Wolf, OMG!! I feel so inadequate now :what: . What amount of ass licking and payoffs is it gonna take in order to get you to release a properly written and professional X11 miner to us noobs then?
Me? Dude, I am probably not yet capable of properly writing X11 - I've done it decently, that's all. A properly written X11 miner wouldn't have all the hashes calculated by a single thread, Lord knows at least SIMD is horrible on GPU when done this way. While my kernels depend a lot less memory than the stock one, it still depends on memory because SIMD is so huge and poorly done that it accesses global memory a lot.
A PROPERLY done X11 would likely not even use the AES tables - the dev would bitslice the AES S-box, and that implementation would help out Groestl a lot, too, which is one of the slower ones, currently. SIMD would be split at least 4-way, probably 8-way, meaning that 8 threads are used to calculate a single hash. GPUs hate large amounts of code and complex tasks - they excel at tons of work in tiny pieces done by many different threads. BMW isn't as good as it could be, in stock or mine - Lord knows how to fix that, but I know there's gotta be a better way. Luffa, another kinda slow one, can easily be done in parallel - what stopped me is the ugliness of the SGMiner host code; I decided not to edit it, and confine my work to the OpenCL device code only. CubeHash isn't too slow, but a 2-way will probably make it faster.
In short, I am NOT who you're looking for - I am no professional, not yet.