From my understanding:
I would use the 33% as an indicator only as profiling for small methods could really change the overall performance. So i would run the algorithm on some big dataset and see the total time. And I would consider the efficiancies of my optimization based on that total time changes. I would also include a warning up phase so that the JIT can do it's optimisations.
In fact the bit counting thing seem to be one of the key part of your algorithm anyway... if you optimize everything, and manage to get 10 time faster for all key part, you still profile something near 33% for this part. That's not bad by essence.
Inspiring from this link http://bmagic.sourceforge.net/bmsse2opt.html you could try to use SSE instruction present in all intel/AMD processor now if I remember right (you could alway failback to JAVA otherwise). An interresting part concerning the article is... That most of the time, it is memory bound anyway. But I would still try to see how this could work for you.
A GPU would be a perfect fit for insanely fast processing (easy hundred time one of a CPU core) and bandwidth. Main problem would be pushing data to CPU dedicated memory and getting result back. But if you don't just perform bit counting but more more operation, this could bring huge gains.
There is not shortcut anyway, you must try several approach and see what bring the most gain. Don't count % through but total time spent.
bin[0]
is the relative frequency that a random 64-bit sample has none bit set,bin[64]
is the relative frequency that all bits are set, and so on. Alternatively, if you can post the distribution of the bits (such as Bernoulli) it will also help. – Breeder