bit_count function in PostgreSQL

Asked 18/9, 2017 at 13:36 Answered 22/11, 2022 at 11:50

We are in the process of migrating a MySQL 5.7 database to PostgreSQL 9.6.

A real issue is the lack of bit_count function in PostgreSQL. This function is also not available in the upcoming version 10.

Current MySQL code snippet (simplified):

-- mysql specific, tested with 5.7.19
select code,phash,bit_count(phash ^ -9187530158960050433) as hd 
from documents 
where phash is not null and bit_count(phash ^ -9187530158960050433) < 7
order by hd;

We tried a naive solution (converting the BIGINT to a String and counting the "1"'s), but it performs terribly compared to MySQL.

Java uses a trick from Hacker's Delight, but AFAIK this is not possible with PostgreSQL, because the >>> operator is (also) not available.

Question: Is a there solution/workaround available comparable with MySQL performance wise?

UPDATE 1

Best solution i could find is based on this SO answer:

First create bit_count function:

CREATE OR REPLACE FUNCTION bit_count(value bigint) 
RETURNS numeric 
AS $$ SELECT SUM((value >> bit) & 1) FROM generate_series(0, 63) bit $$
LANGUAGE SQL IMMUTABLE STRICT;

Now we can use almost the same SQL as with MySQL:

-- postgresql specific, tested with 9.6.5
select code,phash,bit_count(phash # -9187530158960050433) as hd 
from documents 
where phash is not null and bit_count(phash # -9187530158960050433) < 7
order by hd;

UPDATE 2

Based on @a_horse_with_no_name comment, i tried this function:

-- fastest implementation so far. 10 - 11 x faster than the naive solution (see UPDATE 1)
CREATE OR REPLACE FUNCTION bit_count(value bigint) 
RETURNS integer 
AS $$ SELECT length(replace(value::bit(64)::text,'0','')); $$
LANGUAGE SQL IMMUTABLE STRICT;

However, this is still 5 - 6 times slower than MySQL (tested wit exact the same data set of 200k phash values on the same hardware).

Disquisition answered 18/9, 2017 at 13:36 Comment(7)

Please Edit your question and add some sample data and the expected output based on that data. Formatted text please, no screen shots. edit your question - do not post code or additional information in comments. – Negatron 18/9, 2017 at 13:40

Hi there. If you found a solution to your question you can add it as an answer and wait a bit before mark it as the accepted one. Just to make sure that no one has a better solution to offer. – Birdella 18/9, 2017 at 14:16

I wonder if this is faster: length(replace((phash # -9187530158960050433)::bit(64)::text,'0','')) – Negatron 18/9, 2017 at 14:22

If you are trying to compute the hamming distance of perceptual hashes or similar LSH bit strings, then this question may be a duplicate of this: https://mcmap.net/q/82890/-hamming-distance-similarity-searches-in-a-database Check my answer there – Important 27/11, 2017 at 19:45

@PhilippeOmbredanne This is indeed the background of this question. However, this question is specifically about finding the fastest way to emulate the MySQL bit_count function (which works perfectly well in a MySQL environment). Nevertheless, thanks for the hint. – Disquisition 28/11, 2017 at 6:34

Then this may be the cure: github.com/commonsmachinery/hmsearch/tree/postgres ... a postgres extension for hamming distance search ;) – Important 28/11, 2017 at 7:26

@PhilippeOmbredanne Will check it. Thanks again. – Disquisition 28/11, 2017 at 7:42

Function bit_count is available since PostgreSQL version 14, see Bit String Functions and Operators.

Example:

select bit_count(B'1101');

Result is 3.

Note that the function is defined for types bit and bit varying. So if you want to use it with integer values, you need to cast.

Example:

select cast (cast (1101 as text) as bit varying);

Result is B'1101'.

Combining both examples:

select bit_count(cast (cast (1101 as text) as bit varying));

Esposito answered 22/11, 2022 at 11:50 Comment(0)

Question: Is a there solution/workaround available comparable with MySQL performance wise?

To get a comparable speed, a compiled C function should be used. If you can compile C code, see for instance https://github.com/dverite/postgresql-functions/tree/master/hamming_weight

The code itself is very simple.

The result seems 10 times faster than the bit_count function based on counting the 0 characters in the bit(64) string as text.

Example:

plpgsql function:

test=> select sum(bit_count(x)) from generate_series(1,1000000) x;
   sum   
---------
 9884999
(1 row)

Time: 2442,340 ms

C function:

test=> select sum(hamming_weight(x::int8)) from generate_series(1,1000000) x;
   sum   
---------
 9884999
(1 row)

Time: 239,749 ms

Saucedo answered 12/3, 2018 at 15:56 Comment(1)

Yes, this performs on par with MySQL. The downside of this solution is the "do it yourself" character. – Disquisition 13/3, 2018 at 6:55

Function bit_count is available since PostgreSQL version 14, see Bit String Functions and Operators.

Example:

select bit_count(B'1101');

Result is 3.

Note that the function is defined for types bit and bit varying. So if you want to use it with integer values, you need to cast.

Example:

select cast (cast (1101 as text) as bit varying);

Result is B'1101'.

Combining both examples:

select bit_count(cast (cast (1101 as text) as bit varying));

Esposito answered 22/11, 2022 at 11:50 Comment(0)

If you are trying to compute the hamming distance of perceptual hashes or similar LSH bit strings, then this question may be a closely related to this answer

If you are looking specifically for a pre-built way to do hamming distance queries on a PostgreSQL database, then this may be the cure: an extension for hamming distance search

Important answered 28/11, 2017 at 9:45 Comment(0)

Recommended topics

Hot tags