Extract numbers from a field in PostgreSQL
Asked Answered
R

5

36

I have a table with a column po_number of type varchar in Postgres 8.4. It stores alphanumeric values with some special characters. I want to ignore the characters [/alpha/?/$/encoding/.] and check if the column contains a number or not. If its a number then it needs to typecast as number or else pass null, as my output field po_number_new is a number field.

Below is the example:

example

SQL Fiddle.

I tired this statement:

select 
(case when  regexp_replace(po_number,'[^\w],.-+\?/','') then po_number::numeric
else null
end) as po_number_new from test

But I got an error for explicit cast:

error

Rifkin answered 12/11, 2016 at 15:36 Comment(2)
Undelete your new question, you will get an answer. Just explain all edge cases carefully, that's all.Assignation
It's ok Tim. I will post a proper question from now-on without any edits : ) no issues. By the way thanks : )Rifkin
J
85

Simply:

SELECT NULLIF(regexp_replace(po_number, '\D','','g'), '')::numeric AS result
FROM   tbl;

\D being the class shorthand for "not a digit".
And you need the 4th parameter 'g' (for "globally") to replace all occurrences.
Details in the manual.

For a known, limited set of characters to replace, plain string manipulation functions like replace() or translate() are substantially cheaper. Regular expressions are just more versatile, and we want to eliminate everything but digits in this case. Related:

But why Postgres 8.4? Consider upgrading to a modern version.

Consider pitfalls for outdated versions:

Johns answered 12/11, 2016 at 15:59 Comment(1)
You might be interested in my proposed solution to this question - using the TRANSLATE function in this particularly easy case instead of the more expensive REGEXP_REPLACE? I'd be interested in, and grateful for, any comment(s) you may have on what I wrote! What led me here was this question - a lot tricker! :-) I have a working solution for PostgreSQL and a guy I was chatting with came up with a SQL Server one - trying to do it without regexes tricky though!Scrutable
S
3

I think you want something like this:

select (case when regexp_replace(po_number, '[^\w],.-+\?/', '') ~ '^[0-9]+$'
             then regexp_replace(po_number, '[^\w],.-+\?/', '')::numeric
        end) as po_number_new 
from test;

That is, you need to do the conversion on the string after replacement.

Note: This assumes that the "number" is just a string of digits.

Superior answered 12/11, 2016 at 15:41 Comment(3)
Could you explain the regexp ,.-+? What it means?Eads
@Eads I think they should be inside the brackets, though I'm not in front of Postgre right now to test it.Assignation
I am getting only nulls after using the above... instead I need numbers from each record and nulls for records that doesn't have numbers in itRifkin
S
3

If you want to extract floating numbers try to use this:

SELECT NULLIF(regexp_replace(po_number, '[^\.\d]','','g'), '')::numeric AS result FROM tbl;

It's the same as Erwin Brandstetter answer but with different expression:

[^...] - match any character except a list of excluded characters, put the excluded charaters instead of ...

\. - point character (also you can change it to , char)

\d - digit character

Sweettalk answered 30/9, 2021 at 7:30 Comment(0)
A
2

The logic I would use to determine if the po_number field contains numeric digits is that its length should decrease when attempting to remove numeric digits.

If so, then all non numeric digits ([^\d]) should be removed from the po_number column. Otherwise, NULL should be returned.

select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number)
            then regexp_replace(po_number, '[^0-9]', '', 'g')
            else null
       end as po_number_new
from test
Assignation answered 12/11, 2016 at 15:44 Comment(6)
I am getting an SQL error [42883] after trying thisRifkin
@Rifkin The error was being caused by you using Postgres 8.x, which does not have a length function. I updated to use char_length and it should work now.Assignation
I did a select version(); I am now using PostgreSQL 9.5.2, compiled by Visual C++ build 1800, 64-bit. Still getting an error. I posted the image error.Rifkin
@user1538020: Asides: The error you get is due to a typo: regex_replace <> regexp_replace. And there is a length() function in pg 8.4.Johns
@Tim Biegeleisen : select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number) then regexp_replace(po_number, '[^0-9]', '', 'g') else null end as po_number_new from test; Corrected the syntax, it now works. thanks..Rifkin
@ErwinBrandstetter Thanks Erwin, you're truly the king of Postgres on Stack Overflow :-)Assignation
J
2

Since version 12 - that's 2 years + 4 months ago at the time of writing (but after the last edit that I can see on the accepted answer), you could use a GENERATED FIELD to do this quite easily on a one-time basis rather than having to calculate it each time you wish to SELECT a new po_number.

Furthermore, you can use the TRANSLATE function to extract your digits which is less expensive than the REGEXP_REPLACE solution proposed by @ErwinBrandstetter!

I would do this as follows (all of the code below is available on the fiddle here):

CREATE TABLE s
(
  num TEXT,
  
  new_num INTEGER GENERATED ALWAYS AS
    (NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER) STORED
);

You can add to the 'ABCDEFG... string in the TRANSLATE function as appropriate - I have decimal point (.) and a space ( ) at the end - you may wish to have more characters there depending on your input!

And checking:

INSERT INTO s VALUES ('2'), (''), (NULL), (' ');
INSERT INTO t VALUES ('2'), (''), (NULL), (' ');
SELECT * FROM s;
SELECT * FROM t;

Result (same for both):

num    new_num
  2          2
          NULL
          NULL
          NULL

So, I wanted to check how efficient my solution was, so I ran the following test inserting 10,000 records into both tables s and t as follows (from here):

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
INSERT INTO t 
with symbols(characters) as 
(
  VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
)
select string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '')
from symbols
join generate_series(1,10) as word(chr_idx) on 1 = 1 -- word length
join generate_series(1,10000) as words(idx) on 1 = 1 -- # of words
group by idx;

The differences weren't that huge but the regex solution was consistently slower by about 25% - even changing the order of the tables undergoing the INSERTs.

However, where the TRANSLATE solution really shines is when doing a "raw" SELECT as follows:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
  NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER
FROM s;

and the same for the REGEXP_REPLACE solution.

The differences were very marked, the TRANSLATE taking approx. 25% of the time of the other function. Finally, in the interests of fairness, I also did this for both tables:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
  num, new_num
FROM t;

Both extremely quick and identical!

Jellaba answered 9/2, 2022 at 2:49 Comment(1)
Yes, regular expression functions are known to be substantially more expensive (got faster in recent versions, but still). I've kept pointing that out over the last decade: https://mcmap.net/q/427251/-postgresql-select-only-alpha-characters-on-a-row, https://mcmap.net/q/427252/-is-there-a-regexp_replace-equivalent-for-postgresql-7-4, ... However, listing all possible non-digits was somewhat more practical in times of ASCII, not so much with UTF ...Johns

© 2022 - 2024 — McMap. All rights reserved.