Splitting string into multiple rows in Oracle
Asked Answered
P

14

131

I know this has been answered to some degree with PHP and MYSQL, but I was wondering if someone could teach me the simplest approach to splitting a string (comma delimited) into multiple rows in Oracle 10g (preferably) and 11g.

The table is as follows:

Name | Project | Error 
108    test      Err1, Err2, Err3
109    test2     Err1

I want to create the following:

Name | Project | Error
108    Test      Err1
108    Test      Err2 
108    Test      Err3 
109    Test2     Err1

I've seen a few potential solutions around stack, however they only accounted for a single column (being the comma delimited string). Any help would be greatly appreciated.

Proconsul answered 14/1, 2013 at 23:20 Comment(2)
For examples using REGEXP, XMLTABLE and MODEL clause, see Split comma delimited strings in a table using Oracle SQLRaquelraquela
I cannot resist, but the outlandish complexities you have to go through for such a simple task on Oracle just shows what an outdated and moronically designed system Oracle actually is. Compare that with PostgreSQL where the design is well done, and therefore such tasks have very easy solutions. WITH Test AS (SELECT 108 as name, 'test' as project, 'Err1, Err2, Err3' as error UNION ALL SELECT 109, 'test2', 'Err1') SELECT * FROM Test LEFT JOIN LATERAL unnest(string_to_array(error, ', ')) WITH ORDINALITY Error(error, i) ON true;Odds
C
148

This may be an improved way (also with regexp and connect by):

with temp as
(
    select 108 Name, 'test' Project, 'Err1, Err2, Err3' Error  from dual
    union all
    select 109, 'test2', 'Err1' from dual
)
select distinct
  t.name, t.project,
  trim(regexp_substr(t.error, '[^,]+', 1, levels.column_value))  as error
from 
  temp t,
  table(cast(multiset(select level from dual connect by  level <= length (regexp_replace(t.error, '[^,]+'))  + 1) as sys.OdciNumberList)) levels
order by name

EDIT: Here is a simple (as in, "not in depth") explanation of the query.

  1. length (regexp_replace(t.error, '[^,]+')) + 1 uses regexp_replace to erase anything that is not the delimiter (comma in this case) and length +1 to get how many elements (errors) are there.
  2. The select level from dual connect by level <= (...) uses a hierarchical query to create a column with an increasing number of matches found, from 1 to the total number of errors.

    Preview:

    select level, length (regexp_replace('Err1, Err2, Err3', '[^,]+'))  + 1 as max 
    from dual connect by level <= length (regexp_replace('Err1, Err2, Err3', '[^,]+'))  + 1
    
  3. table(cast(multiset(.....) as sys.OdciNumberList)) does some casting of oracle types.
    • The cast(multiset(.....)) as sys.OdciNumberList transforms multiple collections (one collection for each row in the original data set) into a single collection of numbers, OdciNumberList.
    • The table() function transforms a collection into a resultset.
  4. FROM without a join creates a cross join between your dataset and the multiset. As a result, a row in the data set with 4 matches will repeat 4 times (with an increasing number in the column named "column_value").

    Preview:

    select * from 
    temp t,
    table(cast(multiset(select level from dual connect by  level <= length (regexp_replace(t.error, '[^,]+'))  + 1) as sys.OdciNumberList)) levels
    
  5. trim(regexp_substr(t.error, '[^,]+', 1, levels.column_value)) uses the column_value as the nth_appearance/ocurrence parameter for regexp_substr.
  6. You can add some other columns from your data set (t.name, t.project as an example) for easy visualization.

Some references to Oracle docs:

Conglutinate answered 25/11, 2014 at 22:5 Comment(8)
Beware! A regex of the format '[^,]+' to parse strings does not return the correct item if there is a null element in the list. See here for more info: #31464775Giggle
since 11g you can use regexp_count(t.error, ',') instead of length (regexp_replace(t.error, '[^,]+')), which may bring another performance improvementVeritable
485 seconds with "normal" CONNECT BY. 0.296 seconds this way. You ROCK! Now all I have to do is understand how it works. :-)Rasp
@BobJarvis added an edit to explain what it does. Spelling/grammar corrections are welcome.Conglutinate
"The accepted answer has poor performance" - what is the accepted answer in this topic? Please use the links to reference the other post.Nut
Agreed. But after 5 years I can't remember which one was.Conglutinate
The distinct is no longer necessary here.Nonentity
Question - How is this possible, that in CAST(MULTISET( there is a "visible" data from table t? Normally in where clause subquery does not see data from another tableCawnpore
E
36

regular expressions is a wonderful thing :)

with temp as  (
       select 108 Name, 'test' Project, 'Err1, Err2, Err3' Error  from dual
       union all
       select 109, 'test2', 'Err1' from dual
     )

SELECT distinct Name, Project, trim(regexp_substr(str, '[^,]+', 1, level)) str
  FROM (SELECT Name, Project, Error str FROM temp) t
CONNECT BY instr(str, ',', 1, level - 1) > 0
order by Name
Ezequiel answered 15/1, 2013 at 4:12 Comment(5)
hi can you please clarify me why the above query gives duplicate rows if i didn't use distinct keyword in queryTouraine
That query is unusable due to @JagadeeshG, especially on huge tables.Emmery
Extremely slow, there's a better answer belowPredestine
The reason of the slowness is that every combination of Names is connected, which can be seen if you remove distinct. Unfortunately adding and Name = prior Name to the connect by clause causes ORA-01436: CONNECT BY loop in user data.Nonentity
You can avoid the ORA-01436 error by adding AND name = PRIOR name (or whatever the primary key might be) and AND PRIOR SYS_GUID() IS NOT NULLSerriform
R
32

There is a huge difference between the below two:

  • splitting a single delimited string
  • splitting delimited strings for multiple rows in a table.

If you do not restrict the rows, then the CONNECT BY clause would produce multiple rows and will not give the desired output.

Apart from Regular Expressions, a few other alternatives are using:

  • XMLTable
  • MODEL clause

Setup

SQL> CREATE TABLE t (
  2    ID          NUMBER GENERATED ALWAYS AS IDENTITY,
  3    text        VARCHAR2(100)
  4  );

Table created.

SQL>
SQL> INSERT INTO t (text) VALUES ('word1, word2, word3');

1 row created.

SQL> INSERT INTO t (text) VALUES ('word4, word5, word6');

1 row created.

SQL> INSERT INTO t (text) VALUES ('word7, word8, word9');

1 row created.

SQL> COMMIT;

Commit complete.

SQL>
SQL> SELECT * FROM t;

        ID TEXT
---------- ----------------------------------------------
         1 word1, word2, word3
         2 word4, word5, word6
         3 word7, word8, word9

SQL>

Using XMLTABLE:

SQL> SELECT id,
  2         trim(COLUMN_VALUE) text
  3  FROM t,
  4    xmltable(('"'
  5    || REPLACE(text, ',', '","')
  6    || '"'))
  7  /

        ID TEXT
---------- ------------------------
         1 word1
         1 word2
         1 word3
         2 word4
         2 word5
         2 word6
         3 word7
         3 word8
         3 word9

9 rows selected.

SQL>

Using MODEL clause:

SQL> WITH
  2  model_param AS
  3     (
  4            SELECT id,
  5                      text AS orig_str ,
  6                   ','
  7                          || text
  8                          || ','                                 AS mod_str ,
  9                   1                                             AS start_pos ,
 10                   Length(text)                                   AS end_pos ,
 11                   (Length(text) - Length(Replace(text, ','))) + 1 AS element_count ,
 12                   0                                             AS element_no ,
 13                   ROWNUM                                        AS rn
 14            FROM   t )
 15     SELECT   id,
 16              trim(Substr(mod_str, start_pos, end_pos-start_pos)) text
 17     FROM     (
 18                     SELECT *
 19                     FROM   model_param MODEL PARTITION BY (id, rn, orig_str, mod_str)
 20                     DIMENSION BY (element_no)
 21                     MEASURES (start_pos, end_pos, element_count)
 22                     RULES ITERATE (2000)
 23                     UNTIL (ITERATION_NUMBER+1 = element_count[0])
 24                     ( start_pos[ITERATION_NUMBER+1] = instr(cv(mod_str), ',', 1, cv(element_no)) + 1,
 25                     end_pos[iteration_number+1] = instr(cv(mod_str), ',', 1, cv(element_no) + 1) )
 26                 )
 27     WHERE    element_no != 0
 28     ORDER BY mod_str ,
 29           element_no
 30  /

        ID TEXT
---------- --------------------------------------------------
         1 word1
         1 word2
         1 word3
         2 word4
         2 word5
         2 word6
         3 word7
         3 word8
         3 word9

9 rows selected.

SQL>
Raquelraquela answered 5/5, 2015 at 5:24 Comment(3)
Can you elaborate more, why there has to be ('"' || REPLACE(text, ',', '","') || '"') and brackets cannot be removed? Oracle docs ([docs.oracle.com/database/121/SQLRF/functions268.htm) are not clear to me. Is it XQuery_string?Audiometer
@Audiometer it is a XQuery expression.Raquelraquela
XMLTABLE solution for some reason constantly fails to output the last entry for a mixed length rows. Eg. row1: 3 words; row2: 2 words, row3: 1 word; row4 : 2 words, row5: 1 word -- will not output the last word. Ordering of rows doesn't matter.Tonsillotomy
U
10

A couple of more examples of the same:

SELECT trim(regexp_substr('Err1, Err2, Err3', '[^,]+', 1, LEVEL)) str_2_tab
  FROM dual
CONNECT BY LEVEL <= regexp_count('Err1, Err2, Err3', ',')+1
/

SELECT trim(regexp_substr('Err1, Err2, Err3', '[^,]+', 1, LEVEL)) str_2_tab
  FROM dual
CONNECT BY LEVEL <= length('Err1, Err2, Err3') - length(REPLACE('Err1, Err2, Err3', ',', ''))+1
/

Also, may use DBMS_UTILITY.comma_to_table & table_to_comma: http://www.oracle-base.com/articles/9i/useful-procedures-and-functions-9i.php#DBMS_UTILITY.comma_to_table

Unlimber answered 17/1, 2013 at 20:27 Comment(1)
Be aware that comma_to_table() only works with tokens which fit Oracle's database object naming conventions. It will hurl on a string like '123,456,789' for instance.Jellied
L
10

I would like to propose a different approach using a PIPELINED table function. It's somewhat similar to the technique of the XMLTABLE, except that you are providing your own custom function to split the character string:

-- Create a collection type to hold the results
CREATE OR REPLACE TYPE typ_str2tbl_nst AS TABLE OF VARCHAR2(30);
/

-- Split the string according to the specified delimiter
CREATE OR REPLACE FUNCTION str2tbl (
  p_string    VARCHAR2,
  p_delimiter CHAR DEFAULT ',' 
)
RETURN typ_str2tbl_nst PIPELINED
AS
  l_tmp VARCHAR2(32000) := p_string || p_delimiter;
  l_pos NUMBER;
BEGIN
  LOOP
    l_pos := INSTR( l_tmp, p_delimiter );
    EXIT WHEN NVL( l_pos, 0 ) = 0;
    PIPE ROW ( RTRIM( LTRIM( SUBSTR( l_tmp, 1, l_pos-1) ) ) );
    l_tmp := SUBSTR( l_tmp, l_pos+1 );
  END LOOP;
END str2tbl;
/

-- The problem solution
SELECT name, 
       project, 
       TRIM(COLUMN_VALUE) error
  FROM t, TABLE(str2tbl(error));

Results:

      NAME PROJECT    ERROR
---------- ---------- --------------------
       108 test       Err1
       108 test       Err2
       108 test       Err3
       109 test2      Err1

The problem with this type of approach is that often the optimizer won't know the cardinality of the table function and it will have to make a guess. This could be potentialy harmful to your execution plans, so this solution can be extended to provide execution statistics for the optimizer.

You can see this optimizer estimate by running an EXPLAIN PLAN on the query above:

Execution Plan
----------------------------------------------------------
Plan hash value: 2402555806

----------------------------------------------------------------------------------------------
| Id  | Operation                          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |         | 16336 |   366K|    59   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                      |         | 16336 |   366K|    59   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL                | T       |     2 |    42 |     3   (0)| 00:00:01 |
|   3 |   COLLECTION ITERATOR PICKLER FETCH| STR2TBL |  8168 | 16336 |    28   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Even though the collection has only 3 values, the optimizer estimated 8168 rows for it (default value). This may seem irrelevant at first, but it may be enough for the optimizer to decide for a sub-optimal plan.

The solution is to use the optimizer extensions to provide statistics for the collection:

-- Create the optimizer interface to the str2tbl function
CREATE OR REPLACE TYPE typ_str2tbl_stats AS OBJECT (
  dummy NUMBER,

  STATIC FUNCTION ODCIGetInterfaces ( p_interfaces OUT SYS.ODCIObjectList )
  RETURN NUMBER,

  STATIC FUNCTION ODCIStatsTableFunction ( p_function  IN  SYS.ODCIFuncInfo,
                                           p_stats     OUT SYS.ODCITabFuncStats,
                                           p_args      IN  SYS.ODCIArgDescList,
                                           p_string    IN  VARCHAR2,
                                           p_delimiter IN  CHAR DEFAULT ',' )
  RETURN NUMBER
);
/

-- Optimizer interface implementation
CREATE OR REPLACE TYPE BODY typ_str2tbl_stats
AS
  STATIC FUNCTION ODCIGetInterfaces ( p_interfaces OUT SYS.ODCIObjectList )
  RETURN NUMBER
  AS
  BEGIN
    p_interfaces := SYS.ODCIObjectList ( SYS.ODCIObject ('SYS', 'ODCISTATS2') );
    RETURN ODCIConst.SUCCESS;
  END ODCIGetInterfaces;

  -- This function is responsible for returning the cardinality estimate
  STATIC FUNCTION ODCIStatsTableFunction ( p_function  IN  SYS.ODCIFuncInfo,
                                           p_stats     OUT SYS.ODCITabFuncStats,
                                           p_args      IN  SYS.ODCIArgDescList,
                                           p_string    IN  VARCHAR2,
                                           p_delimiter IN  CHAR DEFAULT ',' )
  RETURN NUMBER
  AS
  BEGIN
    -- I'm using basically half the string lenght as an estimator for its cardinality
    p_stats := SYS.ODCITabFuncStats( CEIL( LENGTH( p_string ) / 2 ) );
    RETURN ODCIConst.SUCCESS;
  END ODCIStatsTableFunction;

END;
/

-- Associate our optimizer extension with the PIPELINED function   
ASSOCIATE STATISTICS WITH FUNCTIONS str2tbl USING typ_str2tbl_stats;

Testing the resulting execution plan:

Execution Plan
----------------------------------------------------------
Plan hash value: 2402555806

----------------------------------------------------------------------------------------------
| Id  | Operation                          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |         |     1 |    23 |    59   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                      |         |     1 |    23 |    59   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL                | T       |     2 |    42 |     3   (0)| 00:00:01 |
|   3 |   COLLECTION ITERATOR PICKLER FETCH| STR2TBL |     1 |     2 |    28   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

As you can see the cardinality on the plan above is not the 8196 guessed value anymore. It's still not correct because we are passing a column instead of a string literal to the function.

Some tweaking to the function code would be necessary to give a closer estimate in this particular case, but I think the overall concept is pretty much explained here.

The str2tbl function used in this answer was originally developed by Tom Kyte: https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:110612348061

The concept of associating statistics with object types can be further explored by reading this article: http://www.oracle-developer.net/display.php?id=427

The technique described here works in 10g+.

Loiretcher answered 27/10, 2016 at 13:55 Comment(0)
P
9

Starting from Oracle 12c you could use JSON_TABLE and JSON_ARRAY:

CREATE TABLE tab(Name, Project, Error) AS
SELECT 108,'test' ,'Err1, Err2, Err3' FROM dual UNION 
SELECT 109,'test2','Err1'             FROM dual;

And query:

SELECT *
FROM tab t
OUTER APPLY (SELECT TRIM(p) AS p
            FROM JSON_TABLE(REPLACE(JSON_ARRAY(t.Error), ',', '","'),
           '$[*]' COLUMNS (p VARCHAR2(4000) PATH '$'))) s;

Output:

┌──────┬─────────┬──────────────────┬──────┐
│ Name │ Project │      Error       │  P   │
├──────┼─────────┼──────────────────┼──────┤
│  108 │ test    │ Err1, Err2, Err3 │ Err1 │
│  108 │ test    │ Err1, Err2, Err3 │ Err2 │
│  108 │ test    │ Err1, Err2, Err3 │ Err3 │
│  109 │ test2   │ Err1             │ Err1 │
└──────┴─────────┴──────────────────┴──────┘

db<>fiddle demo

Percept answered 6/6, 2018 at 12:48 Comment(4)
I concede this is a clever trick but frankly it would puzzle me if I came across it in a code base.Jellied
@Jellied This is just show of what is possible with SQL. If I have to use such code in my codebase I would definitely wrap it with in a function or leave an extended comment :)Percept
Of course. It's just that this thread is one of the more popular hits for string tokenization with Oracle so I think we should include caveats on the more exotic solutions, to protect the innocent from themselves :)Jellied
I love this solution! it was helpful to get my objetive, thanks!!!Barbuto
D
4

REGEXP_COUNT wasn't added until Oracle 11i. Here's an Oracle 10g solution, adopted from Art's solution.

SELECT trim(regexp_substr('Err1, Err2, Err3', '[^,]+', 1, LEVEL)) str_2_tab
  FROM dual
CONNECT BY LEVEL <=
  LENGTH('Err1, Err2, Err3')
    - LENGTH(REPLACE('Err1, Err2, Err3', ',', ''))
    + 1;
Deadhead answered 21/7, 2014 at 20:20 Comment(1)
How can I add a filter for this lets say I want to filter with only name = '108'. I tried adding a where after the from clause but ended up with duplicates.Pasticcio
F
4

Here is an alternative implementation using XMLTABLE that allows for casting to different data types:

select 
  xmltab.txt
from xmltable(
  'for $text in tokenize("a,b,c", ",") return $text'
  columns 
    txt varchar2(4000) path '.'
) xmltab
;

... or if your delimited strings are stored in one or more rows of a table:

select 
  xmltab.txt
from (
  select 'a;b;c' inpt from dual union all
  select 'd;e;f' from dual
) base
inner join xmltable(
  'for $text in tokenize($input, ";") return $text'
  passing base.inpt as "input"
  columns 
    txt varchar2(4000) path '.'
) xmltab
  on 1=1
;
Fighterbomber answered 1/12, 2017 at 13:16 Comment(1)
I think this solution works for Oracle 11.2.0.3 and later versions.Jellied
A
3

Without using connect by or regexp:

    with mytable as (
      select 108 name, 'test' project, 'Err1,Err2,Err3' error from dual
      union all
      select 109, 'test2', 'Err1' from dual
    )
    ,x as (
      select name
      ,project
      ,','||error||',' error
      from mytable
    )
    ,iter as (SELECT rownum AS pos
        FROM all_objects
    )
    select x.name,x.project
    ,SUBSTR(x.error
      ,INSTR(x.error, ',', 1, iter.pos) + 1
      ,INSTR(x.error, ',', 1, iter.pos + 1)-INSTR(x.error, ',', 1, iter.pos)-1
    ) error
    from x, iter
    where iter.pos < = (LENGTH(x.error) - LENGTH(REPLACE(x.error, ','))) - 1;
Aide answered 26/11, 2015 at 21:31 Comment(1)
I went with this solution for a similar problem but with a slight change to iter: SELECT LEVEL AS POS FROM DUAL CONNECT BY LEVEL <= 100 The use of all_objects was a little random for my liking, when I knew I would not exceed a certain value and it was more explicit about the intention of iter.Conciliar
R
3

I had the same problem, and xmltable helped me:

SELECT id, trim(COLUMN_VALUE) text FROM t, xmltable(('"' || REPLACE(text, ',', '","') || '"'))

Repeater answered 14/1, 2018 at 4:34 Comment(1)
select trim(column_value) from xmltable('"SVN","ITA"') select to_number(column_value) from xmltable('1,2,3')Fairyfairyland
A
2

I'd like to add another method. This one uses recursive querys, something I haven't seen in the other answers. It is supported by Oracle since 11gR2.

with cte0 as (
    select phone_number x
    from hr.employees
), cte1(xstr,xrest,xremoved) as (
        select x, x, null
        from cte0
    union all        
        select xstr,
            case when instr(xrest,'.') = 0 then null else substr(xrest,instr(xrest,'.')+1) end,
            case when instr(xrest,'.') = 0 then xrest else substr(xrest,1,instr(xrest,'.') - 1) end
        from cte1
        where xrest is not null
)
select xstr, xremoved from cte1  
where xremoved is not null
order by xstr

It is quite flexible with the splitting character. Simply change it in the INSTR calls.

Ayesha answered 17/8, 2015 at 13:54 Comment(0)
D
2

If you have Oracle APEX 5.1 or later installed, you can use the convenient APEX_STRING.split function, e.g.:

select q.Name, q.Project, s.column_value as Error
from mytable q,
     APEX_STRING.split(q.Error, ',') s

The second parameter is the delimiter string. It also accepts a 3rd parameter to limit how many splits you want it to perform.

https://docs.oracle.com/en/database/oracle/application-express/20.1/aeapi/SPLIT-Function-Signature-1.html#GUID-3BE7FF37-E54F-4503-91B8-94F374E243E6

Dorcas answered 6/10, 2020 at 4:36 Comment(0)
F
1

In Oracle 11g and later, you can use a recursive sub-query and simple string functions (which may be faster than regular expressions and correlated hierarchical sub-queries):

Oracle Setup:

CREATE TABLE table_name ( name, project, error ) as
 select 108, 'test',  'Err1, Err2, Err3' from dual union all
 select 109, 'test2', 'Err1'             from dual;

Query:

WITH table_name_error_bounds ( name, project, error, start_pos, end_pos ) AS (
  SELECT name,
         project,
         error,
         1,
         INSTR( error, ', ', 1 )
  FROM   table_name
UNION ALL
  SELECT name,
         project,
         error,
         end_pos + 2,
         INSTR( error, ', ', end_pos + 2 )
  FROM   table_name_error_bounds
  WHERE  end_pos > 0
)
SELECT name,
       project,
       CASE end_pos
       WHEN 0
       THEN SUBSTR( error, start_pos )
       ELSE SUBSTR( error, start_pos, end_pos - start_pos )
       END AS error
FROM   table_name_error_bounds

Output:

NAME | PROJECT | ERROR
---: | :------ | :----
 108 | test    | Err1 
 109 | test2   | Err1 
 108 | test    | Err2 
 108 | test    | Err3 

db<>fiddle here

Furtherance answered 28/10, 2019 at 13:29 Comment(0)
B
-1

i had used the DBMS_UTILITY.comma_to _table function actually its working the code as follows

declare
l_tablen  BINARY_INTEGER;
l_tab     DBMS_UTILITY.uncl_array;
cursor cur is select * from qwer;
rec cur%rowtype;
begin
open cur;
loop
fetch cur into rec;
exit when cur%notfound;
DBMS_UTILITY.comma_to_table (
     list   => rec.val,
     tablen => l_tablen,
     tab    => l_tab);
FOR i IN 1 .. l_tablen LOOP
    DBMS_OUTPUT.put_line(i || ' : ' || l_tab(i));
END LOOP;
end loop;
close cur;
end; 

i had used my own table and column names

Burns answered 20/10, 2014 at 12:18 Comment(3)
Be aware that comma_to_table() only works with tokens which fit Oracle's database object naming conventions. It will hurl on a string like '123,456,789' for instance.Jellied
can we implement using temporary tables?Burns
Umm, given all the other workable solutions why would we want to use temporary tables which come with a massive overhead of materializing the data?Jellied

© 2022 - 2024 — McMap. All rights reserved.