Monday, November 22, 2021

How to split comma separated value strings into rows in Oracle Database

Often you receive a CSV string that you need to convert to rows. To do this with SQL you need to:

◉ Generate a row for each value

◉ For each row N, extract the value at position N from the string

◉ You can do this in Oracle Database with a query like:

with rws as (

  select 'split,into,rows' str from dual

)

  select regexp_substr (

           str,

           '[^,]+',

           1,

           level

         ) value

  from   rws

  connect by level <= 

    length ( str ) - length ( replace ( str, ',' ) ) + 1;

VALUE   

split    

into     

rows

So what's going on here?

The connect by level clause generates a row for each value. It finds how many values there are by:

◉ Using replace ( str, ',' ) to remove all the commas from the string

◉ Subtracting the length of the replaced string from the original to get the number of commas

◉ Add one to this result to get the number of values

The regexp_substr extracts each value using this regular expression:

[^,]+

This searches for:

◉ Characters not in the list after the caret. So everything except a comma.

◉ The plus operator means it must match one or more of these non-comma characters

The third argument tells regexp_substr to start the search at the first character. And the final one instructs it to fetch the Nth occurrence of the pattern. So row one finds the first value, row two the second, and so on.

Oracle Database, Oracle Database Exam, Oracle Database Exam Prep, Oracle Database Tutorial and Materials, Oracle Database Certification, Oracle Database Learning, Oracle Database Career, Oracle Database Preparation
Photo by Toa Heftiba Şinca from Pexels

Before we move on, note the solution above assumes there are no commas at the start or end of the string. If this is possible, you should trim the string first:

with rws as (
  select ',leading,commas,and,trailing,' str from dual
)
  select regexp_substr (
           str,
           '[^,]+',
           1,
           level
         ) value
  from   rws
  connect by level <= 
    length ( trim ( both ',' from str ) ) - 
    length ( replace ( str, ',' ) ) + 1;
    
VALUE      
leading     
commas      
and         
trailing

You can also adapt this to split strings with other delimiters.

How to turn space, semicolon, pipe or other delimited strings into rows


CSV may be the most common format for delimited values. But often you'll want to split strings with other delimiters, such as pipes or semicolons. To do this, swap commas for the new character:

with rws as (
  select 'split;semicolons;into;rows' str from dual
)
  select regexp_substr (
           str,
           '[^;]+',
           1,
           level
         ) value
  from   rws
  connect by level <= 
    length ( str ) - 
    length ( replace ( str, ';' ) ) + 1;
  
VALUE        
split         
semicolons    
into          
rows

Using this basic method you can overcome a common challenge: passing a CSV as input to an in list.

How to do variable in lists


The in condition returns all rows where the target matches one of the values in a comma-separated list. For example:

select employee_id, first_name, last_name
from   hr.employees
where  last_name in ( 'King', 'Kochhar', 'De Haan' );

EMPLOYEE_ID FIRST_NAME    LAST_NAME   
        102 Lex           De Haan      
        156 Janette       King         
        100 Steven        King         
        101 Neena         Kochhar

The issue here is the items in the list need to be separate values. If you pass them as a single CSV string like this:

select employee_id, first_name, last_name
from   hr.employees
where  last_name in ( 'King,Kochhar,De Haan' );

no rows selected

The query returns no rows!

This is because it's searching for someone with the name King,Kochhar,De Haan. Unsurprisingly, no one has that name.

This makes in hard to use with bind variables, particularly if the number of values is unknown.

Using the CSV-to-rows trick above you can pass the search string as a single value and split it into rows like this:

with rws as (
  select 'King,Kochhar,De Haan' str from dual
)
select employee_id, first_name, last_name
from   hr.employees
where  last_name in ( 
  select regexp_substr (
           str,
           '[^,]+',
           1,
           level
         ) value
  from   rws
  connect by level <= 
    length ( str ) - 
    length ( replace ( str, ',' ) ) + 1
);

EMPLOYEE_ID FIRST_NAME    LAST_NAME   
        156 Janette       King         
        100 Steven        King         
        101 Neena         Kochhar      
        102 Lex           De Haan

This enables you to use a bind variable for the in list:

declare
  emp_cur     sys_refcursor;
  search_str  varchar2(100) := 'King,Kochhar,De Haan';
  employee_id integer;
  first_name  varchar2(30);
  last_name   varchar2(30);
begin
  open emp_cur for q'!select employee_id, first_name, last_name
from   hr.employees
where  last_name in ( 
  select regexp_substr (
           :str,
           '[^,]+',
           1,
           level
         ) value
  from   dual
  connect by level <= 
    length ( :str ) - length ( replace ( :str, ',' ) ) + 1
)!' using search_str, search_str, search_str;

  loop
    fetch emp_cur into employee_id, first_name, last_name;
    exit when emp_cur%notfound;
    dbms_output.put_line (  
      employee_id || ' ' || first_name || ' ' || last_name 
    );
  end loop;
end;
/

156 Janette King
100 Steven King
101 Neena Kochhar
102 Lex De Haan

This works well when you're splitting one value, such as a string literal or bind variable.

But what if the strings are stored in a table and you have many rows with strings to split?

Oracle Database, Oracle Database Exam, Oracle Database Exam Prep, Oracle Database Tutorial and Materials, Oracle Database Certification, Oracle Database Learning, Oracle Database Career, Oracle Database Preparation
Image by ejaugsburg from Pixabay

Splitting delimited values stored in columns into rows


Sometimes you'll find columns storing lists of separated values. For example:

create table csvs (
  csv_id   integer
    primary key,
  csv_text varchar2(1000)
);

insert into csvs 
  values ( 1, 'split,into,rows' );
insert into csvs 
  values ( 2, 'even,more,values,to,extract' );
  
commit;

This is bad design, violates first normal form, and causes many headaches.

For example, if you want to join each value to rows in another table, you need to convert the string into rows first.

You can do this with the CSV-to-rows trick above as the basis. But it needs a tweak.

If you swap dual for your table (csvs here), the query links every row with every other row in the table! This leads to generating a huge data set and slowing the query to a crawl.

There are a few ways to avoid this.

From 12c you can use a lateral join. This enables you to write a subquery that uses values from tables to its left. So the database only generates a tree corresponding to each source row.

This gives a query like:

select csv_id, regexp_substr (
         csv_text,
         '[^,]+',
         1,
         rn
       ) val
from   csvs
cross  join lateral (
  select level rn from dual
  connect by level <=   
    length ( csv_text ) - length ( replace ( csv_text, ',' ) ) + 1
);

CSV_ID VAL       
     1 split      
     1 into       
     1 rows       
     2 even       
     2 more       
     2 values     
     2 to         
     2 extract

If you're stuck on ancient versions of Oracle Database, you can extend the connect by clause instead:

select csv_id, regexp_substr (
         csv_text,
         '[^,]+',
         1,
         level
       ) val
from   csvs 
connect by level <=   
  length ( csv_text ) - length ( replace ( csv_text, ',' ) ) + 1
and    prior csv_text = csv_text
and    prior sys_guid () is not null;

CSV_ID VAL       
     2 even       
     2 more       
     2 values     
     2 to         
     2 extract    
     1 split      
     1 into       
     1 rows

So what do the extra connect by clauses do?

◉ prior csv = csv ensures the query only creates new rows linked to the original

◉ prior sys_guid () is not null adds a unique bit of data to each row. This prevents cycles because Oracle Database uses all the columns with prior operators in the connect by to check for loops. If two rows in a tree have the same value for these columns, the database considers this a loop.

These methods help you turn delimited values into rows, but so far we've made a big assumption:

The separator cannot appear within the values themselves.

What do you do if it can?

Oracle Database, Oracle Database Exam, Oracle Database Exam Prep, Oracle Database Tutorial and Materials, Oracle Database Certification, Oracle Database Learning, Oracle Database Career, Oracle Database Preparation
Photo by Toa Heftiba Şinca from Pexels

Converting quoted values containing the separator into rows


Say you've received a CSV with the values in "lastname, firstname" format. Splitting on each comma will place last and first names on separate lines, giving twice the number of rows you need!

To overcome this change the regular expression. Instead of searching for the Nth token that excludes a comma, an alternative is to:

◉ Find the number of tokens by counting how many times "," (quote comma quote) appears; regexp_count ( str, '","' ) is a neat way to do this
◉ Find the position of the next double quote after every second comma. You can do this with instr ( str, '"', 1, ( rn * 2 ) - 1 )
◉ Return all the characters that are not a double quote from this point

Which looks something like:

with rws as (
  select department_id, 
         listagg ( '"' || last_name || ', ' || first_name || '"', ',' ) 
           within group ( order by employee_id ) str
  from   hr.employees
  where  department_id in ( 10, 20, 30 )
  group  by department_id
)
  select department_id, regexp_substr (
           str,
           '[^"]+',
           instr  ( str, '"', 1, (rn*2)-1 ), 
           1
         ) employee
  from   rws, lateral (
    select level rn from dual
    connect by level <=   
      regexp_count ( str, '","' ) + 1
  );
  
DEPARTMENT_ID EMPLOYEE             
           10 Whalen, Jennifer      
           20 Hartstein, Michael    
           20 Fay, Pat              
           30 Raphaely, Den         
           30 Khoo, Alexander       
           30 Baida, Shelli         
           30 Tobias, Sigal         
           30 Himuro, Guy           
           30 Colmenares, Karen

This makes two important assumptions:

◉ Every value contains exactly one comma. If there can be a different number of delimiters in each value, you'll need to adjust the logic.
◉ Each value is enclosed in quotes (or some other character not present in the values themselves)

A note on performance


At this point you may be wondering:

Why not use regexp_count in the original query to find how many values there? Why subtract the lengths of the strings?

The short answer is: you can!

But be aware that regular expressions are slower than other string manipulation functions.

On small data sets processing short CSVs you're unlikely to notice this difference. But if gaining every last microsecond matters to you, avoid the regular expressions. You can also swap out the regexp_substr expression in the select for one using substr and instr.

Whichever delimited values to rows method you use, writing the query is fiddly. As this is a common task, it'd be good to create a function to split strings for you.

You can split a single string with pipelined table functions (PTFs) like this:

create or replace type string_table as 
  table of varchar2 (4000);
/

create or replace function split_string (
  delimited_string varchar2,
  separator        varchar2 default ','
) return string_table
  pipelined
as
  delimited_string_cleaned varchar2(32767);
  substring varchar2(4000);
  pos       pls_integer;
begin
  delimited_string_cleaned := 
    trim ( both separator from delimited_string ) || 
    separator;
  pos := instr ( delimited_string_cleaned, separator );
  substring := substr ( delimited_string_cleaned, 1, pos - 1 );
  
  loop
    exit when substring is null;
    pipe row ( substring );
  
    substring := substr ( 
      delimited_string_cleaned, 
      pos + 1, 
      instr ( 
        delimited_string_cleaned, separator, pos + 1 
      ) - pos - 1 
    );
    pos := instr ( delimited_string_cleaned, separator, pos + 1 );   
  end loop;

  return;
end;
/

select *
from   split_string ( 'King,Kochhar,De Haan' );

COLUMN_VALUE   
King            
Kochhar         
De Haan

While PTFs can accept cursors, making a generic PTF that converts CSVs stored in a table to rows is hard.

Luckily Oracle Database 19c offers a better way: SQL macros!

Oracle Database, Oracle Database Exam, Oracle Database Exam Prep, Oracle Database Tutorial and Materials, Oracle Database Certification, Oracle Database Learning, Oracle Database Career, Oracle Database Preparation

Make reusable string-to-rows functions with SQL macros


Added in Oracle Database 19.6, SQL macros return table SQL expressions. From 21c you can also make macros that return scalar expressions.

You can think of SQL macros as query templates. They return a string containing the text of your expression. The function's parameters are placeholders in this string.

At parse time the database resolves the expression. It replaces the parameter placeholders with the text of the actual values you call it with. It then substitutes this expression back into your query.

This enables you to define a reusable query you can pass tables or columns to at runtime.

This is a big advantage over PTFs which can't accept tables as parameters.

Macros also resolve to pure SQL. This avoids any context switches from SQL to PL/SQL and gives the optimizer better visibility of the query. These benefits mean using macros is likely to be faster than PTFs.

In this post we've identified two template queries:

◉ Taking a CSV string literal and converting it into rows
◉ Extracting delimited values stored in a table

So we need two macros. One that accepts an input string, the other a table name and source column. You could create these as two functions with different names. Or create one overloaded function in a package. I'll do the latter here.

One overload will accept a string, the other a table and column containing delimited values. Both will have an optional parameter for the separator, which defaults to a comma:

create or replace package string_macros_pkg as 
  function split_string ( 
    tab dbms_tf.table_t,
    col dbms_tf.columns_t,
    separator varchar2 default ','
  ) return clob sql_macro;
  
  function split_string ( 
    delimited_string varchar2,
    separator        varchar2 default ','
  ) return clob sql_macro;
end;
/

Place the template queries above into these like so:

create or replace package body string_macros_pkg as 
  function split_string ( 
    tab dbms_tf.table_t,
    col dbms_tf.columns_t,
    separator varchar2 default ','
  ) return clob sql_macro as
    sql_text clob;
  begin
    
    sql_text := 'select t.*, 
         regexp_substr (
           ' || col ( 1 ) || ',
           ''[^'' || separator || '']+'',
           1,
           pos
         ) str,
         pos
  from   tab t,
         lateral (
           select level pos
           from   dual
           connect by level <= 
             length ( ' || col ( 1 ) || ' ) 
               - length ( replace ( ' || col ( 1 ) || ', separator ) ) 
               + 1
         )';
  
    return sql_text;
    
  end split_string;
  
  function split_string ( 
    delimited_string varchar2,
    separator        varchar2 default ','
  ) return clob sql_macro as
      sql_text clob;
  begin
    
    sql_text := 'select 
         regexp_substr (
           delimited_string,
           ''[^'' || separator || '']+'',
           1,
           level
         ) str,
         level pos
  from   dual
  connect by level <= 
    length ( delimited_string ) 
      - length ( replace ( delimited_string, separator ) ) 
      + 1';
  
    return sql_text;
    
  end split_string;

end;
/

With this in place, you can now call the macro, passing either a string or a table's column to split. Here are a few examples to get you going:

select employee_id, first_name, last_name
from   hr.employees
where  last_name in (
  select str 
  from   string_macros_pkg.split_string ( 'King,Kochhar,De Haan' )
);

EMPLOYEE_ID FIRST_NAME    LAST_NAME   
        100 Steven        King         
        156 Janette       King         
        101 Neena         Kochhar      
        102 Lex           De Haan  

select employee_id, first_name, last_name 
from   hr.employees
join   string_macros_pkg.split_string ( 'King,Kochhar,De Haan' )
on     last_name = str;

EMPLOYEE_ID FIRST_NAME    LAST_NAME
        100 Steven        King     
        156 Janette       King     
        101 Neena         Kochhar  
        102 Lex           De Haan  
        
select csv_id, str, pos 
from   string_macros_pkg.split_string ( 
  csvs, columns ( csv_text ), ',' 
);

CSV_ID STR       POS   
     1 split       1 
     1 into        2 
     1 rows        3 
     2 even        1 
     2 more        2 
     2 values      3 
     2 to          4 
     2 extract     5

Source: oracle.com

Related Posts

0 comments:

Post a Comment