Often you receive a CSV string that you need to convert to rows. To do this with SQL you need to:
◉ Generate a row for each value
◉ For each row N, extract the value at position N from the string
◉ You can do this in Oracle Database with a query like:
with rws as (
select 'split,into,rows' str from dual
)
select regexp_substr (
str,
'[^,]+',
1,
level
) value
from rws
connect by level <=
length ( str ) - length ( replace ( str, ',' ) ) + 1;
VALUE
split
into
rows
So what's going on here?
The connect by level clause generates a row for each value. It finds how many values there are by:
◉ Using replace ( str, ',' ) to remove all the commas from the string
◉ Subtracting the length of the replaced string from the original to get the number of commas
◉ Add one to this result to get the number of values
The regexp_substr extracts each value using this regular expression:
[^,]+
This searches for:
◉ Characters not in the list after the caret. So everything except a comma.
◉ The plus operator means it must match one or more of these non-comma characters
The third argument tells regexp_substr to start the search at the first character. And the final one instructs it to fetch the Nth occurrence of the pattern. So row one finds the first value, row two the second, and so on.
Photo by Toa Heftiba Şinca from Pexels
Before we move on, note the solution above assumes there are no commas at the start or end of the string. If this is possible, you should trim the string first:
with rws as (
select ',leading,commas,and,trailing,' str from dual
)
select regexp_substr (
str,
'[^,]+',
1,
level
) value
from rws
connect by level <=
length ( trim ( both ',' from str ) ) -
length ( replace ( str, ',' ) ) + 1;
VALUE
leading
commas
and
trailing
You can also adapt this to split strings with other delimiters.
How to turn space, semicolon, pipe or other delimited strings into rows
CSV may be the most common format for delimited values. But often you'll want to split strings with other delimiters, such as pipes or semicolons. To do this, swap commas for the new character:
with rws as (
select 'split;semicolons;into;rows' str from dual
)
select regexp_substr (
str,
'[^;]+',
1,
level
) value
from rws
connect by level <=
length ( str ) -
length ( replace ( str, ';' ) ) + 1;
VALUE
split
semicolons
into
rows
Using this basic method you can overcome a common challenge: passing a CSV as input to an in list.
How to do variable in lists
The in condition returns all rows where the target matches one of the values in a comma-separated list. For example:
select employee_id, first_name, last_name
from hr.employees
where last_name in ( 'King', 'Kochhar', 'De Haan' );
EMPLOYEE_ID FIRST_NAME LAST_NAME
102 Lex De Haan
156 Janette King
100 Steven King
101 Neena Kochhar
The issue here is the items in the list need to be separate values. If you pass them as a single CSV string like this:
select employee_id, first_name, last_name
from hr.employees
where last_name in ( 'King,Kochhar,De Haan' );
no rows selected
The query returns no rows!
This is because it's searching for someone with the name King,Kochhar,De Haan. Unsurprisingly, no one has that name.
Using the CSV-to-rows trick above you can pass the search string as a single value and split it into rows like this:
with rws as (
select 'King,Kochhar,De Haan' str from dual
)
select employee_id, first_name, last_name
from hr.employees
where last_name in (
select regexp_substr (
str,
'[^,]+',
1,
level
) value
from rws
connect by level <=
length ( str ) -
length ( replace ( str, ',' ) ) + 1
);
EMPLOYEE_ID FIRST_NAME LAST_NAME
156 Janette King
100 Steven King
101 Neena Kochhar
102 Lex De Haan
This enables you to use a bind variable for the in list:
declare
emp_cur sys_refcursor;
search_str varchar2(100) := 'King,Kochhar,De Haan';
employee_id integer;
first_name varchar2(30);
last_name varchar2(30);
begin
open emp_cur for q'!select employee_id, first_name, last_name
from hr.employees
where last_name in (
select regexp_substr (
:str,
'[^,]+',
1,
level
) value
from dual
connect by level <=
length ( :str ) - length ( replace ( :str, ',' ) ) + 1
)!' using search_str, search_str, search_str;
loop
fetch emp_cur into employee_id, first_name, last_name;
exit when emp_cur%notfound;
dbms_output.put_line (
employee_id || ' ' || first_name || ' ' || last_name
);
end loop;
end;
/
156 Janette King
100 Steven King
101 Neena Kochhar
102 Lex De Haan
This works well when you're splitting one value, such as a string literal or bind variable.
But what if the strings are stored in a table and you have many rows with strings to split?
Image by ejaugsburg from Pixabay
Splitting delimited values stored in columns into rows
Sometimes you'll find columns storing lists of separated values. For example:
create table csvs (
csv_id integer
primary key,
csv_text varchar2(1000)
);
insert into csvs
values ( 1, 'split,into,rows' );
insert into csvs
values ( 2, 'even,more,values,to,extract' );
commit;
For example, if you want to join each value to rows in another table, you need to convert the string into rows first.
You can do this with the CSV-to-rows trick above as the basis. But it needs a tweak.
If you swap dual for your table (csvs here), the query links every row with every other row in the table! This leads to generating a huge data set and slowing the query to a crawl.
There are a few ways to avoid this.
From 12c you can use a lateral join. This enables you to write a subquery that uses values from tables to its left. So the database only generates a tree corresponding to each source row.
This gives a query like:
select csv_id, regexp_substr (
csv_text,
'[^,]+',
1,
rn
) val
from csvs
cross join lateral (
select level rn from dual
connect by level <=
length ( csv_text ) - length ( replace ( csv_text, ',' ) ) + 1
);
CSV_ID VAL
1 split
1 into
1 rows
2 even
2 more
2 values
2 to
2 extract
If you're stuck on ancient versions of Oracle Database, you can extend the connect by clause instead:
select csv_id, regexp_substr (
csv_text,
'[^,]+',
1,
level
) val
from csvs
connect by level <=
length ( csv_text ) - length ( replace ( csv_text, ',' ) ) + 1
and prior csv_text = csv_text
and prior sys_guid () is not null;
CSV_ID VAL
2 even
2 more
2 values
2 to
2 extract
1 split
1 into
1 rows
So what do the extra connect by clauses do?
◉ prior csv = csv ensures the query only creates new rows linked to the original
◉ prior sys_guid () is not null adds a unique bit of data to each row. This prevents cycles because Oracle Database uses all the columns with prior operators in the connect by to check for loops. If two rows in a tree have the same value for these columns, the database considers this a loop.
These methods help you turn delimited values into rows, but so far we've made a big assumption:
The separator cannot appear within the values themselves.
What do you do if it can?
Photo by Toa Heftiba Şinca from Pexels
Converting quoted values containing the separator into rows
Say you've received a CSV with the values in "lastname, firstname" format. Splitting on each comma will place last and first names on separate lines, giving twice the number of rows you need!
To overcome this change the regular expression. Instead of searching for the Nth token that excludes a comma, an alternative is to:
◉ Find the number of tokens by counting how many times "," (quote comma quote) appears; regexp_count ( str, '","' ) is a neat way to do this
◉ Find the position of the next double quote after every second comma. You can do this with instr ( str, '"', 1, ( rn * 2 ) - 1 )
◉ Return all the characters that are not a double quote from this point
Which looks something like:
with rws as (
select department_id,
listagg ( '"' || last_name || ', ' || first_name || '"', ',' )
within group ( order by employee_id ) str
from hr.employees
where department_id in ( 10, 20, 30 )
group by department_id
)
select department_id, regexp_substr (
str,
'[^"]+',
instr ( str, '"', 1, (rn*2)-1 ),
1
) employee
from rws, lateral (
select level rn from dual
connect by level <=
regexp_count ( str, '","' ) + 1
);
DEPARTMENT_ID EMPLOYEE
10 Whalen, Jennifer
20 Hartstein, Michael
20 Fay, Pat
30 Raphaely, Den
30 Khoo, Alexander
30 Baida, Shelli
30 Tobias, Sigal
30 Himuro, Guy
30 Colmenares, Karen
This makes two important assumptions:
◉ Every value contains exactly one comma. If there can be a different number of delimiters in each value, you'll need to adjust the logic.
◉ Each value is enclosed in quotes (or some other character not present in the values themselves)
A note on performance
At this point you may be wondering:
Why not use regexp_count in the original query to find how many values there? Why subtract the lengths of the strings?
The short answer is: you can!
But be aware that regular expressions are slower than other string manipulation functions.
On small data sets processing short CSVs you're unlikely to notice this difference. But if gaining every last microsecond matters to you, avoid the regular expressions. You can also swap out the regexp_substr expression in the select for one using substr and instr.
Whichever delimited values to rows method you use, writing the query is fiddly. As this is a common task, it'd be good to create a function to split strings for you.
You can split a single string with pipelined table functions (PTFs) like this:
create or replace type string_table as
table of varchar2 (4000);
/
create or replace function split_string (
delimited_string varchar2,
separator varchar2 default ','
) return string_table
pipelined
as
delimited_string_cleaned varchar2(32767);
substring varchar2(4000);
pos pls_integer;
begin
delimited_string_cleaned :=
trim ( both separator from delimited_string ) ||
separator;
pos := instr ( delimited_string_cleaned, separator );
substring := substr ( delimited_string_cleaned, 1, pos - 1 );
loop
exit when substring is null;
pipe row ( substring );
substring := substr (
delimited_string_cleaned,
pos + 1,
instr (
delimited_string_cleaned, separator, pos + 1
) - pos - 1
);
pos := instr ( delimited_string_cleaned, separator, pos + 1 );
end loop;
return;
end;
/
select *
from split_string ( 'King,Kochhar,De Haan' );
COLUMN_VALUE
King
Kochhar
De Haan
While PTFs can accept cursors, making a generic PTF that converts CSVs stored in a table to rows is hard.
Luckily Oracle Database 19c offers a better way: SQL macros!
Make reusable string-to-rows functions with SQL macros
Added in Oracle Database 19.6, SQL macros return table SQL expressions. From 21c you can also make macros that return scalar expressions.
You can think of SQL macros as query templates. They return a string containing the text of your expression. The function's parameters are placeholders in this string.
At parse time the database resolves the expression. It replaces the parameter placeholders with the text of the actual values you call it with. It then substitutes this expression back into your query.
This enables you to define a reusable query you can pass tables or columns to at runtime.
This is a big advantage over PTFs which can't accept tables as parameters.
Macros also resolve to pure SQL. This avoids any context switches from SQL to PL/SQL and gives the optimizer better visibility of the query. These benefits mean using macros is likely to be faster than PTFs.
In this post we've identified two template queries:
◉ Taking a CSV string literal and converting it into rows
◉ Extracting delimited values stored in a table
So we need two macros. One that accepts an input string, the other a table name and source column. You could create these as two functions with different names. Or create one overloaded function in a package. I'll do the latter here.
One overload will accept a string, the other a table and column containing delimited values. Both will have an optional parameter for the separator, which defaults to a comma:
create or replace package string_macros_pkg as
function split_string (
tab dbms_tf.table_t,
col dbms_tf.columns_t,
separator varchar2 default ','
) return clob sql_macro;
function split_string (
delimited_string varchar2,
separator varchar2 default ','
) return clob sql_macro;
end;
/
Place the template queries above into these like so:
create or replace package body string_macros_pkg as
function split_string (
tab dbms_tf.table_t,
col dbms_tf.columns_t,
separator varchar2 default ','
) return clob sql_macro as
sql_text clob;
begin
sql_text := 'select t.*,
regexp_substr (
' || col ( 1 ) || ',
''[^'' || separator || '']+'',
1,
pos
) str,
pos
from tab t,
lateral (
select level pos
from dual
connect by level <=
length ( ' || col ( 1 ) || ' )
- length ( replace ( ' || col ( 1 ) || ', separator ) )
+ 1
)';
return sql_text;
end split_string;
function split_string (
delimited_string varchar2,
separator varchar2 default ','
) return clob sql_macro as
sql_text clob;
begin
sql_text := 'select
regexp_substr (
delimited_string,
''[^'' || separator || '']+'',
1,
level
) str,
level pos
from dual
connect by level <=
length ( delimited_string )
- length ( replace ( delimited_string, separator ) )
+ 1';
return sql_text;
end split_string;
end;
/
With this in place, you can now call the macro, passing either a string or a table's column to split. Here are a few examples to get you going:
select employee_id, first_name, last_name
from hr.employees
where last_name in (
select str
from string_macros_pkg.split_string ( 'King,Kochhar,De Haan' )
);
EMPLOYEE_ID FIRST_NAME LAST_NAME
100 Steven King
156 Janette King
101 Neena Kochhar
102 Lex De Haan
select employee_id, first_name, last_name
from hr.employees
join string_macros_pkg.split_string ( 'King,Kochhar,De Haan' )
on last_name = str;
EMPLOYEE_ID FIRST_NAME LAST_NAME
100 Steven King
156 Janette King
101 Neena Kochhar
102 Lex De Haan
select csv_id, str, pos
from string_macros_pkg.split_string (
csvs, columns ( csv_text ), ','
);
CSV_ID STR POS
1 split 1
1 into 2
1 rows 3
2 even 1
2 more 2
2 values 3
2 to 4
2 extract 5
Source: oracle.com
0 comments:
Post a Comment