Wednesday, December 8, 2021

Search indexes for JSON

Oracle Database, Oracle Database Certification, Database Career, Database Preparation, Database Guides, Database Prep

The JSON Search Index

The JSON search index is a flexible index which indexes all the content of a JSON document collection for both full text search (known as keyword searches) using json_textcontains() predicates, and scalar value range search using json_exists() and json_value() predicates. Additionally, it can be used for maintaining an up-to-date JSON dataguide describing the schema of the JSON document collection.

When should we use a JSON search index?

If you know the layout of your JSON in advance, and know which fields will be commonly searched, you are probably best off creating a single-value functional B-tree index for json_value() predicates or multi-value functional B-tree index for json_exists() predicates. For example, let’s assume we've created a table and inserted a document like this:

create table mytab (jtext varchar2(2000) constraint jtextjson check (jtext is json))

insert into mytab (jtext) values ( '{ "customer": { "id": 20, "name": "Acme Inc" } }' )

If we frequently search on single-value customer ID or single-value customer name, it makes sense to create single value functional indexes on those fields for using json_value() predicates. We can do that using the "simple dot notation" for JSON:

create index custid on mytab ( jtext.customer.id.number() )

create index custname on mytab ( jtext.customer.name.string() )

In order to search an array via json_exists, in Oracle Database 21c we can create a multi-value functional index. In Oracle Database 19c we can can create an on-statement-refreshable materialized view over a json_table().

For example, given the following JSON document with an array creditscore[]:

 create table person_collection_t (id number primary key,  jdoc JSON);

 insert into person_collection_t

 values(1,

'{"person":

    {"birthdate":"1998-03-04", "creditscore":[700, 650, 720],

     "experiences" : "json xml semi-structured data processing",

     "hobby" : ["bicycle", "chess", "hiking"] } }

     ' );

 commit;

If we want to perform json_exists() predicates on the creditscore[] array, we may benefit from an index.

In 21c we can create a multi-value functional index:

create multivalue index creditsIdx

 on person_collection t (t.jdoc.person.creditscore[*].number());

However, there are various situations where we can't make use of  functional indexes or materialized views:

1. We don't know in advance what fields will be present, or used for searching, in our JSON. That is, the search is ad-hoc in nature.

2. We need to do full text searching (also known as keyword searches) on textual fields within our JSON we can do this with json_textcontains() specifying a path context for the search.

3. There are too many JSON path scalar fields that may need to be searched via json_exists() and json_value(). In general, as number of functional indices or materialized views increases, the storage space overhead increases and DML performance degrades due to maintaining too many functional indices and/or materialized views.

A 'search index' overcomes all these issues.  We can create a search index thus:

create search index srchidx on person_collection_t (jdoc) for json

How is a JSON search index used ?

By default, JSON Search index indexes both JSON path structures and leaf values of JSON documents. Leaf values that are numbers, or strings that are convertible to numbers, are indexed as numbers. Time (temporal) datatypes, or strings that are convertible to them according to the ISO8601 format, are indexed as timestamp normalized to UTC timezone.

Therefore, when we search against a table with a JSON search index, the query processor will know that the search index is present and will use it to speed up all appropriate searches using json_textcontains(), json_exists() and json_value() predicates in WHERE clause.

Let's look at the plan for a Json_textcontains() query using a search index:

explain plan for

select count(*) from person_collection t

where json_textcontains(t.jdoc, '$.person.experiences', 'data process')

/

PLAN_TABLE_OUTPUT

----------------------------------------------------------------------------

|   0 | SELECT STATEMENT

|   1 |  SORT AGGREGATE  

|*  2 |   DOMAIN INDEX   | SRCHIDX

---------------------------------------------------

   2 - access("CTXSYS"."CONTAINS"("T"."JDOC" /*+ LOB_BY_VALUE */

              ,'(data process) INPATH (/person/experiences)')>0)

And a Json_value() query using a search index

explain plan for

select count(*) from person_collection t

where JSON_VALUE(t.jdoc, '$.person.birthdate' returning date)  < :1

/

PLAN_TABLE_OUTPUT

--------------------------------------------------------------------------------

|   0 | SELECT STATEMENT             

|   1 |  SORT AGGREGATE              

|*  2 |   TABLE ACCESS BY INDEX ROWID| PERSON_COLLECTION

|*  3 |    DOMAIN INDEX              | SRCHIDX

   2 - filter(JSON_VALUE("T"."JDOC" /*+ LOB_BY_VALUE */  FORMAT OSON ,

              '$.person.birthdate' RETURNING DATE TRUNCATE TIME NULL ON ERROR)<:1

   3 - access("CTXSYS"."CONTAINS"("T"."JDOC" /*+ LOB_BY_VALUE */

              ,'('||('sdata(FTMS_CE3B2313E5C2A8B4A28FA522BF150D9B_birthdate  <

              "'||TO_CHAR(:1,'SYYYY-MM-DD"T"HH24:MI:SS')||'" )')||')')>0)

Finally,  the plan for Json_exists() using search index

explain plan for

select count(*) from person_collection t

where json_exists(t.jdoc, '$.person.creditscore[*]?(@.number() > 700)')

/

--------------------------------------------------------------------------------

|   0 | SELECT STATEMENT           

|   1 |  SORT AGGREGATE     

|*  2 |   TABLE ACCESS BY INDEX ROWID| PERSON_COLLECTION

|*  3 |    DOMAIN INDEX              | SRCHIDX           

   2 - filter(JSON_EXISTS2("T"."JDOC" /*+ LOB_BY_VALUE */  FORMAT OSON ,

              '$.person.creditscore[*]?(@.number() > 700)' FALSE ON ERROR)=1)

   3 - access("CTXSYS"."CONTAINS"("T"."JDOC" /*+ LOB_BY_VALUE */

              ,'(sdata(FNUM_9FF831A9FB5F275573ED5611C9D52D76_creditscore  > 700

))')>0)

explain plan for

select count(*) from person_collection t

where json_exists(t.jdoc, '$.person.hobby')

/

----------------------------------------------------------------------------

|   0 | SELECT STATEMENT |

|   1 |  SORT AGGREGATE   |

|*  2 |   DOMAIN INDEX   | SRCHIDX |

---------------------------------------------------

   2 - access("CTXSYS"."CONTAINS"("T"."JDOC" /*+ LOB_BY_VALUE */

              ,'HASPATH(/person/hobby)')>0)

Note JSON search index works for JSON text with IS JSON check constraint in 19c, However, Oracle highly recommends using 21c JSON datatype when possible, for advanced JSON functionality, optimal performance and search index capabilities.

The basic search index above works well for small databases with low volumes of DML (creates, updates and deletes). For larger databases, or those with high volumes of DML, we may want to customize the search index.

JSON search indexes are based on Oracle Text technology, and accept most of the same index options as the Oracle Text CONTEXT indexes.  In the next section we will consider some of the more common options you may wish to use.

Customizing the Search Index

SYNC intervals

JSON search indexes must be "synchronized" after DML, before they are searchable. You may notice that if you insert a document but do not commit it, some searches using a search index will not find it, because the index has not yet been synchronized.

By default, JSON search indexes use a method known as sync on commit. That means the index is synchronized automatically as part of the commit. This has the advantage that the documents are immediately searchable after being committed. However, it has the disadvantage that it slows DML down, and can cause contention between different processes all trying to perform DML at the same time.

We therefore recommend interval syncs for most customers (this is likely to become the default for search indexes in a future release). That way the index is synchronized by a regular background task. To ensure that updates are synchronized as soon as possible, it is common to use a one second interval between syncs.  We can modify the previous index to use one-second syncs using a parameters clause as follows:

drop index srchidx

create search index srchidx on person_collection_t (jdoc) for json

parameters ('sync (every "freq=secondly; interval=1")')

Synchronization jobs are launched by DBMS_SCHEDULER. So in order to create an interval-sync index, our user must have CREATE JOB privilege (this requirement goes away in 21c). The interval specification after "every" uses scheduler syntax, and any legal scheduler interval specification may be used.

Index Optimization

Over time, search indexes can become fragmented, and can contain garbage data from deleted documents. There are various methods to manually optimize indexes, but the easiest way is to add a further 'optimize' clause to the parameters. This clause can take EVERY with a time interval, like sync, or you can use the special term "auto_daily":

create search index srchidx on person_collection_t (jdoc) for json

parameters ('sync     (every "freq=secondly; interval=1")

             optimize (auto_daily)')

With "auto_daily", optimization will be run once a day at midnight. If a time period is specified, optimize will be run for the top 10 most fragmented tokens at that time interval (an interval of one hour is suggested), and additionally a full optimize run is run at midnight on Saturday night.

Memory and Parallelism

Creating a search index on a large document collection is an intensive process, and can take some time. It's not unknown for very large indexes to take hours or even days to create. You can tune this index creation by providing more resources, by setting the index memory and degree of parallelism for index creation.  The default index memory is 50MB, and degree of parallelism is normally one.  We can modify those as follows:

create search index srchidx on person_collection_t (jdoc) for json

parameters ('sync     (every "freq=secondly; interval=1")

             optimize (auto_daily)

             memory 500M')

parallel 4

Note that the memory setting is inside the parameters clause (since it's specific to search indexes) but the parallel setting is part of the standard SQL syntax for creating indexes, so is outside the parameters clause.

Before we go any further, we need to understand that the index memory is allocated to each parallel thread, so in the example above we could use up to 2GB of memory (500MB for each of four threads).

Index memory is allocated from the PGA (Program Global Area). If we try to allocate more PGA memory than is available, the index creation will fail. It's therefore a good idea to check the maximum PGA memory available, and keep well below it. We can get that, in GB, by running the following as a user with suitable privilege:

select value/1073741820 from gv$parameter where name = 'pga_aggregate_limit'

It also makes sense not to try to run more parallel threads than we have CPUs available. We can check that with

select value from gv$parameter where name = 'cpu_count';

Path Subsetting using a User Datastore

By default, a JSON search index indexes all parts of a JSON document. But what if we only want part of the document indexed?  We can do this using a powerful feature known as the 'User Datastore'. A user datastore allows us to specify a PL/SQL procedure which provides the actual content to index. The user datastore procedure is called for each document that is indexed. What the user datastore provides back for indexing is entirely up to the code of that procedure. It is given the rowid of the document, but nothing else, and can assemble the document from multiple columns of the indexed table, from a set of other tables, or from anywhere else it wants.

By calling JSON_TRANSFORM from within the user datastore procedure, we can fetch back only parts of the JSON document in question.

Let's assume we've created a populated a table thus:

create table jdocs(jtext varchar2(2000) constraint jtextisjson check (jtext is json));

insert into jdocs values (

 '{ "meta":

       { "key":12345, "createdby": "john" },

    "info":

       { "title": "A Tale of Two Cities", "author": "Charles Dickens" }

  }'

);

Maybe we don't want to index the meta part, only the info.  As a quick test we can confirm that JSON transform can give us that:

select json_transform(j.jtext, KEEP '$.info' returning clob) from jdocs j;

Great, that works. So let's incorporate that into a procedure using the standard interface for a user datastore procedure:

create or replace procedure get_info_from_json

   (rid     in            rowid,

    outclob in out nocopy clob

   ) is

begin

   select json_transform(j.jtext, KEEP '$.info' returning clob) into outclob

   from jdocs j where rowid = rid;

end;

Our procedure is fetching the subset of the JSON into outclob, which as it's defined as an IN OUT variable will be returned to the caller for indexing.

Now we need to use the preference/attribute system for JSON search and text indexes to create a user datastore preference using our procedure:

exec ctx_ddl.create_preference('my_datastore', 'USER_DATASTORE')

exec ctx_ddl.set_attribute    ('my_datastore', 'PROCEDURE', 'get_info_from_json')

And finally we need to create our index, using that user datastore preference:

create search index jdocsidx on jdocs(jtext) for json parameters ('datastore my_datastore');

Now if we search for Dickens using json_textcontains we will find it:

select * from jdocs where json_textcontains(jtext, '$.*', 'Dickens');

But if we search for John, we will not, since that is in the now-hidden "meta" section of the document:

select * from jdocs where json_textcontains(jtext, '$.*', 'John');

There is much more you can do with JSON using the full capabilities of Oracle Text. Feel free to experiment, and do let us know if you find some cool new capabilities!

Source: oracle.com

Related Posts

0 comments:

Post a Comment