Oracle: Full text search with condition
Asked Answered
S

4

11

I've created an Oracle Text index like the following:

create index my_idx on my_table (text) indextype is ctxsys.context; 

And I can then do the following:

select * from my_table where contains(text, '%blah%') > 0;

But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:

select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;

With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.

Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:

create index my_idx on my_table (group_id, text) indextype is ctxsys.context; 

Kind of like a normal index, so a separate text search can be done for each group_id.

Is there a way to do something like this in Oracle (I'm using 10g if that is important)?

Edit (clarification)

Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.

Now lets consider select ... where A = x and B = y

An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.

Whereas an index on (A,B) is much more effective, it finds the one row in one index search.

Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.

(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.

Whereas I want:

(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).

Signorelli answered 9/9, 2011 at 6:54 Comment(3)
I don't think you understand what contains(text, ...) actually does. contains() is not the kind of function you use to filter results based on occurrences of a certain word. It actually calculates a sort of relevance score any given text has for the column you are using it on.Adkinson
Say you have a row that has text = 'hello world'. When you do a where contains(text, 'hello') > 0 this row might or not might be included. Are you sure this is what you actually want?Adkinson
@NullUserException: Could you explain in an answer what contains(...) (and catsearch(...)) actually does, and if any of them does a full text search? (i.e. what you typical get if you use "find" in a text editor).Signorelli
A
10

Oracle Text

1 - You can improve performance by creating the CONTEXT index with FILTER BY:

create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;

In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.

2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:

begin
  ctx_ddl.create_index_set('my_table_index_set');
  ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/

create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
    parameters('index set my_table_index_set');

select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0

This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.

(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like @NullUserException said.)

Multi-column indexes vs. index joins

For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.

The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.

However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.

Aspersion answered 16/9, 2011 at 4:35 Comment(0)
S
1

I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.

Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.

Slide answered 10/9, 2011 at 14:53 Comment(1)
really. The optimizer should be able to estimate what index will be the most appropriate to use between the CONTEXT index and the B-TREE index. Statistics will help choose a good execution plan.Inigo
A
1

Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:

CREATE INDEX my_group_idx ON my_table (group_id); 

Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.

DESC my_table;

Name     Null     Type      
-------- -------- --------- 
ID       NOT NULL NUMBER(4) 
GROUP_ID          NUMBER(4) 
TEXT              CLOB      

There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:

EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);

Here's how many rows each group_id has and the corresponding percentage:

GROUP_ID               COUNT                  PCT                    
---------------------- ---------------------- ---------------------- 
1                      1                      1                      
2                      2                      1                      
3                      4                      3                      
4                      8                      6                      
5                      16                     12                     
6                      32                     24                     
7                      64                     47                     
8                      9                      7         

Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:

SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;

You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.

Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).

SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';

Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:

SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;

Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:

SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';

Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.

Adkinson answered 11/9, 2011 at 16:49 Comment(0)
B
0

I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?

This inline view gets you the PK values of the rows in group 43:

             (
             select T.pkcol
             from T
             where group = 43
             )

If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:

           select * from T
           inner join
            (
             select T.pkcol
             from T
             where group = 43
             ) as MyGroup

           on T.pkcol = MyGroup.pkcol
           where contains(text, '%blah%') > 0

Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.

Bastogne answered 10/9, 2011 at 12:41 Comment(3)
Hi Tim, I don't see how it could use the text index on only the rows of group 43? Forgetting about oracle text for the moment, if I've got separate indexes on say columns A and B, if I want to do select ... where A = x and B = y I have to either (1) Use the A index and check all the B items or (2) Use the B index and check all the A items. For this query to be efficient you really need an index on (A,B) (or (B,A)). I don't see how Oracle Text makes that reality any different?Signorelli
@Clinton: If the join is performed before the where-condition is applied, then you're wrong; if the join is performed after the where-condition is applied, then you're right. It should be easy enough to find out. Just put an index on groupid.Bastogne
@Tim: Adding an inline view like this usually won't make any difference. Oracle may merge the inline view, or push the predicates into the inline view. This behavior does not follow the SQL standard, and leads to some interesting data conversion problems, but usually helps significantly with performance. As other have pointed out, if you write the query the simple way Oracle will likely rewrite it to run fast. And if you really need Oracle to perform things in a specific order, there are other tricks or hints you have to use (such as adding ROWNUM to the inline view).Aspersion

© 2022 - 2024 — McMap. All rights reserved.