Notes on Oracle: April 2009

For the client where I'm currently working, the need arose that we needed to find lastnames regardless of diacritical characters. In Dutch, we have lastnames (and firstname too) where characters like ü, ä, ö, é occur.

The "usual" way to handle this was to add a column to the table, and store the name there without the diacritical characters. Meaning the name "müller" would be in one column, while another column would contain the name "muller", the same letters but without the double dots (called a trema in Dutch) over the "u". A common complaint with this approach, is that when you query with the double dots, you won't be able to find what you are looking for.
There is also the possibility of using Oracle Text to handle this. In the past I have been to a presentation on Oracle Text, but never used it before. It scared me in a certain way. I thought it would be quite complicated to use it, but it turned out to be real easy.

Let's start with the table


create table text_test
(name varchar2(255)
);

insert into text_test values ('muller');
insert into text_test values ('müller');
insert into text_test values ('MULLER');
insert into text_test values ('MÜLLER');
insert into text_test values ('mueller');
insert into text_test values ('MUELLER');
--
insert into text_test values ('möller');
insert into text_test values ('moller');
--
insert into text_test values ('mäller');
insert into text_test values ('maller');
--
insert into text_test values ('Médar');

commit;

In this table are some variations of "Muller", upper- and lowercase and with and without the diacritical character.
In order to use Oracle Text to search for the name regardless of diacritic, we need to create a Custom Lexer. This custom lexer is needed, because we need to change the Base Letter setting. From the documentation:

With base-letter conversions enabled, letters with umlauts, acute accents, cedillas, and the like are converted to their basic forms for indexing, so fiancé is indexed both as fiancé and as fiance, and a query of fiancé returns documents containing either form.

And this is exactly what we're after.

To be able to search using the Base Letter conversion, as described in the quote from the ducmentation, we need to create a Preference.
To change the setting of the Base Letter, the default is NO, we to set this attribute to YES.


begin
   ctxsys.ctx_ddl.create_preference ('cust_lexer','BASIC_LEXER');
   ctxsys.ctx_ddl.set_attribute ('cust_lexer','base_letter','YES'); -- removes diacritics
end;
/

In the above code, we create a preference, called CUST_LEXER. Because the main language will be Dutch, I use the BASIC_LEXER. More information regarding this in the Oracle documentation.
The attribute that I want to override is the BASE_LETTER, so set that attribute to YES. And that should take care of it.
Now the only thing to do is create the Oracle Text index and specify that we want to use our preference set.


CREATE INDEX text_test_idx ON text_test(name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('LEXER cust_lexer');

Now that we are all set, let's run a query and verify the results.


SQL> select name
  2    from text_test
  3   where contains (name, 'muller') > 0
  4  ;

NAME
-----------------------------------------
müller
MULLER
muller
MÜLLER

And there you have it, it works as advertised.

While writing this blog, a colleague of mine pointed out that in Germany it is accepted to spell the name "müller" as "mueller". Oracle Text can even handle that.
Add this to your preference set, and you're good to go.


ctxsys.ctx_ddl.set_attribute ('cust_lexer','alternate_spelling','GERMAN');

With a little demo:


SQL> select name
  2    from text_test
  3   where contains (name, 'muller') > 0
  4  ;

NAME
-----------------------------------------
muller
müller
MULLER
MÜLLER
mueller
MUELLER

This also works when searching for "Mueller"


SQL> select name
  2    from text_test
  3   where contains (name, 'mueller') > 0
  4  ;

NAME
-----------------------------------------
muller
müller
MULLER
MÜLLER
mueller
MUELLER

The problem with Oracle Text used to be the synchronization of the Text indexes with DML actions. In the old days, you needed to take care of this yourself. Schedule a "make sure the Text index is updated to be in sync with the table". This could mean that after you added a name like "Désiré" to your table, you wouldn't be able to find because it wasn't in the text index.
Starting with Oracle 10g (release 2) you can indicate that the index needs to be synchronized during a COMMIT. Just specify it with the creation of the index, and that's it.


CREATE INDEX text_test_idx ON text_test(name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('LEXER cust_lexer SYNC (ON COMMIT)');

This Oracle Text is really cool stuff, and this is just the beginning. The Oracle Documentation is a great source of information. A link is included at the bottom.

Finally the cleanup:


drop table text_test cascade constraints
/
begin
   ctxsys.ctx_ddl.drop_preference ('cust_lexer');
end;
/

Oracle Text Documentation

Notes on Oracle

Pages

21 April 2009

Copy and Paste: Clipboard Items

06 April 2009

Oracle Text: diacritic search

Blog Archive

Non Oracle Links