Yesterday I attended a technical presentation on Character Sets. As I told the presenter afterwards, to me attending this meeting was like reading the Kitab-i-Iqan (book of certitude) and then looking at religious history – it was the key that suddenly made everything else clear. I had come across the terms Character Set, Character Encoding, collation, UTF8, ASCII, ANSI, etc many times and had some idea what they referred to, but was really confused about how they all relate to each other. Now, it was all much more clear. We’ll have a second part of the presentation later, and I may post a summary of the concepts after that.
But already puting theory to practice – this morning I had a request to remove diacritics from a listing of countries, so they sort properly. I thought “Aha!” This is a collation issue. Collation refers to setting items in order, and in the case of databases the collation sequence is the set of rules by which characters are ordered – for example a, b, c, ä or a, ä, b, c. So instead of simply removing the diacritics I googled “oracle collation” and ended up finding a better solution:
Oracle uses a setting called NLS_LANG to determine a number of things – language, territory and character set. This is an important setting, as any Oracle DBA would tell you. The default collation sequence is derived from the language portion of that setting, as there is a corresponding NLS_SORT setting for each NLS_LANGUAGE, but then there are also more custom values that you can use. Once you set a value for NLS_SORT, that collating sequence will then be used for sorting when running a query with the ORDER BY clause.
First let’s check what our current collation sequence is set to:
SELECT SYS_CONTEXT ('USERENV', 'NLS_SORT') FROM DUAL;
Then, get a list of the possible values you can set NLS_SORT to:
SELECT * FROM v$nls_valid_values WHERE parameter = 'SORT';
Choose the collation sequence you want to use from this list. In my case, I wasn’t quite sure which one I wanted, so I searched some more and found one called BINARY_AI which did what I was looking for. Turns out each value in the above table of possible values can have a hybrid definition, by adding a suffix of _CI (case insensitivity) or _AI (accent-insensitive and case-insensitive).
Finally, to set the collation sequence for the current session:
ALTER SESSION SET nls_sort='BINARY_AI';
My query for a list of countries originally gave these results (NLS_SORT was set to BINARY):
Romania Russian Federation Rwanda Réunion
After the NLS_SORT was set to BINARY_AI:
Réunion Romania Russian Federation Rwanda