PostgreSQL: TO_ASCII & UTF8

In the process of fixing our code for an upcoming upgrade of one database version for one of our $-projects I encountered a strange behaviour. Initiual situation:

  • we're moving from PostgreSQL 8.1.3 to the 8.3.5
  • we're moving from database encoding LATIN1 to UTF8
  • in our code we're using the TO_ASCII function a few times.

And this combination produces some headaches.

But first a gentle introduction to the TO_ASCII function. It converts any given text into it's ASCII representation. Folks bound to languages with german umlauts or some kind of apostrophes encounter many problems. For example: what should you do if you have to build some kind of index based on the first character of the lastname. Certainly you don't want to have an extra entry with 'Ü', instead you want to but them into the 'U' list. Grand entrance TO_ASCII:

  1. SELECT to_ascii('Übermeier');
  2. Ubermeie

Works like a charm. Caveat: TO_ASCII only supports LATIN1, LATIN2, LATIN9 and WIN1250 encodings but no UTF8.

Okay, the first guess would be to do something like this:

  1. SELECT to_ascii(convert_to('Übermeier', 'latin1'));
  2. ERROR:  FUNCTION to_ascii(bytea) does NOT exist

Bummer. CONVERT_TO returnes BYTEA, TO_ASCII only wants TEXT.

There has been some discussion going on on the pgsql.hackers mailinglist and frankly I can follow both parties in their point of view. But thanks to Pavel Stehule we have some kind of a hack to sidestep this issue:

  1. CREATE FUNCTION to_ascii(bytea, name)
  2. RETURNS text STRICT AS 'to_ascii_encname' LANGUAGE internal;

This version gladly accepts the BYTEA data returned by CONVERT_TO so we can just use it in this way:

  1. SELECT to_ascii(convert_to('Übermeier', 'latin1'), 'latin1');
  2. Ubermeie

Problem solved.

Edit: Added fix by eMerzh. Thanks!