PostgreSQL: TO_ASCII & UTF8

In the process of fixing our code for an upcoming upgrade of one database version for one of our $-projects I encountered a strange behaviour. Initiual situation:

  • we're moving from PostgreSQL 8.1.3 to the 8.3.5
  • we're moving from database encoding LATIN1 to UTF8
  • in our code we're using the TO_ASCII function a few times.

And this combination produces some headaches.

But first a gentle introduction to the TO_ASCII function. It converts any given text into it's ASCII representation. Folks bound to languages with german umlauts or some kind of apostrophes encounter many problems. For example: what should you do if you have to build some kind of index based on the first character of the lastname. Certainly you don't want to have an extra entry with 'Ü', instead you want to but them into the 'U' list. Grand entrance TO_ASCII:

  1. SELECT to_ascii('Ãœbermeier');
  2. Ubermeie

Works like a charm. Caveat: TO_ASCII only supports LATIN1, LATIN2, LATIN9 and WIN1250 encodings but no UTF8.

Okay, the first guess would be to do something like this:

  1. SELECT to_ascii(convert_to('Ãœbermeier', 'latin1'));
  2. ERROR:  FUNCTION to_ascii(bytea) does NOT exist

Bummer. CONVERT_TO returnes BYTEA, TO_ASCII only wants TEXT.

There has been some discussion going on on the pgsql.hackers mailinglist and frankly I can follow both parties in their point of view. But thanks to Pavel Stehule we have some kind of a hack to sidestep this issue:

  1. CREATE FUNCTION to_ascii(bytea, name)
  2. RETURNS text STRICT AS 'to_ascii_encname' LANGUAGE internal;

This version gladly accepts the BYTEA data returned by CONVERT_TO so we can just use it in this way:

  1. SELECT to_ascii(convert_to('Ãœbermeier', 'latin1'), 'latin1');
  2. Ubermeie

Problem solved.

Edit: Added fix by eMerzh. Thanks!

12 Kommentare bisher »

  1. P sagt

    am 17. März 2009 @ 15:26

    Problem solved too.
    You saved my life.

    Thx for share!

  2. eMerzh sagt

    am 18. März 2009 @ 11:06

    Thanks for your function …

    But…

    if you do select public.to_ascii(null::bytea, ‚latin1′);

    Postmaster will crash!

    because of the null value!

    Please mark your function to strict :

    CREATE FUNCTION to_ascii(bytea, name)
    RETURNS STRICT text AS ‚to_ascii_encname‘ LANGUAGE internal;

  3. eMerzh sagt

    am 18. März 2009 @ 11:06

    oups
    CREATE FUNCTION to_ascii(bytea, name)
    RETURNS text STRICT AS ‚to_ascii_encname‘ LANGUAGE internal;

  4. poncho sagt

    am 26. Mai 2009 @ 19:49

    There are characters that cannot be converted from utf to latin.
    If you have only one record with such character, every query using convert_to raises an error.
    Have someone found out how to work around this?

  5. loob sagt

    am 14. Juli 2009 @ 9:00

    poncho: I just change the encoding from latin1 to latin2 and it works now! 😉

  6. Santiago Sexo sagt

    am 4. August 2009 @ 6:55

    any news coming ?

  7. emi sagt

    am 16. November 2010 @ 12:50

    I have the same problem that @poncho. I have tried with LATIN1, LATIN2, LATIN9 and WIN1250, but all of them fail.

    This is a 7000 rows table, so looking „manually“ for a bad character is not a way.

    Any help on this?

    Thanks in advance.

  8. Rodrigo Almeida sagt

    am 1. Februar 2011 @ 19:01

    Problem solved! One point for you! :) Thank you.

  9. Antonio Garcia sagt

    am 2. August 2011 @ 11:03

    Thank for this solution.

  10. Johng849 sagt

    am 7. Mai 2014 @ 5:34

    Excellent post. I was checking continuously this blog and I’m impressed! Very useful info particularly the last part affdakekkedc

  11. How can I do a accent insensitive search in Postgres 8.3.x with a DB in utf-8? | Yourshaw Answers sagt

    am 8. Dezember 2014 @ 17:36

    […] Look here. […]

  12. Túlio sagt

    am 16. Dezember 2014 @ 20:09

    This is great. You are a hero. Thanks for sharing!

Komentar RSS · TrackBack URI

Hinterlasse einen Kommentar

Name: (erforderlich)

eMail: (erforderlich)

Website:

Kommentar: