Moving to proper UTF-8 in MySQL for bugzilla on CentOS 5

October 2nd, 2009

I have an old bugzilla instance that has been live for some years, with lots of text in it with the Swedish non-ascii characters å, ä and ö. When I set it up I didn't think about what character encoding I used for the data, I just added data and it worked. A few days back it was time to migrate the instance to a new bugzilla version, on a CentOS 5 box. It seemed like a good idea to move the data to properly UTF-8 encoded data in the database while I was in the process of moving it. It turned out to be more difficult than I anticipated. Here is a sort list of discoveries:

  1. The text was encoded in UTF-8 in the old database, but mysql thought that it was what it calls latin1. What I had entered as å the database perceived as Ã¥, but the transformation was applied on both write to and read from the database, so the characters turned out to be correct when displayed in bugzilla again.
  2. The default behavior of mysqldump is to treat data it knows to be latin1 into UTF-8 in the output file. Since my data was really UTF-8, but mysql was under the impression that it was latin1, it encoded the UTF-8 into UTF-8 once more.
  3. To make matters even more complicated, what mysql calls 'latin1' is not actually ISO-8859-1 but rather a slightly modified variant of the Windows-1252 character encoding. A result of this is that in some instances the double application of the UTF-8 transformation a single input character results in 5 output characters.
  4. The solution to this mess is a curiously named option to mysqldump named --default-character-set. It can be used to override the default behavior of encoding strings marked as latin1 into UTF-8. mysqldump --default-character-set latin1 outputs my UTF-8 correctly. Once the database is in a file, just search and replace default charset=latin1 with default charset=utf8 and import the data.
  5. At this point, the data that was UTF-8 all along is now correctly understood by mysql to be UTF-8.
  6. Next problem: when starting up bugzilla with UTF-8 settings the characters still gets mangled.
  7. It turns out that the bridge between mysql and perl in CentOS 5, the perl-DBD-MySQL package, is too old to support the mysql_enable_utf8 connection parameter. As a result, strings coming out of perl-DBD-MySQL containing non-ascii is not marked as utf8 strings.
  8. So, why didn't checksetup.pl tell me this when I ran it? It turns out that there is a patch in the bugzilla shipped with EPEL to remove the check for the proper perl-DBD-MySQL version to make it runnable on CentOS 5. Perhaps a reasonable tradeoff, but a bit annoying when trying to find out what fails.
  9. So I compiled a recent perl-DBD-MySQL and put it in my playground repository and now my bugzilla displays all sorts of strange characters correctly.

Trackback URI | Comments RSS

Leave a Reply

Name (required)

Email (required)

Website

Speak your mind