Moving to proper UTF-8 in MySQL for bugzilla on CentOS 5
I have an old bugzilla instance that has been live for some years, with lots of text in it with the Swedish non-ascii characters å, ä and ö. When I set it up I didn't think about what character encoding I used for the data, I just added data and it worked. A few days back it was time to migrate the instance to a new bugzilla version, on a CentOS 5 box. It seemed like a good idea to move the data to properly UTF-8 encoded data in the database while I was in the process of moving it. It turned out to be more difficult than I anticipated. Here is a sort list of discoveries:
- The text was encoded in UTF-8 in the old database, but mysql thought that it was what it calls latin1. What I had entered as å the database perceived as Ã¥, but the transformation was applied on both write to and read from the database, so the characters turned out to be correct when displayed in bugzilla again.
- The default behavior of mysqldump is to treat data it knows to be latin1 into UTF-8 in the output file. Since my data was really UTF-8, but mysql was under the impression that it was latin1, it encoded the UTF-8 into UTF-8 once more.
- To make matters even more complicated, what mysql calls 'latin1' is not actually ISO-8859-1 but rather a slightly modified variant of the Windows-1252 character encoding. A result of this is that in some instances the double application of the UTF-8 transformation a single input character results in 5 output characters.
- The solution to this mess is a curiously named option to
mysqldumpnamed--default-character-set. It can be used to override the default behavior of encoding strings marked as latin1 into UTF-8.mysqldump --default-character-set latin1outputs my UTF-8 correctly. Once the database is in a file, just search and replacedefault charset=latin1withdefault charset=utf8and import the data. - At this point, the data that was UTF-8 all along is now correctly understood by mysql to be UTF-8.
- Next problem: when starting up bugzilla with UTF-8 settings the characters still gets mangled.
- It turns out that the bridge between mysql and perl in CentOS 5, the perl-DBD-MySQL package, is too old to support the mysql_enable_utf8 connection parameter. As a result, strings coming out of perl-DBD-MySQL containing non-ascii is not marked as utf8 strings.
- So, why didn't checksetup.pl tell me this when I ran it? It turns out that there is a patch in the bugzilla shipped with EPEL to remove the check for the proper perl-DBD-MySQL version to make it runnable on CentOS 5. Perhaps a reasonable tradeoff, but a bit annoying when trying to find out what fails.
- So I compiled a recent perl-DBD-MySQL and put it in my playground repository and now my bugzilla displays all sorts of strange characters correctly.
Leave a Reply