Moving to proper UTF-8 in MySQL for bugzilla on CentOS 5
I have an old bugzilla instance that has been live for some years, with lots of text in it with the Swedish non-ascii characters å, ä and ö. When I set it up I didn't think about what character encoding I used for the data, I just added data and it worked. A few days back it was time to migrate the instance to a new bugzilla version, on a CentOS 5 box. It seemed like a good idea to move the data to properly UTF-8 encoded data in the database while I was in the process of moving it. It turned out to be more difficult than I anticipated. Here is a sort list of discoveries:
- The text was encoded in UTF-8 in the old database, but mysql thought that it was what it calls latin1. What I had entered as å the database perceived as Ã¥, but the transformation was applied on both write to and read from the database, so the characters turned out to be correct when displayed in bugzilla again.
- The default behavior of mysqldump is to treat data it knows to be latin1 into UTF-8 in the output file. Since my data was really UTF-8, but mysql was under the impression that it was latin1, it encoded the UTF-8 into UTF-8 once more.
- To make matters even more complicated, what mysql calls 'latin1' is not actually ISO-8859-1 but rather a slightly modified variant of the Windows-1252 character encoding. A result of this is that in some instances the double application of the UTF-8 transformation a single input character results in 5 output characters.
- The solution to this mess is a curiously named option to
mysqldumpnamed--default-character-set. It can be used to override the default behavior of encoding strings marked as latin1 into UTF-8.mysqldump --default-character-set latin1outputs my UTF-8 correctly. Once the database is in a file, just search and replacedefault charset=latin1withdefault charset=utf8and import the data. - At this point, the data that was UTF-8 all along is now correctly understood by mysql to be UTF-8.
- Next problem: when starting up bugzilla with UTF-8 settings the characters still gets mangled.
- It turns out that the bridge between mysql and perl in CentOS 5, the perl-DBD-MySQL package, is too old to support the mysql_enable_utf8 connection parameter. As a result, strings coming out of perl-DBD-MySQL containing non-ascii is not marked as utf8 strings.
- So, why didn't checksetup.pl tell me this when I ran it? It turns out that there is a patch in the bugzilla shipped with EPEL to remove the check for the proper perl-DBD-MySQL version to make it runnable on CentOS 5. Perhaps a reasonable tradeoff, but a bit annoying when trying to find out what fails.
- So I compiled a recent perl-DBD-MySQL and put it in my playground repository and now my bugzilla displays all sorts of strange characters correctly.
CentOS prerelease security
The CentOS Linux distribution is in many respects the optimal choice for anyone that wants a stable system that is supported over a number of years. I run it on a handful of servers and the problems are few and far between.
However, one thing has been emerging as a bit of a problem lately, and that is that security updates from Red Hat has taken quite some time to get built and released for CentOS. This is especially true in the weeks following a new point release from Red Hat. Not having security updates available for known problems for weeks a the time makes users of CentOS less secure than they would otherwise be.
To help make this problem a bit less pronounced I have started to rebuild security updates from Red Hat and installing them on the systems I administer. That's one of the beauties of open source, you can fix things. If anyone else is interested in those updated packages they can be found at http://rpm.resare.com/centos5-pre-security. If you're in the target demographic you'll know what to do.
Filed under centos | Comment (0)So, I wanted encrypted access to multiple websites
Multiple websites on a single server that provide encrypted access is traditionally done by adding one IP address per website. However, that is no longer necessary now that modern web browsers has support for Server Name Indication which enables multiple HTTPS websites sharing a single IP address. All that is needed is to enable support for this on your webserver.
On the Linux distribution I use on my servers, CentOS 4, that was a bit tricky. My first plan was to update the openssl package to a version that supports SNI, but that turned out to be seriously difficult since the library has changed major version between the version shipped in CentOS 4 and the version that includes SNI support and that would mean recompiling many parts of the core system.
However, I found that there is an alternative apache module to the mod_ssl shipping in CentOS called mod_gnutls. It provides the same basic functionality but does so without using the openssl library. So, I pulled the latest stable version of mod_gnutls and made an RPM package of it. It depended on newer versions of a few packages that I could pull from Fedora rawhide and recompile for CentOS 4. If you want to use the packages I built, they are available from a special yum repository. Adding this repository and installing mod_gnutls will upgrade the system provided libgcrypt and gnutls packages to newer versions.
Filed under Geeky | Comment (0)