koha: EncodingScratchPad Some notes on encoding and charsets

http://wiki.koha.org/doku.php?id=encodingscratchpad

Introduction

For the versions prior to Koha 2.2.6, careful attention was not given to dealing with character sets correctly. This document attempts to raise awareness of character set issues so that Koha developers and administrators can understand how best to proceed with development as well as setup and configuration of Koha systems.
MARC Records

MARC21 records can ‘legally’ only have two encodings: MARC-8 or UTF-8. The encoding is set in position 9 of the leader (LEADER / 09). MARC-8 is not recognized in modern web browsers and since Koha is a web-based system, if you are using MARC21 records, the encoding MUST be UTF-8. This means that the records should be pre-processed before entering your Koha system (in whatever way they enter). Some of this is handled internally within Koha, but don’t leave it to chance: if you’re migrating MARC21 data into Koha expect to spend a significant amount of time to dealing with properly pre-processing and storing your data in Koha.

Conversion from MARC-8 to UTF-8 for MARC21 records is handled in Koha with the MARC::* suite of Perl modules. There are significant issues with properly configuring your system (with the proper SAX parsers, etc.) and there are also some questions raised about whether this suite is handling all character set / encoding issues correctly. For some details, please refer to the following posts:

http://www.nntp.perl.org/group/perl.perl4lib/2369

http://lists.nongnu.org/archive/html/koha-devel/2006-07/msg00000.html

One thing to remember is that LEADER / 09 is used in MARC::* to determine the encoding of a given record. This means that if it’s not set correctly, you will very likely mangle any records you are importing/exporting.
System

Be sure to set your system locales up correctly to use UTF-8. You can test your locale settings by running:

$ locale

or

$ echo $LANG
en_US.UTF-8

If it’s not en_US.UTF-8 (or UTF-8 of your language), en_US means it’s configured for iso-8859-1/latin1. Be sure to reconfigure your locales. On Debian, you can configure locales thusly:

$ sudo dpkg-reconfigure locales

Then, you’ll need to quit your shell session and log back in again to check the default.

NOTE: on some systems, the root user won't have locale set properly, use
a non-root user when working with Koha and the 'sudo' command if you need
elevated permissions

Apache2

Be sure to have these lines in your http.conf:

AddCharset UTF-8 .utf8
AddDefaultCharset UTF-8

MySQL 4.1
Server Configuration

MySQL Version 4.1 is absolute minimum if you want to handle encoding correctly

Please refer to the MySQL Manual Chapter 10: http://dev.mysql.com/doc/refman/4.1/en/charset.html

You will probably have to edit your my.cnf to set some variables so that the server will use utf8 by default. Even standard packages like the one provided by Debian Sarge have the variables set to use latin1 by default. Make sure you have the following in your my.cnf:

init-connect = 'SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_ci

Connect to mysql using a non-root user and type:

show variables;

NOTE: The root user won't show the variables correctly for reasons I haven't had time to
investigate ... connect as the kohaadmin user to check the values.

Check to make sure the following are set to utf8:

| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci

You must create your Koha database _after_ you set the character set defaults otherwise the database could be set to the wrong defaults

If you are moving from a mysql 4.0 database to a 4.1, you need to pay special attention to how to properly deal with your charsets. If you are storing utf-8 data in mysql 4.0 but your table types are set to latin1, you will need to convert to blob or binary before changing the table type otherwise mysql will attempt a conversion and you will end up with double-encoded utf8:

http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html

Also, if you are using marc-8 encoded data in a latin1 type database you probably need to do the same thing, export your records from marc_subfield_table into a marc file (after converting to type blob), then process the file, changing everything to utf8, then change the table type in mysql, then re-import.
Database Backups

http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

http://textsnippets.com/posts/show/84 (probably not the best way)
mysqldump
mysqlhotcopy
Perl

Here are some links to good references for perl encoding issues:

http://www.ahinea.com/en/tech/perl-unicode-struggle.html http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod
DBI Module

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html

Movable Type uses the Perl modules DBI and DBD::mysql to
access the MySQL database. And guess what? They don’t have
any Unicode support. In fact, forget marking the UTF-8 flag
properly, according to this, DBD::mysql doesn’t even preserve
UTF-8 flag when it’s already there.

Wait for Unicode support for DBI/DBD::mysql which might be a
long time since nobody is sure if it should be provided by the
database-independent interface DBI or by the MySQL driver DBD::mysql
or both together in some way.

Use decode_utf8 on every output from the database. This is not very easy to do.
http://perldoc.perl.org/Encode.html#PERL-ENCODING-API

Use a patch which blesses all database data (yes that includes the binary
fields) as UTF-8 based on a flag you set when connecting to the database.
http://lists.mysql.com/perl/3563 (one patch)
http://dysphoria.net/2006/02/05/utf-8-a-go-go/ (another)
http://perl7.ru/lib/UTF8DBI.pm

Here’s one that seems to indicate that it’s best to grab DBI from CPAN:

http://www.codecomments.com/archive237-2006-4-786695.html

DBD::mysql will just pass
everything through unaltered. So if you use UTF-8 as connection charset,
you have to encode('utf-8', ...) all queries and parameters, unless you
are sure that they are either plain ASCII or already have the UTF-8 bit
set. And you will get raw UTF-8 strings back, which you have to decode()
explicitely.

However, I notice that on Debian Sarge (on which I did my testing),
libdbd-mysql-perl depends on libmysqlclient12. So there may be a problem
with mixing releases (The server is 4.1, but libmysqlclient12 belongs to
4.0, which doesn't know about UTF-8).

CGI Module

Coming soon ...
Opening Files

Coming soon ...
using bulkmarcimport

Coming soon ...
Zebra

Coming soon ...

koha

星期六, 7月 15, 2006

EncodingScratchPad Some notes on encoding and charsets

沒有留言:

標籤

網誌存檔

相關連結

著作人

Maps