http://wiki.koha.org/doku.php?id=encodingscratchpad
Introduction
For the versions prior to Koha 2.2.6, careful attention was not given to dealing with character sets correctly. This document attempts to raise awareness of character set issues so that Koha developers and administrators can understand how best to proceed with development as well as setup and configuration of Koha systems.
MARC Records
MARC21 records can ‘legally’ only have two encodings: MARC-8 or UTF-8. The encoding is set in position 9 of the leader (LEADER / 09). MARC-8 is not recognized in modern web browsers and since Koha is a web-based system, if you are using MARC21 records, the encoding MUST be UTF-8. This means that the records should be pre-processed before entering your Koha system (in whatever way they enter). Some of this is handled internally within Koha, but don’t leave it to chance: if you’re migrating MARC21 data into Koha expect to spend a significant amount of time to dealing with properly pre-processing and storing your data in Koha.
Conversion from MARC-8 to UTF-8 for MARC21 records is handled in Koha with the MARC::* suite of Perl modules. There are significant issues with properly configuring your system (with the proper SAX parsers, etc.) and there are also some questions raised about whether this suite is handling all character set / encoding issues correctly. For some details, please refer to the following posts:
http://www.nntp.perl.org/group/perl.perl4lib/2369
http://lists.nongnu.org/archive/html/koha-devel/2006-07/msg00000.html
One thing to remember is that LEADER / 09 is used in MARC::* to determine the encoding of a given record. This means that if it’s not set correctly, you will very likely mangle any records you are importing/exporting.
System
Be sure to set your system locales up correctly to use UTF-8. You can test your locale settings by running:
$ locale
or
$ echo $LANG
en_US.UTF-8
If it’s not en_US.UTF-8 (or UTF-8 of your language), en_US means it’s configured for iso-8859-1/latin1. Be sure to reconfigure your locales. On Debian, you can configure locales thusly:
$ sudo dpkg-reconfigure locales
Then, you’ll need to quit your shell session and log back in again to check the default.
NOTE: on some systems, the root user won't have locale set properly, use
a non-root user when working with Koha and the 'sudo' command if you need
elevated permissions
Apache2
Be sure to have these lines in your http.conf:
AddCharset UTF-8 .utf8
AddDefaultCharset UTF-8
MySQL 4.1
Server Configuration
MySQL Version 4.1 is absolute minimum if you want to handle encoding correctly
Please refer to the MySQL Manual Chapter 10: http://dev.mysql.com/doc/refman/4.1/en/charset.html
You will probably have to edit your my.cnf to set some variables so that the server will use utf8 by default. Even standard packages like the one provided by Debian Sarge have the variables set to use latin1 by default. Make sure you have the following in your my.cnf:
init-connect = 'SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_ci
Connect to mysql using a non-root user and type:
show variables;
NOTE: The root user won't show the variables correctly for reasons I haven't had time to
investigate ... connect as the kohaadmin user to check the values.
Check to make sure the following are set to utf8:
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci
You must create your Koha database _after_ you set the character set defaults otherwise the database could be set to the wrong defaults
If you are moving from a mysql 4.0 database to a 4.1, you need to pay special attention to how to properly deal with your charsets. If you are storing utf-8 data in mysql 4.0 but your table types are set to latin1, you will need to convert to blob or binary before changing the table type otherwise mysql will attempt a conversion and you will end up with double-encoded utf8:
http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html
Also, if you are using marc-8 encoded data in a latin1 type database you probably need to do the same thing, export your records from marc_subfield_table into a marc file (after converting to type blob), then process the file, changing everything to utf8, then change the table type in mysql, then re-import.
Database Backups
http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html
http://textsnippets.com/posts/show/84 (probably not the best way)
mysqldump
mysqlhotcopy
Perl
Here are some links to good references for perl encoding issues:
http://www.ahinea.com/en/tech/perl-unicode-struggle.html http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod
DBI Module
http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
Movable Type uses the Perl modules DBI and DBD::mysql to
access the MySQL database. And guess what? They don’t have
any Unicode support. In fact, forget marking the UTF-8 flag
properly, according to this, DBD::mysql doesn’t even preserve
UTF-8 flag when it’s already there.
Wait for Unicode support for DBI/DBD::mysql which might be a
long time since nobody is sure if it should be provided by the
database-independent interface DBI or by the MySQL driver DBD::mysql
or both together in some way.
Use decode_utf8 on every output from the database. This is not very easy to do.
http://perldoc.perl.org/Encode.html#PERL-ENCODING-API
Use a patch which blesses all database data (yes that includes the binary
fields) as UTF-8 based on a flag you set when connecting to the database.
http://lists.mysql.com/perl/3563 (one patch)
http://dysphoria.net/2006/02/05/utf-8-a-go-go/ (another)
http://perl7.ru/lib/UTF8DBI.pm
Here’s one that seems to indicate that it’s best to grab DBI from CPAN:
http://www.codecomments.com/archive237-2006-4-786695.html
DBD::mysql will just pass
everything through unaltered. So if you use UTF-8 as connection charset,
you have to encode('utf-8', ...) all queries and parameters, unless you
are sure that they are either plain ASCII or already have the UTF-8 bit
set. And you will get raw UTF-8 strings back, which you have to decode()
explicitely.
However, I notice that on Debian Sarge (on which I did my testing),
libdbd-mysql-perl depends on libmysqlclient12. So there may be a problem
with mixing releases (The server is 4.1, but libmysqlclient12 belongs to
4.0, which doesn't know about UTF-8).
CGI Module
Coming soon ...
Opening Files
Coming soon ...
using bulkmarcimport
Coming soon ...
Zebra
Coming soon ...
沒有留言:
張貼留言