星期五, 7月 21, 2006

星期日, 7月 16, 2006

Investigations on Perl, MySQL & UTF-8

http://lists.gnu.org/archive/html/koha-devel/2006-03/msg00027.html

Because the story of Perl, MySQL, UTF-8 and Koha is becoming more and
more complicated, I've decided to start my tests outside of Koha or any
web server. I wanted to check that Perl and MySQL could communicate
with UTF-8 data.

What I did :

1. copy some UTF-8 strings from
http://www.columbia.edu/kermit/utf8-t1.html paste into a UTF-8 text
file utf8.txt (open/past in UTF-8 console, with Vim having :set
encoding=utf-8)

2. create a UTF-8 database with a simple table having a TEXT field

$ mysql --user=root --password=xxx
mysql> CREATE DATABASE `utf8_test` CHARACTER SET utf8;
mysql> connect utf8_test
mysql> create table strings (id int, value text);
mysql> quit

(no need to set connection character set to utf-8 in that case, default
latin1 is fine)

Note: my MySQL server is latin1...

$ mysql --user=root --password=xxx utf8_test
mysql> status
Server characterset: latin1
Db characterset: utf8
Client characterset: latin1
Conn. characterset: latin1
mysql> set names 'UTF8';
mysql> status
Server characterset: latin1
Db characterset: utf8
Client characterset: utf8
Conn. characterset: utf8

3. write and execute a Perl script which reads the UTF-8 text file,
insert UTF-8 strings into database, retrieve UTF-8 strings from
database, print UTF-8 strings to STDOUT. See details in attached file
readfile_insertdb.pl. Important note: "set names 'UTF8';" is mandatory.

Everything is *working fine*. My output is in UTF-8, I'm 100% sure of
it.

DBD::mysql : 2.9007
Perl : 5.8.7
MySQL : 4.1.12-Debian_1ubuntu3.1-log
DBI : 1.48

(find your local versions with attached script versions.pl)

I suspect that Paul's data stored in MySQL are not truely UTF-8. Maybe
I miss the point, but it seems Perl, MySQL and UTF-8 are not working so
badly altogether.

The Inter Library Comparison (ILS) Chart

http://wiki.koha.org/doku.php?id=inter_library_system_comparison

星期六, 7月 15, 2006

koha-2.3.0 bug-1

ERROR 1062 at line 1: Duplicate entry 'localhost-root' for key 1
256ERROR 1062 at line 1: Duplicate entry '%-Koha-root' for key 1


read_config_file(/etc/koha.conf.tmp) returned undef at /usr/local/koha/intranet/ modules/C4/Context.pm line 195.
Can't call method "config" on unblessed reference at /usr/local/koha/intranet/mo dules/C4/Context.pm line 488.
Problem updating database...

converts the binary MARC-8 records to MARCXML / UTF-8

http://www.nntp.perl.org/group/perl.perl4lib/2369

Hi everyone,

Just providing an update on this issue. As you may recall, I've
been putting the MARC::Record suite, specifically MARC::File::XML
and MARC::Charset, through some fairly rigourous tests, including
a 'roundtrip' test, which converts the binary MARC-8 records to
MARCXML / UTF-8 and then back to binary MARC but encoded as UTF-8.
This test is available here:

http://liblime.com/public/roundtrip.pl

I discovered a number of bugs or issues, not in the MARC::* stuff, but in the
back-end SAX parsers. I'll just summarize my discoveries here for
posterity:

1. MARC::File::XML, if it encounteres unmapped encoding in a
MARC-8 encoded binary MARC file (in as_xml()) will drop the entire
subfield where the improper encoding exists. The simple solution is
to always use: MARC::Charset->ignore_errors(1); if you expect your
data will have improper encoding.

2. the XML::SAX::PurePerl parser cannot properly handle combining
characters. I've reported this bug here:

http://rt.cpan.org/Public/Bug/Display.html?id=19543

At the suggestion of several, I tried replacing my default system
parser with expat, which cause another problem:

3. handing valid UTF-8 encoded XML to new_from_xml() sometimes
causes the entire record to be destroyed when using XML::SAX::Expat
as the parser (with PurePerl these seem to work). It fails with
the error:

not well-formed (invalid token) at line 23, column 43, byte 937 at /usr/lib/perl5/XML/Parser.pm line 187

I haven't been able to track the cause of this bug, I eventually
found a workaround that didn't result in the above error, but instead,
silently mangled the resulting binary MARC record on the way out:

4. Using incompatible version of XML::SAX::LibXML and libxml2 will
cause binary MARC records to be mangled when passed through new_from_xml()
in some cases. The solution here is to make sure you're running
compatible versions of XML::SAX::LibXML and libxml2. I run Debian
Sarge and when I just used the package maintainer's versions it
fixed the bug. It's unclear to me why the binary MARC would be
mangled, this may indicate a problem with MARC::* but I haven't
had time to track it down and since installing compatible versions
of the parser back-end solves the problem I can only assume it's
the fault of the incompatible parsers.

Issues #3 and #4 above can be replicated following batch of records
through the roundtrip.pl script above:

http://liblime.com/public/several.mrc

If you want to test #2, try running this record through roundtrip.pl:

http://liblime.com/public/combiningchar.mrc

BTW: you can change your default SAX parser by editing the .ini file ...
mine is located in /usr/local/share/perl/5.8.4/XML/SAX/ParserDetails.ini

So the bottom line is, if you want to use MARC::File::XML in any
serious application, you've got to use compatible versions of the
libxml2 parser and XML::SAX::LibXML. Check the README in the perl
package for documentation on which are compatible...

Maybe a note somewhere in the MARC::File::XML documentation to point
these issues out would be useful. Also, it wouldn't be too bad to have
a few tests to make sure that the system's default SAX parser is capable
of handling these cases. Just my two cents.

Cheers,

--
Joshua Ferraro VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology migration, training, maintenance, support
LibLime Featuring Koha Open-Source ILS
jmf[at]liblime.com |Full Demos at http://liblime.com/koha |1(888)KohaILS

DB schema

1.A logical schema diagram for 3.0 has been written by Paul. It’s a 2 page document. Avaible in some forms :openoffice.org [http://www.koha-fr.org/presentation/MCD_version3.odg 15KB], PDF [http://www.koha-fr.org/presentation/MCD_version3.pdf 230KB] It will be updated when needed (Paul will take care of the update. If he don’t, bug him)

2.A logical schema diagram for 2.2.0 has been written. It’s a 2 page document. Avaible in some forms :openoffice.org [http://www.koha-fr.org/presentation/MCD_version2_2_0.sxd 15KB], PDF [http://www.koha-fr.org/presentation/MCD_version2_2_0.pdf 230KB]

3.A logical schema diagram for 2.0.0 has been written. It’s a 2 page document. Avaible in some forms :openoffice.org [http://www.koha-fr.org/presentation/MCD2.sxd 15KB], PDF [http://www.koha-fr.org/presentation/MCD.pdf 230KB] and jpg [http://www.koha-fr.org/presentation/MCD1.jpg page 1, 160KB] and [http://www.koha-fr.org/presentation/MCD2.jpg page 2 170KB] Some draft ‘logical’ schema diagrams from 1.3.3 are [http://irref.mine.nu/user/dchud/koha-schema/ available here]

ZebraSearchingDefinitions an explaination of the components of searching with the new ZOOM API, and a discussion of which cataloging procedures should

The Koha Online Catalog: A Working Definition

In versions of Koha prior to 2.4, the goal with Koha’s MARC support was to get a functioning ILS in place that was capable of storing MARC records correctly. But now we have a more ambitious goal: we want our ILS to be capable of searching the semantic information in MARC records to the fullest extent possible. A secondary goal is to provide easy access from the Online Catalog to resources that extend beyond just the bibliographic records for library holdings.

This Wiki page provides a workspace where Koha developers, cataloging staff, and general staff can post ideas, requests, and questions for how Koha handles searching (and display) of bibliographic records and access to other resources.
Scope

There are many considerations in constructing a working definition of the Koha Catalog. Ultimately, our working definition will consist of individual goals. An example of a goal might be “I want to be able to search for an exact title like “It” for Stephen King, and have it be the first record in the result set”. To realize a given goal, we must define a set of practices in four areas:

Search Indexes

The indexes are where we define:

*
how MARC fields should be grouped together as ‘search points’ (eg, ‘author’, ‘date’, ‘exact title’ are search points)
*
what kinds of searches we can do on those grouping (eg, ‘number’ search, ‘phrase’ search)
*
how to search within certain fields for data (specific positions of fixed fields for instance)

MARC Frameworks

Koha’s MARC Frameworks are where we define:

*
what constitutes a MARC record (what fields/subfields)
*
labels for each field
*
how the fields are handled within the MARC editor
*
how the fields should be displayed in search results and details pages
*
a mapping between MARC records and Koha’s item management (issues, reserves, circ rules, barcodes, etc.)

Cataloging

Consistant cataloging practices are, together with Frameworks and Indexes, an essential component to searching. Here are some things to think about:

*
NPL employs ‘copy-cataloging’, not original cataloging, so records often come from different sources that may have different cataloging practices.
*
in areas where no official rule has been made in AACR2 or similar cataloging manuals, Koha will need a consistant practice in order to properly index records
*
with over 2000 edit points per record, we need to identify clearly which of those are most important for purposes of search and display

Interface Design

The Koha OPAC is an interface through which patrons and staff construct queries of the data. The interface needs to be fast, accurate, and intuitive to use if it is to be a useful search tool of the library’s collections.

Our task then, is to construct a working set of expectations and definitions of the above. The definitions can then be applied directly to each of the four categories to realize a given search goal.
Discussion Points
Dates

MARC records don’t have a consistant way to distinguish between copyright and publication dates (that I can tell), so we have two date types to think about: copyright/publication, and acquisition. Here are some related MARC fields for each:
copyright/publication dates

008 / 07-10 : generally a primary date associated with the
publication, distribution, etc. of an item and the beginning
date of a collection

008 / 11-14 : secondary date associated with the publication
distribution, etc. of an item and the ending date of a collection.
For books and visual materials, this may be a detailed date which
represents a month and day.

260

362

*
Index I propose to index the 008/07-10 field and make that the date field used for date searches
*
MARC Framework The framework should require that 008/07-10 be filled with values
*
Cataloging We need to make sure that all our records have values in the 008/07-10
*
Interface Design What ways do we want to be able to search on dates? in a range, individually?

acquisition date

942$k : stored as yyyymmddhhmmss

Item Types, Circulation Rules, etc.

For the Zebra version of Koha, we’re breaking up the itemtypes into four categories:

1.
collection code (the original itemtype)
2.
audience
3.
content
4.
format

To do this, we are using a combination of several fields in the record to derive each category.

Leader

LDR/06 type of record

FORMAT OF ITEM

MARC Field: 007/1,2 (form of item)

ta = everything else = 'regular print'
tb = LP,LPNF,LP J, LP YA,LP JNF,LP YANF = 'large print'
sd = CDM,AB,JAB,JABN,YAB,YABN,ABN, = 'sound disk'
co = CDR = 'CD-ROM'
vf = AV,AVJ,AVNF,AVJNF = 'VHS'
vd = DVD,DVDN,DVDJ,DVJN = 'DVD'
ss = JAC,YAC,AC,JACN,YACN,ACN = 'sound cassette'

TARGET AUDIENCE

MARC Field: 008/22 (target audience)
a = EASY
b = EASY
c = J,JNF,JAB,JABN,AVJ,AVJNF,JAC,JACN (juvenile)
d = YA,YANF,YAB,YABN,YAC,YACN (young adult)
e = everything else (adult)
j = J,JNF,JAB,JABN,AVJ,AVJNF,JAC,JACN,DVDJ,DVDJN (juvenile)

CONTENT

MARC Field: 008/33,34

normal records:
008 / 33 fiction/non-fiction
008 / 34 biography
(what about mystery ... are they are there any others?)

video recordings: MARC Field 880/33
v = videorecording

008 / 34 l live action
008 / 34 a animation
008 / 34 c animation and live action

sound recordings:
008 / 30-31 a autobiography
b biography
d drama
etc.
AUDIO BOOKS
LDR nim a 00
008/ 30, 31
Guidelines for applying content designators:

Code: Description:
# Item is a music sound recording When # is used, it is followed by
another blank (##).
a Autobiography
b Biography
c Conference proceedings
d Drama
e Essays
f Fiction Fiction includes novels, short stories, etc.
g Reporting Reports of news-worthy events and informative messages
are included in this category.
h History History includes historical narration, etc., that may also
be covered by one of the other codes (e.g., historical poetry).
i Instruction Instructional text includes instructions on how to
accomplish a task, learn an art, etc. (e.g., how to replace a light
switch). Note: Language instruction text is assigned code j.
j Language instruction Language instructional text may include
passages that fall under the definition for one of the other codes
(e.g., language text that includes poetry).
k Comedy Spoken comedy.
l Lectures, speeches Literary text is lectures and/or speeches.
m Memoirs Memoirs are usually autobiographical.
n Not applicable Item is not a sound recording (e.g., printed or
manuscript music).
o Folktales
p Poetry
r Rehearsals Rehearsals are performances of any of a variety of
nonmusical productions.
s Sounds Sounds include nonmusical utterances and vocalizations that
may or may not convey meaning.
t Interviews
z Other Type of literary text for which none of the other defined
codes are appropriate.
| No attempt to code

MUSIC
LDR njm a 00
008 / 30,31 (usually blank)
008 / 18,19 composition form

Guidelines for applying content designators:

Code: Description:
an Anthems
bd Ballads
bt Ballets
bg Bluegrass music
bl Blues
cn Canons and rounds i.e., compositions employing strict imitation
throughout
ct Cantatas
cz Canzonas Instrumental music designated as a canzona.
cr Carols
ca Chaconnes
cs Chance compositions
cp Chansons, polyphonic
cc Chant, Christian
cb Chants, Other
cl Chorale preludes
ch Chorales
cg Concerti grossi
co Concertos
cy Country music
df Dance forms Includes music for individual dances except those that
have separate codes defined: mazurkas, minuets, pavans, polonaises,
and waltzes.
dv Divertimentos, serenades, cassations, divertissements, and notturni
Instrumental music designated as a divertimento, serenade, cassation,
divertissement, or notturno.
ft Fantasias Instrumental music designated as fantasia, fancies,
fantasies, etc.
fm Folk music Includes folk songs, etc.
fg Fugues
gm Gospel music
hy Hymns
jz Jazz
md Madrigals
mr Marches
ms Masses
mz Mazurkas
mi Minuets
mo Motets
mp Motion picture music
mc Musical revues and comedies
mu Multiple forms
nc Nocturnes
nn Not applicable Indicates that form of composition is not applicable
to the item. Used for any item that is a non-music sound recording.
op Operas
or Oratorios
ov Overtures
pt Part-songs
ps Passacaglias Includes all types of ostinato basses.
pm Passion music
pv Pavans
po Polonaises
pp Popular music
pr Preludes
pg Program music
rg Ragtime music
rp Rhapsodies
rq Requiems
ri Ricercars
rc Rock music
rd Rondos
sd Square dance music
sn Sonatas
sg Songs
st Studies and exercises Used only when the work is intended for
teaching purposes (usually entitled Studies, Etudes, etc.).
su Suites
sp Symphonic poems
sy Symphonies
tc Toccatas
ts Trio-sonatas
uu Unknown Indicates that the form of composition of an item is
unknown. Used when the only indication given is the number of
instruments and the medium of performance. No structure or genre is
given, although they may be implied or understood.
vr Variations
wz Waltzes
zz Other Indicates a form of composition for which none of the other
defined codes are appropriate (e.g., villancicos, incidental music,
electronic music, etc.).
| No attempt to code

*
Index I propose that the above guidelines be used for indexing a record for its itemtype, format, audience, and content
*
MARC Framework The framework should require that the above fields be filled with values
*
Cataloging We need to make sure that all our records have appropriate values in the above fields
*
Interface Design need to make sure the interface is easy to use

Organization of Materials

This gets tricky. Please keep in mind that I haven’t had any formal library science training and the following is what I’ve gleaned by working with librarians from many different systems. Every library seems to handle these issues differently, but here are some definitions that I hope are universal:

*
Collection Code - used to specify circulation rules on a given record or item

*
Classification - a taxonomy for organizing a library collection into subjects

*
Shelving Location - the general location of an item within the library (general stacks, reference area, new books shelf, science fiction area, etc.)

*
Call Number - a standards-based scheme for organization of a given item on the shelf. Typically, the call number is composed of some part of the classification

*
Local Call Number - a locally-defined scheme for organizing items on the shelf.

*
Item Call Number - an item-specific call number, sometimes used to distinguish between two of the same item on the same shelf. Also used for inventory as a way to specify which shelf a given item is associated with.

Libraries typically simplify the above elements to simplify record maintenance and searching of materials. For instance, NPL currently uses a simplified scheme that consists of the following:
Name Use Composition Location
Item Type general shelving location, circulation rules locally defined 942$c
Call Number shelf order, subject classification from Dewey or locally defined 942$c

For Koha 2.4, we’re proposing to change that scheme slightly to enable better search options in the catalog. Here is the scheme that we’re proposing:
Name Use Composition Location
Classification subject classification Dewey 082
Collection Code (itemtype) circulation rules, general shelving location locally defined 942$c
Call Number shelf order Local Call Number (fiction) or Classification (non-fiction) ?
Local Call Number shelf order NPL’s local call number scheme ( ) 942$c
Item Call Number inventory Call Number 952?

Looking forward, we may want to adopt an even more complete scheme such as the following:
Name Use Composition Location
Classification subject classification Dewey 082
Collection Code circulation rules locally defined 942$c
Shelving Location Code location of item (new items, general stacks, mysteries and sci-fi, etc.) locally defined ?
Call Number shelf order Local Call Number (fiction) or Classification (non-fiction) ?
Local Call Number shelf order NPL’s local call number scheme ( ) 942$c
Item Call Number inventory Call Number + some other identifier ?

Here are some additional thoughts on the topic of Material Organization

*
There is currently crossover between itemtypes and call numbers but I think we can safely ignore it
*
Staff need search and sort by ‘Call Number’. A ‘Call Number Search’ is defined as:
o
search Classification
o
if not found, search ‘Local Call Number’
o
sort of this search point is based on which type of ‘call number’ the search was on
*
Sorting by call numbers outside of the context of a call number search will consist of sorting by number first, then by text
*
Item Call Numbers are required for inventory
*
NPL does not use shelving locations

Display of Records

Here is a list of requests I know about:

*
Volume Numbers (245$n) should be included in title display and search
*
Subjects should display in a semantically correct way

ZebraProgrammerGuide Some useful information about managing Zebra for Koha

http://wiki.koha.org/doku.php?id=zebraprogrammerguide

Here are some commands that you may find useful if you’re managing a Zebra installation with Koha
Counting Records

You can find out how many records are in your database thusly:

Z> base IR-Explain-1
Z> form sutrs
Z> f @attr exp1 1=1 databaseinfo
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 4, setno 1
SearchResult-1: databaseinfo(4)
records returned: 0
Elapsed: 0.069880
Z> s
Sent presentRequest (1+1).
Records: 1
[IR-Explain-1]Record type: SUTRS
explain:
databaseInfo: DatabaseInfo
commonInfo:
dateAdded: 20020911101011
dateChanged: 20020911101011
languageCode: EN
accessinfo:
unitSystems:
string: ISO
attributeSetIds:
oid: 1.2.840.10003.3.5
oid: 1.2.840.10003.3.1
oid: 1.2.840.10003.3.1000.81.2
schemas:
oid: 1.2.840.10003.13.2
name: gils
userFee: 0
available: 1
recordCount:
recordCountActual: 48
zebraInfo:
recordBytes: 123562
Elapsed: 0.068221
Z> s
Sent presentRequest (2+1).
Records: 1
[IR-Explain-1]Record type: SUTRS

EncodingScratchPad Some notes on encoding and charsets

http://wiki.koha.org/doku.php?id=encodingscratchpad

Introduction

For the versions prior to Koha 2.2.6, careful attention was not given to dealing with character sets correctly. This document attempts to raise awareness of character set issues so that Koha developers and administrators can understand how best to proceed with development as well as setup and configuration of Koha systems.
MARC Records

MARC21 records can ‘legally’ only have two encodings: MARC-8 or UTF-8. The encoding is set in position 9 of the leader (LEADER / 09). MARC-8 is not recognized in modern web browsers and since Koha is a web-based system, if you are using MARC21 records, the encoding MUST be UTF-8. This means that the records should be pre-processed before entering your Koha system (in whatever way they enter). Some of this is handled internally within Koha, but don’t leave it to chance: if you’re migrating MARC21 data into Koha expect to spend a significant amount of time to dealing with properly pre-processing and storing your data in Koha.

Conversion from MARC-8 to UTF-8 for MARC21 records is handled in Koha with the MARC::* suite of Perl modules. There are significant issues with properly configuring your system (with the proper SAX parsers, etc.) and there are also some questions raised about whether this suite is handling all character set / encoding issues correctly. For some details, please refer to the following posts:

http://www.nntp.perl.org/group/perl.perl4lib/2369

http://lists.nongnu.org/archive/html/koha-devel/2006-07/msg00000.html

One thing to remember is that LEADER / 09 is used in MARC::* to determine the encoding of a given record. This means that if it’s not set correctly, you will very likely mangle any records you are importing/exporting.
System

Be sure to set your system locales up correctly to use UTF-8. You can test your locale settings by running:

$ locale

or

$ echo $LANG
en_US.UTF-8

If it’s not en_US.UTF-8 (or UTF-8 of your language), en_US means it’s configured for iso-8859-1/latin1. Be sure to reconfigure your locales. On Debian, you can configure locales thusly:

$ sudo dpkg-reconfigure locales

Then, you’ll need to quit your shell session and log back in again to check the default.

NOTE: on some systems, the root user won't have locale set properly, use
a non-root user when working with Koha and the 'sudo' command if you need
elevated permissions

Apache2

Be sure to have these lines in your http.conf:

AddCharset UTF-8 .utf8
AddDefaultCharset UTF-8

MySQL 4.1
Server Configuration

MySQL Version 4.1 is absolute minimum if you want to handle encoding correctly

Please refer to the MySQL Manual Chapter 10: http://dev.mysql.com/doc/refman/4.1/en/charset.html

You will probably have to edit your my.cnf to set some variables so that the server will use utf8 by default. Even standard packages like the one provided by Debian Sarge have the variables set to use latin1 by default. Make sure you have the following in your my.cnf:

init-connect = 'SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_ci

Connect to mysql using a non-root user and type:

show variables;

NOTE: The root user won't show the variables correctly for reasons I haven't had time to
investigate ... connect as the kohaadmin user to check the values.

Check to make sure the following are set to utf8:

| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci

You must create your Koha database _after_ you set the character set defaults otherwise the database could be set to the wrong defaults

If you are moving from a mysql 4.0 database to a 4.1, you need to pay special attention to how to properly deal with your charsets. If you are storing utf-8 data in mysql 4.0 but your table types are set to latin1, you will need to convert to blob or binary before changing the table type otherwise mysql will attempt a conversion and you will end up with double-encoded utf8:

http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html

Also, if you are using marc-8 encoded data in a latin1 type database you probably need to do the same thing, export your records from marc_subfield_table into a marc file (after converting to type blob), then process the file, changing everything to utf8, then change the table type in mysql, then re-import.
Database Backups

http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

http://textsnippets.com/posts/show/84 (probably not the best way)
mysqldump
mysqlhotcopy
Perl

Here are some links to good references for perl encoding issues:

http://www.ahinea.com/en/tech/perl-unicode-struggle.html http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod
DBI Module

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html

Movable Type uses the Perl modules DBI and DBD::mysql to
access the MySQL database. And guess what? They don’t have
any Unicode support. In fact, forget marking the UTF-8 flag
properly, according to this, DBD::mysql doesn’t even preserve
UTF-8 flag when it’s already there.

Wait for Unicode support for DBI/DBD::mysql which might be a
long time since nobody is sure if it should be provided by the
database-independent interface DBI or by the MySQL driver DBD::mysql
or both together in some way.

Use decode_utf8 on every output from the database. This is not very easy to do.
http://perldoc.perl.org/Encode.html#PERL-ENCODING-API

Use a patch which blesses all database data (yes that includes the binary
fields) as UTF-8 based on a flag you set when connecting to the database.
http://lists.mysql.com/perl/3563 (one patch)
http://dysphoria.net/2006/02/05/utf-8-a-go-go/ (another)
http://perl7.ru/lib/UTF8DBI.pm

Here’s one that seems to indicate that it’s best to grab DBI from CPAN:

http://www.codecomments.com/archive237-2006-4-786695.html

DBD::mysql will just pass
everything through unaltered. So if you use UTF-8 as connection charset,
you have to encode('utf-8', ...) all queries and parameters, unless you
are sure that they are either plain ASCII or already have the UTF-8 bit
set. And you will get raw UTF-8 strings back, which you have to decode()
explicitely.

However, I notice that on Debian Sarge (on which I did my testing),
libdbd-mysql-perl depends on libmysqlclient12. So there may be a problem
with mixing releases (The server is 4.1, but libmysqlclient12 belongs to
4.0, which doesn't know about UTF-8).

CGI Module

Coming soon ...
Opening Files

Coming soon ...
using bulkmarcimport

Coming soon ...
Zebra

Coming soon ...

InstallingZebraPlugin226 How to install the Zebra plugin for 2.2.6

http://wiki.koha.org/doku.php?id=installingzebraplugin226

Introduction

Koha’s Zebra plugin is a new feature with 2.2.6 that allows an otherwise ordinary rel_2_2 Koha to use Zebra for bibliographic data storage, search and retrieval. Why you would want to integrate Koha and Zebra is a topic for another document. This guide assumes you’re sold on the idea, and already have some experience managing a Koha system. In it, we’ll walk through the process of:

*
configuring your system
*
symlinking your installation environment to a ‘dev-week’ CVS repository
*
making needed changes to your Koha MySQL database
*
installing, configuring, and starting Zebra
*
importing your data

Before following this install document please refer to the “Installing Koha (2.2.6) on Debian Sarge” and the “Updating Koha” documents available from http://kohadocs.org. The assumption is that you’ve already got Koha 2.2.6 installed and a working knowledge of how to symlink a CVS working repository to your installation. If you don’t know what that means, DON’T PROCEED. The Zebra integration adds quite a bit of complexity to the installation and maintenance of Koha, so be warned.

I also highly recommend you read over the Zebra docs at http://indexdata.dk/zebra if you’re going to be managing a Zebra installation.

Finally, DO NOT perform these steps on a production system unless you have fully tested them on a test system and are comfortable with the process. Doing otherwise could lead to serious data and configuration loss. And of course, before doing anything, please back up your data.
Preparing the server for Zebra
Install Yaz, Zebra, Net::Z3950::ZOOM
on Debian

Put the following in your /etc/apt/sources.list

# for Yaz Toolkit
deb http://ftp.indexdata.dk/debian indexdata/sarge released
deb-src http://ftp.indexdata.dk/debian indexdata/sarge released

Now run

# apt-get update && apt-get install idzebra

(yaz will automatically be installed as it’s a dependency)

Install the latest version of Net::Z3950::ZOOM from CPAN:

# perl -MCPAN -e 'install Net::Z3950::ZOOM'

On other systems

Get latest zebra & yaz sources from : http://www.indexdata.com/yaz/ and http://www.indexdata.com/zebra/ Install Yaz:

# tar xvfz yaz-version.tar.gz
# cd yaz-version
# ./configure
# make
# make install

Then istall zebra :

# tar xvfz idzebra-version.tar.gz
# cd idzebra-version
# ./configure
# make
# make install

Install the latest version of Net::Z3950::ZOOM from CPAN:

# perl -MCPAN -e 'install Net::Z3950::ZOOM'

Prepare the filesystem

Check out dev-week from CVS

# cvs -z3 -d:pserver:anonymous@cvs.savannah.nongnu.org:/sources/koha export -r dev_week koha

NOTE: This is not a 'check out' but an 'export' The main difference is that there are no CVS directories in the 'export'

Symlink your Koha 2.2.6 install environment to the dev-week ‘working copy’ (see the ‘Updating Koha’ document for details)
The zebraplugin directory

In the dev-week Koha cvs repository you’ll fine a zebraplugin directory that contains all the files you’ll need to set up Zebra.
etc

Within the etc directory, you’ll find a koha.xml file that is a replacement for the koha.conf file in rel_2_2. This file is where you specify the location of many of the files in the zebraplugin directory. You’ll need to pick a directory structure that works with your configuration and edit the file accordingly. For instance, on my systems, I have a structure like the following:

/koha
|-- cvsrepos
|-- etc
|-- intranet
|-- log
|-- opac
|-- utils
`-- zebradb

The default plugin koha.xml uses this directory structure as a point of reference (the etc and zebradb directory above correspond to the same directories in the kohaplugin directory).
zebradb

This directory contains the filesystem that will store all of Zebra’s indexes. The only file you should need to edit in the zebradb file structure is the kohalis file within biblios/tab. This file should contain the user/password specified in the koha.xml directive.

Depending on your system you also may need to modify some idzebra directories. On my Mandriva, zebra parameters are in /usr/local/share/idzebra and not in /usr/local/idzebra. to check it,

which zebraidx

If the answer is

/usr/local/bin/zebraidx

then update zebra-biblios.cfg & zebra-authorities.cfg and modify the line

profilePath:${srcdir:-.}:/usr/share/idzebra/tab/:/koha/zebraplugin/zebradb/biblios/tab/:${srcdir:-.}/tab/

to

profilePath:${srcdir:-.}:/usr/local/share/idzebra/tab/:/koha/zebraplugin/zebradb/biblios/tab/:${srcdir:-.}/tab/

utils

The utils directory contains the utilities you’ll need to perform the rest of the installation / upgrade, which brings us to ...
Modify the SQL database

Here are tasks you’ll want to perform whether or not this is a brand new Koha install:

1.
updatedatabase (using updatedatabase from rel_2_2)
2.
update to the latest bib framework
3.
convert_to_utf8.pl (from dev-week)

If you’re migrating from a previous version of Koha (very likely) you’ll need to also do the following:

1.
run rebuild-nonmarc from dev_week if your framework has changed
2.
run missing090field.pl (from dev-week)
3.
run biblio_framework.sql from within the mysql monitor (from dev-week)
4.
run phrase_log.sql from within the mysql monitor (from dev-week)
5.
export your MARC records
6.
run them through a preprocess routine to convert to utf-8
7.
double-check again for missing 090 fields (very critical)

Importing Data

If you’re upgrading an existing Koha installation, your MySQL database already contains the record data, so all we need to do is import the bibliographic data into Zebra. We can do this thusly:

# zebraidx -g iso2709 -c /koha/etc/zebra-biblios.cfg -d biblios update /path/to/records
# zebraidx -g iso2709 -c /koha/etc/zebra-biblios.cfg -d biblios commit
-g is for group, files with the same group have the same extension
-c is where the config file is
-d is the name of the biblioserver

If you need to batch import records that don’t exist in your Koha installation, you can use bulkmarcimport as with rel_2_2:

# cd /path/to/dev_week/repo/
# export KOHA_CONF=/path/to/koha.xml
# perl misc/migration_tools/bulkmarcimport /path/to/records.mrc

Starting Zebra

zebrasrv -f /koha/etc/koha.xml

Yes, it’s that simple. :-)

The old 2.2 RoadMapToMarc

http://wiki.koha.org/doku.php?id=roadmaptomarc

1.
ToDoMARC : the complete ROADMAP, and where we are...
2.
WhatIsMarc : explains what is MARC
3.
MarcDBStructure : almost uptodate. Some indexes have been added, and a field or 2
4.
MarcKohaMap : how we map koha old-db fields to USMARC subfields.

not updtodate, but useful :

1.
MarcOperation : how we will manage the differents MARC normas in koha.

completly out of date :

1.
CataloguingAPI : see Biblio.pm instead (lot of comments at the beginning)
2.
WalktroughToMarc : see ToDoMARC instead

ZOOMSearchBeta ZOOM Searching Beta Notes

http://wiki.koha.org/doku.php?id=zoomsearchbeta

Hi folks,

Well, you knew it was coming, it’s been promised for, like, forever ... and now, it’s finally here! I’m proud to announce the beta version of the new Koha searching module that we’ve been raving about.

Before I show you the link though, I must warn you, this is still a beta product, things might not work perfectly, and that’s because we’re still working on it. If something doesn’t look right, drop Owen or I a note and let us know (either on chat, or on the forum or via email – jmf@liblime.com is my current one).

And now, without further ado, the link:

http://zoomopac.liblime.com

Let’s walk through the various search features of the new search:
SIMPLE SEARCH

The SIMPLE SEARCH page provides a simple, patron-friendly google-likeinterface to the catalog. Patrons can type simple, intuitive phrases like “harry potter” (titles) or “Chorale from Beethoven’s Symphony no. 9 " (song titles), “Neal Stephenson” (authors), etc.

The SIMPLE SEARCH also exposes a very intuitive formal query language called the Common Command Language - CCL (this is actually an international query standard: ISO 8777). With CCL, you can do queries like:

ti=cryptonomicon
au=neal stephenson
isbn=0380973464

If you ever wonder how to use CCL, just click on the little [?] next to the search input box.

For most queries, you can probably get away with just using the SIMPLE SEARCH, but sometimes ...
ADVANCED SEARCH

The ADVANCED SEARCH provides a guided interface to some ‘prefab’ search types like ‘author’, ‘title’, etc. For example, say you know the exact title of an item ... say it’s ‘It’ by Stephen King, and you want to find it in the catalog (pun intended). Try the ‘Exact Title’ option.The possibilities here are really endless ... if you want us to add a new search type, just drop Owen or I a note and we’ll make it happen.

You’ll also notice something else about the Advanced Search, something we’re not quite finished with and could use some feedback on. Remember the old ‘Item Type’ limit? Ever try to find ‘all videos’ or ‘add DVDs’? It was pretty tough because those formats were in several places at the same time. Well now we’ve broken things into three categories:

*
Audience: (EASY, YA, Juvenille, Adult,etc.)
*
Content: (Fiction, non-Fiction, Biography,etc.)
*
Format: (Large Print, VHS, DVD,CD-ROM,etc.)

The old ‘Item Type’ search is now re-labeled as ‘Collection Code’.

Please try out these new options, let us know if they work as you expected, or if there are types missing, etc.
POWER SEARCH

The POWER SEARCH is, plainly put, for infomaniacs :-). It exposes the full syntax of the library-created Z39.50 protocol in all it’s glory: search attributes, boolean operators, index scanning, the whole deal.

There are also two additional formal query syntax search boxes in the POWER SEARCH tab: CQL and PQF/RPN.
PROXIMITY SEARCH

The PROXIMITY SEARCH does one thing, and it does it well. It allows you to find works that are within a certain distance from each other in any fields listed in the drop-down box (let us know if you want others to be added).

Well ... that’s a start ... There are lots more features to show off, and some to refine ... above all, we really need your feedback on this system so we can make sure it’s meeting everyone’s expectations.

Ta ta for now,

Joshua

ZebraProtocolSupport Zebra Protocol Support (Z39.50, bib1, etc.)

http://wiki.koha.org/doku.php?id=zebraprotocolsupport

These attribute types are recognized regardless of attribute set. Some are +recognized for search, others for scan.

Search

Type Name Version 7 Embedded Sort 1.1 8 Term Set 1.1 9 Rank weight 1.1 9 Approx Limit 1.4 10 Term Ref 1.4

Embedded Sort

The embedded sort is a way to specify sort within a query - thus removing the +need to send a Sort Request separately. It is both faster and does not require +clients that deal with the Sort Facility.

The value after attribute type 7 is 1=ascending, 2=descending.. The +attributes+term (APT) node is separate from the rest and must be @or’ed. The +term associated with APT is the level .. 0=primary sort, 1=secondary sort etc.. +Example:

Search for water, sort by title (ascending):

@or @attr 1=1016 water @attr 7=1 @attr 1=4 0

Search for water, sort by title ascending, then date descending:

@or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1

Term Set

The Term Set feature is a facility that allows a search to store hitting terms +in a “pseudo” resultset; thus a search (as usual) + a scan-like facility. +Requires a client that can do named result sets since the search generates two +result sets. The value for attribute 8 is the name of a result set (string). +The terms in term set are returned as SUTRS records.

Seach for u in title, right truncated.. Store result in result set named uset.

@attr 5=1 @attr 1=4 @attr 8=uset u

The model as one serious flaw.. We don’t know the size of term set.

Rank weight

Rank weight is a way to pass a value to a ranking algorithm - so that one APT +has one value - while another as a different one.

Search for utah in title with weight 30 as well as any with weight 20.

@attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah

Approx Limit

Newer Zebra versions normally estemiates hit count for every APT (leaf) in the +query tree. These hit counts are returned as part of the searchResult-1 +facility.

By setting a limit for the APT we can make Zebra turn into approximate hit count +when a certain hit count limit is reached. A value of zero means exact hit +count.

We are intersted in exact hit count for a, but for b we allow estimates for 1000 +and higher..

@and a @attr 9=1000 b

This facility clashes with rank weight! Fortunately this is a Zebra 1.4 thing so +we can change this without upsetting anybody!

Term Ref

Zebra supports the searchResult-1 facility.

If attribute 10 is given, that specifies a subqueryId value returned as part of +the search result. It is a way for a client to name an APT part of a query.

Scan

Type Name Version 8 Result set narrow 1.3 9 Approx Limit 1.4

Result set narrow

If attribute 8 is given for scan, the value is the name of a result set. Each +hit count in scan is @and’ed with the result set given.

Approx limit

The approx (as for search) is a way to enable approx hit counts for scan hit +counts. However, it does NOT appear to work at the moment.

Installing Koha on Ubuntu amd64

http://wiki.koha.org/doku.php?id=ubuntu_amd64

Generally to install Koha on Ubuntu Dapper Drake’s amd64 platform you can just follow the instructions at http://www.kohadocs.org/Installing_Koha_on_Debian_sarge.html, but I found a few differences while I was going through it.

Note that I used the server edition rather than the desktop version.

Before installing “Event” from CPAN, install the build-essential package via apt-get. This will allow Event to install without failing.

Note that presently there are packages for the Yaz toolkit in Ubuntu universe, but they are a little old. If you’re just doing a plain install of Koha without updating to cvs or using Zebra, they’re probably fine, but if you’re planning on using the Zebra plugin then you should download the Yaz source tarball instead, and compile the packages yourself. (This is because installing Net::Z3950::ZOOM will fail because the version of Yaz is too old.)
Generating Yaz deb packages from source

The folks at Index Data have already done the package configuration for Yaz (and Zebra) so creating packages from the source is fairly simple:

*
Download the newest source tarball from ftp.indexdata.dk/pub/yaz and untar it.
*
Install fakeroot and debhelper with apt-get
*
run “dpkg-buildpackage -rfakeroot -b”
*
It will probably give you a list of dependent packages that are missing. Install them with apt-get, and repeat the last step.
*
Once the package has finished building, cd up a directory, where you should find your .deb packages.
*
Install them with “dpkg -i packagename“.

If you are planning on installing zebra, you can follow the same procedure (downloading the idzebra tarball from /pub/zebra, of course).
Other notes

*
cvs doesn’t appear to be installed by default on the server edition. You’ll have to apt-get it.
*
I also had to install XML::SAX, Class::Accessor, and Business::ISBN from CPAN. For the Zebra plugin, I also had to install XML::SAX::Expat, XML::Parser, and XML::Simple.

星期日, 7月 02, 2006

How do I go about maintaining a module when the author is unresponsive?

Sometimes a module goes unmaintained for a while due to the author pursuing other interests, being busy, etc. and another person needs changes applied to that module and may become frustrated when their email goes unanswered. CPAN does not mediate or dictate a policy in this situation and rely on the respective authors to work out the details. If you treat other authors as you would like to be treated in the same situation the manner in which you go about dealing with such problems should be obvious.

* Be courteous.
* Be considerate.
* Make an earnest attempt to contact the author.
* Give it time. If you need changes made immediately, consider applying your patches to the current module, changing the version and requiring that version for your application. Eventually the author will turn up and apply your patches, offer you maintenance of the module or, if the author doesn't respond in a year, you may get maintenance by having interest.
* If you need changes in order for another module or application to work, consider making the needed changes and bundling the new version with your own distribution and noting the change well in the documentation. Do not upload the new version under the same namespace to CPAN until the matter has been resolved with the author or CPAN.

Simply keep in mind that you are dealing with a person who invested time and care into something. A little respect and courtesy go a long way.