星期五, 2月 16, 2007

MARC::File::XML

http://lists.gnu.org/archive/html/koha-devel/2006-05/msg00036.html

Hi everyone,

Just providing an update on this issue. As you may recall, I've
been putting the MARC::Record suite, specifically MARC::File::XML
and MARC::Charset, through some fairly rigourous tests, including
a 'roundtrip' test, which converts the binary MARC-8 records to
MARCXML / UTF-8 and then back to binary MARC but encoded as UTF-8.
This test is available here:

http://liblime.com/public/roundtrip.pl

I discovered a number of bugs or issues, not in the MARC::* stuff, but in the
back-end SAX parsers. I'll just summarize my discoveries here for
posterity:

1. MARC::File::XML, if it encounteres unmapped encoding in a
MARC-8 encoded binary MARC file (in as_xml()) will drop the entire
subfield where the improper encoding exists. The simple solution is
to always use: MARC::Charset->ignore_errors(1); if you expect your
data will have improper encoding.

2. the XML::SAX::PurePerl parser cannot properly handle combining
characters. I've reported this bug here:

http://rt.cpan.org/Public/Bug/Display.html?id=19543

At the suggestion of several, I tried replacing my default system
parser with expat, which cause another problem:

3. handing valid UTF-8 encoded XML to new_from_xml() sometimes
causes the entire record to be destroyed when using XML::SAX::Expat
as the parser (with PurePerl these seem to work). It fails with
the error:

not well-formed (invalid token) at line 23, column 43, byte 937 at
/usr/lib/perl5/XML/Parser.pm line 187

I haven't been able to track the cause of this bug, I eventually
found a workaround that didn't result in the above error, but instead,
silently mangled the resulting binary MARC record on the way out:

4. Using incompatible version of XML::SAX::LibXML and libxml2 will
cause binary MARC records to be mangled when passed through new_from_xml()
in some cases. The solution here is to make sure you're running
compatible versions of XML::SAX::LibXML and libxml2. I run Debian
Sarge and when I just used the package maintainer's versions it
fixed the bug. It's unclear to me why the binary MARC would be
mangled, this may indicate a problem with MARC::* but I haven't
had time to track it down and since installing compatible versions
of the parser back-end solves the problem I can only assume it's
the fault of the incompatible parsers.

Issues #3 and #4 above can be replicated following batch of records
through the roundtrip.pl script above:

http://liblime.com/public/several.mrc

If you want to test #2, try running this record through roundtrip.pl:

http://liblime.com/public/combiningchar.mrc

BTW: you can change your default SAX parser by editing the .ini file ...
mine is located in /usr/local/share/perl/5.8.4/XML/SAX/ParserDetails.ini

So the bottom line is, if you want to use MARC::File::XML in any
serious application, you've got to use compatible versions of the
libxml2 parser and XML::SAX::LibXML. Check the README in the perl
package for documentation on which are compatible...

Maybe a note somewhere in the MARC::File::XML documentation to point
these issues out would be useful. Also, it wouldn't be too bad to have
a few tests to make sure that the system's default SAX parser is capable
of handling these cases. Just my two cents.

Cheers,

沒有留言: