星期五, 8月 18, 2006

MARC::SAX

cpan -i XML::LibXML LWP::Simple XML::Simple

http://www.nntp.perl.org/group/perl.perl4lib/2369

Just providing an update on this issue. As you may
recall, I'vebeen putting the MARC::Record suite,
specifically MARC::File::XMLand MARC::Charset, through
some fairly rigourous tests, includinga 'roundtrip'
test, which converts the binary MARC-8 records to
MARCXML / UTF-8 and then back to binary MARC but
encoded as UTF-8.This test is available here:

http://liblime.com/public/roundtrip.pl


I discovered a number of bugs or issues, not in the
MARC::* stuff, but in theback-end SAX parsers. I'll
just summarize my discoveries here for posterity:

1. MARC::File::XML, if it encounteres unmapped
encoding in aMARC-8 encoded binary MARC file (in
as_xml()) will drop the entire subfield where the
improper encoding exists. The simple solution isto
always use: MARC::Charset->ignore_errors(1); if you
expect your
data will have improper encoding.

2. the XML::SAX::PurePerl parser cannot properly
handle combiningcharacters. I've reported this bug
here:

http://rt.cpan.org/Public/Bug/Display.html?id=19543


At the suggestion of several, I tried replacing my
default systemparser with expat, which cause another
problem:

3. handing valid UTF-8 encoded XML to new_from_xml()
sometimes causes the entire record to be destroyed
when using XML::SAX::Expat as the parser (with
PurePerl these seem to work). It fails with the error:

not well-formed (invalid token) at line 23, column 43,
byte 937 at /usr/lib/perl5/XML/Parser.pm line 187

I haven't been able to track the cause of this bug, I
eventually found a workaround that didn't result in
the above error, but instead,silently mangled the
resulting binary MARC record on the way out:

4. Using incompatible version of XML::SAX::LibXML and
libxml2 will cause binary MARC records to be mangled
when passed through new_from_xml() in some cases. The
solution here is to make sure you're running
compatible versions of XML::SAX::LibXML and libxml2. I
run Debian Sarge and when I just used the package
maintainer's versions it fixed the bug. It's unclear
to me why the binary MARC would be mangled, this may
indicate a problem with MARC::* but I haven't
had time to track it down and since installing
compatible versions of the parser back-end solves the
problem I can only assume it's the fault of the
incompatible parsers.

Issues #3 and #4 above can be replicated following
batch of records through the roundtrip.pl script
above:

http://liblime.com/public/several.mrc

If you want to test #2, try running this record
through roundtrip.pl:

http://liblime.com/public/combiningchar.mrc

BTW: you can change your default SAX parser by editing
the .ini file ... mine is located in
/usr/local/share/perl/5.8.4/XML/SAX/ParserDetails.ini

So the bottom line is, if you want to use
MARC::File::XML in any serious application, you've got
to use compatible versions of the
libxml2 parser and XML::SAX::LibXML. Check the README
in the perl package for documentation on which are
compatible...

Maybe a note somewhere in the MARC::File::XML
documentation to point these issues out would be
useful. Also, it wouldn't be too bad to have
a few tests to make sure that the system's default SAX
parser is capable of handling these cases. Just my two
cents.

沒有留言: