← Back to team overview

pymarc-team team mailing list archive

Re: Writing UTF-8 MARC records in pymarc

 

Hi Ted:

This has been a long-standing issue w/ pymarc, and one that I'm
surprised hasn't come up before. The problem is that the record's
leader and directory have byte offsets in them to indicate where
fields are located in the record bytestream. But most sane people
(like you) are working with Unicode. pymarc was using length() to
calculate the byte offsets, but this breaks down with multibyte
serializations like utf-8.

I took a stab [1] at modifying Record.as_marc and Record.decode_marc
to encode/decode to/from utf-8. I doubt that makes any sense, but I
think it is safe to do, and will interoperate with MARC-8 data. If you
get a chance please give it a try if you can:

   bzr branch lp:pymarc

I added a unittest to test/marc8.py which I think simulates (roughly)
what you were doing:

     record = Record()
     record.add_field(Field(245, ['1', '0'], ['a', unichr(0x1234)]))
     writer = MARCWriter(open('test/foo', 'w'))
     writer.write(record)
     writer.close()

     reader = MARCReader(open('test/foo'))
     record = reader.next()
     self.assertEqual(record['245']['a'], unichr(0x1234))

I'd be interested to hear if the latest code fixes your problem.

//Ed



Follow ups

References