pymarc-team team mailing list archive

Thread
Date

Re: Writing UTF-8 MARC records in pymarc

To: Ted Lawless <lawlesst@xxxxxxxxx>
From: Ed Summers <ehs@xxxxxxxxx>
Date: Fri, 25 Sep 2009 22:40:22 -0400
Cc: pymarc-team@xxxxxxxxxxxxxxxxxxx
In-reply-to: <2848e84a0909251455p40d77eddi9d0e362a38652934@mail.gmail.com>
Sender: ed.summers@xxxxxxxxx

Hi Ted:

This has been a long-standing issue w/ pymarc, and one that I'm
surprised hasn't come up before. The problem is that the record's
leader and directory have byte offsets in them to indicate where
fields are located in the record bytestream. But most sane people
(like you) are working with Unicode. pymarc was using length() to
calculate the byte offsets, but this breaks down with multibyte
serializations like utf-8.

I took a stab [1] at modifying Record.as_marc and Record.decode_marc
to encode/decode to/from utf-8. I doubt that makes any sense, but I
think it is safe to do, and will interoperate with MARC-8 data. If you
get a chance please give it a try if you can:

   bzr branch lp:pymarc

I added a unittest to test/marc8.py which I think simulates (roughly)
what you were doing:

     record = Record()
     record.add_field(Field(245, ['1', '0'], ['a', unichr(0x1234)]))
     writer = MARCWriter(open('test/foo', 'w'))
     writer.write(record)
     writer.close()

     reader = MARCReader(open('test/foo'))
     record = reader.next()
     self.assertEqual(record['245']['a'], unichr(0x1234))

I'd be interested to hear if the latest code fixes your problem.

//Ed

Follow ups

Re: Writing UTF-8 MARC records in pymarc
From: Ed Summers, 2009-09-26

References

Writing UTF-8 MARC records in pymarc
From: Ted Lawless, 2009-09-25