← Back to team overview

desktop-packages team mailing list archive

[Bug 1470032] [NEW] libpst / readpst incorrectly decodes latin1 contacts, etc.

 

Public bug reported:

After a client of ours moved from Exchange 2003 to Office 365 we had to
get some data out of PST-files, which mostly worked well, but apparently
Contacts and some Tasks have a tendency på be incorrectly decoded into
gibberish.

As far as I can tell, the problem is that the data is interpreted to be
UTF16 that needs to be converted to UTF8 and the charset I defined on
the commandline for readpst is not consulted in this transaction.

When inspecting the debug log, it is clear to human eyes that this
conversion is incorrect and if anything should have been from the
charset I specified to UTF8 and not from UTF16 to UTF8.

As far as I can tell, the problem occurs in the 'pst_vb_utf16to8', which
seems to be called indescrimately, and it seems that the charset I
specify to readpst is rarely used, if ever.

I wonder if it would be possible to have a switch to present the user
with the unconverted version and possibly a couple of encoding and let
the user decide the proper one. There are several contacts that are
fine, but over 200 that suffer from this garbling of the data.
Unfortunately it is more or less impossible to get from the utf8 version
of the non-utf16 data back to latin1, as far as I can tell.

This is a sample contact that has the issue (Most are totally illegible, but a few had some text I could search for):
FN:Ballerup Politi
N:汋獯整�;潊湨祮;;;
EMAIL:慂汬牥灵倠汯瑩<U+2069>䨨䡃灀汯瑩<U+2E69>此�
ADR;TYPE=work:;;;;;;
LABEL;TYPE=work:汇<U+202E><U+E552>桤獵敶<U+206A>㤱\n慂汬牥灵 㜲〵\n慄浮牡�
TEL;TYPE=work,voice:㤳㔠‴㐱㐠‸潬慫<U+206C>㐠㌲�
TEL;TYPE=cell,voice: 72 58 78 29    (20 90 98 02)
TITLE:楖散潰楬楴潫浭獩狦
NOTE:Gladsaxe Politi (kredsen)  3969 1448\n
VERSION: 3.0
END:VCARD

Attached is debug version of the parsing of this contact.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: pst-utils 0.6.59-1build1
ProcVersionSignature: Ubuntu 3.13.0-24.47-generic 3.13.9
Uname: Linux 3.13.0-24-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.11
Architecture: amd64
CurrentDesktop: X-Cinnamon
Date: Tue Jun 30 11:05:51 2015
EcryptfsInUse: Yes
InstallationDate: Installed on 2014-07-27 (337 days ago)
InstallationMedia: Linux Mint 17 "Qiana" - Release amd64 20140624
ProcEnviron:
 SHELL=/bin/bash
 TERM=xterm
 PATH=(custom, no user)
 LANG=da_DK.UTF-8
 XDG_RUNTIME_DIR=<set>
SourcePackage: libpst
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: libpst (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug qiana

** Attachment added: "readpst-decode-error.txt"
   https://bugs.launchpad.net/bugs/1470032/+attachment/4422292/+files/readpst-decode-error.txt

-- 
You received this bug notification because you are a member of Desktop
Packages, which is subscribed to libpst in Ubuntu.
https://bugs.launchpad.net/bugs/1470032

Title:
  libpst / readpst incorrectly decodes latin1 contacts, etc.

Status in libpst package in Ubuntu:
  New

Bug description:
  After a client of ours moved from Exchange 2003 to Office 365 we had
  to get some data out of PST-files, which mostly worked well, but
  apparently Contacts and some Tasks have a tendency på be incorrectly
  decoded into gibberish.

  As far as I can tell, the problem is that the data is interpreted to
  be UTF16 that needs to be converted to UTF8 and the charset I defined
  on the commandline for readpst is not consulted in this transaction.

  When inspecting the debug log, it is clear to human eyes that this
  conversion is incorrect and if anything should have been from the
  charset I specified to UTF8 and not from UTF16 to UTF8.

  As far as I can tell, the problem occurs in the 'pst_vb_utf16to8',
  which seems to be called indescrimately, and it seems that the charset
  I specify to readpst is rarely used, if ever.

  I wonder if it would be possible to have a switch to present the user
  with the unconverted version and possibly a couple of encoding and let
  the user decide the proper one. There are several contacts that are
  fine, but over 200 that suffer from this garbling of the data.
  Unfortunately it is more or less impossible to get from the utf8
  version of the non-utf16 data back to latin1, as far as I can tell.

  This is a sample contact that has the issue (Most are totally illegible, but a few had some text I could search for):
  FN:Ballerup Politi
  N:汋獯整�;潊湨祮;;;
  EMAIL:慂汬牥灵倠汯瑩<U+2069>䨨䡃灀汯瑩<U+2E69>此�
  ADR;TYPE=work:;;;;;;
  LABEL;TYPE=work:汇<U+202E><U+E552>桤獵敶<U+206A>㤱\n慂汬牥灵 㜲〵\n慄浮牡�
  TEL;TYPE=work,voice:㤳㔠‴㐱㐠‸潬慫<U+206C>㐠㌲�
  TEL;TYPE=cell,voice: 72 58 78 29    (20 90 98 02)
  TITLE:楖散潰楬楴潫浭獩狦
  NOTE:Gladsaxe Politi (kredsen)  3969 1448\n
  VERSION: 3.0
  END:VCARD

  Attached is debug version of the parsing of this contact.

  ProblemType: Bug
  DistroRelease: Ubuntu 14.04
  Package: pst-utils 0.6.59-1build1
  ProcVersionSignature: Ubuntu 3.13.0-24.47-generic 3.13.9
  Uname: Linux 3.13.0-24-generic x86_64
  ApportVersion: 2.14.1-0ubuntu3.11
  Architecture: amd64
  CurrentDesktop: X-Cinnamon
  Date: Tue Jun 30 11:05:51 2015
  EcryptfsInUse: Yes
  InstallationDate: Installed on 2014-07-27 (337 days ago)
  InstallationMedia: Linux Mint 17 "Qiana" - Release amd64 20140624
  ProcEnviron:
   SHELL=/bin/bash
   TERM=xterm
   PATH=(custom, no user)
   LANG=da_DK.UTF-8
   XDG_RUNTIME_DIR=<set>
  SourcePackage: libpst
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libpst/+bug/1470032/+subscriptions


Follow ups

References