← Back to team overview

duplicity-team team mailing list archive

Help with unicode branch (for Python3 support)

 

Hello all,

I am hoping for some help to iron out a small testing bug with:
https://code.launchpad.net/~aaron-whitehouse/duplicity/08-unicode
so that I can get the code committed. I believe that the code is working correctly, but our test setup (tox, pexpect etc) is creating an environment identified as ASCII rather than UTF-8 and that makes the tests fail.


         *The branch aims to ease Python 2/3 compatibility*

For context, this branch aims to ease the conversion of duplicity to be Python 2/3 compatible. It looks to me as though the key stumbling block in previous efforts has been the string unicode/bytes distinction in Python 3. My plan with this branch was therefore to take manageable sections of duplicity and convert the strings to Python 2 unicode/bytes strings, making it much easier to then convert that code to Python 3 in the future, but in a way that can be committed straight away to the existing code base.


         *Using sys.getfilesystemencoding() misdetects 'ascii' in tests*

As the branch currently stands, all tests pass. If, however (on my UTF-8 system) you change (util.py, line 66):

return bytes_filename.decode("UTF-8", "ignore")

to:

return bytes_filename.decode(sys.getfilesystemencoding(), "ignore")

then tests (mainly in testing.functional.test_selection.TestUnicode) fail. Changing "ignore" in the above line to "strict" gives errors that suggest an encoding error issue:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 25: ordinal not in range(128)

This is the case even though sys.getfilesystemencoding() returns "UTF-8" on my setup. Putting a print statement showing the result of sys.getfilesystemencoding() shows this changing from "UTF-8" to "ANSI_X3.4-1968" once the code is within the:

child = pexpect.spawn(b'/bin/sh', [b'-c', cmdline.encode(sys.getfilesystemencoding(),
                                                         'replace')],timeout=None)


         *This looks to just be a problem with the test suite*

The test suite prints a copy of the failing command, for example (from testing.functional.test_selection.TestUnicode.test_unicode_paths_non_globbing):

...command: "setsid" "-w" "duplicity" "full" "testfiles/select-unicode" "file://testfiles/output" "--volsize" "1" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/उदाहरण.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/דוגמא.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/მაგალითი/" "--include" "testfiles/select-unicode/прыклад/пример/例/" "--exclude" "testfiles/select-unicode/прыклад/пример/" "--include" "testfiles/select-unicode/прыклад/" "--include" "testfiles/select-unicode/օրինակ.txt" "--exclude" "testfiles/select-unicode/**" "-v0" "--no-print-statistics" "--allow-source-mismatch" "--archive-dir=testfiles/cache" < /dev/null

If this (with PYTHONPATH added and the duplicity path adjusted, executed from the "testing" folder with "testfiles.tar.gz" extracted) is run directly in the commandline:

$ PYTHONPATH=../ "../bin/duplicity" "full" "testfiles/select-unicode""file://testfiles/output"  "--volsize" "1" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/उदाहरण.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/דוגמא.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/მაგალითი/" "--include" "testfiles/select-unicode/прыклад/пример/例/" "--exclude" "testfiles/select-unicode/прыклад/пример/" "--include" "testfiles/select-unicode/прыклад/" "--include" "testfiles/select-unicode/օրինակ.txt" "--exclude" "testfiles/select-unicode/**" "-v0" "--no-print-statistics" "--allow-source-mismatch" "--archive-dir=testfiles/cache"

Everything works correctly and restoring the files (using this branch) and manually checking shows it worked correctly (even with "strict"). The print statement also shows that the system encoding is "UTF-8" throughout.


         Help requested

Can anybody suggest what I can do to force the testing environment to be UTF-8, or at least be detected as such by sys.getfilesystemencoding? Alternatively, what is the least awful way to make the tests work enough to get the (apparently working) code committed?

Many thanks,

Aaron

Follow ups