← Back to team overview

duplicity-team team mailing list archive

Re: Help with unicode branch (for Python3 support)

 

Hi Aaron,

Not sure about adding ANSI_X3.4-1968.  That would mean that the system
indeed hand locale specified properly.

The values 'ascii' and None were returned when locale was not specified at
all (detached or cron jobs on some setups).  ANSI implies a proper locale
setting.

...Ken


On Wed, Nov 29, 2017 at 4:58 PM, Aaron <lists@xxxxxxxxxxxxxxxxxx> wrote:

> Many thanks for this, Ken, it seems to have done the trick (after adding
> 'ANSI_X3.4-1968' to the list)!
>
> I will tidy up the branch and get it submitted.
>
> Presumably the effect of this is to treat systems that say they are ASCII
> etc as UTF-8, so while it is perfect in reading/decoding filenames, I am a
> little more nervous to use this global for encoding/writing filenames.
> Presumably it wouldn't be an issue for Linux filesystems as the filenames
> are bytes, but are there any other filesystems that might declare as ASCII
> and not accept UTF-8?
>
> Kind regards,
>
> Aaron
>
> On 19/11/17 12:24, Kenneth Loafman wrote:
>
> Further reading suggests:
> globals.fsencoding = fse if fse not in ['ascii', None] else 'utf-8'
>
> Gack!  This is turning into a real fubar!
>
> ...Ken
>
>
> On Sat, Nov 18, 2017 at 3:55 PM, Kenneth Loafman <kenneth@xxxxxxxxxxx>
> wrote:
>
>> It seems we are fighting an old Python bug still extant in Python 3.
>> Namely, sys.getfilesystemencoding() will return ascii if the LC_* variables
>> are not set (cron or other detached processes).  At one time in the early 3
>> series they defaulted to utf-8 if ascii was returned.  Then, as I
>> understand it, the purists won and ascii is returned now.  So, I think that
>> was a good enough idea, except we should allow an override.  I suggest we
>> allow an option if the FS is really something other than utf-8, but do
>> something like this in globals.py.
>>
>> fse = sys.getfilesystemencoding()
>> globals.fsencoding = fse if fse != 'ascii' else 'utf-8'
>>
>> Then allow it to be overridden in command line processing if needed.
>> Replace the two sys.getfilesystemencoding() with globals.fsencoding and we
>> should be 99% there.
>>
>> ...Ken
>>
>>
>>
>>
>>
>> On Wed, Nov 15, 2017 at 9:40 AM, Kenneth Loafman <kenneth@xxxxxxxxxxx>
>> wrote:
>>
>>> Google for 'tox getfilesystemencoding' and 'setup.py test
>>> getfilesystemencoding'.  You'll see a bunch of discussion.  It may be that
>>> we need to move from 'setup.py test' to something else.
>>>
>>> ...Ken
>>>
>>>
>>> On Tue, Nov 14, 2017 at 3:40 PM, Aaron <lists@xxxxxxxxxxxxxxxxxx> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am hoping for some help to iron out a small testing bug with:
>>>> https://code.launchpad.net/~aaron-whitehouse/duplicity/08-unicode
>>>> so that I can get the code committed. I believe that the code is
>>>> working correctly, but our test setup (tox, pexpect etc) is creating an
>>>> environment identified as ASCII rather than UTF-8 and that makes the tests
>>>> fail.
>>>> *The branch aims to ease Python 2/3 compatibility*
>>>>
>>>> For context, this branch aims to ease the conversion of duplicity to be
>>>> Python 2/3 compatible. It looks to me as though the key stumbling block in
>>>> previous efforts has been the string unicode/bytes distinction in Python 3.
>>>> My plan with this branch was therefore to take manageable sections of
>>>> duplicity and convert the strings to Python 2 unicode/bytes strings, making
>>>> it much easier to then convert that code to Python 3 in the future, but in
>>>> a way that can be committed straight away to the existing code base.
>>>> *Using sys.getfilesystemencoding() misdetects 'ascii' in tests*
>>>>
>>>> As the branch currently stands, all tests pass. If, however (on my
>>>> UTF-8 system) you change (util.py, line 66):
>>>>
>>>> return bytes_filename.decode("UTF-8", "ignore")
>>>>
>>>> to:
>>>>
>>>> return bytes_filename.decode(sys.getfilesystemencoding(), "ignore")
>>>>
>>>> then tests (mainly in testing.functional.test_selection.TestUnicode)
>>>> fail. Changing "ignore" in the above line to "strict" gives errors that
>>>> suggest an encoding error issue:
>>>>
>>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 25: ordinal not in range(128)
>>>>
>>>> This is the case even though sys.getfilesystemencoding() returns
>>>> "UTF-8" on my setup. Putting a print statement showing the result of
>>>> sys.getfilesystemencoding() shows this changing from "UTF-8" to
>>>> "ANSI_X3.4-1968" once the code is within the:
>>>>
>>>> child = pexpect.spawn(b'/bin/sh', [b'-c', cmdline.encode(sys.getfilesystemencoding(),
>>>>                                                          'replace')], timeout=None)
>>>>
>>>> *This looks to just be a problem with the test suite*
>>>>
>>>> The test suite prints a copy of the failing command, for example (from
>>>> testing.functional.test_selection.TestUnicode.test_unicode_p
>>>> aths_non_globbing):
>>>>
>>>> ...command: "setsid" "-w" "duplicity" "full" "testfiles/select-unicode" "file://testfiles/output" "--volsize" "1" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/उदाहरण.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/דוגמא.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/მაგალითი/" "--include" "testfiles/select-unicode/прыклад/пример/例/" "--exclude" "testfiles/select-unicode/прыклад/пример/" "--include" "testfiles/select-unicode/прыклад/" "--include" "testfiles/select-unicode/օրինակ.txt" "--exclude" "testfiles/select-unicode/**" "-v0" "--no-print-statistics" "--allow-source-mismatch" "--archive-dir=testfiles/cache" < /dev/null
>>>>
>>>> If this (with PYTHONPATH added and the duplicity path adjusted,
>>>> executed from the "testing" folder with "testfiles.tar.gz" extracted) is
>>>> run directly in the commandline:
>>>>
>>>> $ PYTHONPATH=../ "../bin/duplicity" "full" "testfiles/select-unicode" "file://testfiles/output" "--volsize" "1" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/उदाहरण.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/דוגמא.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/მაგალითი/" "--include" "testfiles/select-unicode/прыклад/пример/例/" "--exclude" "testfiles/select-unicode/прыклад/пример/" "--include" "testfiles/select-unicode/прыклад/" "--include" "testfiles/select-unicode/օրինակ.txt" "--exclude" "testfiles/select-unicode/**" "-v0" "--no-print-statistics" "--allow-source-mismatch" "--archive-dir=testfiles/cache"
>>>>
>>>> Everything works correctly and restoring the files (using this branch)
>>>> and manually checking shows it worked correctly (even with "strict"). The
>>>> print statement also shows that the system encoding is "UTF-8" throughout.
>>>> Help requested
>>>>
>>>> Can anybody suggest what I can do to force the testing environment to
>>>> be UTF-8, or at least be detected as such by sys.getfilesystemencoding?
>>>> Alternatively, what is the least awful way to make the tests work enough to
>>>> get the (apparently working) code committed?
>>>> Many thanks,
>>>>
>>>> Aaron
>>>>
>>>> _______________________________________________
>>>> Mailing list: https://launchpad.net/~duplicity-team
>>>> Post to     : duplicity-team@xxxxxxxxxxxxxxxxxxx
>>>> Unsubscribe : https://launchpad.net/~duplicity-team
>>>> More help   : https://help.launchpad.net/ListHelp
>>>>
>>>>
>>>
>>
>
>

Follow ups

References