← Back to team overview

duplicity-team team mailing list archive

Re: Help with unicode branch (for Python3 support)

 

Many thanks for this, Ken, it seems to have done the trick (after adding 'ANSI_X3.4-1968' to the list)!

I will tidy up the branch and get it submitted.

Presumably the effect of this is to treat systems that say they are ASCII etc as UTF-8, so while it is perfect in reading/decoding filenames, I am a little more nervous to use this global for encoding/writing filenames. Presumably it wouldn't be an issue for Linux filesystems as the filenames are bytes, but are there any other filesystems that might declare as ASCII and not accept UTF-8?

Kind regards,

Aaron


On 19/11/17 12:24, Kenneth Loafman wrote:
Further reading suggests:
globals.fsencoding = fse if fse not in ['ascii', None] else 'utf-8'

Gack!  This is turning into a real fubar!

...Ken


On Sat, Nov 18, 2017 at 3:55 PM, Kenneth Loafman <kenneth@xxxxxxxxxxx <mailto:kenneth@xxxxxxxxxxx>> wrote:

    It seems we are fighting an old Python bug still extant in Python
    3.  Namely, sys.getfilesystemencoding() will return ascii if the
    LC_* variables are not set (cron or other detached processes). At
    one time in the early 3 series they defaulted to utf-8 if ascii
    was returned.  Then, as I understand it, the purists won and ascii
    is returned now.  So, I think that was a good enough idea, except
    we should allow an override.  I suggest we allow an option if the
    FS is really something other than utf-8, but do something like
    this in globals.py.

    fse = sys.getfilesystemencoding()
    globals.fsencoding = fse if fse != 'ascii' else 'utf-8'

    Then allow it to be overridden in command line processing if
    needed.  Replace the two sys.getfilesystemencoding() with
    globals.fsencoding and we should be 99% there.

    ...Ken





    On Wed, Nov 15, 2017 at 9:40 AM, Kenneth Loafman
    <kenneth@xxxxxxxxxxx <mailto:kenneth@xxxxxxxxxxx>> wrote:

        Google for 'tox getfilesystemencoding' and 'setup.py test
        getfilesystemencoding'.  You'll see a bunch of discussion.  It
        may be that we need to move from 'setup.py test' to something
        else.

        ...Ken


        On Tue, Nov 14, 2017 at 3:40 PM, Aaron
        <lists@xxxxxxxxxxxxxxxxxx <mailto:lists@xxxxxxxxxxxxxxxxxx>>
        wrote:

            Hello all,

            I am hoping for some help to iron out a small testing bug
            with:
            https://code.launchpad.net/~aaron-whitehouse/duplicity/08-unicode
            <https://code.launchpad.net/%7Eaaron-whitehouse/duplicity/08-unicode>so
            that I can get the code committed. I believe that the code
            is working correctly, but our test setup (tox, pexpect
            etc) is creating an environment identified as ASCII rather
            than UTF-8 and that makes the tests fail.


                      *The branch aims to ease Python 2/3 compatibility*

            For context, this branch aims to ease the conversion of
            duplicity to be Python 2/3 compatible. It looks to me as
            though the key stumbling block in previous efforts has
            been the string unicode/bytes distinction in Python 3. My
            plan with this branch was therefore to take manageable
            sections of duplicity and convert the strings to Python 2
            unicode/bytes strings, making it much easier to then
            convert that code to Python 3 in the future, but in a way
            that can be committed straight away to the existing code base.


                      *Using sys.getfilesystemencoding() misdetects
                      'ascii' in tests*

            As the branch currently stands, all tests pass. If,
            however (on my UTF-8 system) you change (util.py, line 66):

            return bytes_filename.decode("UTF-8", "ignore")

            to:

            return bytes_filename.decode(sys.getfilesystemencoding(), "ignore")

            then tests (mainly in
            testing.functional.test_selection.TestUnicode) fail.
            Changing "ignore" in the above line to "strict" gives
            errors that suggest an encoding error issue:

            UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 25: ordinal not in range(128)

            This is the case even though sys.getfilesystemencoding()
            returns "UTF-8" on my setup. Putting a print statement
            showing the result of sys.getfilesystemencoding() shows
            this changing from "UTF-8" to "ANSI_X3.4-1968" once the
            code is within the:

            child = pexpect.spawn(b'/bin/sh', [b'-c', cmdline.encode(sys.getfilesystemencoding(),
                                                                      'replace')],timeout=None)


                      *This looks to just be a problem with the test
                      suite*

            The test suite prints a copy of the failing command, for
            example (from
            testing.functional.test_selection.TestUnicode.test_unicode_paths_non_globbing):

            ...command: "setsid" "-w" "duplicity" "full" "testfiles/select-unicode""file://testfiles/output"  "--volsize" "1" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/उदाहरण.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/דוגמא.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/მაგალითი/" "--include" "testfiles/select-unicode/прыклад/пример/例/" "--exclude" "testfiles/select-unicode/прыклад/пример/" "--include" "testfiles/select-unicode/прыклад/" "--include" "testfiles/select-unicode/օրինակ.txt" "--exclude" "testfiles/select-unicode/**" "-v0" "--no-print-statistics" "--allow-source-mismatch" "--archive-dir=testfiles/cache" < /dev/null

            If this (with PYTHONPATH added and the duplicity path
            adjusted, executed from the "testing" folder with
            "testfiles.tar.gz" extracted) is run directly in the
            commandline:

            $ PYTHONPATH=../ "../bin/duplicity" "full" "testfiles/select-unicode""file://testfiles/output"  "--volsize" "1" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/उदाहरण.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/Παράδειγμα/דוגמא.txt" "--exclude" "testfiles/select-unicode/прыклад/пример/例/მაგალითი/" "--include" "testfiles/select-unicode/прыклад/пример/例/" "--exclude" "testfiles/select-unicode/прыклад/пример/" "--include" "testfiles/select-unicode/прыклад/" "--include" "testfiles/select-unicode/օրինակ.txt" "--exclude" "testfiles/select-unicode/**" "-v0" "--no-print-statistics" "--allow-source-mismatch" "--archive-dir=testfiles/cache"

            Everything works correctly and restoring the files (using
            this branch) and manually checking shows it worked
            correctly (even with "strict"). The print statement also
            shows that the system encoding is "UTF-8" throughout.


                      Help requested

            Can anybody suggest what I can do to force the testing
            environment to be UTF-8, or at least be detected as such
            by sys.getfilesystemencoding? Alternatively, what is the
            least awful way to make the tests work enough to get the
            (apparently working) code committed?

            Many thanks,

            Aaron

            _______________________________________________
            Mailing list: https://launchpad.net/~duplicity-team
            <https://launchpad.net/%7Eduplicity-team>
            Post to     : duplicity-team@xxxxxxxxxxxxxxxxxxx
            <mailto:duplicity-team@xxxxxxxxxxxxxxxxxxx>
            Unsubscribe : https://launchpad.net/~duplicity-team
            <https://launchpad.net/%7Eduplicity-team>
            More help   : https://help.launchpad.net/ListHelp
            <https://help.launchpad.net/ListHelp>






Follow ups

References