← Back to team overview

touch-packages team mailing list archive

[Bug 374807] Re: grep does not work for UTF-16 files

 

This is not surprising since 'grep' is a standard POSIX utility. It uses POSIX locales (http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html#tag_20_55_08). So if you read the POSIX standard carefully, then you are going to find out the following: UTF-16 and UTF-32 cannot be supported in POSIX locales because these encoding forms imply using 2-byte and 4-byte code-units respectively making the encoding of '/' and '.' nonconforming.
Quoting http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html:

"Conforming implementations shall support one or more coded character
sets. Each supported locale shall include the portable character set,
which is the set of symbolic names for characters in Portable Character
Set.

...
POSIX.1-2008 places only the following requirements on the encoded values of the characters in the portable character set:

...

The encoded values associated with <slash> and <period> shall be
invariant across all locales supported by the implementation.

The encoded values associated with the members of the portable character
set are each represented in a single byte. Moreover, if the value is
stored in an object of C-language type char, it is guaranteed to be
positive (except the NUL, which is always zero)."

Another issue is that sizeof(wchar_t) is implementation defined. My
tests on Ubuntu show that sizeof(wchar_t) returns 4 (bytes) and you need
some other data type to store UTF-16 code units in a portable way.

I would say that this should not be fixed: you should use iconv in a
pipeline to do the appropriate grepping with UTF-8 (though this might be
resource-intensive for large XML files).

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to grep in Ubuntu.
https://bugs.launchpad.net/bugs/374807

Title:
  grep does not work for UTF-16 files

Status in grep package in Ubuntu:
  Confirmed

Bug description:
  Binary package hint: grep

  Release: 
      Description:    Ubuntu 9.04
      Release:        9.04

  Package:
      grep:
        Installed: 2.5.3~dfsg-6ubuntu1
        Candidate: 2.5.3~dfsg-6ubuntu1
        Version table:
       *** 2.5.3~dfsg-6ubuntu1 0
              500 http://gb.archive.ubuntu.com jaunty/main Packages
              100 /var/lib/dpkg/status

  When grep-ing a UTF-16 file, I expected results for the search pattern
  I was using.  However, no matches were found (using grep without
  options and 'grep -hi').

  I am not sure what program initially created the file as I received
  them via email from a Windows user.  I have attached part of the file
  for testing (I have gzipped the file to reduce any risk of the browser
  mangling it).  'file' returns the filetype as 'Little-endian UTF-16
  Unicode character data, with CRLF, CR line terminators'.  I have
  attached a gzip extract of the file (just the first ten lines returned
  from 'head').

  Other text utilities such as cat, less, head, tail and vim have no
  problem dealing with the file.  So far as I have found, only grep
  cannot handle the file.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grep/+bug/374807/+subscriptions