touch-packages team mailing list archive
-
touch-packages team
-
Mailing list archive
-
Message #59918
[Bug 1428091] [NEW] regexec/regcomp fails on regular expression containing UTF-8 multi-byte characters
Public bug reported:
I want to do a regular expression match on UTF-8 formatted strings.
A simple example is matching a string consisting of 1 or 2 uppercase characters, including Ä,Ë,Ï,Ö,Ü.
The extended regular expression I use is:
'^[A-ZÄ-Ü]{1,2}$'
Expected behaviour:
Input Expect
------------------
Ä Match
ÄB Match
ABC Fail
Test using grep works OK:
$ echo Ä |grep -E '^[A-ZÄ-Ü]{1,2}$'
Ä
$ echo ÄB |grep -E '^[A-ZÄ-Ü]{1,2}$'
ÄB
$ echo ABC |grep -E '^[A-ZÄ-Ü]{1,2}$'
The same test using a simple test program using regex/regcomp:
$ ./regex Ä '^[A-ZÄ-Ü]{1,2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{1,2}$)
$ ./regex ÄB '^[A-ZÄ-Ü]{1,2}$'
MISS (ÄB) (^[A-ZÄ-Ü]{1,2}$)
$ ./regex ABC '^[A-ZÄ-Ü]{1,2}$'
MISS (ABC) (^[A-ZÄ-Ü]{1,2}$)
It seems that the single symbol Ä counts as two symbols here, because
this works:
$ ./regex Ä '^[A-ZÄ-Ü]{2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{2}$)
Additional information:
$ lsb_release -rd
Description: Ubuntu 14.04.2 LTS
Release: 14.04
libc6:amd64 version2.19-0ubuntu6.5
Locale: en_US.UTF-8.
ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: libc6 2.19-0ubuntu6.5
ProcVersionSignature: Ubuntu 3.13.0-35.62-gatso 3.13.11.6
Uname: Linux 3.13.0-35-gatso x86_64
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Mar 4 11:51:24 2015
Dependencies:
gcc-4.9-base 4.9.1-0ubuntu1
libc6 2.19-0ubuntu6.5
libgcc1 1:4.9.1-0ubuntu1
multiarch-support 2.19-0ubuntu6.5
InstallationDate: Installed on 2014-09-26 (158 days ago)
InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.3)
SourcePackage: eglibc
UpgradeStatus: No upgrade log present (probably fresh install)
** Affects: eglibc (Ubuntu)
Importance: Undecided
Status: New
** Tags: amd64 apport-bug libc6 regcomp regex regexec trusty unicode
** Attachment added: "regex_test.c"
https://bugs.launchpad.net/bugs/1428091/+attachment/4334307/+files/regex_test.c
--
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to eglibc in Ubuntu.
https://bugs.launchpad.net/bugs/1428091
Title:
regexec/regcomp fails on regular expression containing UTF-8 multi-
byte characters
Status in eglibc package in Ubuntu:
New
Bug description:
I want to do a regular expression match on UTF-8 formatted strings.
A simple example is matching a string consisting of 1 or 2 uppercase characters, including Ä,Ë,Ï,Ö,Ü.
The extended regular expression I use is:
'^[A-ZÄ-Ü]{1,2}$'
Expected behaviour:
Input Expect
------------------
Ä Match
ÄB Match
ABC Fail
Test using grep works OK:
$ echo Ä |grep -E '^[A-ZÄ-Ü]{1,2}$'
Ä
$ echo ÄB |grep -E '^[A-ZÄ-Ü]{1,2}$'
ÄB
$ echo ABC |grep -E '^[A-ZÄ-Ü]{1,2}$'
The same test using a simple test program using regex/regcomp:
$ ./regex Ä '^[A-ZÄ-Ü]{1,2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{1,2}$)
$ ./regex ÄB '^[A-ZÄ-Ü]{1,2}$'
MISS (ÄB) (^[A-ZÄ-Ü]{1,2}$)
$ ./regex ABC '^[A-ZÄ-Ü]{1,2}$'
MISS (ABC) (^[A-ZÄ-Ü]{1,2}$)
It seems that the single symbol Ä counts as two symbols here, because
this works:
$ ./regex Ä '^[A-ZÄ-Ü]{2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{2}$)
Additional information:
$ lsb_release -rd
Description: Ubuntu 14.04.2 LTS
Release: 14.04
libc6:amd64 version2.19-0ubuntu6.5
Locale: en_US.UTF-8.
ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: libc6 2.19-0ubuntu6.5
ProcVersionSignature: Ubuntu 3.13.0-35.62-gatso 3.13.11.6
Uname: Linux 3.13.0-35-gatso x86_64
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Mar 4 11:51:24 2015
Dependencies:
gcc-4.9-base 4.9.1-0ubuntu1
libc6 2.19-0ubuntu6.5
libgcc1 1:4.9.1-0ubuntu1
multiarch-support 2.19-0ubuntu6.5
InstallationDate: Installed on 2014-09-26 (158 days ago)
InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.3)
SourcePackage: eglibc
UpgradeStatus: No upgrade log present (probably fresh install)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/1428091/+subscriptions
Follow ups
References