← Back to team overview

touch-packages team mailing list archive

[Bug 1428091] [NEW] regexec/regcomp fails on regular expression containing UTF-8 multi-byte characters

 

Public bug reported:

I want to do a regular expression match on UTF-8 formatted strings.
A simple example is matching a string consisting of 1 or 2 uppercase characters, including Ä,Ë,Ï,Ö,Ü.
The extended regular expression I use is:

'^[A-ZÄ-Ü]{1,2}$'

Expected behaviour:

Input Expect
------------------
Ä       Match
ÄB    Match
ABC  Fail

Test using grep works OK:
$ echo Ä |grep -E '^[A-ZÄ-Ü]{1,2}$'
Ä
$ echo ÄB |grep -E '^[A-ZÄ-Ü]{1,2}$'
ÄB
$ echo ABC |grep -E '^[A-ZÄ-Ü]{1,2}$'

The same test using a simple test program using regex/regcomp:


$ ./regex Ä '^[A-ZÄ-Ü]{1,2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{1,2}$)

$ ./regex ÄB '^[A-ZÄ-Ü]{1,2}$'
MISS  (ÄB) (^[A-ZÄ-Ü]{1,2}$)

$ ./regex ABC '^[A-ZÄ-Ü]{1,2}$'
MISS  (ABC) (^[A-ZÄ-Ü]{1,2}$)

It seems that the single symbol Ä counts as two symbols here, because
this works:

$ ./regex Ä '^[A-ZÄ-Ü]{2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{2}$)


Additional information:

$ lsb_release -rd
Description:	Ubuntu 14.04.2 LTS
Release:	14.04

libc6:amd64 version2.19-0ubuntu6.5

Locale: en_US.UTF-8.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: libc6 2.19-0ubuntu6.5
ProcVersionSignature: Ubuntu 3.13.0-35.62-gatso 3.13.11.6
Uname: Linux 3.13.0-35-gatso x86_64
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Mar  4 11:51:24 2015
Dependencies:
 gcc-4.9-base 4.9.1-0ubuntu1
 libc6 2.19-0ubuntu6.5
 libgcc1 1:4.9.1-0ubuntu1
 multiarch-support 2.19-0ubuntu6.5
InstallationDate: Installed on 2014-09-26 (158 days ago)
InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.3)
SourcePackage: eglibc
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: eglibc (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug libc6 regcomp regex regexec trusty unicode

** Attachment added: "regex_test.c"
   https://bugs.launchpad.net/bugs/1428091/+attachment/4334307/+files/regex_test.c

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to eglibc in Ubuntu.
https://bugs.launchpad.net/bugs/1428091

Title:
  regexec/regcomp fails on regular expression containing UTF-8 multi-
  byte characters

Status in eglibc package in Ubuntu:
  New

Bug description:
  I want to do a regular expression match on UTF-8 formatted strings.
  A simple example is matching a string consisting of 1 or 2 uppercase characters, including Ä,Ë,Ï,Ö,Ü.
  The extended regular expression I use is:

  '^[A-ZÄ-Ü]{1,2}$'

  Expected behaviour:

  Input Expect
  ------------------
  Ä       Match
  ÄB    Match
  ABC  Fail

  Test using grep works OK:
  $ echo Ä |grep -E '^[A-ZÄ-Ü]{1,2}$'
  Ä
  $ echo ÄB |grep -E '^[A-ZÄ-Ü]{1,2}$'
  ÄB
  $ echo ABC |grep -E '^[A-ZÄ-Ü]{1,2}$'

  The same test using a simple test program using regex/regcomp:

  
  $ ./regex Ä '^[A-ZÄ-Ü]{1,2}$'
  MATCH (Ä) (^[A-ZÄ-Ü]{1,2}$)

  $ ./regex ÄB '^[A-ZÄ-Ü]{1,2}$'
  MISS  (ÄB) (^[A-ZÄ-Ü]{1,2}$)

  $ ./regex ABC '^[A-ZÄ-Ü]{1,2}$'
  MISS  (ABC) (^[A-ZÄ-Ü]{1,2}$)

  It seems that the single symbol Ä counts as two symbols here, because
  this works:

  $ ./regex Ä '^[A-ZÄ-Ü]{2}$'
  MATCH (Ä) (^[A-ZÄ-Ü]{2}$)

  
  Additional information:

  $ lsb_release -rd
  Description:	Ubuntu 14.04.2 LTS
  Release:	14.04

  libc6:amd64 version2.19-0ubuntu6.5

  Locale: en_US.UTF-8.

  ProblemType: Bug
  DistroRelease: Ubuntu 14.04
  Package: libc6 2.19-0ubuntu6.5
  ProcVersionSignature: Ubuntu 3.13.0-35.62-gatso 3.13.11.6
  Uname: Linux 3.13.0-35-gatso x86_64
  ApportVersion: 2.14.1-0ubuntu3.7
  Architecture: amd64
  CurrentDesktop: Unity
  Date: Wed Mar  4 11:51:24 2015
  Dependencies:
   gcc-4.9-base 4.9.1-0ubuntu1
   libc6 2.19-0ubuntu6.5
   libgcc1 1:4.9.1-0ubuntu1
   multiarch-support 2.19-0ubuntu6.5
  InstallationDate: Installed on 2014-09-26 (158 days ago)
  InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.3)
  SourcePackage: eglibc
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/1428091/+subscriptions


Follow ups

References