← Back to team overview

desktop-packages team mailing list archive

[Bug 808894] Re: Certain characters are not rendered correctly when selected (highlighted)

 

Launchpad has imported 27 comments from the remote bug at
https://bugs.freedesktop.org/show_bug.cgi?id=46603.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2012-02-25T01:13:44+00:00 Jason Crain wrote:

Created attachment 57622
test pdf

Forwarding from gnome bugzilla:
https://bugzilla.gnome.org/show_bug.cgi?id=654473

When selecting text in this pdf, some glyphs are not visible.  Text is
displayed correctly when not selected.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/6

------------------------------------------------------------------------
On 2012-02-25T01:14:58+00:00 Jason Crain wrote:

Created attachment 57623
incorrect display in evince

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/7

------------------------------------------------------------------------
On 2012-02-25T01:15:35+00:00 Jason Crain wrote:

Created attachment 57624
correct display in evince

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/8

------------------------------------------------------------------------
On 2012-02-25T01:25:40+00:00 Jason Crain wrote:

Created attachment 57625
Fixed display for selected glyph in ActualText span

This patch seems to correct the issue.  It sets the correct CharCode.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/9

------------------------------------------------------------------------
On 2012-02-25T07:45:38+00:00 Albert Astals Cid wrote:

I don't think this patch is correct, you are only setting the charcode
once in ActualText::addChar so if multiple ActualText::addChar calls
happen before ActualText::end the other charcodes are lost, no?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/10

------------------------------------------------------------------------
On 2012-02-26T04:03:40+00:00 Adrian Johnson wrote:

The patch doesn't work when the ActualText span contains more than one
glyph. There is a test case in the test suite that demonstrates the
problem:

http://cgit.freedesktop.org/poppler/test/tree/unittestcases/WithActualText.pdf

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/11

------------------------------------------------------------------------
On 2012-03-03T16:40:29+00:00 Jason Crain wrote:

Created attachment 57989
Enable displayed chars to map to any number of text chars

It's tricky when the length of the ActualText does not match the number
of displayed glyphs.  This first patch modifies the TextWord, TextLine,
TextLineFrag, TextBlock, and TextPage classes to suport displayed
characters that can be mapped to any number of text characters.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/12

------------------------------------------------------------------------
On 2012-03-03T16:46:38+00:00 Jason Crain wrote:

Created attachment 57990
Fixes display for selected glyphs in ActualText span

This sets the correct CharCode for each glyph in an ActualText span.
Attempts to match one text character to each glyph.  If there are more
glyphs, they are added without matching text.  If there is more text,
the remaining is added to the last glyph.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/13

------------------------------------------------------------------------
On 2012-03-03T21:27:44+00:00 Adrian Johnson wrote:

Created attachment 57993
fix selection of glyphs in actualtext

Thanks for these patches. I have started looking at some of the text
related bugs and the inability of TextOuputDev to understand the
difference between glyphs and characters is the cause of some of these
bugs. The first patch is very similar to the solution I had in mind. The
first patch also fixes bug 9001.

Some comments on the first patch:

The following code in TextBlock::coalesce() needs fixing:
	  if (word2->len == word0->len &&
	      !memcmp(word2->text, word0->text,
		      word0->len * sizeof(Unicode))) {
len need to be replaced with textLen.

I don't think addChar should be renamed to addChars. My understanding of
the code is that 'Char' is referring to the CharCodes and only one
CharCode is added per call.

I would move the surrogate decoding to TextWord:addChar() and do the
decoding as the unicode values are copied into the text array. This
avoids having to make a copy of the unicode array in
TextPage::addChar().

Some comments on the second patch:

I don't like the way the second patch is mapping the replacement text to
the charcodes. I am attaching an alternative patch that distributes the
replacement text evenly across the charcodes.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/14

------------------------------------------------------------------------
On 2012-03-04T20:38:26+00:00 Jason Crain wrote:

Comment on attachment 57993
fix selection of glyphs in actualtext

Review of attachment 57993:
-----------------------------------------------------------------

::: poppler/TextOutputDev.cc
@@ +5331,5 @@
> +    // If this is the last glyph ensure all remaining text is included
> +    // as pos may be < length due to rounding errors.
> +    if (i == lenGlyphs - 1)
> +      count = length - first;
> +    text->addChar(state, glyphs[i].x, glyphs[i].y, glyphs[i].dx, glyphs[i].dy,

This needs to make sure that a surrogate pair is not split

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/15

------------------------------------------------------------------------
On 2012-03-04T21:09:24+00:00 Jason Crain wrote:

Created attachment 58013
Enable displayed chars to map to any number of text chars

updated patch

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/16

------------------------------------------------------------------------
On 2012-03-05T15:04:19+00:00 Albert Astals Cid wrote:

Guys, i'm a bit lost, sorry, are both patches supposed to fix the same
issue in a different way?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/17

------------------------------------------------------------------------
On 2012-03-05T15:16:54+00:00 Adrian Johnson wrote:

Both patches are required.

The "fix selection of glyphs in actualtext" ensures all charcodes are
passed through to TextPage. I need to update this to correctly handle
surrogates.

The "enable displayed chars to map to any number of text chars" is a
prerequisite for the "fix selection of glyphs in actualtext" patch. It
also fixes text selection of ligatures (bug 9001).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/18

------------------------------------------------------------------------
On 2012-03-05T15:41:38+00:00 Albert Astals Cid wrote:

When using Jason's patch on a pdf that i will be attaching in a moment,
pdftotext changes from extracting

 Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing
through 𝑥. We want to define an order on the set ℒ 𝑥 ∖ {𝐿} in the same way as we
did in the proof of Proposition 5. In the disk model ̃
𝑃 with boundary circle 𝐿, every
line 𝐻 ∈ ℒ 𝑥 ∖{𝐿} separates ̃
𝑃 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may
 be disconnected). Since we know from Propositions 6 and 7 that lines always intersect
 transversally, it follows that for two such lines, one of the respective lower parts is entirely

to

 Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing
through 𝑥. We want to define an order on the set ℒ𝑥 ∖ {𝐿} in the same way as we
 with boundary circle 𝐿, every
did in the proof of Proposition 5. In the disk model 𝑃
 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may
line 𝐻 ∈ ℒ𝑥 ∖{𝐿} separates 𝑃
 be disconnected). Since we know from Propositions 6 and 7 that lines always intersect
 transversally, it follows that for two such lines, one of the respective lower parts is entirely


You can see that in the second case some weird/unwanted reordering happened

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/19

------------------------------------------------------------------------
On 2012-03-05T15:42:20+00:00 Albert Astals Cid wrote:

Created attachment 58045
The said pdf

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/20

------------------------------------------------------------------------
On 2012-03-08T04:05:41+00:00 Adrian Johnson wrote:

Created attachment 58178
convert utf-16 to ucs-4 when reading ToUnicode

The next two patches fix the problem of "fix selection of glyphs in
actualtext" not handling surrogates. The "Unicode" type is meant to be
UCS-4 so the solution is to convert UTF-16 to UCS-4 when it the
ToUnicode cmap is parsed.

This patch does the UTF-16 conversion in CharCodeToUnicode.cc. As a
result the special surrogate handling in TextOutputDev and HtmlOutputDev
can be removed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/21

------------------------------------------------------------------------
On 2012-03-08T04:09:18+00:00 Adrian Johnson wrote:

Created attachment 58179
move text string to unicode conversion into a separate function

This patch adds a new function for converting PDF text strings to UCS-4.
As a result the duplicated code in TextOutputDev and pdfinfo can be
replaced by a call to this function.

This patch is to ensure that my updated "fix selection of glyphs in
actualtext" does not have to care about surrogates.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/22

------------------------------------------------------------------------
On 2012-03-08T04:12:30+00:00 Adrian Johnson wrote:

Created attachment 58180
fix selection of glyphs in actualtext

The updated version of "fix selection of glyphs in actualtext".

The patch order is:
1 - convert utf-16 to ucs-4 when reading ToUnicode
2 - move text string to unicode conversion into a separate function
3 - Enable displayed chars to map to any number of text chars
4 - fix selection of glyphs in actualtext

Patch 3 needs to be updated to remove the surrogate handling and fix the
regression.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/23

------------------------------------------------------------------------
On 2012-03-18T15:18:23+00:00 Adrian Johnson wrote:

Created attachment 58643
fix regressions

This patch fixes the regressions in "Enable displayed chars to map to
any number of text chars".

The problem is the changes now allow glyphs that map to zero length
unicode strings to be added to TextWords. Often these glyphs have
overlapping bounding boxes or are not on the same baseline. This
confuses TextOutputDev when trying to determine the layout of the text.

This patch does two things:
- it avoids breaking words when one of these glyphs with an empty mapping is encountered
- it increases the tolerance for overlapping bounding boxes.

With the attached PDF the result the text output is still different but
checking the differences it is actually an improvement.

However I suspect the changes could potentially break other PDFs. If
this patch causes problems, plan B is to change TextOutputDev to ignore
the glyphs with zero mapping when determining the text layout (but still
add these glyphs to the words to make text selection work correctly).
This should emulate the old behavior as closely as possible.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/24

------------------------------------------------------------------------
On 2012-03-18T15:21:09+00:00 Adrian Johnson wrote:

Created attachment 58644
Enable displayed chars to map to any number of text chars

This is Jason's patch rebased so it applies on top of my first two
patches.

The patch order is:
1 - convert utf-16 to ucs-4 when reading ToUnicode
2 - move text string to unicode conversion into a separate function
3 - Enable displayed chars to map to any number of text chars
4 - fix selection of glyphs in actualtext
5 - fix regressions

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/25

------------------------------------------------------------------------
On 2012-03-19T15:30:08+00:00 Albert Astals Cid wrote:

With all the patches applied the pdftotext extraction of https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459
changes from

  • 
Patches may be grouped with other patches to test the whole of a

to

  •  P
 atches may be grouped with other patches to test the whole of a

The correct extraction would be having no newline, but if we are going
to have a newline i very much prefer the old way than the new one that
breaks the word after the P

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/26

------------------------------------------------------------------------
On 2012-03-22T05:07:01+00:00 Adrian Johnson wrote:

Created attachment 58860
Don't reverse order of words with same xMin

This fixes the regression. Output is now:

•  Patches may be grouped with other patches to test the whole of a

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/27

------------------------------------------------------------------------
On 2012-03-22T15:29:49+00:00 Albert Astals Cid wrote:

Tehre's still some problems, in the file that i will be attaching

• 
white X in patch b:

gets changed to

•w
 hite X in patch b:

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/28

------------------------------------------------------------------------
On 2012-03-22T15:32:29+00:00 Albert Astals Cid wrote:

Created attachment 58892
The pdf with the "white X in patch" problem

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/29

------------------------------------------------------------------------
On 2012-03-25T04:48:16+00:00 Adrian Johnson wrote:

Created attachment 59003
don't start a new word if the previous char is a control char

Problem is caused by a ^G that overlaps the first letter of the word.
This patch avoids started a new word if a control characters overlaps
other character. Although it would probably be better to strip out
control characters from the extracted text. Acroread does not include
the ^G in extracted text.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/30

------------------------------------------------------------------------
On 2012-03-25T08:17:29+00:00 Albert Astals Cid wrote:

With this last patch now we get
       P
atch creation date
  instead of        
Patch creation date
again in https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/31

------------------------------------------------------------------------
On 2014-12-17T13:33:13+00:00 Jason Crain wrote:

*** Bug 87401 has been marked as a duplicate of this bug. ***

Reply at:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/36


** Changed in: poppler
       Status: Unknown => Confirmed

** Changed in: poppler
   Importance: Unknown => Medium

-- 
You received this bug notification because you are a member of Desktop
Packages, which is subscribed to poppler in Ubuntu.
https://bugs.launchpad.net/bugs/808894

Title:
  Certain characters are not rendered correctly when selected
  (highlighted)

Status in Poppler:
  Confirmed
Status in poppler package in Ubuntu:
  Triaged

Bug description:
  1) lsb_release -rd
  Description:	Ubuntu Vivid Vervet (development branch)
  Release:	15.04

  2) apt-cache policy evince
  evince:
    Installed: 3.14.1-0ubuntu1
    Candidate: 3.14.1-0ubuntu1
    Version table:
   *** 3.14.1-0ubuntu1 0
          500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages
          100 /var/lib/dpkg/status

  3) What is expected to happen via
  https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/+attachment/2202502/+files/testfile.pdf
  is when one highlights the first three lines, it doesn't mis-highlight
  the words.

  What happens instead is certain letters are not visible as per
  https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/+attachment/2202506/+files/screenshot.png
  .

  ProblemType: Bug
  DistroRelease: Ubuntu 11.04
  Package: evince 2.32.0-0ubuntu12.2
  ProcVersionSignature: Ubuntu 2.6.38-8.42-generic 2.6.38.2
  Uname: Linux 2.6.38-8-generic i686
  Architecture: i386
  CheckboxSubmission: 9e6554c36969a101b9e0e3075c8ffbe0
  CheckboxSystem: b8f3ec504801f13fc208edb5c785b099
  Date: Mon Jul 11 18:38:00 2011
  InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Release i386 (20110427.1)
  ProcEnviron:
   LANGUAGE=fr_FR:en
   LANG=fr_FR.UTF-8
   SHELL=/bin/bash
  ProcVersionSignature_: Ubuntu 2.6.38-8.42-generic 2.6.38.2
  SourcePackage: evince
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/poppler/+bug/808894/+subscriptions