PDFKitten is highlighting on wrong position
Asked Answered
K

1

5

I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.

Wrong coordinate

As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.

I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.

My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.

I have used the original sample PDF document from PDFKitten for my screenshot.

Kare answered 16/10, 2012 at 12:14 Comment(7)
The only appearant specialty of the third line seems to be that it contains a ligature (fi in finally), and the yellow mark is about the width of that ligature misplaced. Maybe PDFKitten while searching did not take the ligature into account... This BTW also explains why the 'in' in finally was not found... In content: [(has)-326(\014nally)-327(found)-326(a)-326(p)-27(eaceful)-327(place)-326(to)-326(call)-327(home)1(.)-436(He)-326(has)-326(found)-327(jo)28(y)-327(in)-326(life.)]TJTraceytrachea
It is a problem with the letter-spacing of the typeface or the justification of the text. Did you try testing this with a monospaced font? I seems that the markers have their origins on the right of the Screen … or may be the ligaturesBickart
@Bickart We created an document with different types of Fonts. Monospaced, Arial, custom fonts. With and without justification. Also with Umlaute and special chars. We cant figure out a pattern but justification is not a problem. String "in" works fine on the document but "io" is a bit shifted to the left. Here is the test document stefanpopp.de/stack/121016_Test_PDF_justified_pics.pdfKare
@Traceytrachea We also thought that the ligatures could make problems. We tested that in a document with and without and we couldnt reproduce a problem regarding ligature. The effect happens without ligatures too. Thank you for the tip!Kare
@DasFuxx Do you also have a sample PDF and a sample screen shot of a problem situation where ligatures could not be blamed? (I don't have an iOS machine here and, therefore, cannot test PDFKitten directly.)Traceytrachea
@Traceytrachea Unfortunately i cant send the PDF while its under NDA. Here is another document we i could reproduce the error. This is how it looks like: link This is how it should be: link And this is the PDF document linkKare
@DasFuxx The answer I gave this noon also explains unexpected behavior of your new file: In this case the TimesNewRoman font used for regular text in the top half of page 2 has numerous characters with character identifiers differing from their respective Unicode code. The other fonts on that page are included in a less exotic way, though, and the marks in them look quite right. I cannot promise that the problem explained in the answer is the only one PDFKitten has but it is the most obvious.Traceytrachea
T
4

This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.

appendPDFString in StringDetector works with two strings when processing some string data:

// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];

// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];

stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.

Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.

So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.

I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.

This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...

(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))

Traceytrachea answered 17/10, 2012 at 10:49 Comment(6)
Thank you for the suggestion. Hopefully i can try to review it tomorrow. From what you've written it makes sense and thank you in advance for the code analyze. I will write ASAP back.Kare
I made some reverse engineering on my documents and i successfully found the root of the problem in the PDF document. link here you can see both lines from the same document. Everything described by the first one works like a charm, everything with the second one bricks the selection. I think PDFKitten is getting problems in the width description there. I will proof that in the next days.Kare
Well, those informations on pastie are far from complete (e.g. The contents of the ToUnicode map is not shown). But considering the FirstChar, LastChar and Widths, it indeed looks like the first font contains its glyphs at positions equalling their Unicode code, and merely leaves out the glyphs not required in your document. The second font, though, contains glyphs at positions 2, 15, 16, 18, 19, 20, and 22-29, which surely do not coincide with any Unicode codes. Thus, your observations make sense in combination with my analysis. :)Traceytrachea
@DasFuxx Have you succeeded in fixing that issue?Traceytrachea
Not currently while iam on another project. I have maybe a change to get into the problem next week, but i am out of office for dvd recording until middle november. I will response as soon as possible but we are heavy loaded with our current projects. I wont forget that question =)Kare
You were right. Someone posted something similar on Github. I changed one line in the CompositeFont.m which represents your solution. Wrong character has been selected. Link to workaround Thank you!Kare

© 2022 - 2024 — McMap. All rights reserved.