Extract text from array TJ in PDF operator using PoDoFo lib
Asked Answered
D

1

1

I am trying to extract text from a PDF file usind the PoDoFo library, it is working for the Tj operator and fails to do so for the (array) TJ operator. I ve found this piece of code(with my small modification) here :

 const char*      pszToken = NULL;
    PdfVariant       var;
    EPdfContentsType eType;

    PdfContentsTokenizer tokenizer( pPage );

    double dCurPosX     = 0.0;
    double dCurPosY     = 0.0;
    double dCurFontSize = 0.0;
    bool   bTextBlock   = false;
    PdfFont* pCurFont   = NULL;

    std::stack<PdfVariant> stack;



while( tokenizer.ReadNext( eType, pszToken, var ) )
{

    if( eType == ePdfContentsType_Keyword )
    {
        // support 'l' and 'm' tokens

        _RPT1(_CRT_WARN, " %s\n", pszToken);

        if( strcmp( pszToken, "l" ) == 0 || 
            strcmp( pszToken, "m" ) == 0 )
        {
            dCurPosX = stack.top().GetReal();
            stack.pop();
            dCurPosY = stack.top().GetReal();
            stack.pop();
        }
        else if (strcmp(pszToken, "Td") == 0)
        {
            dCurPosY = stack.top().GetReal();
            stack.pop();
            dCurPosX = stack.top().GetReal();
            stack.pop();
        }
        else if (strcmp(pszToken, "Tm") == 0)
        {
            dCurPosY = stack.top().GetReal();
            stack.pop();
            dCurPosX = stack.top().GetReal(); 
            stack.pop();
        }
        else if( strcmp( pszToken, "BT" ) == 0 ) 
        {
            bTextBlock   = true;     
            // BT does not reset font
            // dCurFontSize = 0.0;
            // pCurFont     = NULL;
        }
        else if( strcmp( pszToken, "ET" ) == 0 ) 
        {
            if( !bTextBlock ) 
                fprintf( stderr, "WARNING: Found ET without BT!\n" );
        }

        if( bTextBlock ) 
        {
            if( strcmp( pszToken, "Tf" ) == 0 ) 
            {
                dCurFontSize = stack.top().GetReal();
                stack.pop();
                PdfName fontName = stack.top().GetName();
                PdfObject* pFont = pPage->GetFromResources( PdfName("Font"), fontName );
                if( !pFont ) 
                {
                    PODOFO_RAISE_ERROR_INFO( ePdfError_InvalidHandle, "Cannot create font!" );
                }

                pCurFont = pDocument->GetFont( pFont );
                if( !pCurFont ) 
                {
                    fprintf( stderr, "WARNING: Unable to create font for object %i %i R\n",
                        pFont->Reference().ObjectNumber(),
                        pFont->Reference().GenerationNumber() );
                }
            }
            else if( strcmp( pszToken, "Tj" ) == 0 ||
                strcmp( pszToken, "'" ) == 0 ) 
            {
                AddTextElement( dCurPosX, dCurPosY, pCurFont, stack.top().GetString() );
                stack.pop();
            }
            else if( strcmp( pszToken, "\"" ) == 0 )
            {
                AddTextElement( dCurPosX, dCurPosY, pCurFont, stack.top().GetString() );
                stack.pop();
                stack.pop(); // remove char spacing from stack
                stack.pop(); // remove word spacing from stack
            }
            else if( strcmp( pszToken, "TJ" ) == 0 ) 
            {
                PdfArray array = stack.top().GetArray();
                stack.pop();

                for( int i=0; i<static_cast<int>(array.GetSize()); i++ ) 
                {
                    _RPT1(_CRT_WARN, " variant: %s", array[i].GetDataTypeString());
                    if(array[i].IsHexString()) {
                        if(!pCurFont) {
                            _RPT1(_CRT_WARN, " : Could not Get font!!%d\n", i);
                        }
                        else {
                            if(!pCurFont->GetEncoding()) {
                                _RPT1(_CRT_WARN, ": could not get encoding\n",0);
                            } else {
                                PdfString s = array[i].GetString();
                                _RPT1(_CRT_WARN, " : valid :%s   ", s.IsValid()?"yes":"not");
                                _RPT1(_CRT_WARN, " ;hex :%s   ", s.IsHex()?"yes":"not");
                                _RPT1(_CRT_WARN, " ;unicode: %s   ", s.IsUnicode()?"yes":"not");

                                PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode(s,pCurFont);
                                const char* szText = unicode.GetStringUtf8().c_str();
                                _RPT1(_CRT_WARN, " : %s\n", strlen(szText)> 0? szText: "nothing");

                            }

                        }
                    }
                    else if(array[i].IsNumber()) {
                        _RPT1(_CRT_WARN, " : %d\n", array[i].GetNumber());
                    }

                    if( array[i].IsString() )//|| array[i].IsHexString())
                        AddTextElement( dCurPosX, dCurPosY, pCurFont, array[i].GetString() );
                }
            }
        }
    }
    else if ( eType == ePdfContentsType_Variant )
    {
        stack.push( var );

        _RPT1(_CRT_WARN, " variant: %s\n", var.GetDataTypeString());
    }
    else
    {
        // Impossible; type must be keyword or variant
        PODOFO_RAISE_ERROR( ePdfError_InternalLogic );
    }
}

and for the code I get this output:

    BT
 variant: Name
 variant: Real
 Tf
 variant: Number
 variant: Number
 variant: Number
 rg
 variant: Real
 variant: Number
 variant: Number
 variant: Number
 variant: Real
 variant: Real
 Tm
 variant: Array
 TJ
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -7
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -15
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -15
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -11
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -11
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -19
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -11
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -15
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 variant: Number : -11
 variant: HexString : valid :yes    ;hex :yes    ;unicode: not    : nothing
 ET

The PDF stream object would like this (I'm sorry but I m not allowed to give you the pdf file):

    q
Q
q
Q
q
q
q
1 0 0 1 37.68 785.28 cm
91.92 0 0 31.44 0 0 cm
/Img1 Do
Q
Q
q
q
1 0 0 1 431.28 780.24 cm
42.72 0 0 7.2 0 0 cm
/Img2 Do
Q
Q
q
BT
/F1 8.88 Tf
0 0 0 rg
0.9998 0 0 1 377.28 704.4 Tm
[<0026>-7<004F>-15<004C>-15<0048>-11<0051>-11<0057>-19<0058>-11<004F>-15<0058>-11<004C>] TJ
ET
Q
q
1 0 0 1 0 0 cm
0.4799 w
0 0 0 RG
377.28 703.44 m
415.2 703.44 l
S
Q
q
BT
/F1 8.16 Tf
0 0 0 rg
0.9998 0 0 1 377.28 687.36 Tm
[<0030>9<0027>-13<002C>-16<0003>1<0026>-13<0032>13<0031>-13<0036>-9<0037>-6<0035>-13<0038>-13<0026>-13<0037>-6<0003>1<0037>-6<0035>-13<0024>-9<0031>-13<0036>-9<0003>1<0028>-9<003B>-9<0033>-9<0028>-9<0035>-13<0037>-6<0003>1<0036>-9<0035>-13<002F>] TJ
ET

The PDF file should be found here or here

Doublereed answered 17/5, 2013 at 10:43 Comment(0)
C
5

1. The answer to the original question for which the central code part was this:

else if( strcmp( pszToken, "TJ" ) == 0 ) 
{
    PdfArray array = stack.top().GetArray();
    stack.pop();
                
    for( int i=0; i<static_cast<int>(array.GetSize()); i++ ) 
    {
        if( array[i].IsString() )
            AddTextElement( dCurPosX, dCurPosY, pCurFont, array[i].GetString() );
        }
    }
}

and the question was:

I've noticed that the the array[i].IsString() never gets to be true. Is this the right way to get the text from a TJ operator?

The short answer:

Hexadecimal strings in PoDoFo PdfVariants are recognized by IsHexString() instead of IsString(). Thus, you have to test for both string flavors:

if( array[i].IsString() || array[i].IsHexString() )

The long answer:

There are two basic flavors of strings in PDF:

String objects shall be written in one of the following two ways:

  • As a sequence of literal characters enclosed in parentheses ( ) (using LEFT PARENTHESIS (28h) and RIGHT PARENThESIS (29h)); see 7.3.4.2, "Literal Strings."

  • As hexadecimal data enclosed in angle brackets < > (using LESS-THAN SIGN (3Ch) and GREATER-THAN SIGN (3Eh)); see 7.3.4.3, "Hexadecimal Strings."

(section 7.3.4 in ISO 32000-1)

PoDoFo models both using the PdfString class which in the context of parsing often is wrapped inside a PdfVariant or even more specifically in a PdfObject.

When determining the type of the object contained in it, though, the PdfVariant differentiates between literal strings and hexadecimal strings:

/** \returns true if this variant is a string (i.e. GetDataType() == ePdfDataType_String)
 */
inline bool IsString() const { return GetDataType() == ePdfDataType_String; }

/** \returns true if this variant is a hex-string (i.e. GetDataType() == ePdfDataType_HexString)
 */
inline bool IsHexString() const { return GetDataType() == ePdfDataType_HexString; }

(PdfVariant.h)

The type of the PdfString inside a PdfVariant is determined when wrapped:

PdfVariant::PdfVariant( const PdfString & rsString )
{
    Init();
    Clear();

    m_eDataType  = rsString.IsHex() ? ePdfDataType_HexString : ePdfDataType_String;
    m_Data.pData = new PdfString( rsString );
}

(PdfVariant.cpp)

In case of your TJ argument array components, the strings in question are read as hexadecimal strings.

In your code, therefore, you have to consider both IsHexString() and IsString():

if( array[i].IsString() || array[i].IsHexString() )

2. Thereafter, and after the code was revised to check using IsHexString(), the question centered on

PdfString s = array[i].GetString();
_RPT1(_CRT_WARN, " : valid :%s   ", s.IsValid()?"yes":"not");
_RPT1(_CRT_WARN, " ;hex :%s   ", s.IsHex()?"yes":"not");
_RPT1(_CRT_WARN, " ;unicode: %s   ", s.IsUnicode()?"yes":"not");

PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode(s,pCurFont);
const char* szText = unicode.GetStringUtf8().c_str();
_RPT1(_CRT_WARN, " : %s\n", strlen(szText)> 0? szText: "nothing");

and the problem (as stated in comments) that

the s.GetLength() returns 2 and unicode.GetLength() returns 0, the conversion didn't work?

An analysis of the example documents Document2.pdf shows that the document in question does contain the required informations for text extraction. The only font present in that document which is used with hexadecimal encoding is /F1, and its font dictionary does contain an appropriate /ToUnicode map for reliable text extraction.

Unfortunately, though, PoDoFo does not yet seem to have implemented properly using that map for parsing purposes. I do not see it anywhere retrieving the /ToUnicode map to make the contained informations available for text parsing. It looks like PoDoFo cannot be used to properly parse the text of documents using Type0 aka composite font.

Coniine answered 17/5, 2013 at 11:59 Comment(13)
thank you, you're truly an inspiration, I've update my post to include another issue with regards to font.Doublereed
Are you sure you should have that '!' in if(!pCurFont) {? Because... in the first code block (executed if !pCurFont) you use pCurFont, and in the second code block (executed if pCurFont) you warn "Could not Get font!!"Coniine
I 've took it from here :libpodofo.sourcearchive.com/documentation/0.8.4/… , line 153. And I'm not sure about that :DDoublereed
Well, you broke it. The code you linked prints a warning if( !pCurFont ) but you print the warning in the else branch!Coniine
I sure did! Thank you for pointing that out, however I think I've fixed it now but still no text coming outDoublereed
I'm looking at it right now, and the s.GetLength() returns 2 and unicode.GetLength() returns 0, the conversion didn't work?Doublereed
Well, maybe the PDF font objects do not contain the information required to extract text (putting the required information into a PDF is not mandatory!). If you could provide the PDF for inspection, stackoverflow members could inspect it. As a first check, though: Can Adobe Reader / Acrobat extract (copy&paste) the text in question? It would be a bad sign ff they can't.Coniine
it might have some connection to this mail-archive.com/[email protected]/…Doublereed
I'll look into that later, currently I'm on the road.Coniine
The document you provided does not contain the information you provided in your file upload. May I assume that your attempts to parse those files resulted in the same problems as your attempts with your original file?Coniine
The document is created in the same way, it only has different text on the page, it uses the same language and it is created the same way and I'm having the same problems in decoding the characters. I've noticed that if I use special characters in the PDF I get this current encoding style ... Nevertheless what do I do with the numbers between the characters? do they influence the calculation of the tx and ty or should I just disregard them in concern to that.Doublereed
I inspected the example document you provided, cf. the edit of my answer. It does contain the information required for text extraction, and other PDF processors (e.g. Adobe Reader and the Java iText library) can extract its text properly. Thus I searched the PoDoFo code for a routine extracting the /ToUnicode map (required to parse text in a composite font) but found none that is called during text parsing. Thus, I'm afraid PoDoFo cannot yet be used for extracting that text. To be sure, though, you might want to make that a question in its own right to activate people who tried that, too.Coniine
Thank you very much, you have helped me a lot, I will make the question.Doublereed

© 2022 - 2024 — McMap. All rights reserved.