Extract complete hyphenated word from .pdf using acrobat.tlb in .NET VB or C#
Asked Answered
C

1

7

I am parsing a .pdf using the acrobat.tlb library

Hyphenated words are being split across new lines with the hyphens removed.

e.g. ABC-123-XXX-987

Parses as:
ABC
123
XXX
987

If I parse the text using iTextSharp it parses the whole string as displayed in the file which is the behaviour I want. However, I need to highlight these strings (serial numbers) in the .pdf and iTextSharp is not placing the highlight in the correct location... hence acrobat.tlb

I am using this code, from here: http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

 ' filey = "*your full file name including directory here*"
        AcroExchApp = CreateObject("AcroExch.App")
        AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
        ' Open the [strfiley] pdf file
        AcroExchAVDoc.Open(filey, "")       

        ' Get the PDDoc associated with the open AVDoc
        AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
        sustext = "accessorizes"
        suktext = "accessorises" 
        ' get JavaScript Object
        ' note jso is related to PDDoc of a PDF,
        jso = AcroExchPDDoc.GetJSObject
        ' count
        nCount = 0
        nCount1 = 0
        gbStop = False
        bUSCnt = False
        bUKCnt = False
        ' search for the text
        If Not jso Is Nothing Then
            ' total number of pages
            nPages = jso.numpages           

                ' Go through pages
                For i = 0 To nPages - 1
                    ' check each word in a page
                    nWords = jso.getPageNumWords(i)
                    For j = 0 To nWords - 1
                        ' get a word

                        word = Trim(CStr(jso.getPageNthWord(i, j)))
                        'If VarType(word) = VariantType.String Then
                        If word <> "" Then
                            ' compare the word with what the user wants
                            If Trim(sustext) <> "" Then
                                result = StrComp(word, sustext, vbTextCompare)
                                ' if same
                                If result = 0 Then
                                    nCount = nCount + 1
                                    If bUSCnt = False Then
                                        iUSCnt = iUSCnt + 1
                                        bUSCnt = True
                                    End If
                                End If
                            End If
                            If suktext<> "" Then
                                result1 = StrComp(word, suktext, vbTextCompare)
                                ' if same
                                If result1 = 0 Then
                                    nCount1 = nCount1 + 1
                                    If bUKCnt = False Then
                                        iUKCnt = iUKCnt + 1
                                        bUKCnt = True
                                    End If
                                End If
                            End If
                        End If
                    Next j
                Next i
jso = Nothing
        End If

The code does the job of highlighting the text, but the FOR loop with the 'word' variable is splitting the hyphenated string into component parts.

For i = 0 To nPages - 1
                        ' check each word in a page
                        nWords = jso.getPageNumWords(i)
                        For j = 0 To nWords - 1
                            ' get a word

                            word = Trim(CStr(jso.getPageNthWord(i, j)))

Does anyone know how to maintain the whole string using acrobat.tlb? My quite extensive searches have drawn a blank.

Contradict answered 12/9, 2018 at 9:0 Comment(0)
T
2

I can understand that iTextSharp is troublesome when highlighting text cause you have to draw a rectangle and becomes complicated but the solution of acrobat.tlb has its drawback also. It is not free, few people might use it. A better solution for the rest of us is the free and easy to use Spire.Pdf. You can get it from NuGet packages. The code does the folowings:

  • Opens .pdf
  • Read each text page
  • using regular expression find matches
  • save them to a list of strings eliminating duplicates
  • for each string in this list search page and highlight the word

Code:

Dim pdf As PdfDocument = New PdfDocument("Path")
Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3})"
Dim matches As MatchCollection

Dim result As PdfTextFind() = Nothing
Dim content As New StringBuilder()
Dim matchList As New List(Of String)

For Each page As PdfPageBase In pdf.Pages
    'get text from current page
    content.Append(page.ExtractText())

    'find matches
    matches = Regex.Matches(content.ToString, pattern, RegexOptions.None)

    matchList.Clear()

    'Assign each match to a string list.
    For Each match As Match In matches
        matchList.Add(match.Value)
    Next

    'Eliminate duplicates.
    matchList = matchList.Distinct.ToList

    'for each string in list
    For i = 0 To matchList.Count - 1
        'find all occurances of matchList(i) string in page and highlight it
        result = page.FindText(matchList(i)).Finds

        For Each find As PdfTextFind In result
            find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
        Next

    Next 'matchList

Next 'page

pdf.SaveToFile("New Path")

pdf.Close()
pdf.Dispose()

I am not so good in regular expression so you can implement yours. That was my approach anyway.

Talent answered 15/9, 2018 at 18:16 Comment(3)
Thanks for your suggestion and code. Unfortunately the Free version of Spire.PDF for .NET is limited to 10 pages per file which makes it unsuitable for my requirements.Contradict
@Contradict I just shaw that! I guess not many if any are using or used the acrobat.tlb library. Maybe you are lucky who knows. What problems did you get with iTextSharp?Eyeleteer
I hope I get lucky! I have the highlighting kind of working with iTextSharp, but it places the highlight in the wrong location. I can't see any problems with my code either. I am now trying with PDF Clown for .NET. It looks similar to Spire PDF.Contradict

© 2022 - 2024 — McMap. All rights reserved.