Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp
Asked Answered
S

5

23

I have a PDF file that i am reading into string using ITextExtractionStrategy.Now from the string i am taking a substring like My name is XYZ and need to get the rectangular coordinates of substring from the PDF file but not able to do it.On googling i got to know that LocationTextExtractionStrategy but not getting how to use this to get the coordinates.

Here is the code..

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);

string getcoordinate="My name is XYZ";

How can i get the rectangular coordinate of this substring using ITEXTSHARP..

Please help.

Screamer answered 28/5, 2014 at 11:5 Comment(5)
You might want to follow the words of this answer or this answer. BTW, what is the purpose of currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));?Siobhan
@Siobhan I already went through both the answer posts but i am sorry to say that i am not able to implement.I am not able to start with, as how to use the substring with the methods..If you please guide me it would be life savior for me..Screamer
See this post for changing line 3 to something that's correct: https://mcmap.net/q/376801/-reading-pdf-content-using-itextsharp-in-cTittle
@ChrisHaas I have changed that as per the suggestion.Now will you please tell me how to solve the problem described in the post.Screamer
@ChrisHaas Do You have any suggestion for the issue..It will be helpfull ...Screamer
T
50

Here is a very, very simple version of an implementation.

Before implementing it is very important to know that PDFs have zero concept of "words", "paragraphs", "sentences", etc. Also, text within a PDF is not necessarily laid out left to right and top to bottom and this has nothing to do with non-LTR languages. The phrase "Hello World" could be written into the PDF as:

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

It could also be written as

Draw Hello World at (10,10)

The ITextExtractionStrategy interface that you need to implement has a method called RenderText that gets called once for every chunk of text within a PDF. Notice I said "chunk" and not "word". In the first example above the method would be called four times for those two words. In the second example it would be called once for those two words. This is the very important part to understand. PDFs don't have words and because of this, iTextSharp doesn't have words either. The "word" part is 100% up to you to solve.

Also along these lines, as I said above, PDFs don't have paragraphs. The reason to be aware of this is because PDFs cannot wrap text to a new line. Any time that you see something that looks like a paragraph return you are actually seeing a brand new text drawing command that has a different y coordinate as the previous line. See this for further discussion.

The code below is a very simple implementation. For it I'm subclassing LocationTextExtractionStrategy which already implements ITextExtractionStrategy. On each call to RenderText() I find the rectangle of the current chunk (using Mark's code here) and storing it for later. I'm using this simple helper class for storing these chunks and rectangles:

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

And here's the subclass:

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

And finally an implementation of the above:

//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
    using (var doc = new Document()) {
        using (var writer = PdfWriter.GetInstance(doc, fs)) {
            doc.Open();

            doc.Add(new Paragraph("This is my sample file"));

            doc.Close();
        }
    }
}

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

I can't stress enough that the above does not take "words" into account, that'll be up to you. The TextRenderInfo object that gets passed into RenderText has a method called GetCharacterRenderInfos() that you might be able to use to get more information. You might also want to use GetBaseline() instead ofGetDescentLine()` if you don't care about descenders in the font.

EDIT

(I had a great lunch so I'm feeling a little more helpful.)

Here's an updated version of MyLocationTextExtractionStrategy that does what my comments below say, namely it takes a string to search for and searches each chunk for that string. For all the reasons listed this will not work in some/many/most/all cases. If the substring exists multiple times in a single chunk it will also only return the first instance. Ligatures and diacritics could also mess with this.

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //The string that we're searching for
    public String TextToSearchFor { get; set; }

    //How to compare strings
    public System.Globalization.CompareOptions CompareOptions { get; set; }

    public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
        this.TextToSearchFor = textToSearchFor;
        this.CompareOptions = compareOptions;
    }

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //See if the current chunk contains the text
        var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

        //If not found bail
        if (startPosition < 0) {
            return;
        }

        //Grab the individual characters
        var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();

        //Grab the first and last character
        var firstChar = chars.First();
        var lastChar = chars.Last();


        //Get the bounding box for the chunk of text
        var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
        var topRight = lastChar.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
    }

You would use this the same as before but now the constructor has a single required parameter:

var t = new MyLocationTextExtractionStrategy("sample");
Tittle answered 28/5, 2014 at 15:11 Comment(6)
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(60.6755f, 749.172f, 94.0195f, 735.3f); This is the code line where i need to use the coordinates of the substring ..I implemented your code and got the result as 36x785.516....How can i implement this like that?Screamer
The code posted shows how to use the tools but tells you many times that "words" or "substrings" don't exist in a PDF and therefor iText doesn't support them either. There's no guarantee that text that you are searching for is written in the order that you are searching. However, if you want to assume it does, add a constructor to MyLocationTextExtractionStrategy that takes your search text, then search for that text with renderInfo.GetText() and then use GetCharacterRenderInfos() to get your bounding boxes.Tittle
With the example provided above, when renderInfo.GetText() is called, only one letter at a time is returned, so I would never find the text I am searching for. Any ideas? Thanks.Rotl
@MariusPopa, I recommend re-reading the first couple of paragraphs of this answer as well as the last paragraph before the EDIT mark which tell you exactly what you found. You'll need to buffer all of the renderInfo objects and then perform some logic across that data set.Tittle
@ChrisHaas, any idea how to migrate this code to use it with IText7? They have a lot of breaking changes ThanksHaight
Looks like iText people like to rename stuff very often too. do you mind telling me what is the replacement of LocationTextExtractionStrategy.RenderText in iText 8?Leanneleanor
D
9

It's an old question but I leave here my response as I could not find a correct answer in the web.

As Chris Haas has exposed it is not easy dealing with words as iText deals with chunks. The code that Chris post failed in most of my test because a word is normally splited in different chunks (he warns about that in the post).

To solve that problem here it is the strategy I have used:

  1. Split chunks in characters (actually textrenderinfo objects per each char)
  2. Group chars by line. This is not straight forward as you have to deal with chunk alignment.
  3. Search the word you need to find for each line

I leave here the code. I test it with several documents and it works pretty well but it could fail in some scenarios because it's a bit tricky this chunk -> words transformation.

Hope it helps to someone.

  class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
    private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
    private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
    public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
    private String m_SearchText;
    public const float PDF_PX_TO_MM = 0.3528f;
    public float m_PageSizeY;


    public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
        : base()
    {
        this.m_SearchText = sSearchText;
        this.m_PageSizeY = fPageSizeY;
    }

    private void searchText()
    {
        foreach (LineInfo aLineInfo in m_LinesTextInfo)
        {
            int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
            if (iIndex != -1)
            {
                TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
                SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
                this.m_SearchResultsList.Add(aSearchResult);
            }
        }
    }

    private void groupChunksbyLine()
    {                     
        LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
        LocationTextExtractionStrategyEx.LineInfo textInfo = null;
        foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
        {
            if (textChunk1 == null)
            {                    
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            else if (textChunk2.sameLine(textChunk1))
            {                      
                textInfo.appendText(textChunk2);
            }
            else
            {                                        
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            textChunk1 = textChunk2;
        }
    }

    public override string GetResultantText()
    {
        groupChunksbyLine();
        searchText();
        //In this case the return value is not useful
        return "";
    }

    public override void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment baseline = renderInfo.GetBaseline();
        //Create ExtendedChunk
        ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
        this.m_DocChunks.Add(aExtendedChunk);
    }

    public class ExtendedTextChunk
    {
        public string m_text;
        private Vector m_startLocation;
        private Vector m_endLocation;
        private Vector m_orientationVector;
        private int m_orientationMagnitude;
        private int m_distPerpendicular;           
        private float m_charSpaceWidth;           
        public List<TextRenderInfo> m_ChunkChars;


        public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
        {
            this.m_text = txt;
            this.m_startLocation = startLoc;
            this.m_endLocation = endLoc;
            this.m_charSpaceWidth = charSpaceWidth;                
            this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
            this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
            this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];                
            this.m_ChunkChars = chunkChars;

        }


        public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
        {
            return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
        }


    }

    public class SearchResult
    {
        public int iPosX;
        public int iPosY;

        public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
        {
            //Get position of upperLeft coordinate
            Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
            //PosX
            float fPosX = vTopLeft[Vector.I1]; 
            //PosY
            float fPosY = vTopLeft[Vector.I2];
            //Transform to mm and get y from top of page
            iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
            iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
        }
    }

    public class LineInfo
    {            
        public string m_Text;
        public List<TextRenderInfo> m_LineCharsList;

        public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
        {                
            this.m_Text = initialTextChunk.m_text;
            this.m_LineCharsList = initialTextChunk.m_ChunkChars;
        }

        public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
        {
            m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
            this.m_Text += additionalTextChunk.m_text;
        }
    }
}
Desiraedesire answered 8/10, 2015 at 11:26 Comment(3)
The code above only locates the searched text once per line. If need to check mulitple appearences per line then iterate the search in the searchText function.Desiraedesire
Great post! ThanksPena
Hello, Can you add an example how to use this code? ThanksHaight
H
6

I know this is a really old question, but below is what I ended up doing. Just posting it here hoping that it will be useful for someone else.

The following code will tell you the starting coordinates of the line(s) that contains a search text. It should not be hard to modify it to give positions of words. Note. I tested this on itextsharp 5.5.11.0 and won't work on some older versions

As mentioned above pdfs have no concept of words/lines or paragraphs. But I found that the LocationTextExtractionStrategy does a very good job of splitting lines and words. So my solution is based on that.

DISCLAIMER:

This solution is based on the https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs and that file has a comment saying that it's a dev preview. So this might not work in future.

Anyway here's the code.

using System.Collections.Generic;
using iTextSharp.text.pdf.parser;

namespace Logic
{
    public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
    {
        private readonly List<TextChunk> locationalResult = new List<TextChunk>();

        private readonly ITextChunkLocationStrategy tclStrat;

        public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp())
        {
        }

        /**
         * Creates a new text extraction renderer, with a custom strategy for
         * creating new TextChunkLocation objects based on the input of the
         * TextRenderInfo.
         * @param strat the custom strategy
         */
        public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
        {
            tclStrat = strat;
        }


        private bool StartsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[0] == ' ';
        }


        private bool EndsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[str.Length - 1] == ' ';
        }

        /**
         * Filters the provided list with the provided filter
         * @param textChunks a list of all TextChunks that this strategy found during processing
         * @param filter the filter to apply.  If null, filtering will be skipped.
         * @return the filtered list
         * @since 5.3.3
         */

        private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
        {
            if (filter == null)
            {
                return textChunks;
            }

            var filtered = new List<TextChunk>();

            foreach (var textChunk in textChunks)
            {
                if (filter.Accept(textChunk))
                {
                    filtered.Add(textChunk);
                }
            }

            return filtered;
        }

        public override void RenderText(TextRenderInfo renderInfo)
        {
            LineSegment segment = renderInfo.GetBaseline();
            if (renderInfo.GetRise() != 0)
            { // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to 
                Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
                segment = segment.TransformBy(riseOffsetTransform);
            }
            TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
            locationalResult.Add(tc);
        }


        public IList<TextLocation> GetLocations()
        {

            var filteredTextChunks = filterTextChunks(locationalResult, null);
            filteredTextChunks.Sort();

            TextChunk lastChunk = null;

            var textLocations = new List<TextLocation>();

            foreach (var chunk in filteredTextChunks)
            {

                if (lastChunk == null)
                {
                    //initial
                    textLocations.Add(new TextLocation
                    {
                        Text = chunk.Text,
                        X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                        Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                    });

                }
                else
                {
                    if (chunk.SameLine(lastChunk))
                    {
                        var text = "";
                        // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                        if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
                            text += ' ';

                        text += chunk.Text;

                        textLocations[textLocations.Count - 1].Text += text;

                    }
                    else
                    {

                        textLocations.Add(new TextLocation
                        {
                            Text = chunk.Text,
                            X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                            Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                        });
                    }
                }
                lastChunk = chunk;
            }

            //now find the location(s) with the given texts
            return textLocations;

        }

    }

    public class TextLocation
    {
        public float X { get; set; }
        public float Y { get; set; }

        public string Text { get; set; }
    }
}

How to call the method:

using (var reader = new PdfReader(inputPdf))
{
    var parser = new PdfReaderContentParser(reader);
    var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());
    var res = strategy.GetLocations();
    reader.Close();
}
var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();
  • inputPdf is a byte[] that has the pdf data
  • pageNumber is the page where you want to search in
Hexachord answered 9/6, 2017 at 6:11 Comment(1)
Hi Amila. I googled for this issue for hours, made long trials but couldn't find a solution. I was thinking to give up but saw then your video. Thank you so much for this great solution. I have tested it and it works great !!! Best greetings :-)Atrabilious
M
0

Here is how you use LocationTextExtractionStrategy in VB.NET.

Class definition:

Class TextExtractor
    Inherits LocationTextExtractionStrategy
    Implements iTextSharp.text.pdf.parser.ITextExtractionStrategy
    Public oPoints As IList(Of RectAndText) = New List(Of RectAndText)
    Public Overrides Sub RenderText(renderInfo As TextRenderInfo) 'Implements IRenderListener.RenderText
        MyBase.RenderText(renderInfo)

        Dim bottomLeft As Vector = renderInfo.GetDescentLine().GetStartPoint()
        Dim topRight As Vector = renderInfo.GetAscentLine().GetEndPoint() 'GetBaseline

        Dim rect As Rectangle = New Rectangle(bottomLeft(Vector.I1), bottomLeft(Vector.I2), topRight(Vector.I1), topRight(Vector.I2))
        oPoints.Add(New RectAndText(rect, renderInfo.GetText()))
    End Sub

    Private Function GetLines() As Dictionary(Of Single, ArrayList)
        Dim oLines As New Dictionary(Of Single, ArrayList)
        For Each p As RectAndText In oPoints
            Dim iBottom = p.Rect.Bottom

            If oLines.ContainsKey(iBottom) = False Then
                oLines(iBottom) = New ArrayList()
            End If

            oLines(iBottom).Add(p)
        Next

        Return oLines
    End Function

    Public Function Find(ByVal sFind As String) As iTextSharp.text.Rectangle
        Dim oLines As Dictionary(Of Single, ArrayList) = GetLines()

        For Each oEntry As KeyValuePair(Of Single, ArrayList) In oLines
            'Dim iBottom As Integer = oEntry.Key
            Dim oRectAndTexts As ArrayList = oEntry.Value
            Dim sLine As String = ""
            For Each p As RectAndText In oRectAndTexts
                sLine += p.Text
                If sLine.IndexOf(sFind) <> -1 Then
                    Return p.Rect
                End If
            Next
        Next

        Return Nothing
    End Function

End Class

Public Class RectAndText
    Public Rect As iTextSharp.text.Rectangle
    Public Text As String
    Public Sub New(ByVal rect As iTextSharp.text.Rectangle, ByVal text As String)
        Me.Rect = rect
        Me.Text = text
    End Sub
End Class

Usage (Insert Signature box right to the found text)

Sub EncryptPdf(ByVal sInFilePath As String, ByVal sOutFilePath As String)

        Dim oPdfReader As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(sInFilePath)
        Dim oPdfDoc As New iTextSharp.text.Document()
        Dim oPdfWriter As PdfWriter = PdfWriter.GetInstance(oPdfDoc, New FileStream(sOutFilePath, FileMode.Create))
        'oPdfWriter.SetEncryption(PdfWriter.STRENGTH40BITS, sPassword, sPassword, PdfWriter.AllowCopy)
        oPdfDoc.Open()

        oPdfDoc.SetPageSize(iTextSharp.text.PageSize.LEDGER.Rotate())

        Dim oDirectContent As iTextSharp.text.pdf.PdfContentByte = oPdfWriter.DirectContent
        Dim iNumberOfPages As Integer = oPdfReader.NumberOfPages
        Dim iPage As Integer = 0

        Dim iBottomMargin As Integer = txtBottomMargin.Text '10
        Dim iLeftMargin As Integer = txtLeftMargin.Text '500
        Dim iWidth As Integer = txtWidth.Text '120
        Dim iHeight As Integer = txtHeight.Text '780

        Dim oStrategy As New parser.SimpleTextExtractionStrategy()


        Do While (iPage < iNumberOfPages)
            iPage += 1
            oPdfDoc.SetPageSize(oPdfReader.GetPageSizeWithRotation(iPage))
            oPdfDoc.NewPage()

            Dim oPdfImportedPage As iTextSharp.text.pdf.PdfImportedPage =
            oPdfWriter.GetImportedPage(oPdfReader, iPage)
            Dim iRotation As Integer = oPdfReader.GetPageRotation(iPage)
            If (iRotation = 90) Or (iRotation = 270) Then
                oDirectContent.AddTemplate(oPdfImportedPage, 0, -1.0F, 1.0F,
                 0, 0, oPdfReader.GetPageSizeWithRotation(iPage).Height)
            Else
                oDirectContent.AddTemplate(oPdfImportedPage, 1.0F, 0, 0, 1.0F, 0, 0)
            End If

            'Dim sPageText As String = parser.PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oStrategy)
            'sPageText = System.Text.Encoding.UTF8.GetString(System.Text.ASCIIEncoding.Convert(System.Text.Encoding.Default, System.Text.Encoding.UTF8, System.Text.Encoding.Default.GetBytes(sPageText)))
            'If txtFind.Text = "" OrElse sPageText.IndexOf(txtFind.Text) <> -1 Then

            Dim oTextExtractor As New TextExtractor()
            PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oTextExtractor) 'Initialize oTextExtractor

            Dim oRect As iTextSharp.text.Rectangle = oTextExtractor.Find(txtFind.Text)
            If oRect IsNot Nothing Then
                Dim iX As Integer = oRect.Left + oRect.Width + iLeftMargin 'Move right
                Dim iY As Integer = oRect.Bottom - iBottomMargin 'Move down

                Dim field As PdfFormField = PdfFormField.CreateSignature(oPdfWriter)
                field.SetWidget(New Rectangle(iX, iY, iX + iWidth, iY + iHeight), PdfAnnotation.HIGHLIGHT_OUTLINE)
                field.FieldName = "myEmptySignatureField" & iPage
                oPdfWriter.AddAnnotation(field)
            End If

        Loop

        oPdfDoc.Close()

    End Sub
Mexico answered 20/3, 2021 at 3:26 Comment(0)
B
0

@Ivan Basart, thank you so much.

here is the full code for anyone who needs to save time from Ivan's code.

using System.Collections.Generic;
using System;

class Program
{

    static void Main()
    {
        //Our test file
        var testFile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
        var searchText = "searchWords";

        using (var reader = new PdfReader(testFile))
        {

            for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
            {
                LocationTextExtractionStrategyEx strategy = new LocationTextExtractionStrategyEx(searchText, reader.GetPageSize(1).Height);
                var ex = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
                foreach (LocationTextExtractionStrategyEx.SearchResult result in strategy.m_SearchResultsList)
                {
                    Console.WriteLine("Found at position: X = {0}, Y = {1}", result.iPosX, result.iPosY);
                }
            }
        }
    }
}



class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
    private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
    private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
    public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
    private String m_SearchText;
    public const float PDF_PX_TO_MM = 0.3528f;
    public float m_PageSizeY;


    public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
        : base()
    {
        this.m_SearchText = sSearchText;
        this.m_PageSizeY = fPageSizeY;
    }

    private void searchText()
    {
        foreach (LineInfo aLineInfo in m_LinesTextInfo)
        {
            int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
            if (iIndex != -1)
            {
                TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
                SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
                this.m_SearchResultsList.Add(aSearchResult);
            }
        }
    }

    private void groupChunksbyLine()
    {
        LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
        LocationTextExtractionStrategyEx.LineInfo textInfo = null;
        foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
        {
            if (textChunk1 == null)
            {
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            else if (textChunk2.sameLine(textChunk1))
            {
                textInfo.appendText(textChunk2);
            }
            else
            {
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            textChunk1 = textChunk2;
        }
    }

    public override string GetResultantText()
    {
        groupChunksbyLine();
        searchText();
        //In this case the return value is not useful
        return "";
    }

    public override void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment baseline = renderInfo.GetBaseline();
        //Create ExtendedChunk
        ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
        this.m_DocChunks.Add(aExtendedChunk);
    }

    public class ExtendedTextChunk
    {
        public string m_text;
        private Vector m_startLocation;
        private Vector m_endLocation;
        private Vector m_orientationVector;
        private int m_orientationMagnitude;
        private int m_distPerpendicular;
        private float m_charSpaceWidth;
        public List<TextRenderInfo> m_ChunkChars;


        public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth, List<TextRenderInfo> chunkChars)
        {
            this.m_text = txt;
            this.m_startLocation = startLoc;
            this.m_endLocation = endLoc;
            this.m_charSpaceWidth = charSpaceWidth;
            this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
            this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
            this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];
            this.m_ChunkChars = chunkChars;

        }


        public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
        {
            return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
        }


    }

    public class SearchResult
    {
        public float iPosX;
        public float iPosY;

        public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
        {
            //Get position of upperLeft coordinate
            Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
            //PosX
            float fPosX = vTopLeft[Vector.I1];
            //PosY
            float fPosY = vTopLeft[Vector.I2];
            //Transform to mm and get y from top of page
            iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
            iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);

        }
    }

    public class LineInfo
    {
        public string m_Text;
        public List<TextRenderInfo> m_LineCharsList;

        public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
        {
            this.m_Text = initialTextChunk.m_text;
            this.m_LineCharsList = initialTextChunk.m_ChunkChars;
        }

        public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
        {
            m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
            this.m_Text += additionalTextChunk.m_text;
        }
    }
}
Boughton answered 20/6, 2023 at 9:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.