Get character offsets for elements in jsoup - McMap

About

Get character offsets for elements in jsoup

Asked 8/7, 2012 at 23:9 Answered 25/2, 2013 at 0:10

Solved jsoup lexical-analysis

F

1

6

I need to map jsoup elements back to specific character offsets in the source HTML. In other words, if I have HTML that looks like this:

Hello <br/> World

I need to know that "Hello " starts at offset 0 and has a length of 6 characters, <br/> starts at offset 6 and has a length of 5 characters, etc..

I could not find a getter in the Element javadoc that returns this information. Can it be retrieved?

Finsen answered 8/7, 2012 at 23:9 Comment(2)

Did you find a solution to this that did not result in writing your own grammar? – Wsan 8/6, 2013 at 16:3

No. I'm still using jflex. – Finsen 9/6, 2013 at 23:11

F

0

I don't believe Jsoup has this functionality. This question seems closer to lexical analysis than HTML parsing.

I would write a grammar, and then write a lexer against that grammar which would tokenize the HTML, and supply the offsets that you're looking for.

First, parse the document with Jsoup to verify that it is valid HTML.

Then, lexically analyze the document against a grammar. A grammar might look like:

Document := {optional-opening-tag} | {literal} {optional-opening-tag} | {optional-closing-tag}

optional-opening-tag := ["<" {literal} ">" {optional-opening-tag}|{literal} ] | ""

optional-closing-tag := "</ {literal} ">" | ""

literal := any string of characters not beginning with whitespace, or containing "<"

Insert each token that you find in an object which stores the token, the index of the first character, and the length.

Fletcherfletcherism answered 25/2, 2013 at 0:10 Comment(1)

Yes, this is the right answer. I had actually already written a lexer using JFlex, and it works, and I'm still using it, but I'd rather not maintain it. I was trying to get rid of the code. – Finsen 25/2, 2013 at 17:18

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.