how to properly parse paired html tags?

the question is about parsing an html stream obtained by load/markup in a way you can get html tags constituent parts, i.e. when you find

<div id="one">my text</div>

you should end with something like <div id="one">, {my text} and </div> in the same container, something like

[<div id="one"> {my text} </div>]

or even better

[<div> [id {one}] {my text} </div>]

the parsing problem is matching paired html tags, in html a tag may be an empty tag with maybe attributes but without content and thus without an ending tag or a normal tag maybe with attributes and content and so an ending tag, but both types of tag are just a tag

I mean when you find a sequence like <p>some words</p> you have a P tag just the same you get whit a sequence like <p /> just a P tag, in first case you have associated text and ending tag and in the latter you don't, that's all

In other words, html attributes and content are properties of tag element in html, so representing this in json you will get someting like:

tag: { name: "div" attributes: { id: "one } content: "my text" }

this means you have to identify content of a tag in order to assign it to properly tag, which in terms of rebol parse means identifing matching tags (opening tag and ending tag)

In rebol you can easy parse an html sequence like:

<div id="yo">yeah!</div><br/>

with the rule:

[ some [ tag! string! tag! | tag! ]]

but with this rule you will match the html

<div id="yo">yeah!</div><br/>

and also

<div id="yo">yeah!</p><br/>

as being the same

so you need a way to match the same opening tag when appearing in ending position

sadly rebol tags cannot (AFAIK) be parametrized with tag name, so you cannot say something like:

[ some [ set t1 tag! set s string! set t2 tag!#t1/1 | tag! ] ]

the t1/1 notation is due to a (bad) feature of rebol including all tag atributes at same level of tag name (another bad feature is not reckognizing matching tags as being the same tag)

Of course you can achieve the goal using code such as:

tags: copy []
html: {<div id="yo">yeah!</p><br/>}
parse html [ some [ set t1 tag! set s string! set t2 tag! (tag: first make block! t1 if none <> find t2 tag [append/only tags reduce [t1 s] ]) | tag! (append/only tags reduce [t1])]]

but the idea is to use a more elegant and naive approach using parse dialect only

parse load/markup "<p>preug</p>something<br />" [ some [ set t tag! ( b: copy t remove/part find b " " tail b insert b "/" ) set s string! b (print [t s b]) | tag! | string! ] ]

Recommended topics

Hot tags