how to properly parse paired html tags?
Asked Answered
E

1

6

the question is about parsing an html stream obtained by load/markup in a way you can get html tags constituent parts, i.e. when you find

<div id="one">my text</div> 

you should end with something like <div id="one">, {my text} and </div> in the same container, something like

[<div id="one"> {my text} </div>] 

or even better

[<div> [id {one}] {my text} </div>]

the parsing problem is matching paired html tags, in html a tag may be an empty tag with maybe attributes but without content and thus without an ending tag or a normal tag maybe with attributes and content and so an ending tag, but both types of tag are just a tag

I mean when you find a sequence like <p>some words</p> you have a P tag just the same you get whit a sequence like <p /> just a P tag, in first case you have associated text and ending tag and in the latter you don't, that's all

In other words, html attributes and content are properties of tag element in html, so representing this in json you will get someting like:

tag: { name: "div" attributes: { id: "one } content: "my text" }

this means you have to identify content of a tag in order to assign it to properly tag, which in terms of rebol parse means identifing matching tags (opening tag and ending tag)

In rebol you can easy parse an html sequence like:

<div id="yo">yeah!</div><br/>

with the rule:

[ some [ tag! string! tag! | tag! ]]

but with this rule you will match the html

<div id="yo">yeah!</div><br/> 

and also

<div id="yo">yeah!</p><br/> 

as being the same

so you need a way to match the same opening tag when appearing in ending position

sadly rebol tags cannot (AFAIK) be parametrized with tag name, so you cannot say something like:

[ some [ set t1 tag! set s string! set t2 tag!#t1/1 | tag! ] ]

the t1/1 notation is due to a (bad) feature of rebol including all tag atributes at same level of tag name (another bad feature is not reckognizing matching tags as being the same tag)

Of course you can achieve the goal using code such as:

tags: copy []
html: {<div id="yo">yeah!</p><br/>}
parse html [ some [ set t1 tag! set s string! set t2 tag! (tag: first make block! t1 if none <> find t2 tag [append/only tags reduce [t1 s] ]) | tag! (append/only tags reduce [t1])]]

but the idea is to use a more elegant and naive approach using parse dialect only

Emetine answered 13/5, 2017 at 9:46 Comment(1)
HTML parsing is voodoo at best (see HTML5 spec). There is an HTML parser for Rebol 2 although you need the PowerMezz pack to use it. At this time there is no HTML parser for Rebol 3, though I did start a conversion of PowerMezz (it is not beginners code and takes a lot of effort to get a handle on). You could perhaps create a cheap and cheerful handler with some combination of LOAD/MARKUP, a tag parser and a state machine—depends whether you are operating for the general case or specific markup.Whomever
E
1

There's a way to parse pairs of items in rebol parse dialect, simply using a word to store the expected pair:

parse ["a" "a"] [some [set s string! s ]]
parse ["a" "a" "b" "b"] [some [set s string! s ]]

But this doesn't work well with tags due to tags carry attributes and special ending marks (/) and thus it's not easy to find the ending pair from initial one:

parse [<p> "some text" </p>] [some [ set t tag! set s string! t ]
parse [<div id="d1"> "some text" </div>] [some [ set t tag! set s string! t ]

don't work cause </p> is not equal to <p> and neither </div> is equal to <div id="d1">

Again you can fix it with code:

parse load/markup "<p>preug</p>something<br />" [
    some [
        set t tag! (
            b: copy t remove/part find b " " tail b
            insert b "/"
        )
        set s string!
        b (print [t s b])
    |
        tag!
    |
        string!
    ]
]

but this is not simple and zen code anymore, so question's still alive ;-)

Emetine answered 15/5, 2017 at 5:11 Comment(3)
Mind that in Rebol 3, you can use INTO on string types including tags: parse load/markup html [and tag! into ["p" [end | " " to end]] stuff </p>]Whomever
Does load/markup exist in R3? I thought it doesn't.Jaco
No, it doesn't exist as refinement but you can emulate it with refinement type: load/type www.google.com 'markupEmetine

© 2022 - 2024 — McMap. All rights reserved.