Go: How would you "Pretty Print"/"Prettify" HTML?
Asked Answered
R

5

11

In Python, PHP, and many other languages, it is possible to convert a html document and "prettify" it. In Go, this is very easily done for JSON and XML (from a struct/interface) using the MarshIndent function.

Example for XML in Go:

http://play.golang.org/p/aBNfNxTEG1

package main

import (
    "encoding/xml"
    "fmt"
    "os"
)

func main() {
    type Address struct {
        City, State string
    }
    type Person struct {
        XMLName   xml.Name `xml:"person"`
        Id        int      `xml:"id,attr"`
        FirstName string   `xml:"name>first"`
        LastName  string   `xml:"name>last"`
        Age       int      `xml:"age"`
        Height    float32  `xml:"height,omitempty"`
        Married   bool
        Address
        Comment string `xml:",comment"`
    }

    v := &Person{Id: 13, FirstName: "John", LastName: "Doe", Age: 42}
    v.Comment = " Need more details. "
    v.Address = Address{"Hanga Roa", "Easter Island"}

    output, err := xml.MarshalIndent(v, "  ", "    ")
    if err != nil {
        fmt.Printf("error: %v\n", err)
    }

    os.Stdout.Write(output)
}

However, this only works for converting struct/interface into a []byte. What I want is convert a string of html code and indent automatically. Example:

Raw HTML

<!doctype html><html><head>
<title>Website Title</title>
</head><body>
<div class="random-class">
<h1>I like pie</h1><p>It's true!</p></div>
</body></html>

Prettified HTML

<!doctype html>
<html>
    <head>
        <title>Website Title</title>
    </head>
    <body>
        <div class="random-class">
            <h1>I like pie</h1>
            <p>It's true!</p>
        </div>
    </body>
</html>

How would this be done just using a string?

Rawson answered 14/1, 2014 at 15:21 Comment(0)
W
16

I faced a same problem and I just solved it by creating an HTML formatting package in Go by myself.

Here it is:

GoHTML - HTML formatter for Go

Please check this package out.

Thanks,

Keiji

Wrought answered 25/4, 2014 at 6:42 Comment(2)
Thank you so much for this. My own implementation that still in the works is not as good as yours so I will be using yours instead.Rawson
@KeijiYoshida Which would be the best way to embed your prettyfier into a standalone binary that reads from stdin and writes to stdout?Clump
L
9

I found this question when trying to figure out how to pretty print xml in Go. Since I didn't find the answer anywhere, here's my solution:

import (
    "bytes"
    "encoding/xml"
    "io"
)

func formatXML(data []byte) ([]byte, error) {
    b := &bytes.Buffer{}
    decoder := xml.NewDecoder(bytes.NewReader(data))
    encoder := xml.NewEncoder(b)
    encoder.Indent("", "  ")
    for {
        token, err := decoder.Token()
        if err == io.EOF {
            encoder.Flush()
            return b.Bytes(), nil
        }
        if err != nil {
            return nil, err
        }
        err = encoder.EncodeToken(token)
        if err != nil {
            return nil, err
        }
    }
}
Lota answered 26/11, 2014 at 4:5 Comment(2)
I like this solution, but am still in search of a Golang XML formatter/prettyprinter that doesn't rewrite the document (other than formatting whitespace). Marshalling or using the Encoder will change namespace declarations. For example an element like "<ns1:Element/>" will be translated to something like '<Element xmlns="ns1"/>' which seems harmless enough except when the intent is to not alter the xml other than formatting.Miasma
@JamesMcGill, for a Golang XML formatter/prettyprinter that doesn't rewrite the document, check out github.com/go-xmlfmt/xmlfmt. I share the same pain as you. :-)Heartfelt
L
5

EDIT: Found a great way using the XML parser:

package main

import (
    "encoding/xml"
    "fmt"
)

func main() {
    html := "<html><head><title>Website Title</title></head><body><div class=\"random-class\"><h1>I like pie</h1><p>It's true!</p></div></body></html>"
    type node struct {
        Attr     []xml.Attr
        XMLName  xml.Name
        Children []node `xml:",any"`
        Text     string `xml:",chardata"`
    }
    x := node{}
    _ = xml.Unmarshal([]byte(html), &x)
    buf, _ := xml.MarshalIndent(x, "", "\t")
    fmt.Println(string(buf))
}

will output the following:

<html>
    <head>
        <title>Website Title</title>
    </head>
    <body>
        <div>
            <h1>I like pie</h1>
            <p>It&#39;s true!</p>
        </div>
    </body>
</html>
Levin answered 14/1, 2014 at 15:31 Comment(9)
that's hardly the problem here. go's XML module supports non strict, auto close, sloppy parsing.Levin
This seems to be the right track - a generic xml/html unmarshaller. However, I am betting that if I cannot get attributes to work, I'll have to make my own Pretty-Print parser.Rawson
@GingerBill I tried to get the attributes to work but couldn't using this scheme. I even wrote a custom parser that captured them in the above Node struct, but then the serializer didn't serialize properly. But what I didn't explore was interfaces the xml module exposes for serializing attributes.Levin
@GingerBill Another approach I tried that did work, but I think is not scalable, was to add all the known attributes I want to support (src, href, id, class, etc...) to the Node struct, and adding xml:",attr,omitempty" to the field. This caused them to be hidden if they did not exist in the struct and all worked well, but then it stops being generic and doesn't support unknown attributes.Levin
@Levin I was going to do the latter solution but looking at the amount of html attributes, this would may be time consuming (about 100).Rawson
@GingerBill you can write a simple script that generates them from this document: w3.org/TR/html4/index/attributes.html but newer flavors of HTML support unknown attributes. Look for example at the Google Play source code. They add all sorts of custom semantic attributes.Levin
@Levin that is exactly the problem. I know for most of my code I probably won't be using all the custom attributes but it would be nice to have it do anything. If it is not possible, I'll make it an open source library for people to use so people can prettyprint their html in Go.Rawson
@GingerBill perhaps it's worth offering support for this in golang's standrad xml library? something like adding a special type xml.AttrList and/or a special tag for it like xml:",any-attr"Levin
@Levin I might actually do this. Thanks for the idea. Something like this should be straight forward and apart of the xml library. I bet many people have wanted to properly unmarshal html code for reason or another.Rawson
S
2

You could parse the HTML with code.google.com/p/go.net/html, and write your own version of the Render function from that package—one that keeps track of indentation.

But let me warn you: you need to be careful with adding and removing whitespace in HTML. Although whitespace is not usually significant, you can have spaces appearing and disappearing in the rendered text if you're not careful.

Edit:

Here's a pretty-printer function I wrote recently. It handles some of the special cases, but not all of them.

func prettyPrint(b *bytes.Buffer, n *html.Node, depth int) {
    switch n.Type {
    case html.DocumentNode:
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            prettyPrint(b, c, depth)
        }

    case html.ElementNode:
        justRender := false
        switch {
        case n.FirstChild == nil:
            justRender = true
        case n.Data == "pre" || n.Data == "textarea":
            justRender = true
        case n.Data == "script" || n.Data == "style":
            break
        case n.FirstChild == n.LastChild && n.FirstChild.Type == html.TextNode:
            if !isInline(n) {
                c := n.FirstChild
                c.Data = strings.Trim(c.Data, " \t\n\r")
            }
            justRender = true
        case isInline(n) && contentIsInline(n):
            justRender = true
        }
        if justRender {
            indent(b, depth)
            html.Render(b, n)
            b.WriteByte('\n')
            return
        }
        indent(b, depth)
        fmt.Fprintln(b, html.Token{
            Type: html.StartTagToken,
            Data: n.Data,
            Attr: n.Attr,
        })
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            if n.Data == "script" || n.Data == "style" && c.Type == html.TextNode {
                prettyPrintScript(b, c.Data, depth+1)
            } else {
                prettyPrint(b, c, depth+1)
            }
        }
        indent(b, depth)
        fmt.Fprintln(b, html.Token{
            Type: html.EndTagToken,
            Data: n.Data,
        })

    case html.TextNode:
        n.Data = strings.Trim(n.Data, " \t\n\r")
        if n.Data == "" {
            return
        }
        indent(b, depth)
        html.Render(b, n)
        b.WriteByte('\n')

    default:
        indent(b, depth)
        html.Render(b, n)
        b.WriteByte('\n')
    }
}

func isInline(n *html.Node) bool {
    switch n.Type {
    case html.TextNode, html.CommentNode:
        return true
    case html.ElementNode:
        switch n.Data {
        case "b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "em", "kbd", "strong", "samp", "var", "a", "bdo", "img", "map", "object", "q", "span", "sub", "sup", "button", "input", "label", "select", "textarea":
            return true
        default:
            return false
        }
    default:
        return false
    }
}

func contentIsInline(n *html.Node) bool {
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        if !isInline(c) || !contentIsInline(c) {
            return false
        }
    }
    return true
}

func indent(b *bytes.Buffer, depth int) {
    depth *= 2
    for i := 0; i < depth; i++ {
        b.WriteByte(' ')
    }
}

func prettyPrintScript(b *bytes.Buffer, s string, depth int) {
    for _, line := range strings.Split(s, "\n") {
        line = strings.TrimSpace(line)
        if line == "" {
            continue
        }
        depthChange := 0
        for _, c := range line {
            switch c {
            case '(', '[', '{':
                depthChange++
            case ')', ']', '}':
                depthChange--
            }
        }
        switch line[0] {
        case '.':
            indent(b, depth+1)
        case ')', ']', '}':
            indent(b, depth-1)
        default:
            indent(b, depth)
        }
        depth += depthChange
        fmt.Fprintln(b, line)
    }
}
Seat answered 14/1, 2014 at 18:13 Comment(0)
O
2

Short answer

Use this HTML prettyprint library for Go (that I wrote, *uhum*). It has some tests and works for basic inputs, and will hopefully become more robust over time, though it isn't very robust right now. Note the Known Issues section in the readme.

Long Answer

Rolling your own HTML prettifier for simple cases is reasonably easy using the code.google.com/p/go.net/html package (that's what the above package does). Here is a very simple Prettify function implemented in this way:

func Prettify(raw string, indent string) (pretty string, e error) {
    r := strings.NewReader(raw)
    z := html.NewTokenizer(r)
    pretty = ""
    depth := 0
    prevToken := html.CommentToken
    for {
        tt := z.Next()
        tokenString := string(z.Raw())

        // strip away newlines
        if tt == html.TextToken {
            stripped := strings.Trim(tokenString, "\n")
            if len(stripped) == 0 {
                continue
            }
        }

        if tt == html.EndTagToken {
            depth -= 1
        }

        if tt != html.TextToken {
            if prevToken != html.TextToken {
                pretty += "\n"
                for i := 0; i < depth; i++ {
                    pretty += indent
                }
            }
        }

        pretty += tokenString

        // last token
        if tt == html.ErrorToken {
            break
        } else if tt == html.StartTagToken {
            depth += 1
        }
        prevToken = tt
    }
    return strings.Trim(pretty, "\n"), nil
}

It handles simple examples, like the one you provided. For example,

html := `<!DOCTYPE html><html><head>
<title>Website Title</title>
</head><body>
<div class="random-class">
<h1>I like pie</h1><p>It's true!</p></div>
</body></html>`
pretty, _ := Prettify(html, "    ")
fmt.Println(pretty)

will print the following:

<!DOCTYPE html>
<html>
    <head>
        <title>Website Title</title>
    </head>
    <body>
        <div class="random-class">
            <h1>I like pie</h1>
            <p>It's true!</p>
        </div>
    </body>
</html>

Beware though, this simple approach doesn't yet handle HTML comments, nor does it handle perfectly valid self-closing HTML5 tags that are not XHTML-compliant, like <br>, whitespace is not guaranteed to be preserved when it should, and a whole range of other edge cases I haven't yet thought of. Use it only as a reference, a toy or a starting point :)

Oleneolenka answered 20/1, 2014 at 12:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.