Try this:
- Install http://www.w3.org/Tools/HTML-XML-utils/.
- Ubuntu:
aptitude install html-xml-utils
- MacOS:
brew install html-xml-utils
- Save a web page (call it filename.html).
- Run:
hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"
Where "label.black"
is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep
:
#!/bin/bash
# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"
You can then run:
cssgrep filename.html "label.black"
This will generate the content for all HTML label
elements of the class black
.
The -l 240
argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label>
is the input, then -l 240
will reformat the HTML to <label class="black">Text to extract</label>
, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.
See also: