Is there something like a "CSS selector" or XPath grep?
Asked Answered
H

4

21

I need to find all places in a bunch of HTML files, that lie in following structure (CSS):

div.a ul.b

or XPath:

//div[@class="a"]//div[@class="b"]

grep doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.

Hispanic answered 7/9, 2011 at 13:47 Comment(2)
You might be able to get fancy with sed and come up with some regex to strip out the elements you don't care about; but that is probably going to be complicated and not reusable unless you write it off somewhere. I would just write a perl script which uses something like XML::Twig::XPath and prints a message with file name for all xmls w/the class attributes you're looking for. If you're interested, I could post a quick script as an answer; but since you're specifically asking for command line solution I'll hold off on that.Walkin
Similar question superuser.com/questions/507344/…Airdrop
H
27

Try this:

  1. Install http://www.w3.org/Tools/HTML-XML-utils/.
    • Ubuntu: aptitude install html-xml-utils
    • MacOS: brew install html-xml-utils
  2. Save a web page (call it filename.html).
  3. Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"

Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:

#!/bin/bash

# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"

You can then run:

cssgrep filename.html "label.black"

This will generate the content for all HTML label elements of the class black.

The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.

See also:

Hadlock answered 6/1, 2013 at 21:41 Comment(0)
L
9

I have built a command line tool with Node JS which does just this. You enter a CSS selector and it will search through all of the HTML files in the directory and tell you which files have matches for that selector.

You will need to install Element Finder, cd into the directory you want to search, and then run:

elfinder -s "div.a ul.b"

For more info please see http://keegan.st/2012/06/03/find-in-files-with-css-selectors/

Likker answered 5/6, 2012 at 3:6 Comment(0)
T
7

There are at least 4 tools:

  • pup - Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

  • htmlq - Likes jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.

  • hq - Lightweight command line HTML processor using CSS and XPath selectors.

  • xq - Command-line XML and HTML beautifier and content extractor.

Examples:

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

$ pup --color 'title' < robots.html
<title>
 Robots exclusion standard - Wikipedia
</title>

$ htmlq --text 'title' < robots.html
Robots exclusion standard - Wikipedia

$ hq --xpath '//title' < robots.html
<title>robots.txt - Wikipedia</title>

$ xq --xpath '//title' < robots.html
robots.txt - Wikipedia
Thibaut answered 12/5, 2020 at 7:46 Comment(0)
W
-1

Per Nat's answer here:

How to parse XML in Bash?

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library
Walkin answered 7/9, 2011 at 17:8 Comment(3)
OK, that's a good way to handle XML. Seems like the synopsis code here: search.cpan.org/~msergeant/XML-XPath-1.13/XPath.pm would exactly fit my needs. However, if I have non-XML HTML (e.g., I have some SSI snippets to search) I also need a non-XML tool. Any ideas?Hispanic
In terms of SSI, you should be able to use xpath, since they're basically xml comments parsed and handled by your server. #785245Walkin
Pretty much any variation of html should work and you should be able to get access to any of the information in it using xpath as long as its well formed (this could be mitigated by libraries used to format malformed html), and not inside of a CDATA element (which you wouldn't be able to use xpath to get to since it isn't handled as markup).Walkin

© 2022 - 2024 — McMap. All rights reserved.