Cheerio. It gives you a jQuery-like syntax for quickly parsing out content from an HTML string.
I recently used it to parse out some data from a structured list using a nice high-level DSL that’s used by Cheerio’s extract function.
;
// The page has many of these:
// <div class="ts-segment">
// <span class="ts-name">Jimmy Wales</span>
// <span class="ts-timestamp"><a href="https://youtube.com/watch?v=XXXX&t=5112">(01:23:45)</a> </span>
// <span class="ts-text">Hmm?</span>
// </div>
;
The input HTML content contains many snippets like the one in the comment above. One nice thing about Cheerio’s extract function is that if parts of a composite value are missing, then the entries in each segment will remain undefined while the existing values are extracted. For example, if the name
is missing from a particular segment and its selector fails to match, then its timestamp
, href
, and text
will still be extracted.
Running the extraction will return a JavaScript object with the key segments
, whose value is an array of objects each with a name
, timestamp
, href
, and text
, mirroring the shape of the input.
One nice thing about this interface is its flexibility, which allows you to extract arbitrary properties at multiple levels of a nested tree-like query. The docs for the function aren’t very informative as to what can be extracted but this tutorial explains many of the possibilities, including using a function as the extractor.