Cheerio for Web Scraping

A great JavaScript library has recently reached 1.0: Cheerio. It gives you a jQuery-like syntax for quickly parsing out content from an HTML string.

I recently used it to parse out some data from a structured list using a nice high-level DSL that’s used by Cheerio’s extract function.

const $ = cheerio.load(htmlContent);

// The page has many of these:
// <div class="ts-segment">
//     <span class="ts-name">Jimmy Wales</span> 
//     <span class="ts-timestamp"><a href="https://youtube.com/watch?v=XXXX&t=5112">(01:23:45)</a> </span>
//     <span class="ts-text">Hmm?</span>
// </div>

const result = $.extract({
  segments: [{
    selector: '.ts-segment',
    value: {
      name: '.ts-name',
      timestamp: {
        selector: '.ts-timestamp',
      },
      href: {
        selector: '.ts-timestamp a',
        value: 'href'
      },
      text: '.ts-text',
    },
  }]
});

The input HTML content contains many snippets like the one in the comment above. One nice thing about Cheerio’s extract function is that if parts of a composite value are missing, then the entries in each segment will remain undefined while the existing values are extracted. For example, if the name is missing from a particular segment and its selector fails to match, then its timestamp, href, and text will still be extracted.

Running the extraction will return a JavaScript object with the key segments, whose value is an array of objects each with a name, timestamp, href, and text, mirroring the shape of the input.

One nice thing about this interface is its flexibility, which allows you to extract arbitrary properties at multiple levels of a nested tree-like query. The docs for the function aren’t very informative as to what can be extracted but this tutorial explains many of the possibilities, including using a function as the extractor.