Learning the Fundamentals of Nokogiri Gem

The world is spinning faster and faster and this acceleration is evident in all facets of our lives. Especially when it comes to businesses, the premium is on the speed. In this frantic accelerated, volatility is the only constant. To keep up with the pace and in order to stay ahead of increasingly fierce competition, businesses are looking for ways to increases efficiency and have faster go-to-market speed. This explains the frenzy behind the popularity of programming languages like Ruby on Rails. What makes Ruby on Rails even more amazing is the sheer number of gems it comes packed with. Imagine a tool and you most probably have it in your Ruby kit. One of the best gems for Ruby on Rails is Nokogiri which is a library to deal with XML and HTML documents. The most common use for a parser like Nokogiri is to extract data from structured documents. Examples:

  • A list of prices from a price comparison website.
  • Search result links from a search engine.
  • A list of answers from a Q&A site.

Installation:

OS X:

To install libxml2 from macports:

$ sudo port install libxml2 libxslt

Then to install nokogiri:

$ sudo gem install nokogiri

Linux:

On Linux, we still need to install libxml2. The command for installing libxml2 will vary based on the package manager and Linux distribution we’re using.

On Fedora:

$ sudo yum install libxml2-devel libxslt-devel

$ gem install nokogiri

On Ubuntu:

$ sudo apt-get install libxml2 libxml2-dev libxslt libxslt-dev

$ gem install nokogiri

Getting Started With Nokogiri:

Once we have Nokogiri installed we can start to make use of it. Nokogiri can use XPath or CSS3 selectors. The capability to use CSS selectors makes it a really good fit for extracting data from HTML documents.

require ‘rubygems’

require ‘nokogiri’

require ‘open-uri’

As well as requiring the nokogiri gem we would need open-uri so that contents of a URL can be easily found. We then create a new Nokogiri HTML document and pass it the contents of the search results page. With that Nokogiri document we can then use at_css, passing the CSS selector “title” to retrieve the contents of the <title> element. The at_css method will return the first matching element and we can call .text on that element to get its text content. Finally we use puts to print out the text.

Basic Parsing:

Nokogiri lets you parse an HTML or XML document using a few different strategies:

  • DOM
  • SAX
  • Reader
  • Pull

Each of these strategies have different advantages and disadvantages. DOM interface is the most common, and generally regarded as the easiest to use.

Every product that is the last item in a row, has a different class:

<div class=”product lastcol”>
<a href=”/product/f05f/” class=”product_link”>
<img
src=”/images/dot_clear.gif”
title=”Destroy sleep with this powerful energy shot – in a reusable shotgun shell bottle.”
alt=”Zombie Blast Energy Shots 3 Pack”
width=”125″
height=”125″
class=”lazy”
data-original=”http://a.tgcdn.net/images/products/thumb/largesquare/f05f_zombie_blast_energy_shots.jpg”
/>
<h4>Zombie Blast Energy Shots 3 Pack</h4>
</a>
<p>$9.99</p>
</div>

This means in order to get the name of the products, we’d say:

English: Starting at the root of the document: look in every div that has a class name containing the word ‘product’. Inside that find a link. In that link find h4 text.

XPath: //div[contains(@class,’product’)]/a/h4

The XPath equality operator only matches complete values, in this case a string. XPath only matches whole class names so div[@class=’product’] in Xpath would not work to get the last column as you might expect.

Reference:

https://github.com/sparklemotion/nokogiri

Leave a Comment

Your email address will not be published. Required fields are marked *