This post is part of a series of posts that provide step-by-step instructions on how to write a simple web scraper using Ruby on morph.io. If you find any problems, let us know in the comments so we can improve these tutorials.
In the past post we set up our scraper. Now we’re going to start out writing our scraper.
It can be really helpful to start out writing your scraper in an interactive shell. In the shell you’ll get quick feedback as you explore the page you’re trying to scrape, instead of having to run your scraper file to see what your code does.
The interactive shell for ruby is called irb. Start an irb session on the command line with:
> bundle exec irb
The bundle exec
command executes your irb
command in the context of your project’s Gemfile. This means that your specified gems will be available.
The first command you need to run in irb
is:
>> require 'mechanize'
This loads in the Mechanize library. Mechanize is a helpful library for making requesting and interacting with webpages.
Now you can create an instance of Mechanize that will be your agent to do things like ‘get’ pages and ‘click’ on links:
>> agent = Mechanize.new
You want to get information for all the members you can. Looking at your target page it turns out the members are spread across several pages. You’ll have to scrape all 3 pages to get all the members. Rather than worry about this now, lets start small. Start by just collecting the information you want for the first member on the first page. Reducing the complexity as you start to write your code will make it easier to debug as you go along.
In your irb session, use the Mechanize get
method to get the first page with members listed on it.
>> url = "https://morph.io/documentation/examples/australian_members_of_parliament"
>> page = agent.get(url)
This returns the source of your page as a Mechanize Page object. You’ll be pulling the information you want out of this object using the handy Nokogiri XML searching methods that Mechanize loads in for you.
Let’s review some of these methods.
at()
The at()
method returns the first element that matches the selectors provided. For example, page.at(‘ul’)
returns the first <ul>
element in the page as a Nokogiri XML Element that you can parse. There are a number of ways to target elements using the at()
method. We’re using a css style selector in this example because many people are familiar with this style from writing CSS or jQuery. You can also target elements by class
, e.g. page.at('.search-filter-results')
; or id
, e.g. page.at('#content')
.
search()
The search()
method works like the at()
method, but returns an Array of every element that matches the target instead of just the first. Running page.search('li')
returns an Array of every <li>
element in page
.
You can chain these methods together to find specific elements. page.at('.search-filter-results').at('li').search('p')
will return an Array of all <p>
elements found within the first <li>
element found within the first element with the class .search-filter-results
on the page.
You can use the at()
and search()
methods to get the first member list item on the page:
>> page.at('.search-filter-results').at('li')
This returns a big blob of code that can be hard to read. You can use the inner_text()
method to help work out if got the element you’re looking for: the first member in the list.
>> page.at('.search-filter-results').at('li').inner_text
=> "\n\nThe Hon Ian Macfarlane MP\n\n\n\n\n\nMember for\nGroom,Queensland\nParty\nLiberal Party of Australia\nConnect\n\nEmail\n\n\n"
Victory!
Now that you’ve found your first member, you want to collect their title, electorate, party, and the url for their individual page. Let’s start with the title.
If you view the page source in your browser and look at the first member list item, you can see that the title of the member, “The Hon Ian Macfarlane MP”, is the text inside the link in the <p>
with the class ‘title’.
<li>
<p class='title'>
<a href="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">
The Hon Ian Macfarlane MP
</a>
</p>
<p class='thumbnail'>
<a href="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">
<img alt="Photo of The Hon Ian Macfarlane MP" src="http://parlinfo.aph.gov.au/parlInfo/download/handbook/allmps/WN6/upload_ref_binary/WN6.JPG" width="80" />
</a>
</p>
<dl>
<dt>Member for</dt>
<dd>Groom, Queensland</dd>
<dt>Party</dt>
<dd>Liberal Party of Australia</dd>
<dt>Connect</dt>
<dd>
<a class="social mail" href="mailto:Ian.Macfarlane.MP@aph.gov.au"
target="_blank">Email</a>
</dd>
</dl>
</li>
You can use the .inner_text
method here.
>> page.at('.search-filter-results').at('li').at('.title').inner_text
=> "\nThe Hon Ian Macfarlane MP\n"
There it is: the title of the first member. It’s got messy \n
whitespace characters around it though. Never fear, you can clean it up with the Ruby method strip
.
>> page.at('.search-filter-results').at('li').at('.title').inner_text.strip
=> "The Hon Ian Macfarlane MP"
You’ve successfully scraped the first bit of information you want.
Now that you’ve got some code for your scraper, let’s add it to your scraper.rb
file and make your first commit.
You’ll want to come back to your irb
session, so leave it running and open your scraper.rb
file in your code editor. Replace the commented out template code with the working code from your irb
session.
Your scraper.rb
should look like this:
require 'mechanize'
agent = Mechanize.new
url = 'https://morph.io/documentation/examples/australian_members_of_parliament'
page = agent.get(url)
page.at('.search-filter-results').at('li').at('.title').inner_text.strip
You actually want to collect members with this scraper, so create a member object and assign the text you’ve collected as it’s title:
require 'mechanize'
agent = Mechanize.new
url = 'https://morph.io/documentation/examples/australian_members_of_parliament'
page = agent.get(url)
member = {
title: page.at('.search-filter-results').at('li').at('.title').inner_text.strip
}
Add a final line to the file to help confirm that everything is working as expected.
p member
You can now, back in on the command line in the folder for your project, run this file in Ruby:
> bundle exec ruby scraper.rb
The scraper runs and the p
command returns your member
:
> bundle exec ruby scraper.rb
{:title=>"The Hon Ian Macfarlane MP"}
This is a good time to make your first git commit for this project. Yay!
In our next post we’ll work out how to scrape more bits of information from the page.