Extract information from the web with Ruby

Take advantage of web scraping software and website APIs for automated data extraction

Explore the latest methods for extracting structured information from the web. Using Ruby script examples, author M. Tim Jones demonstrates scraping technology and the use of web APIs for targeted data retrieval.

M. Tim Jones, Independent author

M. TIm JonesM. Tim Jones is an embedded-firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming, AI Application Programming, and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a platform architect with Intel in Longmont, Colo.



17 December 2013

Also available in Russian Japanese

Websites no longer cater solely to human readers. Many sites now support APIs that enable computer programs to harvest information. Screen scraping— the time-honored technique of parsing of HTML pages into more-digestible forms — can still come in handy. But opportunities to simplify web data extraction through the use of APIs are increasing rapidly. According to ProgrammableWeb, more than 10,000 website APIs were available at the time this article was published — an increase of more than 3,000 over the preceding 15 months. (ProgrammableWeb itself offers an API for searching for and retrieving APIs, mashups, member profiles, and other data from its catalog.)

This article begins with a look at modern-day web scraping and compares it to the API approach. Then, through Ruby examples, it shows how to extract structured information by using APIs from some popular web properties. Basic understanding of the Ruby language, Representational State Transfer (REST), and data formats such as JavaScript Object Notation (JSON) and XML is assumed.

Scraping vs. APIs

Several scraping solutions are available now. Some of them translate HTML into other formats, such as JSON, which makes it simpler to extract the desired content. Other solutions read in the HTML, and you can define the content as a function of the HTML hierarchy in which the data is marked up. One such solution is Nokogiri, which supports the parsing of HTML and XML documents with the Ruby language. Other open source scraping tools include pjscrape for JavaScript and Beautiful Soup for Python. pjscrape implements a command-line tool that can scrape a fully rendered page, including JavaScript content. Beautiful Soup cleanly integrates into the Python 2 and 3 environments.

Suppose you want to use scraping with Nokogiri to identify the number of IBM employees as reported by CrunchBase. The first step is to understand the markup of the particular HTML page on CrunchBase where the number of IBM employees is listed. Figure 1 shows this page open in the Firebug tool within Mozilla Firefox. The upper half of the image illustrates the rendered HTML, and the lower half shows the HTML source code for the section of interest.

Figure 1. Viewing HTML source with Firefox's Firebug
Screen capture of viewing HTML source with Firefox's Firebug

The Ruby script in Listing 1 uses Nokogiri to scrape the number of employees from the web page in Figure 1.

Listing 1. Parsing HTML with Nokogiri (parse.rb)
#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'

# Define the URL with the argument passed by the user
uri = "http://www.crunchbase.com/company/#{ARGV[0]}"

# Use Nokogiri to get the document
doc = Nokogiri::HTML(open(uri))

# Find the link of interest
link = doc.search('tr span[1]')

# Emit the content associated with that link
puts link[0].content

In the HTML source that Firebug displays in Figure 1, you can see that the data of interest (the number of employees) is embedded within an HTML unique-ID <span> tag. And you can see that the <span id="num_employees"> tag is the first of two <span> ID tags. So, the last two instructions in Listing 1 are to request the first <span> tag with link = doc.search('tr span[1]'), then to emit the content of that parsed link with puts link[0].content.

CrunchBase also exposes a REST API that makes considerably more data accessible than you can access through scraping. Listing 2 shows how to use the API to extract the number of a company's employees from the CrunchBase site.

Listing 2. Using the CrunchBase REST API with JSON parsing (api.rb)
#!/usr/bin/env ruby
require 'rubygems'
require 'json'
require 'net/http'

# Define the URL with the argument passed by the user
uri = "http://api.crunchbase.com/v/1/company/#{ARGV[0]}.js"

# Perform the HTTP GET request, and return the response
resp = Net::HTTP.get_response(URI.parse(uri))

# Parse the JSON from the response body
jresp = JSON.parse(resp.body)

# Emit the content of interest
puts jresp['number_of_employees']

In Listing 2, you define a URL (with the company passed in as the script argument). Then you use the HTTP class to make a GET request and return the response. The response is parsed as a JSON object, and you reference your item of interest through a Ruby data structure.

The console session in Listing 3 shows the results of running both the scraping script from Listing 1 and the API-based script from Listing 2.

Listing 3. Demonstrating the scraping and API approaches
$ ./parse.rb ibm
388,000
$ ./api.rb ibm
388000
$ ./parse.rb cisco
63,000
$ ./api.rb cisco
63000
$ ./parse.rb paypal
300,000
$ ./api.rb paypal
300000
$

When the scraping script runs, you receive a formatted count, whereas the API script produces a raw integer. As Listing 3 shows, you can generalize the use of either script and request employee numbers from other companies that CrunchBase tracks. The general structure of the URLs each approach provides makes this versatility possible.

So, what can you gain by using the API approach? In the scraping case, you needed to dig into the HTML to understand its structure and identify the data to extract. It was then simple to parse the HTML with Nokogiri and grab the data of interest. But if the structure of the HTML document changes, you might need to modify the script to parse the new structure correctly. The API approach doesn't present that concern, as long as the API contract stands. Another key advantage of the API approach is that you can access all the data exposed through the interface (through the returned JSON object). Considerably less CrunchBase data is exposed through HTML for human consumption.

Now you'll explore the use of some other APIs for extracting various kinds of information from the Internet, again with the help of Ruby scripting. I'll begin by showing how to collect personal data from a social networking site. Then you'll see how to look for less personal data through other API sources.


Extracting personal data from LinkedIn

LinkedIn is a social networking website for professional occupations. It's useful for connecting with other developers, looking for a job, researching a company, or joining a group to collaborate on interesting topics. LinkedIn also incorporates a recommendation engine and can recommend jobs and companies to follow based on your profile.

LinkedIn users can get access to the site's REST and JavaScript APIs to retrieve information that's also accessible through its human-readable website: connections, social sharing streams, content groups, communications (messages and connection invitations), and company and jobs information.

To use the LinkedIn API, you must register your application. After registration, you get an API key and secret, and a user token and secret. LinkedIn uses the OAuth protocol for authentication.

After you authenticate, you can make REST requests through the access-token object. The response is a typical HTTP response, so you can parse the body into a JSON object. Then you can iterate through the JSON object to extract your data of interest.

The Ruby script in Listing 4 delivers recommendations for companies to follow and job suggestions to the authenticated LinkedIn user.

Listing 4. Viewing company and job suggestions with the LinkedIn API (lkdin.rb)
#!/usr/bin/ruby
require 'rubygems'
require 'oauth'
require 'json'

pquery = "http://api.linkedin.com/v1/people/~?format=json"
cquery='http://api.linkedin.com/v1/people/~/suggestions/to-follow/companies?format=json'
jquery='http://api.linkedin.com/v1/people/~/suggestions/job-suggestions?format=json'
 
# Fill the keys and secrets you retrieved after registering your app
api_key = 'api key'
api_secret = 'api secret'
user_token = 'user token'
user_secret = 'user secret'
 
# Specify LinkedIn API endpoint
configuration = { :site => 'https://api.linkedin.com' }
 
# Use the API key and secret to instantiate consumer object
consumer = OAuth::Consumer.new(api_key, api_secret, configuration)
 
# Use the developer token and secret to instantiate access token object
access_token = OAuth::AccessToken.new(consumer, user_token, user_secret)

# Get the username for this profile
response = access_token.get(pquery)
jresp = JSON.parse(response.body)
myName = "#{jresp['firstName']} #{jresp['lastName']}"
puts "\nSuggested companies to follow for #{myName}"

# Get the suggested companies to follow
response = access_token.get(cquery)
jresp = JSON.parse(response.body)

# Iterate through each and display the company name
jresp['values'].each do | company |
    puts "  #{company['name']}"
end

# Get the job suggestions
response = access_token.get(jquery)
jresp = JSON.parse(response.body)
puts "\nSuggested jobs for #{myName}"

# Iterate through each suggested job and print the company name
jresp['jobs']['values'].each do | job |
    puts "  #{job['company']['name']} in #{job['locationDescription']}"
end

puts "\n"

The console session in Listing 5 shows the output from running the Ruby script in Listing 4. The output results from three separate calls in the script to the LinkedIn API (one for authentication and one each for the company-suggestion and job-suggestion links).

Listing 5. Demonstrating the LinkedIn Ruby script
$ ./lkdin.rb

Suggested companies to follow for M. Tim Jones
  Open Kernel Labs, Inc.
  Linaro
  Wind River
  DDC-I
  Linsyssoft Technologies
  Kalray
  American Megatrends
  JetHead Development
  Evidence Srl
  Aizyc Technology

Suggested jobs for M. Tim Jones
  Kozio in Greater Denver Area
  Samsung Semiconductor Inc in San Jose, CA
  Terran Systems in Sunnyvale, CA
  Magnum Semiconductor in San Francisco Bay Area
  RGB Spectrum in Alameda, CA
  Aptina in San Francisco Bay Area
  CyberCoders in San Francisco, CA
  CyberCoders in Alameda, CA
  SanDisk in Longmont, CO
  SanDisk in Longmont, CO

$

You can use the LinkedIn API with any language that offers OAuth support.


Retrieving business data with the Yelp API

Yelp exposes a rich REST API for business search that includes ratings, reviews, and geographical search (neighborhood, city, geocode). With the Yelp API, you can search for businesses of a given type (such as "restaurant") and constrain that search to a geographical bounding box; proximity to a geographic coordinate; or proximity to a neighborhood, address, or city. The JSON response includes a large amount of information about businesses that match the criteria, including address information, distance, ratings, deals, and URLs for other types of information (such as a picture of the business, mobile-formatted information, and more).

Like LinkedIn, Yelp uses OAuth for authentication, so you must register with Yelp to get a set of credentials for authenticating through the API. After your script authenticates, you can construct a REST-based URL request. In Listing 6, I hard-code a restaurant request for Boulder, Colo. The response body is parsed into a JSON object and iterated through to emit the desired information. Note that I exclude businesses that are closed.

Listing 6. Retrieving business data with the Yelp API (yelp.rb)
#!/usr/bin/ruby
require 'rubygems'
require 'oauth'
require 'json'

consumer_key = 'your consumer key'
consumer_secret = 'your consumer secret'
token = 'your token'
token_secret = 'your token secret'
api_host = 'http://api.yelp.com'

consumer = OAuth::Consumer.new(consumer_key, consumer_secret, {:site => api_host})
access_token = OAuth::AccessToken.new(consumer, token, token_secret)

path = "/v2/search?term=restaurants&location=Boulder,CO"

jresp = JSON.parse(access_token.get(path).body)

jresp['businesses'].each do | business |
    if business['is_closed'] == false
      printf("%-32s  %10s  %3d  %1.1f\n", 
                business['name'], business['phone'], 
                business['review_count'], business['rating'])
    end
end

The console session in Listing 7 shows sample output from running the Listing 6 script. For simplicity, I display only the initial set of businesses returned, instead of supporting the limit/offset features of the API (to perform multiple calls to retrieve the entire list). This sample output shows the business name, phone number, number of reviews received, and average rating.

Listing 7. Demonstrating the Yelp API Ruby script
$ ./yelp.rb
Frasca Food and Wine              3034426966  189  4.5
John's Restaurant                 3034445232   51  4.5
Leaf Vegetarian Restaurant        3034421485  144  4.0
Nepal Cuisine                     3035545828   65  4.5
Black Cat Bistro                  3034445500   72  4.0
The Mediterranean Restaurant      3034445335  306  4.0
Arugula Bar E Ristorante          3034435100   48  4.0
Ras Kassa's Ethiopia Restaurant   3034472919  101  4.0
L'Atelier                         3034427233   58  4.0
Bombay Bistro                     3034444721   87  4.0
Brasserie Ten Ten                 3039981010  200  4.0
Flagstaff House                   3034424640   86  4.5
Pearl Street Mall                 3034493774   77  4.0
Gurkhas on the Hill               3034431355   19  4.0
The Kitchen                       3035445973  274  4.0
Chez Thuy Restaurant              3034421700   99  3.5
Il Pastaio                        3034479572  113  4.5
3 Margaritas                      3039981234   11  3.5
Q's Restaurant                    3034424880   65  4.0
Julia's Kitchen                                 8  5.0

$

Yelp offers one of the best-documented APIs, with data descriptions, examples, error handling, and more. But although the Yelp API is useful, its use is restricted through throttling. As an initial developer, you have a maximum of 100 API calls per day, with 1,000 calls for testing purposes. If your application meets Yelp's display requirements, you'll be given up to 10,000 calls per day (with the potential for more).


Domain location with a simple mashup

This next example ties two sources together to yield information. In this case, you want to translate a web domain name into its general geographic location. The Ruby script in Listing 8 uses the Linux® host command and the OpenCrypt IP Location API Service to retrieve location information.

Listing 8. Retrieving location information for web domains
#!/usr/bin/env ruby
require 'net/http'

aggr = ""
key = 'your api key here'

# Get the IP address for the domain using the 'host' command
IO.popen("host #{ARGV[0]}") { | line |
  until line.eof?
    aggr += line.gets
  end
}

# Find the IP address in the response from the 'host' command
pattern = /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/
if m = pattern.match(aggr)
    uri = "http://api.opencrypt.com/ip/?IP=#{m[0]}&key=#{key}"
    resp = Net::HTTP.get_response(URI.parse(uri))
    puts resp.body
end

In Listing 8, you begin by using the locally available host command to translate the domain name into an IP address. (The host command itself uses an internal API to resolve the domain name to an IP address with DNS resolution.) You use a simple regular expression (and the match method) to parse the IP address from the host command output. With the IP address available, you use the IP location service at OpenCrypt to retrieve the general geolocation information. The OpenCrypt API permits you up to 50,000 free API calls.

The OpenCrypt API call is simple: Your constructed URL contains both the IP address you want to locate and the key provided to you by the OpenCrypt registration process. The HTTP response body consists of the IP address, country code, and country name.

The console session in Listing 9 shows the output for two sample domain names.

Listing 9. Using the simple domain-location script
$ ./where.rb www.baynet.ne.jp
IP=111.68.239.125
CC=JP
CN=Japan
$ ./where.rb www.pravda.ru
IP=212.76.137.2
CC=RU
CN=Russian Federation
$

Google API discovery

The undisputed champion of web APIs is Google. Google has so many APIs that it offers another API for discovering them. Through the Google API Discovery Service, you can list the available APIs from Google and extract metadata about them. Although interaction with most Google APIs requires authentication, you can access the discovery API through a secure socket connection. For this reason, Listing 10 uses Ruby's https class to construct a connection to the secure port. The defined URL specifies the REST request, and the response is JSON-encoded. You iterate through the response and emit a small portion of the data for the preferred APIs.

Listing 10. Listing Google APIs with the Google API Discovery service (gdir.rb)
#!/usr/bin/ruby
require 'rubygems'
require 'net/https'
require 'json'

url = 'https://www.googleapis.com/discovery/v1/apis'

uri = URI.parse(url)

# Set up a connection to the Google API Service
http = Net::HTTP.new( uri.host, 443 )
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE

# Connect to the service
req = Net::HTTP::Get.new(uri.request_uri)
resp = http.request(req)

# Get the JSON representation
jresp = JSON.parse(resp.body)

# Iterate through the API List
jresp['items'].each do | item |
  if item['preferred'] == true
    name = item['name']
    title = item['title']
    link = item['discoveryLink']
    printf("%-17s %-34s %-20s\n", name, title, link)
  end
end

The console session in Listing 11 shows a sampling of the response from running the Listing 10 script.

Listing 11. Using the simple Google directory service Ruby script
$ ./gdir.rb
adexchangebuyer   Ad Exchange Buyer API              ./apis/adexchangebuyer/v1.1/rest
adsense           AdSense Management API             ./apis/adsense/v1.1/rest
adsensehost       AdSense Host API                   ./apis/adsensehost/v4.1/rest
analytics         Google Analytics API               ./apis/analytics/v3/rest
androidpublisher  Google Play Android Developer API  ./apis/androidpublisher/v1/rest
audit             Enterprise Audit API               ./apis/audit/v1/rest
bigquery          BigQuery API                       ./apis/bigquery/v2/rest
blogger           Blogger API                        ./apis/blogger/v3/rest
books             Books API                          ./apis/books/v1/rest
calendar          Calendar API                       ./apis/calendar/v3/rest
compute           Compute Engine API                 ./apis/compute/v1beta12/rest
coordinate        Google Maps Coordinate API         ./apis/coordinate/v1/rest
customsearch      CustomSearch API                   ./apis/customsearch/v1/rest
dfareporting      DFA Reporting API                  ./apis/dfareporting/v1/rest
discovery         APIs Discovery Service             ./apis/discovery/v1/rest
drive             Drive API                          ./apis/drive/v2/rest
...
storage           Cloud Storage API                  ./apis/storage/v1beta1/rest
taskqueue         TaskQueue API                      ./apis/taskqueue/v1beta2/rest
tasks             Tasks API                          ./apis/tasks/v1/rest
translate         Translate API                      ./apis/translate/v2/rest
urlshortener      URL Shortener API                  ./apis/urlshortener/v1/rest
webfonts          Google Web Fonts Developer API     ./apis/webfonts/v1/rest
youtube           YouTube API                        ./apis/youtube/v3alpha/rest
youtubeAnalytics  YouTube Analytics API              ./apis/youtubeAnalytics/v1/rest
$

The output in Listing 11 shows the API names, their titles, and the URL path for digging deeper into each one.


Conclusion

The examples in this article illustrate the power available in public APIs for extracting information from the Internet. Web APIs provide access to targeted, specific information, in contrast to web scraping and spidering. New value is being created on the Internet, not only through the use of these APIs but also by combining them in novel ways to present new data to a growing population of web users.

Keep in mind, though, that with APIs, you get what you pay for. Throttling issues are a common complaint. Also, the fact that API rules can change without notice must be a consideration when you build applications. Relatively recently, Twitter changed its API to provide "a more consistent experience." The change spelled doom for a number of third-party applications that could be viewed as competitive to the typical Twitter web client.

Resources

Learn

Get products and technologies

  • Nokogiri: Get a scraping solution for Ruby that supports XML and JSON parsing of HTML documents.
  • pjscrape: Check out this JavaScript framework for client-side web scraping.
  • Beautiful Soup: Give this scraping library a try if Python is your scripting language of choice.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Web development
ArticleID=956861
ArticleTitle=Extract information from the web with Ruby
publish-date=12172013