Web Page Scraping with Jsoup

By Rob Gravelle

Web Page Scraping with Jsoup

A lot of sites make their content available via APIs, RSS feeds, or other forms of structured data. For those that don't there's Web Scraping. It's a technique whereby you extract data from website content. I recently employed Web scraping within a Web app that converted one file type to another. It featured the ability to paste in a URL that contained links to the source file type. Using an open source tool called Jsoup, my app iterated over hyperlinks to process the files without ever downloading them to the user's device. As you are probably aware, working with the DOM (Document Object Model) is a lot easier using a library. On the client-side, you've got the excellent jQuery library. On the server, your choice of tool depends on the language that you are coding with. In today's article, I'd like to elaborate on the Jsoup Web scraping library for Java. Using my recent app as an example, we'll learn about some of its many capabilities.

A Brief Overview

Jsoup is an open-source Java library consisting of methods designed to extract and manipulate HTML document content. It was written in 2009 by Jonathan Hedley, a software development manager for Amazon Seattle. If you're familiar with jQuery, you should have no trouble working with Jsoup's methods. Here's a taste of what you can do with them:

  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe white-list, to prevent XSS attacks
  • output tidy HTML

So how do you add all of this goodness to your project? Just download the jar file from the Jsoup site and reference it from your project. For example, in Eclipse,

  1. Right-click your project in the Project Explorer and select Properties... from the popup menu.
  2. In the properties dialog,
    1. Select Java Build Path from the list on the left.
    2. Click on the Libraries tab.
    3. Click the Add external JARS... button and navigate to the downloaded Jsoup jar file. Click Open.
  3. Click OK on the properties dialog to close it.

Fetching the Page

Before you can work with the DOM, you need the parsable document markup. That's the text content that is sent to the browser. At that point all server-side code will have executed and generated whatever dynamic content is required. Jsoup represents a Web page using the org.jsoup.nodes.Document object. It can be created from a content string or via a connection. Typically, the simplest choice is the latter, but there are cases where you may want to fetch the page yourself, such as where a proxy server in involved or credentials are required.

There are two steps to fetching a page: first you create the Connection to the resource. Then you call the get() function to retrieve the page content:

// fetch the document over HTTP
Document doc = Jsoup.connect("http://google.com").get();

Like jQuery, Jsoup functions are chainable, so that you can do other things like emulate a UserAgent and provide request parameters:

Document doc = Jsoup.connect("http://google.com").userAgent("Mozilla").data("name", "jsoup").get();

Fetching the page yourself is a lot more work on your part, but it's an option if you want it. The following example uses an HttpURLConnection to fetch the resource and wraps the InputStreamReader in a BufferedReader in order to read in the file line-by-line. The full markup string is then passed to the static Jsoup.parse() method:

URL url = new URL("http://google.com/");
HttpURLConnection urlConn = (HttpURLConnection) url.openConnection();
String line = null;
StringBuilder tmp = new StringBuilder();
BufferedReader in = new BufferedReader(new InputStreamReader(urlConn.getInputStream()));
while ((line = in.readLine()) != null) {
  tmp.append(line);
}

Document doc = Jsoup.parse(tmp.toString());

Iterating over Hyperlinks

Once we've obtained a reference to the Document, we can work with the DOM much like we would using jQuery. Case in point, here's the code to iterate over every hyperlink in the document and print its href attribute and text to the console:

package com.robgravelle.scraper;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ScrapeHyperlinks {

  public static void main(String[] args) {
    try {
      // fetch the document over HTTP
      Document doc = Jsoup.connect("http://google.com").get();
     
      // get the page title
      String title = doc.title();
      System.out.println("title: " + title);
     
      // get all links in page
      Elements links = doc.select("a[href]");
      for (Element link : links) {
        // get the value from the href attribute
        System.out.println("\nlink: " + link.attr("href"));
        System.out.println("text: " + link.text());
      }
    } catch (IOException e) {
    e.printStackTrace();
    }
  }
}

Here is the partial output of the above code:

title: Google

link: https://www.google.ca/setprefs?suggon=2&prev=https://www.google.ca/?gfe_rd%...tvvuVYmnEMWC8QfH9YvQDg%26gws_rd%3Dssl&sig=0__HbDv6hs6FIlym_AHoeCX1JHMtU%3D
text: Screen-reader users, click here to turn off Google Instant.

link: https://mail.google.com/mail/?tab=wm
text: Gmail

link: https://www.google.ca/imghp?hl=en&tab=wi&ei=w_vuVfnIO8yxe-a5oogI&ved=0CBMQqi4oAQ
text: Images

//etc...

Like jQuery, there are no direct references to element attributes. All information is returned via function calls.

Conclusion

Now that we've got a feel for the basics, in the next instalment, we'll move on to more advanced operations.


Rob Gravelle

Rob Gravelle resides in Ottawa, Canada, and is the founder of Gravelle Web Design. Rob has built systems for Intelligence-related organizations such as Canada Border Services, CSIS as well as for numerous commercial businesses.

In his spare time, Rob has become an accomplished guitar player, and has released several CDs. His band, Ivory Knight, was rated as one Canada's top hard rock and metal groups by Brave Words magazine (issue #92) and reached the #1 spot in the National Heavy Metal charts on Reverb Nation.



Make a Comment

Loading Comments...

  • Web Development Newsletter Signup

    Invalid email
    You have successfuly registered to our newsletter.
  •  
  •  
  •  
Thanks for your registration, follow us on our social networks to keep up-to-date