Web Page Scraping with Jsoup

A lot of sites make their content available via APIs, RSS feeds, or other forms of structured data. For those that don’t there’s Web Scraping. It’s a technique whereby you extract data from website content. I recently employed Web scraping within a Web app that converted one file type to another. It featured the ability to paste in a URL that contained links to the source file type. Using an open source tool called Jsoup, my app iterated over hyperlinks to process the files without ever downloading them to the user’s device. As you are probably aware, working with the DOM (Document Object Model) is a lot easier using a library. On the client-side, you’ve got the excellent jQuery library. On the server, your choice of tool depends on the language that you are coding with. In today’s article, I’d like to elaborate on the Jsoup Web scraping library for Java. Using my recent app as an example, we’ll learn about some of its many capabilities.

A Brief Overview

Jsoup is an open-source Java library consisting of methods designed to extract and manipulate HTML document content. It was written in 2009 by Jonathan Hedley, a software development manager for Amazon Seattle. If you’re familiar with jQuery, you should have no trouble working with Jsoup’s methods. Here’s a taste of what you can do with them:

scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML

So how do you add all of this goodness to your project? Just download the jar file from the Jsoup site and reference it from your project. For example, in Eclipse,

Right-click your project in the Project Explorer and select Properties… from the popup menu.
In the properties dialog,
1. Select Java Build Path from the list on the left.
2. Click on the Libraries tab.
3. Click the Add external JARS… button and navigate to the downloaded Jsoup jar file. Click Open.
Click OK on the properties dialog to close it.

Fetching the Page

Before you can work with the DOM, you need the parsable document markup. That’s the text content that is sent to the browser. At that point all server-side code will have executed and generated whatever dynamic content is required. Jsoup represents a Web page using the org.jsoup.nodes.Document object. It can be created from a content string or via a connection. Typically, the simplest choice is the latter, but there are cases where you may want to fetch the page yourself, such as where a proxy server in involved or credentials are required.

There are two steps to fetching a page: first you create the Connection to the resource. Then you call the get() function to retrieve the page content:

// fetch the document over HTTP
Document doc = Jsoup.connect("http://google.com").get();

Like jQuery, Jsoup functions are chainable, so that you can do other things like emulate a UserAgent and provide request parameters:

Document doc = Jsoup.connect("http://google.com").userAgent("Mozilla").data("name", "jsoup").get();

Fetching the page yourself is a lot more work on your part, but it’s an option if you want it. The following example uses an HttpURLConnection to fetch the resource and wraps the InputStreamReader in a BufferedReader in order to read in the file line-by-line. The full markup string is then passed to the static Jsoup.parse() method:

URL url = new URL("http://google.com/");
HttpURLConnection urlConn = (HttpURLConnection) url.openConnection();
String line = null;
StringBuilder tmp = new StringBuilder();
BufferedReader in = new BufferedReader(new InputStreamReader(urlConn.getInputStream()));
while ((line = in.readLine()) != null) {
  tmp.append(line);
}

Document doc = Jsoup.parse(tmp.toString());

Iterating over Hyperlinks

Once we’ve obtained a reference to the Document, we can work with the DOM much like we would using jQuery. Case in point, here’s the code to iterate over every hyperlink in the document and print its href attribute and text to the console:

package com.robgravelle.scraper;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ScrapeHyperlinks {

  public static void main(String[] args) {
    try {
      // fetch the document over HTTP
      Document doc = Jsoup.connect("http://google.com").get();
     
      // get the page title
      String title = doc.title();
      System.out.println("title: " + title);
     
      // get all links in page
      Elements links = doc.select("a[href]");
      for (Element link : links) {
        // get the value from the href attribute
        System.out.println("nlink: " + link.attr("href"));
        System.out.println("text: " + link.text());
      }
    } catch (IOException e) {
    e.printStackTrace();
    }
  }
}

Here is the partial output of the above code:

title: Google

link: https://www.google.ca/setprefs?suggon=2&prev=https://www.google.ca/?gfe_rd%...tvvuVYmnEMWC8QfH9YvQDg%26gws_rd%3Dssl&sig=0__HbDv6hs6FIlym_AHoeCX1JHMtU%3D
text: Screen-reader users, click here to turn off Google Instant.

link: https://mail.google.com/mail/?tab=wm
text: Gmail

link: https://www.google.ca/imghp?hl=en&tab=wi&ei=w_vuVfnIO8yxe-a5oogI&ved=0CBMQqi4oAQ
text: Images

//etc...

Like jQuery, there are no direct references to element attributes. All information is returned via function calls.

Conclusion

Now that we’ve got a feel for the basics, in the next instalment, we’ll move on to more advanced operations.

Web Page Scraping with Jsoup

A Brief Overview

Fetching the Page

Iterating over Hyperlinks

Conclusion

Get the Free Newsletter!

Popular Articles

How to Reload the Page

HTML5 Navigation: Using an Anchor Tag for Hypertext

How to Create Indents and Bullet Lists

Featured

Top Online Courses to Learn SEO

Sellzone Marketing Tool for Amazon Review

The Top Database Plugins for WordPress

The Revolutionary ES6 Rest and Spread Operators

Advertisers

Menu

Our Brands