A lot of sites make their content available via APIs, RSS feeds, or other forms of structured data. For those that don’t there’s Web Scraping. It’s a technique whereby you extract data from website content. I recently employed Web scraping within a Web app that converted one file type to another. It featured the ability to paste in a URL that contained links to the source file type. Using an open source tool called Jsoup, my app iterated over hyperlinks to process the files without ever downloading them to the user’s device. As you are probably aware, working with the DOM (Document Object Model) is a lot easier using a library. On the client-side, you’ve got the excellent jQuery library. On the server, your choice of tool depends on the language that you are coding with. In today’s article, I’d like to elaborate on the Jsoup Web scraping library for Java. Using my recent app as an example, we’ll learn about some of its many capabilities.
A Brief Overview
Jsoup is an open-source Java library consisting of methods designed to extract and manipulate HTML document content. It was written in 2009 by Jonathan Hedley, a software development manager for Amazon Seattle. If you’re familiar with jQuery, you should have no trouble working with Jsoup’s methods. Here’s a taste of what you can do with them:
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- output tidy HTML
So how do you add all of this goodness to your project? Just download the jar file from the Jsoup site and reference it from your project. For example, in Eclipse,
- Right-click your project in the Project Explorer and select Properties… from the popup menu.
- In the properties dialog,
- Select Java Build Path from the list on the left.
- Click on the Libraries tab.
- Click the Add external JARS… button and navigate to the downloaded Jsoup jar file. Click Open.
- Click OK on the properties dialog to close it.
Fetching the Page
Before you can work with the DOM, you need the parsable document markup. That’s the text content that is sent to the browser. At that point all server-side code will have executed and generated whatever dynamic content is required. Jsoup represents a Web page using the org.jsoup.nodes.Document object. It can be created from a content string or via a connection. Typically, the simplest choice is the latter, but there are cases where you may want to fetch the page yourself, such as where a proxy server in involved or credentials are required.
There are two steps to fetching a page: first you create the Connection to the resource. Then you call the get() function to retrieve the page content:
// fetch the document over HTTP Document doc = Jsoup.connect("http://google.com").get();
Like jQuery, Jsoup functions are chainable, so that you can do other things like emulate a UserAgent and provide request parameters:
Document doc = Jsoup.connect("http://google.com").userAgent("Mozilla").data("name", "jsoup").get();
Fetching the page yourself is a lot more work on your part, but it’s an option if you want it. The following example uses an HttpURLConnection to fetch the resource and wraps the InputStreamReader in a BufferedReader in order to read in the file line-by-line. The full markup string is then passed to the static Jsoup.parse() method:
URL url = new URL("http://google.com/"); HttpURLConnection urlConn = (HttpURLConnection) url.openConnection(); String line = null; StringBuilder tmp = new StringBuilder(); BufferedReader in = new BufferedReader(new InputStreamReader(urlConn.getInputStream())); while ((line = in.readLine()) != null) { tmp.append(line); } Document doc = Jsoup.parse(tmp.toString());
Iterating over Hyperlinks
Once we’ve obtained a reference to the Document, we can work with the DOM much like we would using jQuery. Case in point, here’s the code to iterate over every hyperlink in the document and print its href attribute and text to the console:
package com.robgravelle.scraper; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class ScrapeHyperlinks { public static void main(String[] args) { try { // fetch the document over HTTP Document doc = Jsoup.connect("http://google.com").get(); // get the page title String title = doc.title(); System.out.println("title: " + title); // get all links in page Elements links = doc.select("a[href]"); for (Element link : links) { // get the value from the href attribute System.out.println("nlink: " + link.attr("href")); System.out.println("text: " + link.text()); } } catch (IOException e) { e.printStackTrace(); } } }
Here is the partial output of the above code:
title: Google link: https://www.google.ca/setprefs?suggon=2&prev=https://www.google.ca/?gfe_rd%...tvvuVYmnEMWC8QfH9YvQDg%26gws_rd%3Dssl&sig=0__HbDv6hs6FIlym_AHoeCX1JHMtU%3D text: Screen-reader users, click here to turn off Google Instant. link: https://mail.google.com/mail/?tab=wm text: Gmail link: https://www.google.ca/imghp?hl=en&tab=wi&ei=w_vuVfnIO8yxe-a5oogI&ved=0CBMQqi4oAQ text: Images //etc...
Like jQuery, there are no direct references to element attributes. All information is returned via function calls.
Conclusion
Now that we’ve got a feel for the basics, in the next instalment, we’ll move on to more advanced operations.