A common question that I come across in developer forums is how to follow hyperlinks within a Web document and download the linked files. Having gained some experience in this area, I can tell you that it’s really not that difficult, but a good tool does make it a lot easier. In my Web Page Scraping with Jsoup article, we saw how to use the Jsoup Java library to iterate over a web page’s links and print out their attributes. In today’s follow-up, we’ll learn how to select a specific hyperlink element based on some criteria in order to download a linked MP3.
The Test Page
Downloading MP3s can have negative repercussions if the material is copyrighted, so I am offering up a free download from my own site. It’s an instrumental piece that I recorded in 2009. The link’s text is not particularly unique and the track title is nowhere near the link element, so I am using the track ID for identification purposes (“t=60”):
The Source Code:
<table class="tracks"> <tbody> <tr class="track odd"> <td class="number">1.</td> <td class="title">UltraViolence - Full Song (192 bitrate MP3)</td> <td class="link stream"><a href="javascript:void(0);" onclick="javascript:playAudio("/albums/flash/playlist.xml?t=60");return false;">Play</a></td> <td class="link download"><a href="download?t=60">Download</a></td><td class="link pdf"></td><td class="link lyrics"></td> <td class="link info"> <div id="xmjnmlhm" style="display:none;"> <div class="overlay_heading">UltraViolence - Full Song (192 bitrate MP3)</div> <div class="overlay_subheading">Credits</div> Rob Gravelle - Guitars, Keys, Drum programming<br> Steve Mercer - Bass<br> Engineered, mixed, and mastered by Rob Gravelle<br> CopyRight 2009 Outsider Music Records All Rights Reserved <div class="overlay_subheading">Notes</div> The MP3 is a large file as the song is 8 minutes long! </div> <span class="link info"><a href="javascript:void(0)" onclick="javascript:oOWEC("xmjnmlhm",{width:400});return false;">Info</a></span> </td> <td class="link buy"></td> </tr> </tbody> </table>
The JsoupDemoTest Class
To keep things simple, I suggest that you always create a stand-alone class first. It’s a lot easier to debug a Java application than a Web app running on a server. You can always reference your code later from the Servlet or other Web app controller.
Creating the Document Object
Before we can work with the page, we need to create a Jsoup Document object, which consists of traversable nodes such as Elements and TextNodes. The simplest way to convert the web page into a Jsoup Document is to use the static Jsoup.parse() method. The parse() method that we are employing this time has a different signature than the one in the last article. It took a String argument; this one fetches the page based on a URL and timeout and is your best choice for most applications.
URL url = new URL(URL_TO_PARSE); Document doc = parse(url, 30000); //30 seconds timeout
Selecting the Hyperlink Node
Jsoup elements support a CSS/jquery-like selector syntax to find matching elements, that allows very powerful and detailed queries. The select method is available on Document, Element objects and Elements collections. The Select() method accepts a criteria string and returns a list of Elements, which comes complete with a range of methods to extract and manipulate the results.
The selector that we are using matches the end of an attribute using the syntax [attr$=value]:
Elements links = doc.select("a[href$=" + LINK + "]");
Here is the code for the JsoupDemoTest class so far:
import static org.jsoup.Jsoup.parse; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; import java.io.IOException; import java.net.URL; public class JsoupDemoTest { private final static String URL_TO_PARSE = "http://robgravelle.com/albums/"; private final static String LINK = "t=60"; public static void main(String[] args) throws IOException { //these two lines are only required if your Internet //connection uses a proxy server System.setProperty("http.proxyHost", "my.proxy.server"); System.setProperty("http.proxyPort", "8080"); URL url = new URL(URL_TO_PARSE); Document doc = parse(url, 30000); Elements links = doc.select("a[href$=" + LINK + "]"); int linksSize = links.size(); //select links whose href attribute ends with "t=60" //should only be one but select returns an Elements collection Elements links = doc.select("a[href$=" + LINK + "]"); } }
For more information on Jsoup selectors, check out this page from the Jsoup cookbook.
Conclusion
In the next instalment, we’ll cover how to extract the full absolute URL from the first link in the Elements Collection, as well as the best/most difficult part of this series: how to download and save the MP3 file.