Saturday, September 14, 2024

Fetch Hyperlinked Files using Jsoup

Fetch Hyperlinked Files using Jsoup

In the Download Linked Resources using Jsoup tutorial, we learned how to select a specific hyperlink element based on a unique attribute value in order to download a linked MP3. In today’s conclusion, we’ll cover how to extract the absolute URL from the first link in the Elements Collection and save the MP3 file on our local device.

Retrieving the Download Link

Recall that in the last article we invoked the org.jsoup.nodes.Document object’s select() method to return a collection of matching Elements. Although we are using an identifier that we believe to be unique, it never hurts to check how many items were returned. If more than one link comes back, we can handle the situation. For the sake of simplicity, I elected to process the first element only.

The handy first() shortcut method gets the element that we’re after.

We need an absolute URL to work with because we will be making a new request to the server. To do that, we need to include the “abs:” prefix on the attribute name.

//select links whose href attribute ends with "t=60"
//should only be one but select returns an Elements collection
Elements links = doc.select("a[href$=" + LINK + "]"); //LINK is the unique identifier of "t=60"
int linksSize = links.size();
if (linksSize > 0) {
  if (linksSize > 1) {
      System.out.println("Warning: more than one link found.  Downloading first match.");
  }
  Element link = links.first();
  //this returns an absolute URL
  String  linkUrl = link.attr("abs:href");

Downloading the Linked File

This step is significantly more difficult that one might expect. One problem, ingeniously solved by Jeremy Chung is that Jsoup limits the file size. His solution was to set the maxBodySize property to zero.

But that’s not the end of it. I noticed that servers sometimes check things like the referrer and userAgent to prevent third-party linking. Luckily, both those properties are easy to set. Another important setting is ignoreContentType. It is set to false by default so that an unrecognized content-type will cause an IOException to be thrown. This is to prevent Jsoup from attempting to parse binary content. In our case, we are simply using Jsoup to download the file, so we have to tell it to ignore the content type.

//Thanks to Jeremy Chung for maxBodySize solution
//http://jmchung.github.io/blog/2013/10/25/how-to-solve-jsoup-does-not-get-complete-html-document/
byte[] bytes = Jsoup.connect(linkUrl)
   .header("Accept-Encoding", "gzip, deflate")
   .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
   .referrer(URL_TO_PARSE)
   .ignoreContentType(true)
   .maxBodySize(0)
   .timeout(600000)
   .execute()
   .bodyAsBytes();

Validating the File Type

If we really want to be thorough we can check the file to make sure that it’s really an MP3. The file extension might be a good clue, but since many dynamic links such as this one don’t include the file name at all, much less an extension, we have to check the contents for indicators.

try {

    validateMP3File(bytes);
       
} catch (IOException e) {
    System.err.println("Could not read the file at '" + linkUrl + "'.");
}
catch (InvalidFileTypeException e) {
    System.err.println("'" + linkUrl + "' does not appear to point to an MP3 file.");
}

The InvalidFileTypeException is our own class and is defined as a private field:

@SuppressWarnings("serial")
private static class InvalidFileTypeException extends Exception {}

The validateMP3File() Method

MP3 files happen to have certain bytes reserved for information about the content called an ID3 tag. The following code:

  1. connects an InputStream to the byte array
  2. reads in the first MB of content in to a byte array
  3. converts the byte array into a String using the new String(Byte[] bytes) constructor
  4. stores the first three characters into a variable
  5. compares it to the “IDE” marker
  6. throws a new InvalidFileTypeException if the marker is not present
public static void validateMP3File(byte[] song) throws IOException, InvalidFileTypeException {
    InputStream file = new ByteArrayInputStream(song);
    byte[] startOfFile = new byte[1024];
    file.read(startOfFile);
    String id3 = new String(startOfFile);
    String tag = id3.substring(0, 3);
    if  ( ! "ID3".equals(tag) ) {
        throw new InvalidFileTypeException();
    }
}

Saving the File

All that remains now is to save the file. We can use the link text to name the file. The .mp3 extension may be added if necessary. A FileOutputStream writes the bytes to the new file.

try {
    validateMP3File(bytes);
                                       
    String savedFileName = link.text();
    if (!savedFileName.endsWith(".mp3")) savedFileName.concat(".mp3");
    FileOutputStream fos = new FileOutputStream(savedFileName);
    fos.write(bytes);
    fos.close();

    System.out.println("File has been downloaded.");
} catch (IOException e) {
//...    

Here is the full source for the JsoupDemoTest class:

package com.robgravelle.jsoupdemo;

import static org.jsoup.Jsoup.parse;

import java.io.ByteArrayInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupDemoTest {
    private final static String URL_TO_PARSE  = "http://robgravelle.com/albums/";
    private final static String LINK = "t=60";
    @SuppressWarnings("serial")
    private static class InvalidFileTypeException extends Exception {}
   
    public static void main(String[] args) throws IOException {
        //these two lines are only required if your Internet
        //connection uses a proxy server
        //System.setProperty("http.proxyHost", "my.proxy.server");
        //System.setProperty("http.proxyPort", "8081");
        URL url = new URL(URL_TO_PARSE);
        Document doc = parse(url, 30000);
       
        Elements links = doc.select("a[href$=" + LINK + "]");
        int linksSize = links.size();
        if (linksSize > 0) {
            if (linksSize > 1) {
                System.out.println("Warning: more than one link found.  Downloading first match.");
            }
            Element link    = links.first();
            String  linkUrl = link.attr("abs:href");
            //Thanks to Jeremy Chung for maxBodySize solution
            //http://jmchung.github.io/blog/2013/10/25/how-to-solve-jsoup-does-not-get-complete-html-document/
            byte[] bytes = Jsoup.connect(linkUrl)
                .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                .referrer(URL_TO_PARSE)
                .ignoreContentType(true)
                .maxBodySize(0)
                .timeout(600000)
                .execute()
                .bodyAsBytes();
           
                try {
                    validateMP3File(bytes);
                   
                    String savedFileName = link.text();
                    if (!savedFileName.endsWith(".mp3")) savedFileName.concat(".mp3");
                    FileOutputStream fos = new FileOutputStream(savedFileName);
                    fos.write(bytes);
                    fos.close();
                   
                    System.out.println("File has been downloaded.");
                } catch (IOException e) {
                    System.err.println("Could not read the file at '" + linkUrl + "'.");
                }
                catch (InvalidFileTypeException e) {
                    System.err.println("'" + linkUrl + "' does not appear to point to an MP3 file.");
                }
        }
        else {
            System.out.println("Could not find the link ending with '" + LINK + "' in web page.");
        }
    }
   
    public static void validateMP3File(byte[] song) throws IOException, InvalidFileTypeException {
        InputStream file = new ByteArrayInputStream(song);
        byte[] startOfFile = new byte[6];
        file.read(startOfFile);
        String id3 = new String(startOfFile);
        //String tag = id3.substring(0, 3);
        if  ( ! "ID3".equals(id3) ) {
            throw new InvalidFileTypeException();
        }
    }
   
    //validateMP3File() is based on this method
    public static void getMP3Metadata(byte[] song) {
        try {
            InputStream file = new ByteArrayInputStream(song);
            int size = (int)song.length;
            byte[] startOfFile = new byte[1024];
            file.read(startOfFile);
            String id3 = new String(startOfFile);
            String tag = id3.substring(0, 3);
            if  ("ID3".equals(tag)) {
                System.out.println("Title: " + id3.substring(3, 32));
                System.out.println("Artist: " + id3.substring(33, 62));
                System.out.println("Album: " + id3.substring(63, 91));
                System.out.println("Year: " + id3.substring(93, 97));
            } else
                System.out.println("does not contain" + " ID3 information.");
            file.close();
        } catch (Exception e) {
            System.out.println("Error - " + e.toString());
        }
    }
}


Conclusion

The Jsoup library offers a virtually unlimited number of applications for page scraping and resource fetching via website hyperlinks. If you’ve come up with your own creative uses for it, by all means share. It might just get featured in an up-coming article!

Rob Gravelle
Rob Gravelle
Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Popular Articles

Featured