10 min read

(For more resources related to this topic, see here.)

As web scrapers like to say: “Every website has an API. Sometimes it’s just poorly documented and difficult to use.”

To say that web scraping is a useful skill is an understatement. Whether you’re satisfying a curiosity by writing a quick script in an afternoon or building the next Google, the ability to grab any online data, in any amount, in any format, while choosing how you want to store it and retrieve it, is a vital part of any good programmer’s toolkit.

By virtue of reading this article, I assume that you already have some idea of what web scraping entails, and perhaps have specific motivations for using Java to do it. Regardless, it is important to cover some basic definitions and principles.

Very rarely does one hear about web scraping in Java—a language often thought to be solely the domain of enterprise software and interactive web apps of the 90’s and early 2000’s.

However, there are many reasons why Java is an often underrated language for web scraping:

  • Java’s excellent exception-handling lets you compile code that elegantly handles the often-messy Web
  • Reusable data structures allow you to write once and use everywhere with ease and safety
  • Java’s concurrency libraries allow you to write code that can process other data while waiting for servers to return information (the slowest part of any scraper)
  • The Web is big and slow, but the Java RMI allows you to write code across a distributed network of machines, in order to collect and process data quickly
  • There are a variety of standard libraries for getting data from servers, as well as third-party libraries for parsing this data, and even executing JavaScript (which is needed for scraping some websites)

In this article, we will explore these, and other benefits of Java in web scraping, and build several scrapers ourselves. Although it is possible, and recommended, to skip to the sections you already have a good grasp of, keep in mind that some sections build up the code and concepts of other sections. When this happens, it will be noted in the beginning of the section.

How is this legal?

Web scraping has always had a “gray hat” reputation. While websites are generally meant to be viewed by actual humans sitting at a computer, web scrapers find ways to subvert that. While APIs are designed to accept obviously computer-generated requests, web scrapers must find ways to imitate humans, often by modifying headers, forging POST requests and other methods.

Web scraping often requires a great deal of problem solving and ingenuity to figure out how to get the data you want. There are often few roadmaps or tried-and-true procedures to follow, and you must carefully tailor the code to each website—often riding between the lines of what is intended and what is possible. Although this sort of hacking can be fun and challenging, you have to be careful to follow the rules.

Like many technology fields, the legal precedent for web scraping is scant. A good rule of thumb to keep yourself out of trouble is to always follow the terms of use and copyright documents on websites that you scrape (if any).

There are some cases in which the act of web crawling is itself in murky legal territory, regardless of how the data is used. Crawling is often prohibited in the terms of service of websites where the aggregated data is valuable (for example, a site that contains a directory of personal addresses in the United States), or where a commercial or rate-limited API is available. Twitter, for example, explicitly prohibits web scraping (at least of any actual tweet data) in its terms of service:

“crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited”

Unless explicitly prohibited by the terms of service, there is no fundamental difference between accessing a website (and its information) via a browser, and accessing it via an automated script. The robots.txt file alone has not been shown to be legally binding, although in many cases the terms of service can be.

Writing a simple scraper (Simple)

Wikipedia is not just a helpful resource for researching or looking up information but also a very interesting website to scrape. They make no efforts to prevent scrapers from accessing the site, and, with a very well-marked-up HTML, they make it very easy to find the information you’re looking for. In this project, we will scrape an article from Wikipedia and retrieve the first few lines of text from the body of the article.

Getting ready

It is recommended that you have some working knowledge of Java, and the ability to create and execute Java programs at this point.

As an example, we will use the article from the following Wikipedia link:

http://en.wikipedia.org/wiki/Java

Note that this article is about the Indonesian island nation Java, not the programming language. Regardless, it seems appropriate to use it as a test subject.

We will be using the jsoup library to parse HTML data from websites and convert it into a manipulatable object, with traversable, indexed values (much like an array). In this xercise, we will show you how to download, install, and use Java libraries. In addition, we’ll also be covering some of the basics of the jsoup library in particular.

How to do it…

Now that we’re starting to get into writing scrapers, let’s create a new project to keep them all bundled together. Carry out the following steps for this task:

  1. Open Eclipse and create a new Java project called Scraper.
  2. Packages are still considered to be handy for bundling collections of classes together within a single project (projects contain multiple packages, and packages contain multiple classes). You can create a new package by highlighting the Scraper project in Eclipse and going to File | New | Package .
  3. By convention, in order to prevent programmers from creating packages with the same name (and causing namespace problems), packages are named starting with the reverse of your domain name (for example, com.mydomain.mypackagename). For the rest of the article, we will begin all our packages with com.packtpub.JavaScraping appended with the package name. Let’s create a new package called com.packtpub.JavaScraping.SimpleScraper.
  4. Create a new class, WikiScraper, inside the src folder of the package.
  5. Download the jsoup core library, the first link, from the following URL: http://jsoup.org/download
  6. Place the .jar file you downloaded into the lib folder of the package you just created.
  7. In Eclipse, right-click in the Package Explorer window and select Refresh. This will allow Eclipse to update the Package Explorer to the current state of the workspace folder. Eclipse should show your jsoup-1.7.2.jar file (this file may have a different name depending on the version you’re using) in the Package Explorer window.
  8. Right-click on the jsoup JAR file and select Build Path | Add to Build Path.
  9. In your WikiScraper class file, write the following code:

    package com.packtpub.JavaScraping.SimpleScraper; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import java.net.*; import java.io.*; public class WikiScraper { public static void main(String[] args) { scrapeTopic("/wiki/Python"); } public static void scrapeTopic(String url){ String html = getUrl("http://www.wikipedia.org/"+url); Document doc = Jsoup.parse(html); String contentText = doc.select("#mw-content-text > p").first().text(); System.out.println(contentText); } public static String getUrl(String url){ URL urlObj = null; try{ urlObj = new URL(url); } catch(MalformedURLException e){ System.out.println("The url was malformed!"); return ""; } URLConnection urlCon = null; BufferedReader in = null; String outputText = ""; try{ urlCon = urlObj.openConnection(); in = new BufferedReader(new InputStreamReader(urlCon.getInputStream())); String line = ""; while((line = in.readLine()) != null){ outputText += line; } in.close(); }catch(IOException e){ System.out.println("There was an error connecting to the URL"); return ""; } return outputText; } }

Assuming you’re connected to the internet, this should compile and run with no errors, and print the first paragraph of text from the article.

How it works…

Unlike our HelloWorld example, there are a number of libraries needed to make this code work. We can incorporate all of these using the import statements before the class declaration. There are a number of jsoup objects needed, along with two Java libraries, java.io and java.net , which are needed for creating the connection and retrieving information from the Web.

As always, our program starts out in the main method of the class. This method calls the scrapeTopic method, which will eventually print the data that we are looking for (the first paragraph of text in the Wikipedia article) to the screen. scrapeTopic requires another method, getURL, in order to do this.

getUrl is a function that we will be using throughout the article. It takes in an arbitrary URL and returns the raw source code as a string. Essentially, it creates a Java URL object from the URL string, and calls the openConnection method on that URL object. The openConnection method returns a URLConnection object, which can be used to create a BufferedReader object.

BufferedReader objects are designed to read from, potentially very long, streams of text, stopping at a certain size limit, or, very commonly, reading streams one line at a time. Depending on the potential size of the pages you might be reading (or if you’re reading from an unknown source), it might be important to set a buffer size here. To simplify this exercise, however, we will continue to read as long as Java is able to.

The while loop here retrieves the text from the BufferedReader object one line at a time and adds it to outputText, which is then returned.

After the getURL method has returned the HTML string to scrapeTopic, jsoup is used. jsoup is a Java library that turns HTML strings (such as the string returned by our scraper) into more accessible objects. There are many ways to access and manipulate these objects, but the function you’ll likely find yourself using most often is the select function. The jsoup select function returns a jsoup object (or an array of objects, if more than one element matches the argument to the select function), which can be further manipulated, or turned into text, and printed.

The crux of our script can be found in this line:

String contentText = doc.select("#mw-content-text > p").first().text();

This finds all the elements that match #mw-content-text > p (that is, all p elements that are the children of the elements with the CSS ID mw-content-text), selects the first element of this set, and turns the resulting object into plain text (stripping out all tags, such as <a> tags or other formatting that might be in the text).

The program ends by printing this line out to the console.

There’s more…

Jsoup is a powerful library that we will be working with in many applications throughout this article. For uses that are not covered in the article, I encourage you to read the complete documentation at http://jsoup.org/apidocs/.

What if you find yourself working on a project where jsoup’s capabilities aren’t quite meeting your needs? There are literally dozens of Java-based HTML parsers on the market.

Summary

Thus in this article we took the first step towards web scraping with Java, and also learned how to scrape an article from Wikipedia and retrieve the first few lines of text from the body of the article.

Resources for Article :


Further resources on this subject:


 

LEAVE A REPLY

Please enter your comment!
Please enter your name here