20 min read

(For more resources related to this topic, see here.)

As you attempted to gather data, you might have realized that the information was trapped in a proprietary spreadsheet format or spread across pages on the Web. Making matters worse, after spending hours manually reformatting the data, perhaps your computer slowed to a crawl after running out of memory. Perhaps R even crashed or froze your machine. Hopefully you were undeterred; it does get easier with time.

You might find the information particularly useful if you tend to work with data that are:

  • Stored in unstructured or proprietary formats such as web pages, web APIs, or spreadsheets
  • From a domain such as bioinformatics or social network analysis, which presents additional challenges
  • So extremely large that R cannot store the dataset in memory or machine learning takes a very long time to complete

You’re not alone if you suffer from any of these problems. Although there is no panacea—these issues are the bane of the data scientist as well as the reason for data skills to be in high demand—through the dedicated efforts of the R community, a number of R packages provide a head start toward solving the problem.

This article also provides a cookbook of such solutions. Even if you are an experienced R veteran, you may discover a package that simplifies your workflow, or perhaps one day you will author a package that makes work easier for everybody else!

Working with specialized data

Unlike the analyses in this article, real-world data are rarely packaged in a simple CSV form that can be downloaded from a website. Instead, significant effort is needed to prepare data for analysis. Data must be collected, merged, sorted, filtered, or reformatted to meet the requirements of the learning algorithm. This process is known informally as data munging. Munging has become even more important as the size of typical datasets has grown from megabytes to gigabytes and data are gathered from unrelated and messy sources, many of which are domain-specific. Several packages and resources for working with specialized or domain-specific data are listed as follows:

Getting data from the Web with the RCurl package

The RCurl package by Duncan Temple Lang provides an R interface to the curl (client for URLs) utility, a command-line tool for transferring data over networks. The curl utility is useful for web scraping, which refers to the practice of harvesting data from websites and transforming it into a structured form.

Documentation for the RCurl package can be found on the Web at http://www.omegahat.org/RCurl/.

After installing the RCurl package, downloading a page is as simple as typing:

> library(RCurl) > webpage <- getURL(“http://www.packtpub.com/”)

This will save the full text of the Packt Publishing’s homepage (including all web markup) into the R character object named webpage. As shown in the following lines, this is not very useful as-is:

> str(webpage) chr “<!DOCTYPE html>n<html >More information on the XML package, including simple examples to get you started quickly, can be found at the project’s website: http://www.omegahat.org/RSXML/.

Reading and writing JSON with the rjson package

The rjson package by Alex Couture-Beil can be used to read and write files in the JavaScript Object Notation (JSON) format. JSON is a standard, plaintext format, most often used for data structures and objects on the Web. The format has become popular recently due to its utility in creating web applications, but despite the name, it is not limited to web browsers.

For details about the JSON format, go to http://www.json.org/.

The JSON format stores objects in plain text strings. After installing the rjson package, to convert from JSON to R:

> library(rjson) > r_object <- fromJSON(json_string)

To convert from an R object to a JSON object:

>json_string <- toJSON(r_object)

Used with the Rcurl package (noted previously), it is possible to write R programs that utilize JSON data directly from many online data stores.

Reading and writing Microsoft Excel spreadsheets using xlsx

The xlsx package by Adrian A. Dragulescu offers functions to read and write to spreadsheets in the Excel 2007 (or earlier) format—a common task in many business environments. The package is based on the Apache POI Java API for working with Microsoft’s documents.

For more information on xlsx, including a quick start document, go to https://code.google.com/p/rexcel/.

Working with bioinformatics data

Data analysis in the field of bioinformatics offers a number of challenges relative to other fields due to the unique nature of genetic data. The use of DNA and protein microarrays has resulted in datasets that are often much wider than they are long (that is, they have more features than examples). This creates problems when attempting to apply conventional visualizations, statistical tests, and machine learning-methods to such data.

A CRAN task view for statistical genetics/bioinformatics is available at http://cran.r-project.org/web/views/Genetics.html.

The Bioconductor project (http://www.bioconductor.org/) of the Fred Hutchinson Cancer Research Center in Seattle, Washington, provides a centralized hub for methods of analyzing genomic data. Using R as its foundation, Bioconductor adds packages and documentation specific to the field of bioinformatics.

Bioconductor provides workflows for analyzing microarray data from common platforms such as for analysis of microarray platforms, including Affymetrix, Illumina, Nimblegen, and Agilent. Additional functionality includes sequence annotation, multiple testing procedures, specialized visualizations, and many other functions.

Working with social network data and graph data

Social network data and graph data present many challenges. These data record connections, or links, between people or objects. With N people, an N by N matrix of links is possible, which creates tremendous complexity as the number of people grows. The network is then analyzed using statistical measures and visualizations to search for meaningful patterns of relationships.

The network package by Carter T. Butts, David Hunter, and Mark S. Handcock offers a specialized data structure for working with such networks. A closely-related package, sna, allows analysis and visualization of the network objects.

For more information on network and sna, refer to the project website hosted by the University of Washington: http://www.statnet.org/.

Improving the performance of R

R has a reputation for being slow and memory inefficient, a reputation that is at least somewhat earned. These faults are largely unnoticed on a modern PC for datasets of many thousands of records, but datasets with a million records or more can push the limits of what is currently possible with consumer-grade hardware. The problem is worsened if the data have many features or if complex learning algorithms are being used.

CRAN has a high performance computing task view that lists packages pushing the boundaries on what is possible in R: http://cran.r-project.org/web/views/HighPerformanceComputing.html.

Packages that extend R past the capabilities of the base package are being developed rapidly. This work comes primarily on two fronts: some packages add the capability to manage extremely large datasets by making data operations faster or by allowing the size of data to exceed the amount of available system memory; others allow R to work faster, perhaps by spreading the work over additional computers or processors, by utilizing specialized computer hardware, or by providing machine learning optimized to Big Data problems. Some of these packages are listed as follows.

Managing very large datasets

Very large datasets can sometimes cause R to grind to a halt when the system runs out of memory to store the data. Even if the entire dataset can fit in memory, additional RAM is needed to read the data from disk, which necessitates a total memory size much larger than the dataset itself. Furthermore, very large datasets can take a long amount of time to process for no reason other than the sheer volume of records; even a quick operation can add up when performed many millions of times.

Years ago, many would suggest performing data preparation of massive datasets outside R in another programming language, then using R to perform analyses on a smaller subset of data. However, this is no longer necessary, as several packages have been contributed to R to address these Big Data problems.

Making data frames faster with data.table

The data.table package by Dowle, Short, and Lianoglou provides an enhanced version of a data frame called a data table. The data.table objects are typically much faster than data frames for subsetting, joining, and grouping operations. Yet, because it is essentially an improved data frame, the resulting objects can still be used by any R function that accepts a data frame.

The data.table project is found on the Web at http://datatable.r-forge.r-project.org/.

One limitation of data.table structures is that like data frames, they are limited by the available system memory. The next two sections discuss packages that overcome this shortcoming at the expense of breaking compatibility with many R functions.

Creating disk-based data frames with ff

The ff package by Daniel Adler, Christian Glaser, Oleg Nenadic, Jens Oehlschlagel, and Walter Zucchini provides an alternative to a data frame (ffdf) that allows datasets of over two billion rows to be created, even if this far exceeds the available system memory.

The ffdf structure has a physical component that stores the data on disk in a highly efficient form and a virtual component that acts like a typical R data frame but transparently points to the data stored in the physical component. You can imagine the ffdf object as a map that points to a location of data on a disk.

The ff project is on the Web at http://ff.r-forge.r-project.org/.

A downside of ffdf data structures is that they cannot be used natively by most R functions. Instead, the data must be processed in small chunks, and the results should be combined later on. The upside of chunking the data is that the task can be divided across several processors simultaneously using the parallel computing methods presented later in this article.

The ffbase package by Edwin de Jonge, Jan Wijffels, and Jan van der Laan addresses this issue somewhat by adding capabilities for basic statistical analyses using ff objects. This makes it possible to use ff objects directly for data exploration.

The ffbase project is hosted at http://github.com/edwindj/ffbase.

Using massive matrices with bigmemory

The bigmemory package by Michael J. Kane and John W. Emerson allows extremely large matrices that exceed the amount of available system memory. The matrices can be stored on disk or in shared memory, allowing them to be used by other processes on the same computer or across a network. This facilitates parallel computing methods, such as those covered later in this article.

Additional documentation on the bigmemory package can be found at http://www.bigmemory.org/.

Because bigmemory matrices are intentionally unlike data frames, they cannot be used directly with most of the machine learning methods covered in this book. They also can only be used with numeric data. That said, since they are similar to a typical R matrix, it is easy to create smaller samples or chunks that can be converted to standard R data structures.

The authors also provide bigalgebra, biganalytics, and bigtabulate packages, which allow simple analyses to be performed on the matrices. Of particular note is the bigkmeans() function in the biganalytics package, which performs k-means clustering.

Learning faster with parallel computing

In the early days of computing, programs were entirely serial, which limited them to performing a single task at a time. The next instruction could not be performed until the previous instruction was complete. However, many tasks can be completed more efficiently by allowing work to be performed simultaneously.

This need was addressed by the development of parallel computing methods, which use a set of two or more processors or computers to solve a larger problem. Many modern computers are designed for parallel computing. Even in the case that they have a single processor, they often have two or more cores which are capable of working in parallel. This allows tasks to be accomplished independently from one another.

Networks of multiple computers called clusters can also be used for parallel computing. A large cluster may include a variety of hardware and be separated over large distances. In this case, the cluster is known as a grid. Taken to an extreme, a cluster or grid of hundreds or thousands of computers running commodity hardware could be a very powerful system.

The catch, however, is that not every problem can be parallelized; certain problems are more conducive to parallel execution than others. You might expect that adding 100 processors would result in 100 times the work being accomplished in the same amount of time (that is, the execution time is 1/100), but this is typically not the case. The reason is that it takes effort to manage the workers; the work first must be divided into non-overlapping tasks and second, each of the workers’ results must be combined into one final answer.

So-called embarrassingly parallel problems are the ideal. These tasks are easy to reduce into non-overlapping blocks of work, and the results are easy to recombine. An example of an embarrassingly parallel machine learning task would be 10-fold cross-validation; once the samples are decided, each of the 10 evaluations is independent, meaning that its result does not affect the others. As you will soon see, this task can be sped up quite dramatically using parallel computing.

Measuring execution time

Efforts to speed up R will be wasted if it is not possible to systematically measure how much time was saved. Although you could sit and observe a clock, an easier solution is to wrap the offending code in a system.time() function.

For example, on the author’s laptop, the system.time() function notes that it takes about 0.13 seconds to generate a million random numbers:

> system.time(rnorm(1000000)) user system elapsed 0.13 0.00 0.13

The same function can be used for evaluating improvement in performance, obtained with the methods that were just described or any R function.

Working in parallel with foreach

The foreach package by Steve Weston of Revolution Analytics provides perhaps the easiest way to get started with parallel computing, particularly if you are running R on the Windows operating system, as some of the other packages are platform-specific.

The core of the package is a new foreach looping construct. If you have worked with other programming languages, this may be familiar. Essentially, it allows looping over a number of items in a set, without explicitly counting the number of items; in other words, for each item in the set, do something.

In addition to the foreach package, Revolution Analytics has developed high-performance, enterprise-ready R builds. Free versions are available for trial and academic use. For more information, see their website at http://www.revolutionanalytics.com/.

If you’re thinking that R already provides a set of apply functions to loop over sets of items (for example, apply(), lapply(), sapply(), and so on), you are correct. However, the foreach loop has an additional benefit: iterations of the loop can be completed in parallel using a very simple syntax.

The sister package doParallel provides a parallel backend for foreach that utilizes the parallel package included with R (Version 2.14.0 and later). The parallel package includes components of the multicore and snow packages described in the following sections.

Using a multitasking operating system with multicore

The multicore package by Simon Urbanek allows parallel processing on single machines that have multiple processors or processor cores. Because it utilizes multitasking capabilities of the operating system, it is not supported natively on Windows systems. An easy way to get started with the code package is using the mcapply() function, which is a parallelized version of lapply().

The multicore project is hosted at http://www.rforge.net/multicore/.

Networking multiple workstations with snow and snowfall

The snow package (simple networking of workstations) by Luke Tierney, A. J. Rossini, Na Li, and H. Sevcikova allows parallel computing on multicore or multiprocessor machines as well as on a network of multiple machines. The snowfall package by Jochen Knaus provides an easier-to-use interface for snow.

For more information on code, including a detailed FAQ and information on how to configure parallel computing over a network, see http://www.imbi.uni-freiburg.de/parallel/.

Parallel cloud computing with MapReduce and Hadoop

The MapReduce programming model was developed at Google as a way to process their data on a large cluster of networked computers. MapReduce defined parallel programming as a two-step process:

  • A map step, in which a problem is divided into smaller tasks that are distributed across the computers in the cluster
  • A reduce step, in which the results of the small chunks of work are collected and synthesized into a final solution to the original problem

A popular open source alternative to the proprietary MapReduce framework is Apache Hadoop. The Hadoop software comprises of the MapReduce concept plus a distributed filesystem capable of storing large amounts of data across a cluster of computers.

Packt Publishing has published quite a number of books on Hadoop. To view the list of books on this topic, refer to Hadoop titles from Packt.

Several R projects that provide an R interface to Hadoop are in development. One such project is RHIPE by Saptarshi Guha, which attempts to bring the divide and recombine philosophy into R by managing the communication between R and Hadoop.

The RHIPE package is not yet available at CRAN, but it can be built from the source available on the Web at http://www.datadr.org.

The RHadoop project by Revolution Analytics provides an R interface to Hadoop. The project provides a package, rmr, intended to be an easy way for R developers to write MapReduce programs. Additional RHadoop packages provide R functions for accessing Hadoop’s distributed data stores.

At the time of publication, development of RHadoop is progressing very rapidly. For more information about the project, see https://github.com/RevolutionAnalytics/RHadoop/wiki.

GPU computing

An alternative to parallel processing uses a computer’s graphics processing unit (GPU) to increase the speed of mathematical calculations. A GPU is a specialized processor that is optimized for rapidly displaying images on a computer screen. Because a computer often needs to display complex 3D graphics (particularly for video games), many GPUs use hardware designed for parallel processing and extremely efficient matrix and vector calculations. A side benefit is that they can be used for efficiently solving certain types of mathematical problems. Where a computer processor may have on the order of 16 cores, a GPU may have thousands.

The downside of GPU computing is that it requires specific hardware that is not included with many computers. In most cases, a GPU from the manufacturer Nvidia is required, as they provide a proprietary framework called CUDA (Complete Unified Device Architecture) that makes the GPU programmable using common languages such as C++.

For more information on Nvidia’s role in GPU computing, go to http://www.nvidia.com/object/what-is-gpu-computing.html.

The gputools package by Josh Buckner, Mark Seligman, and Justin Wilson implements several R functions, such as matrix operations, clustering, and regression modeling using the Nvidia CUDA toolkit. The package requires a CUDA 1.3 or higher GPU and the installation of the Nvidia CUDA toolkit.

Deploying optimized learning algorithms

Some of the machine learning algorithms covered in this book are able to work on extremely large datasets with relatively minor modifications. For instance, it would be fairly straightforward to implement naive Bayes or the Apriori algorithm using one of the Big Data packages described previously. Some types of models such as ensembles, lend themselves well to parallelization, since the work of each model can be distributed across processors or computers in a cluster. On the other hand, some algorithms require larger changes to the data or algorithm, or need to be rethought altogether before they can be used with massive datasets.

Building bigger regression models with biglm

The biglm package by Thomas Lumley provides functions for training regression models on datasets that may be too large to fit into memory. It works by an iterative process in which the model is updated little-by-little using small chunks of data. The results will be nearly identical to what would have been obtained running the conventional lm() function on the entire dataset.

The biglm() function allows use of a SQL database in place of a data frame. The model can also be trained with chunks obtained from data objects created by the ff package described previously.

Growing bigger and faster random forests with bigrf

The bigrf package by Aloysius Lim implements the training of random forests for classification and regression on datasets that are too large to fit into memory using bigmemory objects as described earlier in this article. The package also allows faster parallel processing using the foreach package described previously. Trees can be grown in parallel (on a single computer or across multiple computers), as can forests, and additional trees can be added to the forest at any time or merged with other forests.

For more information, including examples and Windows installation instructions, see the package wiki hosted at GitHub: https://github.com/aloysius-lim/bigrf.

Training and evaluating models in parallel with caret

The caret package by Max Kuhn will transparently utilize a parallel backend if one has been registered with R (for instance, using the foreach package described previously).

Many of the tasks involved in training and evaluating models, such as creating random samples and repeatedly testing predictions for 10-fold cross-validation are embarrassingly parallel. This makes a particularly good caret.

Configuration instructions and a case study of the performance improvements for enabling parallel processing in caret are available at the project’s website: http://caret.r-forge.r-project.org/parallel.html.

Summary

It is certainly an exciting time to be studying machine learning. Ongoing work on the relatively uncharted frontiers of parallel and distributed computing offers great potential for tapping the knowledge found in the deluge of Big Data. And the burgeoning data science community is facilitated by the free and open source R programming language, which provides a very low barrier for entry – you simply need to be willing to learn.

The topics you have learned, provide the foundation for understanding more advanced machine learning methods. It is now your responsibility to keep learning and adding tools to your arsenal. Along the way, be sure to keep in mind the No Free Lunch theorem—no learning algorithm can rule them all. There will always be a human element to machine learning, adding subject-specific knowledge and the ability to match the appropriate algorithm to the task at hand.

In the coming years, it will be interesting to see how the human side changes as the line between machine learning and human learning is blurred. Services such as Amazon’s Mechanical Turk provide crowd-sourced intelligence, offering a cluster of human minds ready to perform simple tasks at a moment’s notice. Perhaps one day, just as we have used computers to perform tasks that human beings cannot do easily, computers will employ human beings to do the reverse; food for thought.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here