15 min read

In this article by Matthias Templ, author of the book Simulation for Data Science with R, we will cover:

  • What is meant bydata science
  • A short overview of what Ris
  • The essential tools for a data scientist in R

(For more resources related to this topic, see here.)

Data science

Looking at the job market it is no doubt that the industry needs experts on data science. But what is data science and what’s the difference to statistics or computational statistics?

Statistics is computing with data. In computational statistics, methods and corresponding software are developed in a highly data-depended manner using modern computational tools. Computational statistics has a huge intersection with data science. Data science is the applied part of computational statistics plus data management including storage of data, data bases, and data security issues. The term data science is used when your work is driven by data with a less strong component on method and algorithm development as computational statistics, but with a lot of pure computer science topics related to storing, retrieving, and handling data sets. It is the marriage of computer science and computational statistics. As an example to show differences, we took the broad area of visualization. A data scientist is also interested in pure process related visualizations (airflows in an engine, for example),while in computational statistics, methods for visualization of data and statistical results are onlytouched upon.

Data science is the management of the entire modelling process, from data collection to automatized reporting and presenting the results. Storage and managing data, data pre-processing (editing, imputation), data analysis, and modelling are included in this process. Data scientists use statistics and data-oriented computer science tools to solve the problems they face.

R

R has become an essential tool for statistics and data science(Godfrey 2013). As soon as data scientists have to analyze data, R might be the first choice. The opensource programming language and software environment, R, is currently one of the most widely used and popular software tools for statistics and data analysis. It is available at the Comprehensive R Archive Network (CRAN) as free software under the terms of the Free Software Foundation’s GNU General Public License (GPL) in source code and binary form.

The R Core Team defines R as an environment. R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. Base R includes:

  • A suite of operators for calculations on arrays, mostly written in C and integrated in R
  • Comprehensive, coherent, and integrated collection of methods for data analysis
  • Graphical facilities for data analysis and display, either on-screen or in hard copy
  • A well-developed, simple, and effective programming language thatincludes conditional statements, loops, user-defined recursive functions, and input and output facilities
  • A flexible object-oriented system facilitating code reuse
  • High performance computing with interfaces to compiled code and facilities for parallel and grid computing
  • The ability to be extended with (add-on) packages
  • An environment that allows communication with many other software tools

Each R package provides a structured standard documentation including code application examples. Further documents(so called vignettes???)potentially show more applications of the packages and illustrate dependencies between the implemented functions and methods.

R is not only used extensively in the academic world, but also companies in the area of social media (Google, Facebook, Twitter, and Mozilla Corporation), the banking world (Bank of America, ANZ Bank, Simple), food and pharmaceutical areas (FDA, Merck, and Pfizer), finance (Lloyd, London, and Thomas Cook), technology companies (Microsoft), car construction and logistic companies (Ford, John Deere, and Uber), newspapers (The New York Times and New Scientist), and companies in many other areas; they use R in a professional context(see also, Gentlemen 2009andTippmann 2015). International and national organizations nowadays widely use R in their statistical offices(Todorov and Templ 2012 and Templ and Todorov 2016).

R can be extended with add-on packages, and some of those extensions are especially useful for data scientists as discussed in the following section.

Tools for data scientists in R

Data scientists typically like:

  • The flexibility in reading and writing data including the connection to data bases
  • To have easy-to-use, flexible, and powerful data manipulation features available
  • To work with modern statistical methodology
  • To use high-performance computing tools including interfaces to foreign languages and parallel computing
  • Versatile presentation capabilities for generating tables and graphics, which can readily be used in text processing systems, such as LaTeX or Microsoft Word
  • To create dynamical reports
  • To build web-based applications
  • An economical solution

The following presented tools are related to these topics and helps data scientists in their daily work.

Use a smart environment for R

Would you prefer to have one environment that includes types of modern tools for scientific computing, programming and management of data and files, versioning, output generation that also supports a project philosophy, code completion, highlighting, markup languages and interfaces to other software, and automated connections to servers?

Currently two software products supports this concept. The first one is Eclipse with the extensionSTATET or the modified Eclipse IDE from Open Analytics called Architect. The second is a very popular IDE for R called RStudio, which also includes the named features and additionally includes an integration of the packages shiny(RStudio, Inc. 2014)for web-based development and integration of R and rmarkdown(Allaire et al. 2015). It provides a modern scientific computing environment, well designed and easy to use, and most importantly, distributed under GPL License.

Use of R as a mediator

Data exchange between statistical systems, database systems, or output formats is often required. In this respect, R offers very flexible import and export interfaces either through its base installation but mostly through add-on packages, which are available from CRAN or GitHub. For example, the packages xml2(Wickham 2015a)allow to read XML files. For importing delimited files, fixed width files, and web log files, it is worth mentioning the package readr(Wickham and Francois 2015a)or data.table(Dowle et al. 2015)(functionfread), which are supposed to be faster than the available functions in base R. The packages XLConnect(Mirai Solutions GmbH 2015)can be used to read and write Microsoft Excel files including formulas, graphics, and so on. The readxlpackage(Wickham 2015b)is faster for data import but do not provide export features. The foreignpackages(R Core Team 2015)and a newer promising package called haven(Wickham and Miller 2015)allow to read file formats from various commercial statistical software.

The connection to all major database systems is easily established with specialized packages. Note that theROBDCpackage(Ripley and Lapsley 2015)is slow but general, while other specialized packages exists for special data bases.

Efficient data manipulation as the daily job

Data manipulation, in general but in any case with large data, can be best done with the dplyrpackage(Wickham and Francois 2015b)or the data.tablepackage(Dowle et al. 2015). The computational speed of both packages is much faster than the data manipulation features of base R, while data.table is slightly faster than dplyr using keys and fast binary search based methods for performance improvements.

In the author’s viewpoint, the syntax of dplyr is much easier to learn for beginners as the base R data manipulation features, and it is possible to write thedplyr syntax using data pipelines that is internally provided by package magrittr(Bache and Wickham 2014).

Let’s take an example to see the logical concept. We want to compute a new variableEngineSizeas the square ofEngineSizefrom the data set Cars93. For each group, we want to compute the minimum of the new variable. In addition, the results should be sorted in descending order:

data(Cars93, package = "MASS")

library("dplyr")

Cars93 %>%

  mutate(ES2 = EngineSize^2) %>%

  group_by(Type) %>%

  summarize(min.ES2 = min(ES2)) %>%

  arrange(desc(min.ES2))

## Source: local data frame [6 x 2]

##

##      Type min.ES2

## 1   Large   10.89

## 2     Van    5.76

## 3 Compact    4.00

## 4 Midsize    4.00

## 5  Sporty    1.69

## 6   Small    1.00

The code is somehow self-explanatory, while data manipulation in base R and data.table needs more expertise on syntax writing.

In the case of large data files thatexceed available RAM, interfaces to (relational) database management systems are available, see the CRAN task view on high-performance computingthat includes also information about parallel computing.

According to data manipulation, the excellent packages stringr, stringi, and lubridate for string operations and date-time handling should also be mentioned.

The requirement of efficient data preprocessing

A data scientist typically spends a major amount of time not only ondata management issues but also on fixing data quality problems. It is out of the scope of this book to mention all the tools for each data preprocessing topic. As an example, we concentrate on one particular topic—the handling of missing values.

The VIMpackage(Templ, Alfons, and Filzmoser 2011)(Kowarik and Templ 2016)can be used for visual inspection and imputation of data. It is possible to visualize missing values using suitable plot methods and to analyze missing values’ structure in microdata using univariate, bivariate, multiple, and multivariate plots. The information on missing values from specified variables is highlighted in selected variables. VIM can also evaluate imputations visually. Moreover, the VIMGUIpackage(Schopfhauser et al., 2014)provides a point and click graphical user interface (GUI).

One plot, a parallel coordinate plot, for missing values is shown in the following graph. It highlights the values on certain chemical elements. In red, those values are marked that contain the missing in the chemical element Bi. It is easy to see missing at random situations with such plots as well as to detect any structure according to the missing pattern. Note that this data is compositional thus transformed using a log-ratio transformation from the package robCompositions(Templ, Hron, and Filzmoser 2011):

library("VIM")

data(chorizonDL, package = "VIM")

## for missing values

x <- chorizonDL[,c(15,101:110)]

library("robCompositions")

x <- cenLR(x)$x.clr

parcoordMiss(x,

    plotvars=2:11, interactive = FALSE)

legend("top", col = c("skyblue", "red"), lwd = c(1,1),

    legend = c("observed in Bi", "missing in Bi"))

To impute missing values,not onlykk-nearest neighbor and hot-deck methods are included, but also robust statistical methods implemented in an EMalgorithm, for example, in the functionirmi. The implemented methods can deal with a mixture of continuous, semi-continuous, binary, categorical, and count variables:

any(is.na(x))

## [1] TRUE

ximputed <- irmi(x)

## Time difference of 0.01330566 secs

any(is.na(ximputed))

## [1] FALSE

Visualization as a must

While in former times, results were presented mostly in tables and data was analyzed by their values on screen; nowadays visualization of data and results becomes very important. Data scientists often heavily use visualizations to analyze data andalso for reporting and presenting results. It’s already a nogo to not make use of visualizations.

R features not only it’s traditional graphical system but also an implementation of the grammar of graphics book(Wilkinson 2005)in the form of the R package(Wickham 2009).

Why a data scientist should make use of ggplot2? Since it is a very flexible, customizable, consistent, and systematic approach to generate graphics. It allows to define own themes (for example, cooperative designs in companies) and support the users with legends and optimal plot layout.

In ggplot2, the parts of a plot are defined independently. We do not go into details and refer to(Wickham 2009)or(???), but here’s a simple example to show the user-friendliness of the implementation:

library("ggplot2")

ggplot(Cars93, aes(x = Horsepower, y = MPG.city)) + geom_point() + facet_wrap(~Cylinders)

Here, we mapped Horsepower to the x variable and MPG.city to the y variable. We used Cylinder for faceting. We usedgeom_pointto tell ggplot2 to produce scatterplots.

Reporting and webapplications

Every analysis and report should be reproducible, especially when a data scientist does the job. Everything from the past should be able to compute at any time thereafter.

Additionally,a task for a data scientist is to organize and managetext,code,data, andgraphics.

The use of dynamical reporting tools raise the quality of outcomes and reduce the work-load.

In R, the knitrpackage provides functionality for creating reproducible reports. It links code and text elements. The code is executed and the results are embedded in the text. Different output formats are possible such as PDF,HTML, orWord. The structuring can be most simply done using rmarkdown(Allaire et al., 2015). markdown is a markup language with many features, including headings of different sizes, text formatting, lists, links, HTML, JavaScript,LaTeX equations, tables, and citations.

The aim is to generate documents from plain text. Cooperate designs and styles can be managed through CSS stylesheets. For data scientists, it is highly recommended to use these tools in their daily work.

We already mentioned the automated generation from HTML pages from plain text with rmarkdown. The shinypackage(RStudio Inc. 2014)allows to build web-based applications. The website generated with shiny changes instantly as users modify inputs. You can stay within the R environment to build shiny user interfaces. Interactivity can be integrated using JavaScript, and built-in support for animation and sliders.

Following is a very simple example that includes a slider and presents a scatterplot with highlighting of outliers given. We do not go into detail on the code that should only prove that it is just as simple to make a web application with shiny:

library("shiny")

library("robustbase")

## Define server code

server <- function(input, output) {

  output$scatterplot <- renderPlot({

    x <- c(rnorm(input$obs-10), rnorm(10, 5)); y <- x + rnorm(input$obs)

    df <- data.frame("x" = x,

"y" = y)

    df$out <- ifelse(covMcd(df)$mah > qchisq(0.975, 1), "outlier", "non-outlier")

    ggplot(df, aes(x=x, y=y, colour=out)) + geom_point()

  })

}

 

## Define UI

ui <- fluidPage(

  sidebarLayout(

    sidebarPanel(

      sliderInput("obs", "No. of obs.", min = 10, max = 500, value = 100, step = 10)

    ),

    mainPanel(plotOutput("scatterplot"))

  )

)

 

## Shiny app object

shinyApp(ui = ui, server = server)

Building R packages

First, RStudio and the package devtools(Wickham and Chang 2016)make life easy when building packages. RStudio has a lot of facilities for package building, and it’s integrated package devtools includes features for checking, building, and documenting a package efficiently, and includes roxygen2(Wickham, Danenberg, and Eugster)for automated documentation of packages.

When code of a package is updated,load_all(‘pathToPackage’)simulates a restart of R, the new installation of the package and the loading of the newly build packages. Note that there are many other functions available for testing, documenting, and checking.

Secondly, build a package whenever you wrote more than two functions and whenever you deal with more than one data set. If you use it only for yourself, you may be lazy with documenting the functions to save time.

Packages allow to share code easily, to load all functions and data with one line of code, to have the documentation integrated, and to support consistency checks and additional integrated unit tests.

Advice for beginners is to read the manualWriting R Extensions, and use all the features that are provided by RStudio and devtools.

Summary

In this article, we discussed essential tools for data scientists in R. This covers methods for data pre-processing, data manipulation, and tools for reporting, reproducible work, visualization, R packaging, and writing web-applications.

A data scientist should learn to use the presented tools and deepen the knowledge in the proposed methods and software tools. Having learnt these lessons, a data scientist is well-prepared to face the challenges in data analysis, data analytics, data science, and data problems in practice.

References

  • Allaire, J.J., J. Cheng, Xie Y, J. McPherson, W. Chang, J. Allen, H. Wickham, and H. Hyndman. 2015.Rmarkdown: Dynamic Documents for R.http://CRAN.R-project.org/package=rmarkdown.
  • Bache, S.M., and W. Wickham. 2014.magrittr: A Forward-Pipe Operator for R.https://CRAN.R-project.org/package=magrittr.
  • Dowle, M., A. Srinivasan, T. Short, S. Lianoglou, R. Saporta, and E. Antonyan. 2015.Data.table: Extension of Data.frame.https://CRAN.R-project.org/package=data.table.
  • Gentlemen, R. 2009. “Data Analysts Captivated by R’s Power.”New York Times.http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html.
  • Godfrey, A.J.R. 2013. “Statistical Analysis from a Blind Person’s Perspective.”The R Journal5 (1): 73–80.
  • Kowarik, A., and M. Templ. 2016. “Imputation with the R Package VIM.”Journal of Statistical Software.
  • Mirai Solutions GmbH. 2015.XLConnect: Excel Connector for R.http://CRAN.R-project.org/package=XLConnect.
  • R Core Team. 2015.Foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ….http://CRAN.R-project.org/package=foreign.
  • Ripley, B., and M. Lapsley. 2015.RODBC: ODBC Database Access.http://CRAN.R-project.org/package=RODBC.
  • RStudio Inc. 2014.Shiny: Web Application Framework for R.http://CRAN.R-project.org/package=shiny.
  • Schopfhauser, D., M. Templ, A. Alfons, A. Kowarik, and B. Prantner. 2014.VIMGUI: Visualization and Imputation of Missing Values.http://CRAN.R-project.org/package=VIMGUI.
  • Templ, M., A. Alfons, and P. Filzmoser. 2011. “Exploring Incomplete Data Using Visualization Techniques.”Advances in Data Analysis and Classification6 (1): 29–47.
  • Templ, M., and V. Todorov. 2016. “The Software Environment R for Official Statistics and Survey Methodology.”Austrian Journal of Statistics45 (1): 97–124.
  • Templ, M., K. Hron, and P. Filzmoser. 2011.RobCompositions: An R-Package for Robust Statistical Analysis of Compositional Data. John Wiley; Sons.
  • Tippmann, S. 2015. “Programming Tools: Adventures with R.”Nature, 109–10. doi:10.1038/517109a.
  • Todorov, V., and M. Templ. 2012.R in the Statistical Office: Part II. Working paper 1/2012. United Nations Industrial Development.
  • Wickham, H. 2009.Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.http://had.co.nz/ggplot2/book.
  • 2015a.Xml2: Parse XML.http://CRAN.R-project.org/package=xml2.
  • 2015b.Readxl: Read Excel Files.http://CRAN.R-project.org/package=readxl.
  • Wickham, H., and W. Chang. 2016.Devtools: Tools to Make Developing R Packages Easier.https://CRAN.R-project.org/package=devtools.
  • Wickham, H., and R. Francois. 2015a.Readr: Read Tabular Data.http://CRAN.R-project.org/package=readr.
  • 2015b.dplyr: A Grammar of Data Manipulation.https://CRAN.R-project.org/package=dplyr.
  • Wickham, H., and E. Miller. 2015.Haven: Import SPSS,Stata and SAS Files.http://CRAN.R-project.org/package=haven.
  • Wickham, H., P. Danenberg, and M. Eugster.Roxygen2: In-Source Documentation for R.https://github.com/klutometis/roxygen.
  • Wilkinson, L. 2005.The Grammar of Graphics (Statistics and Computing). Secaucus, NJ, USA: Springer-Verlag New York, Inc.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here