Dealing with a Mess

58 min read

Analyzing data in the real world often requires some know-how outside of the typical introductory data analysis curriculum. For example, rarely do we get a neatly formatted, tidy dataset with no errors, junk, or missing values. Rather, we often get messy, unwieldy datasets.

What makes a dataset messy? Different people in different roles have different ideas about what constitutes messiness. Some regard any data that invalidates the assumptions of the parametric model as messy. Others see messiness in datasets with a grievously imbalanced number of observations in each category for a categorical variable. Some examples of things that I would consider messy are:

  • Many missing values (NAs)
  • Misspelled names in categorical variables
  • Inconsistent data coding
  • Numbers in the same column being in different units
  • Mis-recorded data and data entry mistakes
  • Extreme outliers

Since there are an infinite number of ways that data can be messy, there’s simply no chance of enumerating every example and their respective solutions. Instead, we are going to talk about two tools that help combat the bulk of the messiness issues that I cited just now.

(For more resources related to this topic, see here.)

Analysis with missing data

Missing data is another one of those topics that are largely ignored in most introductory texts. Probably, part of the reason why this is the case is that many myths about analysis with missing data still abound. Additionally, some of the research into cutting-edge techniques is still relatively new. A more legitimate reason for its absence in introductory texts is that most of the more principled methodologies are fairly complicated—mathematically speaking. Nevertheless, the incredible ubiquity of problems related to missing data in real life data analysis necessitates some broaching of the subject. This section serves as a gentle introduction into the subject and one of the more effective techniques for dealing with it.

A common refrain on the subject is something along the lines of the best way to deal with missing data is not to have any. It’s true that missing data is a messy subject, and there are a lot of ways to do it wrong. It’s important not to take this advice to the extreme, though. In order to bypass missing data problems, some have disallowed survey participants, for example, to go on without answering all the questions on a form. You can coerce the participants in a longitudinal study to not drop out, too. Don’t do this. Not only is it unethical, it is also prodigiously counter-productive; there are treatments for missing data, but there are no treatments for bad data.

The standard treatment to the problem of missing data is to replace the missing data with non-missing values. This process is called imputation. In most cases, the goal of imputation is not to recreate the lost completed dataset but to allow valid statistical estimates or inferences to be drawn from incomplete data. Because of this, the effectiveness of different imputation techniques can’t be evaluated by their ability to most accurately recreate the data from a simulated missing dataset; they must, instead, be judged by their ability to support the same statistical inferences as would be drawn from the analysis on the complete data. In this way, filling in the missing data is only a step towards the real goal—the analysis. The imputed dataset is rarely considered the final goal of imputation.

There are many different ways that missing data is dealt with in practice—some are good, some are not so good. Some are okay under certain circumstances, but not okay in others. Some involve missing data deletion, while some involve imputation. We will briefly touch on some of the more common methods. The ultimate goal of this article, though, is to get you started on what is often described as the gold-standard of imputation techniques: multiple imputation.

Visualizing missing data

In order to demonstrate the visualizing patterns of missing data, we first have to create some missing data. This will also be the same dataset that we perform analysis on later in the article. To showcase how to use multiple imputation for a semi-realistic scenario, we are going to create a version of the mtcars dataset with a few missing values:

Okay, let’s set the seed (for deterministic randomness), and create a variable to hold our new marred dataset.


miss_mtcars <- mtcars

First, we are going to create seven missing values in drat (about 20 percent), five missing values in the mpg column (about 15 percent), five missing values in the cyl column, three missing values in wt (about 10 percent), and three missing values
in vs:

some_rows <- sample(1:nrow(miss_mtcars), 7)

miss_mtcars$drat[some_rows] <- NA

 some_rows <- sample(1:nrow(miss_mtcars), 5)

miss_mtcars$mpg[some_rows] <- NA

 some_rows <- sample(1:nrow(miss_mtcars), 5)

miss_mtcars$cyl[some_rows] <- NA

 some_rows <- sample(1:nrow(miss_mtcars), 3)

miss_mtcars$wt[some_rows] <- NA

 some_rows <- sample(1:nrow(miss_mtcars), 3)

miss_mtcars$vs[some_rows] <- NA

Now, we are going to create four missing values in qsec, but only for automatic cars:

only_automatic <- which(miss_mtcars$am==0)

some_rows <- sample(only_automatic, 4)

miss_mtcars$qsec[some_rows] <- NA

Now, let’s take a look at the dataset:

> miss_mtcars

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb

Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4

Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

Datsun 710          22.8   4 108.0  93 3.85    NA 18.61  1  1    4    1

Hornet 4 Drive      21.4   6 258.0 110   NA 3.215 19.44  1  0    3    1

Hornet Sportabout   18.7   8 360.0 175   NA 3.440 17.02  0  0    3    2

Valiant             18.1  NA 225.0 105   NA 3.460    NA  1  0    3    1

Great, now let’s visualize the missingness.

The first way we are going to visualize the pattern of missing data is by using the md.pattern function from the mice package (which is also the package that we are ultimately going to use for imputing our missing data). If you don’t have the package already, install it.

> library(mice)

> md.pattern(miss_mtcars)

   disp hp am gear carb wt vs qsec mpg cyl drat  

12    1  1  1    1    1  1  1    1   1   1    1  0

 4    1  1  1    1    1  1  1    1   0   1    1  1

 2    1  1  1    1    1  1  1    1   1   0    1  1

 3    1  1  1    1    1  1  1    1   1   1    0  1

 3    1  1  1    1    1  0  1    1   1   1    1  1

 2    1  1  1    1    1  1  1    0   1   1    1  1

 1    1  1  1    1    1  1  1    1   0   1    0  2

 1    1  1  1    1    1  1  1    0   1   0    1  2

 1    1  1  1    1    1  1  0    1   1   0    1  2

 2    1  1  1    1    1  1  0    1   1   1    0  2

 1    1  1  1    1    1  1  1    0   1   0    0  3

      0  0  0    0    0  3  3    4   5   5    7 27

A row-wise missing data pattern refers to the columns that are missing for each row. This function aggregates and counts the number of rows with the same missing data pattern. This function outputs a binary (0 and 1) matrix. Cells with a 1 represent non-missing data; 0s represent missing data. Since the rows are sorted in an increasing-amount-of-missingness order, the first row always refers to the missing data pattern containing the least amount of missing data.

In this case, the missing data pattern with the least amount of missing data is the pattern containing no missing data at all. Because of this, the first row has all 1s in the columns that are named after the columns in the miss_mtcars dataset. The left-most column is a count of the number of rows that display the missing data pattern, and the right-most column is a count of the number of missing data points in that pattern. The last row contains a count of the number of missing data points in each column.

As you can see, 12 of the rows contain no missing data. The next most common missing data pattern is the one with missing just mpg; four rows fit this pattern. There are only six rows that contain more than one missing value. Only one of these rows contains more than two missing values (as shown in the second-to-last row).

As far as datasets with missing data go, this particular one doesn’t contain much. It is not uncommon for some datasets to have more than 30 percent of its data missing. This data set doesn’t even hit 3 percent.

Now let’s visualize the missing data pattern graphically using the VIM package. You will probably have to install this, too.


aggr(miss_mtcars, numbers=TRUE)

Data Analysis with R

Figure11.1: The output of VIM’s visual aggregation of missing data. The left plot shows the proportion on missing values for each column. The right plot depicts the prevalence of row-wise missing data patterns, like md.pattern

At a glance, this representation shows us, effortlessly, that the drat column accounts for the highest proportion of missingness, column-wise, followed by mpg, cyl, qsec, vs, and wt. The graphic on the right shows us information similar to that of the output of md.pattern. This representation, though, makes it easier to tell if there is some systematic pattern of missingness. The blue cells represent non-missing data, and the red cells represent missing data. The numbers on the right of the graphic represent the proportion of rows displaying that missing data pattern. 37.5 percent of the rows contain no missing data whatsoever.

Types of missing data

The VIM package allowed us to visualize the missing data patterns. A related term, the missing data mechanism, describes the process that determines each data point’s likelihood of being missing. There are three main categories of missing data mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Discrimination based on missing data mechanism is crucial, since it informs us about the options for handling the missingness.

The first mechanism, MCAR, occurs when data’s missingness is unrelated to the data. This would occur, for example, if rows were deleted from a database at random, or if a gust of wind took a random sample of a surveyor’s survey forms off into the horizon. The mechanism that governs the missingness of drat, mpg, cyl, wt, and vs‘ is MCAR, because we randomly selected elements to go missing. This mechanism, while being the easiest to work with, is seldom tenable in practice.

MNAR, on the other hand, occurs when a variable’s missingness is related to the variable itself. For example, suppose the scale that weighed each car had a capacity of only 3,700 pounds, and because of this, the eight cars that weighed more than that were recorded as NA. This is a classic example of the MNAR mechanism—it is the weight of the observation itself that is the cause for its being missing. Another example would be if during the course of trial of an anti-depressant drug, participants who were not being helped by the drug became too depressed to continue with the trial. At the end of the trial, when all the participants’ level of depression is accessed and recorded, there would be missing values for participants whose reason for absence is related to their level of depression.

The last mechanism, missing at random, is somewhat unfortunately named. Contrary to what it may sound like, it means there is a systematic relationship between the missingness of an outcome variable’ and other observed variables, but not the outcome variable itself. This is probably best explained by the following example.

Suppose that in a survey, there is a question about income level that, in its wording, uses a particular colloquialism. Due to this, a large number of the participants in the survey whose native language is not English couldn’t interpret the question, and left it blank. If the survey collected just the name, gender, and income, the missing data mechanism of the question on income would be MNAR. If, however, the questionnaire included a question that asked if the participant spoke English as a first language, then the mechanism would be MAR. The inclusion of the Is English your first language? variable means that the missingness of the income question can be completely accounted for. The reason for the moniker missing at random is that when you control the relationship between the missing variable and the observed variable(s) it is related to (for example, What is your income? and Is English your first language? respectively), the data are missing at random.

As another example, there is a systematic relationship between the am and qsec variables in our simulated missing dataset: qsecs were missing only for automatic cars. But within the group of automatic cars, the qsec variable is missing at random. Therefore, qsec ‘s mechanism is MAR; controlling for transmission type, qsec is missing at random. Bear in mind, though, if we removed am from our simulated dataset, qsec would become MNAR.

As mentioned earlier, MCAR is the easiest type to work with because of the complete absence of a systematic relationship in the data’s missingness. Many unsophisticated techniques for handling missing data rest on the assumption that the data are MCAR. On the other hand, MNAR data is the hardest to work with since the properties of the missing data that caused its missingness has to be understood quantifiably, and included in the imputation model. Though multiple imputations can handle the MNAR mechanisms, the procedures involved become more complicated and far beyond the scope of this text. The MCAR and MAR mechanisms allow us not to worry about the properties and parameters of the missing data. For this reason, may sometimes find MCAR or MAR missingness being referred to as ignorable missingness.

MAR data is not as hard to work with as MNAR data, but it is not as forgiving as MCAR. For this reason, though our simulated dataset contains MCAR and MAR components, the mechanism that describes the whole data is MAR—just one MAR mechanism makes the whole dataset MAR.

So which one is it?

You may have noticed that the place of a particular dataset in the missing data mechanism taxonomy is dependent on the variables that it includes. For example, we know that the mechanism behind qsec is MAR, but if the dataset did not include am, it would be MNAR. Since we are the ones that created the data, we know the procedure that resulted in qsec ‘s missing values. If we weren’t the ones creating the data—as happens in the real world—and the dataset did not contain the am column, we would just see a bunch of arbitrarily missing qsec values. This might lead us to believe that the data is MCAR. It isn’t, though; just because the variable to which another variable’s missingness is systematically related is non-observed, doesn’t mean that it doesn’t exist.

This raises a critical question: can we ever be sure that our data is not MNAR? The unfortunate answer is no. Since the data that we need to prove or disprove MNAR is ipso facto missing, the MNAR assumption can never be conclusively disconfirmed. It’s our job, as critically thinking data analysts, to ask whether there is likely an MNAR mechanism or not.

Unsophisticated methods for dealing with missing data

Here we are going to look at various types of methods for dealing with missing data:

Complete case analysis

This method, also called list-wise deletion, is a straightforward procedure that simply removes all rows or elements containing missing values prior to the analysis. In the univariate case—taking the mean of the drat column, for example—all elements of drat that are missing would simply be removed:

> mean(miss_mtcars$drat)

[1] NA

> mean(miss_mtcars$drat, na.rm=TRUE)

[1] 3.63

In a multivariate procedure—for example, linear regression predicting mpg from am, wt, and qsec—all rows that have a missing value in any of the columns included in the regression are removed:

listwise_model <- lm(mpg ~ am + wt + qsec,


                     na.action = na.omit)

## OR

# complete.cases returns a boolean vector

comp <- complete.cases(cbind(miss_mtcars$mpg,




comp_mtcars <- mtcars[comp,]

listwise_model <- lm(mpg ~ am + wt + qsec,


Under an MCAR mechanism, a complete case analysis produces unbiased estimates of the mean, variance/standard deviation, and regression coefficients, which means that the estimates don’t systematically differ from the true values on average, since the included data elements are just a random sampling of the recorded data elements. However, inference-wise, since we lost a number of our samples, we are going to lose statistical power and generate standard errors and confidence intervals that are bigger than they need to be. Additionally, in the multivariate regression case, note that our sample size depends on the variables that we include in the regression; more the variables, more is the missing data that we open ourselves up to, and more the rows that we are liable to lose. This makes comparing results across different models slightly hairy.

Under an MAR or MNAR mechanism, list-wise deletion will produce biased estimates of the mean and variance. For example, if am were highly correlated with qsec, the fact that we are missing qsec only for automatic cars would significantly shift our estimates of the mean of qsec. Surprisingly, list-wise deletion produces unbiased estimates of the regression coefficients, even if the data is MNAR or MAR, as long as the relevant variables are included in the regression equations. For this reason, if there are relatively few missing values in a data set that is to be used in regression analysis, list-wise deletion could be an acceptable alternative to more principled approaches.

Pairwise deletion

Also called available-case analysis, this technique is (somewhat unfortunately) common when estimating covariance or correlation matrices. For each pair of variables, it only uses the cases that are non-missing for both. This often means that the number of elements used will vary from cell to cell of the covariance/correlation matrices. This can result in absurd correlation coefficients that are above 1, making the resulting matrices largely useless to methodologies that depend on them.

Mean substitution

Mean substitution, as the name suggests, replaces all the missing values with the mean of the available cases. For example:

mean_sub <- miss_mtcars

mean_sub$qsec[$qsec)] <- mean(mean_sub$qsec,


# etc...

Although this seemingly solves the problem of the loss of sample size in the list-wise deletion procedure, mean substitution has some very unsavory properties of it’s own. Whilst mean substitution produces unbiased estimates of the mean of a column, it produces biased estimates of the variance, since it removes the natural variability that would have occurred in the missing values had they not been missing. The variance estimates from mean substitution will therefore be, systematically, too small. Additionally, it’s not hard to see that mean substitution will result in biased estimates if the data are MAR or MNAR. For these reasons, mean substitution is not recommended under virtually any circumstance.

Hot deck imputation

Hot deck imputation is an intuitively elegant approach that fills in the missing data with donor values from another row in the dataset. In the least sophisticated formulation, a random non-missing element from the same dataset is shared with a missing value. In more sophisticated hot deck approaches, the donor value comes from a row that is similar to the row with the missing data. The multiple imputation techniques that we will be using in a later section of this article borrows this idea for one of its imputation methods.

The term hot deck refers to the old practice of storing data in decks of punch cards. The deck that holds the donor value would be hot because it is the one that is currently being processed.

Regression imputation

This approach attempts to fill in the missing data in a column using regression to predict likely values of the missing elements using other columns as predictors. For example, using regression imputation on the drat column would employ a linear regression predicting drat from all the other columns in miss_mtcars. The process would be repeated for all columns containing missing data, until the dataset is complete.

This procedure is intuitively appealing, because it integrates knowledge of the other variables and patterns of the dataset. This creates a set of more informed imputations. As a result, this produces unbiased estimates of the mean and regression coefficients under MCAR and MAR (so long as the relevant variables are included in the regression model.

However, this approach is not without its problems. The predicted values of the missing data lie right on the regression line but, as we know, very few data points lie right on the regression line—there is usually a normally distributed residual (error) term. Due to this, regression imputation underestimates the variability of the missing values. As a result, it will result in biased estimates of the variance and covariance between different columns. However, we’re on the right track.

Stochastic regression imputation

As far as unsophisticated approaches go, stochastic regression is fairly evolved. This approach solves some of the issues of regression imputation, and produces unbiased estimates of the mean, variance, covariance, and regression coefficients under MCAR and MAR. It does this by adding a random (stochastic) value to the predictions of regression imputation. This random added value is sampled from the residual (error) distribution of the linear regression—which, if you remember, is assumed to be a normal distribution. This restores the variability in the missing values (that we lost in regression imputation) that those values would have had if they weren’t missing.

However, as far as subsequent analysis and inference on the imputed dataset goes, stochastic regression results in standard errors and confidence intervals that are smaller than they should be. Since it produces only one imputed dataset, it does not capture the extent to which we are uncertain about the residuals and our coefficient estimates. Nevertheless, stochastic regression forms the basis of still more sophisticated imputation methods.

There are two sophisticated, well-founded, and recommended methods of dealing with missing data. One is called the Expectation Maximization (EM) method, which we do not cover here. The second is called Multiple Imputation, and because it is widely considered the most effective method, it is the one we explore in this article.

Multiple imputation

The big idea behind multiple imputation is that instead of generating one set of imputed data with our best estimation of the missing data, we generate multiple versions of the imputed data where the imputed values are drawn from a distribution. The uncertainty about what the imputed values should be is reflected in the variation between the multiply imputed datasets.

We perform our intended analysis separately with each of these m amount of completed datasets. These analyses will then yield m different parameter estimates (like regression coefficients, and so on). The critical point is that these parameter estimates are different solely due to the variability in the imputed missing values, and hence, our uncertainty about what the imputed values should be. This is how multiple imputation integrates uncertainty, and outperforms more limited imputation methods that produce one imputed dataset, conferring an unwarranted sense of confidence in the filled-in data of our analysis. The following diagram illustrates this idea:

Data Analysis with R

Figure 11.2: Multiple imputation in a nutshell

So how does mice come up with the imputed values?

Let’s focus on the univariate case—where only one column contains missing data and we use all the other (completed) columns to impute the missing values—before generalizing to a multivariate case.

mice actually has a few different imputation methods up its sleeve, each best suited for a particular use case. mice will often choose sensible defaults based on the data type (continuous, binary, non-binary categorical, and so on).

The most important method is what the package calls the norm method. This method is very much like stochastic regression. Each of the m imputations is created by adding a normal “noise” term to the output of a linear regression predicting the missing variable. What makes this slightly different than just stochastic regression repeated m times is that the norm method also integrates uncertainty about the regression coefficients used in the predictive linear model.

Recall that the regression coefficients in a linear regression are just estimates of the population coefficients from a random sample (that’s why each regression coefficient has a standard error and confidence interval). Another sample from the population would have yielded slightly different coefficient estimates. If through all our imputations, we just added a normal residual term from a linear regression equation with the same coefficients, we would be systematically understating our uncertainty regarding what the imputed values should be.

To combat this, in multiple imputation, each imputation of the data contains two steps. The first step performs stochastic linear regression imputation using coefficients for each predictor estimated from the data. The second step chooses slightly different estimates of these regression coefficients, and proceeds into the next imputation. The first step of the next imputation uses the slightly different coefficient estimates to perform stochastic linear regression imputation again. After that, in the second step of the second iteration, still other coefficient estimates are generated to be used in the third imputation. This cycle goes on until we have m multiply imputed datasets.

How do we choose these different coefficient estimates at the second step of each imputation? Traditionally, the approach is Bayesian in nature; these new coefficients are drawn from each of the coefficients’ posterior distribution, which describes credible values of the estimate using the observed data and uninformative priors. This is the approach that norm uses. There is an alternate method that chooses these new coefficient estimates from a sampling distribution that is created by taking repeated samples of the data (with replacement) and estimating the regression coefficients of each of these samples. mice calls this method norm.boot.

The multivariate case is a little more hairy, since the imputation for one column depends on the other columns, which may contain missing data of their own.

For this reason, we make several passes over all the columns that need imputing, until the imputation of all missing data in a particular column is informed by informed estimates of the missing data in the predictor columns. These passes over all the columns are called iterations.

So that you really understand how this iteration works, let’s say we are performing multiple imputation on a subset of miss_mtcars containing only mpg, wt and drat. First, all the missing data in all the columns are set to a placeholder value like the mean or a randomly sampled non-missing value from its column. Then, we visit mpg where the placeholder values are turned back into missing values. These missing values are predicted using the two-part procedure described in the univariate case. Then we move on to wt; the placeholder values are turned back into missing values, whose new values are imputed with the two-step univariate procedure using mpg and drat as predictors. Then this is repeated with drat. This is one iteration. On the next iteration, it is not the placeholder values that get turned back into random values and imputed but the imputed values from the previous iteration. As this repeats, we shift away from the starting values and the imputed values begin to stabilize. This usually happens within just a few iterations. The dataset at the completion of the last iteration is the first multiply imputed dataset. Each m starts the iteration process all over again.

The default in mice is five iterations. Of course, you can increase this number if you have reason to believe that you need to. We’ll discuss how to tell if this is necessary later in the section.

Methods of imputation

The method of imputation that we described for the univariate case, norm, works best for imputed values that follow an unconstrained normal distribution—but it could lead to some nonsensical imputations otherwise. For example, since the weights in wt are so close to 0 (because it’s in units of a thousand pounds) it is possible for the norm method to impute a negative weight. Though this will no doubt balance out over the other m-1 multiply imputed datasets, we can combat this situation by using another method of imputation called predictive mean matching.

Predictive mean matching (mice calls this pmm) works a lot like norm. The difference is that the norm imputations are then used to find the d closest values to the imputed value among the non-missing data in the column. Then, one of these d values is chosen as the final imputed value—d=3 is the default in mice.

This method has a few great properties. For one, the possibility of imputing a negative value for wt is categorically off the table; the imputed value would have to be chosen from the set {1.513, 1.615, 1.835}, since these are the three lowest weights. More generally, any natural constraint in the data (lower or upper bounds, integer count data, numbers rounded to the nearest one-half, and so on) is respected with predictive mean matching, because the imputed values appear in the actual non-missing observed values. In this way, predictive mean matching is like hot-deck imputation. Predictive mean matching is the default imputation method in mice for numerical data, though it may be inferior to norm for small datasets and/or datasets with a lot of missing values.

Many of the other imputation methods in mice are specially suited for one particular data type. For example, binary categorical variables use logreg by default; this is like norm but uses logistic regression to impute a binary outcome. Similarly, non-binary categorical data uses multinomial regression—mice calls this method polyreg.

Multiple imputation in practice

There are a few steps to follow and decisions to make when using this powerful imputation technique:

  • Are the data MAR?: And be honest! If the mechanism is likely not MAR, then more complicated measures have to be taken.
  • Are there any derived terms, redundant variables, or irrelevant variables in the data set?: Any of these types of variables will interfere with the regression process. Irrelevant variables—like unique IDs—will not have any predictive power. Derived terms or redundant variables—like having a column for weight in pounds and grams, or a column for area in addition to a length and width column—will similarly interfere with the regression step.
  • Convert all categorical variables to factors, otherwise mice will not be able to tell that the variable is supposed to be categorical.
  • Choose number of iterations and m: By default, these are both five. Using five iterations is usually okay—and we’ll be able to tell if we need more. Five imputations are usually okay, too, but we can achieve more statistical power from more imputed datasets. I suggest setting m to 20, unless the processing power and time can’t be spared.
  • Choose an imputation method for each variable: You can stick with the defaults as long as you are aware of what they are and think they’re the right fit.
    1.     Choose the predictors: Let mice use all the available columns as predictors as long as derived terms and redundant/irrelevant columns are removed. Not only does using more    predictors result in reduced bias, but it also increases the likelihood that the data is MAR.
    2.     Perform the imputations
    3.     Audit the imputations
    4.     Perform analysis with the imputations
    5.     Pool the results of the analyses

Before we get down to it, let’s call the mice function on our data frame with missing data, and use its default arguments, just to see what we shouldn’t do and why:

# we are going to set the seed and printFlag to FALSE, but

# everything else will the default argument

imp <- mice(miss_mtcars, seed=3, printFlag=FALSE)





Multiply imputed data set


mice(data = miss_mtcars, printFlag = FALSE, seed = 3)

Number of multiple imputations:  5

Missing cells per column:

 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb

   5    5    0    0    7    3    4    3    0    0    0

Imputation methods:

  mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb

"pmm" "pmm"    ""    "" "pmm" "pmm" "pmm" "pmm"    ""    ""    ""


 mpg  cyl drat   wt qsec   vs

   1    2    5    6    7    8


     mpg cyl disp hp drat wt qsec vs am gear carb

mpg    0   1    1  1    1  1    1  1  1    1    1

cyl    1   0    1  1    1  1    1  1  1    1    1

disp   0   0    0  0    0  0    0  0  0    0    0


Random generator seed value:  3

The first thing we notice (on line four of the output) is that mice chose to create five multiply imputed datasets, by default. As we discussed, this isn’t a bad default, but more imputation can only improve our statistical power (if only marginally); when we impute this data set in earnest, we will use m=20.

The second thing we notice (on lines 8-10 of the output) is that it used predictive mean matching as the imputation method for all the columns with missing data. If you recall, predictive mean matching is the default imputation method for numeric columns. However, vs and cyl are binary categorical and non-binary categorical variables, respectively. Because we didn’t convert them to factors, mice thinks these are just regular numeric columns. We’ll have to fix this.

The last thing we should notice here is the predictor matrix (starting on line 14). Each row and column of the predictor matrix refers to a column in the dataset to impute. If a cell contains a 1, it means that the variable referred to in the column is used as a predictor for the variable in the row. The first row indicates that all available attributes are used to help predict mpg with the exception of mpg itself. All the values in the diagonal are 0, because mice won’t use an attribute to predict itself. Note that the disp, hp, am, gear, and carb rows all contain `0`s—this is because these variables are complete, and don’t need to use any predictors.

Since we thought carefully about whether there were any attributes that should be removed before we perform the imputation, we can use mice‘s default predictor matrix for this dataset. If there were any non-predictive attributes (like unique identifiers, redundant variables, and so on) we would have either had to remove them (easiest option), or instruct mice not to use them as predictors (harder).

Let’s now correct the issues that we’ve discussed.

# convert categorical variables into factors

miss_mtcars$vs <- factor(miss_mtcars$vs)

miss_mtcars$cyl <- factor(miss_mtcars$cyl)


imp <- mice(miss_mtcars, m=20, seed=3, printFlag=FALSE)



      mpg       cyl      disp        hp      drat

    "pmm" "polyreg"        ""        ""     "pmm"

       wt      qsec        vs        am      gear

    "pmm"     "pmm"  "logreg"        ""        ""



Now mice has corrected the imputation method of cyl and vs to their correct defaults. In truth, cyl is a kind of discrete numeric variable called an ordinal variable, which means that yet another imputation method may be optimal for that attribute, but, for the sake of simplicity, we’ll treat it as a categorical variable.

Before we get to use the imputations in an analysis, we have to check the output. The first thing we need to check is the convergence of the iterations. Recall that for imputing data with missing values in multiple columns, multiple imputation requires iteration over all these columns a few times. At each iteration, mice produces imputations—and samples new parameter estimates from the parameters’ posterior distributions—for all columns that need to be imputed. The final imputations, for each multiply imputed dataset m, are the imputed values from the final iteration.

In contrast to when we used MCMC the convergence in mice is much faster; it usually occurs in just a few iterations. However, visually checking for convergence is highly recommended. We even check for it similarly; when we call the plot function on the variable that we assign the mice output to, it displays trace plots of the mean and standard deviation of all the variables involved in the imputations. Each line in each plot is one of the m imputations.


Data Analysis with R

Figure 11.3: A subset of the trace plots produced by plotting an object returned by a mice imputation

As you can see from the preceding trace plot on imp, there are no clear trends and the variables are all overlapping from one iteration to the next. Put another way, the variance within a chain (there are m chains ) should be about equal to the variance between the chains. This indicates that convergence was achieved.

If convergence was not achieved, you can increase the number of iterations that mice employs by explicitly specifying the maxit parameter to the mice function.

To see an example of non-convergence, take a look at Figures 7 and 8 in the paper that describes this package written by the authors of the package’ themselves. It is available at

The next step is to make sure the imputed values are reasonable. In general, whenever we quickly review the results of something to see if they make sense, it is called a sanity test or sanity check. With the following line, we’re going to display the imputed values for the five missing mpgs for the first six imputations:



                      1    2    3    4    5    6

Duster 360         19.2 16.4 17.3 15.5 15.0 19.2

Cadillac Fleetwood 15.2 13.3 15.0 13.3 10.4 17.3

Chrysler Imperial  10.4 15.0 15.0 16.4 10.4 10.4

Porsche 914-2      27.3 22.8 21.4 22.8 21.4 15.5

Ferrari Dino       19.2 21.4 19.2 15.2 18.1 19.2

These sure look reasonable. A better method for sanity checking is to call densityplot on the variable that we assign the mice output to:


Data Analysis with R

Figure 11.4: Density plots of all the imputed values for mpg, drat, wt, and qsec. Each imputation has its own density curve in each quadrant

This displays, for every attribute imputed, a density plot of the actual non-missing values (the thick line) and the imputed values (the thin lines). We are looking to see that the distributions are similar. Note that the density curve of the imputed values extend much higher than the observed values’ density curve in this case. This is partly because we imputed so few variables that there weren’t enough data points to properly smooth the density approximation. Height and non-smoothness notwithstanding, these density plots indicate no outlandish behavior among the imputed variables.

We are now ready for the analysis phase. We are going to perform linear regression on each imputed dataset and attempt to model mpg as a function of am, wt, and qsec. Instead of repeating the analyses on each dataset manually, we can apply an expression to all the datasets at one time with the with function, as follows:

imp_models <- with(imp, lm(mpg ~ am + wt + qsec))

We could take a peak at the estimated coefficients from each dataset using lapply on the analyses attribute of the returned object:

lapply(imp_models$analyses, coef)



(Intercept)          am          wt        qsec

 18.1534095   2.0284014  -4.4054825   0.8637856



(Intercept)          am          wt        qsec

   8.375455    3.336896   -3.520882    1.219775



(Intercept)          am          wt        qsec

   5.254578    3.277198   -3.233096    1.337469


Finally, let’s pool the results of the analyses (with the pool function), and call summary on it:

pooled_model <- pool(imp_models)



                  est        se         t       df    Pr(>|t|)

(Intercept)  7.049781 9.2254581  0.764166 17.63319 0.454873254

am           3.182049 1.7445444  1.824000 21.36600 0.082171407

wt          -3.413534 0.9983207 -3.419276 14.99816 0.003804876

qsec         1.270712 0.3660131  3.471765 19.93296 0.002416595

                  lo 95     hi 95 nmis       fmi    lambda

(Intercept) -12.3611281 26.460690   NA 0.3459197 0.2757138

am           -0.4421495  6.806247    0 0.2290359 0.1600952

wt           -5.5414268 -1.285641    3 0.4324828 0.3615349

qsec          0.5070570  2.034366    4 0.2736026 0.2042003

Though we could have performed the pooling ourselves using the equations that Donald Rubin outlined in his 1987 classic Multiple Imputation for Nonresponse in Surveys, it is less of a hassle and less error-prone to have the pool function do it for us. Readers who are interested in the pooling rules are encouraged to consult the aforementioned text.

As you can see, for each parameter, pool has combined the coefficient estimate and standard errors, and calculated the appropriate degrees of freedom. These allow us to t-test each coefficient against the null hypothesis that the coefficient is equal to 0, produce p-values for the t-test, and construct confidence intervals.

The standard errors and confidence intervals are wider than those that would have resulted from linear regression on a single imputed dataset, but that’s because it is appropriately taking into account our uncertainty regarding what the missing values would have been.

There are, at present time, a limited number of analyses that can be automatically pooled by mice—the most important being lm/glm. If you recall, though, the generalized linear model is extremely flexible, and can be used to express a wide array of different analyses. By extension, we could use multiple imputation for not only linear regression but logistic regression, Poisson regression, t-tests, ANOVA, ANCOVA, and more.

Analysis with unsanitized data

Very often, there will be errors or mistakes in data that can severely complicate analyses—especially with public data or data outside of your organization. For example, say there is a stray comma or punctuation mark in a column that was supposed to be numeric. If we aren’t careful, R will read this column as character, and subsequent analysis may, in the best case scenario, fail; it is also possible, however, that our analysis will silently chug along, and return an unexpected result. This will happen, for example, if we try to perform linear regression using the punctuation-containing-but-otherwise-numeric column as a predictor, which will compel R to convert it into a factor thinking that it is a categorical variable.

In the worst-case scenario, an analysis with unsanitized data may not error out or return nonsensical results, but return results that look plausible but are actually incorrect. For example, it is common (for some reason) to encode missing data with 999 instead of NA; performing a regression analysis with 999 in a numeric column can severely adulterate our linear models, but often not enough to cause clearly inappropriate results. This mistake may then go undetected indefinitely.

Some problems like these could, rather easily, be detected in small datasets by visually auditing the data. Often, however, mistakes like these are notoriously easy to miss. Further, visual inspection is an untenable solution for datasets with thousands of rows and hundreds of columns. Any sustainable solution must off-load this auditing process to R. But how do we describe aberrant behavior to R so that it can catch mistakes on its own?

The package assertr seeks to do this by introducing a number of data checking verbs. Using assertr grammar, these verbs (functions) can be combined with subjects (data) in different ways to express a rich vocabulary of data validation tasks.

More prosaically, assertr provides a suite of functions designed to verify the assumptions about data early in the analysis process, before any time is wasted computing on bad data. The idea is to provide as much information as you can about how you expect the data to look upfront so that any deviation from this expectation can be dealt with immediately.

Given that the assertr grammar is designed to be able to describe a bouquet of error-checking routines, rather than list all the functions and functionalities that the package provides, it would be more helpful to visit particular use cases.

Two things before we start. First, make sure you install assertr. Second, bear in mind that all data verification verbs in assertr take a data frame to check as their first argument, and either (a) returns the same data frame if the check passes, or (b) produces a fatal error. Since the verbs return a copy of the chosen data frame if the check passes, the main idiom in assertr involves reassignment of the returning data frame after it passes the check.

a_dataset <- CHECKING_VERB(a_dataset, ....)

Checking for out-of-bounds data

It’s common for numeric values in a column to have a natural constraint on the values that it should hold. For example, if a column represents a percent of something, we might want to check if all the values in that column are between 0 and 1 (or 0 and 100). In assertr, we typically use the within_bounds function in conjunction with the assert verb to ensure that this is the case. For example, if we added a column to mtcars that represented the percent of heaviest car’s weight, the weight of each car is:


mtcars.copy <- mtcars


mtcars.copy$Percent.Max.Wt <- round(mtcars.copy$wt /




mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),


within_bounds is actually a function that takes the lower and upper bounds and returns a predicate, a function that returns TRUE or FALSE. The assert function then applies this predicate to every element of the column specified in the third argument. If there are more than three arguments, assert will assume there are more columns to check.

Using within_bounds, we can also avoid the situation where NA values are specified as “999”, as long as the second argument in within_bounds is less than this value.

within_bounds can take other information such as whether the bounds should be inclusive or exclusive, or whether it should ignore the NA values. To see the options for this, and all the other functions in assertr, use the help function on them.

Let’s see an example of what it looks like when the assert function fails:

mtcars.copy$Percent.Max.Wt[c(10,15)] <- 2

mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),




Vector 'Percent.Max.Wt' violates assertion 'within_bounds' 2 times (e.g. [2] at index 10)

We get an informative error message that tells us how many times the assertion was violated, and the index and value of the first offending datum.

With assert, we have the option of checking a condition on multiple columns at the same time. For example, none of the measurements in iris can possibly be negative. Here’s how we might make sure our dataset is compliant:

iris <- assert(iris, within_bounds(0, Inf),

               Sepal.Length, Sepal.Width,

               Petal.Length, Petal.Width)


# or simply "-Species" because that

# will include all columns *except* Species

iris <- assert(iris, within_bounds(0, Inf),


On occasion, we will want to check elements for adherence to a more complicated pattern. For example, let’s say we had a column that we knew was either between -10 and -20, or 10 and 20. We can check for this by using the more flexible verify verb, which takes a logical expression as its second argument; if any of the results in the logical expression is FALSE, verify will cause an error.

vec <- runif(10, min=10, max=20)

# randomly turn some elements negative

vec <- vec * sample(c(1, -1), 10,



example <- data.frame(weird=vec)


example <- verify(example, ((weird < 20 & weird > 10) |

                              (weird < -10 & weird > -20)))


# or


example <- verify(example, abs(weird) < 20 & abs(weird) > 10)

# passes


example$weird[4] <- 0

example <- verify(example, abs(weird) < 20 & abs(weird) > 10)

# fails


Error in verify(example, abs(weird) < 20 & abs(weird) > 10) :

  verification failed! (1 failure)

Checking the data type of a column

By default, most of the data import functions in R will attempt to guess the data type for each column at the import phase. This is usually nice, because it saves us from tedious work. However, it can backfire when there are, for example, stray punctuation marks in what are supposed to be numeric columns. To verify this, we can use the assert function with the is.numeric base function:

iris <- assert(iris, is.numeric, -Species)

We can use the is.character and is.logical functions with assert, too.

An alternative method that will disallow the import of unexpected data types is to specify the data type that each column should be at the data import phase with the colClasses optional argument:

iris <- read.csv("PATH_TO_IRIS_DATA.csv",

                 colClasses=c("numeric", "numeric",

                              "numeric", "numeric",


This solution comes with the added benefit of speeding up the data import process, since R doesn’t have to waste time guessing each column’s data type.

Checking for unexpected categories

Another data integrity impropriety that is, unfortunately, very common is the mislabeling of categorical variables. There are two types of mislabeling of categories that can occur: an observation’s class is mis-entered/mis-recorded/mistaken for that of another class, or the observation’s class is labeled in a way that is not consistent with the rest of the labels. To see an example of what we can do to combat the former case, read assertr‘s vignette. The latter case covers instances where, for example, the species of iris could be misspelled (such as “versicolour”, “verginica”) or cases where the pattern established by the majority of class names is ignored (“iris setosa”, “i. setosa”, “SETOSA”). Either way, these misspecifications prove to be a great bane to data analysts for several reasons. For example, an analysis that is predicated upon a two-class categorical variable (for example, logistic regression) will now have to contend with more than two categories. Yet another way in which unexpected categories can haunt you is by producing statistics grouped by different values of a categorical variable; if the categories were extracted from the main data manually—with subset, for example, as opposed to with by, tapply, or aggregate—you’ll be missing potentially crucial observations.

If you know what categories you are expecting from the start, you can use the in_set function, in concert with assert, to confirm that all the categories of a particular column are squarely contained within a predetermined set.

# passes

iris <- assert(iris, in_set("setosa", "versicolor",

                            "virginica"), Species)


# mess up the data

iris.copy <- iris

# We have to make the 'Species' column not

# a factor

ris.copy$Species <- as.vector(iris$Species)

iris.copy$Species[4:9] <- "SETOSA"

iris.copy$Species[135] <- "verginica"

iris.copy$Species[95] <- "i. versicolor"


# fails

iris.copy <- assert(iris.copy, in_set("setosa", "versicolor",

                                      "virginica"), Species)



Vector 'Species' violates assertion 'in_set' 8 times (e.g. [SETOSA] at index 4)

If you don’t know the categories that you should be expecting, a priori, the following incantation, which will tell you how many rows each category contains, may help you identify the categories that are either rare or misspecified:

by(iris.copy, iris.copy$Species, nrow)

Checking for outliers, entry errors, or unlikely data points

Automatic outlier detection (sometimes known as anomaly detection) is something that a lot of analysts scoff at and view as a pipe dream. Though the creation of a routine that automagically detects all erroneous data points with 100 percent specificity and precision is impossible, unmistakably mis-entered data points and flagrant outliers are not hard to detect even with very simple methods. In my experience, there are a lot of errors of this type.

One simple way to detect the presence of a major outlier is to confirm that every data point is within some n number of standard deviations away from the mean of the group. assertr has a function, within_n_sds—in conjunction with the insist verb—to do just this; if we wanted to check that every numeric value in iris is within five standard deviations of its respective column’s mean, we could express so thusly:

iris <- insist(iris, within_n_sds(5), -Species)

An issue with using standard deviations away from the mean (z-scores) for detecting outliers is that both the mean and standard deviation are influenced heavily by outliers; this means that the very thing we are trying to detect is obstructing our ability to find it.

There is a more robust measure for finding central tendency and dispersion than the mean and standard deviation: the median and median absolute deviation. The median absolute deviation is the median of the absolute value of all the elements of a vector subtracted by the vector’s median.

assertr has a sister to within_n_sds, within_n_mads, that checks every element of a vector to make sure it is within n median absolute deviations away from its column’s median.

iris <- insist(iris, within_n_mads(4), -Species)

iris$Petal.Length[5] <- 15

iris <- insist(iris, within_n_mads(4), -Species)



Vector 'Petal.Length' violates assertion 'within_n_mads' 1 time (value [15] at index 5)

In my experience, within_n_mads can be an effective guard against illegitimate univariate outliers if n is chosen carefully.

The examples here have been focusing on outlier identification in the univariate case—across one dimension at a time. Often, there are times where an observation is truly anomalous but it wouldn’t be evident by looking at the spread of each dimension individually. assertr has support for this type of multivariate outlier analysis, but a full discussion of it would require a background outside the scope of this text.

Chaining assertions

The check assertr aims to make the checking of assumptions so effortless that the user never feels the need to hold back any implicit assumption. Therefore, it’s expected that the user uses multiple checks on one data frame.

The usage examples that we’ve seen so far are really only appropriate for one or two checks. For example, a usage pattern such as the following is clearly unworkable:


To combat this visual cacophony, assertr provides direct support for chaining multiple assertions by using the “piping” construct from the magrittr package.

The pipe operator of magrittr‘, %>%, works as follows: it takes the item on the left-hand side of the pipe and inserts it (by default) into the position of the first argument of the function on the right-hand side. The following are some examples of simple magrittr usage patterns:


4 %>% sqrt              # 2

iris %>% head(n=3)      # the first 3 rows of iris

iris <- iris %>% assert(within_bounds(0, Inf), -Species)

Since the return value of a passed assertr check is the validated data frame, you can use the magrittr pipe operator to tack on more checks in a way that lends itself to easier human understanding. For example:

iris <- iris %>%

  assert(is.numeric, -Species) %>%

  assert(within_bounds(0, Inf), -Species) %>%

  assert(in_set("setosa", "versicolor", "virginica"), Species) %>%

  insist(within_n_mads(4), -Species)


# or, equivalently


CHECKS <- . %>%

  assert(is.numeric, -Species) %>%

  assert(within_bounds(0, Inf), -Species) %>%

  assert(in_set("setosa", "versicolor", "virginica"), Species) %>%

  insist(within_n_mads(4), -Species)


iris <- iris %>% CHECKS

When chaining assertions, I like to put the most integral and general one right at the top. I also like to put the assertions most likely to be violated right at the top so that execution is terminated before any more checks are run.

There are many other capabilities built into the assertr multivariate outlier checking. For more information about these, read the package’s vignette, (vignette(“assertr”)).

On the magrittr side, besides the forward-pipe operator, this package sports some other very helpful pipe operators. Additionally, magrittr allows the substitution at the right side of the pipe operator to occur at locations other than the first argument. For more information about the wonderful magrittr package, read its vignette.

Other messiness

As we discussed in this article’s preface, there are countless ways that a dataset may be messy. There are many other messy situations and solutions that we couldn’t discuss at length here. In order that you, dear reader, are not left in the dark regarding custodial solutions, here are some other remedies which you may find helpful along your analytics journey:


Though OpenRefine (formerly Google Refine) doesn’t have anything to do with R per se, it is a sophisticated tool for working with and for cleaning up messy data. Among its numerous, sophisticated capabilities is the capacity to auto-detect misspelled or mispecified categories and fix them at the click of a button.

Regular expressions

Suppose you find that there are commas separating every third digit of the numbers in a numeric column. How would you remove them? Or suppose you needed to strip a currency symbol from values in columns that hold monetary values so that you can compute with them as numbers. These, and vastly more complicated text transformations, can be performed using regular expressions (a formal grammar for specifying the search patterns in text) and associate R functions like grep and sub. Any time spent learning regular expressions will pay enormous dividends over your career as an analyst, and there are many great, free tutorials available on the web for this purpose.


There are a few different ways in which you can represent the same tabular dataset. In one form—called long, narrow, stacked, or entity-attribute-value model—each row contains an observation ID, a variable name, and the value of that variable. For example:


member  attribute  value

1     Ringo Starr  birthyear   1940

2  Paul McCartney  birthyear   1942

3 George Harrison  birthyear   1943

4     John Lennon  birthyear   1940

5     Ringo Starr instrument  Drums

6  Paul McCartney instrument   Bass

7 George Harrison instrument Guitar

8     John Lennon instrument Guitar

In another form (called wide or unstacked), each of the observation’s variables are stored in each column:


member birthyear instrument

1 George Harrison      1943     Guitar

2     John Lennon      1940     Guitar

3  Paul McCartney      1942       Bass

4     Ringo Starr      1940      Drums

If you ever need to convert between these representations, (which is a somewhat common operation, in practice) tidyr is your tool for the job.


The following are a few exercises for you to strengthen your grasp over the concepts learned in this article:

  • Normally, when there is missing data for a question such as “What is your income?”, we strongly suspect an MNAR mechanism, because we live in a dystopia that equates wealth with worth. As a result, the participants with the lowest income may be embarrassed to answer that question. In the relevant section, we assumed that because the question was poorly worded and we could account for whether English was the first language of the participant, the mechanism is MAR. If we were wrong about this reason, and it was really because the lower income participants were reticent to admit their income, what would the missing data mechanism be now? If, however, the differences in income were fully explained by whether English was the first language of the participant, what would the missing data mechanism be in that case?
  • Find a dataset on the web with missing data. What does it use to denote that data is missing? Think about that dataset’s missing data mechanism. Is there a chance that this data is MNAR?
  • Find a freely available government dataset on the web. Read the dataset’s description, and think about what assumptions you might make about the data when planning a certain analysis. Translate these into actual code so that R can check them for you. Were there any deviations from your expectations?
  • When two autonomous individuals decide to voluntarily trade, the transaction can be in both parties’ best interests. Does it necessarily follow that a voluntary trade between nations benefits both states? Why or why not?


“Messy data”—no matter what definition you use—present a huge roadblock for people who work with data. This article focused on two of the most notorious and prolific culprits: missing data and data that has not been cleaned or audited for quality.

On the missing data side, you learned how to visualize missing data patterns, and how to recognize different types of missing data. You saw a few unprincipled ways of tackling the problem, and learned why they were suboptimal solutions. Multiple imputation, so you learned, addresses the shortcomings of these approaches and, through its usage of several imputed data sets, correctly communicates our uncertainty surrounding the imputed values.

On unsanitized data, we saw that the, perhaps, optimal solution (visually auditing the data) was untenable for moderately sized datasets or larger. We discovered that the grammar of the package assertr provides a mechanism to offload this auditing process to R. You now have a few assertr checking “recipes” under your belt for some of the more common manifestations of the mistakes that plague data that has not been scrutinized.

You can check out similar books published by Packt Publishing on R (

Resources for Article:

Further resources on this subject:


Please enter your comment!
Please enter your name here