32 min read

In this article by James D Miller, the author of the book Big Data Visualization we will explore the idea of adding context to the data you are working with.

Specifically, we’ll discuss the importance of establishing data context, as well as the practice of profiling your data for context discovery as well how big data effects this effort.

The article is organized into the following main sections:

  • Adding Context
  • About R
  • R and Big Data
  • R Example 1 — healthcare data
  • R Example 2 — healthcare data

(For more resources related to this topic, see here.)

When writing a book, authors leave context clues for their readers. A context clue is a “source of information” about written content that may be difficult or unique that helps readers understand. This information offers insight into the content being read or consumed (an example might be: “It was an idyllic day; sunny, warm and perfect…”).

With data, context clues should be developed, through a process referred to as profiling (we’ll discuss profiling in more detail later in this article), so that the data consumer can better understand (the data) when visualized. (Additionally, having context and perspective on the data you are working with is a vital step in determining what kind of data visualization should be created).

Context or profiling examples might be calculating the average age of “patients” or subjects within the data or “segmenting the data into time periods” (years or months, usually).

Another motive for adding context to data might be to gain a new perspective on the data. An example of this might be recognizing and examining a comparison present in the data. For example, body fat percentages of urban high school seniors could be compared to those of rural high school seniors.

Adding context to your data before creating visualizations can certainly make it (the data visualization) more relevant, but context still can’t serve as a substitute for value. Before you consider any factors such as time of day or geographic location, or average age, first and foremost, your data visualization needs to benefit those who are going to consume it so establishing appropriate context requirements will be critical.

For data profiling (adding context), the rule is: Before Context, Think →Value

Generally speaking, there are a several visualization contextual categories, which can be used to argument or increase the value and understanding of data for visualization.

These include:

  • Definitions and explanations,
  • Comparisons,
  • Contrasts
  • Tendencies
  • Dispersion

Definitions andexplanations

This is providing additional information or “attributes” about a data point.

For example, if the data contains a field named “patient ID” and we come to know that records describe individual patients, we may choose to calculate and add each individual patients BMI or body mass index:

Comparisons

This is adding a comparable value to a particular data point. For example, you might compute and add a national ranking to each “total by state”:

Contrasts

This is almost like adding an “opposite” to a data point to see if it perhaps determines a different perspective. An example might be reviewing average body weights for patients who consume alcoholic beverages verses those who do not consume alcoholic beverages:

Tendencies

These are the “typical” mathematical calculations (or summaries) on the data as a whole or by other category within the data, such as Mean, Median, and Mode. For example, you might add a median heart rate for the age group each patient in the data is a member of:

Dispersion

Again, these are mathematical calculations (or summaries), such as Range, Variance, and Standard Deviation, but they describe the “average” of a data set (or group within the data). For example, you may want to add the “range” for a selected value, such as the minimum and maximum number of hospital stays found in the data for each patient age group:

The “art” of profiling data to add context and identify new and interesting perspectives for visualization is still and ever evolving; no doubt there are additional contextual categories existing today that can be investigated as you continue your work with big data visualization projects.

Adding Context

So, how do we add context to data? …is it merely select Insert, then Data Context?

No, it’s not that easy (but it’s not impossible either).

Once you have identified (or “pulled together”) your big data source (or at least a significant amount of data), how do you go from mountains of raw big data to summarizations that can be used as input to create valuable data visualizations, helping you to further analyze that data and support your conclusions?

The answer is through data profiling.

Data profiling involves logically “getting to know” the data you think you may want to visualize – through query, experimentation & review.

Following the profiling process, you can then use the information you have collected to add context (and/or apply new “perspectives”) to the data. Adding context to data requires the manipulation of that data to perhaps reformat, adding calculations, aggregations or additional columns or re-ordering and so on.

Finally, you will be ready to visualize (or “picture”) your data.

The complete profiling process is shown below; as in:

  1. Pull together (the data or enough of the data),
  2. Profile (the data through query, experimentation and review),
  3. add Perspective(s) (or context) and finally…
  4. Picture (visualize) the data

About R

R is a language and environment easy to learn, very flexible in nature and also very focused on statistical computing- making it great for manipulating, cleaning, summarizing, producing probability statistics, etc. (as well as actually creating visualizations with your data), so it’s a great choice for the exercises required for profiling, establishing context and identifying additional perspectives.

In addition, here are a few more reasons to use R when profiling your big data:

  • R is used by a large number of academic statisticians – so it’s a tool that is not “going away”
  • R is pretty much platform independent – what you develop will run almost any where
  • R has awesome help resources – just Goggle it; you’ll see!

R and Big Data

Although R is free (open sourced), super flexible, and feature rich, you must keep in mind that R preserves everything in your machine’s memory and this can become problematic when you are working with big data (even with the introduction of the low resource costs of today).

Thankfully, though there are various options and strategies to “work with” this limitation, such as imploring a sort of “pseudo-sampling” technique, which we will expound on later in this article (as part of some of the examples provided).

Additionally, R libraries have been developed and introduced that can leverage hard drive space (as sort of a virtual extension to your machines memory), again exposed in this article’s examples.

Example 1

In this article’s first example we’ll use data collected from a theoretical hospital where upon admission, patient medical history information is collected though an online survey. Information is also added to a “patients file” as treatment is provided.

The file includes many fields including basic descriptive data for the patient such as:

  • sex,
  • date of birth,
  • height,
  • weight,
  • blood type,
  • etc.

Vital statistics such as:

  • blood pressure,
  • heart rate,
  • etc.

Medical history such as:

  • number of hospital visits,
  • surgeries,
  • major illnesses or conditions,
  • currently under a doctor’s care,
  • etc.

Demographical statistics such as:

  • occupation,
  • home state,
  • educational background,
  • etc.

Some additional information is also collected in the file in an attempt to develop patient characters and habits such as the number of times the patient included beef, pork and fowl in their weekly diet or if they typically use a butter replacement product, and so on.

Periodically, the data is “dumped” to text files, are comma-delimited and contain the following fields (in this order):

Patientid, recorddate_month, recorddate_day, recorddate_year, sex, age, weight, height, no_hospital_visits, heartrate, state, relationship, Insured, Bloodtype, blood_pressure, Education, DOBMonth,        DOBDay,         DOBYear,            current_smoker,          

current_drinker, currently_on_medications, known_allergies, currently_under_doctors_care, ever_operated_on, occupation, Heart_attack, Rheumatic_Fever    Heart_murmur, Diseases_of_the_arteries, Varicose_veins,    

Arthritis, abnormal_bloodsugar, Phlebitis, Dizziness_fainting, Epilepsy_seizures, Stroke, Diphtheria, Scarlet_Fever, Infectious_mononucleosis, Nervous_emotional_problems, Anemia, hyroid_problems, Pneumonia, Bronchitis, Asthma, Abnormal_chest_Xray, lung_disease, Injuries_back_arms_legs_joints_Broken_bones, Jaundice_gallbladder_problems, Father_alive, Father_current_age, Fathers_general_health, Fathers_reason_poor_health, Fathersdeceased_age_death, mother_alive, Mother_current_age,    

Mother_general_health, Mothers_reason_poor_health,          

Mothers_deceased_age_death, No_of_brothers, No_of_sisters,

age_range, siblings_health_problems, Heart_attacks_under_50,        

Strokes_under_50, High_blood_pressure, Elevated_cholesterol,

Diabetes, Asthma_hayfever, Congenital_heart_disease,

Heart_operations, Glaucoma, ever_smoked_cigs, cigars_or_pipes,

no_cigs_day, no_cigars_day, no_pipefuls_day, if_stopped_smoking_when_was_it, if_still_smoke_how_long_ago_start,target_weight,     

most_ever_weighed, 1_year_ago_weight, age_21_weight, No_of_meals_eatten_per_day, No_of_times_per_week_eat_beef,     

No_of_times_per_week_eat_pork, No_of_times_per_week_eat_fish,          

No_of_times_per_week_eat_fowl, No_of_times_per_week_eat_desserts,    

No_of_times_per_week_eat_fried_foods,

No_servings_per_week_wholemilk,

No_servings_per_week_2%_milk,

No_servings_per_week_tea,

No_servings_per_week_buttermilk,

No_servings_per_week_1%_milk,

No_servings_per_week_regular_or_diet_soda,

No_servings_per_week_skim_milk, No_servings_per_week_coffee 

No_servings_per_week_water, beer_intake, wine_intake, liquor_intake, use_butter, use_extra_sugar, use_extra_salt,

different_diet_weekends, activity_level, sexually_active,

vision_problems, wear_glasses

Following is the image showing a portion of the file (displayed in MS Windows notepad):

Assuming we have been given no further information about the data, other than the provided field name list and the knowledge that the data is captured by hospital personnel upon patient admission, the next step would be to perform some sort of profiling of the data- investigating to start understanding the data and then to start adding context and perspectives (so ultimately we can create some visualizations).

Initially, we start out by looking through the field or column names in our file and some ideas start to come to mind. For example:

What is the data time-frame we are dealing with? Using the field record date, can we establish a period of time (or time frame) for the data? (In other words, over what period of time was this data captured).

Can we start “grouping the data” using fields such as sex, age and state?

Eventually, what we should be asking is, “what can we learn from visualizing the data?” Perhaps:

  • What is the breakdown of those currently smoking by age group?
  • What is the ratio of those currently smoking to the number of hospital visits?
  • Do those patients currently under a doctor’s care, on average have better BMI ratios?

And so on.

Dig-in with R

Using the power of R programming, we can run various queries on the data; noting that the results of those quires may spawn additional questions and queries and eventually, yield data ready for visualizing.

Let’s start with a few simple profile queries. I always start my data profiling by “time boxing” the data.

The following R scripts (although as mentioned earlier, there are many ways to accomplish the same objective) work well for this:

# --- read our file into a temporary R table

tmpRTable4TimeBox<-read.table(file="C:/Big Data Visualization/Chapter 3/sampleHCSurvey02.txt”, sep=",")

# --- convert to an R data frame and filter it to just include # --- the 2nd column or field of data

data.df <- data.frame(tmpRTable4TimeBox)

data.df <- data.df[,2]

# --- provides a sorted list of the years in the file

YearsInData = substr(substr(data.df[],(regexpr('/',data.df[])+1),11),( regexpr('/',substr(data.df[],(regexpr('/',data.df[])+1),11))+1),11)

# -- write a new file named ListofYears

write.csv(sort(unique(YearsInData)),file="C:/Big Data Visualization /Chapter 3/ListofYears.txt",quote = FALSE, row.names = FALSE)

The above simple R script provides a sorted list file (ListofYears.txt) (shown below) containing the years found in the data we are profiling:

Now we can see that our patient survey data covers patient survey data collected during the years 1999 through 2016 and with this information we start to add context (or allow us to gain a perspective) on our data.

We could further time-box the data by perhaps breaking the years into months (we will do this later on in this article) but let’s move on now to some basic “grouping profiling”.

Assuming that each record in our data represents a unique hospital visit, how can we determine the number of hospital visits (the number of records) by sex, age and state?

Here I will point out that it may be worthwhile establishing the size (number of rows or records (we already know the number of columns or fields) of the file you are working with. This is important since the size of the data file will dictate the programming or scripting approach you will need to use during your profiling.

Simple R functions valuable to know are: nrow and head. These simple command can be used to count the total rows in a file:

nrow:mydata

Of to view the first n umber of rows of data:

head(mydata, nrow=10)

So, using R, one could write a script to load the data into a table, convert it to a data frame and then read through all the records in the file and “count up” or “tally” the number of hospital visits (the number of records) for males and females.

Such logic is a snap to write:

# --- assuming tmpRTable holds the data already

datas.df<-data.frame(tmpRTable)

# --- initialize 2 counter variables

NumberMaleVisits <-0;NumberFemaleVisits <-0

# --- read through the data

for(i in 1:nrow(datas.df))

{

if (datas.df[i,3] == 'Male') {NumberMaleVisits <- NumberMaleVisits + 1}

if (datas.df[i,3] == 'Female') {NumberFemaleVisits <- NumberFemaleVisits + 1}

}

# --- show me the totals

NumberMaleVisits

NumberFemaleVisits

The previous script works, but in a big data scenario, there is a more efficient way, since reading or “looping through” and counting each record will take far too long. Thankfully, R provides the table function that can be used similar to the SQL “group by” command.

The following script assumes that our data is already in an R data frame (named datas.df), so using the sequence number of the field in the file, if we want to see the number of hospital visits for Males and the number of hospital visits for Females we can write:

# --- using R table function as "group by" field number

# --- patient sex is the 3rd field in the file

table(datas.df[,3])

Following is the output generated from running the above stated script. Notice that R shows “sex” with a count of 1 since the script included the files “header record” of the file as a unique value:

We can also establish the number of hospital visits by state (state is the 9th field in the file):

table(datas.df[,9]) 

Age (or the fourth field in the file) can also be studied using the R functions sort and table:

Sort(table(datas.df[,4]))

Note that since there are quite a few more values for age within the file, I’ve sorted the output using the R sort function.

Moving on now, let’s see if there is a difference between the number of hospital visits for patients who are current smokers (field name current_smoker and is field number 16 in the file) and those indicating that they are non-current smokers.

We can use the same R scripting logic:

sort(table(datas.df[16])) 

Surprisingly (one might think) it appears from our profiling that those patients who currently do not smoke have had more hospital visits (113,681) than those who currently are smokers (12,561):

Another interesting R script to continue profiling our data might be:

table(datas.df[,3],datas.df[,16])

The above shown script again uses the R table function to group data, but shows how we can “group within a group”, in other words, using this script we can get totals for “current” and “non-current” smokers, grouped by sex.

In the below image we see that the difference between female smokers and male smokers might be considered to be marginal:

So we see that by using the above simple R script examples, we’ve been able to add some context to our healthcare survey data. By reviewing the list of fields provided in the file we can come up with the R profiling queries shown (and many others) without much effort. We will continue with some more complex profiling in the next section, but for now, let’s use R to create a few data visualizations – based upon what we’ve learned so far through our profiling.

Going back to the number of hospital visits by sex, we can use the R function barplot to create a visualization of visits by sex. But first, a couple of “helpful hints” for creating the script.

First, rather than using the table function, you can use the ftable function which creates a “flat” version of the original function’s output. This makes it easier to exclude the header record count of 1 that comes back from the table function.

Next, we can leverage some additional arguments of the barplot function like col, border, names.arg and Title to make the visualization a little “nicer to look at”.

Below is the script:

# -- use ftable function to drop out the header record

forChart<- ftable(datas.df[,3])

# --- create bar names

barnames<-c("Female","Male")

# -- use barplot to draw bar visual

barplot(forChart[2:3], col = "brown1", border = TRUE, names.arg = barnames)

# --- add a title

title(main = list("Hospital Visits by Sex", font = 4))

The scripts output (our visualization) is shown below:

We could follow the same logic for creating a similar visualization of hospital visits by state:

st<-ftable(datas.df[,9])

barplot(st)

title(main = list("Hospital Visits by State", font = 2))

But the visualization generated isn’t very clear:

One can always experiment a bit more with this data to make the visualization a little more interesting. Using the R functions substr and regexpr, we can create an R data frame that contains a record for each hospital visit by state within each year in the file. Then we can use the function plot (rather than barplot) to generate the visualization.

Below is the R script:

# --- create a data frame from our original table file

datas.df <- data.frame(tmpRTable)

# --- create a filtered data frame of records from the file

# --- using the record year and state fields from the file

dats.df<-data.frame(substr(substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11),( regexpr('/',substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11))+1),11),datas.df[,9])

# --- plot to show a visualization

plot(sort(table(dats.df[2]),decreasing = TRUE),type="o", col="blue")

title(main = list("Hospital Visits by State (Highest to Lowest)", font = 2))

Here is the different (perhaps more interesting) version of the visualization generated by the previous script:

Another earlier perspective on the data was concerning Age. We grouped the hospital visits by the age of the patients (using the R table function). Since there are many different patient ages, a common practice is to establish age ranges, such as the following:

  • 21 and under
  • 22 to 34
  • 35 to 44
  • 45 to 54
  • 55 to 64
  • 65 and over

To implement the previous age ranges, we need to organize the data and could use the following R script:

# --- initialize age range counters

a1 <-0;a2 <-0;a3 <-0;a4 <-0;a5 <-0;a6 <-0

# --- read and count visits by age range

for(i in 2:nrow(datas.df))

{

if (as.numeric(datas.df[i,4]) < 22) {a1 <- a1 + 1}

if (as.numeric(datas.df[i,4]) > 21 & as.numeric(datas.df[i,4]) < 35) {a2 <- a2 + 1}

if (as.numeric(datas.df[i,4]) > 34 & as.numeric(datas.df[i,4]) < 45) {a3 <- a3 + 1}

if (as.numeric(datas.df[i,4]) > 44 & as.numeric(datas.df[i,4]) < 55) {a4 <- a4 + 1}

if (as.numeric(datas.df[i,4]) > 54 & as.numeric(datas.df[i,4]) < 65) {a5 <- a5 + 1}

if (as.numeric(datas.df[i,4]) > 64) {a6 <- a6 + 1}

}

Big Data Note: Looping or reading through each of the records in our file isn’t very practical if there are a trillion records. Later in this article we’ll use a much better approach, but for now will assume a smaller file size for convenience.

Once the above script is run, we can use the R pie function and the following code to create our pie chart visualization:

# --- create Pie Chart

slices <- c(a1, a2, a3, a4, a5, a6)

lbls <- c("under 21", "22-34","35-44","45-54","55-64", "65 & over")

pie(slices, labels = lbls, main="Hospital Visits by Age Range")

Following is the generated visualization:

Finally, earlier in this section we looked at the values in field 16 of our file – which indicates whether the survey patient was a current smoker. We could build a simple visual showing the totals, but (again) the visualization isn’t very interesting or all that informative.

With some simple R scripts, we can proceed to create a visualization showing the number of hospital visits, year-over-year by those patients that are current smokers.

First, we can “reformat” the data in our R data frame (named datas.df) to store only the year (of the record date) using the R function substr. This makes it a little easier to aggregate the data by year shown in the next steps.

The R script using the substr function is shown below:

# --- redefine the record date field to hold just the record

# --- year value

datas.df[,2]<-substr(substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11),( regexpr('/',substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11))+1),11)

Next, we can create an R table named c to hold the record date year and totals (of non and current smokers) for each year.

Following is the R script: used:

# --- create a table holding record year and total count for

# --- smokers and not smoking

c<-table(datas.df[,2],datas.df[,16])

Finally, we can use the R barplot function to create our visualization.

Again, there is more than likely a cleverer way to setup the objects bars and lbls, but for now, I simply hand-coded the year’s data I wanted to see in my visualization:

# --- set up the values to chart and the labels for each bar

# --- in the chart

bars<-c(c[2,3], c[3,3], c[4,3],c[5,3],c[6,3],c[7,3],c[8,3],c[9,3],c[10,3],c[11,3],c[12,3],c[13,3])

lbls<-c("99","00","01","02","03","04","05","06","07","08","09","10")

Now the R script to actually produce the bar chart visualization is shown below:

# --- create the bar chart

barplot(bars, names.arg=lbls, col="red")

title(main = list("Smoking Patients Year to Year", font = 2))

Below is the generated visualization:

Example 2

In the above examples, we’ve presented some pretty basic and straight forward data profiling exercises. Typically, once you’ve become somewhat familiar with your data – having added some context (though some basic profiling), one would extend the profiling process, trying to look at the data in additional ways using technics such as those mentioned in the beginning of this article:

Defining new data points based upon the existing data, performing comparisons, looking at contrasts (between data points), identifying tendencies and using dispersions to establish the variability of the data.

Let’s now review some of these options for extended profiling using simple examples as well as the same source data as was used in the previous section examples.

Definitions & Explanations

One method of extending your data profiling is to “add to” the existing data by creating additional definition or explanatory “attributes” (in other words add new fields to the file). This means that you use existing data points found in the data to create (hopefully new and interesting) perspectives on the data.

In the data used in this article, a thought-provoking example might be to use the existing patient information (such as the patients weight and height) to calculate a new point of data: body mass index (BMI) information.

A generally accepted formula for calculating a patient’s body mass index is:

BMI = (Weight (lbs.) / (Height (in))2) x 703

For example: (165 lbs.) / (702) x 703 = 23.67 BMI.

Using the above formula, we can use the following R script (assuming we’ve already loaded the R object named tmpRTable with our file data) to generate a new file of BMI percentages and state names:

j=1

for(i in 2:nrow(tmpRTable))

{

W<-as.numeric(as.character(tmpRTable[i,5]))

H<-as.numeric(as.character(tmpRTable[i,6]))

P<-(W/(H^2)*703)

datas2.df[j,1]<-format(P,digits=3)

datas2.df[j,2]<-tmpRTable[i,9]

j=j+1

}

write.csv(datas2.df[1:j-1,1:2],file="C:/Big Data Visualization/Chapter 3/BMI.txt", quote = FALSE, row.names = FALSE)

Below is a portion of the generated file:

Now we have a new file of BMI percentages by state (one BMI record for each hospital visit in each state).

Earlier in this article we touched on the concept of looping or reading through all of the records in a file or data source and creating counts based on various field or column values. Such logic works fine for medium or smaller files but a much better approach (especially with big data files) would be to use the power of various R commands.

No Looping

Although the above described R script does work, it requires looping through each record in our file which is slow and inefficient to say the least. So, let’s consider a better approach.

Again, assuming we’ve already loaded the R object named tmpRTable with our data, the below R script can accomplish the same results (create the same file) in just 2 lines:

PDQ<-paste(format((as.numeric(as.character(tmpRTable[,5]))/(as.numeric(as.character(tmpRTable[,6]))^2)*703),digits=2),',',tmpRTable[,9],sep="")

write.csv(PDQ,file="C:/Big Data Visualization/Chapter 3/BMI.txt", quote = FALSE,row.names = FALSE)

We could now use this file (or one similar) as input to additional profiling exercise or to create a visualization, but let’s move on.

Comparisons

Performing comparisons during data profiling can also add new and different perspectives to the data. Beyond simple record counts (like total smoking patients visiting a hospital verses the total non-smoking patients visiting a hospital) one might ponder to compare the total number of hospital visits for each state to the average number of hospital visits for a state. This would require calculating the total number of hospital visits by state as well as the total number of hospital visits over all (then computing the average).

The following 2 lines of code use the R functions table and write.csv to create a list (a file) of the total number of hospital visits found for each state:

# --- calculates the number of hospital visits for each

# --- state (state ID is in field 9 of the file

StateVisitCount<-table(datas.df[9])

# --- write out a csv file of counts by state

write.csv (StateVisitCount, file="C:/Big Data Visualization/Chapter 3/visitsByStateName.txt", quote = FALSE, row.names = FALSE)

Below is a portion of the file that is generated:

The following R command can be used to calculate the average number of hospitals by using the nrow function to obtain a count of records in the data source and then divide it by the number of states:

# --- calculate the average

averageVisits<-nrow(datas.df)/50

Going a bit further with this line of thinking, you might consider that the nine states the U.S. Census Bureau designates as the Northeast region are Connecticut, Maine, Massachusetts, New Hampshire, New York, New Jersey, Pennsylvania, Rhode Island and Vermont. What is the total number of hospital visits recorded in our file for the northeast region?

R makes it simple with the subset function:

# --- use subset function and the “OR” operator to only have

# --- northeast region states in our list

NERVisits<-subset(tmpRTable, as.character(V9)=="Connecticut" | as.character(V9)=="Maine"

| as.character(V9)=="Massachusetts"

| as.character(V9)=="New Hampshire"

| as.character(V9)=="New York"

| as.character(V9)=="New Jersey"

| as.character(V9)=="Pennsylvania"

| as.character(V9)=="Rhode Island"

| as.character(V9)=="Vermont")

Extending our scripting we can add some additional queries to calculate the average number of hospital visits for the northeast region and the total country:

AvgNERVisits<-nrow(NERVisits)/9

averageVisits<-nrow(tmpRTable)/50

And let’s add a visualization:

# -- the c objet is the the data for the barplot function to

# --- graph

c<-c(AvgNERVisits, averageVisits)

# --- use R barplot

barplot(c, ylim=c(0,3000), 

ylab="Average Visits", border="Black",

names.arg = c("Northeast","all"))

title("Northeast Region vs Country")

The generated visualzation is shown below:

Contrasts

The examination of contrasting data is another form of extending data profiling.

For example, using this article’s data, one could contrast the average body weight of patients that are under doctor’s care against the average body weight of patients that are not under a doctor’s care (after calculating average body weights for each group).

To accomplish this, we can calculate the average weights for patients that fall into each category (those currently under a doctor’s care and those not currently under a doctor’s care) as well as for all patients, using the following R script:

# --- read in our entire file

tmpRTable<-read.table(file="C:/Big Data Visualization/Chapter 3/sampleHCSurvey02.txt",sep=",")

# --- use the subset functionto create the 2 groups we are

# --- interested in

UCare.sub<-subset(tmpRTable, V20=="Yes")

NUCare.sub<-subset(tmpRTable, V20=="No")

# --- use the mean function to get the average body weight of all pateints in the file as well as for each of our separate groups

average_undercare<-mean(as.numeric(as.character(UCare.sub[,5])))

average_notundercare<-mean(as.numeric(as.character(NUCare.sub[,5])))

averageoverall<-mean(as.numeric(as.character(tmpRTable[2:nrow(tmpRTable),5])))

average_undercare;average_notundercare;averageoverall 

In “short order”, we can use R’s ability to create subsets (using the subset function) of the data based upon values in a certain field (or column), then use the mean function to calculate the average patient weight for the group.

The results from running the script (the calculated average weights) are shown below:

And if we use the calculated results to create a simple visualization:

# --- use R barplot to create the bar graph of

# --- average patient weight

barplot(c, ylim=c(0,200),  ylab="Patient Weight", border="Black", names.arg = c("under care","not under care", "all"), legend.text= c(format(c[1],digits=5),format(c[2],digits=5),format(c[3],digits=5)))> title("Average Patient Weight")

Tendencies

Identifying tendencies present within your data is also an interesting way of extending data profiling. For example, using this article’s sample data, you might determine what the number of servings of water that was consumed per week by each patient age group.

Earlier in this section we created a simple R script to count visits by age groups; it worked, but in a big data scenario, this may not work. A better approach would be to categorize the data into the age groups (age is the fourth field or column in the file) using the following script:

# --- build subsets of each age group

agegroup1<-subset(tmpRTable, as.numeric(V4)<22)

agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35)

agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45)

agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55)

agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66)

agegroup6<-subset(tmpRTable, as.numeric(V4)>64)

After we have our grouped data, we can calculate water consumption. For example, to count the total weekly servings of water (which is in field or column 96) for age group 1 we can use:

# --- field 96 in the file is the number of servings of water

# --- below line counts the total number of water servings for

# --- age group 1

sum(as.numeric(agegroup1[,96]))

Or the average number of servings of water for the same age group:

mean(as.numeric(agegroup1[,96]))

Take note that R requires the explicit conversion of the value of field 96 (even though it comes in the file as a number) to a number using the R function as.numeric.

Now, let’s see create the visualization of this perspective of our data. Below is the R script used to generate the visualization:

# --- group the data into age groups

agegroup1<-subset(tmpRTable, as.numeric(V4)<22)

agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35)

agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45)

agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55)

agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66)

agegroup6<-subset(tmpRTable, as.numeric(V4)>64)

# --- calculate the averages by group

g1<-mean(as.numeric(agegroup1[,96]))

g2<-mean(as.numeric(agegroup2[,96]))

g3<-mean(as.numeric(agegroup3[,96]))

g4<-mean(as.numeric(agegroup4[,96]))

g5<-mean(as.numeric(agegroup5[,96]))

g6<-mean(as.numeric(agegroup6[,96]))

# --- create the visualization

barplot(c(g1,g2,g3,g4,g5,g6),

+ axisnames=TRUE, names.arg = c("<21", "22-34", "35-44", "45-54", "55-64", ">65"))

> title("Glasses of Water by Age Group")

The generated visualization is shown below:

Dispersion

Finally, dispersion is still another method of extended data profiling.

Dispersion measures how various elements selected behave with regards to some sort of central tendency, usually the mean. For example, we might look at the total number of hospital visits for each age group, per calendar month in regards to the average number of hospital visits per month.

For this example, we can use the R function subset in the R scripts (to define our age groups and then group the hospital records by those age groups) like we did in our last example. Below is the script, showing the calculation for each group:

agegroup1<-subset(tmpRTable, as.numeric(V4) <22)

agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35)

agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45)

agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55)

agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66)

agegroup6<-subset(tmpRTable, as.numeric(V4)>64)

Remember, the previous scripts create subsets of the entire file (which we loaded into the object tmpRTable) and they contain all of the fields of the entire file.

The agegroup1 group is partially displayed as follows:

Once we have our data categorized by age group (agegroup1 through agegroup6), we can then go on and calculate a count of hospital stays by month for each group (shown in the following R commands). Note that the substr function is used to look at the month code (the first 3 characters of the record date) in the file since we (for now) don’t care about the year.

The table function then can be used to create an array of counts by month.

az1<-table(substr(agegroup1[,2],1,3))

az2<-table(substr(agegroup2[,2],1,3))

az3<-table(substr(agegroup3[,2],1,3))

az4<-table(substr(agegroup4[,2],1,3))

az5<-table(substr(agegroup5[,2],1,3))

az6<-table(substr(agegroup6[,2],1,3))

Using the above month totals, we can then calculate an average number of hospital visits for each month using the R function mean. This will be the mean function of the total for the month for ALL age groups:

JanAvg<-mean(az1["Jan"], az2["Jan"], az3["Jan"], az4["Jan"], az5["Jan"], az6["Jan"]) 

Note that the above code example can be used to calculate an average for each month

Next we can calculate the totals for each month, for each age group:

Janag1<-az1["Jan"];Febag1<-az1["Feb"];Marag1<-az1["Mar"];Aprag1<-az1["Apr"];Mayag1<-az1["May"];Junag1<-az1["Jun"]

Julag1<-az1["Jul"];Augag1<-az1["Aug"];Sepag1<-az1["Sep"];Octag1<-az1["Oct"];Novag1<-az1["Nov"];Decag1<-az1["Dec"]

The following code “stacks” the totals so we can more easily visualize it later (we would have one line for each age group (that is, Group1Visits, Group2Visits and so on).

Monthly_Visits<-c(JanAvg, FebAvg, MarAvg, AprAvg, MayAvg, JunAvg, JulAvg, AugAvg, SepAvg, OctAvg, NovAvg, DecAvg)

Group1Visits<-c(Janag1,Febag1,Marag1,Aprag1,Mayag1,Junag1,Julag1,Augag1,Sepag1,Octag1,Novag1,Decag1)

Group2Visits<-c(Janag2,Febag2,Marag2,Aprag2,Mayag2,Junag2,Julag2,Augag2,Sepag2,Octag2,Novag2,Decag2)

Finally, we can now create the visualization:

plot(Monthly_Visits, ylim=c(1000,4000))

lines(Group1Visits, type="b", col="red")

lines(Group2Visits, type="b", col="purple")

lines(Group3Visits, type="b", col="green")

lines(Group4Visits, type="b", col="yellow")

lines(Group5Visits, type="b", col="pink")

lines(Group6Visits, type="b", col="blue")

title("Hosptial Visits", sub = "Month to Month",

      cex.main = 2,   font.main= 4, col.main= "blue",     cex.sub = 0.75, font.sub = 3, col.sub = "red")

and enjoy the generated output:

Summary

In this article we went over the idea and importance of establishing context and perhaps identifying perspectives to big data, using the data profiling with R.

Additionally, we introduced and explored the R Programming language as an effective means to profile big data and used R in numerous illustrative examples.

Once again, R is an extremely flexible and powerful tool that works well for data profiling and the reader would be well served researching and experimenting with the languages vast libraries available today as we have only scratched the surface of the features currently available.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here