Performing descriptive analysis with SAS

This article is an excerpt from a book written by David Pope titled Big Data Analysis with SAS. This book will help you combine SAS with platforms such as Hadoop, SAP HANA, and Cloud Foundry-based platforms for efficient Big Data analytics.

In today’s tutorial, we will perform descriptive analysis using SAS with practical use-cases.

The following are few examples of descriptive analysis. Let us take a look at each one in
detail.

PROC FREQ

How many males versus females are in a particular table, say SASHELP.CLASS? PROC FREQ
can be used to easily find the answer to this type of question. Type the following code in a
SAS Studio program section and submit it:

proc freq data=sashelp.class;

tables sex;

Quit;

performing-descriptive-analysis-with-sas-img-0

If you remove the tables statement, then, by default, PROC FREQ produces a one-frequency table for all the variables within the dataset.

PROC CORR

Are the height and weight of a fish related to each other, and do their lengths have any impact on this relationship if it exists? PROC CORR can be used to determine this. In these examples, the plots option will be used to provide more insights by producing an additional graphic plot output along with the statistical results. Type the following code in a SAS Studio program section and submit it:

proc corr data=sashelp.fish plots=matrix(histogram);

var height weight length1 length2 length3;

Quit;

performing-descriptive-analysis-with-sas-img-1

The simple statistics table provides the descriptive univariate statistics for all five variables listed in the var statement. An insight regarding a very minor data quality issue can be seen in this table—one of the 159 observations in this dataset is missing a value for weight. The higher the Pearson correlation coefficient for a pair of variables (which means closer to 1.0), the stronger the relationship between the variables. While height and weight do have a strong relationship, it is interesting to note that the relationships of weight to all three length variables are stronger than the relationships of height to all three length Variables:

performing-descriptive-analysis-with-sas-img-2

In this next example, the code is still searching for a relationship between height and weight. However, now the relationship is being adjusted for the effect of the partial variables for which the three length variables have been assigned. Instead of requesting a matrix plot, the code requests a scatter plot with three different prediction ellipses. Type the following code in a SAS Studio program section and submit it:

proc corr data=sashelp.fish plots=scatter(alpha=.15 .25 .35);
var height weight;
partial length1 length2 length3;
quit;

performing-descriptive-analysis-with-sas-img-3

The results indicate that the partial relationship between height and weight is weaker than the unpartialled one; 0.46071 is less than 0.72869. However, both relationships are statistically relevant since both have p-values of <.0001. The smaller the p-value becomes, the more statistically relevant the variable is to what is being analyzed:

performing-descriptive-analysis-with-sas-img-4

Prediction ellipses are regions used to predict an observation based on values of the associated population. This particular code requests three prediction ellipses, each of which contains a specified percentage of the population, in this case, 85%, 75%, and 65%.

Change the plots option to the following, and submit the code:

proc corr data=sashelp.fish plots=scatter(ellipse=confidence alpha=.10

.05);

var height weight;

partial length1 length2 length3;

quit;

performing-descriptive-analysis-with-sas-img-5

A confidence ellipse provides an estimate range for the population's mean associated with a level of confidence in that range. In this example, there are two ellipse ranges, one at a 90% confidence level and one at a 95% confidence level. If the relationships between variables are not linear, or there are a lot of outliers in the data being analyzed, the correlation coefficient might incorrectly estimate the strength of the relationship. Therefore, visualizing the data through these types of plots enables an analyst to verify the linear relationship and spot potential outliers.

PROC UNIVARIATE

Some of the output associated with PROC UNIVARIATE was seen in the simple statistics table in the output associated with the PROC CORR examples in the previous section.

Type the following code in a SAS Studio program section and submit it:

proc univariate data=sashelp.fish;

quit;

performing-descriptive-analysis-with-sas-img-6

By running PROC UNIVARIATE on an entire table, the applicable variable within that data
will have the descriptive statistics seen in Figure 4.7.

An analyst can control which tables show up in the results by using certain Output Delivery System (ODS) statements along with procedures. ODS is another part of BASE SAS that helps produce output and graphics in a variety of different formats.

For example, if an analyst is only interested in the extreme observations of all the variables within a table, they can limit the PROC UNIVARIATE output to only the extreme observations table. Type this code in a SAS Studio program section and submit it:

title "Extreme Observations in SASHELP.FISH";
ods select ExtremeObs;
proc univariate data=sashelp.fish;
Quit;

performing-descriptive-analysis-with-sas-img-7

We learned how to perform descriptive analysis on SAS platform with the help of a practical use-case.

If you found this post useful, do check out the book Big Data Analysis with SAS to leverage the capabilities of SAS for processing and analyzing Big Data.