[box type=”note” align=”” class=”” width=””]*This article is an excerpt from the book, **Big Data Analysis with SAS** written by David Pope. This book will help you leverage the power of SAS for data management, analysis and reporting. It contains practical use-cases and real-world examples on predictive modelling, forecasting, optimizing, and reporting your Big Data analysis using SAS.*[/box]

Today, we will perform regression analysis using SAS in a step-by-step manner with a practical use-case.

Regression analysis is one of the earliest predictive techniques most people learn because it can be applied across a wide variety of problems dealing with data that is related in linear and non-linear ways. Linear data is one of the easier use cases, and as such PROC REG is a well-known and often-used procedure to help predict likely outcomes before they happen.

The REG procedure provides extensive capabilities for fitting linear regression models that involve individual numeric independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of survival data, and regression modeling of transformed variables. The SAS/STAT procedures that can fit regression models include the ADAPTIVEREG, CATMOD, GAM, GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOESS, LOGISTIC, MIXED, NLIN, NLMIXED, ORTHOREG, PHREG, PLS, PROBIT, QUANTREG, QUANTSELECT, REG, ROBUSTREG, RSREG, SURVEYLOGISTIC, SURVEYPHREG, SURVEYREG, TPSPLINE, and

TRANSREG procedures. Several procedures in SAS/ETS software also fit regression models.

SAS/STAT14.2 / SAS/STAT User’s Guide – Introduction to Regression Procedures – Overview:

Regression Procedures (http://documentation.sas.com/?cdcId=statcdccdcVersion=14.2

docsetId=statugdocsetTarget=statug_introreg_sect001.htmlocale=enshowBanner=yes).

Regression analysis attempts to model the relationship between a response or output variable and a set of input variables. The response is considered the target variable or the variable that one is trying to predict, while the rest of the input variables make up parameters used as input into the algorithm. They are used to derive the predicted value for the response variable.

**PROC REG**

One of the easiest ways to determine if regression analysis is applicable to helping you answer a question is if the type of question being asked has only two answers. For example, should a bank lend an applicant money? Yes or no? This is known as a **binary response**, and as such, regression analysis can be applied to help determine the answer. In the following example, the reader will use the SASHELP.BASEBALL dataset to create a regression model to predict the value of a baseball player’s salary.

The SASHELP.BASEBALL dataset contains salary and performance information for Major League. Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries (Sports Illustrated, April 20, 1987) are for the 1987 season and the performance measures are from 1986 (Collier Books, The 1987 Baseball Encyclopedia Update). SAS/STAT® 14.2 / SAS/STAT User’s Guide – Example 99: Modeling Salaries of Major League Baseball Players (http://documentation.sas.com/ ?cdcId= statcdc cdcVersion= 14.2 docsetId=statugdocsetTarget= statug_ reg_ examples01.htmlocale= en showBanner= yes).

Let’s first use PROC UNIVARIATE to learn something about this baseball data by submitting the following code:

```
proc univariate data=sashelp.baseball;
quit;
```

While reviewing the results of the output, the reader will notice that the variance associated with logSalary, 0.79066, is much less than the variance associated with the actual target variable Salary, 203508. In this case, it makes better sense to attempt to predict the logSalary value of a player instead of Salary.

Write the following code in a SAS Studio program section and submit it:

```
proc reg data=sashelp.baseball;
id name team league;
model logSalary = nAtBat nHits nHome nRuns nRBI YrMajor CrAtBat
CrHits CrHome CrRuns CrRbi;
Quit;
```

Notice that there are 59 observations as specified in the first output table with at least one of the input variables with missing values; as such those are not used in the development of the regression model. The Root Mean Squared Error (RMSE) and R-square are statistics that typically inform the analyst how good the model is in predicting the target. These range from 0 to 1.0 with higher values typically indicating a better model. The higher the Rsquared values typically indicate a better performing model but sometimes conditions or the data used to train the model over-fit and don’t represent the true value of the prediction power of that particular model.

Over-fitting can happen when an analyst doesn’t have enough real-life data and chooses data or a sample of data that over-presents the target event, and therefore it will produce a poor performing model when using real-world data as input.

Since several of the input values appear to have little predictive power on the target, an analyst may decide to drop these variables, thereby reducing the need for that information to make a decent prediction. In this case, it appears we only need to use four input variables.

```
YrMajor, nHits, nRuns, and nAtBat. Modify the code as follows and submit it again:
proc reg data=sashelp.baseball;
id name team league;
model logSalary = YrMajor nHits nRuns nAtBat;
Quit;
```

The p-value associated with each of the input variables provides the analyst with an insight into which variables have the biggest impact on helping to predict the target variable. In this case, the smaller the value, the higher the predictive value of the input variable.

Both the RMSE and R-square values for this second model are slightly lower than the original. However, the adjusted R-square value is slightly higher. In this case, an analyst may chose to use the second model since it requires much less data and provides basically the same predictive power. Prior to accepting any model, an analyst should determine whether there are a few observations that may be over-influencing the results by investigating the influence and fit diagnostics. The default output from PROC REG provides this type of visual insight:

The top-right corner plot, showing the externally studentized residuals (RStudent) by leverage values, shows that there are a few observations with high leverage that may be overly influencing the fit produced. In order to investigate this further, we will add a plots statement to our PROC REG to produce a labeled version of this plot.

Type the following code in a SAS Studio program section and submit:

```
proc reg data=sashelp.baseball
plots(only label)=(RStudentByLeverage);
id name team league;
model logSalary = YrMajor nHits nRuns nAtBat;
Quit;
```

Sure enough, there are three to five individuals whose input variables may have excessive influence on fitting this model. Let’s remove those points and see if the model improves. Type this code in a SAS Studio program section and submit it:

```
proc reg data=sashelp.baseball plots=(residuals(smooth));
where name NOT IN ("Mattingly, Don", "Henderson, Rickey",
"Boggs, Wade", "Davis, Eric", "Rose, Pete");
id name team league;
model logSalary = YrMajor nHits nRuns nAtBat;
Quit;
```

This change, in itself, has not improved the model but actually made the model worse as can be seen by the R-square, 0.5592. However, the plots residuals(smooth) option gives some insights as it pertains to YrMajor; players at the beginning and the end of their careers tend to be paid less compared to others, as can be seen in Figure 4.12:

In order to address this lack of fit, an analyst can use polynomials of degree two for this variable, YrMajor. Type the following code in a SAS Studio program section and submit it:

```
data work.baseball;
set sashelp.baseball;
where name NOT IN ("Mattingly, Don", "Henderson, Rickey",
"Boggs, Wade", "Davis, Eric", "Rose, Pete");
YrMajor2 = YrMajor*YrMajor;
run;
proc reg data=work.baseball;
id name team league;
model logSalary = YrMajor YrMajor2 nHits nRuns nAtBat;
Quit;
```

After removing some outliers and adjusting for the YrMajor variable, the model’s predictive power has improved significantly as can be seen in the much improved R-square value of 0.7149.

We saw an effective way of performing regression analysis using SAS platform.

*If you found our post useful, do check out this book* *Big Data Analysis with SAS** to understand other data analysis models and perform them practically using SAS.*