[box type=”note” align=”” class=”” width=””]The following excerpt is taken from the book Statistics for Data Science, authored by IBM expert James D. Miller. This book gives a statistical view of building smart data models to help you get unique insights from the data.[/box]
In this article, we introduce you to the concept of regression analysis, one of the most popular machine learning algorithms. -You will learn what is regression analysis, the different types of regression, and how to choose the right regression technique to build your data model.
What is Regression Analysis?
For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variable (or predictors).
Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn’t changed by the other variables) is changed while the other independent variables stay the same.
A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)?
[box type=”info” align=”” class=”” width=””]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box]
Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable).
An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values.
You’ll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We’ll explore some real-world examples later, but for now, let’s us look at some techniques for the process.
Popular regression techniques and approaches
You’ll find that various techniques for carrying out regression analysis have been developed and accepted.These are:
Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model.
Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on.
When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial.
Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features.
Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion.
Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables.
Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces.
Which technique should I choose?
In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it’s important to choose the technique that is right for your data and your project.
Rather than selecting the right regression approach, it is more about selecting the most effective regression approach.
Typically, you use the data to identify the regression approach you’ll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect.
Overall, here’s some generally good advice for choosing the right regression approach from your project:
- Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don’t reinvent the wheel. Also, even if an observed approach doesn’t quite fit as it was used, perhaps some simple adjustments would make it a good choice.
- Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers.
- Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results.
- Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project.
Does it fit?
After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit.
A well-fitting regression model results in predicted values close to the observed data values.
The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model.
As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals.
[box type=”info” align=”” class=”” width=””]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box]
Finally, it has been proven that simple models produce more accurate results! Keep this in mind always when selecting an approach or a technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model.
If you found this excerpt useful, make sure to check out this book Statistics for Data Science for tips on building effective data models by leveraging the power of the statistical tools and techniques.