Introduction to R Squared Regression
R Squared is a statistical measure, which is defined by the proportion of variance in the dependent variable that can be explained from independent variables. In other words, in a regression model, the value of R squared test about the goodness of the regression model or the how well the data fits in the model. It is also called the coefficient of determination. The value of R squared can be between 0% to 100%.
What is R Squared?
To understand R squared analysis, let us consider n number of points in two-dimensional space say, (x1, y1), (x2, y2), … (xn, yn). Now we will plot them in two-dimensional space, and consider a line y = mx + b passing through them in such a way that the squared error is minimized.
Now the total squared error between the points and the line can be calculated as below. We will consider error along the y-axis.
e1 = (y1 – (m x1 + b))2
e2 = (y2 – (m x2 + b))2
en = (yn – (m xn + b))2
Let the squared error of line be SEline.
SEline = (y1 – (mx1 + b))2 + (y2 – (mx2 + b))2 + … (yn – (mxn + b))2
Now to know how good the line fits in the data points, let us know “How much (or what percentage) of the variation in y is described by the variation in x?
Variance (y) = Total Variation in y / #n
But the total variation in y that is not described by the regression line can be expressed by the following equation.
SEline / SEY (%) of variation is not described by the regression line.
So let us say if the above percentage stands out to be 25%, that is 25% of the variation is not described by the line then we can say that the variance described by the line can be equal to (1-25 = 75%). So the total variance described by the regression line can be described by the below formula.
(1 – SEline / SEY) %
So this gives what percentage of the total variation is described by the variation in x. This is called the coefficient of determination or R squared.
Coefficient of Determination = R2 = (1 – SEline / SEY)
Example to Implement R Squared Regression
Let us consider an example using Python. The library named sklearn contains the metrics named r2_score. And for the Linear Regression model, we will use LinerRegression from sklearn. We will use the matplotlib library for plotting the regression graph. Numpy library will be used to reshape the input array.
1. Let us first install the sklearn package, and then import the necessary packages.
pip install scikit-learn
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression from matplotlib import pyplot as plt
import numpy as np
2. Then, we need to download wage2.csv dataset which is available here: http://murraylax.org/datasets/wage2.csv
The fields present in the dataset are as below which can be found by wage.columns.
3. We will consider YearsEdu as x and MonthlyEarnings on y that is the dependent variable. We will scatter plot all the points and visualize the distribution.
y = wage["MonthlyEarnings"]
x = wage["YearsEdu"]
4. Now we will train the linear regression model on our data.
X = np.array(x).reshape(-1, 1)
regModel = LinearRegression()
y_pred = regModel.predict(X)
5. Now to view the best fitting line to our model, we will execute the following script.
plt.plot(X, y_pred, color='red')
6. To compute the R-squared value of the line, the following function can be used.
- If the squared error is very small then we can say that the line is a good fit. So if SEline is a small number then the whole fraction will be a very small value. And this will result in a larger number when subtracted from one. Thus if the squared error is small then R squared or the coefficient of determination will get larger, nearly equal to one. This shows that the line is a good fit.
- Similarly in the opposite case if the squared error of line is huge that means a lot of error between data points and the line, then SEline this number will get large and hence resulting in a larger value of the fraction. So the R squared or the coefficient of determination will be a smaller value, showing a poorly fit regression line.
We can get the relationship between independent variables movements and the movements of dependent variables. It is not a parameter to identify whether the selected model is good or not. And you do not get any information about if the data and predictions are biased or not. There are times when you can get high R squared value for the poor model and a low value for a well-fitted model. Thus R squared doesn’t help to identify the reliability of the model.
R squared value is not a metric to verify the good fit of the trained linear regression model. But it can give what amount of variance of the independent variable that can be described by the variance of the dependent variable.
This is a guide to R Squared Regression. Here we discuss what is R squared analysis along with its limitations, interpretation, and example. You can also go through our other related articles to learn more –