What is Linear Regression in R?
Linear regression is the most popular and widely used algorithm in the field of statistics and Machine Learning. Linear regression is a modeling technique to understand the relationship between input and output variables. Here variables must be numeric. Linear regression comes from the fact that the output variable is a linear combination of input variables. The output is usually represented by “y”, whereas input is represented by “x”.
Linear Regression in R can be categorized into two ways
Simple Linear Regression
This is the regression where the output variable is a function of a single input variable. Representation of simple linear regression:
y = c0 + c1*x1
Multiple Linear Regression
This is the regression where the output variable is a function of a multiple-input variable.
y = c0 + c1*x1 + c2*x2
In both the above cases c0, c1, c2 are the coefficient’s which represents regression weights.
Linear Regression in R
R is a very powerful statistical tool. So let’s see how linear regression can be performed in R and how its output values can be interpreted.
4.8 (3,567 ratings)
Let’s prepare a dataset, to perform and understand linear regression in-depth now.
Now we have a dataset, where “satisfaction_score” and “year_of_Exp” are the independent variable. “salary_in_lakhs” is the output variable.
Referring to the above dataset, the problem we want to address here through linear regression is:
Estimation of the salary of an employee, based on his year of experience and satisfaction score in his company.
R code of linear regression:
model <- lm(salary_in_Lakhs ~ satisfaction_score + year_of_Exp, data = employee.data)
The output of the above code will be:
The formula of Regression becomes
Y = 12.29-1.19*satisfaction_score+2.08×2*year_of_Exp
In case, one has multiple inputs to the model.
Then R code can be:
model <- lm(salary_in_Lakhs ~ ., data = employee.data)
However, if someone wants to select variable out of multiple input variable, there are multiple techniques like “Backward Elimination”, “Forward Selection” etc. are available to do that as well.
Interpretation of Linear Regression in R
Below are some interpretations of linear regression in r which are as follows:
This refers to the difference between the actual response and the predicted response of the model. So for every point, there will one actual response and one predicted response. Hence residuals will be as many as observations are. In our case we have four observations, hence four residuals.
Going further, we will find the coefficients section, which depicts the intercept and slope. If one wants to predict the salary of an employee based on his experience and satisfaction score, one needs to develop a model formula based on slope and intercept. This formula will help you in predicting salary. The intercept and slope help an analyst to come up with the best model that suits datapoints aptly.
Slope: Depicts steepness of the line.
Intercept: The location where the line cuts the axis.
Let’s understand how formula formation is done based on slope and intercept.
Say intercept is 3 and the slope is 5.
So, the formula is y = 3+5x. This means if x increased by a unit, y gets increased by 5.
a.Coefficient – Estimate
In this, the intercept denotes the average value of the output variable, when all input becomes zero. So, in our case, salary in lakhs will be 12.29Lakhs as average considering satisfaction score and experience comes zero. Here slope represents the change in the output variable with a unit change in the input variable.
b.Coefficient – Standard Error
The standard error is the estimation of error, we can get when calculating the difference between the actual and predicted value of our response variable. In turn, this tells about the confidence for relating input and output variables.
c.Coefficient – t value
This value gives the confidence to reject the null hypothesis. The greater the value away from zero, the bigger the confidence to reject the null hypothesis and establishing the relationship between output and input variable. In our case value is away from zero as well.
d.Coefficient – Pr(>t)
This acronym basically depicts the p-value. The closer it is to zero, the easier we can reject the null hypothesis. The line we see in our case, this value is near to zero, we can say there exists a relationship between salary package, satisfaction score and year of experiences.
Residual Standard Error
This depicts the error in the prediction of the response variable. The lower it is, the higher the accuracy of the model is.
Multiple R-squared, Adjusted R-squared
R-squared is a very important statistical measure in understanding how close the data has fitted into the model. Hence in our case how well our model that is linear regression represents the dataset.
R-squared value always lies between 0 and 1. Formula is:
The closer the value to 1, the better the model describes the datasets and its variance.
However, when more than one input variable comes into the picture, the adjusted R squared value is preferred.
It’s a strong measure to determine the relationship between input and response variable. The larger the value than 1, the higher is the confidence in the relationship between the input and output variable.
In our case its “937.5”, which is relatively larger considering the size of the data. Hence the rejection of the null hypothesis gets easier.
If someone wants to see the confidence interval for model’s coefficients, here is the way to do it:-
Visualization of Regression
plot(salary_in_Lakhs ~ satisfaction_score + year_of_Exp, data = employee.data)
Its always better to gather more and more points, before fitting to a model.
Conclusion – Linear Regression in R
Linear regression is simple, easy to fit, easy to understand yet very powerful model. We saw how linear regression can be performed on R. We also tried interpreting the results, which can help you in the optimization of the model. Once one gets comfortable with simple linear regression, one should try multiple linear regression. Along with this, as linear regression is sensitive to outliers, one must look into it, before jumping into the fitting to linear regression directly.
This is a guide to Linear Regression in R. Here we have discuss what is Linear Regression in R? categorization, Visualization and interpretation of R. You can also go through our other suggested articles to learn more –