Introduction to Linear Regression Analysis
Linear Regression analysis is among the most widely used statistical analysis technique as it involves the study of additive and linear relationship between single and multiple variables techniques. The analysis using single variable is termed as the simple linear analysis while with multiple variables are termed as multiple linear analysis. Basically, in linear regression analysis we try to figure out the relationship of the independent and the dependent variables and that’s why it has multiple advantages such as, being simple and powerful, in making better business decisions, etc.
The 3 Types of Regression Analysis
These three Regression analyses have maximum use cases in the real world otherwise there are more than 15 types of regression analysis. Types of regression analysis which we are going to discuss are:
- Linear Regression Analysis
- Multiple Linear Regression Analysis
- Logistic Regression
In this article, we will focus on Simple Linear Regression analysis. This analysis helps us to identify the relationship between the independent factor and the dependent factor. In simpler words, the Regression model helps us to find that how the changes in the independent factor affect the dependent factor. This model helps us in multiple ways like:
- It is a simple and powerful statistical model
- It will help us in making prediction and forecast
- It will help us to make a better business decision
- It will help us to analyze the results and correcting errors
The Equation of Linear Regression and Split it into relevant parts
- Where β1 in the mathematical terminology known as intercept and β2 in the mathematical terminology known as a slope. They are also known as regression coefficients. ϵ is the error term, it is the part of Y the regression model is unable to explain.
- Y is a dependent variable (other terms which are interchangeably used for dependent variables are response variable, regressand, measured variable, observed variable, responding variable, explained variable, outcome variable, experimental variable, and/or output variable).
- X is an independent variable (regressors, controlled variable, manipulated a variable, explanatory variable, exposure variable, and/or input variable).
Problem: For understanding what is linear regression analysis we are taking the “Cars” dataset which comes by default in R directories. In this dataset, there are 50 observations (basically rows) and 2 variables (columns). Columns names are “Dist” and “Speed”. Here we have to see the impact on distance variables due to change speed variables. To see the structure of the data we can run a code Str(dataset). This code helps us to understand the structure of the dataset. These functionalities help us to make better decisions because we have a better picture in our mind about the dataset structure. This code helps us to identify the type of datasets.
Code:
Similarly to check the statistics checkpoints of the dataset we can use code Summary(cars). This Code provides mean, median, range of the dataset in a go, which the researcher can use while dealing with the problem.
Output:
Here we can see the statistical output of every variable we have in our dataset.
4.5 (5,715 ratings)
View Course
The Graphical Representation of Datasets
Types of Graphical representation which will cover here are and why:
- Scatter Plot: With the help of the graph, we can see in which direction our linear regression model is going, whether there is any strong evidence to prove our model or not.
- Box Plot: Helps us to find outliers.
- Density Plot: Help us to understand the distribution of the independent variable, in our case, the independent variable is “Speed”.
Advantages of Graphical Representation
Here the following advantages are as follows:
- Easy to understand
- Helps us to take quick decision
- Comparative analysis
- Less Effort and time
1. Scatter Plot: It will help to visualize any relationships between the independent variable and the dependent variable.
Code:
Output:
We can see from the graph a linearly increasing relationship between the dependent variable (Distance) and the independent variable (Speed).
2. Box Plot: Box plot helps us to identify the outliers in the datasets. Advantages of using a box plot are:
- Graphical display of variables location and spread.
- It helps us to understand the data’s skewness and symmetry.
Code:
Output:
3. Density Plot (to check the normality of the distribution)
Code:
Output:
Correlation Analysis
This Analysis helps us to find the relationship between the variables. There are mainly six types of correlation analysis.
- Positive Correlation (0.01 to 0.99)
- Negative Correlation (-0.99 to -0.01)
- No Correlation
- Perfect Correlation
- Strong Correlation (a value closer to ± 0.99)
- Weak Correlation (a value closer to 0)
Scatter plot helps us to identify which types of correlation datasets have among them and the code for finding the correlation is
Output:
Here we have a strong positive correlation between Speed and Distance, which means they have a direct relationship among them.
Linear Regression Model
This is the core component of the analysis, earlier we were just trying and testing things whether the dataset we have is logical enough to run such analysis or not. The function we are planning to use is lm(). This function contains two elements which are Formula and Data. Before assigning that which variable is dependent or independent we have to be very sure about that because our whole formula depends on that.
The formula looks like this,
Linear Regression <- lm(Dependent Variable ~ Independent Variable, data=Date.Frame)
Code:
Output:
As we can recall from the above segment of the article the equation of linear regression is:
Y = β1 + β2X + ϵ
Now we will fit in the information which we got from the above code in this equation.
dist = −17.579 + 3.932∗speed
Only finding the equation of linear regression is not sufficient we have to check its statistic significant also. For this, we have to pass a code “Summary” on our linear regression model.
Code:
Output:
There are multiple ways of checking the statistic significant of a model, here we are using the P-value method. We can consider a model statistically fit when the P-value is less than the pre-determined statistical significant level, which is ideally 0.05. We can see in our table of summary(linear_regression) that P-value is below 0.05 level, so we can conclude that our model is statistically significant. Once we are sure about our model, we can use our dataset to predict things.
Recommended Articles
This is a guide to Linear Regression Analysis. Here we discuss the three types of Linear Regression Analysis, the Graphical Representation of Datasets with advantages and linear regression models. You can also go through our other related articles to learn more-