Introduction to OLS Regression in R
OLS Regression in R is a standard regression algorithm that is based upon the ordinary least squares calculation method.OLS regression is useful to analyze the predictive value of one dependent variable Y by using one or more independent variables X. R language provides built-in functions to generate OLS regression models and check the model accuracy. the R function such as lm() is used to create the OLS regression model. In the event of the model generates a straight line equation it resembles linear regression. OLS Regression is a good fit Machine learning model for a numerical data set.
The bivariate regression takes the form of the below equation.
y = mx + c
y = is a dependent variable
m = gradient(slope)
x = independent variable
c = intercept
The OLS linear aggression allows us to predict the value of the response variable by varying the predictor values when the slope and coefficients are the best fit. To calculate the slope and intercept coefficients in R, we use lm() function. We need to input five variables to calculate slope and coefficient intercepts and those are standard deviations of x and y, means of x and y, Pearson correlation coefficients between x and y variables.
The mathematical formulas for both slope and intercept are given below.
slope <- cor(x, y) * (sd(y) / sd(x))
intercept <- mean(y) - (slope * mean(x))
To determine the linearity between two numeric values, we use a scatter plot that is best suited for the purpose. A scatter plot is easy to help us find out the strength and direction of a relationship. To perform OLS regression in R we need data to be passed on to lm() and predict() base functions. We also use ggplot 2 and dplyr packages which need to be imported.
Implementation of OLS
Here are some of the OLS implementation steps that we need to follow:
To implement OLS through lm() function, we need to import the library required to perform OLS regression.
Catools library contains basic utility to perform statistic functions.
After importing the required libraries, We import the data that is required for us to perform linear regression on. Below is the syntax.
data = read.csv(“path/filename”)
We import the data using the above syntax and store it in the variable called data.
Once the data is imported, we analyze the data through str() function which displays the structure of the data that was imported.
We have seen the structure of the data, we will output the partial data for us to have a clear idea on the data set.
To understand the statistical features like mean, median and also labeling the data is important. We can use the summary () function to see the labels and the complete summary of the data.
Now, once we have performed all the above steps. We now try to build a linear model from the data. We start by generating random numbers for simulating and modeling data.
We use seed() to generate random numbers for simulation and modeling where x, can be any random number to generate values.
The significant step before we model data is splitting the data into two, one being the training data and the other being test data. Training data is 75% and test data is 25 %, which constitutes 100% of our data. This step is called a data division.
data_split = sample.split(data, SplitRatio = 0.75)
training <- subset(data, data_split == TRUE)
test <-subset(data, data_split == FALSE)
The last step is to implement a linear data model using the lm() function.
model <- lm(X1.1 ~ X0.00631 + X6.572 + X16.3 + X25, data = training)
Lastly, we display the summary of the model through a summary function.
Important Command Used in OLS Model
Here we will discuss about some important commands of OLS Regression in R given below:
1. Reading the Data
Below are commands required to read data.
- read.csv: To read data from a csv file.
- read.table: To read data from text files.
2. Commands to Display Data
Below are the commands required to display data.
- Head(): Displays the first six rows of the data
- Str(): Shows the information of variables and their data types.
- Rename(): Rename existing variables through the function.
- Names(): Shows names of the variables.
- Attach(): Used to attach data which makes it easier to search for variables.
3. Display Statistical Data
Below are the commands required to display statistical data.
- mean(): Calculates the mean of variable x.
- median(): Computes the median of variable x.
- sd(x): Computes the standard deviation of variable x.
- cor(matrix): Computes the correlation of the matrix.
4. Graphical Commands
Below are the commands required to display graphical data.
- Hist(): Creates a histogram for the variable x
- Boxplot(x): Creates box plot for the variable x.
- Plot(x): Creates the scatter plot for x.
- Stem(x): Creates a stem plot for the variable x.
OLS Diagnostics in R
Here are some of the diagnostic of OLS in the R language as follows:
- After the OLS model is built, we have to make sure post-estimation analysis is done to that built model.
- We use diagnostics to create different graphs from the data to check what kind of data it is and the force behind the data that keeps it moving.
- Outliers are important in the data as it is treated as unusual observations.
- The ability to change the slope of the regression line is called Leverage.
- The impact of the data is the combination of leverage and outliers.
This is a guide to OLS Regression in R. Here we discuss the introduction and implementation steps of OLS regression in r along with its important commands. You may also look at the following articles to learn more-