Introduction to Linear Model in R

A statistical or mathematical model that is used to formulate a relationship between a dependent variable and single or multiple independent variables called as, linear model in R. The criteria is that the variables involved in the formation of model meet certain assumptions as necessary prerequisites prior model building and that the model has certain important elements as its parts, which are formula, data, subset, weights, method, model, offset etc. It is not necessary that all have to be used every time, but only those that are sufficient and essential in the given context.

Advantages of Linear Model:

Helps us to understand the type and nature of the data.
Helps us to predict the data.
Helps us to make statistical inferences from data.

Now we will learn about linear regression basically it is a statistical method used to create these models. The main objective of this model is to explain the relationship between the dependent variable and the independent variable.

Syntax of Linear Model in R

Here is the syntax of the linear model in R which is given below.

Syntax:

lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,offset, ...)

Here are the parameters of the linear model which are explained below:

Formula: Here we have to enter the variables of our dataset, basically, those variables where we are planning to trace out whether any relationship exists between them or not. The format should be fixed like (Dependent variable ~ Independent variables). Eg (Distance ~ Speed), (Demand~Price), etc.
Data: It is used when we have to pass an optional list of data, data frame or environment.
Subset: It helps us to define the data when we have to use a subset of the observations.
Weights: It accepts only numeric vector or “NULL” command. If it is not null, “WLS (Weighted least squares)” is used with weights or if Null then OLS (ordinary least squares) is used.
Na.action: It will give the instruction of what should be done when the data points have NA values, like na.fail, na.omit, na.exclude, etc.
Method: It is used for fitting.
Model: It is a logical vector if it is TRUE the corresponding components of the fit are returned.
X: It is a logical vector if it is TRUE the corresponding components of the fit are returned.
Y: It is a logical vector if it is TRUE the corresponding components of the fit are returned.
qr: It is a logical vector if it is TRUE the corresponding components of the fit are returned.
ok: It is also a logical vector if it is FALSE then singular fits are the error.
Offset: It can be NULL, numerical vector or matrix. This is used to specify an a priori known component to be included in the linear predictor during fitting.

Types of Linear Model in R

Let’s now discuss different types of linear models which are as follows:

1. Simple Linear Regression

This model helps us to explain a relationship between one dependent variable and one independent variable. With the help of it, we can also predict the data, by providing the input values. In general, the dependent variable is also known as the response variable, regressand, observed variable, responding variable, measured variable, explained variable, experimental variable, outcome variable, and/or output variable). And independent variable known as a controlled variable, regressors, explanatory variable, manipulated variable, exposure variable, and/or input variable. The equation for the simple linear regression model is:

Y = β1 + β2X + ϵ

Where β1 is an intercept, β2 is a slope and ϵ is an error term. We will use the “USArrest” data set.

	Murder arrests (per 100,000)	Assault arrests (per 100,000)	Percent urban population	Rape arrests (per 100,000)
Alabama	13.2	236	58	21.2
Alaska	10	263	48	44.5
Arizona	8.1	294	80	31
Arkansas	8.8	190	50	19.5
California	9	276	91	40.6
Colorado	7.9	204	78	38.7
Connecticut	3.3	110	77	11.1
Delaware	5.9	238	72	15.8
Florida	15.4	335	80	31.9
Georgia	17.4	211	60	25.8
Hawaii	5.3	46	83	20.2
Idaho	2.6	120	54	14.2
Illinois	10.4	249	83	24
Indiana	7.2	113	65	21
Iowa	2.2	56	57	11.3
Kansas	6	115	66	18
Kentucky	9.7	109	52	16.3
Louisiana	15.4	249	66	22.2
Maine	2.1	83	51	7.8
Maryland	11.3	300	67	27.8
Massachusetts	4.4	149	85	16.3
Michigan	12.1	255	74	35.1
Minnesota	2.7	72	66	14.9
Mississippi	16.1	259	44	17.1
Missouri	9	178	70	28.2
Montana	6	109	53	16.4
Nebraska	4.3	102	62	16.5
Nevada	12.2	252	81	46
New Hampshire	2.1	57	56	9.5
New Jersey	7.4	159	89	18.8
New Mexico	11.4	285	70	32.1
New York	11.1	254	86	26.1
North Carolina	13	337	45	16.1
North Dakota	0.8	45	44	7.3
Ohio	7.3	120	75	21.4
Oklahoma	6.6	151	68	20
Oregon	4.9	159	67	29.3
Pennsylvania	6.3	106	72	14.9
Rhode Island	3.4	174	87	8.3
South Carolina	14.4	279	48	22.5
South Dakota	3.8	86	45	12.8
Tennessee	13.2	188	59	26.9
Texas	12.7	201	80	25.5
Utah	3.2	120	80	22.9
Vermont	2.2	48	32	11.2
Virginia	8.5	156	63	20.7
Washington	4	145	73	26.2
West Virginia	5.7	81	39	9.3
Wisconsin	2.6	53	66	10.8
Wyoming	6.8	161	60	15.6

Now we will find the relationship between the Assault variable and the Urban population.

>dataset = USArrests
>Linear_relationship1 = lm(Assault~ UrbanPop, data=dataset)
> Linear_relationship

Equation looks like:

Assault = β1 + β2(UrbanPop)

Now we have intercept and slope also, Assault = 73.08 + 1.49(UrbanPop). Here we have a linear model equation, we have to supply the inputs in the form of “UrbanPop”, and the model equation will automatically predict the value of “Assualt” for us. Let’s take another example of this model, now we will run this model on Murder variable and Urban Population variable.

>View(USArrests)
>Linear_relationship2= lm(Murder~ UrbanPop, data=dataset)

Equation looks like:

Murder = β1 + β2(UrbanPop)

Now we have intercept and slope also, Murder= 6.41594 + 0.02093(UrbanPop). Here we have a linear model equation, we have to supply the inputs in the form of “UrbanPop”, and the model equation will automatically predict the value of “Murder” for us.

2. Multiple Linear Regression

In this model, we will have one dependent variable and multiple independent variables. Multiple independent variables are used in this model to predict one dependent variable. Let’s take an example of this model. Here also we will use the “USArrests” dataset.

Dependent variable = Urban Population
Independent variable = Assault, Rape, and Murder

>Multiple_Linear_Relationship = lm(UrbanPop~ Assault+Rape+Murder , data=dataset)
>Multiple_Linear_Relationship

The equation looks like:

UrbanPop = β1 + β2(Assault) + β3(Rape) + β4(Murder)

Now we have intercept and slope also, UrbanPop= 52.8419 + 0.0519(Assault) + 0.6984(Rape) – 1.4115(Murder)

Conclusion

The linear model generally works around two parameters: one is slope which is often known as the rate of change and the other one is intercept which is basically an initial value. These models are very common in use when we are dealing with numeric data. Outcomes of these models can easily break down to reach over final results. Therefore, researchers, academicians, economists prefer these models.