Updated March 24, 2023

Introduction to Predict function in R

A function in R programming which is syntactically represented as predict(model, data) that is used to apply an already obtained model to another section of the dataset over the portion of which the model used in it was trained, with the data over which the model was built being referred to as train dataset and the data over which the model is to be applied referred to as test dataset, is referred to as predict() function in R programming

Predictive Analytics (Machine Learning)

It covers every time frame, basically, it will consider historical data as well as current data and on the basis of it frame a model which will predict the data or you can say forecast the data. It will include statistical techniques, predictive modeling, machine learning, etc.

For Example, in this, we will continue the same scenario which we used in descriptive analytics, like once we fit in the historical or current data into our model and we pass on the predict command on our new input data, automatically model will tell us that which of the new customer has a chance to default on loans. This technique gives the company a good heads up that in which direction they have to work.

Descriptive Analytics (Business Intelligence)

In this branch of analytics, we will interpret the historical data to understand the changes that occurred in the business. The main types of descriptive analytics techniques include data aggregation and data mining which can provide us the knowledge about past events. The best example of this is let’s suppose you are studying the data of the people who took a loan and you want to specifically study which type of people default on loans. By studying closely, we can identify which kind of people default on loans like what is their age or whether they belong to the same location or whether they are into the same occupation or they work under the same industry sector. This will help us to learn from historical mistakes.

Prescriptive Analytics (Decision Science)

It is the combination of both descriptive and predictive analytics, it will help the company to make effective decisions. It can provide an answer to a question like which type of customer will default on the loan, and at the same time suggest the ways like what should a company do to reduce the number of defaults.

In this blog, we will talk about predictive analytics more where we will develop data science models which will help us to predict “what next”. In prediction, there are different types of already existing models in Rstudio like lm, glm or random forest. We will talk about “lm” here.

Predict function syntax in R looks like this:

Arguments

The object is a class inheriting from “lm”
Newdata is a new data frame wherein we have to predict the value
Se.fit is used when standard errors are required
The scale is generally NULL, but it is used for standard error calculation
Df is degrees of freedom
Interval, here we have mentioned the type of interval for the calculation
Level, here we have to mention the confidence level which is fine to the researcher. Like some studies are conducted with 95% confidence and some are done on 99%.
Type, basically the type of prediction (response or model)
Na.action is a function which instructs what to do with missing values, the default here is NA
Pred.var is the variance for future observation which needs to be assumed for the prediction interval
Weights are the variance weights for prediction

We will work on the dataset which already exists in R known as “Cars”. And we will build a linear regression model that will predict the distance on the basis of the speed.

This dataset has 50 observations of 2 variables.

The first variable is speed (mph) which has numeric figures
The second variable is Distance (ft) which also has numeric figures

A dataset “cars” look like this.

Case Number	Speed	Distance
1	4	2
2	4	10
3	7	4
4	7	22
5	8	16
6	9	10
7	10	18
8	10	26
9	10	34
10	11	17
11	11	28
12	12	14
13	12	20
14	12	24
15	12	28
16	13	26
17	13	34
18	13	34
19	13	46
20	14	26
21	14	36
22	14	60
23	14	80
24	15	20
25	15	26
26	15	54
27	16	32
28	16	40
29	17	32
30	17	40
31	17	50
32	18	42
33	18	56
34	18	76
35	18	84
36	19	36
37	19	46
38	19	68
39	20	32
40	20	48
41	20	52
42	20	56
43	20	64
44	22	66
45	23	54
46	24	70
47	24	92
48	24	93
49	24	120
50	25	85

Now we will build the linear regression model because to predict something we need a model that has both input and output. Once the model learns that how data works, it will also try to provide predicted figures based on the input supplied, we will come to the prediction part in a while, first, we will make a model.

linear_model = lm(dist~speed, data = cars) linear_model

The Linear regression model equation is:

Y = β1 + β2X + ϵ

X = Independent Variable
Y = Dependent Variable
Β1 = Intercept of the regression model
β2 = Slope of the regression model
ϵ = error term

When we fit variables of our model then the equation looks like:

Dist = β1 + β2(Speed) + ϵ

And when we fit the outcome of our model into this equation it looks like:

Dist = -17.579 + 3.932(Speed)

Now we have a model, we can predict the value of the new dataset by giving inputs to our model.

Case Number	Speed	Distance
51	10	To be predicted
52	12	To be predicted
53	15	To be predicted
54	18	To be predicted
55	10	To be predicted
56	14	To be predicted
57	20	To be predicted
58	25	To be predicted
59	14	To be predicted
60	12	To be predicted

We will provide the above speed variable data as an input to our model.

We can predict the value by using function Predict() in Rstudio.

Example:

Input_variable_speed <- data.frame(speed = c(10,12,15,18,10,14,20,25,14,12)) linear_model = lm(dist~speed, data = cars) predict(linear_model, newdata = Input_variable_speed)

Now we have predicted values of the distance variable. We have to incorporate confidence level also in these predictions, this will help us to see how sure we are about our predicted values.

Output with predicted values.

Case Number	Speed	Distance
51	10	21.74499
52	12	29.60981
53	15	41.40704
54	18	53.20426
55	10	21.74499
56	14	37.47463
57	20	61.06908
58	25	80.73112
59	14	37.47463
60	12	29.60981

Confidence interval of Predict Function in R

It will helps us to deal with the uncertainty around the mean predictions. By using interval command in Predict() function we can get 95% of the confidence interval. This 95% of confidence level is pre-fitted in the function.

Example

Input_variable_speed <- data.frame(speed = c(10,12,15,18,10,14,20,25,14,12)) linear_model = lm(dist~speed, data = cars) predict(linear_model, newdata = Input_variable_speed, interval = "confidence")

Output:

The 95% confidence intervals associated with a speed of 10 are (15.46, 28.02). This means that, according to our model, 95% of the cars with a speed of 10 mph have a stopping distance between 15.46 and 28.02.