One–Way Analysis of Variance
Analysis of variance written shortly as ANOVA is the procedure by which we can compare means across three or more populations. Statistically, we frame two hypotheses, the null hypothesis: “All population means are equal” and the alternative hypothesis: “Not all population means are equal”. It enables us to test the equality of multiple means in one test rather than comparing two means at a time which is infeasible when there are several groups. In this topic, we are going to learn about One Way ANOVA in R.
One-Way Analysis of Variance helps us in analyzing only one factor or variable. E.g. there exist five regions and we want to check if daily mean rainfalls for all five regions are equal or if they are different. In this case, there’s only one factor which is region, as we need to check if regional factors affect the rainfall reception and the pattern.
Assumptions of Analysis of Variance
The following are the assumptions that must be met for applying one-way ANOVA:
- The populations from which the samples are drawn are normally distributed.
- The populations from which the samples are drawn have the same variance or standard deviation.
- The samples drawn from different populations are random and independent.
How One-Way ANOVA in R works?
For our demonstration, we are using the data which contains two variables viz. Brand and Sales. There are four brands – ATB, JKV, MKL, and PRQ. Monthly sales for these brands are given. We need to check if mean sales across the four brands are equal or if they are different from each other. To verify this, we will use the One-way ANOVA. The step-by-step procedure to implement ANOVA is as follows:
- First, import the data into R. The data is present in a CSV format. So, to import it, we will use the read.csv() function.
- View the first few records of the data. This is important to check if the data has been rightly imported into R. Similarly, we will apply a summary() function over the data, to get basic insights into the data.
- Every time we use the variables present in the dataset, we need to explicitly mention the name of the dataset like brand_sales_data$Brand or brand_sales_data$Sales. To overcome this, we shall employ the attach function. The function has to be applied as below.
- Let’s aggregate Sales by Brand using mean or standard deviation. Aggregation helps us get a basic idea of data.
The above result shows means for the four different groups are not equal. JKV has the highest mean sales.
As can be seen above, the standard deviations across the four groups don’t show any significant difference and it is highest for the brand MKL.
4.5 (2,356 ratings)
- Now, we will apply ANOVA to validate if the means across the three populations are equal or there exists any difference.
From the results above we can see that the ANOVA test for Brand is significant because of p < 0.0001. We can interpret that all brands don’t have the same preference levels in the market which influences the sale of these brands in the market. This could be due to many factors and liking of people for a particular brand.
- The above result can be visualized and it makes interpretation easy. For that, we will use plotmeans() function in gplots() library. It works as below:
As we can see above, the plotmeans() function in the gplots package enables us to visually compare the means of different groups. We can see that means are not the same across the four brands. However, the means for the brands MKL and PRQ fall in close range.
- The above analysis helps us to check if brands have equal means or not, however, making the pair-wise comparison is difficult with it. We can make pair-wise comparisons for different brands, using TukeyHSD() function which facilitates checking if a brand is significantly different from any of the remaining ones.
The pairwise comparisons as above. The difference between any two groups is significant if p < 0.001. As we can see above p-value for PRQ-MKL pair is much higher that indicates that the two brands aren’t significantly different from each other.
To visualize the pairwise comparisons, we will plot the above results as below:
The first par function rotates the axis labels making them horizontal, and the second par statement adjusts the margins so that the labels fit properly, otherwise, they will go out of the screen.
The above graph offers good insight, but we can plot the results in the form of boxplot to get better insights for clearer interpretation as demonstrated below.
The glht() function used above comes with a comprehensive set of methods for comparing multiple means. Note, the level option in cld() function pertains to the significance level, e.g. 0.05 or 95 percent confidence)
Using the above plot it becomes easy to compare means across the groups and also it facilitates systematic interpretation. There are letters, over the top of the plot, for each brand. If two brands have the same letter then they don’t have significantly different means as brands MKL and PRQ in this case which have the same letter b.
- Till now, we implemented ANOVA and used plots to visualize the results. However, it is equally important to test the assumptions. First, we will validate the normality assumption.
The car package in R provides the function qqPlot(). The above plot shows that data falls within 95% confidence envelop. This indicates that the normality assumption has almost been met.
Next, we will validate if the variances across the brands are equal. For this, we will use Bartlett’s test
The p-value shows that variances across the group don’t differ significantly
Last but not least, we shall check if any outliers are there that affect ANOVA results.
From the above result, we can see that there’s no indication of outliers in the data (NA occurs when p > 1)
Taking into consideration the results of QQ Plot, Bartlett’s test and Outlier test, we can say that data meets all ANOVA assumptions and the results obtained are valid.
Conclusion – One Way ANOVA in R
ANOVA is a very handy statistical technique that can be used to compare means across multiple populations. R offers a comprehensive range of packages to implement ANOVA, derive results and validate the assumptions. In R, statistical results can be interpreted in visual forms that offer deeper insights.
This is a guide to One Way ANOVA in R. Here we discuss the How One-Way ANOVA works and the Assumptions of Analysis of Variance. You may also have a look at the following articles to learn more –