Quick Concept: ANOVA

Gustavo Santos
2 min readApr 15, 2024

Understand what is anova and how to use it.

Image by the author.

ANOVA

ANOVA, or Analyss of Variance is a statistical test used to determine if there is a significant difference between groups.

ANOVA is a statistical test used to determine if at least one group mean is significantly different than the others.

The same way we use a T-Test to check for statistical differences in means, when we have many groups to test, the T-Test option can become overwhelming since there would be many tests to be performed.

ANOVA comes to solve that problem, testing all the groups averages at once.

  • Ho: There is no statistically significant difference between the groups averages
  • Ha: There is statistically significant difference from at least one group average

How to Perform in R

In R, we can use the built-in method aov(target ~ explanatory, data)

# Create a dataset
four_sessions <- data.frame(
Page = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4),
Time = c(164, 172, 177, 156, 195, 172,178, 191,182,185,177,
185,175,193,171,163,176,176,155,166,164,170,168,162)
)

# ANOVA Test
summary(aov(Time ~Page, data=four_sessions))

## [OUT] ##
Df Sum Sq Mean Sq F value Pr(>F)
Page 3 1093 364.4 4.472 0.0147 *
Residuals 20 1630 81.5
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

If we consider a significance level of 0.05, given that the p-Value < 0.05, we can reject Ho and infer that there is a significant difference from at least one group.

Alternatively, there is also the method aovp() from the lmPerm package.

How to Perform in Python

In Python, we can implement the solution using the scipy module.

The downside is that we must separate the groups ourselves in Python. So, there’s an extra step.

import pandas as pd
import scipy.stats as sns

# Creating a dataset
four_sessions = pd.DataFrame({
'Page':[1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4],
'Time':[164, 172, 177, 156, 195, 172,178, 191,182,185,177,
185,175,193,171,163,176,176,155,166,164,170,168,162]
})

# Separate groups
pg1 = four_sessions.groupby('Page').get_group(1).Time
pg2 = four_sessions.groupby('Page').get_group(2).Time
pg3 = four_sessions.groupby('Page').get_group(3).Time
pg4 = four_sessions.groupby('Page').get_group(4).Time

# ANOVA test
sns.f_oneway(pg1, pg2, pg3, pg4)

## [OUT] ##
F_onewayResult(statistic=4.472230745627494, pvalue=0.01471358269967038)

We get the same result. Considering a significance level of 0.05, given that the p-Value < 0.05, we can reject Ho and infer that there is a significant difference from at least one group.

References

BRUCE, Peter et all. 2019. Paractical Statistics for Data Scientists. O’Reilly.

--

--

Gustavo Santos

Data Scientist. I extract insights from data to help people and companies to make better and data driven decisions. | In: https://www.linkedin.com/in/gurezende/