Dr Charles Martin
Reference book this class is Lazar et al. (2017) chapter 4 “Statistical Analysis”. This book discusses statistical analysis in the context of HCI (but doesn’t show how to do it in Python).
Week 5 content recap: What data pre-processing do we need to do?
In an evaluation with two groups or conditions, we want to know whether differences are meaningful or not.
As an example: Is an height difference of 20cm meaningful?
So what do we do?
A typical cut-off for “significance” is p = 0.05. Is this the best choice?
Significance tests involve estimating probability distributions and a concept called degrees of freedom (df) representing the number of independent values that can vary in your analysis while still calculating the statistic you need.
how much information you have left after using some data to estimate parameters. For example, if you know the mean of 10 scores, only 9 can vary freely—the last one is determined.
different tests use different df calculations: for independent samples t-test comparing two groups df = n1 + n2 − 2
few samples leads to low df: more samples leads to higher df, more complex tests consume more degrees
low df leads to higher p. Fewer degrees of freedom requires a stronger test statistic to reach significance because you have less information.
df often reported alongside your test statistic (e.g., t(28) = 2.45, p < 0.05) so readers can evaluate analysis and sample constraints
| Test Type | Specific Test | Use Case |
|---|---|---|
| t Test | Independent-samples t test | Between-group comparison (2 groups) |
| Paired-samples t test | Within-group comparison (same participants, 2 conditions) | |
| ANOVA | One-way ANOVA | 1 independent variable, 3+ groups |
| Factorial ANOVA | 2+ independent variables | |
| Repeated measures ANOVA | Same participants across 3+ conditions | |
| Split-plot ANOVA | Mix of between- and within-subject factors |
These tests are for continuous (parametric) data, and assume that the data is normally distribution.
p < 0.05 (1-in-20 chance of a random result)
typically taken as evidence supporting a hypothesisfrom scipy.stats import ttest_ind, ttest_rel
# independent values t-test
t_stat, p_value = ttest_ind(group1, group2, equal_var=False)
# observations of different participants
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
# paired-values t-test
t_stat, p_value = ttest_rel(observation1, observation2)
# different observations of the same participant.
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")What if you have more than two groups to compare? (e.g., three+ interface variations?)
What if you have more than one independent variable? (e.g., comparing the individual and combined effects of two separate aspects of an interface)
Analysis of variance (ANOVA) enables these more complicated comparisons.
An ANOVA’s output is a statistic called F so sometimes called an F-test.
ANOVAs can be used in lots of situations: between-groups, within-groups, one or multiple independent variables, even multiple dependent variables.
Types of ANOVAs:
One-way ANOVA:
from scipy.stats import f_oneway
import statsmodels.api as sm
from statsmodels.formula.api import ols
# group by 'independent' column and compare dependent column
groups = [group['dependent'].values for _, group in df.groupby('independent')]
f_stat, p_value = f_oneway(*groups)
# create a Model from a formula and dataframe and run anova on that
model = ols('dependent ~ C(independent)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)Factorial ANOVA:
# factorial anova: example effects of two independent variables and their interaction
# model: tempo ~ key + mode + key:mode
model = ols('dep ~ C(ind_1) + C(ind_2) + C(ind_1):C(ind_2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)These examples use the statsmodels
package which allows more serious statistical modelling for complex
experiments.
Understand how variables relate to each other (e.g., is age or experience related to performance?)
Correlation:
Pearson’s r2 (Coefficient of Determination)
Note: correlation not equal to causation!
Imagine an experiment measuring time spent in an online shopping app vs income.
E.g., income vs. performance may be correlated due to an intervening variable (e.g., age) rather than directly related.
Examine the relationship between one dependent variable and one or more independent variables.
Simultaneous (Standard) Regression
Hierarchical Regression
Many situations where data does not fit the expectations for t or ANOVA tests, e.g.:
Non-parametric tests can help with this data:
Helps to analyse categorical data: e.g., a yes/no choice.
Does this look random?
| Group | Yes | No |
|---|---|---|
| A | 5 | 7 |
| B | 11 | 1 |
Results:
data = {
'Group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'Answer': ['Y', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N']
}
df = pd.DataFrame(data)
contingency_table = pd.crosstab(df['Group'], df['Answer'])
print(contingency_table)
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p:.4f}")Research question:
What effects will different machine learning models and feedback mechanisms have on simple improvised music performances?
“Understanding Musical Predictions with an Embodied Interface for Musical Machine Learning” Martin et al. (2020)
| X | Motor off | Motor on |
|---|---|---|
| Human | Human/Off | Human/On |
| Synth | Synth/Off | Synth/On |
| Noise | Noise/Off | Noise/On |
Who has a question?