1 $\mathrm{p}$-Values: Challenges for Hypothesis Testing

\[ \require{color} %% Colorbox within equation-environments: \newcommand{\highlight}[2][yellow]{\mathchoice% {\colorbox{#1}{$\displaystyle#2$}}% {\colorbox{#1}{$\displaystyle#2$}}% {\colorbox{#1}{$\displaystyle#2$}}% }% \]

1.1 Introduction

🎯 Lecture Goals

In this lecture, we’ll tackle three key questions:

How do $\mathrm{p}$-values actually work?
What are the typical misuses and misinterpretations of $\mathrm{p}$-values?
What is good practice when working with $\mathrm{p}$-values?

We’ll explore these questions using the two-sided $t$-test as our main example. Parallel arguments apply to one-sided testing and other hypothesis tests (F-test, etc.); see Section 1.4.

💻 Example: Marketing Campaign

Imagine a company runs a marketing campaign in $n = 5$ stores and measures sales before and after the campaign.

Data Table
Data Plot

Store	Sales Before	Sales After	Difference (After $-$ Before)
$1$	$4.86$	$5.50$	$X_{1,\mathrm{obs}}=0.64$
$2$	$4.96$	$4.24$	$X_{2,\mathrm{obs}}=-0.72$
$3$	$6.01$	$5.78$	$X_{3,\mathrm{obs}}=-0.23$
$4$	$4.84$	$5.75$	$X_{4,\mathrm{obs}}=0.91$
$5$	$2.84$	$3.90$	$X_{5,\mathrm{obs}}=1.06$

Figure 1.1 visualizes the observed differences $X_{i,\mathrm{obs}},$ $i=1,\dots,5.$

Figure 1.1: Sales differences (sales after minus before the marketing campaign).

We want to know:

Research question: Is there a marketing effect? Did it show a tendency (location-shift) toward significantly higher or lower sales?
Statistical question: Is the mean, $\mu,$ of $X_1,\dots,X_n$ different from zero?

Assumption

There are no systematic differences (mean-shifts) between sales before and after the marketing campaign, other than the potential impact of the campaign itself.

Two-Sided $t$-Test

We test

$\displaystyle H_0\colon \mu = 0 \quad$ versus $\displaystyle \quad H_1\colon \mu \neq 0$

The $t$-statistic is \[ \mathrm{T} = \frac{\bar{X} - 0}{S/\sqrt{n}}\overset{H_0}{\sim} t_{n-1}, \] where

$n$ is the sample size,
$\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i$ is the sample mean, and
$S= \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})}$ is the sample standard deviation.

Assumptions underlying the $t$-test

Random (iid) sample: $X_1,\dots,X_n\overset{\operatorname{iid}}{\sim}X$
For small $n$: $X\sim\mathcal{N}(\mu, \sigma^2)$
For large $n$: we just need finite mean $E(X)=\mu$ and variance $\operatorname{Var}(X)=\sigma^2$ (Central Limit Theorem helps).

Observed (obs) Value of the $t$-Statistic (Marketing Example)

\[ \mathrm{T}_{\mathrm{obs}} = \frac{\bar{X}_{\mathrm{obs}} - 0}{S_{\mathrm{obs}}/\sqrt{n}} = 0.96 \]

$n=5$
$\bar{X}_{\mathrm{obs}}=\frac{1}{5}\sum_{i=1}^5 X_{i,\mathrm{obs}} = 0.33$
$S_{\mathrm{obs}} = \sqrt{\frac{1}{5-1}\sum_{i=1}^5(X_{i,\mathrm{obs}}-\bar{X}_{\mathrm{obs}})} = 0.77$

Computing $\mathrm{T}_{\mathrm{obs}}$ in R

X_obs <- c(0.64, -0.72, -0.23, 0.91, 1.06)
n     <- length(X_obs)
X_bar <- mean(X_obs)
S_obs <- sd(X_obs)

## Observed t-statistic
T_obs <- (X_bar - 0)/(S_obs/sqrt(n))

## Alternatively: 
result <- t.test(x           = X_obs, 
                 mu          = 0, 
                 alternative = "two.sided")
T_obs  <- result$statistic

1.2 The Two-Sided $\mathrm{p}$-Value of the $t$-Test

⚙️ Definition of the $\mathrm{p}$-Value

General Definition of the $\mathrm{p}$-Value

The $\mathrm{p}$-value, $\textrm{p}_{\mathrm{obs}},$ is the probability of obtaining realizations of the test statistic $\mathrm{T}$ at least as extreme as the one realization computed from the data $(\mathrm{T}_{\mathrm{obs}})$ — assuming that $H_0$ is true.

By extreme, we mean extreme in the direction(s) of the alternative hypothesis.

Two-sided $\mathrm{p}$-value of the $t$-test \[\displaystyle\mathrm{p}_{\mathrm{obs}} = \mathrm{P}\big(\,|\mathrm{T}|\,\geq |\mathrm{T}_{\mathrm{obs}}|\;\big|\;H_0\;\text{is true}\big)\]

🧮 Computing the $\mathrm{p}$-Value

Note that \[ \mathrm{T}=\frac{\bar{X}-0}{S/\sqrt{n}}\overset{H_0}{\sim}t_{n-1}\phantom{Half} \] implies that \[ \highlight{|\mathrm{T}|}=\frac{|\bar{X}-0|}{S/\sqrt{n}}\overset{H_0}{\sim}\highlight{\textrm{Half-}t_{n-1}.} \]

Figure 1.2: The density function of the Half-$t_{n-1}$ distribution is the folded version of the density function of the $t_{n-1}$ distribution.

\[\begin{align*} \mathrm{F}_{\,\text{Half-}t_{n-1}}(x)&=\mathrm{P}(|\textrm{T}|\leq x\mid H_0\;\text{is true}),\quad |\textrm{T}|\overset{H_0}{\sim} \textrm{Half-}t_{n-1} \end{align*}\]

$\mathrm{F}_{\,\text{Half-}t_{n-1}}(x) =$ extraDistr::pht(q = x, nu = n - 1)

Figure 1.3: Distribution function of the null-distribution $\textrm{Half-}t_{\operatorname{df}}$ with $\operatorname{df}=n-1=4$ (marketing example).

Computing the $\mathrm{p}$-Value of the Two-Sided $t$-Test

\[\begin{align*} \mathrm{p}_{\mathrm{obs}} & = \mathrm{P}\big(\,|\mathrm{T}|\,\geq |\mathrm{T}_{\mathrm{obs}}|\;\big|\;H_0\;\text{is true}\big)\\[1ex] & = 1 - \mathrm{P}\big(\highlight{|\mathrm{T}|}\leq |\mathrm{T}_{\mathrm{obs}}|\;\big|\highlight{H_0\;\text{is true}}\big)\\[1ex] & = 1 - \highlight{\mathrm{F}_{\,\text{Half-}t_{n-1}}}\big(|\mathrm{T}_{\mathrm{obs}}|\big) \end{align*}\]

Computing the $\mathrm{p}$-Value (Marketing Example)

\[\begin{align*} \mathrm{p}_{\mathrm{obs}} &=1 - \mathrm{F}_{\,\text{Half-}t_{n-1}}\big(|\mathrm{T}_{\mathrm{obs}}|\big)\\ &=1 - \mathrm{F}_{\,\text{Half-}t_{n-1}}\big(\,0.96\,\big)=0.39 \end{align*}\]

Plot
R-Codes

Computation of the observed $\operatorname{p}$-value:

Computation of the observed $\operatorname{p}$-value:

library("extraDistr")
n  <- 5
df <- n - 1

## 1 minus the distribution function of the Half-t
p_obs <- 1 - pht(q = abs(T_obs), nu = df)

## Alternatively: 
result <- t.test(X_obs, mu = 0, alternative = "two.sided")
p_obs  <- result$p.value

⚖️ Test Decision based on the $\mathrm{p}$-Value

We reject $H_0$ at the significance level $0<\alpha<1$ if \[ \highlight{\mathrm{p}_{\mathrm{obs}} < \alpha} \]
We fail to reject $H_0$ at the significance level $0<\alpha<1$ if \[ \highlight{\mathrm{p}_{\mathrm{obs}} \geq \alpha} \]

Usual significance levels: $\alpha=0.05$ or $\alpha=0.01.$

Test Decision (Marketing Example)

Since \[ \mathrm{p}_{\mathrm{obs}}=0.39 \geq \alpha = 0.05 \] we fail to reject $H_0\colon \mu =0$ at the significance level $\alpha=0.05.$

⚠️ Failing to reject $H_0$ does not mean that $H_0$ is true! (See Section 1.3.)

1.3 $\mathrm{p}$-Values: Misused and Misunderstood

🙁 $\mathrm{p}$-Hacking

$\mathrm{p}$-hacking is arguably the most common misuse of $\mathrm{p}$-values. It happens when researchers keep adjusting their analysis until they find a statistically significant result $(\mathrm{p}_{\mathrm{obs}} < \alpha)$ — even if there’s no real effect.

🧪 Typical “tricks” include:

Running lots of tests until one happens to be “significant.”
Changing the model setup (adding or removing variables) until an insignificant result turns “significant”.
Selectively including or excluding data points to turn an insignificant result into a “significant” one.

This is called the multiple comparison problem — the more tests you run, the higher the chance of finding a false positive (falsely rejecting a correct $H_0$).

Multiple Comparison Problem

⚠️ Multiple Comparison Problem

When multiple tests are conducted, the chance of finding at least one “significant” result purely by coincidence increases.

For instance, with $m=20$ independent tests at the significance level $\alpha = 0.05$, there is a $64 \%$ chance of incorrectly rejecting at least one null hypothesis: \[\begin{align*} & P(\text{rejecting $H_0$ in at least one of the $m$ tests}\mid \text{all }H_0\text{'s are true})\\[2ex] =& 1 - (1-\alpha)^m\\[2ex] =& 1 - 0.95^{20} \;\approx \; 0.64 \end{align*}\]

Derivation of the $1 - (1-\alpha)^m$ formula.

\[\begin{align*} & P(\text{rejecting $H_0$ in at least one of the $m$ tests}\mid \text{all }H_0\text{'s are true})\\[2ex] =& 1-P(\text{not rejecting $H_0$ in each of the $m$ tests}\mid \text{all }H_0\text{'s are true})\\[2ex] =& 1-\left(P(\text{not rejecting $H_0$ in one single test}\mid H_0 \text{ is true})\right)^m\\[2ex] =& 1-\left(1-P(\text{rejecting $H_0$ in one single test}\mid H_0 \text{ is true})\right)^m\\[2ex] =& 1-\left(1-P(\text{type I error})\right)^m\\[2ex] =& 1 - (1-\alpha)^m\\[2ex] =& 1 - 0.95^{20} \;\approx \; 0.64 \end{align*}\]

Adjusting for Multiple Comparisons

🤓 The Bonferroni Correction

The Bonferroni Correction requires using a more stringent significance threshold when multiple tests are performed; namely \[ \alpha_{\textrm{Bonf}}=\frac{\alpha}{m} \] for $m$ tests.

\[\begin{align*} & P(\text{rejecting $H_0$ in at least one of the $m$ tests}\mid \text{all }H_0\text{'s are true})\\[2ex] =& 1 - (1-\alpha_{\textrm{Bonf}})^m\\[2ex] =& 1 - \left(1-\frac{\alpha}{m}\right)^m\leq \alpha \end{align*}\] For $\alpha = 0.05$ and $m=20\colon$ \[\begin{align*} 1 - \left(1-\frac{0.05}{20}\right)^{20} &= 1 - \left(1-0.0025\right)^{20} = 0.049 < 0.05 \end{align*}\]

Alternatives to the Bonferroni correction:

Holm–Bonferroni method
Hochberg’s step-up procedure

Mitigating the Misuse of $\mathrm{p}$-Values

🧠 Pre-Registration and 🔁 Replication

Pre-registration:
Researchers state their hypotheses and analysis plans before looking at the data.
This reduces the temptation to $\mathrm{p}$-hack and helps distinguish confirmatory from exploratory research.
- The Open Science Framework
- AsPredicted

Replication:
Independent replication studies test whether a finding is robust or simply a statistical fluke (a type I error).
- Institute for Replication

✅ Correct Interpretation of the $\textrm{p}$-Value

🤓 Interpreting Small $\textrm{p}$-Values: $\;\textrm{p}_{\mathrm{obs}} < \alpha$

Compatibility with $H_0$:
A small $\textrm{p}_{\mathrm{obs}}$ suggests that the observed statistic $|\mathrm{T}_{\mathrm{obs}}|$ is incompatible with the null hypothesis $H_0.$ In that case, we may decide that $H_0$ does not provide a plausible explanation for the data.
Decision Theory:
When we reject $H_0$ (because $\textrm{p}_{\mathrm{obs}} < \alpha$), we falsely reject a true $H_0$ (type I error) with probability $\alpha$.
👉 This is why we choose small significance levels such as $\alpha = 0.05$ or $\alpha = 0.01.$
Statistical Significance ≠ Practical Significance:
A small $\mathrm{p}$-value does not mean the effect is large or meaningful in practice.
With a very large sample ($n$ large) or low data variability ($\operatorname{Var}(X)$ small), even tiny and unimportant effects will yield very small $\mathrm{p}$-values.

📢 A small $\mathrm{p}$-value signals statistical evidence — not necessarily practical importance.

⚠️ Interpreting Large $\mathrm{p}$-Values: $\;\mathrm{p}_{\mathrm{obs}} \geq \alpha$

Compatibility with $H_0$:
A large $\mathrm{p}_{\mathrm{obs}}$ means the observed test statistic $\mathrm{T}_{\mathrm{obs}}$ is compatible with the null hypothesis $H_0$.
❗️However, this does not mean that $H_0$ is true or confirmed.
- “Absence of evidence is not evidence of absence.” (Sagan 1997).
- Argument from ignorance.
Possible Reasons for Large $\mathrm{p}$-Values:
Large $\mathrm{p}$-values (even $\mathrm{p}_{\mathrm{obs}} \approx 1$) can occur when a real effect is simply too small to detect, for example if
- the sample size $n$ is small, or
- the data variability $\operatorname{Var}(X)$ is large.
Decision Theory:
When we fail to reject $H_0$ (because $\mathrm{p}_{\mathrm{obs}} \geq \alpha$), we cannot conclude that $H_0$ is true — only that we do not have enough evidence to reject it. We also do not know the probability of a type II error (falsely accepting $H_0$).

📢 Statistical tests are designed to reject $H_0$ when evidence is strong — not to prove that $H_0$ is true.

Last but not least:

📢 The $\mathrm{p}$-value is not the probability that the null hypothesis is true.

The p-value is calculated assuming the null hypothesis is true. Therefore, it cannot be the probability of that same hypothesis being true.

📄 Good Practice

Use $\mathrm{p}$-values as one piece of evidence, not the final verdict. Always combine them with

effect sizes,
confidence intervals, and
context.

Accept that there will always be uncertainty, and be thoughtful, open, and modest.

Be Thoughtful:
- Know whether you’re doing an exploratory study (to generate hypotheses) or a more rigidly pre-planned study (to test hypotheses).
- Invest time and care in sound data collection — good study design and careful execution matter more than fancy statistics.
Be Open:
- Practice transparency — share your data, methods, and results so others can reproduce your findings.
- Be honest in communication — “statistical significance” ≠ “scientific importance,” and one study alone rarely proves anything.
Be Modest:
- Remember: there is no single “true” model — every model is an approximation.
- Stay humble about what statistical inference can (and cannot) tell us about the real world.

📚 The Scientific Debate about $\mathrm{p}$-Values

$\mathrm{p}$-values can be powerful tools — when used correctly. But in practice, they’re often misused and misinterpreted, leading to a surprising amount of published “bad science.”

This misuse has sparked an ongoing debate: Should we rely on $\mathrm{p}$-values and the concept of statistical significance at all?

📚 Further reading:

To clarify what $\mathrm{p}$-values and statistical significance actually mean (and what they don’t), the American Statistical Association (ASA) published two landmark position papers:

📄 ASA Statements:

1.4 Overview: $\mathrm{p}$-Values of other Test Statistics

Here you’ll find more great things about the $p$-value!

References

Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Scientists Rise up Against Statistical Significance.” https://www.nature.com/articles/d41586-019-00857-9.

Cumming, Geoff. 2013. “The Problem with p Values: How Significant Are They, Really?” http://phys.org/wire-news/145707973/the-problem-with-p-values-how-significant-are-they-really.html.

Nuzzo, Regina. 2014. “Scientific Method: Statistical Errors.” Nature 506 (7487).

Sagan, Carl. 1997. The Demon-Haunted World: Science as a Candle in the Dark. Headline Book Publishing.

Wasserstein, Ronald L, and Nicole A Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33.

Wasserstein, Ronald L, Allen L Schirm, and Nicole A Lazar. 2019. “Moving to a World Beyond p< 0.05.” The American Statistician 73: 1–19.

Store	Sales Before	Sales After	Difference (After \(-\) Before)
\(1\)	\(4.86\)	\(5.50\)	\(X_{1,\mathrm{obs}}=0.64\)
\(2\)	\(4.96\)	\(4.24\)	\(X_{2,\mathrm{obs}}=-0.72\)
\(3\)	\(6.01\)	\(5.78\)	\(X_{3,\mathrm{obs}}=-0.23\)
\(4\)	\(4.84\)	\(5.75\)	\(X_{4,\mathrm{obs}}=0.91\)
\(5\)	\(2.84\)	\(3.90\)	\(X_{5,\mathrm{obs}}=1.06\)

1 \(\mathrm{p}\)-Values: Challenges for Hypothesis Testing

1.1 Introduction

🎯 Lecture Goals

💻 Example: Marketing Campaign

Two-Sided \(t\)-Test

Observed (obs) Value of the \(t\)-Statistic (Marketing Example)

1.2 The Two-Sided \(\mathrm{p}\)-Value of the \(t\)-Test

⚙️ Definition of the \(\mathrm{p}\)-Value

🧮 Computing the \(\mathrm{p}\)-Value

Computing the \(\mathrm{p}\)-Value (Marketing Example)

⚖️ Test Decision based on the \(\mathrm{p}\)-Value

Test Decision (Marketing Example)

1.3 \(\mathrm{p}\)-Values: Misused and Misunderstood

🙁 \(\mathrm{p}\)-Hacking

Multiple Comparison Problem

Adjusting for Multiple Comparisons

Mitigating the Misuse of \(\mathrm{p}\)-Values

✅ Correct Interpretation of the \(\textrm{p}\)-Value

📄 Good Practice

📚 The Scientific Debate about \(\mathrm{p}\)-Values

1.4 Overview: \(\mathrm{p}\)-Values of other Test Statistics

References