1  \(\mathrm{p}\)-Values: Challenges for Hypothesis Testing

\[ \require{color} %% Colorbox within equation-environments: \newcommand{\highlight}[2][yellow]{\mathchoice% {\colorbox{#1}{$\displaystyle#2$}}% {\colorbox{#1}{$\displaystyle#2$}}% {\colorbox{#1}{$\displaystyle#2$}}% }% \]

1.1 Introduction

🎯 Lecture Goals

In this lecture, we’ll tackle three key questions:

  • How do \(\mathrm{p}\)-values actually work?
  • What are the typical misuses and misinterpretations of \(\mathrm{p}\)-values?
  • What is good practice when working with \(\mathrm{p}\)-values?

We’ll explore these questions using the two-sided \(t\)-test as our main example. Parallel arguments apply to one-sided testing and other hypothesis tests (F-test, etc.); see Section 1.4.


💻 Example: Marketing Campaign

Imagine a company runs a marketing campaign in \(n = 5\) stores and measures sales before and after the campaign.


Store Sales Before Sales After Difference (After \(-\) Before)
\(1\) \(4.86\) \(5.50\) \(X_{1,\mathrm{obs}}=0.64\)
\(2\) \(4.96\) \(4.24\) \(X_{2,\mathrm{obs}}=-0.72\)
\(3\) \(6.01\) \(5.78\) \(X_{3,\mathrm{obs}}=-0.23\)
\(4\) \(4.84\) \(5.75\) \(X_{4,\mathrm{obs}}=0.91\)
\(5\) \(2.84\) \(3.90\) \(X_{5,\mathrm{obs}}=1.06\)

Figure 1.1 visualizes the observed differences \(X_{i,\mathrm{obs}},\) \(i=1,\dots,5.\)


Figure 1.1: Sales differences (sales after minus before the marketing campaign).

We want to know:

  • Research question: Is there a marketing effect? Did it show a tendency (location-shift) toward significantly higher or lower sales?

  • Statistical question: Is the mean, \(\mu,\) of \(X_1,\dots,X_n\) different from zero?

Assumption

There are no systematic differences (mean-shifts) between sales before and after the marketing campaign, other than the potential impact of the campaign itself.


Two-Sided \(t\)-Test

We test

\(\displaystyle H_0\colon \mu = 0 \quad\) versus \(\displaystyle \quad H_1\colon \mu \neq 0\)


The \(t\)-statistic is \[ \mathrm{T} = \frac{\bar{X} - 0}{S/\sqrt{n}}\overset{H_0}{\sim} t_{n-1}, \] where

  • \(n\) is the sample size,
  • \(\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\) is the sample mean, and
  • \(S= \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})}\) is the sample standard deviation.
  • Random (iid) sample: \(X_1,\dots,X_n\overset{\operatorname{iid}}{\sim}X\)
  • For small \(n\): \(X\sim\mathcal{N}(\mu, \sigma^2)\)
  • For large \(n\): we just need finite mean \(E(X)=\mu\) and variance \(\operatorname{Var}(X)=\sigma^2\) (Central Limit Theorem helps).

Observed (obs) Value of the \(t\)-Statistic (Marketing Example)

\[ \mathrm{T}_{\mathrm{obs}} = \frac{\bar{X}_{\mathrm{obs}} - 0}{S_{\mathrm{obs}}/\sqrt{n}} = 0.96 \]

  • \(n=5\)
  • \(\bar{X}_{\mathrm{obs}}=\frac{1}{5}\sum_{i=1}^5 X_{i,\mathrm{obs}} = 0.33\)
  • \(S_{\mathrm{obs}} = \sqrt{\frac{1}{5-1}\sum_{i=1}^5(X_{i,\mathrm{obs}}-\bar{X}_{\mathrm{obs}})} = 0.77\)
X_obs <- c(0.64, -0.72, -0.23, 0.91, 1.06)
n     <- length(X_obs)
X_bar <- mean(X_obs)
S_obs <- sd(X_obs)

## Observed t-statistic
T_obs <- (X_bar - 0)/(S_obs/sqrt(n))

## Alternatively: 
result <- t.test(x           = X_obs, 
                 mu          = 0, 
                 alternative = "two.sided")
T_obs  <- result$statistic

1.2 The Two-Sided \(\mathrm{p}\)-Value of the \(t\)-Test

⚙️ Definition of the \(\mathrm{p}\)-Value

General Definition of the \(\mathrm{p}\)-Value

The \(\mathrm{p}\)-value, \(\textrm{p}_{\mathrm{obs}},\) is the probability of obtaining realizations of the test statistic \(\mathrm{T}\) at least as extreme as the one realization computed from the data \((\mathrm{T}_{\mathrm{obs}})\) — assuming that \(H_0\) is true.

By extreme, we mean extreme in the direction(s) of the alternative hypothesis.

Two-sided \(\mathrm{p}\)-value of the \(t\)-test \[\displaystyle\mathrm{p}_{\mathrm{obs}} = \mathrm{P}\big(\,|\mathrm{T}|\,\geq |\mathrm{T}_{\mathrm{obs}}|\;\big|\;H_0\;\text{is true}\big)\]


🧮 Computing the \(\mathrm{p}\)-Value

Note that \[ \mathrm{T}=\frac{\bar{X}-0}{S/\sqrt{n}}\overset{H_0}{\sim}t_{n-1}\phantom{Half} \] implies that \[ \highlight{|\mathrm{T}|}=\frac{|\bar{X}-0|}{S/\sqrt{n}}\overset{H_0}{\sim}\highlight{\textrm{Half-}t_{n-1}.} \]

Figure 1.2: The density function of the Half-\(t_{n-1}\) distribution is the folded version of the density function of the \(t_{n-1}\) distribution.

\[\begin{align*} \mathrm{F}_{\,\text{Half-}t_{n-1}}(x)&=\mathrm{P}(|\textrm{T}|\leq x\mid H_0\;\text{is true}),\quad |\textrm{T}|\overset{H_0}{\sim} \textrm{Half-}t_{n-1} \end{align*}\]

\(\mathrm{F}_{\,\text{Half-}t_{n-1}}(x) =\) extraDistr::pht(q = x, nu = n - 1)
Figure 1.3: Distribution function of the null-distribution \(\textrm{Half-}t_{\operatorname{df}}\) with \(\operatorname{df}=n-1=4\) (marketing example).
Computing the \(\mathrm{p}\)-Value of the Two-Sided \(t\)-Test

\[\begin{align*} \mathrm{p}_{\mathrm{obs}} & = \mathrm{P}\big(\,|\mathrm{T}|\,\geq |\mathrm{T}_{\mathrm{obs}}|\;\big|\;H_0\;\text{is true}\big)\\[1ex] & = 1 - \mathrm{P}\big(\highlight{|\mathrm{T}|}\leq |\mathrm{T}_{\mathrm{obs}}|\;\big|\highlight{H_0\;\text{is true}}\big)\\[1ex] & = 1 - \highlight{\mathrm{F}_{\,\text{Half-}t_{n-1}}}\big(|\mathrm{T}_{\mathrm{obs}}|\big) \end{align*}\]


Computing the \(\mathrm{p}\)-Value (Marketing Example)

\[\begin{align*} \mathrm{p}_{\mathrm{obs}} &=1 - \mathrm{F}_{\,\text{Half-}t_{n-1}}\big(|\mathrm{T}_{\mathrm{obs}}|\big)\\ &=1 - \mathrm{F}_{\,\text{Half-}t_{n-1}}\big(\,0.96\,\big)=0.39 \end{align*}\]

Computation of the observed \(\operatorname{p}\)-value:

Computation of the observed \(\operatorname{p}\)-value:

library("extraDistr")
n  <- 5
df <- n - 1

## 1 minus the distribution function of the Half-t
p_obs <- 1 - pht(q = abs(T_obs), nu = df)

## Alternatively: 
result <- t.test(X_obs, mu = 0, alternative = "two.sided")
p_obs  <- result$p.value

⚖️ Test Decision based on the \(\mathrm{p}\)-Value


  • We reject \(H_0\) at the significance level \(0<\alpha<1\) if \[ \highlight{\mathrm{p}_{\mathrm{obs}} < \alpha} \]

  • We fail to reject \(H_0\) at the significance level \(0<\alpha<1\) if \[ \highlight{\mathrm{p}_{\mathrm{obs}} \geq \alpha} \]

Usual significance levels: \(\alpha=0.05\) or \(\alpha=0.01.\)


Test Decision (Marketing Example)

Since \[ \mathrm{p}_{\mathrm{obs}}=0.39 \geq \alpha = 0.05 \] we fail to reject \(H_0\colon \mu =0\) at the significance level \(\alpha=0.05.\)

⚠️ Failing to reject \(H_0\) does not mean that \(H_0\) is true! (See Section 1.3.)

1.3 \(\mathrm{p}\)-Values: Misused and Misunderstood

🙁 \(\mathrm{p}\)-Hacking

\(\mathrm{p}\)-hacking is arguably the most common misuse of \(\mathrm{p}\)-values. It happens when researchers keep adjusting their analysis until they find a statistically significant result \((\mathrm{p}_{\mathrm{obs}} < \alpha)\) — even if there’s no real effect.

🧪 Typical “tricks” include:

  • Running lots of tests until one happens to be “significant.”

  • Changing the model setup (adding or removing variables) until an insignificant result turns “significant”.

  • Selectively including or excluding data points to turn an insignificant result into a “significant” one.

This is called the multiple comparison problem — the more tests you run, the higher the chance of finding a false positive (falsely rejecting a correct \(H_0\)).


Multiple Comparison Problem

⚠️ Multiple Comparison Problem


When multiple tests are conducted, the chance of finding at least one “significant” result purely by coincidence increases.

For instance, with \(m=20\) independent tests at the significance level \(\alpha = 0.05\), there is a \(64 \%\) chance of incorrectly rejecting at least one null hypothesis: \[\begin{align*} & P(\text{rejecting $H_0$ in at least one of the $m$ tests}\mid \text{all }H_0\text{'s are true})\\[2ex] =& 1 - (1-\alpha)^m\\[2ex] =& 1 - 0.95^{20} \;\approx \; 0.64 \end{align*}\]

\[\begin{align*} & P(\text{rejecting $H_0$ in at least one of the $m$ tests}\mid \text{all }H_0\text{'s are true})\\[2ex] =& 1-P(\text{not rejecting $H_0$ in each of the $m$ tests}\mid \text{all }H_0\text{'s are true})\\[2ex] =& 1-\left(P(\text{not rejecting $H_0$ in one single test}\mid H_0 \text{ is true})\right)^m\\[2ex] =& 1-\left(1-P(\text{rejecting $H_0$ in one single test}\mid H_0 \text{ is true})\right)^m\\[2ex] =& 1-\left(1-P(\text{type I error})\right)^m\\[2ex] =& 1 - (1-\alpha)^m\\[2ex] =& 1 - 0.95^{20} \;\approx \; 0.64 \end{align*}\]


Adjusting for Multiple Comparisons

🤓 The Bonferroni Correction


The Bonferroni Correction requires using a more stringent significance threshold when multiple tests are performed; namely \[ \alpha_{\textrm{Bonf}}=\frac{\alpha}{m} \] for \(m\) tests.

\[\begin{align*} & P(\text{rejecting $H_0$ in at least one of the $m$ tests}\mid \text{all }H_0\text{'s are true})\\[2ex] =& 1 - (1-\alpha_{\textrm{Bonf}})^m\\[2ex] =& 1 - \left(1-\frac{\alpha}{m}\right)^m\leq \alpha \end{align*}\] For \(\alpha = 0.05\) and \(m=20\colon\) \[\begin{align*} 1 - \left(1-\frac{0.05}{20}\right)^{20} &= 1 - \left(1-0.0025\right)^{20} = 0.049 < 0.05 \end{align*}\]

Alternatives to the Bonferroni correction:

  • Holm–Bonferroni method
  • Hochberg’s step-up procedure

Mitigating the Misuse of \(\mathrm{p}\)-Values

🧠 Pre-Registration and 🔁 Replication


  • Pre-registration:
    Researchers state their hypotheses and analysis plans before looking at the data.
    This reduces the temptation to \(\mathrm{p}\)-hack and helps distinguish confirmatory from exploratory research.


  • Replication:
    Independent replication studies test whether a finding is robust or simply a statistical fluke (a type I error).

✅ Correct Interpretation of the \(\textrm{p}\)-Value

🤓 Interpreting Small \(\textrm{p}\)-Values: \(\;\textrm{p}_{\mathrm{obs}} < \alpha\)


  • Compatibility with \(H_0\):
    A small \(\textrm{p}_{\mathrm{obs}}\) suggests that the observed statistic \(|\mathrm{T}_{\mathrm{obs}}|\) is incompatible with the null hypothesis \(H_0.\) In that case, we may decide that \(H_0\) does not provide a plausible explanation for the data.

  • Decision Theory:
    When we reject \(H_0\) (because \(\textrm{p}_{\mathrm{obs}} < \alpha\)), we falsely reject a true \(H_0\) (type I error) with probability \(\alpha\).
    👉 This is why we choose small significance levels such as \(\alpha = 0.05\) or \(\alpha = 0.01.\)

  • Statistical Significance ≠ Practical Significance:
    A small \(\mathrm{p}\)-value does not mean the effect is large or meaningful in practice.
    With a very large sample (\(n\) large) or low data variability (\(\operatorname{Var}(X)\) small), even tiny and unimportant effects will yield very small \(\mathrm{p}\)-values.

📢 A small \(\mathrm{p}\)-value signals statistical evidence — not necessarily practical importance.
⚠️ Interpreting Large \(\mathrm{p}\)-Values: \(\;\mathrm{p}_{\mathrm{obs}} \geq \alpha\)


  • Compatibility with \(H_0\):
    A large \(\mathrm{p}_{\mathrm{obs}}\) means the observed test statistic \(\mathrm{T}_{\mathrm{obs}}\) is compatible with the null hypothesis \(H_0\).
    ❗️However, this does not mean that \(H_0\) is true or confirmed.
  • Possible Reasons for Large \(\mathrm{p}\)-Values:
    Large \(\mathrm{p}\)-values (even \(\mathrm{p}_{\mathrm{obs}} \approx 1\)) can occur when a real effect is simply too small to detect, for example if
    • the sample size \(n\) is small, or
    • the data variability \(\operatorname{Var}(X)\) is large.
  • Decision Theory:
    When we fail to reject \(H_0\) (because \(\mathrm{p}_{\mathrm{obs}} \geq \alpha\)), we cannot conclude that \(H_0\) is true — only that we do not have enough evidence to reject it. We also do not know the probability of a type II error (falsely accepting \(H_0\)).
📢 Statistical tests are designed to reject \(H_0\) when evidence is strong — not to prove that \(H_0\) is true.

Last but not least:

📢 The \(\mathrm{p}\)-value is not the probability that the null hypothesis is true.


The p-value is calculated assuming the null hypothesis is true. Therefore, it cannot be the probability of that same hypothesis being true.

The Terminator knows stats.

📄 Good Practice

Use \(\mathrm{p}\)-values as one piece of evidence, not the final verdict. Always combine them with

  • effect sizes,
  • confidence intervals, and
  • context.

Accept that there will always be uncertainty, and be thoughtful, open, and modest.

  • Be Thoughtful:
    • Know whether you’re doing an exploratory study (to generate hypotheses) or a more rigidly pre-planned study (to test hypotheses).
    • Invest time and care in sound data collection — good study design and careful execution matter more than fancy statistics.
  • Be Open:
    • Practice transparency — share your data, methods, and results so others can reproduce your findings.
    • Be honest in communication — “statistical significance” ≠ “scientific importance,” and one study alone rarely proves anything.
  • Be Modest:
    • Remember: there is no single “true” model — every model is an approximation.
    • Stay humble about what statistical inference can (and cannot) tell us about the real world.

📚 The Scientific Debate about \(\mathrm{p}\)-Values

\(\mathrm{p}\)-values can be powerful tools — when used correctly. But in practice, they’re often misused and misinterpreted, leading to a surprising amount of published “bad science.”

This misuse has sparked an ongoing debate: Should we rely on \(\mathrm{p}\)-values and the concept of statistical significance at all?

📚 Further reading:

To clarify what \(\mathrm{p}\)-values and statistical significance actually mean (and what they don’t), the American Statistical Association (ASA) published two landmark position papers:

📄 ASA Statements:

1.4 Overview: \(\mathrm{p}\)-Values of other Test Statistics

Here you’ll find more great things about the \(p\)-value!

References

Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Scientists Rise up Against Statistical Significance.” https://www.nature.com/articles/d41586-019-00857-9.
Cumming, Geoff. 2013. “The Problem with p Values: How Significant Are They, Really?” http://phys.org/wire-news/145707973/the-problem-with-p-values-how-significant-are-they-really.html.
Nuzzo, Regina. 2014. “Scientific Method: Statistical Errors.” Nature 506 (7487).
Sagan, Carl. 1997. The Demon-Haunted World: Science as a Candle in the Dark. Headline Book Publishing.
Wasserstein, Ronald L, and Nicole A Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33.
Wasserstein, Ronald L, Allen L Schirm, and Nicole A Lazar. 2019. “Moving to a World Beyond p< 0.05.” The American Statistician 73: 1–19.