Ch. 3 Statistical Hypothesis Testing

3.1 Hypotheses and Test-Statistics

Assume an independently and identically distributed (i.i.d.) random sample X1,,Xn, where the distribution of Xi, i=1,,n, depends on some unknown parameter θΩ, and where Ω is some parameter space.


General Testing Problem:

H0:θΩ0 against H1:θΩ1

H0 is the null hypothesis, while H1 is the alternative. Ω0Ω and Ω1Ω are used to denote the possible values of θ under H0 and H1. Necessarily, Ω0Ω1=.

For a large number of tests we have ΩR and the respective null hypothesis states that θ has a specific value θ0R, i.e., Ω0={θ0} and H0:θ=θ0. Depending on the alternative one then often distinguishes between one-sided (Ω1=(θ0,) or Ω1=(,θ0)) and two-sided tests (Ω1={θR|θθ0}).


The data X1,,Xn is used in order to decide whether to accept or to reject H0.


Test Statistic: Every statistical hypothesis test relies on a corresponding test statistic T=T(X1,,Xn). Any test statistic is a real valued random variable, and for given data the resulting observed value Tobs is used to decide between H0 and H1. Generally, the distribution of T under H0 is analyzed in order to define a rejection region C:

  • TobsC H0 is not rejected
  • TobsC H0 is rejected

For one-sided tests C is typically of the form (,c0] or [c1,). For two-sided tests C typically takes the form of (,c0][c1,). The limits c0 and c1 of the respective intervals are called critical values, and are obtained from quantiles of the null distribution, i.e., the distribution of T under H0.


Decision Errors:

Decision Errors Verbal Definition Probability of a Type I/II Error
Type I error H0 is rejected even though H0 is true. P(TC|H0 true)
Type II error The test fails to reject a false H0. P(TC|H1 true)

3.2 Significance Level, Size and p-Values

Significance Level: In a statistical significance test, the probability of a type I error is controlled by the significance level α (e.g., α=5%).

P(Type I error)=supθΩ0P(TC|θΩ0)α


Size: The size of a statistical test is defined as supθΩ0P(TC|θΩ0).

That is, the preselected significance level α is an upper bound for the size, which may not be attained (i.e., size <α) if, for instance, the relevant probability function is discrete.


Practically important significance levels:

  • α=0.05: It is common to say that a test result is “significant” if a hypothesis test of level α=0.05 rejects H0.
  • α=0.01: It is common to say that a test result is “strongly significant” if a hypothesis test of level α=0.01 rejects H0.


p-Value: The p-value is the probability of obtaining a test statistic at least as “extreme” as the one that was actually observed, assuming that the null hypothesis is true.

  • For one-sided tests:
    • P(TTobs|H0 true) or
    • P(TTobs|H0 true)
  • For two-sided tests:
    • 2min{P(TTobs|H0 true),P(TTobs|H0 true)}

Remarks:

  • The p-value is random as it depends on the observed data. That is, different random samples will lead to different p-values.

  • For given data, having determined the p-value of a test we also know the test decisions for all possible levels α:

    • α>p-valueH0 is rejected
    • α<p-valueH0 cannot be rejected


From: https://xkcd.com/1478/

Figure 3.1: From: https://xkcd.com/1478/


Example: Let XiN(μ,σ2) independently for all i=1,,5=n. Observed realizations from this i.i.d. random sample: X1=19.20, X2=17.40, X3=18.50, X4=16.50, X5=18.90. That is, the empirical mean is given by X¯=18.1.


Testing problem: H0:μ=μ0 against H1:μμ0 (i.e., a two-sided test), where μ0=17.


Since the variance is unknown, we use the sample standard deviation, s, which then leads to the t-test for testing H0. Test statistic of the t-test: T=n(X¯μ0)s, where s2=1n1i=1n(XiX¯)2 is the unbiased estimator of σ2. Tobs=5(18.117)1.125=2.187 p-value=2min{P(Tn12.187),P(Tn12.187)}=0.094


The above computations in R

library("magrittr", quietly = TRUE)# for using the pipe-operator: %>% 

X           <- c(19.20, 17.40, 18.50, 16.50, 18.90)
mu_0        <- 17        # hypothetical mean
n           <- length(X) # sample size
X_mean      <- mean(X)   # empirical mean
X_sd        <- sd(X)     # empirical sd
# t-test statistic
t_test_stat <- sqrt(n)*(X_mean - mu_0)/X_sd

# p-value for two-sided test
c(pt(q = t_test_stat, df = n-1, lower.tail = TRUE), 
  pt(q = t_test_stat, df = n-1, lower.tail = FALSE)) %>% 
  min * 2 -> p_value
      
p_value %>% round(., digits = 3)
## [1] 0.094

Of course, there is also a t.test() function in R:

t.test(X, mu = mu_0, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  X
## t = 2.1869, df = 4, p-value = 0.09402
## alternative hypothesis: true mean is not equal to 17
## 95 percent confidence interval:
##  16.70347 19.49653
## sample estimates:
## mean of x 
##      18.1

3.3 The Power Function

For every possible value θΩ0Ω1, all sample sizes n and each significance level α the corresponding value of the power function β is defined by the following probability: βn,α(θ):=P(reject H0|θΩ0Ω1)

Obviously, βn,α(θ)α for all θΩ0. Furthermore, for any θΩ1, 1βn,α(θ) is the probability of committing a type II error.


The power function is an important tool for accessing the quality of a test and for comparing different test procedures.


Conservative Test: If possible, a test is constructed in such a way that size equals level, i.e., βn,α(θ)=α for some θΩ0. In some cases, however, as for discrete test statistics or complex, composite null hypothesis, it is not possible to reach the level, and supθΩ0βn,α(θ)<α. In this case the test is called conservative.


Unbiased Test: A significance test of level α>0 is called unbiased if βn,α(θ)α for all θΩ1.


Consistent Test: A significance test of level α>0 is called consistent if limnβn,α(θ)=1 for all θΩ1.


Most Powerful Test: When choosing between different testing procedures for the same testing problem, one will usually prefer the most powerful test. Consider a fixed sample size n. For a specified θΩ1, a test with power function βn,α(θ) is said to be most powerful for θ if for any alternative test with power function βn,α(θ), βn,α(θ)βn,α(θ) holds for all levels α>0.

Uniformly Most Powerful: A test with power function βn,α(θ) is said to be uniformly most powerful against the set of alternatives Ω1 if for any alternative test with power function βn,α(θ), βn,α(θ)βn,α(θ)holds for all θΩ1,α>0 Unfortunately, uniformly most powerful tests only exist for very special testing problems.


Example: Let X1,,Xn be an i.i.d. random sample. Assume that n=9, and that XiN(μ,0.182). Hence, in this simple example only the mean μ=E(X) is unknown, while the standard deviation has the known value σ=0.18.


Testing problem: H0:μ=μ0 against H1:μμ0 for μ0=18.3 (i.e., a two-sided test).


Since the variance is known, a test may rely on the Gauss (or Z) test statistic: Z=n(X¯μ0)σ=3(X¯18.3)0.18

Under H0 we have ZN(0,1), and for the significance level α=0.05 the null hypothesis is rejected if |Z|z1α/2=1.96, where z1α/2 denotes the (1α/2)-quantile of the standard normal distribution. Note that the size of this test equals its level α=0.05.


For determining the rejection region of a test it suffices to determine the distribution of the test statistic under H0. But in order to calculate the power function one needs to quantify the distribution of the test statistic for all possible values θΩ. For many important problems this is a formidable task. For the Gauss test, however, it is quite easy. Note that for any (true) mean value μR the corresponding distribution of ZZμ=n(X¯μ0)/σ is Zμ=n(μμ0)σ+n(X¯μ)σN(n(μμ0)σ,1) This implies that βn,α(μ)=P(|Zμ|>z1α/2)=1Φ(z1α/2n(μμ0)σ)+Φ(z1α/2n(μμ0)σ), where Φ denotes the distribution function of the standard normal distribution.

Implementing the power function of the two-sided Z-test in R:

# The power function
beta_Ztest_TwoSided <- function(n, alpha, sigma, mu_0, mu){
  # (1-alpha/2)-quantile of N(0,1):
  z_upper        <- qnorm(p = 1-alpha/2)
  # location shift under H_1:
  location_shift <- sqrt(n) * (mu - mu_0)/sigma
  # compute power
  power          <- 1 - pnorm( z_upper - location_shift) + 
                        pnorm(-z_upper - location_shift)
  return(power)
}

# Apply the function
n     <-  9
sigma <-  0.18
mu_0  <- 18.3 
##
c(beta_Ztest_TwoSided(n = n, alpha = 0.05, sigma = sigma, mu_0 = mu_0, mu=18.35),
  beta_Ztest_TwoSided(n = n, alpha = 0.05, sigma = sigma, mu_0 = mu_0, mu=18.50),
  beta_Ztest_TwoSided(n = n, alpha = 0.01, sigma = sigma, mu_0 = mu_0, mu=18.50)) %>% 
  round(., digits = 3)
## [1] 0.133 0.915 0.776

Plotting the graph of the power function

suppressPackageStartupMessages(
  library("tidyverse")
)
# Vectorize the function with respect to mu_0:
beta_Ztest_TwoSided <- Vectorize(FUN = beta_Ztest_TwoSided, 
                                 vectorize.args = "mu_0")

mu_0_vec <- seq(from = 17.75, to = 18.25, len = 50)

beta_vec <- beta_Ztest_TwoSided(n     =   10, 
                                alpha = 0.05, 
                                sigma = 0.18, 
                                mu    =  18, 
                                mu_0  = mu_0_vec)

beta_df <- data.frame("mu_0"  = mu_0_vec,
                      "Beta"  = beta_vec)

ggplot(data = beta_df, aes(x=mu_0, y=Beta)) +
  geom_line() +
  geom_hline(yintercept = 0.05, lty=2) + 
  geom_text(aes(x=17.77, y=0.07, label='alpha==0.05'), parse=TRUE, size=5) +
  labs(title = expression(
    paste("Powerfunction of the two-sided Z-Test (n=10 and ",alpha==0.05,")")), 
       x = expression(paste(mu[0])),
       y = expression(paste(beta)), size=8)    +
  theme_bw() +
  theme(axis.text  = element_text(size=12),
           axis.title = element_text(size=14))

This example illustrates the power function of a sensible test, since:

  • Under H0:μ=μ0 we have βn,α(μ0)=α.
  • The test is unbiased, since βn,α(μ)α for any μμ0.
  • The test is consistent, since limnβn,α(μ)=1 for every fixed μμ0.
  • For fixed sample size n, βn,α(μ) increases as the distance |μμ0| increases.
  • If |μμ0|>|μμ0| then βn,α(μ)>βn,α(μ).
  • βn,α(μ) decreases as the significance level α of the test decreases. I.e., if α>α then βn,α(μ)>βn,α(μ).


Assuming that the basic assumptions (i.e., normality and known variance) are true, the above Gauss-test is the most prominent example of a uniformly most powerful test. Under its (restrictive) assumptions, no other possible test can achieve a larger value of βn,α(μ) for any possible value of μ.

3.4 Asymptotic Null Distributions

Generally, the underlying distributions are unknown. In this case it is usually not possible to compute the power function of a test for fixed n. (Exceptions are so called “distribution-free” tests in nonparametric statistics.) The only way out of this difficulty is to rely on large sample asymptotics and corresponding asymptotic distributions, which allow to approximate the power function and to study the asymptotic efficiency of a test. The finite sample behavior of a test for different sample sizes n is then evaluated by means of simulation studies.

For a real-valued parameter θ most tests of H0:θ=θ0 rely on estimators θ^ of θ. Under suitable regularity conditions on the underlying distribution, central limit theorems usually imply that n(θ^θ)DN(0,v2)asn, where v2 is the asymptotic variance of the estimator.


Often a consistent estimator v^2 of v2 can be determined from the data. For large n we then approximately have n(θ^θ)vaN(0,1). For a given α, a one-sided test of H0:θ=θ0 against H1:θ>θ0 then rejects H0 if Z=n(θ^θ0)v>z1α. The corresponding asymptotic approximation (valid for sufficiently large n) of the true power function is then given by βn,α(θ)=1Φ(z1αn(θθ0)v)


Note that in practice the (unknown) true value v2 is generally replaced by an estimator v^2 determined from the data. As long as v^2 is a consistent estimator of v2 this leads to the same asymptotic power function. The resulting test is asymptotically unbiased and consistent.


Usually there are many different possible estimators for a parameter θ. Consider an alternative estimator θ~ of θ satisfying n(θ~θ)DN(0,v~2)asn. If the asymptotic variance v2 of the estimator θ^ is smaller than the asymptotic variance v~2 of θ~, i.e., v2<v~2, then θ^ is a more efficient estimator of θ. Then necessarily the test based on θ^ is more powerful than the test based on θ~, since asymptotically for all θ>θ0 β~n,α(θ)=1Φ(z1αn(θθ0)v~)<1Φ(z1αn(θθ0)v)=βn,α(θ)


Example: Let X1,,Xn be an iid random sample. Consider testing H0:μ=μ0 against H1:μ>μ0, where μ:=E(Xi). For a given level α the t-test then rejects H0 if T=n(X¯μ0)S>tn1;1α, where tn1;1α is the 1α quantile of a t-distributions with n1-degrees of freedom. This is an exact test if the distribution of Xi is normal. In the general case, the justification of the t-test is based on asymptotic arguments. Under some regularity conditions the central limit theorem implies that n(X¯μ)DN(0,σ2)asn with σ2=Var(Xi). Moreover, S2 is a consistent estimator of σ2 and tn1;1αz1α as n. Thus even if the distribution of Xi is non-normal, for sufficiently large n, T=n(X¯μ0)S is approximately N(0,1)-distributed and the asymptotic power function of the t-test is given by βn,α(θ)=1Φ(z1αn(μμ0)σ).

3.5 Multiple Comparisons

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly reject the null hypothesis are more likely to occur when one considers the set as a whole.

In empirical studies often dozens or even hundreds of tests are performed for the same data set. When searching for significant test results, one may come up with false discoveries.

Example: m different, independent test of significance level α>0. (Independence means that the test statistics used are mutually independent – this is usually not true in practice). Let’s assume that a common null hypothesis H0 holds for each of the m tests. Then

P(Type I errorby at leastone of the m tests)=1(1α)m=:αm>α

Therefore, as m increases also the probability of a type I error increases:

Number of tests m Probability of at least one type I error (αm)
1 0.050
3 0.143
5 0.226
10 0.401
100 0.994


Analogous problem: Construction of m many (1α) confidence intervals. P(at least one of the m confidenceintervals does not containthe true parameter value)=1(1α)m>α


From: https://xkcd.com/882/

Figure 3.2: From: https://xkcd.com/882/

This represents the general problem of multiple comparisons. In practice, it will not be true that all considered test statistics are mutually independent. (This even complicates the problem.) However, we will still have the effect that the probability of at least one falsely significant result increases with the number m of tests, but it will not be equal to 1(1α)m.


A statistically rigorous solution of this problem consists in modifying the constructions of tests or confidence intervals in order to arrive at simultaneous tests: P(Type I error byat least one of the m tests)α or simultaneous confidence intervals: P(At least one of the m confidenceintervals does not containthe true parameter value)αP(All confidence intervalssimultaneously contain thetrue parameter values)1α


For certain problems (e.g., analysis of variance) there exist specific procedures for constructing simultaneous confidence intervals. However, the only generally applicable procedure seems to be the Bonferroni correction. It is based on Boole’s inequality.

Theorem (Boole): Let A1,A2,,Am denote m different events. Then P(A1A2Am)i=1mP(Ai). This inequality also implies that: P(A1A2Am)1i=1mP(A¯i), where A¯i denotes the complementary event “not Ai”.


Example: Bonferroni adjustment for m different tests of level α=α/m. P(Type I error byat least one of the m tests)i=1mα=α


Analogously: Construction of m many (1α)-confidence intervals with α=α/m: P(At least one of the m confidenceintervals does not containthe true parameter value)i=1mα=αP(All confidence intervalsimultaneously contain thetrue parameter values)1i=1mα=1α

Example: Regression analysis with K=100 regressors, where none of the variables has an effect on the dependent variable y.

library("tidyverse", quietly = TRUE)
K <- 100 # number of regressors
n <- 500 # sample size

set.seed(123)

# Generate regression data, where none of the X-variables 
# has an effect on the dependent variable Y:
my_df <- matrix(data = rnorm(n = n*K), 
                nrow = n, ncol = K, 
                dimnames = list(paste0("i.",1:n), 
                                paste0("X.",1:K))) %>% 
  as_tibble() %>% 
  mutate(Y = rnorm(n)) %>% # Adding a Y-variable that is independent of the X-variables
  select(Y, everything())  

# OLS regression
OLS_result_df <- lm(Y ~ . , data = my_df) %>% 
  summary %>% 
  broom::tidy()


Count_Signif <- OLS_result_df %>% 
  filter(term != '(Intercept)') %>% 
  count(p.value < 0.05)
p.value < 0.05 n
FALSE 96
TRUE 4

3.6 R-Lab: The Gauss-Test

Let’s reconsider the simplest test statistic you will ever meet: The Gauss-Test (Or “Z-Test”).

Setup: Let X1,,Xn be an iid random sample with XiN(μ,σ2) and σ2<.

Idea: Under the above setup, X¯n=n1i=1nXi consistently estimates the (unknown) true mean value μ. That is, X¯npμ.

  • Under the null hypothesis (i.e., μ0=μ), the difference X¯nμ0 should be “small”.
  • Under the alternative hypothesis (i.e., μ0μ), the difference X¯nμ0 should be “large”.


Under the null hypothesis H0 we have that μ0=μ. Therefore: Z=n(X¯nμ0)σ=n(X¯nμ)σN(0,1)


Under the alternative H1 we have that μ0μ. Therefore: Z=n(X¯nμ0)σ=n(X¯nμ0+μμ)σ=n(X¯nμ)σ+n(μμ0)σN(n(μμ0)σ,1)

The different distributions (under H0 and H1) of the test statistic Z can be investigated in the following dynamic plot: