Unit 1: Exploring One-Variable Data · Unit 2: Exploring Two-Variable Data
Unit 3: Collecting Data · Unit 4: Probability · Unit 5: Sampling Distributions
Unit 6: Inference for Proportions · Unit 7: Inference for Means
Unit 8: Chi-Square Tests · Unit 9: Inference for Slopes
Core Concepts & Key Formulas
Study these before attempting the exam
Unit 1
Exploring One-Variable Data
Mean: x̄ = (Σxᵢ) / n
Sample SD: s = √[Σ(xᵢ − x̄)² / (n−1)]
IQR = Q3 − Q1
Outlier fences: Q1 − 1.5·IQR and Q3 + 1.5·IQR
z-score: z = (x − μ) / σ
Percentile (Normal): use z-table
Mean is sensitive to outliers; median is resistant.
Skewed right → mean > median; Skewed left → mean < median.
Adding constant c to all values: mean/median shift by c; SD unchanged.
Multiplying by constant c: mean, median, SD all multiply by c.
Normal distribution: 68-95-99.7 Rule (1σ / 2σ / 3σ).
Example
A dataset has Q1 = 20, Q3 = 40. Is 65 an outlier?
IQR = 20; upper fence = 40 + 1.5(20) = 70. Since 65 < 70, NOT an outlier.
Unit 2
Exploring Two-Variable Data
Correlation: r (no units, −1 ≤ r ≤ 1)
LSRL: ŷ = a + bx
Slope: b = r · (sᵧ / sₓ)
y-intercept: a = ȳ − b·x̄
Coefficient of determination: r²
Residual = Actual − Predicted = y − ŷ
r measures only linear association; not causation.
r² = proportion of variation in y explained by x.
Residual plot: random scatter = good linear model.
Influential point: removing it greatly changes LSRL.
Lurking variable can create misleading association.
A fair coin is tossed 4 times. P(exactly 3 heads)?
P = C(4,3)·(0.5)³·(0.5)¹ = 4·0.125·0.5 = 0.25
Unit 5
Sampling Distributions
Sampling dist of p̂: mean = p, SD = √(p(1−p)/n)
Sampling dist of x̄: mean = μ, SD = σ/√n (= SE)
Central Limit Theorem: n ≥ 30 → x̄ approx Normal
10% condition: n ≤ 0.10N (for independence)
Large counts: np ≥ 10 and n(1−p) ≥ 10 (for proportions)
SE of x̄ decreases as n increases (by factor of √n).
CLT applies regardless of the shape of the population distribution when n is large.
CI for μ: x̄ ± t* · (s/√n), df = n−1
t-test: t = (x̄ − μ₀) / (s/√n)
Paired t: use d̄ = mean of differences, sᵈ = SD of differences
Two-sample t: t = (x̄₁−x̄₂) / √(s₁²/n₁ + s₂²/n₂)
Conditions: Random, 10% condition, Normal/Large sample
Use t-procedures when σ is unknown (always in practice).
Larger df → t-distribution closer to Normal.
Paired design reduces variability; more powerful than two-sample.
Unit 8
Chi-Square Tests
χ² = Σ[(O − E)² / E]
Goodness-of-Fit: df = categories − 1
Independence/Homogeneity: df = (r−1)(c−1)
Expected cell: E = (row total × col total) / grand total
Conditions: all expected counts ≥ 5
Goodness-of-Fit: one sample, compare to claimed distribution.
Independence: one sample, two categorical variables.
Homogeneity: multiple samples, one categorical variable.
χ² is always right-tailed.
Unit 9
Inference for Regression Slopes
t-test for slope: t = b / SEb, df = n−2
CI for β: b ± t* · SEb
H₀: β = 0 (no linear relationship)
SEb appears in computer output (standard error of slope)
Conditions (LINE): Linear, Independent, Normal residuals, Equal variance.
Check residual plot for linearity and equal variance.
Normal probability plot of residuals checks normality.
p-value < α → reject H₀ → significant linear relationship.