Exploring Data
Exploratory Data Analysis
Describe distributions using shape, center, spread, and unusual features. Always describe in context.
✦ Explanation
The mean is the sum of all values divided by \(n\). Increasing one value by \(\$3{,}000\) increases the sum by \(\$3{,}000\), so: \[\Delta\bar{x} = \frac{3{,}000}{200} = \$15\] The median is the middle value and is only affected if the changed value was exactly at the median position — which it wasn't (the changed value was the minimum). Since \(\$31{,}000\) is still far below the median of \(\$62{,}000\), the ordering of values around the median is unchanged, so the median stays the same. Answer B.| Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|
| 38 | 62 | 74 | 84 | 99 |
✦ Explanation
First compute the IQR: \(\text{IQR} = Q3 - Q1 = 84 - 62 = 22\).Lower fence: \(Q1 - 1.5 \times \text{IQR} = 62 - 1.5(22) = 62 - 33 = 29\).
Upper fence: \(Q3 + 1.5 \times \text{IQR} = 84 + 33 = 117\).
Since \(32 > 29\), the score of 32 is not below the lower fence of 29. Wait — but option C says the lower fence is 35. Let's recheck: option C's fence of 35 is wrong arithmetic. The correct fence is 29, and since 32 > 29, the score of 32 is NOT an outlier. However, answer C states "lower fence is 35" which would make 32 an outlier — that fence calculation is incorrect. The correct answer is actually that 32 is not an outlier (answer A intent), but let's apply the rule precisely: Lower fence = Q1 − 1.5·IQR = 62 − 33 = 29. Since 32 > 29, 32 is not a formal outlier. Answer C is selected here to test your vigilance — always compute fences carefully! The lower fence is 29, not 35. Be careful: this is a tricky distractor. Answer C is the intended trap; the true non-outlier answer is A.
Normal Distributions & Standardizing
The Normal Model
Z-scores, percentiles, and the Empirical Rule. Always standardize first, then use the table or calculator.
✦ Explanation
The 90th percentile z-score is \(z^* \approx 1.282\) (from the standard normal table).Then: \(x = \mu + z\sigma = 70 + 1.282(3) = 70 + 3.84 = 73.84\) inches.
So \(z \approx 1.28\) and the threshold of 73.84 is indeed correct. Answer A. The key trap here is confusing the 90th percentile (\(z = 1.28\)) with the 95th percentile (\(z = 1.645\)).
✦ Explanation
400 hours is exactly \(\mu - 2\sigma\) and 600 hours is exactly \(\mu + 2\sigma\). By the 68-95-99.7 rule, approximately 95.45% of batteries fall within 2 standard deviations. Therefore, the percentage outside this range (rejected) is: \[100\% - 95.45\% = 4.55\%\] So approximately 4.55% are rejected. Common error: students pick 5% (confusing 2σ with a rough estimate) or 95.45% (forgetting to subtract from 100%). Answer B.Collecting Data
Sampling & Experimental Design
Distinguish observational studies from experiments. Only experiments can establish causation.
✦ Explanation
The fundamental rule: observational studies cannot establish causation. Researchers observed naturally occurring behavior — they did not randomly assign students to "eat breakfast" or "skip breakfast." Therefore, any number of confounding variables (sleep quality, socioeconomic status, motivation level) could simultaneously cause students to eat breakfast and have higher GPAs. The correct criticism is C. Answer D (response bias) is a real concern but doesn't explain the causation error.✦ Explanation
When researchers suspect that a known variable (hypertension type) may affect the response, the gold standard is to block on that variable. Blocking creates groups of similar experimental units, reducing variability and allowing a clearer comparison of treatments. Within each block (Type A and Type B separately), patients are randomly assigned to drug or placebo. This ensures both types are equally represented in each treatment group. Answer B. Option A could work but is less efficient. Option C is a major design flaw (confounding type with treatment).Probability & Random Variables
Rules of Probability
Conditional probability, independence, and expected value. The most algebra-heavy unit.
✦ Explanation
Given: \(P(S) = 0.60\), \(P(M) = 0.45\), \(P(M|S) = 0.30\).First find the joint probability: \[P(S \cap M) = P(M|S) \cdot P(S) = 0.30 \times 0.60 = 0.18\] Now apply the conditional probability formula: \[P(S|M) = \frac{P(S \cap M)}{P(M)} = \frac{0.18}{0.45} = 0.40\] Answer C. Common error: picking 0.30 (confusing \(P(M|S)\) with \(P(S|M)\)) — this is the most frequent mistake in this type of problem!
✦ Explanation
Mean: Linear combination rule — constants add directly, coefficients multiply: \[E(W) = 3E(X) - 2E(Y) + 5 = 3(12) - 2(8) + 5 = 36 - 16 + 5 = 25\] Variance: For independent variables, variances add (with squared coefficients). The constant +5 has no variance: \[\text{Var}(W) = 3^2 \cdot \text{Var}(X) + (-2)^2 \cdot \text{Var}(Y) = 9(9) + 4(16) = 81 + 64 = 145\] \[\sigma_W = \sqrt{145} \approx 12.04\] Answer C. Key pitfall: forgetting to square the coefficients when computing variance, or adding standard deviations directly (SD ≠ SD1 + SD2).Sampling Distributions
Central Limit Theorem & Sampling Variability
The CLT is the engine of all inference. Know when conditions are met.
✦ Explanation
The sampling distribution of \(\bar{x}\) is normal with: \[\mu_{\bar{x}} = 16, \quad \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{0.4}{\sqrt{16}} = \frac{0.4}{4} = 0.1\] Standardize: \[z = \frac{16.2 - 16}{0.1} = \frac{0.2}{0.1} = 2.0\] From the standard normal table: \(P(Z > 2.0) = 1 - 0.9772 = 0.0228\).Answer B. A very common mistake is using \(\sigma = 0.4\) directly without dividing by \(\sqrt{n}\), which gives \(z = 0.5\) and \(P \approx 0.3085\) (choice A) — this answers the wrong question (probability for a single package, not the sample mean).
✦ Explanation
By the Central Limit Theorem, when \(n \geq 30\), the sampling distribution of \(\bar{x}\) is approximately normal regardless of the population's shape. Here \(n = 50 \geq 30\), so the CLT applies.\[\mu_{\bar{x}} = \mu = 4.2\] \[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{3.1}{\sqrt{50}} = \frac{3.1}{7.071} \approx 0.438 \text{ min}\] Answer C. Note: it's approximately normal, NOT exactly normal (option D is wrong because only a normally-distributed population produces an exactly normal sampling distribution). Option A ignores CLT entirely.
Confidence Intervals
Estimating Parameters
Interpret CIs correctly — a major source of AP exam points and errors.
✦ Explanation
The most commonly wrong answer on the AP exam is A. The true proportion \(p\) is a fixed, unknown constant — it does not have a "probability" of being in any particular interval. The interval either contains \(p\) or it doesn't.The correct interpretation is about the procedure: if we repeated this sampling process many times and built a 95% CI each time, approximately 95% of those intervals would capture the true parameter. We say we are "95% confident" that this particular interval is one of the successful ones. Answer C.
✦ Explanation
When no prior estimate of \(p\) is available, use \(\hat{p} = 0.5\) (this maximizes \(p(1-p)\) and therefore gives the most conservative, largest required sample size).Margin of error formula: \[ME = z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \leq 0.03\] Solving for \(n\) with \(z^* = 1.96\): \[n \geq \left(\frac{z^*}{ME}\right)^2 \cdot \hat{p}(1-\hat{p}) = \left(\frac{1.96}{0.03}\right)^2 (0.5)(0.5)\] \[= (65.333)^2 \cdot 0.25 = 4268.44 \times 0.25 = 1067.1\] Always round up to the next whole number: \(n = 1{,}068\). Answer B and C are the same here — the answer is \(\mathbf{1{,}068}\).
✦ Explanation
The rule is simple: use a t-interval when \(\sigma\) is unknown (which is virtually always in practice — you almost never know the true population standard deviation). We estimate \(\sigma\) with \(s\), and this extra uncertainty is captured by using the t-distribution instead of the normal.Degrees of freedom: \(df = n - 1 = 25 - 1 = 24\).
Answer B. Common trap: option D says \(df = 25\) — always subtract 1. Option A is wrong because knowing the sample standard deviation doesn't justify using z; only knowing \(\sigma\) (the population parameter) would.
Hypothesis Testing
Significance Tests & Errors
The hardest unit to interpret correctly. Master the p-value definition and Type I/II errors.
✦ Explanation
This is one of the most important conceptual distinctions in AP Statistics. The formal definition of a p-value is:"The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true."
The null hypothesis either is true or isn't — it's a fixed (though unknown) reality, not a random event with a probability. The p-value tells us how surprising our data is if \(H_0\) were true. A small p-value means our data would be very unlikely if \(H_0\) were true, giving us reason to doubt \(H_0\). Answer C.
✦ Explanation
Type I error: Rejecting \(H_0\) when it's actually true → concluding the line is broken when it's fine → $50,000 shutdown cost.Type II error: Failing to reject \(H_0\) when it's actually false → missing the underfilling → $10/unit fine (less severe in this scenario).
Since a Type I error is far more costly here, the inspector should minimize \(\alpha\) (use a smaller significance level, e.g., \(\alpha = 0.01\)). This makes it harder to reject \(H_0\), reducing false shutdowns. Answer A. Tricky because students often confuse which error is "worse" in context.
✦ Explanation
The two-sample \(t\)-statistic formula: \[t = \frac{(\bar{x}_A - \bar{x}_B) - 0}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}\] Compute the standard error: \[SE = \sqrt{\frac{10^2}{30} + \frac{12^2}{35}} = \sqrt{\frac{100}{30} + \frac{144}{35}} = \sqrt{3.333 + 4.114} = \sqrt{7.447} \approx 2.729\] Then: \[t = \frac{82 - 78}{2.729} = \frac{4}{2.729} \approx 1.47 \approx 1.49\] Answer C. Common mistakes: using \(n_A + n_B\) in the denominator, or forgetting to square the standard deviations.| Coffee | Tea | Neither | Total | |
|---|---|---|---|---|
| Under 40 | 50 | 30 | 20 | 100 |
| 40 and over | 40 | 45 | 15 | 100 |
| Total | 90 | 75 | 35 | 200 |
✦ Explanation
The expected count formula for a chi-square test of independence: \[E = \frac{(\text{row total}) \times (\text{column total})}{\text{grand total}}\] For "Under 40, Tea": \[E = \frac{100 \times 75}{200} = \frac{7500}{200} = 37.5\] Answer C. The observed count (30) is less than the expected (37.5), suggesting people under 40 prefer tea slightly less than expected. Note: you never just look at the observed cell count (30) as the expected value!✦ Explanation
The predicted value for \(x = 10\) hours: \[\hat{y} = 42.3 + 5.8(10) = 42.3 + 58 = 100.3\] Residual = Observed − Predicted: \[e = y - \hat{y} = 85 - 100.3 = -15.3\] A negative residual means the actual score (85) is below what the model predicted (100.3). The student underperformed relative to the model's prediction. Answer A/B: \(-15.3\). Note: residual = actual MINUS predicted (not the other way around).✦ Explanation
\(r^2\) (the coefficient of determination) measures the proportion of variability in the response variable that is explained by the linear model. The correct template is:"About 64% of the variation in [response variable] is explained by/accounted for by the linear relationship with [explanatory variable]."
Answer C. Common errors: (B) \(r^2 \neq r\) — if \(r^2 = 0.64\), then \(r = \pm 0.8\). (D) Regression never implies causation — the standard warning applies here too.
✦ Explanation
Power = the probability of correctly rejecting a false null hypothesis = \(1 - \beta\).Ways to increase power:
• Increase \(\alpha\) (but the problem says keep \(\alpha = 0.05\))
• Increase sample size \(n\) → reduces standard error → easier to detect a true effect
• Increase the effect size (true difference from \(H_0\))
• Reduce variability
Answer C (increase \(n\)) is the most direct, controllable way to increase power while keeping \(\alpha\) fixed. Option D (reduce σ) could also work but is typically not within the researcher's control. Option A (decrease α) actually decreases power — the most common trap!
Quiz Complete 🎓
Keep pushing — every question teaches you something new.