
How to Solve Hypothesis Testing: Step-by-Step
Most students who struggle with hypothesis testing get the arithmetic right and the logic wrong. The five-step procedure covered in this post solves both: it shows every calculation and explains why each step exists, so you can carry it into any exam scenario, not just the one you practiced on.
Two fully worked examples follow the method section, one using a z-test and one using a t-test. Each shows every number from setup to conclusion. The section on the most common mistake covers the p-value misreading that costs marks even when the calculation is correct.
What Is Hypothesis Testing?
Hypothesis testing is a statistical procedure for deciding whether sample data provide enough evidence to reject a specific claim about a population. The procedure answers one question: could the observed result plausibly have occurred by random chance if the null hypothesis were true?
The method does not prove anything. It quantifies how surprised you should be by your data under a specific assumption. That surprise is measured by the p-value, which tells you how often random sampling alone would produce a result at least as extreme as yours if the null were correct.
The Null Hypothesis and the Alternative Hypothesis
Every hypothesis test starts with two competing statements. The null hypothesis (H0) is the default claim of no effect, no difference, or no relationship. The alternative hypothesis (H1 or Ha) is the specific claim you want to test, usually that something has changed, differs, or exceeds a threshold. The null is what you assume true until the data argue strongly otherwise.
A common phrasing: a pharmaceutical company claims a new drug reduces average recovery time below the population mean of 14 days. The null hypothesis states the mean recovery time with the drug equals 14 days. The alternative states the mean is less than 14 days. The test asks whether the sample data from the drug trial are consistent with H0 or whether they shift sufficiently toward H1 to justify rejecting the null.
One-Tailed Versus Two-Tailed Tests
A two-tailed test checks whether the population parameter differs from the null value in either direction. A one-tailed test checks whether it differs specifically upward or specifically downward. The choice must follow from the research question, not from inspection of the data.
| Test type | H1 format | Rejection region | When to use |
|---|---|---|---|
| Two-tailed | mu does not equal mu0 | Both tails (alpha/2 each) | Any difference in either direction is interesting |
| Left-tailed (lower) | mu is less than mu0 | Left tail only (full alpha) | You predict the parameter decreased |
| Right-tailed (upper) | mu is greater than mu0 | Right tail only (full alpha) | You predict the parameter increased |
Set the test direction before collecting data. Changing from two-tailed to one-tailed after seeing the results inflates the Type I error rate.
The Five-Step Method for Hypothesis Testing
The five-step structure below works for every standard parametric test: z-tests, t-tests, chi-square tests, and F-tests all follow the same logical chain. Master the chain with z and t, and the other tests slot in at Step 3.
Step 1: State the Hypotheses
Write both hypotheses in terms of the population parameter, not the sample. For a test about a population mean, use mu. For a test about a proportion, use p. Always include an equals sign in H0 because the test statistic is computed under the assumption that H0 is exactly true.
Write H0 and H1 before you look at any data. If you choose a one-tailed direction after seeing that your sample mean went a particular way, your significance level is no longer what alpha claims. The test is only valid when the direction is set by the research question, not by the data.
Step 2: Choose the Significance Level
Alpha, the significance level, is the probability of a Type I error you are willing to accept. The standard choice across most university statistics courses is 0.05. Choosing 0.05 means that in a long series of experiments where H0 is true, you would incorrectly reject it 5% of the time. Fields with high consequences for false positives, such as clinical trials and drug approval, often require 0.01 or 0.001.
Setting alpha to 0.06 after you calculate a p-value of 0.055 is called p-hacking. It inflates your actual Type I error rate far above your stated alpha and renders the test results invalid. Regulators, journals, and instructors treat post-hoc alpha adjustment as a methodological error. Set alpha once, before computing anything.
Step 3: Calculate the Test Statistic
The test statistic converts your sample result into a standardised number that measures how many standard errors the sample mean sits away from the null-hypothesis mean. Two formulas cover most introductory statistics tests.
For a z-test (population standard deviation sigma is known):
z = (x-bar − mu0) ÷ (sigma ÷ √n)
For a one-sample t-test (population standard deviation unknown, estimated by the sample standard deviation s):
t = (x-bar − mu0) ÷ (s ÷ √n)
The denominator in both formulas is the standard error, the standard deviation of the sampling distribution of the mean. A larger sample shrinks the standard error, making it easier to detect a real difference.
Step 4: Find the p-Value
Once you have the test statistic, locate the corresponding probability in the appropriate distribution. For a z-test, use the standard normal distribution. For a t-test, use the t-distribution with n − 1 degrees of freedom. Most statistics tables give you the area in the tail beyond your test statistic.
For a two-tailed test, double the single-tail probability. A z-statistic of 2.1 places 0.018 in the upper tail. For a two-tailed test, p equals 0.036.
Step 5: Make the Decision and State the Conclusion
Compare p to alpha. If p ≤ alpha, reject H0. If p > alpha, fail to reject H0. Note the language: you never “accept” H0. Failing to reject means the data do not provide sufficient evidence against it, not that the null is confirmed true.
Always state your conclusion in the language of the original problem. “Reject H0” alone earns no credit on most university assessments. The conclusion should read: “At the 5% significance level, there is sufficient evidence to conclude that the population mean recovery time with the drug is less than 14 days.”
Worked Example 1: One-Sample z-Test
A bottling plant fills cans with a target mean of 330 ml. The population standard deviation of fill volumes is known to be 4.2 ml from years of manufacturing data. A quality inspector draws a random sample of 36 cans and finds a sample mean of 328.3 ml. At the 5% significance level, is there evidence that the machine is underfilling?
Setting Up the Test
Step 1 (State the hypotheses):H0: mu = 330 ml; H1: mu < 330 ml. This is a one-tailed (left-tailed) test because the inspector is only concerned about underfilling.
Step 2 (Choose alpha):alpha = 0.05. The critical z-value for a one-tailed left test at 0.05 is −1.645.
Calculating the z-Score and p-Value
Step 3 (Test statistic):
Standard error = sigma ÷ √n = 4.2 ÷ √36 = 4.2 ÷ 6 = 0.70
z = (x-bar − mu0) ÷ SE = (328.3 − 330) ÷ 0.70 = −1.7 ÷ 0.70 = −2.43
Step 4 (p-value):The area to the left of z = −2.43 in a standard normal table is approximately 0.0075.
Step 5 (Decision):p = 0.0075 < alpha = 0.05, so reject H0. The test statistic −2.43 also lies beyond the critical value −1.645.
Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the true mean fill volume is less than 330 ml. The machine appears to be underfilling.
Worked Example 2: One-Sample t-Test
A university administrator claims that the average study time per week for full-time students in a particular program is 25 hours. A researcher suspects the true average is higher. The researcher surveys a random sample of 16 students and finds a sample mean of 27.4 hours and a sample standard deviation of 5.1 hours. Test the administrator's claim at the 1% significance level.
Setting Up the t-Test
Step 1 (Hypotheses):H0: mu = 25 hours; H1: mu > 25 hours. One-tailed right test because the researcher predicts the mean is higher.
Step 2 (Alpha):alpha = 0.01. Degrees of freedom = n − 1 = 16 − 1 = 15. The critical t-value for a one-tailed right test at alpha = 0.01 with 15 df is approximately 2.602.
Calculating t and Comparing to the Critical Value
Step 3 (Test statistic):
Standard error = s ÷ √n = 5.1 ÷ √16 = 5.1 ÷ 4 = 1.275
t = (x-bar − mu0) ÷ SE = (27.4 − 25) ÷ 1.275 = 2.4 ÷ 1.275 = 1.882
Step 4 (p-value): For t = 1.882 with 15 degrees of freedom, the upper-tail probability falls between 0.025 and 0.05 (using t-tables, approximately 0.039).
Step 5 (Decision):p ≈ 0.039 > alpha = 0.01, so fail to reject H0. The test statistic 1.882 also lies below the critical value 2.602.
Conclusion:At the 1% significance level, there is insufficient evidence to conclude that the true mean weekly study time exceeds 25 hours. The administrator's claim is consistent with the data at this level.
If the researcher had chosen alpha = 0.05 instead of 0.01, the p-value of 0.039 would fall below alpha and H0 would be rejected. This example shows precisely why alpha must be set before data collection. Choosing alpha after you see the p-value lets you reverse any decision by picking a conveniently larger or smaller significance threshold.
The subject calculators hub includes statistical tools that can check your test-statistic and p-value calculations once you have worked through the steps by hand.
Statistics Calculators
Check your z-scores, t-statistics, and p-values using the statistical tools in the subject calculators hub.
The Most Common Mistake: Misreading the p-Value
The p-value trips up more students than any step in the calculation. Here is the correct reading: the p-value is the probability of observing a test statistic at least as extreme as yours, assuming H0 is true. It is not the probability that H0 is true. It is not the probability that H1 is true. These statements sound close to correct, but they describe fundamentally different things.
A p-value of 0.04 means: if the null hypothesis were true and you ran this experiment thousands of times, 4% of those experiments would produce a test statistic as extreme or more extreme than yours, purely by random sampling variation. Nothing more. The p-value gives no information about the probability of any hypothesis being true. That inference requires prior probabilities, which classical hypothesis testing does not incorporate.
Type I and Type II Errors
Every hypothesis test carries two error risks. A Type I error (false positive) rejects a true null hypothesis. Its probability equals alpha. A Type II error(false negative) fails to reject a false null hypothesis. Its probability is called beta, and power (1 − beta) measures the test's ability to detect a real effect.
The trade-off between error types is a constraint, not a failure. Choosing a stricter alpha (0.01 instead of 0.05) lowers the risk of false positives but raises the risk of missing real effects. Larger samples increase power and let you lower both simultaneously. The quantitative revision guide covers how to build comfort with this kind of numerical reasoning before your statistics assessments.
For connected topics in your statistics course, the worked-example guide on limits follows the same step-by-step format if your program covers calculus alongside statistics. The exam time management guide covers how to allocate your minutes across multi-part statistics questions under timed conditions.
If you want to talk through a hypothesis testing problem, get an explanation of where your working went wrong, or work through practice questions at your exact level:
Key Takeaways
- Hypothesis testing follows five steps in order: state H0 and H1, choose alpha, calculate the test statistic, find the p-value, and state the decision and conclusion in the language of the original problem.
- The null hypothesis always contains an equality. H0: mu = mu0. The alternative states the direction of the claim. Always set the direction before seeing the data.
- Use a z-test when the population standard deviation is known. Use a t-test when it is not, with degrees of freedom equal to n minus 1 for a one-sample test.
- The p-value is the probability of observing your test statistic or something more extreme, given that H0 is true. It is not the probability that H0 or H1 is correct.
- Set alpha before you calculate anything. Choosing or adjusting alpha after you see the p-value invalidates the test and inflates the real Type I error rate above the stated alpha.
- A Type I error rejects a true null, with probability equal to alpha. A Type II error fails to reject a false null, with probability called beta. Increasing sample size reduces both simultaneously; lowering alpha alone merely trades one for the other.
- Always state conclusions in the context of the original problem with explicit reference to the significance level, not just “reject H0.”
For further practice, the university resources hub links to subject-specific tools. The grade calculators hub can help you track where statistics sits in your overall module average so you know how much weight this topic carries. The matrix multiplication worked example follows the same detailed format if linear algebra appears alongside your statistics course.
OpenStax Introductory Statistics covers hypothesis testing in Chapters 9 and 10 with additional worked examples and free access. MIT OpenCourseWare's Statistics for Applications (18.650) provides lecture notes and problem sets at a higher level. Penn State's STAT 415 course notes include step-by-step hypothesis testing examples with t-tables.


