What does a p-value of 0.03 actually mean?

A p-value of 0.03 means that if the null hypothesis were true, there is a 3% probability of observing a test statistic at least as extreme as the one you calculated. It does not mean there is a 97% chance the alternative hypothesis is correct. Whether 0.03 constitutes evidence against the null depends entirely on your chosen significance level before you ran the test.

When should I use a z-test versus a t-test?

Use a z-test when you know the population standard deviation and your sample size is large (roughly 30 or more). Use a t-test when the population standard deviation is unknown, which describes most real-world situations. In practice, the t-test with the correct degrees of freedom converges on z-test results as sample size grows, so many statisticians use t-tests by default unless the population variance is genuinely known.

What is the difference between a one-tailed and two-tailed test?

A one-tailed test concentrates rejection in one direction of the distribution, suitable when you predict in advance that the effect can only go one way. A two-tailed test splits rejection across both tails, appropriate when any deviation in either direction would be meaningful. In practice, two-tailed tests demand more evidence for rejection and are the safer default unless your hypothesis is genuinely directional.

What significance level should I choose?

The standard choice in social and natural sciences is 0.05, meaning you accept a 5% risk of incorrectly rejecting a true null hypothesis. Medical and pharmaceutical research often uses 0.01 or 0.001 because the cost of a false positive is higher. Set your significance level before collecting data and before calculating any test statistic. Changing it after you see the p-value invalidates the test.

What does it mean to reject the null hypothesis?

Rejecting the null hypothesis means the data are sufficiently inconsistent with the null that you conclude the observed result is unlikely due to chance alone at your chosen significance level. It does not prove the alternative hypothesis is true. Statistical significance refers only to the probability of the observed data given the null, not to the probability of any hypothesis being correct.

How do degrees of freedom affect a t-test?

Degrees of freedom equal the sample size minus one for a one-sample t-test. A higher degrees-of-freedom value produces a t-distribution closer to the normal distribution, which means the critical value you need to exceed becomes lower. Smaller samples require a larger absolute t-statistic to achieve the same significance level, because the t-distribution has heavier tails at low degrees of freedom.

Can hypothesis testing prove that a drug or treatment works?

Hypothesis testing can provide evidence against the null hypothesis that a treatment has no effect, but it cannot prove the treatment works in an absolute sense. A statistically significant result shows the observed effect is unlikely under the null hypothesis. Practical significance, meaning whether the effect size matters clinically or economically, is a separate question that requires effect-size measures like Cohen's d alongside the p-value.

What is a Type I error in hypothesis testing?

A Type I error, also called a false positive, occurs when you reject the null hypothesis even though it is actually true. The probability of a Type I error equals your significance level alpha. Choosing alpha of 0.05 means you accept a 5% chance of incorrectly rejecting a true null in repeated experiments. You can reduce Type I errors by lowering alpha, but that increases the risk of Type II errors, where a real effect goes undetected.

Subject Mastery

How to Solve Hypothesis Testing: Step-by-Step

By Jonas|29 June 2026|12 min read

Key Takeaways

Hypothesis testing follows five steps: state hypotheses, choose alpha, calculate the test statistic, find the p-value, then decide and conclude in context.

The p-value is the probability of observing data at least as extreme as yours if the null hypothesis were true. It is not the probability the null is false.

Use a z-test when the population standard deviation is known; use a t-test when it is not, with degrees of freedom equal to n minus 1.

The most common mistake is deciding significance after seeing the data. Set alpha before you calculate anything.

Rejecting the null does not prove the alternative. Statistical significance and practical significance are two separate questions.

Most students who struggle with hypothesis testing get the arithmetic right and the logic wrong. The five-step procedure covered in this post solves both: it shows every calculation and explains why each step exists, so you can carry it into any exam scenario, not just the one you practiced on.

Two fully worked examples follow the method section, one using a z-test and one using a t-test. Each shows every number from setup to conclusion. The section on the most common mistake covers the p-value misreading that costs marks even when the calculation is correct.

What Is Hypothesis Testing?

Hypothesis testing is a statistical procedure for deciding whether sample data provide enough evidence to reject a specific claim about a population. The procedure answers one question: could the observed result plausibly have occurred by random chance if the null hypothesis were true?

The method does not prove anything. It quantifies how surprised you should be by your data under a specific assumption. That surprise is measured by the p-value, which tells you how often random sampling alone would produce a result at least as extreme as yours if the null were correct.

The Null Hypothesis and the Alternative Hypothesis

Every hypothesis test starts with two competing statements. The null hypothesis (H0) is the default claim of no effect, no difference, or no relationship. The alternative hypothesis (H1 or Ha) is the specific claim you want to test, usually that something has changed, differs, or exceeds a threshold. The null is what you assume true until the data argue strongly otherwise.

A common phrasing: a pharmaceutical company claims a new drug reduces average recovery time below the population mean of 14 days. The null hypothesis states the mean recovery time with the drug equals 14 days. The alternative states the mean is less than 14 days. The test asks whether the sample data from the drug trial are consistent with H0 or whether they shift sufficiently toward H1 to justify rejecting the null.

One-Tailed Versus Two-Tailed Tests

A two-tailed test checks whether the population parameter differs from the null value in either direction. A one-tailed test checks whether it differs specifically upward or specifically downward. The choice must follow from the research question, not from inspection of the data.

Test type	H1 format	Rejection region	When to use
Two-tailed	mu does not equal mu0	Both tails (alpha/2 each)	Any difference in either direction is interesting
Left-tailed (lower)	mu is less than mu0	Left tail only (full alpha)	You predict the parameter decreased
Right-tailed (upper)	mu is greater than mu0	Right tail only (full alpha)	You predict the parameter increased

Test typeTwo-tailed

H1 formatmu does not equal mu0

Rejection regionBoth tails (alpha/2 each)

When to useAny difference in either direction is interesting

Test typeLeft-tailed (lower)

H1 formatmu is less than mu0

Rejection regionLeft tail only (full alpha)

When to useYou predict the parameter decreased

Test typeRight-tailed (upper)

H1 formatmu is greater than mu0

Rejection regionRight tail only (full alpha)

When to useYou predict the parameter increased

Set the test direction before collecting data. Changing from two-tailed to one-tailed after seeing the results inflates the Type I error rate.

The Five-Step Method for Hypothesis Testing

The five-step structure below works for every standard parametric test: z-tests, t-tests, chi-square tests, and F-tests all follow the same logical chain. Master the chain with z and t, and the other tests slot in at Step 3.

Steps 1 and 2 happen before you touch the data. Steps 3 and 4 use the data. Step 5 connects the arithmetic to the real-world question.

Step 1: State the Hypotheses

Write both hypotheses in terms of the population parameter, not the sample. For a test about a population mean, use mu. For a test about a proportion, use p. Always include an equals sign in H0 because the test statistic is computed under the assumption that H0 is exactly true.

Write H0 and H1 before you look at any data. If you choose a one-tailed direction after seeing that your sample mean went a particular way, your significance level is no longer what alpha claims. The test is only valid when the direction is set by the research question, not by the data.

Step 2: Choose the Significance Level

Alpha, the significance level, is the probability of a Type I error you are willing to accept. The standard choice across most university statistics courses is 0.05. Choosing 0.05 means that in a long series of experiments where H0 is true, you would incorrectly reject it 5% of the time. Fields with high consequences for false positives, such as clinical trials and drug approval, often require 0.01 or 0.001.

Never Choose Alpha After Seeing the p-Value

Setting alpha to 0.06 after you calculate a p-value of 0.055 is called p-hacking. It inflates your actual Type I error rate far above your stated alpha and renders the test results invalid. Regulators, journals, and instructors treat post-hoc alpha adjustment as a methodological error. Set alpha once, before computing anything.

Step 3: Calculate the Test Statistic

The test statistic converts your sample result into a standardised number that measures how many standard errors the sample mean sits away from the null-hypothesis mean. Two formulas cover most introductory statistics tests.

For a z-test (population standard deviation sigma is known):
z = (x-bar − mu0) ÷ (sigma ÷ √n)

For a one-sample t-test (population standard deviation unknown, estimated by the sample standard deviation s):
t = (x-bar − mu0) ÷ (s ÷ √n)

The denominator in both formulas is the standard error, the standard deviation of the sampling distribution of the mean. A larger sample shrinks the standard error, making it easier to detect a real difference.

Step 4: Find the p-Value

Once you have the test statistic, locate the corresponding probability in the appropriate distribution. For a z-test, use the standard normal distribution. For a t-test, use the t-distribution with n − 1 degrees of freedom. Most statistics tables give you the area in the tail beyond your test statistic.

For a two-tailed test, double the single-tail probability. A z-statistic of 2.1 places 0.018 in the upper tail. For a two-tailed test, p equals 0.036.

Step 5: Make the Decision and State the Conclusion

Compare p to alpha. If p ≤ alpha, reject H0. If p > alpha, fail to reject H0. Note the language: you never “accept” H0. Failing to reject means the data do not provide sufficient evidence against it, not that the null is confirmed true.

Always state your conclusion in the language of the original problem. “Reject H0” alone earns no credit on most university assessments. The conclusion should read: “At the 5% significance level, there is sufficient evidence to conclude that the population mean recovery time with the drug is less than 14 days.”

Worked Example 1: One-Sample z-Test

A bottling plant fills cans with a target mean of 330 ml. The population standard deviation of fill volumes is known to be 4.2 ml from years of manufacturing data. A quality inspector draws a random sample of 36 cans and finds a sample mean of 328.3 ml. At the 5% significance level, is there evidence that the machine is underfilling?

Setting Up the Test

Step 1 (State the hypotheses):H0: mu = 330 ml; H1: mu < 330 ml. This is a one-tailed (left-tailed) test because the inspector is only concerned about underfilling.

Step 2 (Choose alpha):alpha = 0.05. The critical z-value for a one-tailed left test at 0.05 is −1.645.

Calculating the z-Score and p-Value

Step 3 (Test statistic):

Standard error = sigma ÷ √n = 4.2 ÷ √36 = 4.2 ÷ 6 = 0.70

z = (x-bar − mu0) ÷ SE = (328.3 − 330) ÷ 0.70 = −1.7 ÷ 0.70 = −2.43

Step 4 (p-value):The area to the left of z = −2.43 in a standard normal table is approximately 0.0075.

Step 5 (Decision):p = 0.0075 < alpha = 0.05, so reject H0. The test statistic −2.43 also lies beyond the critical value −1.645.

Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the true mean fill volume is less than 330 ml. The machine appears to be underfilling.

The test statistic z = −2.43 falls beyond the critical value of −1.645, placing it inside the shaded rejection region. The corresponding p-value of 0.0075 sits below alpha = 0.05.

z = -2.43

test statistic for the bottling example

p-value of 0.0075 is below alpha 0.05, so we reject H0 at the 5% significance level.

Worked Example 2: One-Sample t-Test

A university administrator claims that the average study time per week for full-time students in a particular program is 25 hours. A researcher suspects the true average is higher. The researcher surveys a random sample of 16 students and finds a sample mean of 27.4 hours and a sample standard deviation of 5.1 hours. Test the administrator's claim at the 1% significance level.

Setting Up the t-Test

Step 1 (Hypotheses):H0: mu = 25 hours; H1: mu > 25 hours. One-tailed right test because the researcher predicts the mean is higher.

Step 2 (Alpha):alpha = 0.01. Degrees of freedom = n − 1 = 16 − 1 = 15. The critical t-value for a one-tailed right test at alpha = 0.01 with 15 df is approximately 2.602.

Calculating t and Comparing to the Critical Value

Step 3 (Test statistic):

Standard error = s ÷ √n = 5.1 ÷ √16 = 5.1 ÷ 4 = 1.275

t = (x-bar − mu0) ÷ SE = (27.4 − 25) ÷ 1.275 = 2.4 ÷ 1.275 = 1.882

Step 4 (p-value): For t = 1.882 with 15 degrees of freedom, the upper-tail probability falls between 0.025 and 0.05 (using t-tables, approximately 0.039).

Step 5 (Decision):p ≈ 0.039 > alpha = 0.01, so fail to reject H0. The test statistic 1.882 also lies below the critical value 2.602.

Conclusion:At the 1% significance level, there is insufficient evidence to conclude that the true mean weekly study time exceeds 25 hours. The administrator's claim is consistent with the data at this level.

Same Data, Different Decision at Different Alpha

If the researcher had chosen alpha = 0.05 instead of 0.01, the p-value of 0.039 would fall below alpha and H0 would be rejected. This example shows precisely why alpha must be set before data collection. Choosing alpha after you see the p-value lets you reverse any decision by picking a conveniently larger or smaller significance threshold.

The test statistic of 1.882 does not reach the critical value of 2.602 at alpha = 0.01 with 15 df. Had alpha been 0.05 (critical value approximately 1.753), the same test statistic would have crossed into the rejection region.

The subject calculators hub includes statistical tools that can check your test-statistic and p-value calculations once you have worked through the steps by hand.

Statistics Calculators

Check your z-scores, t-statistics, and p-values using the statistical tools in the subject calculators hub.

Use the Calculator

The Most Common Mistake: Misreading the p-Value

The p-value trips up more students than any step in the calculation. Here is the correct reading: the p-value is the probability of observing a test statistic at least as extreme as yours, assuming H0 is true. It is not the probability that H0 is true. It is not the probability that H1 is true. These statements sound close to correct, but they describe fundamentally different things.

A p-value of 0.04 means: if the null hypothesis were true and you ran this experiment thousands of times, 4% of those experiments would produce a test statistic as extreme or more extreme than yours, purely by random sampling variation. Nothing more. The p-value gives no information about the probability of any hypothesis being true. That inference requires prior probabilities, which classical hypothesis testing does not incorporate.

Type I and Type II Errors

Every hypothesis test carries two error risks. A Type I error (false positive) rejects a true null hypothesis. Its probability equals alpha. A Type II error(false negative) fails to reject a false null hypothesis. Its probability is called beta, and power (1 − beta) measures the test's ability to detect a real effect.

Lowering alpha reduces Type I errors but increases Type II errors. Increasing sample size is the only way to reduce both simultaneously.

The trade-off between error types is a constraint, not a failure. Choosing a stricter alpha (0.01 instead of 0.05) lowers the risk of false positives but raises the risk of missing real effects. Larger samples increase power and let you lower both simultaneously. The quantitative revision guide covers how to build comfort with this kind of numerical reasoning before your statistics assessments.

For connected topics in your statistics course, the worked-example guide on limits follows the same step-by-step format if your program covers calculus alongside statistics. The exam time management guide covers how to allocate your minutes across multi-part statistics questions under timed conditions.

If you want to talk through a hypothesis testing problem, get an explanation of where your working went wrong, or work through practice questions at your exact level:

Try the AI tutor free

Key Takeaways

Hypothesis testing follows five steps in order: state H0 and H1, choose alpha, calculate the test statistic, find the p-value, and state the decision and conclusion in the language of the original problem.
The null hypothesis always contains an equality. H0: mu = mu0. The alternative states the direction of the claim. Always set the direction before seeing the data.
Use a z-test when the population standard deviation is known. Use a t-test when it is not, with degrees of freedom equal to n minus 1 for a one-sample test.
The p-value is the probability of observing your test statistic or something more extreme, given that H0 is true. It is not the probability that H0 or H1 is correct.
Set alpha before you calculate anything. Choosing or adjusting alpha after you see the p-value invalidates the test and inflates the real Type I error rate above the stated alpha.
A Type I error rejects a true null, with probability equal to alpha. A Type II error fails to reject a false null, with probability called beta. Increasing sample size reduces both simultaneously; lowering alpha alone merely trades one for the other.
Always state conclusions in the context of the original problem with explicit reference to the significance level, not just “reject H0.”

For further practice, the university resources hub links to subject-specific tools. The grade calculators hub can help you track where statistics sits in your overall module average so you know how much weight this topic carries. The matrix multiplication worked example follows the same detailed format if linear algebra appears alongside your statistics course.

OpenStax Introductory Statistics covers hypothesis testing in Chapters 9 and 10 with additional worked examples and free access. MIT OpenCourseWare's Statistics for Applications (18.650) provides lecture notes and problem sets at a higher level. Penn State's STAT 415 course notes include step-by-step hypothesis testing examples with t-tables.