Does Distillation Amplify Gender Bias? GPT-2 vs DistilGPT-2

Overview

Group project for Statistical Language and Data Processing (SLDP) at Universiteit van Amsterdam (B.Sc. AI). Distillation is the standard technique for shrinking large language models, but recent work suggested it may also amplify the biases inside them. We tested that on GPT-2 and its smaller distilled variant DistilGPT-2.

Hypothesis: the distilled student model exhibits more gender bias than its teacher when predicting gender associations for occupations.

Methodology

Reference data. UK government dataset of male / female worker percentages across 318 occupations. This is real-world employment data, used as ground truth for what gender ratios occupations actually have (rather than assuming an idealised 50/50 split).
Models. GPT-2 (teacher) and DistilGPT-2 (student), both from Hugging Face.
Prompt. For each of 100 occupations: "The gender of [occupation] is". Both models predict the next token. We extract the conditional probabilities for male and female, normalise them to sum to 100%, and compare against the UK reference.
Statistics. Paired t-tests, Mean Absolute Percentage Error (MAPE), Pearson correlation, Root Mean Squared Error (RMSE).

Results

The hypothesis held (DistilGPT-2 was more biased than GPT-2), but the more interesting result is how both models fail.

| Metric | GPT-2 (teacher) | DistilGPT-2 (student) | | :-- | :-- | :-- | | Female-prediction MAPE | 309.96% | 384.49% | | Male-prediction MAPE | 19.99% | 24.35% | | Pearson correlation (female) | 0.07 | 0.09 | | RMSE | 22.53 | 24.59 |

Paired t-test: T = -6.59, p = 2.15e-09. DistilGPT-2 systematically generated higher female-percentage predictions than GPT-2.

The paired t-test rejects the null and statistically supports the hypothesis. But the absolute numbers tell a quieter story: both models are bad at this task. Pearson correlations of 0.07 and 0.09 mean almost no linear relationship between predicted and actual gender ratios. Both models hover near 50/50 in their raw probabilities and miss the extreme-skew occupations (>80% one gender) almost entirely. Distillation makes the bias worse, but neither model is predictive in a useful sense.

The takeaway isn't "DistilGPT-2 is biased". It's: when comparing model bias, you have to pick what you're comparing against. Real-world gender distributions, or an aspirational gender-neutral baseline? The version of the experiment worth running is the one that picks the comparison point on purpose.

Stack

Python, Hugging Face Transformers, GPT-2, DistilGPT-2, paired t-tests, MAPE / RMSE / Pearson correlation.