How Names, Gender, and Profile Photos Shape Algorithmic Candidate Scoring
A two-phase controlled audit of demographic and appearance bias in a production AI recruitment system, testing 250 synthetic candidates across 10 jobs, 5 ethnicities, and 5 attractiveness tiers.
100 AI-generated LinkedIn headshots across 5 ethnicities, 2 genders, and 5 photo-quality tiers — none of these people exist.
















Tier 1 = low-quality webcam shot → Tier 5 = professional studio headshot. All faces are AI-generated — no real individuals.
Our audit reveals that while name and gender bias is minimal, photo quality creates a significant scoring gap.
A 1.09-point gender spread and 1.32-point ethnicity spread — small numbers that, at scale across thousands of applicants, can systematically shift who gets hired.
An 8.6-point gap between low-quality and professional photos — dwarfing all other bias sources combined.
Removing names and pronouns backfires: neutral candidates score ~1 point lower on average than named ones.
Females score +0.9 vs males on Culture Fit — the most subjective criterion is also the most bias-prone.
150 candidates with identical qualifications — only names, pronouns, and cultural affiliations differ.
Spread: 1.09 points — minimal bias
Spread: 1.32 points — no severe bias
Culture Fit and Communication show the largest gender gaps
100 candidates with AI-generated LinkedIn photos. Same qualifications — only the photo changes.
8.6-point gap between T1 and T3
Female candidates show larger T1 penalty
Candidates with low-quality photos (T1) scored 8.6 points lower than those with professional headshots (T3) — despite having identical qualifications. This is the single largest bias source in the entire audit, far exceeding the 1.09-point gender spread or 1.32-point ethnicity spread. Notably, the AI explicitly referenced photo quality in its reasoning for Communication scores.
A controlled experimental audit using synthetic candidates with identical qualifications.
250 identical-qualification profiles generated programmatically
Gemini 3.1 Pro scores each candidate on 4 criteria (0-100)
100 AI-generated LinkedIn photos across 5 attractiveness tiers
Compare scores across gender, ethnicity, and appearance
While overall bias is minimal, specific role-origin combinations show concerning gaps.
Neutral candidate scored 90.1 vs female at 81.0 — the largest single gap in Phase 1.
Raj Patel scored 84.5 vs Priya Patel at 89.5 — a notable male penalty in tech.
Female T1 (76.7) vs T3 (89.5) — low-quality photos punish women more severely.
Our study fills gaps in the AI hiring bias literature — from Gemini testing to appearance-based discrimination.
Are Emily and Greg More Employable than Lakisha and Jamal?
Foundational name-based audit — we extend to AI systems with 5 ethnicities
The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring
Most comparable — we test Gemini in production, add appearance testing
Beauty and the Bias: Attractiveness Impact on Multimodal LLMs
Found attractiveness impacts 86% of LLM decisions — we confirm in hiring context
AI systems in recruitment classified as high-risk
Mandatory bias audits by Aug 2026 — our methodology provides a template
AI hiring tools are facing unprecedented regulatory scrutiny worldwide.
AI recruitment systems classified as high-risk. Mandatory bias audits, transparency, and human oversight required by August 2026.
Requires annual independent bias audits of automated employment decision tools, with public disclosure of impact ratios by race and sex.
Title VII liability extends to AI hiring tools that produce disparate impact, regardless of vendor responsibility.
We believe the first step to building fair AI is measuring and publishing the results — even when they reveal uncomfortable truths.
Research conducted by Humanlike AI. All candidates are synthetic — no real individuals were evaluated. AI models tested: Google Gemini 3.1 Pro Preview (scoring) and Gemini 3 Pro Image Preview (photos). Full methodology and raw data available upon request.