AI Hiring Bias Research

Hired by AI

How Names, Gender, and Profile Photos Shape Algorithmic Candidate Scoring

A two-phase controlled audit of demographic and appearance bias in a production AI recruitment system, testing 250 synthetic candidates across 10 jobs, 5 ethnicities, and 5 attractiveness tiers.

View Findings Methodology

250

Synthetic Candidates

150 text + 100 with photos

Job Positions

Tech to Healthcare

Ethnic Origins

Anglo to East Asian

Bias Dimensions

Text signals + Appearance

AI-Generated Candidates

Meet the Test Subjects

100 AI-generated LinkedIn headshots across 5 ethnicities, 2 genders, and 5 photo-quality tiers — none of these people exist.

Anglo

T3 · F

East Asian

T5 · M

Middle Eastern

T3 · F

Greek

T4 · M

South Asian

T4 · F

Anglo

T1 · M

East Asian

T4 · F

Middle Eastern

T4 · M

Greek

T5 · F

South Asian

T3 · M

Anglo

T4 · M

South Asian

T2 · F

East Asian

T1 · M

Anglo

T5 · F

Greek

T2 · M

Middle Eastern

T1 · F

Tier 1 = low-quality webcam shot → Tier 5 = professional studio headshot. All faces are AI-generated — no real individuals.

Key Findings

What We Discovered

Our audit reveals that while name and gender bias is minimal, photo quality creates a significant scoring gap.

Subtle But Measurable

A 1.09-point gender spread and 1.32-point ethnicity spread — small numbers that, at scale across thousands of applicants, can systematically shift who gets hired.

Appearance Matters Most

An 8.6-point gap between low-quality and professional photos — dwarfing all other bias sources combined.

Anonymization Penalty

Removing names and pronouns backfires: neutral candidates score ~1 point lower on average than named ones.

Culture Fit Gap

Females score +0.9 vs males on Culture Fit — the most subjective criterion is also the most bias-prone.

Phase 1 — Text Signals

Gender & Ethnicity Bias

150 candidates with identical qualifications — only names, pronouns, and cultural affiliations differ.

Average Score by Gender

Spread: 1.09 points — minimal bias

Average Score by Ethnicity

Spread: 1.32 points — no severe bias

Per-Criterion Breakdown by Gender

Culture Fit and Communication show the largest gender gaps

Phase 2 — Appearance

The Photo Effect

100 candidates with AI-generated LinkedIn photos. Same qualifications — only the photo changes.

Below Avg

79.7

avg score

Average

86.5

avg score

Above Avg

88.3

avg score

Attractive

88.3

avg score

Very Attractive

88.0

avg score

Score by Attractiveness Tier

8.6-point gap between T1 and T3

Tier Scores by Gender

Female candidates show larger T1 penalty

The Attractiveness Premium

Candidates with low-quality photos (T1) scored 8.6 points lower than those with professional headshots (T3) — despite having identical qualifications. This is the single largest bias source in the entire audit, far exceeding the 1.09-point gender spread or 1.32-point ethnicity spread. Notably, the AI explicitly referenced photo quality in its reasoning for Communication scores.

Methodology

How We Tested

A controlled experimental audit using synthetic candidates with identical qualifications.

Synthetic Candidates

250 identical-qualification profiles generated programmatically

AI Scoring

Gemini 3.1 Pro scores each candidate on 4 criteria (0-100)

Photo Generation

100 AI-generated LinkedIn photos across 5 attractiveness tiers

Bias Measurement

Compare scores across gender, ethnicity, and appearance

Phase 1 Design

Design10 x 5 x 3 factorial

Candidates150 (50M / 50F / 50N)

Jobs tested10 positions

Ethnicities5 origins

AI ModelGemini 3.1 Pro

Temperature0 (deterministic)

Phase 2 Design

Design5 x 5 x 2 x 2 factorial

Candidates100 (50M / 50F)

Attractiveness tiers5 (webcam to studio)

Photos generated100 AI-generated

Job testedSoftware Engineer (fixed)

Photo modelGemini 3 Pro Image

Deep Dive

Notable Outliers

While overall bias is minimal, specific role-origin combinations show concerning gaps.

9.1

Registered Nurse, South Asian

Neutral candidate scored 90.1 vs female at 81.0 — the largest single gap in Phase 1.

5.0

Software Engineer, South Asian

Raj Patel scored 84.5 vs Priya Patel at 89.5 — a notable male penalty in tech.

12.8

T1 Female Appearance Penalty

Female T1 (76.7) vs T3 (89.5) — low-quality photos punish women more severely.

Literature

Building on Existing Research

Our study fills gaps in the AI hiring bias literature — from Gemini testing to appearance-based discrimination.

Bertrand & Mullainathan (2004)American Economic Review

Are Emily and Greg More Employable than Lakisha and Jamal?

Foundational name-based audit — we extend to AI systems with 5 ethnicities

Haim et al. (2024)arXiv

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Most comparable — we test Gemini in production, add appearance testing

Gulati et al. (2025)arXiv

Beauty and the Bias: Attractiveness Impact on Multimodal LLMs

Found attractiveness impacts 86% of LLM decisions — we confirm in hiring context

EU AI Act (2024)EU Regulation

AI systems in recruitment classified as high-risk

Mandatory bias audits by Aug 2026 — our methodology provides a template

Gaps This Study Fills

Tests Gemini (vs mostly GPT-focused studies)

Audits a production hiring system, not a lab experiment

Uses AI-generated photos with controlled visual parameters

Includes Greek/Mediterranean ethnicity (novel category)

Includes neutral/anonymized baseline condition

Multi-criterion continuous scoring (0-100)

Combined text + appearance in a single study

Separates photo quality from facial features

Why It Matters

Regulatory Context

AI hiring tools are facing unprecedented regulatory scrutiny worldwide.

EU AI Act

AI recruitment systems classified as high-risk. Mandatory bias audits, transparency, and human oversight required by August 2026.

NYC Local Law 144

Requires annual independent bias audits of automated employment decision tools, with public disclosure of impact ratios by race and sex.

EEOC Guidance

Title VII liability extends to AI hiring tools that produce disparate impact, regardless of vendor responsibility.

Transparent AI Hiring Starts Here

We believe the first step to building fair AI is measuring and publishing the results — even when they reveal uncomfortable truths.

Try Humanlike AI Learn About Our Platform

Research conducted by Humanlike AI. All candidates are synthetic — no real individuals were evaluated. AI models tested: Google Gemini 3.1 Pro Preview (scoring) and Gemini 3 Pro Image Preview (photos). Full methodology and raw data available upon request.