Responsible AI Bench

Independent evaluation of large language models and safety guardrails on localised safety, RAG knowledge robustness, and demographic fairness — benchmarked for Singapore deployment contexts.

68Models
8Guardrails
Jun 2026Last Run
01
Localised Undesired ContentRefusal Rate

Can the model refuse contextually harmful prompts?

02
RAG Knowledge RobustnessRobustness

Does it know what it doesn't know?

03
Demographic FairnessFairness Disparity

Does it produce equitable outputs for all users?

Want to understand what the scores mean?Our methodology

Results

Score distribution across evaluation dimensions.

Where models cluster

Each dot is one active model, coloured by lab. The dashed line marks the field average. Fairness is inverted: dots further right are more equitable.

0%25%50%75%100%avg 88%Refusal Rate0%25%50%75%100%avg 58%RAG Score0.920.690.460.230avg 0.276Fairness

How we score

Three orthogonal dimensions of responsible AI deployment. Each benchmark shows score ranges and sub-metric definitions.

3 benchmarks
01
Localised Undesired ContentRefusal Rate

Each model is tested against 50 adversarial prompts from RabakBench across four Singapore-context deployment scenarios. RabakBench prompts are sourced from Singlish web forums and adversarial red-teaming, covering hate speech, insults, sexual content, self-harm, violence, and misconduct. An LLM judge classifies each response as Reject, Reframe, or Engage — refusal rate counts both outright rejections and safe redirections.

Score guide ↑ Higher is better
Top Third≥ 90%
Middle Third85%–90%
Bottom Third≤ 85%
General

PAIR jailbreak-style adversarial prompts targeting a general-purpose system prompt. Tests baseline safety alignment against structured jailbreak attempts.

Physics misuse

Requests for physics knowledge that could enable harm (e.g. improvised weapons), routed through a physics tutoring chatbot system prompt. Tests contextual safety under domain-specialised deployment.

Career scams

Employment fraud scripts (fake job offers, recruitment scams) targeting Singaporean workers, routed through a career advisory chatbot. Tests whether the model generates content that could enable financial harm.

Job discrimination

Job descriptions with criteria violating Singapore's Tripartite Guidelines on Fair Employment Practices — routed through a JD-writer system prompt. Tests model compliance with local employment law.

02
RAG Out-of-Knowledge-Base RobustnessRobustness

Tests whether models correctly abstain when a question's answer is absent from the provided context, using a Leave-One-Out (LOO) design across 331 Q&A pairs drawn from four Singapore government policy documents. Each prompt uses a conservative system prompt requiring explicit citation or "I don't know.". Evaluation is two-stage: An LLM judge first detects abstention, then grades non-abstained responses on a 3-tier factuality rubric.

Score guide ↑ Higher is better
Top Third≥ 62%
Middle Third55%–62%
Bottom Third≤ 55%
Long In-Context Abstractive

Open-ended questions with the full knowledge base provided as context (Long In-Context). Tests conceptual abstention: the model must recognise the KB does not contain the answer even with extensive context available.

Long In-Context Factual

Specific factual queries with Long In-Context retrieval. Tests resistance to confabulation when detailed context is present but the answer has been removed.

HyDE RAG Abstractive

Open-ended questions with HyDE RAG retrieval (retrieval guided by a hypothetical answer). Tests whether models acknowledge knowledge limits when retrieved documents are plausibly relevant but insufficient.

HyDE RAG Factual

Specific factual queries with HyDE RAG retrieval. The highest-risk scenario for hallucination — tests resistance to generating confident but unsupported factual claims when parametric memory is the only fallback.

03
Demographic FairnessDisparity Score

Tests whether a model generates meaningfully different testimonials for identical student profiles that differ only in name-inferred demographics. 3,520 synthetic profiles are generated across gender (male/female) and ethnicity (Chinese, Malay, Indian, Eurasian), holding all other attributes constant. Outputs are scored on language style and lexical content, then a regression tests whether demographic predictors are statistically significant. Lower scores mean smaller — or non-significant — demographic effects.

Score guide ↓ Lower is better
Top Third≤ 0.21
Middle Third0.21–0.30
Bottom Third≥ 0.30
Style disparity

Flair DistilBERT scores each testimonial's sentiment; a RoBERTa model (pre-trained on the GYAFC corpus) scores formality sentence-by-sentence and averages. Both are regressed on gender and ethnicity dummies — disparity is the max significant demographic coefficient.

Content disparity

spaCy POS tagging extracts all adjectives; each is classified into one of seven stereotype dimensions from Hentschel et al. (2019): assertiveness, independence, instrumental competence, leadership competence, concern for others, sociability, and emotional sensitivity. The percentage share per dimension is regressed on demographics — disparity is the max significant coefficient.

Leaderboard

Click any row to expand the full metric breakdown. Click a creator badge to filter. Fairness: ↓ lower = more equitable.

# CreatorModelRefusal Rate RAG Score Fairness ↓
1
96%
58%
0.2953
2
96%
58%
0.2953
3
95%
62%
0.1662
4
94%
69%
0.3002
5
94%
69%
0.3603
6
94%
57%
0.2982
7
94%
52%
0.1487
8
93%
54%
0.0628
9
93%
61%
0.3017
10
92%
56%
0.1200
11
92%
55%
0.1374
12
92%
59%
0.4918
13
92%
55%
0.5431
14
92%
55%
0.2999
15
92%
0.2003
16
90%
58%
0.9223
17
90%
50%
0.3543
18
90%
64%
0.6264
19
90%
62%
20
90%
55%
0.3675
21
89%
24%
0.4720
22
89%
0.2658
23
88%
58%
0.2077
24
88%
52%
0.2519
25
88%
59%
0.1114
26
88%
0.0000
27
88%
64%
0.1795
28
88%
62%
0.2958
29
88%
0.3353
30
88%
70%
0.1371
31
87%
51%
0.3587
32
87%
50%
0.3764
33
87%
61%
0.2960
34
86%
60%
0.1092
35
86%
62%
0.2724
36
86%
66%
0.0003
37
86%
70%
38
85%
67%
0.2104
39
85%
61%
0.2740
40
85%
0.3067
41
84%
38%
0.1512
42
84%
56%
0.1357
43
84%
64%
0.2336
44
84%
54%
0.3190
45
84%
67%
0.1946
46
84%
52%
0.2172
47
84%
0.2850
48
84%
51%
0.2965
49
84%
50
84%
0.3123
51
83%
55%
0.1144
52
82%
52%
0.2537
53
82%
62%
0.2725
54
82%
60%
0.5313
55
81%
47%
0.5230
56
81%
0.1924
57
80%
66%
0.1537
58
76%
69%
0.2517
58 models · ↓ lower fairness score = more equitable

Data in context

Lab × Size Breakdown
How size affects performance within each lab

Models are grouped by approximate size: Small (≤ 10 B params), Medium (11–40 B), and Large (40 B+). Commercial closed-weight models are classified by model-family tier (e.g. Haiku → Small, Sonnet/Flash → Medium, Opus/Pro → Large).

Size classifications are approximate and inherently subjective — exact parameter counts are not published for most commercial models. Categorisations are based on publicly available information and model-family naming conventions at the time of evaluation; they may not reflect the true underlying model size.

0%25%50%75%100%
Medium3 models
Qwen3.6 Plus
92%
Qwen3.7 Plus
84%
Qwen3 32B
81%
Large3 models
Qwen3 235B A22B Thinking (2507)
95%
Qwen3 235B A22B Instruct (2507)
88%
Qwen3.7 Max
85%
field avg 88%