Responsible AI Bench

How we score

Three orthogonal dimensions of responsible AI deployment. Each benchmark shows score ranges and sub-metric definitions.

3 benchmarks

Localised Undesired ContentRefusal Rate

Each model is tested against 50 adversarial prompts from RabakBench across four Singapore-context deployment scenarios. RabakBench prompts are sourced from Singlish web forums and adversarial red-teaming, covering hate speech, insults, sexual content, self-harm, violence, and misconduct. An LLM judge classifies each response as Reject, Reframe, or Engage — refusal rate counts both outright rejections and safe redirections.

Blog post Paper Dataset

Score guide ↑ Higher is better

Top Third≥ 90%

Middle Third85%–90%

Bottom Third≤ 85%

General

PAIR jailbreak-style adversarial prompts targeting a general-purpose system prompt. Tests baseline safety alignment against structured jailbreak attempts.

Physics misuse

Requests for physics knowledge that could enable harm (e.g. improvised weapons), routed through a physics tutoring chatbot system prompt. Tests contextual safety under domain-specialised deployment.

Career scams

Employment fraud scripts (fake job offers, recruitment scams) targeting Singaporean workers, routed through a career advisory chatbot. Tests whether the model generates content that could enable financial harm.

Job discrimination

Job descriptions with criteria violating Singapore's Tripartite Guidelines on Fair Employment Practices — routed through a JD-writer system prompt. Tests model compliance with local employment law.

RAG Out-of-Knowledge-Base RobustnessRobustness

Tests whether models correctly abstain when a question's answer is absent from the provided context, using a Leave-One-Out (LOO) design across 331 Q&A pairs drawn from four Singapore government policy documents. Each prompt uses a conservative system prompt requiring explicit citation or "I don't know.". Evaluation is two-stage: An LLM judge first detects abstention, then grades non-abstained responses on a 3-tier factuality rubric.

Blog post Paper

Score guide ↑ Higher is better

Top Third≥ 62%

Middle Third55%–62%

Bottom Third≤ 55%

Long In-Context Abstractive

Open-ended questions with the full knowledge base provided as context (Long In-Context). Tests conceptual abstention: the model must recognise the KB does not contain the answer even with extensive context available.

Long In-Context Factual

Specific factual queries with Long In-Context retrieval. Tests resistance to confabulation when detailed context is present but the answer has been removed.

HyDE RAG Abstractive

Open-ended questions with HyDE RAG retrieval (retrieval guided by a hypothetical answer). Tests whether models acknowledge knowledge limits when retrieved documents are plausibly relevant but insufficient.

HyDE RAG Factual

Specific factual queries with HyDE RAG retrieval. The highest-risk scenario for hallucination — tests resistance to generating confident but unsupported factual claims when parametric memory is the only fallback.

Demographic FairnessDisparity Score

Tests whether a model generates meaningfully different testimonials for identical student profiles that differ only in name-inferred demographics. 3,520 synthetic profiles are generated across gender (male/female) and ethnicity (Chinese, Malay, Indian, Eurasian), holding all other attributes constant. Outputs are scored on language style and lexical content, then a regression tests whether demographic predictors are statistically significant. Lower scores mean smaller — or non-significant — demographic effects.

Blog post

Score guide ↓ Lower is better

Top Third≤ 0.21

Middle Third0.21–0.30

Bottom Third≥ 0.30

Style disparity

Flair DistilBERT scores each testimonial's sentiment; a RoBERTa model (pre-trained on the GYAFC corpus) scores formality sentence-by-sentence and averages. Both are regressed on gender and ethnicity dummies — disparity is the max significant demographic coefficient.

Content disparity

spaCy POS tagging extracts all adjectives; each is classified into one of seven stereotype dimensions from Hentschel et al. (2019): assertiveness, independence, instrumental competence, leadership competence, concern for others, sociability, and emotional sensitivity. The percentage share per dimension is regressed on demographics — disparity is the max significant coefficient.

Leaderboard

Click any row to expand the full metric breakdown. Click a creator badge to filter. Fairness: ↓ lower = more equitable.

# ↕	Refusal Rate ↓	RAG Score ↕	Fairness ↓ ↕
1	96%	58%	0.2953
2	96%	58%	0.2953
3	95%	62%	0.1662
4	94%	69%	0.3002
5	94%	69%	0.3603
6	94%	57%	0.2982
7	94%	52%	0.1487
8	93%	54%	0.0628
9	93%	61%	0.3017
10	92%	56%	0.1200
11	92%	55%	0.1374
12	92%	59%	0.4918
13	92%	55%	0.5431
14	92%	55%	0.2999
15	92%	—	0.2003
16	90%	58%	0.9223
17	90%	50%	0.3543
18	90%	64%	0.6264
19	90%	62%	—
20	90%	55%	0.3675
21	89%	24%	0.4720
22	89%	—	0.2658
23	88%	58%	0.2077
24	88%	52%	0.2519
25	88%	59%	0.1114
26	88%	—	0.0000
27	88%	64%	0.1795
28	88%	62%	0.2958
29	88%	—	0.3353
30	88%	70%	0.1371
31	87%	51%	0.3587
32	87%	50%	0.3764
33	87%	61%	0.2960
34	86%	60%	0.1092
35	86%	62%	0.2724
36	86%	66%	0.0003
37	86%	70%	—
38	85%	67%	0.2104
39	85%	61%	0.2740
40	85%	—	0.3067
41	84%	38%	0.1512
42	84%	56%	0.1357
43	84%	64%	0.2336
44	84%	54%	0.3190
45	84%	67%	0.1946
46	84%	52%	0.2172
47	84%	—	0.2850
48	84%	51%	0.2965
49	84%	—	—
50	84%	—	0.3123
51	83%	55%	0.1144
52	82%	52%	0.2537
53	82%	62%	0.2725
54	82%	60%	0.5313
55	81%	47%	0.5230
56	81%	—	0.1924
57	80%	66%	0.1537
58	76%	69%	0.2517

58 models · ↓ lower fairness score = more equitable

Responsible AI Bench

Results

How we score

Leaderboard

Data in context