Independent evaluation of large language models and safety guardrails on localised safety, RAG knowledge robustness, and demographic fairness — benchmarked for Singapore deployment contexts.
Score distribution across evaluation dimensions.
Each dot is one active model, coloured by lab. The dashed line marks the field average. Fairness is inverted: dots further right are more equitable.
Three orthogonal dimensions of responsible AI deployment. Each benchmark shows score ranges and sub-metric definitions.
Click any row to expand the full metric breakdown. Click a creator badge to filter. Fairness: ↓ lower = more equitable.
| # ↕ | Creator | Model | Refusal Rate ↓ | RAG Score ↕ | Fairness ↓ ↕ |
|---|---|---|---|---|---|
| 1 | 96% | 58% | 0.2953 | ||
| 2 | 96% | 58% | 0.2953 | ||
| 3 | 95% | 62% | 0.1662 | ||
| 4 | 94% | 69% | 0.3002 | ||
| 5 | 94% | 69% | 0.3603 | ||
| 6 | 94% | 57% | 0.2982 | ||
| 7 | 94% | 52% | 0.1487 | ||
| 8 | 93% | 54% | 0.0628 | ||
| 9 | 93% | 61% | 0.3017 | ||
| 10 | 92% | 56% | 0.1200 | ||
| 11 | 92% | 55% | 0.1374 | ||
| 12 | 92% | 59% | 0.4918 | ||
| 13 | 92% | 55% | 0.5431 | ||
| 14 | 92% | 55% | 0.2999 | ||
| 15 | 92% | — | 0.2003 | ||
| 16 | 90% | 58% | 0.9223 | ||
| 17 | 90% | 50% | 0.3543 | ||
| 18 | 90% | 64% | 0.6264 | ||
| 19 | 90% | 62% | — | ||
| 20 | 90% | 55% | 0.3675 | ||
| 21 | 89% | 24% | 0.4720 | ||
| 22 | 89% | — | 0.2658 | ||
| 23 | 88% | 58% | 0.2077 | ||
| 24 | 88% | 52% | 0.2519 | ||
| 25 | 88% | 59% | 0.1114 | ||
| 26 | 88% | — | 0.0000 | ||
| 27 | 88% | 64% | 0.1795 | ||
| 28 | 88% | 62% | 0.2958 | ||
| 29 | 88% | — | 0.3353 | ||
| 30 | 88% | 70% | 0.1371 | ||
| 31 | 87% | 51% | 0.3587 | ||
| 32 | 87% | 50% | 0.3764 | ||
| 33 | 87% | 61% | 0.2960 | ||
| 34 | 86% | 60% | 0.1092 | ||
| 35 | 86% | 62% | 0.2724 | ||
| 36 | 86% | 66% | 0.0003 | ||
| 37 | 86% | 70% | — | ||
| 38 | 85% | 67% | 0.2104 | ||
| 39 | 85% | 61% | 0.2740 | ||
| 40 | 85% | — | 0.3067 | ||
| 41 | 84% | 38% | 0.1512 | ||
| 42 | 84% | 56% | 0.1357 | ||
| 43 | 84% | 64% | 0.2336 | ||
| 44 | 84% | 54% | 0.3190 | ||
| 45 | 84% | 67% | 0.1946 | ||
| 46 | 84% | 52% | 0.2172 | ||
| 47 | 84% | — | 0.2850 | ||
| 48 | 84% | 51% | 0.2965 | ||
| 49 | 84% | — | — | ||
| 50 | 84% | — | 0.3123 | ||
| 51 | 83% | 55% | 0.1144 | ||
| 52 | 82% | 52% | 0.2537 | ||
| 53 | 82% | 62% | 0.2725 | ||
| 54 | 82% | 60% | 0.5313 | ||
| 55 | 81% | 47% | 0.5230 | ||
| 56 | 81% | — | 0.1924 | ||
| 57 | 80% | 66% | 0.1537 | ||
| 58 | 76% | 69% | 0.2517 |
Models are grouped by approximate size: Small (≤ 10 B params), Medium (11–40 B), and Large (40 B+). Commercial closed-weight models are classified by model-family tier (e.g. Haiku → Small, Sonnet/Flash → Medium, Opus/Pro → Large).
Size classifications are approximate and inherently subjective — exact parameter counts are not published for most commercial models. Categorisations are based on publicly available information and model-family naming conventions at the time of evaluation; they may not reflect the true underlying model size.