Current theme: dark
← Back to Leaderboard
Task
Detailed breakdown of individual task performance across different models.
Status
All
Models
All
Task Name (15 tasks)
claude-4-6-sonnet
gemini-3-flash
gemini-3.1-pro
gpt-5.2-codex
b2b_delete_organization
145.5s
115.9s
133.5s
141.6s
b2b_fraud_aware_login
144.2s
59.1s
119.5s
89.0s
b2b_jwt_validator
159.6s
69.0s
180.8s
98.2s
b2b_organization_search
193.5s
152.0s
86.9s
99.1s
b2b_scim
145.2s
90.5s
63.7s
97.6s
b2b_sso_saml
92.2s
121.2s
168.8s
39.8s
b2b_update_organization
66.3s
119.5s
104.6s
67.2s
b2c_magic_link
94.9s
90.4s
147.5s
394.7s
b2c_session_revoke
77.6s
64.4s
59.0s
78.6s
b2c_totp
135.4s
146.1s
280.2s
95.6s
mcp_auth
123.7s
61.6s
48.7s
78.5s
otp_sms
74.8s
59.1s
51.6s
82.4s
passkeys_auth
80.8s
70.5s
74.0s
82.9s
session_mgmt
72.1s
66.9s
107.8s
91.0s
step_up_mfa
80.1s
116.6s
61.3s
84.2s