Independent AI Agent Evaluation

Trust, verified.
Not sold.

VERDICT is an independent evaluation index for AI agents and workflow platforms. No vendor sponsorships. No paid certifications. Scores based on public data, incident history, and behavioral testing.

3/100 Evaluations Published
8 Critical CVEs Tracked
0 Paid Certifications
$0 Vendor Revenue
Vendors notified only after publication No free trials accepted from vendors under evaluation Methodology fully public and versioned Silence is scored — disclosure gaps are data points Scores update when incidents occur — not on vendor request Framework v0.3.0 · 7 dimensions · 100 points Vendors notified only after publication No free trials accepted from vendors under evaluation Methodology fully public and versioned Silence is scored — disclosure gaps are data points Scores update when incidents occur — not on vendor request Framework v0.3.0 · 7 dimensions · 100 points
Evaluations

Published Reports

Layer 0 evaluates public documentation only. E (Effectiveness) requires live behavioral testing (Layer 1) and is excluded from Layer 0 scores.
Layer 0 maximum: 85 points. The shape of each radar chart is the platform's trust fingerprint.

n8n
Workflow Automation · Open Source
Evaluated 2026.03.12  ·  Framework v0.3.0
Most transparent platform in the index — publishes every CVE, responds in days. But 8 critical vulnerabilities in 12 months reveal a structural sandboxing flaw that patches cannot fix alone.
8 CVEs · CVSS 9.4–10.0 Structural Sandbox Failure Fastest Patch Response
V E R D I C T
40 / 85
Layer 0 · Public Docs
Make.com
Workflow Automation · Cloud SaaS
Evaluated 2026.03.12  ·  Framework v0.3.0
Zero public CVEs — by policy, not necessarily by reality. ISO 27001 and SOC 2 certified, but the absence of public disclosure makes independent verification structurally impossible.
0 Public CVEs Non-Disclosure Policy AI Training: Unconfirmed
V E R D I C T
46 / 85
Layer 0 · Public Docs
Zapier
Workflow Automation · Cloud SaaS
Evaluated 2026.03.12  ·  Framework v0.3.0
No public CVEs, but two verified incidents in 12 months — including a supply chain compromise of its npm SDK. Disclosed the 2025 breach to all affected users immediately, showing the best incident response of the three platforms.
Supply Chain Breach · Nov 2025 Repo Breach · Mar 2025 Best Incident Disclosure
V E R D I C T
48 / 85
Layer 0 · Public Docs
Activepieces Next Evaluation
Pipedream In Queue
Methodology

How VERDICT Evaluates

LAYER 0
Public Documentation
Zero cost. Everything publicly available is evaluated. The absence of documentation is scored as strongly as its presence.
Privacy policies · Terms of Service CVE databases · NVD · GitHub Advisories Security pages · Compliance certifications Community reports · Incident timelines
LAYER 1
Behavioral Testing
Free tier only. 30 executions per scenario across 3 days. Statistical significance, not anecdote. 95% confidence intervals.
Easy / Medium / Hard / Adversarial difficulty tiers Task success rate over 30 runs Performance degradation under load Cost accuracy vs. declared pricing
LAYER C
Live Incident Monitoring
Continuous. A CVE published today changes a score today. Vendor certifications lock in a moment — VERDICT tracks ongoing reality.
New CVEs trigger immediate R-dimension update Supply chain compromise tracking Community-reported incidents investigated Structural vs. isolated failure distinction

VERDICT Dimensions

Code Dimension What We Evaluate Weight
V Verifiability検証可能性 Developer identity, OSS code disclosure, version transparency, third-party audits 20
E Effectiveness実効性 Task success rate, cost accuracy, performance degradation — Layer 1 only 15
R Resilience耐障害性 CVE frequency & severity, patch response speed, structural failure patterns, supply chain integrity 20
D Data Conductデータ行動規範 GDPR posture, data minimization, AI training use disclosure, sub-processor transparency 15
I Identity & Control主権と制御 Emergency stop mechanisms, human-in-the-loop availability, permission chain documentation 10
C Containment境界遵守 Sandbox design philosophy (whitelist vs. blocklist), least-privilege defaults, tenant isolation 10
T Transparency透明性 CVE publication posture, incident disclosure speed, AI safety framework adoption 10
About

Why VERDICT Exists

No vendor can evaluate themselves
Microsoft cannot independently evaluate Copilot. Zapier cannot independently evaluate Zapier. Structural independence is a moat that money cannot replicate — because the value disappears the moment it is sold.
Silence is a data point
If a vendor does not publish a stop mechanism, emergency override, or data retention policy, that absence is scored as a zero. We evaluate what's missing as seriously as what's present.
Living scores, not snapshots
A CVE published today changes a score today. Vendor-sponsored certifications freeze a single point in time. VERDICT tracks the ongoing operational reality of trust.
Our own biases are disclosed
Our evaluation tooling includes Claude (Anthropic). Anthropic competes with some evaluated vendors. We disclose this explicitly, describe how we mitigate it, and subject our own methodology to the same standard we hold others to.