OWL: Open-Source Agent Swarm Tops GAIA Benchmark at 58.18
CAMEL-AI's OWL hits 58.18 on GAIA, edging Manus and lapping GPT-4 + tools. For agencies shipping multi-agent pipelines, the reinvent-CrewAI phase is over.
The GAIA benchmark exists because every other agent benchmark was getting gamed. It asks general-assistant questions a human can answer in a few minutes — read this PDF, search this site, do the arithmetic, give me the number — and grades on whether the agent actually got the right answer. Closed models with tool use have owned the top of the leaderboard for most of 2025.
In May, an open-source project called OWL — built on top of CAMEL-AI — posted 58.18 on the public GAIA leaderboard. That's above the best Manus runs anyone has reproduced, well above GPT-4 with the standard tool stack, and roughly double what AutoGPT-style loops were managing a year ago. The code is on GitHub. You can run it tonight.
What OWL is doing differently
The trick isn't a clever prompt. It's that OWL treats the things agents actually need — a browser, a terminal, a file system, an MCP server — as first-class citizens instead of bolted-on tools.
- Role-playing agent pairs from CAMEL — task planner and task executor that negotiate, not a single chain-of-thought monologue
- Browser automation via Playwright, with real DOM access and login state preservation
- Terminal and file-system tools the agent can chain without re-asking permission for every command
- MCP server support, so any tool you've already wrapped for Claude or Cursor drops in with no glue code
- Document toolkit for PDFs, spreadsheets, and images — the formats GAIA actually tests on
The architecture reads like someone sat down with a year of CrewAI and AutoGen bug reports and just fixed them. The result is an agent loop that doesn't get stuck searching the same Google result three times or hallucinating a file path because it forgot what directory it was in.
Why this matters if you sell agent pipelines
Every AI agency I know has spent the last 18 months building roughly the same thing — a multi-agent pipeline on top of CrewAI or LangGraph, with custom browser tools, custom retry logic, and a custom memory layer. The first 80% takes a weekend. The last 20% takes six months and produces a brittle stack you're afraid to redeploy.
The play for agency owners is to fork OWL, swap the role descriptions for your client's use case, and ship the proof-of-concept in a week instead of a quarter. The framework handles the unsexy parts — retries, tool routing, browser session management — that used to consume the budget. (Want to see the full Ascero agent gallery? /agents.)
The opinion
Closed-source agent frameworks had a 12-month head start and they've already lost the technical lead. The next phase of the agent market is not going to be decided by whose orchestrator is smartest — they're converging on the same patterns. It's going to be decided by who has the cleanest integration surface and the most permissive license.
OWL has both. CAMEL-AI has been quietly shipping for two years and the team treats benchmarks as evidence, not marketing. If you're still maintaining a homegrown agent loop in 2026, you're paying a tax to feel proud of your codebase. The cheaper move is to delete it.
What to actually do this week
- Clone camel-ai/owl, run the GAIA example with your own API key, see how it handles a real client task
- Audit your current multi-agent pipelines — anything CrewAI or AutoGen is doing for you, OWL probably does cleaner
- If you're shopping agent vendors, ask them their GAIA score. If they don't have one, that's your answer
"The open-source agent stack just shipped a benchmark-leading framework with an MIT license. The question for every agency is no longer build vs. buy. It's build vs. fork."
Ascero AI. “OWL: Open-Source Agent Swarm Tops GAIA Benchmark at 58.18.” May 10, 2026. https://asceroai.com/news/owl-multi-agent-gaia-leader
Free to reference with attribution and a link back to this page.