GitHubMAY 2026 · GITHUB · CAMEL-AI/OWL

OWL: Open-Source Agent Swarm Tops GAIA Benchmark at 58.18

CAMEL-AI's OWL hits 58.18 on GAIA, edging Manus and lapping GPT-4 + tools. For agencies shipping multi-agent pipelines, the reinvent-CrewAI phase is over.

By Kadin Nestler · May 10, 2026 · 5 min read

Share X LinkedIn Email

GAIA benchmark leaderboard (general agents)

OWL (CAMEL-AI)58.18

Manus (closed)~52 est.

GPT-4 + tools~46

AutoGPT-style~32

The GAIA benchmark exists because every other agent benchmark was getting gamed. It asks general-assistant questions a human can answer in a few minutes — read this PDF, search this site, do the arithmetic, give me the number — and grades on whether the agent actually got the right answer. Closed models with tool use have owned the top of the leaderboard for most of 2025.

In May, an open-source project called OWL — built on top of CAMEL-AI — posted 58.18 on the public GAIA leaderboard. That's above the best Manus runs anyone has reproduced, well above GPT-4 with the standard tool stack, and roughly double what AutoGPT-style loops were managing a year ago. The code is on GitHub. You can run it tonight.

What OWL is doing differently

The trick isn't a clever prompt. It's that OWL treats the things agents actually need — a browser, a terminal, a file system, an MCP server — as first-class citizens instead of bolted-on tools.

Role-playing agent pairs from CAMEL — task planner and task executor that negotiate, not a single chain-of-thought monologue
Browser automation via Playwright, with real DOM access and login state preservation
Terminal and file-system tools the agent can chain without re-asking permission for every command
MCP server support, so any tool you've already wrapped for Claude or Cursor drops in with no glue code
Document toolkit for PDFs, spreadsheets, and images — the formats GAIA actually tests on

The architecture reads like someone sat down with a year of CrewAI and AutoGen bug reports and just fixed them. The result is an agent loop that doesn't get stuck searching the same Google result three times or hallucinating a file path because it forgot what directory it was in.

Why this matters if you sell agent pipelines

Every AI agency I know has spent the last 18 months building roughly the same thing — a multi-agent pipeline on top of CrewAI or LangGraph, with custom browser tools, custom retry logic, and a custom memory layer. The first 80% takes a weekend. The last 20% takes six months and produces a brittle stack you're afraid to redeploy.

THE TAKEAWAY

When a public, MIT-licensed framework is leading the closed competition on a credible benchmark, "we built our own orchestrator" stops being a moat. It starts being a tax on your delivery timeline.

The play for agency owners is to fork OWL, swap the role descriptions for your client's use case, and ship the proof-of-concept in a week instead of a quarter. The framework handles the unsexy parts — retries, tool routing, browser session management — that used to consume the budget. (Want to see the full Ascero agent gallery? /agents.)

The opinion

Closed-source agent frameworks had a 12-month head start and they've already lost the technical lead. The next phase of the agent market is not going to be decided by whose orchestrator is smartest — they're converging on the same patterns. It's going to be decided by who has the cleanest integration surface and the most permissive license.

OWL has both. CAMEL-AI has been quietly shipping for two years and the team treats benchmarks as evidence, not marketing. If you're still maintaining a homegrown agent loop in 2026, you're paying a tax to feel proud of your codebase. The cheaper move is to delete it.

What to actually do this week

Clone camel-ai/owl, run the GAIA example with your own API key, see how it handles a real client task
Audit your current multi-agent pipelines — anything CrewAI or AutoGen is doing for you, OWL probably does cleaner
If you're shopping agent vendors, ask them their GAIA score. If they don't have one, that's your answer

"The open-source agent stack just shipped a benchmark-leading framework with an MIT license. The question for every agency is no longer build vs. buy. It's build vs. fork."

Sources

Cite this article

Ascero AI. “OWL: Open-Source Agent Swarm Tops GAIA Benchmark at 58.18.” May 10, 2026. https://asceroai.com/news/owl-multi-agent-gaia-leader

Free to reference with attribution and a link back to this page.

Did this land? Pass it on.