Hugging Face Replicates OpenAI Deep Research in 24-Hour Hackathon

Hugging Face researchers built an open source clone of OpenAI's Deep Research feature in 24 hours, achieving 55% accuracy on the GAIA benchmark compared to OpenAI's 67%, demonstrating that agentic frameworks can dramatically boost AI model capabilities.

Hugging Face researchers released an open source AI research agent called "Open Deep Research" on February 4, 2025, built during a 24-hour internal hackathon following OpenAI's launch of its proprietary Deep Research feature. The project achieved 55.15 percent accuracy on the General AI Assistants (GAIA) benchmark, approaching OpenAI's 67.36 percent single-pass score. Developers, researchers, and organizations seeking autonomous research capabilities can access the code freely through GitHub.

Technical diagram showing vulnerability chain

Figure 1: Visual representation of the BeyondTrust vulnerability chain

What Happened

OpenAI launched Deep Research as a feature for ChatGPT Pro subscribers in late January 2025. The feature enables users to request comprehensive research reports on complex topics. The system autonomously browses the web, collects information from multiple sources, and synthesizes findings into structured reports.

On February 4, 2025, Hugging Face announced Open Deep Research through its official blog. The project was led by Aymeric Roucher, who coordinated the 24-hour development sprint. The team published the complete codebase on GitHub as part of the smolagents repository.

Google had previously introduced its own "Deep Research" feature using Gemini in December 2024, predating OpenAI's implementation. The convergence of three major AI organizations on similar functionality within months indicates growing industry focus on autonomous research agents.

Key Claims and Evidence

Hugging Face reported achieving 55.15 percent accuracy on the GAIA benchmark after 24 hours of development. The GAIA benchmark tests an AI model's ability to gather and synthesize information from multiple sources, making it a relevant measure for research agent capabilities.

OpenAI's Deep Research scored 67.36 percent on the same benchmark with single-pass responses, according to OpenAI's published documentation. When 64 responses were combined using a consensus mechanism, OpenAI's score increased to 72.57 percent.

The performance gap between standalone GPT-4o (29 percent) and the same model within an agentic framework (67 percent) demonstrates that architectural design contributes substantially to capability. Aymeric Roucher told Ars Technica that while the project uses closed-weights models like GPT-4o for optimal performance, "it can be switched to any other model, so [it] supports a fully open pipeline."

The Hugging Face team incorporated web browsing and text inspection tools from Microsoft Research's Magnetic-One agent project, released in late 2024. Building on existing open source components shortened development time.

Figure 2: How the authentication bypass vulnerability works

Opportunities

Open Deep Research provides developers with free access to study and modify autonomous research agent technology. Organizations can deploy the system internally without subscription costs or API dependencies on proprietary services.

Researchers can examine the agentic framework architecture to understand how multi-step task completion improves language model performance. The 38-percentage-point improvement from standalone GPT-4o to agentic GPT-4o offers a concrete case study in capability amplification.

The modular design allows substitution of the underlying language model. Organizations with access to open-weights models can build fully transparent research pipelines without relying on closed commercial APIs.

Educational institutions and smaller research groups gain access to research automation capabilities previously available only through expensive subscriptions. The open source license permits modification and redistribution.

Limitations and Risks

Open Deep Research does not match OpenAI's benchmark performance. The 12-percentage-point gap between 55 percent and 67 percent represents meaningful capability differences in real-world research tasks.

The current implementation relies on OpenAI's API for optimal performance. While the framework supports open-weights models, Hugging Face has not published benchmark results for those configurations. Users seeking fully open pipelines may experience degraded performance.

Autonomous web browsing agents introduce security considerations. Systems that can navigate websites and extract information may encounter malicious content or inadvertently access restricted resources. The project documentation does not detail security hardening measures.

Research agents can produce plausible-sounding but incorrect information. The GAIA benchmark measures accuracy, but a 55 percent score means nearly half of responses contain errors. Users must verify agent outputs independently.

Figure 3: Privilege escalation from user to SYSTEM level

How the Technology Works

Open Deep Research implements an agentic architecture that wraps a large language model with tools for web browsing, text extraction, and multi-step planning. The agent receives a research query, decomposes it into subtasks, executes web searches, extracts relevant information, and synthesizes findings into a report.

The framework uses a code-based agent approach where the language model generates Python code to invoke tools rather than selecting from predefined actions. According to Hugging Face, this approach provides flexibility in handling diverse research tasks.

Web browsing capabilities derive from Microsoft Research's Magnetic-One project. The agent can navigate web pages, extract text content, and follow links to gather information across multiple sources. Text inspection tools parse and analyze retrieved content.

The multi-step execution loop allows the agent to refine its approach based on intermediate results. If initial searches fail to find relevant information, the agent can reformulate queries or explore alternative sources. The final synthesis step combines gathered information into a coherent report.

Technical context (optional): The smolagents library provides the underlying agent infrastructure. Developers can customize tool sets, modify planning strategies, and substitute language models through configuration. The codebase includes examples for common research tasks and documentation for extending functionality.

Broader Industry Implications

The 24-hour replication timeline challenges assumptions about competitive moats in AI product development. Features requiring substantial engineering investment from major labs can be approximated by smaller teams building on open source foundations.

The pattern mirrors earlier developments in large language models, where open-weights releases from Meta and others enabled rapid ecosystem development outside major commercial labs. Agentic frameworks may follow a similar trajectory.

Microsoft Research's contribution through Magnetic-One illustrates how open source components accelerate development across organizations. Hugging Face's ability to incorporate those tools shortened their development cycle significantly.

The convergence of OpenAI, Google, and Hugging Face on research agent functionality suggests this capability category will become standard in AI assistants. Competition may shift toward execution quality, reliability, and integration rather than feature availability.

Confirmed Facts and Open Questions

Confirmed:

Hugging Face released Open Deep Research on February 4, 2025
Development took 24 hours following OpenAI's Deep Research announcement
GAIA benchmark accuracy: 55.15 percent (Hugging Face) vs 67.36 percent (OpenAI single-pass)
The project uses Microsoft Research's Magnetic-One tools
Code is available on GitHub under the smolagents repository
Aymeric Roucher leads the project

Unclear:

Performance characteristics with open-weights models instead of GPT-4o
Detailed comparison of failure modes between Open Deep Research and OpenAI's implementation
Resource requirements and cost comparisons for equivalent workloads
Security review status and hardening measures for web browsing components

Signals to Monitor

Hugging Face indicated plans to add support for more file formats and vision-based web browsing capabilities. The team is also working on replicating OpenAI's Operator, which controls computer interfaces through a browser environment.

Community contributions to the smolagents repository will indicate adoption levels and development velocity. GitHub activity metrics provide observable signals of ecosystem engagement.

Benchmark results from other organizations attempting similar implementations will establish whether Hugging Face's 24-hour timeline and 55 percent accuracy represent typical or exceptional outcomes.

OpenAI's response to open source replication of Deep Research may influence future feature announcements and API access policies. Changes to terms of service or pricing would signal strategic adjustments.

Hugging Face Replicates OpenAI Deep Research in 24-Hour Hackathon

What Happened

Key Claims and Evidence

Opportunities

Limitations and Risks

How the Technology Works

Broader Industry Implications

Confirmed Facts and Open Questions

Signals to Monitor

Sources & References

Related Topics

METR Study Finds AI Coding Tools Reduce Developer Productivity by 19 Percent

Mercury Diffusion LLM Achieves Record Inference Speeds

7-Zip 25.00 Adds 64+ Thread Support and Security Fixes

What Happened

Key Claims and Evidence

Opportunities

Limitations and Risks

How the Technology Works

Broader Industry Implications

Confirmed Facts and Open Questions

Signals to Monitor

Sources & References

Related Topics

Related Reading

METR Study Finds AI Coding Tools Reduce Developer Productivity by 19 Percent

Mercury Diffusion LLM Achieves Record Inference Speeds

7-Zip 25.00 Adds 64+ Thread Support and Security Fixes