Microsoft reveals Project Ire, an autonomous AI system that reverse-engineers malware

Microsoft reveals Project Ire introduces a novel autonomous AI program that can fully reverse-engineer software and decide if it is malicious or benign. The prototype was built by teams across research and defender groups and merges global telemetry, tool orchestration, and large models to speed classification for enterprise security teams.

The system blends decompilers, memory-analysis sandboxes, documentation search, and custom tooling with advanced language reasoning to produce audit-ready evidence. In public Windows driver tests, the program achieved 0.98 precision, 0.83 recall, and about 2% false positives, showing real promise in reducing unnecessary escalations.

This project authored the first conviction case for an advanced threat sample, letting automatic blocking trigger in Defender at scale. That matters because malware classification lacks a single computable validator and often needs costly expert review. With one billion devices scanned monthly, precise, fast detection can cut delays and improve response to threats.

Microsoft reveals Project Ire: what’s new in autonomous malware reverse engineering

An LLM-guided investigator now treats each unknown binary as a fresh forensic task, rebuilding intent from raw code.

The system automates the gold standard of reverse engineering by fully inspecting a file without prior labels. It calls specialized tools through APIs, runs decompilers and sandboxes, and synthesizes outputs with iterative LLM reasoning.

Unlike signature-driven engines, this approach avoids simple pattern matching. Instead it documents why code looks malicious or benign, producing an auditable chain of evidence that helps an expert validate or override decisions.

Agentic triage identifies type, structure, and suspicious regions early. That speeds initial decisions and cuts needless manual starts, which helps scale analysis across many drivers and binaries.

Adjudication aims for consistent outcomes across similar samples, reducing subjectivity between reviewers. The method shines on ambiguous behaviors—packers, anti-analysis tricks, and multi-stage payloads—because it classifies based on observed capabilities, not just matches.

Inside Project Ire’s technical foundation and workflow

At its core, the system layers automated tool calls with LLM reasoning to turn raw binaries into explainable behavioral findings.

Agentic architecture

The agent coordinates callable reverse engineering tools with reasoning to progress from raw binary analysis into a structured control flow understanding and behavior-centric conclusions. Triage first identifies file type, layout, and hotspots so work focuses on likely threat regions.

Toolchain and control flow

Frameworks like angr and Ghidra reconstruct a control flow graph that anchors memory modeling. The system uses multiple decompilers, Project Freta-based sandboxes for memory snapshots, and documentation search to clarify API semantics and runtime effects.

Chain of evidence and validator

Iterative function analysis lets the agent request targeted decompilation, symbol and string scans, and semantic summaries to collect claims about process tampering, network C2, or driver I/O. Every claim links to artifacts—disassembly snippets, decompiler outputs, and CFG nodes—so reviewers can audit the path from observation to classification.

The validator cross-checks assertions against the evidence log and expert notes, forcing revision or removal of unsupported claims. This system uses LLM-driven reasoning to bind low-level code semantics to higher-level behavior, producing a transparent, auditable verdict for security teams analyzing windows drivers and other software.

Performance at a glance: precision, recall, and real-world tests

Performance testing found near-perfect precision on benchmark drivers, and more conservative results on hard, real-world files. That split helps teams weigh immediate benefits against areas needing deeper analysis.

Public Windows drivers benchmark

On a curated dataset of public windows drivers—malicious samples from a living-off-the-land drivers collection and benign builds from Windows Update—the system achieved an achieved precision of 0.98 and recall of 0.83.

False positives sat near 2%, and the agent correctly labeled roughly 90% of files. These results suggest readiness for integration alongside human review.

Hard-target Defender evaluation

In a tougher evaluation, nearly 4,000 post-cutoff files that prior automated tools could not classify were processed autonomously.

The system reached 0.89 precision, 0.26 recall, and ~4% false positives, showing conservative accuracy with low error rates on novel samples.

Case studies and operational impact

One kernel rootkit case exposed routines that terminate Explorer.exe threads, alter registry values, patch entry points for hooking, and perform HTTP GET calls consistent with command-and-control.

A second sample targeted antivirus processes (avp.exe, 360Tray.exe), registering and killing matching processes. An initially flagged anti-debug claim was invalidated by the validator and fixed by updating decompiler rules.

Why this matters: high precision reduces alert fatigue while evidence-rich reports create a reusable trail to refine tools and guide controlled deployment into security pipelines.

How Project Ire differs from traditional antivirus and prior ML approaches

This program inspects control flow and runtime semantics so defenders see actionable reasons behind each verdict.

Beyond signatures: reverse engineering-first adjudication

Unlike signature engines that flag known patterns, this approach performs full reverse engineering to reconstruct behavior. It links function-level findings to runtime effects and explains intent, mechanism, and capability in an auditable report.

The outcome is a verdict grounded in verifiable artifacts, not just model scores. That improves trust in automated classification when teams decide to block or contain suspicious files.

Reducing analyst fatigue with consistent, scalable binary analysis

Standardized workflows cut variance across analysts and make results reproducible. A clear chain-of-evidence speeds peer review for auditors and incident responders.

Traditional ML detectors learn patterns from examples. By contrast, this system uses LLM reasoning to direct tools, interpret outputs, and tie claims to concrete artifacts. Benchmarks show lower false positives, which means engineers spend more time on real threats and less on noise.

Designed as a complement to microsoft defender, the system augments expert reverse work rather than replacing it.

Security operations impact and deployment roadmap

“The design aims to let the analyzer classify unknown files at first encounter and flag memory-resident threats in real time.”

The roadmap moves the prototype into a Binary Analyzer that integrates with microsoft defender telemetry and response tooling. This path ties agent findings to existing alerts, allowing fast blocking or analyst review when needed.

Operational benefits include faster triage for security operations, fewer false positives, and evidence-rich reports that speed handoffs to incident response and threat intelligence teams. Clear artifacts help teams trust automated decisions.

Scaling and safeguards

Planned deployment targets classification at first sight and eventual memory-level detection on windows endpoints. The goal is to shorten dwell time and improve outcomes against fast-moving malware.

The agent can run fully autonomously on difficult files, offering conservative, high-precision judgments suitable for gating before human review. Validator safeguards and expert alignment aim to keep misclassifications low and remediation transparent.

Embedded telemetry and agentic architectures will let the Binary Analyzer evolve with microsoft discovery and future toolchain expansions.

Roadmap milestones map to measurable KPIs: live precision, analyst time saved per case, and mean time to resolution. Those metrics will guide phased deployment and continuous improvement.

Conclusion

Project Ire unites automated analysis and human-grade reasoning to turn raw binaries into auditable verdicts.

The prototype showed high precision on curated Windows drivers (0.98, recall 0.83) and conservative accuracy on hard targets (0.89 precision, 0.26 recall, ~4% false positives). Its agentic workflow uses reverse engineering tools, control flow reconstruction, and function-level checks to link low-level code to behavior.

The validator removes unsupported claims so experts can trust classification between malicious benign outcomes. Operational gains include fewer false positives, standardized engineering workflows, and faster binary analysis for security teams.

Looking ahead, the project aims to deploy as a Binary Analyzer to classify files at first encounter and help defenders spot novel malware in memory. This system complements human reverse engineers and improves real-world threat response.