Critical Ollama Vulnerabilities: “Bleeding Llama” and an Unpatched Windows RCE Are Hitting 300,000 Servers

The CyberSec Guru

Ollama RCE Vulnerability

If you like this post, then please share it:

Buy me A Coffee!

Support The CyberSec Guru’s Mission

🔐 Fuel the cybersecurity crusade by buying me a coffee! Why your support matters: Zero paywalls: Keep the main content 100% free for learners worldwide, Writeup Access: Get complete in-depth writeup with scripts access within 12 hours of machine drop.

“Your coffee keeps the servers running and the knowledge flowing in our fight against cybercrime.”☕ Support My Work

Buy Me a Coffee Button

Two separate security disclosures have landed on the AI infrastructure community in quick succession, and neither is easy to brush off. Researchers have found multiple critical flaws in Ollama, the open source framework used by hundreds of thousands of developers and enterprises to run large language models on local hardware. One of them, dubbed “Bleeding Llama,” lets a complete stranger drain your server’s process memory in three HTTP calls. The other, a Windows-specific update chain attack, has been sitting unpatched for over 90 days.

If you’re running Ollama, the short version is: upgrade to v0.23.2 or later right now, and if you’re on Windows, manually disable auto-updates until the vendor ships a fix.

Ollama Logo
Ollama Logo

What Is Ollama and Why Does Any of This Matter?

Before getting into the mechanics, a bit of context on scope.

Ollama is a popular open source framework that lets users run LLMs locally rather than through cloud APIs, and on GitHub the project has over 171,000 stars with more than 16,100 forks. With over 100 million Docker Hub downloads and wide adoption across enterprises as a self-hosted AI inference engine, it has become the de facto standard for teams that want to keep model inference on their own hardware.

The problem is that Ollama does not provide authentication by default and is frequently configured to listen on all network interfaces (0.0.0.0), even though it’s designed for local use and binds to localhost by default. According to NVD, the documented OLLAMA_HOST=0.0.0.0 configuration is widely used in practice, with large public-internet exposure observed. The result: roughly 300,000 Ollama servers are currently exposed on the public internet, with many more sitting on local networks with little isolation.

That’s a lot of unauthenticated inference endpoints pointing at systems that very likely hold API keys, proprietary code, and private user conversations.

CVE-2026-7482 – “Bleeding Llama”

What It Is

Cyera Research disclosed a critical vulnerability (CVE-2026-7482, CVSS 9.1) in Ollama that enables unauthenticated attackers to leak the entire Ollama process memory, with the leaked data containing user messages, system prompts, and environment variables.

The flaw lives in Ollama’s GGUF model loader, specifically in how it handles tensor metadata during quantization.

A Quick Word on GGUF

GGUF (GPT-Generated Unified Format) is the file format now standard for packaging and distributing local LLMs. A GGUF file is essentially a structured binary: it carries the model’s weight tensors, tokenizer data, and a metadata header that tells the loader things like tensor shape (dimensions) and byte offsets into the file where each tensor’s raw data begins.

When Ollama’s model creation pipeline ingests a GGUF file, it reads that header to determine how much memory to allocate, then processes the tensor data for quantization.

How the Bug Actually Works

The core issue arises when the declared tensor offset and size within a GGUF file exceed its actual length. During the quantization process, handled by functions in fs/ggml/gguf.go and server/quantization.go, specifically the WriteTo() method, the server attempts to read data using these oversized parameters.

Ollama is written in Go, a language that normally prevents out-of-bounds memory reads through its runtime. But high-performance AI workloads require low-level memory operations, so Ollama uses Go’s unsafe package in its quantization engine. That package deliberately bypasses the standard boundary checks. The ggml_fp16_to_fp32_row conversion function, which loops over elements to convert floating point precision, trusts the declared element count from the header without checking whether that count actually fits within the data that was provided. Feed it a header claiming 10,000,000 elements backed by 200 bytes of actual data, and it will happily loop off the end of the allocated heap buffer and into whatever memory lies adjacent – other users’ prompts, API keys, environment variables pulled from the host system.

The Lossless Exfiltration Path

Here’s where researchers found a genuinely clever detail. Reading raw heap memory through most attack paths produces corrupted garbage, because the data gets mangled during transit. Cyera found a way around that.

The attacker labels their malicious tensor as Float16 format (2 bytes per element) and requests conversion to Float32 (4 bytes per element). Because the F16-to-F32 conversion is mathematically lossless. It’s a pure precision expansion with no information loss – the heap contents that get read into the conversion buffer come out the other side bit-perfectly preserved in the resulting model artifact. The attacker doesn’t get a corrupted mess. They get the server’s memory intact.

The Three-Call Attack Chain

The exploitation chain is alarmingly simple and requires zero authentication: upload a crafted GGUF file with an inflated tensor shape to a network-accessible Ollama server using an HTTP POST to /api/blobs; use /api/create to activate model creation, which fires the out-of-bounds read; then use /api/push to exfiltrate the resulting artifact, now containing stolen heap memory, to an attacker-controlled registry.

No credentials, no special tooling, no privilege escalation. Three HTTP requests.

What Gets Stolen

In enterprise environments, leaked heap data may expose API keys, internal instructions, proprietary code, customer-related content, and other sensitive material processed by AI workflows. The risk compounds when Ollama is connected to external tools or coding assistants, because those outputs pass through memory and become part of what an attacker can extract.

In a large enterprise running Ollama as a shared AI inference service for thousands of employees, an attacker can effectively learn anything about the organization from the inference server – API keys, proprietary code, customer contracts, and more.

The Disclosure Timeline

The road to a published CVE was bumpier than it should have been. Ollama acknowledged the vulnerability on February 25, 2026, and shared a proposed fix the same day but when the researcher asked about CVE submission, Ollama asked them to submit it independently. After following up in late February and early March with no resolution from MITRE, the researcher eventually approached Echo, a third-party CVE Numbering Authority, which assigned CVE-2026-7482 on April 28, 2026. The patch shipped in version 0.17.1, but Ollama released the updated version without even mentioning it addressed a critical security issue, leaving users unaware of the urgency.

CVE-2026-42248 and CVE-2026-42249 – The Windows Updater Chain

Still Unpatched

While Bleeding Llama at least has a fix available, the Windows-specific attack chain from Striga and CERT Polska does not. The shortcomings remain unpatched following disclosure on January 27, 2026, and have been published following the elapse of a 90-day disclosure period.

How the Auto-Update Mechanism Works (and Fails)

The Windows desktop client auto-starts on login from the Windows Startup folder, listens on 127.0.0.1:11434, and periodically polls for updates in the background via the /api/update endpoint, running any pending updates on the next app start.

Two bugs in this flow can be chained to get persistent code execution.

CVE-2026-42248 – Missing Signature Verification (CVSS 7.7): On Windows, the verifyDownload() function is effectively a no-op. It unconditionally returns nil, meaning the updater accepts any binary handed to it as a legitimate Ollama update. There is no code signing check whatsoever.

CVE-2026-42249 – Path Traversal (CVSS 7.7): The updater constructs local staging paths using unsanitized ETag headers from the update server response. An attacker-controlled update server can supply an ETag value containing ../ path components, redirecting where the “update” binary gets written.

Why This Is Persistent

Combine the two: supply a malicious binary as an “update” and an ETag like ../../../../../../AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup\payload.exe, and the updater writes the executable directly into the Windows Startup folder. The update process runs silently in the background; the user sees nothing. On next login, Windows executes the payload. To exploit the flaws, the attacker needs to be in control of an update server reachable by the victim’s Ollama client – achievable by overriding OLLAMA_UPDATE_URL to point at a local server on plain HTTP, or through DNS hijacking, or TLS interception on the network.

The backdoor survives even a legitimate manual Ollama reinstall, because the malicious file stays in the Startup folder independent of Ollama’s installation state.

Why Enterprise AI Is Particularly Exposed

The uncomfortable truth behind both vulnerabilities is that Ollama was never designed for the way it’s actually deployed. Ollama was designed as a localhost tool, which is why it doesn’t include authentication. In practice, teams deploy it in containers, expose it to other services, or configure it to listen on all interfaces to support multi-client use.

That gap between design intent and operational reality has real consequences when agentic AI pipelines enter the picture. Teams using tools like Claude Code, LangChain, or custom orchestration layers integrated with local Ollama instances are routing proprietary code, internal documentation, and API credentials through that inference server. All of that material sits in heap memory at some point during inference.

Attackers with access to Ollama HTTP servers can already instruct them to pull or delete any AI model they want. The memory leak vulnerability takes that a step further, allowing extraction of secrets from the inference process itself.

Response Plan

Figure Out What You’re Running

Check every server and workstation: ollama --version. Anything below 0.17.1 is vulnerable to the memory leak. Check network binding with ss -tlnp | grep 11434 – if Ollama is bound to 0.0.0.0, assume the worst. If you have public IP space, run a Shodan search for port 11434 against your ranges.

Patch and Contain

Upgrade to v0.23.2 or the latest available release. Set OLLAMA_HOST=127.0.0.1:11434 as an environment variable so Ollama can’t be reached from external interfaces. If you genuinely need remote access, put Ollama behind an authenticating reverse proxy – Nginx, Caddy, or an identity-aware gateway like Tailscale or Cloudflare Access.

If your Ollama instance was publicly accessible before the patch, assume potential compromise. Review what sensitive data may have been exposed and rotate all credentials. That means cloud provider keys, API tokens, registry credentials, anything that could have been sitting in environment variables on that host.

Windows-Specific Steps (Required Until Vendor Patches)

Until Ollama ships fixes for CVE-2026-42248 and CVE-2026-42249, manually disable auto-updates in the Ollama tray settings, and remove any Ollama-related entries from the Startup folder:

Remove-Item "$env:APPDATA\Microsoft\Windows\Start Menu\Programs\Startup\Ollama.lnk"

Do not re-enable auto-updates until signature verification is fixed. Monitor the Ollama GitHub releases page directly and update manually by downloading the installer from the official source.

Frequently Asked Questions

Q: My Ollama instance runs in Docker. Am I safe?

Docker isolation doesn’t protect you from Bleeding Llama. The vulnerability leaks memory from within the Ollama process itself. If the container’s port 11434 is reachable from the network or even from other containers on the same host, the attack works fine. Container boundaries don’t matter here.

Q: Will my EDR catch this?

Probably not for the memory leak. The out-of-bounds read is happening inside a completely legitimate code path of the Ollama process. There’s no shellcode, no unusual syscalls, nothing a typical EDR signature would flag. The Windows RCE payload is a different story – whatever that executable does when it runs on login has a chance of triggering behavioral detection, but by then the persistence mechanism is already in place.

Q: Can I add authentication to Ollama itself?

No. Ollama provides no built-in authentication mechanism; it is the operator’s responsibility to provide an authentication layer using external tools like Nginx, Caddy, or enterprise identity providers.

The Bigger Picture

The “Bleeding Llama” incident isn’t just an Ollama problem. It’s a preview of what happens when infrastructure moves faster than its security model. Ollama was built to make local AI inference easy and it succeeded at that. But “easy” and “safe for network deployment in a multi-user enterprise” are different requirements, and the project never fully bridged that gap.

The assumption that a tool will stay on localhost is increasingly a fantasy. Teams expose things. Containers get misconfigured. Infrastructure sprawls. The general security advice from Cyera – deploy an authentication proxy in front of all Ollama instances, never expose them to the internet without IP filters and firewalls, and isolate local-network instances on secure segments which applies to the entire AI framework ecosystem, which is being increasingly targeted.

AI runtimes deserve the same network paranoia as early unauthenticated databases did in the MongoDB era. The attack surface is real, the exposure numbers are large, and the data sitting in inference server memory is often exactly what an adversary would pay for.

Buy me A Coffee!

Support The CyberSec Guru’s Mission

🔐 Fuel the cybersecurity crusade by buying me a coffee! Your contribution powers free tutorials, hands-on labs, and security resources.

Why your support matters:
  • Writeup Access: Get complete writeup access within 12 hours
  • Zero paywalls: Keep the main content 100% free for learners worldwide

Perks for one-time supporters:
☕️ $5: Shoutout in Buy Me a Coffee
🛡️ $8: Fast-track Access to Live Webinars
💻 $10: Vote on future tutorial topics + exclusive AMA access

“Your coffee keeps the servers running and the knowledge flowing in our fight against cybercrime.”☕ Support My Work

Buy Me a Coffee Button

If you like this post, then please share it:

Discover more from The CyberSec Guru

Subscribe to get the latest posts sent to your email!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from The CyberSec Guru

Subscribe now to keep reading and get access to the full archive.

Continue reading