AI in the engine room

Cloudflare pointed a security-focused AI model at live production infrastructure and published what they found — including where it fell short. That honesty is worth paying attention to.

Cloudflare just shared something most companies wouldn't. They took a security-focused AI model — part of a project they're calling Glasswing — and pointed it at live, critical infrastructure code. Not a test environment. Production systems. The kind of code where a missed vulnerability has real consequences.

The model is called Mythos. The results were honest. And honesty in AI case studies is still rarer than it should be.

What they actually did

This is a step beyond using AI to write code or summarise documents. Cloudflare ran a security-focused LLM against the code that underpins services used by millions of people — looking for vulnerabilities that human reviewers might have missed. They wrote up what worked and what didn't, including the parts that didn't go as planned.

That last part matters. Most AI case studies from large technology companies are announcements dressed up as insights. This one reads differently.

What worked

At scale, the model did things that would take a human security team weeks. It scanned large volumes of code quickly. It found real issues — vulnerabilities that had been sitting there, undetected. The pattern-recognition ability of a well-trained security model, running at the speed of compute, is genuinely useful.

For anyone running a security function, the implication is clear: AI can see more surface area than humans can, and it can do it faster. That is not a small thing. The attack surface of a modern technology organisation is enormous, and it grows every time a new service is deployed. Anything that helps you keep up with that pace matters.

Where it fell short

But Cloudflare were specific about the gaps. The model struggled with context — understanding why a piece of code was written the way it was, what it was meant to do, and what a real attacker would actually do with a given weakness. It produced noise. It needed human judgment to triage, verify, and decide what was actually worth acting on.

Their phrase stuck with me: the work around the models needs to be figured out before any of this can scale. Not the models themselves. The work around them.

A security AI without the right human review process around it isn't more secure. It's differently risky.

The pattern I keep seeing

Every honest AI case study follows the same shape. The model does something genuinely impressive. Then it hits a point where human judgment, context, or institutional knowledge is needed to go further. That is not a failure — it is what useful tools look like at this stage.

The mistake is treating AI as a finished product rather than a strong starting point that needs a system built around it. Cloudflare didn't make that mistake. They named the gap clearly and said: we need to figure out the scaffolding before we trust this at scale.

This connects to something Chris wrote recently about the tech side and the people side needing to move together. The same logic applies here. You don't just deploy the model. You build the process around it — the review workflow, the escalation path, the human who makes the final call. That's the real work.

What this means if you're not Cloudflare

Cloudflare has deep AI expertise and the resources to run this kind of experiment properly. If they are being this careful — and this honest — about where the model still needs work, that tells the rest of us something important.

It doesn't mean wait. It means start with a clear view of what AI can and can't do in your security stack. The bar for trust in security is higher than in most areas. A wrong recommendation in a content system is annoying. A missed vulnerability in production code is something else.

That is not a reason to avoid AI in security. It is a reason to be precise about where you use it, what you check, and who is responsible for the final call. At Travelopia, we are thinking about this actively — where AI adds real speed and coverage, and where a human needs to stay in the loop.

Why the honest case studies matter

The AI work that is actually moving forward isn't the work where people pretend the model is perfect. It's the work where people are clear about the gaps and methodical about closing them.

The fact that security-focused models are now good enough to run against live infrastructure is significant progress. A few years ago this wasn't possible. The fact that they still need human scaffolding to be trustworthy is just reality — and I respect Cloudflare for saying so plainly rather than glossing over it.

The next time someone shows you an impressive AI demo in security, ask one question: what does the process around the model look like? That answer will tell you more than the demo will.

AI in the engine room.