Skip to content
#itworksonmymachine
Go back

How LogIQ Was Born From a Silent VM Bug

Edit page

It was a Tuesday afternoon in February when my manager pinged me.

Customers were complaining that VMs weren’t being created. The dashboard was red. Somewhere in the service I owned, machine creation was failing, but not in the way you’d expect. It wasn’t failing immediately, and it wasn’t failing randomly. It was failing exactly after the 20th machine. Every time. And nobody knew why.

I closed the thing I was working on and started digging.

Table of contents

Open Table of contents

The hypothesis that wasted my afternoon

My first instinct was the obvious one. Thread contention. A connection pool somewhere capped at 20. A config silently throttling. The number was too clean to be anything else.

I spent the next couple of hours chasing variations of that theory. Auditing pool configurations. Reading through service definitions. Looking for the place where someone had typed 20 and forgotten about it.

There was no such place. The hypothesis was wrong, but the failure pattern was so specific that I kept reaching for it anyway. That’s the thing about debugging under pressure. You’re not really searching for the answer; you’re searching for the answer that would let you go home.

When the obvious explanation didn’t pan out, I had to start over from the data.

Five sources, one bug, no single source telling the truth

Here’s what I had to work with:

Every source, on its own, looked fine.

The broker logs showed VM creation requests going out, getting acknowledged, and returning normal responses. The hypervisor logs showed VMs being created successfully. Error rates in aggregate looked roughly within bounds. The customer complaints were specific but inconsistent. Past tickets had touched related symptoms but nothing exactly matched.

Each system was telling a coherent story about itself. The problem was that the stories didn’t agree with each other.

I spent hours pulling these sources side by side, lining up timestamps, correlating session IDs across log formats that weren’t designed to be correlated. Building the kind of cross-source mental model that, six hours in, starts to feel less like engineering and more like detective work conducted under fluorescent lighting.

Eventually, the pattern showed up.

What was actually happening

The failure lived in the seam between two systems.

When the broker called the hypervisor to create a VM, it did so over a WCF connection with a timeout. Under normal conditions, the hypervisor would return its acknowledgment well before the timeout fired, and everyone would agree that the VM existed. The bookkeeping worked because the messages arrived in the order the system assumed they would.

What was happening at the 20-VM threshold was that the hypervisor was getting slow enough, under sustained load, that its acknowledgment was arriving after the timeout had fired. When the timeout fired, the socket connection got destroyed. The acknowledgment, when it eventually came back, hit a connection that no longer existed.

From the broker’s perspective, the VM had never been created. From the hypervisor’s perspective, it had been created successfully. From the customer’s perspective, they had a VM they could see in the inventory but couldn’t interact with through the normal flow, because the system that managed the flow had no idea it existed.

The VM was real. The system just had no coherent record of it.

The fix that went in that afternoon was an interim socket timeout adjustment, with a follow-up to redesign the coordination layer properly. That work later contributed to a 40% reduction in deployment failures across the service.

But the thing I couldn’t stop thinking about, even after the incident closed, was how long it had taken to find. Not because the bug was technically subtle. Once you saw the pattern, it was obvious. The hard part was that no individual source contained the bug. The bug only existed in the relationship between sources.

That hit me harder than the fix did.

The shape of the problem I actually wanted to solve

In production environments, logs rarely exist in isolation. When something goes wrong, engineers are juggling multiple large log files, Jira tickets describing symptoms, customer case notes with partial context, internal documentation, vague memory of “this happened before,” and the dashboards in front of them.

The investigation becomes a manual context-switching exercise. Scanning logs line by line. Correlating timestamps mentally. Jumping between tickets and docs and dashboards. Relying heavily on intuition, under time pressure, on the worst day of the week.

This is slow, error-prone, and mentally exhausting. And the worst part is that it doesn’t scale. Every incident burns the same hours, by the same engineers, doing the same kind of correlation work that doesn’t compound into anything reusable.

The question I kept circling was whether AI could help engineers reason through this kind of investigation. Not by replacing logs or dashboards. Not as a generic “log analyzer” that summarizes a file and calls it insight. But by bringing the heterogeneous sources together into a single investigation flow, where you could ask questions rather than just search, and where the tool could incorporate prior knowledge (tickets, cases, docs) into how it interpreted what it was looking at.

I called the project LogIQ.

What v1 actually was

I want to be honest about this part, because tech blog posts often skip it.

The first version of LogIQ was a Jupyter notebook. That’s it. I’d dump log files into a folder, write a few cells that loaded them in, and feed chunks to the OpenAI API with prompts asking GPT to find anomalies or reason about patterns. There was no UI. There was no architecture. There was barely any code.

It was useful in the way that early prototypes are useful: it told me which parts of the problem I’d actually understood and which parts I’d hand-waved. The notebook could surface single-source anomalies decently well. It was completely useless at the actual hard problem, which was correlation across sources.

That’s the gap I started building toward.

What I’ve changed since

The biggest evolution from the notebook has been in how the system reasons, not in how it ingests.

Early on, I’d ask LogIQ broad questions and get back broad answers. “Something unusual is happening around 2:47 PM.” Technically true. Operationally useless. The system was generating plausible-sounding summaries that didn’t actually advance an investigation.

The shift was moving toward structured hypothesis output. Instead of asking the model to “explain what’s happening,” I started prompting it to produce evidence-backed hypotheses in a specific shape: what it thinks is happening, which log lines support that hypothesis, which sources it correlated to reach that conclusion, and what it would want to look at next to confirm or rule out.

That single change made the tool feel like an actual collaborator instead of a chatbot with logs. The output started looking like the way I’d write up an investigation if I were handing it off to another engineer. Specific. Grounded in evidence. Honest about uncertainty.

I also introduced manifest-driven scoping, which is a way of telling LogIQ what “case” it’s investigating: which sources are in scope, what the suspected symptom is, what context (tickets, prior cases, docs) it should pull in. The manifest is the difference between asking “is anything wrong” and asking “is anything wrong in the context of this specific incident, against these specific systems.” Scope matters in real debugging. It matters even more when an LLM is in the loop.

There’s a lot still rough about the tool. The retrieval layer is closer to RAG-lite than to proper RAG. The eval story is thin. The UI is, charitably, functional. But the bones of the idea (structured hypotheses, manifest-scoped investigation, multi-source correlation) have held up across enough use cases that I trust the direction.

What’s next

The thing I keep coming back to is that LogIQ, as it stands today, is a tool you point at evidence after the incident is already happening. The next version I’m sketching is closer to an agentic investigator, something that can hold a hypothesis, decide which source to pull next, and reason about its own uncertainty as it goes.

The part I care about more than the agentic mechanics, though, is the permission model around it.

A lot of the AI-in-production conversation right now is about what agents can do. I’m more interested in what they cannot do, by design. An on-call investigation tool should never be able to take a destructive action. It should be able to read widely and act narrowly. Scoped credentials, dry-run modes, explicit human approval for anything that changes state. The architecture should make a bad call boring instead of catastrophic.

If you’ve followed my recent thread on the PocketOS incident, this is the same thesis I’ve been working through from the other side. The agent isn’t the dangerous part. The blast radius is. Build the radius small, and the agent becomes useful.

That’s the version of LogIQ I’m building toward next.

I’ll write about it when there’s something worth showing.


Edit page
Share this post on:
If you found this useful, buy me a coffee

Previous Post
The Model War Is a Distraction. The Real Fight Is the Agent Runtime Layer.
Next Post
The AI Didn't Go Rogue. The Architecture Let It.