Observability, Refactored
For years, observability was a specialization in the domain of SREs and DevOps in big IT organizations. It required a special skillset that your common developer lacked. The position was reserved for the most experienced engineers in the company. It was considered unreasonable to expect developers to learn how to instrument their code manually while staying productive in their domains. Instead, the approach taken was to make it someone else’s problem, for a price.
The observability industry made sure that the code was auto-instrumented with magical agents that the observability engineers would set up. The community created OpenTelemetry as an open standard to reduce vendor lock-in. Vendors adopted it, but the instrumentation process remained complex. This made the process of instrumenting systems prohibitively expensive for most. To use their services you have to wait until your company is big enough to afford the observability software and hire the experts.
If you are small, you have to make do with whatever your cloud provider gave you. This usually meant sifting through your logs manually and becoming really good at eyeballing for problems in the logs.
There is a better way, and this problem is not insurmountable, we just need to do a bit of refactoring and remove the bloat.
Code instrumentation with AI agents
Section titled “Code instrumentation with AI agents”With the advent of AI coding agents (Claude Code, Codex etc.), producing code for common tasks is commoditized. Doing a task that someone already did publicly on the internet can be performed easily by our autocomplete genies. For that to work for observability instrumentation, we need to close the agentic loop with a verification step.
Gather Context → Take Action → Verify Results ▲ │ └──────────────────────────────┘
For coding: read code write code run tests ✓For observability: read code instrument ???The missing piece is a local observability backend and a tool to query it.
Once the loop is closed, the stage is set for the instrumentation magic to shift from observability vendors’ proprietary code to AI-generated code you own. The magic act is done only once in your codebase, not at runtime. As a code-reviewer you are in complete control of what runs in production.
Before: After:
runtime agent ──┐ AI agent [o_o] ──┐ ▼ ▼ ┌──────────────────────┐ ┌──────────────────────┐ │ your code │ │ your code │ │ │ │ │ │ ┌────────────────┐ │ │ import { trace } │ │ │ /\_/\ │ │ │ const span = │ │ │ ( o.o ) magic │ │ │ trace.getTracer() │ │ │ > ~ < │ │ │ │ │ └────────────────┘ │ │ │ └──────────────────────┘ └──────────────────────┘ opaque, per-deploy visible, committed onceAgents, not dashboards
Section titled “Agents, not dashboards”Once code is instrumented, the next problem is finding out what to look for and what to monitor. The observability experts have a common set of things to look at and this is usually the part where they come to developers for some insights about metrics to follow.
As a developer the things you want to inspect are unknown unknowns because that’s where your bugs live. So the usual answer is “let me get back to you on this”.
Organizations end up with a generic list of metrics which have less to do with the application code and more with the underlying OS, runtime, and networking stack: HTTP request time, CPU, memory etc. Observability team sets up alerts and dashboards on the metrics in vendor-provided dashboard. Unfortunately, there is additional price to this convenient division of labour. As always, the problem is communication and responsibility and the desired result of achieving observability ends up falling through the cracks.
The observability expert is probably not a part of your usual dev team so what you can expect is a generic dashboard that will satisfy the management, not the people who are supposed to use it to maintain the systems.
If the dashboard is not useful to the devs, they resort back to troubleshooting their systems with logs on their cloud provider. The devs started going around this problem and asking their AI coding agents to help with root cause analysis. Observability vendors noticed, so they started creating their own agents as part of their observability platforms.
┌─────────────────────────────┐│ Observability Platform ││ ││ data ──► dashboard ││ │ ││ vendor agent ││ │ ││ response │└──────────────┼──────────────┘ ▼ DeveloperTheir agents work great with observability data, but cannot easily connect it with any extra context outside of their domain. Your workflow is limited by the way your observability vendor imagined.
For an optimal agentic workflow it would be much more convenient to shift the things around. It makes more sense for the end-user to use their own agent of choice to query the observability data, using a vendor-provided tool. Then the agent is able to join this data with other relevant sources such as source-control or billing systems to get the full context. That context can then immediately be used to fix the problem within the same session.
┌─────────────────┐ │ source control │ └────────┬────────┘ │┌─────────────────────────┐ ▼│ Observability Backend │ Agent ──► Developer│ │ ▲│ data ──► vendor tool ──┼──────┘│ │ ▲└─────────────────────────┘ │ ┌────────┴────────┐ │ billing │ └─────────────────┘Instead of being forced to work within the confines of observability vendor’s platform, you should be able to use whatever works for you.
In 2026, developers are capable of taking on a wider range of tasks which previously required specific expertise. By using AI coding agents, well known problems that have well known and documented solutions became much easier to tackle. All we need is a way for the agents to get the immediate feedback about the outcome of their actions. Observability is no different.
This was the motivation for creating Kopai (to dig, with AI).
Kopai provides the missing pieces for this workflow: @kopai/app (the observability backend) and @kopai/cli (the AI-friendly CLI). They are free and open source, based on OpenTelemetry.
Refactored workflows
Section titled “Refactored workflows”Application instrumentation
Section titled “Application instrumentation”- Ask the agent to instrument your code using OpenTelemetry SDK. Use our skill or point the agent to kopai documentation on instrumenting apps in your programming language
- agent instruments the code and runs the app. App sends telemetry to
@kopai/apprunning locally - agent uses
@kopai/clito query@kopai/appand validate the telemetry that was received from the instrumented code and contains the desired level of information - agent presents the results to the developer who uses
@kopai/appdashboard to review the telemetry
Root cause analysis:
Section titled “Root cause analysis:”- Describe the problem to your coding agent and instruct it to use
@kopai/clito look for relevant observability data - agent gathers relevant context by using the kopai cli to query logs, traces, and metrics to inspect (supported by our root-cause-analysis skill)
- agent presents the facts to the developer and generates a root cause analysis
- developer uses
@kopai/appdashboard to check the facts and review if the analysis makes sense
In both of these workflows the developer is always in control and is making the final call.
Observability is no longer someone else’s problem. With these tools and the ones we’ll release soon, you are empowered to own it. If you are a dev team which needs every bit of edge your AI-native approach can give to stay ahead, you will dig Kopai.