The Agentic Inference Cloud.
A new buying category, parallel to inference APIs and agent frameworks, not sandwiched between them. One serverless endpoint you call like OpenAI. What you get back is a durable agentic run, not just a completion.
Four things, one cloud, one API.
Until now, shipping a production agent meant assembling four separate things yourself: an inference vendor, an agent/workflow runtime, a state store for checkpoints and replay, and a recovery story for crashes and budgets. Four integrations. Four places it can drift. Four bills.
The Agentic Inference Cloud bundles all four into a single serverless API. You make one call, and KovaServe routes it to a warm GPU, binds it to a durable run, captures every step for replay, and enforces your budget and recovery policy. Call it like OpenAI. Get an agent-grade execution surface back.
# Same client you already use. Different cloud.
from openai import OpenAI
client = OpenAI(
base_url="https://api.kovaserve.ai/v1",
api_key="ks_live_...",
)
# One call. Durable agentic run, not just a completion.
resp = client.chat.completions.create(
model="kovaserve/llama-3.3-70b",
messages=[{"role": "user", "content": "..."}],
extra_body={
"kovaserve": {
"run_id": "run_01HZY...", # runtime + state
"budget": {"dollars": 5.00}, # recovery
}
},
)
# resp.choices[0].message.content the completion
# resp.kovaserve.run_id durable run handle
# resp.kovaserve.cost real $ attributed to this call
# resp.kovaserve.cache_hit_ratio real warm-route measurement
# GET /v1/ks/runs/{run_id} full replayable timelineInference + Runtime + State + Recovery.
Four capabilities, one API call, one bill. Here's what each piece does and why it matters.
Inference
The wire format is OpenAI. The routing is not. Every call consults a cluster-wide KV index and lands on the GPU that already holds its context. Warm cache, not cold prefill.
- Drop-in /v1/chat/completions
- Cache-aware routing across your GPU fleet
- Real token counts, real per-run cost attribution
- 30–70% lower $/token on repeated-state workloads
Runtime
Every call belongs to a durable run, not a standalone request. Use LangGraph, Temporal, or a custom loop as-is, or opt into the KovaServe runtime for coding, terminal, and tool-heavy agents with first-class sandboxes and workspaces.
- BYO runtime mode (LangGraph, Temporal, custom) with no rewrite
- KovaServe runtime mode: sandboxes, workspaces, artifacts
- Durable runs span steps, tools, and model calls
- One cloud, two shapes of call
State
A run isn't a black box. Every step, tool call, and model call is captured with real tokens, real dollars, and real timing. Pull the full timeline for a run with one API call for audit, support, or debugging.
- End-to-end run replay by ID
- Per-run, per-project, per-tenant cost attribution
- Audit trail for compliance and incident response
- "What did the agent do?" is one API call, not a log-grep
Recovery
Runs survive real life. A crash resumes on the same warm GPU in under 5 seconds. A runaway loop hits its budget cap and returns 429 at the inference boundary. A pause actually pauses.
- Checkpoint-aware resume with KV cache warmup
- Hard budget caps per run, project, or tenant
- HTTP 429 enforced at the inference boundary
- Managed ↔ self-hosted parity with identical semantics
Not an inference vendor. Not an agent framework. Something else.
Inference APIs
Give you tokens. Stateless. The request is the unit. Your agent, your checkpoints, your budgets. That's on you.
Agent frameworks
Give you Python. Orchestration library. You still have to bring your own inference, state store, and recovery strategy.
Agentic Inference Cloud
Gives you a durable agentic run. Inference + runtime + state + recovery, bundled. The run is the unit. Call it. Done.
One API. Four things, bundled.
Inference engines give you tokens. Agent frameworks give you Python. The Agentic Inference Cloud gives you a durable agentic run, inference, runtime, state, and recovery in one call.
A serverless cloud you call, not a layer you assemble. Parallel to inference APIs and agent frameworks, not sandwiched between them.
Start calling the cloud.
Change your base_url. Your next call is a durable agentic run.