The Platform

The Agentic Inference Cloud.

A new buying category, parallel to inference APIs and agent frameworks, not sandwiched between them. One serverless endpoint you call like OpenAI. What you get back is a durable agentic run, not just a completion.

Try the playground Talk to a founder

What it is

Four things, one cloud, one API.

Until now, shipping a production agent meant assembling four separate things yourself: an inference vendor, an agent/workflow runtime, a state store for checkpoints and replay, and a recovery story for crashes and budgets. Four integrations. Four places it can drift. Four bills.

The Agentic Inference Cloud bundles all four into a single serverless API. You make one call, and KovaServe routes it to a warm GPU, binds it to a durable run, captures every step for replay, and enforces your budget and recovery policy. Call it like OpenAI. Get an agent-grade execution surface back.

python

# Same client you already use. Different cloud.
from openai import OpenAI

client = OpenAI(
    base_url="https://api.kovaserve.ai/v1",
    api_key="ks_live_...",
)

# One call. Durable agentic run, not just a completion.
resp = client.chat.completions.create(
    model="kovaserve/llama-3.3-70b",
    messages=[{"role": "user", "content": "..."}],
    extra_body={
        "kovaserve": {
            "run_id": "run_01HZY...",   # runtime + state
            "budget": {"dollars": 5.00}, # recovery
        }
    },
)

# resp.choices[0].message.content           the completion
# resp.kovaserve.run_id                     durable run handle
# resp.kovaserve.cost                       real $ attributed to this call
# resp.kovaserve.cache_hit_ratio            real warm-route measurement
# GET /v1/ks/runs/{run_id}                  full replayable timeline

The bundle

Inference + Runtime + State + Recovery.

Four capabilities, one API call, one bill. Here's what each piece does and why it matters.

Inference

OpenAI-compatible, cache-aware.

The wire format is OpenAI. The routing is not. Every call consults a cluster-wide KV index and lands on the GPU that already holds its context. Warm cache, not cold prefill.

Drop-in /v1/chat/completions
Cache-aware routing across your GPU fleet
Real token counts, real per-run cost attribution
30–70% lower $/token on repeated-state workloads

Runtime

Bring yours. Or use ours.

Every call belongs to a durable run, not a standalone request. Use LangGraph, Temporal, or a custom loop as-is, or opt into the KovaServe runtime for coding, terminal, and tool-heavy agents with first-class sandboxes and workspaces.

BYO runtime mode (LangGraph, Temporal, custom) with no rewrite
KovaServe runtime mode: sandboxes, workspaces, artifacts
Durable runs span steps, tools, and model calls
One cloud, two shapes of call

State

Every run is replayable.

A run isn't a black box. Every step, tool call, and model call is captured with real tokens, real dollars, and real timing. Pull the full timeline for a run with one API call for audit, support, or debugging.

End-to-end run replay by ID
Per-run, per-project, per-tenant cost attribution
Audit trail for compliance and incident response
"What did the agent do?" is one API call, not a log-grep

Recovery

Crash, resume, cap, pause.

Runs survive real life. A crash resumes on the same warm GPU in under 5 seconds. A runaway loop hits its budget cap and returns 429 at the inference boundary. A pause actually pauses.

Checkpoint-aware resume with KV cache warmup
Hard budget caps per run, project, or tenant
HTTP 429 enforced at the inference boundary
Managed ↔ self-hosted parity with identical semantics

Why a new category

Not an inference vendor. Not an agent framework. Something else.

Inference APIs

Give you tokens. Stateless. The request is the unit. Your agent, your checkpoints, your budgets. That's on you.

Agent frameworks

Give you Python. Orchestration library. You still have to bring your own inference, state store, and recovery strategy.

Agentic Inference Cloud

Gives you a durable agentic run. Inference + runtime + state + recovery, bundled. The run is the unit. Call it. Done.

A new category

One API. Four things, bundled.

Inference engines give you tokens. Agent frameworks give you Python. The Agentic Inference Cloud gives you a durable agentic run, inference, runtime, state, and recovery in one call.

Agent FrameworksOrchestration, retries, tool loops

LangGraphCrewAIGoogle ADKOpenAI SDK

Workflow EnginesDurable execution

TemporalPrefectDagster

Assemble it yourself· or call one API instead

Inference PlatformsServing + scheduling

RayModalBentoMLReplicate

Inference EnginesThroughput, KV cache

vLLMSGLangTensorRT-LLMTGI

Hardware & GPU CloudsThe physical tier

NVIDIAAWSAzureCoreWeaveCrusoe

A serverless cloud you call, not a layer you assemble. Parallel to inference APIs and agent frameworks, not sandwiched between them.

Start calling the cloud.

Change your base_url. Your next call is a durable agentic run.

Start free See pricing