Runtime Economics and Reliability

JamJet's most important benefit is not microsecond-level orchestration overhead. It is the reduction in wasted work when workflows fail, pause, or need to be replayed. These benchmarks show both the raw framework tax and the runtime economics that matter once agents leave the demo stage.

Why runtime economics matter more than framework tax

Resume vs rerun after failure

A 7-step workflow fails on step 6. Plain rerun: repeat all 7 steps. JamJet: resume from step 6. The savings scale with workflow complexity and LLM cost.

Replay savings

Replaying a failed or interesting execution avoids recomputing completed steps. Debug from checkpoints, not from scratch.

Side-effect safety

Durable state and leases reduce duplicate downstream actions after process failure — fewer double-sends, fewer double-charges.

Framework orchestration overhead

For completeness, here is the raw orchestration tax measured against identical LLM calls.

llama3.2 · Ollama · Apple M-series · 2026-03-08

Model: llama3.2 Endpoint: http://localhost:11434/v1 (Ollama) Runs: 20 (+3 warmup)

Framework	mean (ms)	median	p95	p99	stdev	overhead
Raw (baseline)	947.2	943.7	970.3	972.2	9.9	—
JamJet 0.1.1	948.6	948.2	959.0	964.2	6.0	+1.4ms
LangGraph	944.0	943.0	953.8	961.1	8.1	-3.2ms

Note: All three frameworks within measurement noise (~1ms). JamJet's in-process executor adds zero observable overhead over a raw LLM call.

qwen3:8b (thinking mode) · Ollama · Apple M-series · 2026-03-08

Model: qwen3:8b Endpoint: http://localhost:11434/v1 (Ollama) Runs: 15 (+3 warmup)

Framework	mean (ms)	median	p95	p99	stdev	overhead
Raw (baseline)	8429.5	8303.4	8940.3	9427.6	352.3	—
JamJet 0.1.1	10140.1	10139.1	10487.0	10519.5	285.1	+1710.6ms
LangGraph	11902.9	11923.3	12761.8	12823.5	551.7	+3473.3ms

Note: qwen3:8b generates variable-length chain-of-thought. High stdev dominates — overhead numbers reflect token generation variance, not framework overhead.

Vertex AI (Gemini 2.0 Flash) — plan-and-execute agent

End-to-end run: JamJet @task + @tool on Vertex AI's OpenAI-compatible endpoint. Two-step research agent — plan then synthesize.

Model

gemini-2.0-flash-001

Provider

Vertex AI (GCP)

Strategy

plan-and-execute

Wall-clock

41,811 ms

Total tokens

10,961

Est. cost

$0.00191

Step	Latency (ms)	Prompt tokens	Compl tokens	Total tokens
plan — Gemini Flash	2,641	96	104	200
step 1 execution	1,324	129	90	219
step 2 execution	1,413	127	110	237
step 3 execution	1,228	132	103	235
step 4 execution	1,986	664	186	850
step 5 execution	1,290	280	100	380
synthesize — Gemini Flash	3,050	124	153	277
TOTAL (12 calls)	41,811	6,121	4,840	10,961

Integration — 2 env vars, no custom client

export OPENAI_BASE_URL="https://us-central1-aiplatform.googleapis.com/..."
export OPENAI_API_KEY=$(gcloud auth print-access-token)

# Then just use @task/@tool as normal
@task(model="google/gemini-2.0-flash-001", tools=[web_search])
async def research(question: str) -> str:
    """Research assistant — search first, then summarize."""

Methodology

All benchmarks measure wall-clock time per call. Each framework makes the identical LLM call through the same OpenAI-compatible client — what we measure is framework orchestration overhead.

Raw (baseline) — bare openai.OpenAI().chat.completions.create() call
JamJet — Workflow.run_sync() in-process executor
LangGraph — StateGraph.compile().invoke() with a single node

# Reproduce locally (Ollama)
export OPENAI_API_KEY="ollama"
export OPENAI_BASE_URL="http://localhost:11434/v1"
export MODEL_NAME="llama3.2"

git clone https://github.com/jamjet-labs/jamjet-benchmarks
cd jamjet-benchmarks/benchmarks
pip install -r requirements.txt
python bench_single_call.py --json results/my-run.json

Warmup runs excluded from measurements
Each timed run is independent — no shared state
Benchmarks run sequentially to avoid contention
Hardware: Apple M-series, 16GB RAM, Ollama local

Reproduce these benchmarks Feature comparison Quickstart