Runtime Economics and Reliability

JamJet's most important benefit is not microsecond-level orchestration overhead. It is the reduction in wasted work when workflows fail, pause, or need to be replayed. These benchmarks show both the raw framework tax and the runtime economics that matter once agents leave the demo stage.

Why runtime economics matter more than framework tax

Resume vs rerun after failure

A 7-step workflow fails on step 6. Plain rerun: repeat all 7 steps. JamJet: resume from step 6. The savings scale with workflow complexity and LLM cost.

Replay savings

Replaying a failed or interesting execution avoids recomputing completed steps. Debug from checkpoints, not from scratch.

Side-effect safety

Durable state and leases reduce duplicate downstream actions after process failure — fewer double-sends, fewer double-charges.

Framework orchestration overhead

For completeness, here is the raw orchestration tax measured against identical LLM calls.

llama3.2 · Ollama · Apple M-series · 2026-03-08

Model: llama3.2 Endpoint: http://localhost:11434/v1 (Ollama) Runs: 20 (+3 warmup)
Framework mean (ms) median p95 p99 stdev overhead visual
Raw (baseline) 947.2 943.7 970.3 972.2 9.9
JamJet 0.1.1 948.6 948.2 959.0 964.2 6.0 +1.4ms
LangGraph 944.0 943.0 953.8 961.1 8.1 -3.2ms

Note: All three frameworks within measurement noise (~1ms). JamJet's in-process executor adds zero observable overhead over a raw LLM call.

qwen3:8b (thinking mode) · Ollama · Apple M-series · 2026-03-08

Model: qwen3:8b Endpoint: http://localhost:11434/v1 (Ollama) Runs: 15 (+3 warmup)
Framework mean (ms) median p95 p99 stdev overhead visual
Raw (baseline) 8429.5 8303.4 8940.3 9427.6 352.3
JamJet 0.1.1 10140.1 10139.1 10487.0 10519.5 285.1 +1710.6ms
LangGraph 11902.9 11923.3 12761.8 12823.5 551.7 +3473.3ms

Note: qwen3:8b generates variable-length chain-of-thought. High stdev dominates — overhead numbers reflect token generation variance, not framework overhead.

Vertex AI (Gemini 2.0 Flash) — plan-and-execute agent

End-to-end run: JamJet @task + @tool on Vertex AI's OpenAI-compatible endpoint. Two-step research agent — plan then synthesize.

Model
gemini-2.0-flash-001
Provider
Vertex AI (GCP)
Strategy
plan-and-execute
Wall-clock
41,811 ms
Total tokens
10,961
Est. cost
$0.00191
Step Latency (ms) Prompt tokens Compl tokens Total tokens
plan — Gemini Flash2,64196104200
step 1 execution1,32412990219
step 2 execution1,413127110237
step 3 execution1,228132103235
step 4 execution1,986664186850
step 5 execution1,290280100380
synthesize — Gemini Flash3,050124153277
TOTAL (12 calls)41,8116,1214,84010,961
Integration — 2 env vars, no custom client
export OPENAI_BASE_URL="https://us-central1-aiplatform.googleapis.com/..."
export OPENAI_API_KEY=$(gcloud auth print-access-token)

# Then just use @task/@tool as normal
@task(model="google/gemini-2.0-flash-001", tools=[web_search])
async def research(question: str) -> str:
    """Research assistant — search first, then summarize."""

Methodology

All benchmarks measure wall-clock time per call. Each framework makes the identical LLM call through the same OpenAI-compatible client — what we measure is framework orchestration overhead.

  • Raw (baseline) — bare openai.OpenAI().chat.completions.create() call
  • JamJet — Workflow.run_sync() in-process executor
  • LangGraph — StateGraph.compile().invoke() with a single node
# Reproduce locally (Ollama)
export OPENAI_API_KEY="ollama"
export OPENAI_BASE_URL="http://localhost:11434/v1"
export MODEL_NAME="llama3.2"

git clone https://github.com/jamjet-labs/jamjet-benchmarks
cd jamjet-benchmarks/benchmarks
pip install -r requirements.txt
python bench_single_call.py --json results/my-run.json
  • Warmup runs excluded from measurements
  • Each timed run is independent — no shared state
  • Benchmarks run sequentially to avoid contention
  • Hardware: Apple M-series, 16GB RAM, Ollama local
Reproduce these benchmarks Feature comparison Quickstart