Runtime Economics and Reliability
JamJet's most important benefit is not microsecond-level orchestration overhead. It is the reduction in wasted work when workflows fail, pause, or need to be replayed. These benchmarks show both the raw framework tax and the runtime economics that matter once agents leave the demo stage.
Why runtime economics matter more than framework tax
Resume vs rerun after failure
A 7-step workflow fails on step 6. Plain rerun: repeat all 7 steps. JamJet: resume from step 6. The savings scale with workflow complexity and LLM cost.
Replay savings
Replaying a failed or interesting execution avoids recomputing completed steps. Debug from checkpoints, not from scratch.
Side-effect safety
Durable state and leases reduce duplicate downstream actions after process failure — fewer double-sends, fewer double-charges.
Framework orchestration overhead
For completeness, here is the raw orchestration tax measured against identical LLM calls.
llama3.2 · Ollama · Apple M-series · 2026-03-08
| Framework | mean (ms) | median | p95 | p99 | stdev | overhead | visual |
|---|---|---|---|---|---|---|---|
| Raw (baseline) | 947.2 | 943.7 | 970.3 | 972.2 | 9.9 | — | |
| JamJet 0.1.1 | 948.6 | 948.2 | 959.0 | 964.2 | 6.0 | +1.4ms | |
| LangGraph | 944.0 | 943.0 | 953.8 | 961.1 | 8.1 | -3.2ms |
Note: All three frameworks within measurement noise (~1ms). JamJet's in-process executor adds zero observable overhead over a raw LLM call.
qwen3:8b (thinking mode) · Ollama · Apple M-series · 2026-03-08
| Framework | mean (ms) | median | p95 | p99 | stdev | overhead | visual |
|---|---|---|---|---|---|---|---|
| Raw (baseline) | 8429.5 | 8303.4 | 8940.3 | 9427.6 | 352.3 | — | |
| JamJet 0.1.1 | 10140.1 | 10139.1 | 10487.0 | 10519.5 | 285.1 | +1710.6ms | |
| LangGraph | 11902.9 | 11923.3 | 12761.8 | 12823.5 | 551.7 | +3473.3ms |
Note: qwen3:8b generates variable-length chain-of-thought. High stdev dominates — overhead numbers reflect token generation variance, not framework overhead.
Vertex AI (Gemini 2.0 Flash) — plan-and-execute agent
End-to-end run: JamJet @task + @tool on Vertex AI's OpenAI-compatible endpoint. Two-step research agent — plan then synthesize.
| Step | Latency (ms) | Prompt tokens | Compl tokens | Total tokens |
|---|---|---|---|---|
| plan — Gemini Flash | 2,641 | 96 | 104 | 200 |
| step 1 execution | 1,324 | 129 | 90 | 219 |
| step 2 execution | 1,413 | 127 | 110 | 237 |
| step 3 execution | 1,228 | 132 | 103 | 235 |
| step 4 execution | 1,986 | 664 | 186 | 850 |
| step 5 execution | 1,290 | 280 | 100 | 380 |
| synthesize — Gemini Flash | 3,050 | 124 | 153 | 277 |
| TOTAL (12 calls) | 41,811 | 6,121 | 4,840 | 10,961 |
export OPENAI_BASE_URL="https://us-central1-aiplatform.googleapis.com/..." export OPENAI_API_KEY=$(gcloud auth print-access-token) # Then just use @task/@tool as normal @task(model="google/gemini-2.0-flash-001", tools=[web_search]) async def research(question: str) -> str: """Research assistant — search first, then summarize."""
Methodology
All benchmarks measure wall-clock time per call. Each framework makes the identical LLM call through the same OpenAI-compatible client — what we measure is framework orchestration overhead.
- Raw (baseline) — bare openai.OpenAI().chat.completions.create() call
- JamJet — Workflow.run_sync() in-process executor
- LangGraph — StateGraph.compile().invoke() with a single node
# Reproduce locally (Ollama)
export OPENAI_API_KEY="ollama"
export OPENAI_BASE_URL="http://localhost:11434/v1"
export MODEL_NAME="llama3.2"
git clone https://github.com/jamjet-labs/jamjet-benchmarks
cd jamjet-benchmarks/benchmarks
pip install -r requirements.txt
python bench_single_call.py --json results/my-run.json - Warmup runs excluded from measurements
- Each timed run is independent — no shared state
- Benchmarks run sequentially to avoid contention
- Hardware: Apple M-series, 16GB RAM, Ollama local