Skip to main content

Benchmark snapshot

This page is a short reference for our latest public benchmark: GoModel against LiteLLM, Portkey, and Bifrost, all pointed at the same instant mock backend so the numbers reflect gateway overhead, not model latency. The full article has the complete write-up, all the context, and the charts: AI Gateway Benchmark 2026: GoModel vs LiteLLM, Portkey & Bifrost.
This is a point-in-time snapshot from a June 2026 run on AWS. Treat it as data, not dogma. Gateway performance depends on your workload, provider mix, deployment setup, and tuning. Older runs (March 2026, LiteLLM only, on localhost) are still on the blog for history.

What we tested

A simple, like-for-like setup:
  • One gateway at a time, in Docker, on an AWS c7i.large (2 vCPU, 4 GiB).
  • The same shared mock backend for everyone, so we measure only gateway overhead.
  • Six workloads: chat completions, the Responses API, and Anthropic messages - each streaming and non-streaming.
  • 8,000 requests per workload at concurrency 10, across two randomized-order trials (latency is the median across them).
  • Fair config: retries off for everyone, GoModel’s circuit breaker off, and LiteLLM run at its recommended one worker per CPU core.

At a glance

GoModel came out ahead on every operational signal most teams care about: the tightest latency tail, the highest sustained throughput, the smallest image and memory, and the fastest cold start.
Gatewayp50 (ms)p99 (ms)Throughput (req/s)Peak RAMImage (compressed)Cold start
GoModel1.86.94,90037 MB16 MB0.56 s
Bifrost2.518.33,100143 MB77 MB7.1 s
Portkey9.730.5950112 MB59 MB1.1 s
LiteLLM30.639.33242.3 GB372 MB25.5 s
Latency is chat completions, non-streaming (representative). Throughput is the sustained rate from a separate concurrency sweep. Image size is the compressed pull size.

Key readouts

  • GoModel has both the lowest median (1.8 ms) and the tightest tail (6.9 ms).
  • It pushes the most traffic per box (~4,900 req/s) and the most per CPU core.
  • It is the smallest to ship and run: a 16 MB compressed image and 37 MB of RAM under load, ready to serve 0.56 s after launch.
  • LiteLLM, even at its recommended multi-worker config, uses ~2.3 GB of RAM and takes ~25 s to start - the cost of Python on the hot path.
  • Portkey did not serve the Anthropic messages dialect in this single-provider setup, so it covers 4 of the 6 workloads.

Reproduce it yourself

The whole thing is one command. It provisions a small AWS box, runs all four gateways against the same mock backend, prints the tables, and tears the infrastructure back down on its own.
This runs on paid AWS infrastructure, not the free tier. A c7i.large is about 0.09/hourandtherunselfdestructswithinanhourortwo,sobudgetunder0.09/hour and the run self-destructs within an hour or two, so budget **under 1** per run to be safe. If you pass KEEP=1 or a teardown fails, you keep paying until you destroy the box - so confirm it is gone.
The harness lives in the repo at docs/2026-06-25_aws_gateway_benchmark/:
# Needs Docker, Terraform, and AWS credentials
git clone https://github.com/ENTERPILOT/GoModel.git
cd gomodel/docs/2026-06-25_aws_gateway_benchmark
./run.sh
Knobs like N (requests per workload) and REPEATS (trials) are env vars, e.g. N=20000 REPEATS=5 ./run.sh for a heavier run. For a quick local check against just LiteLLM, the older localhost harness is still in docs/about/benchmark-tools/.

Why this page is short

It is meant to give you the result fast, inside the product docs, without a full article. For the narrative, the charts, and the methodology details, read the full post. No single benchmark settles the question for every environment. If you are evaluating gateways seriously, reproduce the test against your own traffic and infrastructure.