Benchmarks - GoModel

Benchmark snapshot

This page is a short reference for our latest public benchmark: GoModel against LiteLLM, Portkey, and Bifrost, all pointed at the same instant mock backend so the numbers reflect gateway overhead, not model latency. The full article has the complete write-up, all the context, and the charts: AI Gateway Benchmark 2026: GoModel vs LiteLLM, Portkey & Bifrost.

This is a point-in-time snapshot from a June 2026 run on AWS. Treat it as data, not dogma. Gateway performance depends on your workload, provider mix, deployment setup, and tuning. Older runs (March 2026, LiteLLM only, on localhost) are still on the blog for history.

What we tested

A simple, like-for-like setup:

One gateway at a time, in Docker, on an AWS c7i.large (2 vCPU, 4 GiB).
The same shared mock backend for everyone, so we measure only gateway overhead.
Six workloads: chat completions, the Responses API, and Anthropic messages - each streaming and non-streaming.
8,000 requests per workload at concurrency 10, across two randomized-order trials (latency is the median across them).
Fair config: retries off for everyone, GoModel’s circuit breaker off, and LiteLLM run at its recommended one worker per CPU core.

At a glance

GoModel came out ahead on every operational signal most teams care about: the tightest latency tail, the highest sustained throughput, the smallest image and memory, and the fastest cold start.

Gateway	p50 (ms)	p99 (ms)	Throughput (req/s)	Peak RAM	Image (compressed)	Cold start
GoModel	`1.8`	`6.9`	`4,900`	`37 MB`	`16 MB`	`0.56 s`
Bifrost	`2.5`	`18.3`	`3,100`	`143 MB`	`77 MB`	`7.1 s`
Portkey	`9.7`	`30.5`	`950`	`112 MB`	`59 MB`	`1.1 s`
LiteLLM	`30.6`	`39.3`	`324`	`2.3 GB`	`372 MB`	`25.5 s`

Latency is chat completions, non-streaming (representative). Throughput is the sustained rate from a separate concurrency sweep. Image size is the compressed pull size.

Key readouts

GoModel has both the lowest median (1.8 ms) and the tightest tail (6.9 ms).
It pushes the most traffic per box (~4,900 req/s) and the most per CPU core.
It is the smallest to ship and run: a 16 MB compressed image and 37 MB of RAM under load, ready to serve 0.56 s after launch.
LiteLLM, even at its recommended multi-worker config, uses ~2.3 GB of RAM and takes ~25 s to start - the cost of Python on the hot path.
Portkey did not serve the Anthropic messages dialect in this single-provider setup, so it covers 4 of the 6 workloads.

Reproduce it yourself

The whole thing is one command. It provisions a small AWS box, runs all four gateways against the same mock backend, prints the tables, and tears the infrastructure back down on its own.

This runs on paid AWS infrastructure, not the free tier. A c7i.large is about

0.09/hour and the run self-destructs within an hour or two, so budget **under

1** per run to be safe. If you pass KEEP=1 or a teardown fails, you keep paying until you destroy the box - so confirm it is gone.

The harness lives in the repo at docs/2026-06-25_aws_gateway_benchmark/:

# Needs Docker, Terraform, and AWS credentials
git clone https://github.com/ENTERPILOT/GoModel.git
cd gomodel/docs/2026-06-25_aws_gateway_benchmark
./run.sh

Knobs like N (requests per workload) and REPEATS (trials) are env vars, e.g. N=20000 REPEATS=5 ./run.sh for a heavier run. For a quick local check against just LiteLLM, the older localhost harness is still in docs/about/benchmark-tools/.

Why this page is short

It is meant to give you the result fast, inside the product docs, without a full article. For the narrative, the charts, and the methodology details, read the full post. No single benchmark settles the question for every environment. If you are evaluating gateways seriously, reproduce the test against your own traffic and infrastructure.

​Benchmark snapshot

​What we tested

​At a glance

​Key readouts

​Reproduce it yourself

​Why this page is short

Benchmark snapshot

What we tested

At a glance

Key readouts

Reproduce it yourself

Why this page is short