Skip to content

Benchmarks

Benchmark results comparing httpcorexyz against upstream httpcore and other HTTP clients. All benchmarks run on Python 3.14.4, loopback, single machine.

Request latency

5000 GET requests, 20 concurrent, 10ms simulated server delay.

Async (asyncio)

total mean median p99 min
httpcorexyz 3.8s 14ms 13ms 27ms 11ms
aiohttp 3.2s 12ms 12ms 17ms 10ms
httpcore 14.9s 57ms 46ms 187ms 18ms

Sync (thread pool)

total mean median p99 min
httpcorexyz 14.6s 56ms 30ms 415ms 11ms
urllib3 4.6s 17ms 15ms 38ms 10ms
httpcore 25.1s 98ms 81ms 308ms 13ms

httpcorexyz is ~3.9x faster than upstream httpcore for async GET and ~42% faster for sync GET. For async, httpcorexyz is now within 17% of aiohttp, which is written in C/Cython.

POST upload throughput

Single connection, 5 requests per payload size.

Async

payload httpcorexyz aiohttp httpcore
1 MB 216 MB/s 223 MB/s 328 MB/s
4 MB 762 MB/s 711 MB/s 522 MB/s
16 MB 726 MB/s 1,136 MB/s 614 MB/s
64 MB 527 MB/s 547 MB/s 599 MB/s

Sync

payload httpcorexyz urllib3 httpcore
1 MB 271 MB/s 338 MB/s 500 MB/s
4 MB 687 MB/s 686 MB/s 1,137 MB/s
16 MB 1,163 MB/s 1,319 MB/s 1,357 MB/s
64 MB 594 MB/s 599 MB/s 583 MB/s

Note: the POST benchmark runs libraries sequentially (httpcorexyz first), so the first library tested at each payload size pays a cold-start penalty from TCP buffer and kernel socket warmup. Interleaved measurements show httpcorexyz and httpcore performing identically on write throughput.

Targeted microbenchmarks

fast_acquire for anyio Lock and Semaphore

httpcorexyz uses anyio.Lock(fast_acquire=True) and anyio.Semaphore(..., fast_acquire=True), which skip an unnecessary event loop yield when the lock/semaphore is uncontended. This was originally merged in upstream httpcore (encode/httpcore#953) but reverted 4 days later (encode/httpcore#1002) for process/governance reasons, not technical ones.

The impact is dramatic - every request acquires the pool lock at least twice (once to assign, once to close), and in a connection pool with many connections most acquires are uncontended. Eliminating the yield on each uncontended acquire removes two event loop round-trips per request.

without fast_acquire with fast_acquire improvement
httpcorexyz async GET (total) 12.4s 3.8s 3.3x faster
httpcorexyz async GET (median) 37ms 13ms 2.8x faster

Write loop: memoryview vs bytes slicing

httpcorexyz uses memoryview for zero-copy slicing in the socket write loop (_backends/sync.py), replacing the upstream buffer = buffer[n:] pattern which copies the remaining bytes into a new object on every partial send.

Measured with a socketpair and a 32KB send buffer to force partial writes (50 iterations, median):

payload bytes (upstream) memoryview (fork) speedup
1 KB 0.00ms 0.00ms ~1x
64 KB 0.02ms 0.02ms ~1x
1 MB 2.13ms 0.41ms 5.2x
4 MB 14.76ms 0.83ms 17.7x
16 MB 297.97ms 3.89ms 76.7x

The improvement scales with payload size because bytes slicing is O(n) per iteration (copies remaining data), making the full loop O(n²) for large payloads with many partial sends. memoryview slicing is O(1) - it adjusts a pointer without copying.

On fast loopback where socket.send() completes in one call, there is no difference. The improvement matters under real network conditions with partial writes.

Connection pool assignment: single-pass vs per-request scanning

httpcorexyz rewrote _assign_requests_to_connections() in the connection pool. Upstream rebuilds available_connections and idle_connections lists by scanning all connections for every queued request - O(n×m) where n = connections and m = queued requests. The fork categorizes connections into buckets in a single pass, then pops from buckets - O(n+m).

Measured with mock connections across 20 origins (1000 iterations, median):

scenario upstream fork speedup
10 conns, 10 reqs 11.3µs 5.3µs 2.2x
100 conns, 50 reqs 335.3µs 39.9µs 8.4x
100 conns, 100 reqs 708.4µs 57.9µs 12.2x
500 conns, 200 reqs 6,496.9µs 172.2µs 37.7x
1000 conns, 500 reqs 37,354.6µs 413.4µs 90.4x

This matters in high-connection-count scenarios - at 1000 connections with 500 queued requests, the upstream algorithm spends 37ms per pool tick while the fork spends 0.4ms.

Lock contention in PoolByteStream.close()

httpcorexyz incorporates a fix from upstream PR encode/httpcore#1038 that reduces lock contention when closing streaming responses. Pool requests are marked as closed outside the lock and cleaned up lazily inside _assign_requests_to_connections(), avoiding a separate _requests.remove() call under the lock. Combined with the O(n+m) pool assignment algorithm, this reduces the total work done while the lock is held.

The stream-sync benchmark measures this directly: each thread opens a streaming GET, iterates the body, and closes the stream - exercising the PoolByteStream.close() path under contention. 50 requests per thread, pool limit 100.

threads httpcorexyz httpcore speedup
1 604ms 600ms 1.00x
10 651ms 656ms 1.01x
50 12,170ms 14,823ms 1.22x
100 27,720ms 28,892ms 1.04x
200 58,226ms 61,437ms 1.05x

At low concurrency there is no contention, so no difference. The improvement peaks around 50 threads where lock contention is the dominant bottleneck. At higher thread counts the connection pool limit (100 connections serving 200 threads) becomes the bottleneck, and threads spend most of their time waiting for a connection rather than contending on the close lock.

Running the benchmarks

scripts/benchmark async
scripts/benchmark sync
scripts/benchmark post-async
scripts/benchmark post-sync
scripts/benchmark stream-sync

Results are printed to stdout and charts are saved to tests/benchmark/results_*.png.