Benchmarks

Benchmark results comparing httpcorexyz against upstream httpcore and other HTTP clients. All benchmarks run on Python 3.14.4, loopback, single machine.

Request latency

5000 GET requests, 20 concurrent, 10ms simulated server delay.

Async (asyncio)

	total	mean	median	p99	min
httpcorexyz	3.8s	14ms	13ms	27ms	11ms
aiohttp	3.2s	12ms	12ms	17ms	10ms
httpcore	14.9s	57ms	46ms	187ms	18ms

Sync (thread pool)

	total	mean	median	p99	min
httpcorexyz	14.6s	56ms	30ms	415ms	11ms
urllib3	4.6s	17ms	15ms	38ms	10ms
httpcore	25.1s	98ms	81ms	308ms	13ms

httpcorexyz is ~3.9x faster than upstream httpcore for async GET and ~42% faster for sync GET. For async, httpcorexyz is now within 17% of aiohttp, which is written in C/Cython.

POST upload throughput

Single connection, 5 requests per payload size.

Async

payload	httpcorexyz	aiohttp	httpcore
1 MB	216 MB/s	223 MB/s	328 MB/s
4 MB	762 MB/s	711 MB/s	522 MB/s
16 MB	726 MB/s	1,136 MB/s	614 MB/s
64 MB	527 MB/s	547 MB/s	599 MB/s

Sync

payload	httpcorexyz	urllib3	httpcore
1 MB	271 MB/s	338 MB/s	500 MB/s
4 MB	687 MB/s	686 MB/s	1,137 MB/s
16 MB	1,163 MB/s	1,319 MB/s	1,357 MB/s
64 MB	594 MB/s	599 MB/s	583 MB/s

Note: the POST benchmark runs libraries sequentially (httpcorexyz first), so the first library tested at each payload size pays a cold-start penalty from TCP buffer and kernel socket warmup. Interleaved measurements show httpcorexyz and httpcore performing identically on write throughput.

Targeted microbenchmarks

`fast_acquire` for anyio Lock and Semaphore

httpcorexyz uses anyio.Lock(fast_acquire=True) and anyio.Semaphore(..., fast_acquire=True), which skip an unnecessary event loop yield when the lock/semaphore is uncontended. This was originally merged in upstream httpcore (encode/httpcore#953) but reverted 4 days later (encode/httpcore#1002) for process/governance reasons, not technical ones.

The impact is dramatic - every request acquires the pool lock at least twice (once to assign, once to close), and in a connection pool with many connections most acquires are uncontended. Eliminating the yield on each uncontended acquire removes two event loop round-trips per request.

	without `fast_acquire`	with `fast_acquire`	improvement
httpcorexyz async GET (total)	12.4s	3.8s	3.3x faster
httpcorexyz async GET (median)	37ms	13ms	2.8x faster

Write loop: `memoryview` vs `bytes` slicing

httpcorexyz uses memoryview for zero-copy slicing in the socket write loop (_backends/sync.py), replacing the upstream buffer = buffer[n:] pattern which copies the remaining bytes into a new object on every partial send.

Measured with a socketpair and a 32KB send buffer to force partial writes (50 iterations, median):

payload	bytes (upstream)	memoryview (fork)	speedup
1 KB	0.00ms	0.00ms	~1x
64 KB	0.02ms	0.02ms	~1x
1 MB	2.13ms	0.41ms	5.2x
4 MB	14.76ms	0.83ms	17.7x
16 MB	297.97ms	3.89ms	76.7x

The improvement scales with payload size because bytes slicing is O(n) per iteration (copies remaining data), making the full loop O(n²) for large payloads with many partial sends. memoryview slicing is O(1) - it adjusts a pointer without copying.

On fast loopback where socket.send() completes in one call, there is no difference. The improvement matters under real network conditions with partial writes.

Connection pool assignment: single-pass vs per-request scanning

httpcorexyz rewrote _assign_requests_to_connections() in the connection pool. Upstream rebuilds available_connections and idle_connections lists by scanning all connections for every queued request - O(n×m) where n = connections and m = queued requests. The fork categorizes connections into buckets in a single pass, then pops from buckets - O(n+m).

Measured with mock connections across 20 origins (1000 iterations, median):

scenario	upstream	fork	speedup
10 conns, 10 reqs	11.3µs	5.3µs	2.2x
100 conns, 50 reqs	335.3µs	39.9µs	8.4x
100 conns, 100 reqs	708.4µs	57.9µs	12.2x
500 conns, 200 reqs	6,496.9µs	172.2µs	37.7x
1000 conns, 500 reqs	37,354.6µs	413.4µs	90.4x

This matters in high-connection-count scenarios - at 1000 connections with 500 queued requests, the upstream algorithm spends 37ms per pool tick while the fork spends 0.4ms.

Lock contention in `PoolByteStream.close()`

httpcorexyz incorporates a fix from upstream PR encode/httpcore#1038 that reduces lock contention when closing streaming responses. Pool requests are marked as closed outside the lock and cleaned up lazily inside _assign_requests_to_connections(), avoiding a separate _requests.remove() call under the lock. Combined with the O(n+m) pool assignment algorithm, this reduces the total work done while the lock is held.

The stream-sync benchmark measures this directly: each thread opens a streaming GET, iterates the body, and closes the stream - exercising the PoolByteStream.close() path under contention. 50 requests per thread, pool limit 100.

threads	httpcorexyz	httpcore	speedup
1	604ms	600ms	1.00x
10	651ms	656ms	1.01x
50	12,170ms	14,823ms	1.22x
100	27,720ms	28,892ms	1.04x
200	58,226ms	61,437ms	1.05x

At low concurrency there is no contention, so no difference. The improvement peaks around 50 threads where lock contention is the dominant bottleneck. At higher thread counts the connection pool limit (100 connections serving 200 threads) becomes the bottleneck, and threads spend most of their time waiting for a connection rather than contending on the close lock.

Running the benchmarks

scripts/benchmark async
scripts/benchmark sync
scripts/benchmark post-async
scripts/benchmark post-sync
scripts/benchmark stream-sync

Results are printed to stdout and charts are saved to tests/benchmark/results_*.png.