submit to queue to score to store · simulated mode · 10 jobs at once · 1.5 second scoring
A backend for "submit your application and we'll evaluate it."
An applicant submits, the system puts the application in a waiting line, and a worker
scores the candidate (in the real version, by calling a large language model) and stores a
ranked result. Scoring is slow and rate-limited, so we accept instantly (an HTTP 202
Accepted response: received now, scored later) and process behind the waiting line at whatever rate
the workers and the model can sustain.
Why this domain: Mercor is a talent-matching marketplace, so "take in candidate applications,
score and rank them with a large language model, serve a leaderboard" maps onto the real business.
That is what makes a slow, rate-limited scoring step believable rather than contrived, and it is what
justifies the whole waiting-line and back-pressure design.
Honest scope: the scoring is a modeled dependency. Simulated mode waits 1.5 seconds to stand
in for the model's response time; real mode does call Claude, but was run only a few times. The
deliverable is how the system behaves around that slow step under load, not a real scoring
product.
Input · submit (answered instantly)
POST /applications
{ "candidate": "alice",
"payload": "5 years backend, Go and Python" }
202 Accepted
{ "id": "3737...", "status": "pending" }
POST sends one application and returns immediately; GET reads a single
result, or the whole ranked leaderboard. Scores shown are example shapes, not measured output.
01
Architecture
instant accept, scoring in the background
202 Accepted the request is received now and scored later ·
reliable pull a worker claims a job in a way that does not lose it if the worker crashes ·
N copies the worker process is run many times in parallel
The cheap step (accept and add to the line) is split from the slow step (scoring), so
the front door stays fast while scoring is processed in the background and scaled on its own.
02
Pieces and how they run
containers on one machine; load generator outside
container one isolated, packaged process ·
OpenTelemetry a library inside each service that emits metrics and traces ·
load generator a tool that sends test traffic to find the limits
our services (run many copies)data storesmonitoringload generator (outside)
Everything runs as containers on one machine and one private network; only the load
generator sits outside, standing in for real client traffic hitting the public entry point.
03
Performance overview
one worker, load raised from 2 to 14 requests per second
capacity of one worker
6.6
requests per second, equal to 10 jobs at once divided by 1.5 second scoring
measured completion rate
6.64
jobs completed per second, sitting right at that maximum
applications waiting in line
0 to 389
flat below the limit, then growing without bound above it
submit-to-scored time, slowest 1 percent
2 to 34
seconds, climbing as the waiting line grows
In the figure: requests per second submissions sent in per second ·
offered rate sent in, scored rate completed ·
queue depth applications waiting ·
p50 / p99 the typical and the slowest-1-percent submit-to-scored time
Up to about 6.6 requests per second the system keeps pace; past that the
waiting line grows without bound and the slowest submit-to-scored time climbs from
2 seconds to 34 seconds.
04
Why it matters
same load, same moment
front-door reply time (slowest 5 percent)
8.45 milliseconds
the public entry point answers immediately
time to actually finish scoring (slowest 1 percent)
34 seconds
the real work, behind the waiting line
The front door answers in 8.45 milliseconds while the same job takes
34 seconds to finish. The overload never shows up at the entry point, so you have
to watch the waiting line, not the reply time.
05
Scaling and the fix
four workers at 20 requests per second; only the database pool size changed
Running one worker, then four lifts throughput from 6.6 to about 26 requests per
second, close to a straight line. The next limit is the database connection pool: each worker
holds a connection for the whole 1.5 second score, so a pool smaller than the number of jobs at once
starves them. Enlarging the pool alone recovered everything below.
completion rate
13.3 to 19.8
jobs completed per second, up 49 percent
submit-to-scored time, slowest 1 percent
20.9 to 2.0
seconds, ten times better
jobs waiting for a database connection
20 to 0
connection starvation cleared
applications waiting in line
334 to 0
the backlog drains and the system keeps up
In the figure: scored rate jobs completed per second ·
workers parallel scorer copies (1, then 4) ·
pool waiters jobs blocked waiting for a free database connection ·
p99 slowest-1-percent submit-to-scored time
More workers raise throughput almost in a straight line, until too few database
connections starve them; enlarging the pool clears the wait and cuts the slowest time about
ten times.
06
Load versus response time
flat, then a sharp bend near 6.6 requests per second
In the figure: offered load submissions sent in per second ·
end-to-end, slowest 1 percent the slowest submit-to-scored time ·
the bend (knee) the load where response time suddenly shoots up
Response time stays flat until the saturation point near 6.6 requests per
second, then rises sharply; that bend is the limit of the system as built.