Application Scoring

submit  to  queue  to  score  to  store  ·  simulated mode · 10 jobs at once · 1.5 second scoring

A backend for "submit your application and we'll evaluate it." An applicant submits, the system puts the application in a waiting line, and a worker scores the candidate (in the real version, by calling a large language model) and stores a ranked result. Scoring is slow and rate-limited, so we accept instantly (an HTTP 202 Accepted response: received now, scored later) and process behind the waiting line at whatever rate the workers and the model can sustain.

Why this domain: Mercor is a talent-matching marketplace, so "take in candidate applications, score and rank them with a large language model, serve a leaderboard" maps onto the real business. That is what makes a slow, rate-limited scoring step believable rather than contrived, and it is what justifies the whole waiting-line and back-pressure design.

Honest scope: the scoring is a modeled dependency. Simulated mode waits 1.5 seconds to stand in for the model's response time; real mode does call Claude, but was run only a few times. The deliverable is how the system behaves around that slow step under load, not a real scoring product.

Input · submit (answered instantly)
POST /applications
{ "candidate": "alice",
  "payload": "5 years backend, Go and Python" }

202 Accepted
{ "id": "3737...", "status": "pending" }
Result · after background scoring
GET /applications/3737...
{ "status": "scored", "score": 58.68 }

GET /leaderboard
[ { "candidate": "sam",   "score": 91.2 },
  { "candidate": "kira",  "score": 88.0 },
  { "candidate": "alice", "score": 58.7 } ]

POST sends one application and returns immediately; GET reads a single result, or the whole ranked leaderboard. Scores shown are example shapes, not measured output.

01

Architecture

instant accept, scoring in the background
202 Accepted the request is received now and scored later  ·  reliable pull a worker claims a job in a way that does not lose it if the worker crashes  ·  N copies the worker process is run many times in parallel
Client load generator ingest-api accepts, returns 202 Redis waiting line worker (N copies) scores jobs Postgres stores results POST /applications add to line reliable pull write: pending write: scored

The cheap step (accept and add to the line) is split from the slow step (scoring), so the front door stays fast while scoring is processed in the background and scaled on its own.

02

Pieces and how they run

containers on one machine; load generator outside
container one isolated, packaged process  ·  OpenTelemetry a library inside each service that emits metrics and traces  ·  load generator a tool that sends test traffic to find the limits
ONE MACHINE · ONE PRIVATE NETWORK k6 load generator SERVICES (OUR CODE) ingest-api accept, add to line + OpenTelemetry (library) worker (4 copies) score jobs + OpenTelemetry (library) DATA STORES Redis waiting line Postgres stored results MONITORING Prometheus collects metrics Grafana charts Jaeger request traces
our services (run many copies) data stores monitoring load generator (outside)

Everything runs as containers on one machine and one private network; only the load generator sits outside, standing in for real client traffic hitting the public entry point.

03

Performance overview

one worker, load raised from 2 to 14 requests per second
capacity of one worker
6.6
requests per second, equal to 10 jobs at once divided by 1.5 second scoring
measured completion rate
6.64
jobs completed per second, sitting right at that maximum
applications waiting in line
0 to 389
flat below the limit, then growing without bound above it
submit-to-scored time, slowest 1 percent
2 to 34
seconds, climbing as the waiting line grows
In the figure:  requests per second submissions sent in per second  ·  offered rate sent in, scored rate completed  ·  queue depth applications waiting  ·  p50 / p99 the typical and the slowest-1-percent submit-to-scored time
Single-worker saturation

Up to about 6.6 requests per second the system keeps pace; past that the waiting line grows without bound and the slowest submit-to-scored time climbs from 2 seconds to 34 seconds.

04

Why it matters

same load, same moment
front-door reply time (slowest 5 percent)
8.45 milliseconds
the public entry point answers immediately
time to actually finish scoring (slowest 1 percent)
34 seconds
the real work, behind the waiting line
The front door answers in 8.45 milliseconds while the same job takes 34 seconds to finish. The overload never shows up at the entry point, so you have to watch the waiting line, not the reply time.
05

Scaling and the fix

four workers at 20 requests per second; only the database pool size changed

Running one worker, then four lifts throughput from 6.6 to about 26 requests per second, close to a straight line. The next limit is the database connection pool: each worker holds a connection for the whole 1.5 second score, so a pool smaller than the number of jobs at once starves them. Enlarging the pool alone recovered everything below.

completion rate
13.3 to 19.8
jobs completed per second, up 49 percent
submit-to-scored time, slowest 1 percent
20.9 to 2.0
seconds, ten times better
jobs waiting for a database connection
20 to 0
connection starvation cleared
applications waiting in line
334 to 0
the backlog drains and the system keeps up
In the figure:  scored rate jobs completed per second  ·  workers parallel scorer copies (1, then 4)  ·  pool waiters jobs blocked waiting for a free database connection  ·  p99 slowest-1-percent submit-to-scored time
Scaling workers and the database pool fix

More workers raise throughput almost in a straight line, until too few database connections starve them; enlarging the pool clears the wait and cuts the slowest time about ten times.

06

Load versus response time

flat, then a sharp bend near 6.6 requests per second
In the figure:  offered load submissions sent in per second  ·  end-to-end, slowest 1 percent the slowest submit-to-scored time  ·  the bend (knee) the load where response time suddenly shoots up
Load versus response time

Response time stays flat until the saturation point near 6.6 requests per second, then rises sharply; that bend is the limit of the system as built.