Application Scoring · Performance

A backend for "submit your application and we'll evaluate it." An applicant submits, the system puts the application in a waiting line, and a worker scores the candidate (in the real version, by calling a large language model) and stores a ranked result. Scoring is slow and rate-limited, so we accept instantly (an HTTP 202 Accepted response: received now, scored later) and process behind the waiting line at whatever rate the workers and the model can sustain.

Why this domain: Mercor is a talent-matching marketplace, so "take in candidate applications, score and rank them with a large language model, serve a leaderboard" maps onto the real business. That is what makes a slow, rate-limited scoring step believable rather than contrived, and it is what justifies the whole waiting-line and back-pressure design.

Honest scope: the scoring is a modeled dependency. Simulated mode waits 1.5 seconds to stand in for the model's response time; real mode does call Claude, but was run only a few times. The deliverable is how the system behaves around that slow step under load, not a real scoring product.

Input · submit (answered instantly)

POST /applications
{ "candidate": "alice",
  "payload": "5 years backend, Go and Python" }

202 Accepted
{ "id": "3737...", "status": "pending" }

Result · after background scoring

GET /applications/3737...
{ "status": "scored", "score": 58.68 }

GET /leaderboard
[ { "candidate": "sam",   "score": 91.2 },
  { "candidate": "kira",  "score": 88.0 },
  { "candidate": "alice", "score": 58.7 } ]

POST sends one application and returns immediately; GET reads a single result, or the whole ranked leaderboard. Scores shown are example shapes, not measured output.

01

Architecture

instant accept, scoring in the background

202 Accepted the request is received now and scored later · reliable pull a worker claims a job in a way that does not lose it if the worker crashes · N copies the worker process is run many times in parallel

The cheap step (accept and add to the line) is split from the slow step (scoring), so the front door stays fast while scoring is processed in the background and scaled on its own.

02

Pieces and how they run

containers on one machine; load generator outside

container one isolated, packaged process · OpenTelemetry a library inside each service that emits metrics and traces · load generator a tool that sends test traffic to find the limits

our services (run many copies) data stores monitoring load generator (outside)

Everything runs as containers on one machine and one private network; only the load generator sits outside, standing in for real client traffic hitting the public entry point.

03

Performance overview

one worker, load raised from 2 to 14 requests per second

capacity of one worker

6.6

requests per second, equal to 10 jobs at once divided by 1.5 second scoring

measured completion rate

6.64

jobs completed per second, sitting right at that maximum

applications waiting in line

0 to 389

flat below the limit, then growing without bound above it

submit-to-scored time, slowest 1 percent

2 to 34

seconds, climbing as the waiting line grows

In the figure: requests per second submissions sent in per second · offered rate sent in, scored rate completed · queue depth applications waiting · p50 / p99 the typical and the slowest-1-percent submit-to-scored time

Up to about 6.6 requests per second the system keeps pace; past that the waiting line grows without bound and the slowest submit-to-scored time climbs from 2 seconds to 34 seconds.

04

Why it matters

same load, same moment

front-door reply time (slowest 5 percent)

8.45 milliseconds

the public entry point answers immediately

time to actually finish scoring (slowest 1 percent)

34 seconds

the real work, behind the waiting line

The front door answers in 8.45 milliseconds while the same job takes 34 seconds to finish. The overload never shows up at the entry point, so you have to watch the waiting line, not the reply time.

05

Scaling and the fix

four workers at 20 requests per second; only the database pool size changed

Running one worker, then four lifts throughput from 6.6 to about 26 requests per second, close to a straight line. The next limit is the database connection pool: each worker holds a connection for the whole 1.5 second score, so a pool smaller than the number of jobs at once starves them. Enlarging the pool alone recovered everything below.

completion rate

13.3 to 19.8

jobs completed per second, up 49 percent

submit-to-scored time, slowest 1 percent

20.9 to 2.0

seconds, ten times better

jobs waiting for a database connection

20 to 0

connection starvation cleared

applications waiting in line

334 to 0

the backlog drains and the system keeps up

In the figure: scored rate jobs completed per second · workers parallel scorer copies (1, then 4) · pool waiters jobs blocked waiting for a free database connection · p99 slowest-1-percent submit-to-scored time

Scaling workers and the database pool fix

More workers raise throughput almost in a straight line, until too few database connections starve them; enlarging the pool clears the wait and cuts the slowest time about ten times.

06

Load versus response time

flat, then a sharp bend near 6.6 requests per second

In the figure: offered load submissions sent in per second · end-to-end, slowest 1 percent the slowest submit-to-scored time · the bend (knee) the load where response time suddenly shoots up

Response time stays flat until the saturation point near 6.6 requests per second, then rises sharply; that bend is the limit of the system as built.