Jaypore Labs
Back to journal
Engineering

Self-consistency: when N=3 beats a smarter prompt

Generating multiple candidates and selecting the most consistent one beats a cleverer prompt for many tasks.

Yash ShahApril 24, 20263 min read

A team chasing the last few percentage points of accuracy on a math-reasoning feature spent a week iterating the prompt. The prompt got more elaborate; accuracy ticked up by 1%. They tried generating 3 candidates per request and selecting the majority answer. Accuracy jumped 8%.

Self-consistency — sample N candidates, pick the most common — is a brute but effective technique for tasks where reasoning leads to a discrete answer.

The voting pattern

The pattern:

  • Generate N samples (typically 3-5) at non-zero temperature.
  • Each sample is a complete reasoning + answer.
  • Tally answers across samples.
  • Pick the most common answer (or weight by confidence if available).

Works for tasks with discrete answers — classification, math, multiple-choice, structured-extraction.

Doesn't work as well for prose generation where there's no "answer" to vote on.

Cost shape

The cost is N times a single call. For N=3, triple the cost.

Worth it when:

  • Accuracy gain matters more than cost.
  • Task has discrete answers to vote on.
  • Latency budget can absorb (calls can be parallel).

Failure modes

  • Tied votes. What's the tie-break rule? Define it.
  • Mode collapse. All samples produce the same wrong answer. Voting doesn't help; the underlying model is wrong.
  • Long-tail variance. Many distinct answers, no clear majority. The task may be too uncertain for self-consistency to fix.

Reviewer signal

Cases where self-consistency disagreed (N=3 produced 2-1 split, for example) are the team's signal:

  • Was the majority right?
  • Was the minority right?
  • Both wrong?

These edge cases inform prompt iteration and eval-set growth.

A real benchmark

A team's reasoning task:

  • Base accuracy: 78%.
  • N=3 self-consistency: 86%.
  • N=5 self-consistency: 88%.
  • N=10 self-consistency: 89%.

Diminishing returns past N=5. Team shipped N=3 (best cost-quality balance for their use case).

What we won't ship

Self-consistency on prose tasks without a voting mechanism that handles the prose.

N=10+ for marginal accuracy gain when N=3 is already at the goal.

Self-consistency in latency-critical paths where N parallel calls don't fit the budget.

Skipping the eval to confirm the gain. The gain might be smaller than expected on your specific task.

Close

Self-consistency is brute force that often beats clever. For tasks with discrete answers, sample N and vote. The accuracy gain is usually worth the cost. Skip the cleverness; ship the votes.

Related reading


We build AI-enabled software and help businesses put AI to work. If you're using self-consistency, we'd love to hear about it. Get in touch.

Tagged
LLMSelf-consistencyEngineeringPredictable OutputSampling
Share