aiinferencellm

How to Control LLM Output: Temperature, Top-P & Beyond

You rewrote the prompt five times. You added "be concise." You added "think step by step." You added "IMPORTANT" in capital letters. And the model is still giving you the same rambling, slightly-too-creative, occasionally-looping answer it gave you an hour ago.

What most people never find out is that the prompt is only half of the controls. There's a second set of knobs that sit after the model has done its thinking, at the exact moment it picks the next word. They're called sampling parameters, and most prompt engineers never touch them. Partly the docs explain them badly. Partly everyone keeps repeating the same vague advice ("turn temperature up for creativity") without telling you what's actually happening underneath.

So this is a field guide. We'll build one simple mental model, walk through every knob that matters, and finish with a copy-paste cheat sheet of settings per task. Every example is a real OpenRouter request body, so you can run it against any of the 300+ models they route to. The same parameters work whether you're calling Claude, GPT, Llama, or Qwen.

The one mental model you need

When an LLM generates text, it doesn't decide on a whole sentence at once. It produces text one token at a time (a token is roughly a word-piece). And at each step, before it commits to anything, it produces a ranked list of every possible next token with a probability attached.

Say the prompt so far is The weather today is. Internally, the model is holding something like this:

tokenprobability
 sunny0.40
 cloudy0.25
 warm0.15
 cold0.08
 nice0.05
 ... (thousands more)0.07 total

That's it. That ranked list is the only thing every sampling parameter touches. None of these knobs change what the model knows. They reshape the list right before a token gets drawn from it: flatten it, sharpen it, trim its tail, tax certain entries.

Once you can picture that table, every parameter stops being folklore:

  • Temperature changes the contrast of the curve (peaky vs. flat).
  • Top-k / top-p truncate the list (chop off the unlikely tail).
  • Penalties tax tokens that already appeared.
  • Seed fixes the dice roll so the draw is repeatable.
  • Logprobs just hand you the numbers in that table so you can see them.

Keep the table in your head. Everything below is an operation on it.

Temperature: a contrast dial, not a creativity dial

Range: 0.0–2.0. Default: 1.0.

Everyone calls temperature the "creativity" knob. That's misleading. What temperature actually controls is the contrast of the probability curve before a token is drawn.

  • Low temperature (→ 0) sharpens the curve. The gap between  sunny (0.40) and  cloudy (0.25) widens, so the model almost always grabs the top token. Output becomes predictable and repeatable.
  • High temperature (→ 2) flattens the curve.  cold and  nice climb closer to  sunny, so unusual choices get a real shot. Output becomes more diverse, and past a point it goes incoherent.

At temperature 0 the model becomes effectively deterministic. It always takes the single most likely token, which is what people mean by "greedy" decoding. That's what you want for extraction, classification, and anything where there's a correct answer.

{
"model": "mistralai/mistral-small",
"messages": [{ "role": "user", "content": "Extract the invoice total as a number." }],
"temperature": 0
}

Crank it up for ideation:

{
"model": "mistralai/mistral-large",
"messages": [{ "role": "user", "content": "Give me 10 unexpected names for a coffee brand." }],
"temperature": 1.1
}

As a rough rule: below 0.3 for facts and structure, 0.7–1.0 for general chat, 1.0–1.4 for creative work. Above about 1.5 it gets weird fast. Keep that range for brainstorming, never for production answers or anything that has to parse.

Top-p and top-k: trimming the tail

Temperature reshapes the whole curve. Top-p and top-k do something different. They delete part of the list before sampling, so the genuinely silly options can never be picked at all.

Top-k (range: 0 and up, default 0 = off) keeps a fixed-size shortlist. top_k: 5 means "only ever consider the 5 most likely tokens, ignore everything else." Simple, but rigid: sometimes 5 is too many (when the model is very sure) and sometimes too few (when it's genuinely torn between 20 reasonable words).

Top-p, aka nucleus sampling (range: 0.0–1.0, default 1.0 = off) keeps a dynamic shortlist. top_p: 0.9 means "keep adding tokens from the top until their probabilities sum to 90%, then sample only from those." In our weather table, top_p: 0.9 keeps  sunny,  cloudy,  warm,  cold,  nice (0.40 + 0.25 + 0.15 + 0.08 + 0.05 = 0.93) and throws away the long tail of thousands of unlikely tokens.

Top-p is usually the better default because it adapts: when the model is confident, the nucleus is tiny; when it's uncertain, the nucleus grows to include all the reasonable candidates.

{
"model": "mistralai/mistral-small",
"messages": [{ "role": "user", "content": "Write a friendly support reply." }],
"temperature": 0.7,
"top_p": 0.9
}

There's one rule the whole field agrees on. Tune temperature or top-p, not both. They both control diversity, and stacking them makes the result almost impossible to reason about or reproduce. Pick one as your diversity knob and leave the other at its default.

Min-p: the newer knob actually worth using

Top-k and top-p have been the defaults for years. The most interesting recent addition is min-p, introduced in a 2024 paper, Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs, that went on to land an oral at ICLR 2025. OpenRouter exposes it directly, so you can use it today.

Min-p (range: 0.0–1.0, default 0.0 = off) sets a floor relative to the top token. min_p: 0.1 means a token is only eligible if its probability is at least 10% of the most likely token's probability. The clever part is how that floor moves with the model's confidence:

  • When the model is sure (top token at 0.9), the bar sits at 0.09, so only genuinely strong candidates survive and the shortlist stays tight.
  • When the model is torn (top token at 0.2), the bar drops to 0.02, so more reasonable options stay in play.

That's the pitch: min-p keeps output coherent even when you push temperature high, because the weak tokens get filtered out before the heat flattens everything together. It pairs naturally with high temperature for creative work.

{
"model": "mistralai/mistral-large",
"messages": [{ "role": "user", "content": "Write the opening line of a noir detective story." }],
"temperature": 1.5,
"min_p": 0.05
}

One honest caveat. A 2025 follow-up, Min-p, Max Exaggeration, re-ran the evidence and argued the original quality and diversity gains don't really hold up. So treat min-p as a useful option for high-temperature creative generation, not a guaranteed upgrade over a well-tuned top-p. There's also top-a (range 0.0–1.0, default 0), a close cousin that scales its threshold by the top token's probability squared. It's on OpenRouter too, but far less used in practice.

The penalties: stopping the loop

Sometimes the problem isn't randomness at all. It's repetition. The model says "the the the," lists the same bullet three times, or keeps circling back to the same phrase. The penalty knobs exist for this. They tax tokens that have already appeared, and they work in subtly different ways:

  • Frequency penalty (range: −2.0 to 2.0, default 0) scales with how many times a token has appeared. The 5th repeat is penalized harder than the 2nd. Good for "stop saying the same word over and over."
  • Presence penalty (range: −2.0 to 2.0, default 0) is a flat tax applied once a token has appeared at all, regardless of count. Good for "push the model toward new topics and vocabulary."
  • Repetition penalty (range: 0.0–2.0, default 1.0) is OpenRouter's combined version that directly down-weights any token already seen. 1.0 is off; values like 1.1–1.3 gently discourage repeats.
{
"model": "mistralai/mistral-medium-3",
"messages": [{ "role": "user", "content": "Summarize this transcript without repeating yourself." }],
"temperature": 0.7,
"frequency_penalty": 0.4
}

One warning. Treat these as scalpels. Push them too high and the model starts avoiding words it genuinely needs, contorting sentences to dodge a banned token while the quality quietly drops. Start at 0.2–0.5 and only go higher if you can still see looping. The defaults are 0 for a reason, so reach for these only when you have an actual repetition problem in front of you.

The frontier: samplers you'll meet in open source first

A wave of newer samplers is landing in local inference stacks (llama.cpp, vLLM, text-generation-webui) well ahead of the hosted APIs. You won't find most of them on a typical commercial endpoint yet, including OpenRouter, but they're worth knowing, because if you run open models they're already at your fingertips and they fix real failure modes:

  • Top-nσ (top-n-sigma). Introduced in late 2024, it sets the cutoff using the statistics of the raw logits instead of probabilities. Its headline trait is staying stable across temperatures, which is handy for reasoning tasks where you want one threshold that behaves the same whether the model runs hot or cold.
  • Typical sampling (locally typical). Instead of keeping the most probable tokens, it keeps tokens whose surprise is close to what the model expects on average, which tends to cut both the boring-obvious and the bizarre-rare choices. It got a fresh round of refinements in 2025.
  • DRY (Don't Repeat Yourself). A context-aware repetition suppressor that catches repeated phrases, not just repeated single tokens, so it kills the "model loops a whole sentence" failure that frequency penalty often misses.
  • XTC (Exclude Top Choices). It occasionally drops the single most likely token on purpose to force more surprising continuations. A deliberate creativity hack rather than a safety net.

One practical note if you do run these locally: the order the samplers fire in changes the result, since each one reshapes the list before handing it to the next. Don't switch them all on at once and hope. The bigger point is the direction of travel. The field is moving away from fixed cutoffs and toward samplers that adapt to the model's own confidence. The knobs you can rely on across providers today (temperature, top-p, and now min-p) are the stable, portable subset of that same trend.

Determinism and inspection: the knobs nobody writes about

This is the half of the toolkit most guides skip entirely, and it's the half that turns sampling from a guessing game into something you can actually debug.

Seed: make it reproducible

seed (any integer). Sampling involves a dice roll. The seed fixes that roll, so the same request with the same seed and the same parameters returns the same output. This is essential for testing: without it, you can't tell whether a prompt change actually helped or whether you just got a different roll of the dice.

{
"model": "mistralai/mistral-large",
"messages": [{ "role": "user", "content": "Pick a random startup idea." }],
"temperature": 0.9,
"seed": 42
}

One caveat. Reproducibility here is best-effort, not a guarantee. It holds within a provider as long as nothing changes underneath you, but a model version update or an infrastructure change will break it, and some providers don't expose a stable seed at all (Anthropic, for one). Treat the seed as "stable enough for an afternoon of prompt iteration," not "stable forever," and re-baseline your test suite whenever you switch models.

Logprobs: read the model's confidence

This is the knob almost no field guide mentions, and it's the most useful one for serious work. logprobs: true asks the API to return the probability the model assigned to each token it chose. Add top_logprobs: 5 and you also get the top 5 alternatives it considered at each step. It's that ranked table from the start of this article, handed straight to you.

{
"model": "mistralai/mistral-small",
"messages": [{ "role": "user", "content": "Is this email spam? Answer yes or no." }],
"temperature": 0,
"logprobs": true,
"top_logprobs": 5
}

Why this matters in practice:

  • Confidence guards. If the model answers "yes" but the logprob shows it was a 51%-vs-49% coin flip against "no," you can route that case to a human or a stronger model instead of trusting it blindly. You've turned a confident-sounding string into a measurable confidence score.
  • Debugging weird answers. When output goes off the rails, top_logprobs shows you the exact moment the model started considering tokens it shouldn't, the fork in the road where it went wrong.
  • Cheaper classification. For yes/no or multiple-choice tasks, you can read the probabilities of the candidate tokens directly instead of generating and parsing free text.

Logprobs are off by default and add a little overhead, so turn them on when you're debugging or building a confidence-gated pipeline, not on every call.

The cheat sheet

This is the part to bookmark. Sensible starting points per task, then adjust by feel:

Tasktemperaturetop_ppenaltiesnotes
Extraction / classification0default—deterministic; add logprobs for confidence gating
Factual Q&A / RAG0.1–0.3default—low temp keeps it grounded in the context
Structured output (JSON)0–0.2default—low temp = fewer schema breaks; pair with structured-output mode
Code generation0.1–0.4default—enough variety to find a solution, not enough to hallucinate APIs
General chat / support0.70.9frequency: 0.2warm but reliable; gentle penalty kills canned repetition
Creative writing1.0–1.30.95presence: 0.3high diversity; or go hotter with min_p: 0.05 instead of top-p
Brainstorming / ideation1.1–1.5defaultfrequency: 0.5maximum spread; expect some duds, that's the point

Two habits that go with the table: tune one diversity knob (temperature, top-p, or min-p, not several at once), and always pin a seed while you're iterating so you're comparing prompts, not dice rolls.

Why this matters when you're automating real processes

A demo tolerates a flaky answer. A process doesn't. The moment an LLM stops being a chat window and becomes a step in a workflow, reading invoices, triaging support tickets, routing a contract, populating a CRM, "mostly right" turns into silent, compounding errors three steps downstream. This is the work we do at LayerX: wiring LLMs into the operational guts of a business so processes run themselves. It's also where sampling stops being a curiosity and becomes load-bearing.

Three places the knobs above directly change automation outcomes:

  • Determinism makes pipelines testable. An automated step you can't reproduce is an automated step you can't trust. Running extraction and classification at temperature: 0, with a pinned seed during development, means the same input gives the same output. Your regression tests start to mean something, and a run that passed yesterday still passes today.
  • Structured steps need low-variance settings. When an LLM's job is to emit JSON that the next system consumes, a stray creative token isn't color. It's a parse failure that halts the line. Low temperature plus structured-output mode is the difference between a workflow that runs unattended and one that needs a babysitter.
  • Logprobs build the human-in-the-loop seam. The hardest part of automating a process isn't the 90% the model handles. It's gracefully escaping the 10% it shouldn't. Reading the model's confidence lets you auto-approve the clear cases and route the genuinely uncertain ones to a person, and that one seam is often what makes an automation safe enough to actually deploy.

The pattern we keep seeing is teams reaching for a bigger model or a fine-tune to fix reliability problems that were really sampling problems. Dialing in determinism and confidence-gating on the model you already have is faster, cheaper, and usually enough.

FAQ

What's the single most important parameter to get right?
Temperature. For anything with a correct answer, like extraction, classification, or structured output, set it to 0 and most of your reliability problems disappear. Everything else is fine-tuning around that one decision.

Should I change temperature or top-p?
Pick one and leave the other at its default. Both control diversity, so tuning them together makes results nearly impossible to reason about or reproduce. Most people are well served by temperature alone; reach for top-p when you want to keep a consistent temperature but more aggressively cut off the weird long tail.

Does temperature 0 guarantee identical output every time?
Close, but it's not a contract. It removes the deliberate randomness, so output is effectively deterministic, though model version updates and provider infrastructure can still shift results over time. Pin a seed too, and re-baseline your tests when you change models or providers.

My model keeps repeating itself. Which knob fixes that?
Start with a small frequency_penalty (0.2–0.5) to discourage reusing the same tokens, or presence_penalty to push it toward new topics. Go up only if you still see looping. Too much penalty makes the model dodge words it actually needs, and quality drops.

Do these parameters work the same across different models?
The concepts are universal, since every model samples from a next-token distribution. The exact knobs exposed vary by provider. Top-k isn't available on OpenAI models, for instance, and Anthropic doesn't expose a stable seed. Routing through OpenRouter gives you one consistent parameter surface across providers, which is why we used it for every example here.

Will the right sampling settings fix hallucinations?
They reduce a class of them. Low temperature keeps the model anchored to its highest-confidence, most-grounded tokens, which cuts down on invented details, especially in RAG where you want it to stick to the retrieved context. But sampling can't add knowledge the model doesn't have. For that you still need retrieval, grounding, or fine-tuning.

Should I tune sampling before fine-tuning?
Almost always, yes. Sampling is a config change with no data, no training run, and no deployment cost. Exhaust it first. A large share of "the model isn't good enough" turns out to be "the model wasn't configured for the job."

The takeaway

Sampling is the cheapest lever in your entire stack. It's a config change, with no new data, no training run, and no deployment. And yet most teams skip straight past it. They exhaust their prompt-engineering ideas, get frustrated, and jump to fine-tuning.

Try it the other way around. Before you spin up a fine-tuning pipeline, make sure you've actually controlled the model you already have. Once you hold the mental model, that every knob is an operation on the next-token distribution, the parameters stop being magic. Temperature is contrast. Top-p is a trim. Penalties are a tax. Seed is reproducibility. Logprobs are x-ray vision.

That's the control room behind the prompt. Most people never walk into it. Now you can 💪

Related posts