Advertisement
X

Why AI Chatbots Give Different Responses For Same Prompts and How Engineers Can Fix It?

Engineers propose batch-invariant kernels and a deterministic inference mode to eliminate LLM nondeterminism caused by batch-dependent floating-point reductions, a Thinking Machines Lab / Horace He approach that made 1,000 Qwen completions identical while trading modest performance for reproducibility

Why AI Chatbots Give Different Responses For Same Prompts and How Engineers Can Fix It?
Summary
  • LLM nondeterminism arises from batch-invariance failures and floating-point rounding in GPU/TPU kernels

  • Nondeterministic outputs undermine reproducibility, RLHF stability, debugging, alignment, and scientific audits

  • Batch-invariant kernels plus deterministic inference mode ensure consistent outputs across batch sizes

  • Default Qwen showed ~80 distinct outputs; deterministic mode yielded 1,000 identical completions

Advertisement

Large language models (LLMs), despite being a breakthrough in AI innovation, come with their fair share of shortcomings. Companies that deploy LLMs are working to address problems such as hallucinations, bias, discriminatory outputs and security and privacy risks. Among these challenges is a lesser-discussed but equally important problem called ‘nondeterminism.’

Nondeterminism in LLMs refers to the unpredictability or variability of outputs even when the same input is provided. This is why, when you submit the same prompt twice to a chatbot, it yields different responses.

This surprising discrepancy is not magic but an engineering issue in how GPU/TPU kernels handle grouped work. A collaborative blog by Thinking Machines Lab along with Horace He, who works on Pytorch at Meta demonstrated a practical fix to this issue.

Why is Nondeterminism a Problem?

LLMs are supposed to give the same output if you ask the same thing under the same exact settings. But in practice, even with temperature set to 0 (which is supposed to force “greedy” or most-likely outputs only and so should be deterministic), you often get different results each time.

Advertisement

This inconsistency poses a challenge for reproducibility, which is critical for tasks such as writing scientific papers, debugging and ensuring alignment between training and inference behaviour.

In reinforcement learning from human feedback (RLHF) and other policy-training settings, it is crucial that inference behaviour matches training behaviour. If inference is inconsistent, then “on-policy” training becomes messy and can drift away from the conditions the policy was trained under, undermining learning and stability. (Thinking Machines Lab)

Moreover, for debugging, alignment, audits and scientific reproducibility, the ability to guarantee the output produced for a given prompt is extremely valuable. Deterministic inference makes it far easier to trace errors, reproduce experiments, verify compliance and ensure that model behaviour remains predictable across deployments. (Thinking Machines Lab)

Cause of Nondeterminism: Batch Invariance

The authors of the blog argue that the primary, often-overlooked cause of nondeterminism is that kernels do not maintain batch invariance.

Batch invariance means that the output for a particular input (from a batch) should not depend on how many other inputs are processed at the same time (the batch size) or on how the work is split internally.

Advertisement

If a kernel is batch-invariant, then whether you process one item or 1,000 items, you should get the same result for that one item, all else being equal.

In many current kernels on GPUs, TPUs and similar hardware, reduction operations (which combine many numbers, for example, sums, norms or dot products) behave differently depending on the batch size.

They may be split across threads or cores with different tiling strategies, which changes the order of floating-point additions and therefore alters rounding behaviour.

Those small changes in rounding can change the final results. Because batch size at inference time often fluctuates, due to varying concurrent user load, different numbers of requests being batched together and so on, model behaviour can appear nondeterministic from the user’s perspective, even if the forward-pass kernels are deterministic for any fixed batch size.

Advertisement

Proposed Solution

To make inference truly reproducible and deterministic, the authors propose several changes.

First, make all kernels batch-invariant so that reduction operations (such as sums, RMSNorm, matrix multiplications, attention and similar) produce the same results regardless of batch size or how the input is split. For example, always use a fixed split size in attention’s “split-KV” or FlashDecode-style kernels rather than allowing the splitting strategy to vary with batch size or query length.

Second, adopt consistent kernel implementations that use the same tiling and arithmetic strategies regardless of batch dimensions and avoid dynamically switching algorithms when those dimensions change. In the attention module, for instance, ensure that the layout and the way you fetch and reduce over K and V remain consistent even when the number of cached tokens differs.

Finally, provide a “deterministic mode” in inference engines. The authors implemented such a mode on top of vLLM using batch-invariant kernels, and they supply a library that can be plugged into PyTorch and other inference stacks.

Advertisement

Experiment on Qwen Model

In an experiment with the Qwen model, the Thinking Machines team generated 1,000 completions at temperature 0. Using the default (non-batch-invariant) inference, they observed roughly 80 distinct outputs among the 1,000 runs. By contrast, with the batch-invariant implementation, all 1,000 completions were identical.

There is a performance cost to deterministic inference when using unoptimised batch-invariant kernels. For the Qwen-3-8B model, the authors report the following timings: default inference took about 26 seconds, the unoptimised deterministic version took about 55 seconds and a tightened or optimised deterministic implementation reduced that to about 42 seconds.

In short, achieving true reproducibility requires some slowdown, but the overhead is modest and feasible in practice.

Show comments