Why Claude Gives Different Answers to the Same Prompt — A Session Independence Experiment
We ran the same prompt across 5 independent Claude sessions with identical model and system prompt settings. Here's exactly where the outputs converged — and where they diverged.
Abstract
Audience — Marketers and AI practitioners who use Claude in repeated workflows.
Experimental setup We ran the same prompt across 5 independent Claude sessions using identical model (claude-sonnet-4-6) and system prompt settings.
Hypothesis High-confidence areas in the model's training would converge; judgment-dependent areas would diverge.
Limitation All sessions shared the same system prompt. This is not a blank-slate experiment. Testing with varied system prompts requires a separate experimental design.
Experimental Design
Fixed Variables
| Condition | Value |
|---|---|
| Model | claude-sonnet-4-6 |
| System prompt | Identical across all 5 sessions |
| Input text | Identical across all 5 sessions |
Independent Variable
| Condition | Value |
|---|---|
| Session count | 5 independent sessions |
| Shared context between sessions | None |
Input
Give the 3 most important pieces of advice for a startup CMO
adopting AI marketing tools for the first time.
Measurement Items
- Topic selection (what was chosen as advice)
- Advice ordering (which position each topic occupied)
- Intro sentence style
- Closing structure (whether a summary was included)
Raw Outputs
Session 1
- Fix your data foundation before touching any tools
- Limit automation scope to repeatable tasks first
- Define your ROI baseline before you start
Session 2
- Fix your data pipeline before anything else
- Draw a clear line between what AI can automate and where humans must stay involved
- Prove ROI on one use case before expanding
Session 3
- Clean your data before adding any AI layer
- Focus on one workflow at a time
- Build a human review process for all AI outputs
Session 4
- Start with the data pipeline — tools come second
- Design a validation process before you optimize for speed
- Prove ROI in one vertical before scaling
Session 5
- Start with your data infrastructure
- Design around workflows, not individual point tools
- Develop your team's judgment before letting the tools do it for them
Observations
| Topic | Sessions present |
|---|---|
| Data first (Advice #1) | 5/5 |
| Validation / human review process | 5/5 |
| Focused rollout / incremental ROI proof | 4/5 |
| Workflow-centric design (vs. point tools) | 1/5 |
Advice #1 converged across all 5 sessions — same meaning, different phrasing. Advice #2 and #3 drew from a shared topic pool but selected different combinations each time. Session 5's "workflow-centric design" angle appeared in no other session.
Intro sentences were distinct in all 5 sessions: pain-point framing, neutral declaration, action-oriented prompt — the same question produced different entry points.
Interpretation
LLMs generate text by sampling from probability distributions over tokens. Independent sessions produce independent sampling paths.1
"Data first" converged because this advice carries overwhelming probability weight in the training data for this question type. For advice positions 2 and 3, where multiple valid candidates compete, each session selected a different path.
Notably, Anthropic's own documentation acknowledges this directly:
"Even with temperature set to 0, the results will not be fully deterministic and identical inputs may produce different outputs across API calls." — Anthropic, Claude API Glossary2
Convergence signals high model confidence. Divergence signals competition between candidates.
Conclusion
Single session is sufficient for — areas where the model strongly converges: established best practices, definitions, principled judgments.
Multiple sessions are warranted for — areas requiring judgment or priority selection. Session 5 surfaced a perspective ("design around workflows, not point tools") that appeared in no other session. Single-session reliance means missing valid alternatives.
Session independence is not a bug. Understanding this property lets you design more intentional AI-assisted workflows — running multiple sessions when coverage matters, trusting a single session when convergence is expected.3
Further Reading
Self-Consistency Improves Chain of Thought Reasoning in Language Models Wang et al. · Google Brain · ICLR 2023
Introduces Self-Consistency: generating multiple reasoning paths via sampling and selecting the majority answer. Directly related to the multi-session strategy this experiment validates.
Non-Determinism of 'Deterministic' LLM Settings Atil et al. · 2024
Empirically demonstrates that LLM outputs remain non-deterministic even at temperature=0, tracing the cause to floating-point non-determinism in hardware-level parallel computation.
The Effect of Sampling Temperature on Problem Solving in Large Language Models Renze & Guven · 2024
Measures how temperature values affect problem-solving performance, showing that optimal temperature varies by task type.
Notes
1 Token sampling — At each generation step, the LLM computes a probability distribution (logits → softmax) over its vocabulary and samples the next token. Temperature controls distribution sharpness. Even at temperature=0, hardware-level floating-point non-determinism persists across calls.
2 Anthropic Claude API Glossary — docs.anthropic.com/en/docs/resources/glossary
3 Self-consistency in practice — Wang et al.'s Self-Consistency formalizes this as an algorithm: generate multiple sampling paths and take a majority vote. It consistently outperforms single-path generation on reasoning tasks. The multi-session strategy follows the same logic.