Why Claude Gives Different Answers to the Same Prompt — A Session Independence Experiment | Labs — SEM AI Growth

We ran the same prompt across 5 independent Claude sessions with identical model and system prompt settings. Here's exactly where the outputs converged — and where they diverged.

Abstract

Audience — Marketers and AI practitioners who use Claude in repeated workflows.

Experimental setup We ran the same prompt across 5 independent Claude sessions using identical model (claude-sonnet-4-6) and system prompt settings.

Hypothesis High-confidence areas in the model's training would converge; judgment-dependent areas would diverge.

Limitation All sessions shared the same system prompt. This is not a blank-slate experiment. Testing with varied system prompts requires a separate experimental design.

Experimental Design

Fixed Variables

Condition	Value
Model	claude-sonnet-4-6
System prompt	Identical across all 5 sessions
Input text	Identical across all 5 sessions

Independent Variable

Condition	Value
Session count	5 independent sessions
Shared context between sessions	None

Input

Give the 3 most important pieces of advice for a startup CMO
adopting AI marketing tools for the first time.

Measurement Items

- Topic selection (what was chosen as advice)
- Advice ordering (which position each topic occupied)
- Intro sentence style
- Closing structure (whether a summary was included)

Raw Outputs

Session 1

Fix your data foundation before touching any tools

Limit automation scope to repeatable tasks first

Define your ROI baseline before you start

Session 2

Fix your data pipeline before anything else

Draw a clear line between what AI can automate and where humans must stay involved

Prove ROI on one use case before expanding

Session 3

Clean your data before adding any AI layer

Focus on one workflow at a time

Build a human review process for all AI outputs

Session 4

Start with the data pipeline — tools come second

Design a validation process before you optimize for speed

Prove ROI in one vertical before scaling

Session 5

Start with your data infrastructure

Design around workflows, not individual point tools

Develop your team's judgment before letting the tools do it for them

Observations

Topic	Sessions present
Data first (Advice #1)	5/5
Validation / human review process	5/5
Focused rollout / incremental ROI proof	4/5
Workflow-centric design (vs. point tools)	1/5

Advice #1 converged across all 5 sessions — same meaning, different phrasing. Advice #2 and #3 drew from a shared topic pool but selected different combinations each time. Session 5's "workflow-centric design" angle appeared in no other session.

Intro sentences were distinct in all 5 sessions: pain-point framing, neutral declaration, action-oriented prompt — the same question produced different entry points.

Interpretation

LLMs generate text by sampling from probability distributions over tokens. Independent sessions produce independent sampling paths.¹

"Data first" converged because this advice carries overwhelming probability weight in the training data for this question type. For advice positions 2 and 3, where multiple valid candidates compete, each session selected a different path.

Notably, Anthropic's own documentation acknowledges this directly:

"Even with temperature set to 0, the results will not be fully deterministic and identical inputs may produce different outputs across API calls." — Anthropic, Claude API Glossary²

Convergence signals high model confidence. Divergence signals competition between candidates.

Conclusion

Single session is sufficient for — areas where the model strongly converges: established best practices, definitions, principled judgments.

Multiple sessions are warranted for — areas requiring judgment or priority selection. Session 5 surfaced a perspective ("design around workflows, not point tools") that appeared in no other session. Single-session reliance means missing valid alternatives.

Session independence is not a bug. Understanding this property lets you design more intentional AI-assisted workflows — running multiple sessions when coverage matters, trusting a single session when convergence is expected.³

Notes

¹ Token sampling — At each generation step, the LLM computes a probability distribution (logits → softmax) over its vocabulary and samples the next token. Temperature controls distribution sharpness. Even at temperature=0, hardware-level floating-point non-determinism persists across calls.

² Anthropic Claude API Glossary — docs.anthropic.com/en/docs/resources/glossary

³ Self-consistency in practice — Wang et al.'s Self-Consistency formalizes this as an algorithm: generate multiple sampling paths and take a majority vote. It consistently outperforms single-path generation on reasoning tasks. The multi-session strategy follows the same logic.

Why Claude Gives Different Answers to the Same Prompt — A Session Independence Experiment