SEM.ai
Labs #001Claude Behavior Series

Why Claude Gives Different Answers to the Same Prompt — A Session Independence Experiment

We ran the same prompt across 5 independent Claude sessions with identical model and system prompt settings. Here's exactly where the outputs converged — and where they diverged.

Abstract

Audience — Marketers and AI practitioners who use Claude in repeated workflows.

Experimental setup We ran the same prompt across 5 independent Claude sessions using identical model (claude-sonnet-4-6) and system prompt settings.

Hypothesis High-confidence areas in the model's training would converge; judgment-dependent areas would diverge.

Limitation All sessions shared the same system prompt. This is not a blank-slate experiment. Testing with varied system prompts requires a separate experimental design.

Experimental Design

Fixed Variables

Condition Value
Model claude-sonnet-4-6
System prompt Identical across all 5 sessions
Input text Identical across all 5 sessions

Independent Variable

Condition Value
Session count 5 independent sessions
Shared context between sessions None

Input

Give the 3 most important pieces of advice for a startup CMO
adopting AI marketing tools for the first time.

Measurement Items

- Topic selection (what was chosen as advice)
- Advice ordering (which position each topic occupied)
- Intro sentence style
- Closing structure (whether a summary was included)

Raw Outputs

Session 1

  1. Fix your data foundation before touching any tools
  2. Limit automation scope to repeatable tasks first
  3. Define your ROI baseline before you start

Session 2

  1. Fix your data pipeline before anything else
  2. Draw a clear line between what AI can automate and where humans must stay involved
  3. Prove ROI on one use case before expanding

Session 3

  1. Clean your data before adding any AI layer
  2. Focus on one workflow at a time
  3. Build a human review process for all AI outputs

Session 4

  1. Start with the data pipeline — tools come second
  2. Design a validation process before you optimize for speed
  3. Prove ROI in one vertical before scaling

Session 5

  1. Start with your data infrastructure
  2. Design around workflows, not individual point tools
  3. Develop your team's judgment before letting the tools do it for them

Observations

Topic Sessions present
Data first (Advice #1) 5/5
Validation / human review process 5/5
Focused rollout / incremental ROI proof 4/5
Workflow-centric design (vs. point tools) 1/5

Advice #1 converged across all 5 sessions — same meaning, different phrasing. Advice #2 and #3 drew from a shared topic pool but selected different combinations each time. Session 5's "workflow-centric design" angle appeared in no other session.

Intro sentences were distinct in all 5 sessions: pain-point framing, neutral declaration, action-oriented prompt — the same question produced different entry points.

Interpretation

LLMs generate text by sampling from probability distributions over tokens. Independent sessions produce independent sampling paths.1

"Data first" converged because this advice carries overwhelming probability weight in the training data for this question type. For advice positions 2 and 3, where multiple valid candidates compete, each session selected a different path.

Notably, Anthropic's own documentation acknowledges this directly:

"Even with temperature set to 0, the results will not be fully deterministic and identical inputs may produce different outputs across API calls." — Anthropic, Claude API Glossary2

Convergence signals high model confidence. Divergence signals competition between candidates.

Conclusion

Single session is sufficient for — areas where the model strongly converges: established best practices, definitions, principled judgments.

Multiple sessions are warranted for — areas requiring judgment or priority selection. Session 5 surfaced a perspective ("design around workflows, not point tools") that appeared in no other session. Single-session reliance means missing valid alternatives.

Session independence is not a bug. Understanding this property lets you design more intentional AI-assisted workflows — running multiple sessions when coverage matters, trusting a single session when convergence is expected.3

Further Reading

Self-Consistency Improves Chain of Thought Reasoning in Language Models Wang et al. · Google Brain · ICLR 2023

Introduces Self-Consistency: generating multiple reasoning paths via sampling and selecting the majority answer. Directly related to the multi-session strategy this experiment validates.

Read the paper →


Non-Determinism of 'Deterministic' LLM Settings Atil et al. · 2024

Empirically demonstrates that LLM outputs remain non-deterministic even at temperature=0, tracing the cause to floating-point non-determinism in hardware-level parallel computation.

Read the paper →


The Effect of Sampling Temperature on Problem Solving in Large Language Models Renze & Guven · 2024

Measures how temperature values affect problem-solving performance, showing that optimal temperature varies by task type.

Read the paper →

Notes

1 Token sampling — At each generation step, the LLM computes a probability distribution (logits → softmax) over its vocabulary and samples the next token. Temperature controls distribution sharpness. Even at temperature=0, hardware-level floating-point non-determinism persists across calls.

2 Anthropic Claude API Glossarydocs.anthropic.com/en/docs/resources/glossary

3 Self-consistency in practice — Wang et al.'s Self-Consistency formalizes this as an algorithm: generate multiple sampling paths and take a majority vote. It consistently outperforms single-path generation on reasoning tasks. The multi-session strategy follows the same logic.

Claude AILLM non-determinismsession independenceAI consistencyprompt engineeringAI experiment