Flashpoint.AIFlashpoint.AIblog
Back

All Models Are Wrong — Part 1

There are no people inside the model

A crowd gathered, rapt, around a beautiful mechanical model of the solar system — wrong in every detail, useful anyway. Joseph Wright of Derby, A Philosopher Giving a Lecture on the Orrery, 1766.
A crowd gathered, rapt, around a beautiful mechanical model of the solar system — wrong in every detail, useful anyway. Joseph Wright of Derby, A Philosopher Giving a Lecture on the Orrery, 1766.

“Essentially, all models are wrong, but some are useful.” — George E. P. Box

Synthetic survey populations use AI models to simulate how different types of people might answer survey questions. Instead of collecting answers from real respondents, the researcher creates artificial “respondents” with demographic and behavioral attributes such as age, gender, income, education, geography, occupation, political orientation, purchasing behavior, or prior attitudes. These profiles are then passed to a large language model, which is asked to answer survey questions as if it were that person.

The problem is that this proposition confuses three different things:

  • simulating plausible language;
  • predicting individual or group behavior;
  • measuring a real population.

These are not equivalent.

A large language model is primarily a conditional generative model. Given a prompt, it estimates a plausible continuation. When it is asked to simulate such a person, it is not sampling from the true distribution of such people. It is sampling from the model's learned representation of how such a person is likely to speak or answer, conditional on the prompt and the training distribution.

The appeal is obvious: synthetic surveys are fast, cheap, scalable, and can be run before spending money on human fieldwork. They can be useful for early hypothesis generation, questionnaire testing, message exploration, and identifying likely areas of disagreement.

But scientifically, they are not equivalent to real survey data.

That distinction matters because survey research is not just text generation. A survey response is an observed measurement from a real human under a specific sampling design, question wording, context, incentive structure, and social environment. A synthetic answer is a model output. It may be useful, but it is not an observation.

The core limitation is that a language model does not contain real people. It contains statistical patterns learned from text. When asked to simulate a respondent, it produces the answer that is most plausible given its training data, the persona description, the prompt, and the model's internal representation of social groups. This means synthetic populations are better understood as model-based predictions of survey responses, not as observations.

They fail in several predictable ways.

One failure is stereotype substitution. When the model lacks real information about a subgroup, it may fall back on broad cultural stereotypes. It may infer that a certain age, income, or education group “should” hold a particular view, even when real survey data show more variation.

A second failure is loss of individual variance. Real humans are noisy, inconsistent, emotional, context-dependent, and often contradictory. Synthetic respondents tend to be too coherent. They often answer in ways that fit the persona too neatly. This can make correlations look stronger than they really are and can underrepresent surprising or minority viewpoints.

A third failure is distributional miscalibration. The model may get the average answer roughly right but fail on the full distribution. It may predict the majority preference correctly while underestimating strong opposition, uncertainty, “don’t know” responses, or subgroup-specific effects.

A fourth failure is prompt sensitivity. Small changes in wording, question order, answer labels, or persona detail can change the synthetic results. This is a serious scientific problem, because survey measurement should be stable under reasonable design choices. If the result depends heavily on the prompt, then the system is measuring the model-prompt interaction, not the population.

A fifth failure is temporal staleness. Models are trained on past data. Public opinion, consumer behavior, technology adoption, political attitudes, and brand perceptions can change quickly. A synthetic population may reproduce yesterday’s consensus while missing today’s shift.

A sixth failure is construct validity. In science, a measure is valid only if it actually measures the thing it claims to measure. A synthetic answer may look plausible, but that does not prove it measures real preference, intent, trust, willingness to pay, or voting behavior. The model is not experiencing budget constraints, social pressure, memory limits, risk, habit, embarrassment, or real-world incentives.

The scientific reason these failures occur is that LLMs are trained to learn associations between words, groups, attitudes, and contexts. But survey answers are generated by causal processes: lived experience, incentives, identity, local context, memory, emotions, social desirability, question interpretation, and the respondent's current situation. A language model can approximate the surface pattern of those processes, but it does not observe or reproduce the processes themselves.

This distinction matters. A synthetic population can sometimes predict aggregate patterns when the question is stable, well-represented in training data, and strongly associated with demographic or cultural signals. It is much weaker when the question is novel, local, technical, emotionally charged, recently changed, or dependent on private experience.

The safest way to use synthetic survey populations is therefore not as a replacement for human respondents, but as a pre-research tool. They can help generate hypotheses, test questionnaire wording, explore possible segment differences, and stress-test assumptions before fieldwork. They should be validated against real survey data whenever the result will influence business, policy, investment, or scientific conclusions.

In simple terms: synthetic populations are useful for thinking faster, but dangerous when treated as evidence. They simulate what a model believes people like this might say, but do not prove what real people think.

Next in the series: why this limit is mathematical rather than technical, and why no model release will remove it.