This guide presents four methods for creating synthetic populations that match known marginal distributions while preserving realistic relationships between variables.
This work is derived from the methods outlined in 'Creating Realistic Synthetic Populations at Varying Spatial Scales: A Comparative Critique of Population Synthesis Techniques' by K. Harland, A. Heppenstall, D. Smith, and M.H. Birkin.
You can find a comparison of the effectiveness measures here.
Note how close the SA and CP[CO] methods are
Adjusts sample weights directly to match target marginals through multiplicative factors.
# Core adjustment logic
for record in sample_data:
for dimension in ['age', 'gender', 'education']:
adjustment = target_margins[dimension][record[dimension]] / current_margins[dimension][record[dimension]]
record['weight'] *= adjustment
Iteratively adjusts weights to simultaneously match all specified marginals.
# IPF core algorithm
while not converged:
for dimension in dimensions:
current_margins = calculate_marginals(population, dimension)
adjustments = {cat: target[cat]/current[cat] for cat in target}
apply_adjustments(population, dimension, adjustments)
check_convergence()
Uses probabilistic search to find optimal population configuration.
# Annealing core
while temperature > min_temp:
new_pop = mutate(current_pop)
ΔE = energy(new_pop) - energy(current_pop)
if ΔE < 0 or random() < exp(-ΔE/temp):
current_pop = new_pop
temperature *= cooling_rate
Creates synthetic individuals by sampling from learned conditional distributions, preserving variable dependencies
# Conditional probability core
def generate_population(reference_data, size):
# Learn P(age), P(gender|age), P(education|age,gender)
age_probs = reference_data['age'].value_counts(normalize=True)
gender_probs = reference_data.groupby('age')['gender'].value_counts(normalize=True)
edu_probs = reference_data.groupby(['age','gender'])['education'].value_counts(normalize=True)
synthetic = []
for _ in range(size):
# Sample hierarchically
age = sample(age_probs)
gender = sample(gender_probs[age])
education = sample(edu_probs[(age,gender)])
synthetic.append({'age':age, 'gender':gender, 'education':education})
return pd.DataFrame(synthetic)
Criteria | Deterministic *2 | IPF *2 | Annealing | Conditional |
---|---|---|---|---|
Implementation | Simple | Medium | Complex | Medium |
Speed | Fastest | Fast | Slow | Medium |
Marginal Accuracy | High (1D) | High | Variable | Approximate |
Joint Distributions | Poor | Good | Custom | Best |
Best For | Quick checks | Standard use | Custom constraints | Realistic relationships *1 |
*1 Also creates anonymised data.