Synthetic Population Generation

This guide presents four methods for creating synthetic populations that match known marginal distributions while preserving realistic relationships between variables.

This work is derived from the methods outlined in 'Creating Realistic Synthetic Populations at Varying Spatial Scales: A Comparative Critique of Population Synthesis Techniques' by K. Harland, A. Heppenstall, D. Smith, and M.H. Birkin.

You can find a comparison of the effectiveness measures here.

Note how close the SA and CP[CO] methods are

1. Deterministic Re-weighting

Adjusts sample weights directly to match target marginals through multiplicative factors.

Implementation Highlights:

# Core adjustment logic
for record in sample_data:
    for dimension in ['age', 'gender', 'education']:
        adjustment = target_margins[dimension][record[dimension]] / current_margins[dimension][record[dimension]]
        record['weight'] *= adjustment
# Sample Output: ID Age Gender Education Weight 1 18-25 Male HighSchool 1.34 2 26-35 Female Bachelor 0.92 3 18-25 Female Bachelor 1.41

When to Use:

  • Quick prototyping
  • When only need to match one or two dimensions
  • Computationally constrained environments

2. Iterative Proportional Fitting (IPF)

Iteratively adjusts weights to simultaneously match all specified marginals.

Implementation Highlights:

# IPF core algorithm
while not converged:
    for dimension in dimensions:
        current_margins = calculate_marginals(population, dimension)
        adjustments = {cat: target[cat]/current[cat] for cat in target}
        apply_adjustments(population, dimension, adjustments)
        check_convergence()
# Sample Output: Converged after 8 iterations Age marginals matched with RMSE: 0.02 Gender marginals matched with RMSE: 0.01

When to Use:

  • Need to match 3+ dimensions simultaneously
  • Require statistically rigorous results
  • Working with large sample sizes

3. Simulated Annealing

Uses probabilistic search to find optimal population configuration.

Implementation Highlights:

# Annealing core
while temperature > min_temp:
    new_pop = mutate(current_pop)
    ΔE = energy(new_pop) - energy(current_pop)
    if ΔE < 0 or random() < exp(-ΔE/temp):
        current_pop = new_pop
    temperature *= cooling_rate
# Sample Output: Temperature: 10.0 Energy: 42.3 Temperature: 9.5 Energy: 38.1 ... Final energy: 2.1 (optimal reached)

When to Use:

  • Complex, non-linear constraints
  • When IPF struggles to converge
  • Need to incorporate custom constraints

4. Conditional Probabilities

Creates synthetic individuals by sampling from learned conditional distributions, preserving variable dependencies

Implementation Highlights:

# Conditional probability core
def generate_population(reference_data, size):
    # Learn P(age), P(gender|age), P(education|age,gender)
    age_probs = reference_data['age'].value_counts(normalize=True)
    gender_probs = reference_data.groupby('age')['gender'].value_counts(normalize=True)
    edu_probs = reference_data.groupby(['age','gender'])['education'].value_counts(normalize=True)
    
    synthetic = []
    for _ in range(size):
        # Sample hierarchically
        age = sample(age_probs)
        gender = sample(gender_probs[age])
        education = sample(edu_probs[(age,gender)])
        synthetic.append({'age':age, 'gender':gender, 'education':education})
    
    return pd.DataFrame(synthetic)
Generated 1500 individuals with preserved relationships: Age 18-25: 32% (target: 33%) P(Female|18-25): 61% (ref: 60%) P(Bachelor|Male,26-35): 42% (ref: 40%)

When to Use:

  • Need to preserve natural variable dependencies from real data
  • Creating entirely new individuals (not reweighting existing)
  • When joint distributions matter more than exact marginal matching
  • Building Bayesian network-style synthetic data

Method Comparison

Criteria Deterministic *2 IPF *2 Annealing Conditional
Implementation Simple Medium Complex Medium
Speed Fastest Fast Slow Medium
Marginal Accuracy High (1D) High Variable Approximate
Joint Distributions Poor Good Custom Best
Best For Quick checks Standard use Custom constraints Realistic relationships *1

*1 Also creates anonymised data.

*2 Core Difference in Approach.

Note: Here is a link to a toy GAN. This is a work in progress, and I will expand it to a fuller explanation at a later date.