Synthetic Population Generation

1. Deterministic Re-weighting

Adjusts sample weights directly to match target marginals through multiplicative factors.

Implementation Highlights:

# Core adjustment logic
for record in sample_data:
    for dimension in ['age', 'gender', 'education']:
        adjustment = target_margins[dimension][record[dimension]] / current_margins[dimension][record[dimension]]
        record['weight'] *= adjustment

# Sample Output: ID Age Gender Education Weight 1 18-25 Male HighSchool 1.34 2 26-35 Female Bachelor 0.92 3 18-25 Female Bachelor 1.41

When to Use:

Quick prototyping
When only need to match one or two dimensions
Computationally constrained environments

More...

2. Iterative Proportional Fitting (IPF)

Iteratively adjusts weights to simultaneously match all specified marginals.

Implementation Highlights:

# IPF core algorithm
while not converged:
    for dimension in dimensions:
        current_margins = calculate_marginals(population, dimension)
        adjustments = {cat: target[cat]/current[cat] for cat in target}
        apply_adjustments(population, dimension, adjustments)
        check_convergence()

# Sample Output: Converged after 8 iterations Age marginals matched with RMSE: 0.02 Gender marginals matched with RMSE: 0.01

When to Use:

Need to match 3+ dimensions simultaneously
Require statistically rigorous results
Working with large sample sizes

More...

3. Simulated Annealing

Uses probabilistic search to find optimal population configuration.

Implementation Highlights:

# Annealing core
while temperature > min_temp:
    new_pop = mutate(current_pop)
    ΔE = energy(new_pop) - energy(current_pop)
    if ΔE < 0 or random() < exp(-ΔE/temp):
        current_pop = new_pop
    temperature *= cooling_rate

# Sample Output: Temperature: 10.0 Energy: 42.3 Temperature: 9.5 Energy: 38.1 ... Final energy: 2.1 (optimal reached)

When to Use:

Complex, non-linear constraints
When IPF struggles to converge
Need to incorporate custom constraints

More...

4. Conditional Probabilities

Creates synthetic individuals by sampling from learned conditional distributions, preserving variable dependencies

Implementation Highlights:

# Conditional probability core
def generate_population(reference_data, size):
    # Learn P(age), P(gender|age), P(education|age,gender)
    age_probs = reference_data['age'].value_counts(normalize=True)
    gender_probs = reference_data.groupby('age')['gender'].value_counts(normalize=True)
    edu_probs = reference_data.groupby(['age','gender'])['education'].value_counts(normalize=True)
    
    synthetic = []
    for _ in range(size):
        # Sample hierarchically
        age = sample(age_probs)
        gender = sample(gender_probs[age])
        education = sample(edu_probs[(age,gender)])
        synthetic.append({'age':age, 'gender':gender, 'education':education})
    
    return pd.DataFrame(synthetic)

Generated 1500 individuals with preserved relationships: Age 18-25: 32% (target: 33%) P(Female|18-25): 61% (ref: 60%) P(Bachelor|Male,26-35): 42% (ref: 40%)

When to Use:

Need to preserve natural variable dependencies from real data
Creating entirely new individuals (not reweighting existing)
When joint distributions matter more than exact marginal matching
Building Bayesian network-style synthetic data

More...

Criteria	Deterministic *2	IPF *2	Annealing	Conditional
Implementation	Simple	Medium	Complex	Medium
Speed	Fastest	Fast	Slow	Medium
Marginal Accuracy	High (1D)	High	Variable	Approximate
Joint Distributions	Poor	Good	Custom	Best
Best For	Quick checks	Standard use	Custom constraints	Realistic relationships *1

Synthetic Population Generation

1. Deterministic Re-weighting

Implementation Highlights:

When to Use:

2. Iterative Proportional Fitting (IPF)

Implementation Highlights:

When to Use:

3. Simulated Annealing

Implementation Highlights:

When to Use:

4. Conditional Probabilities

Implementation Highlights:

When to Use:

Method Comparison