How Iterative Proportional Fitting (IPF) Works

IPF is a mathematical procedure that adjusts the cells of a contingency table so that the table's margins match predefined target values.

A worked example [on going development] is avalable in this Colab notebook

1. Core Components of IPF

IPF requires two main inputs:

A. Seed Table (Initial Table)

B. Target Margins (Constraints)

1.1 Starting with individual survey data to create seed table

The seed table in IPF is typically created from survey data where each row represents an individual respondent with their characteristics.

Example Survey Data (First 10 Respondents):

Respondent ID	Age Group	Gender	Income
1	18-30	Male	Low
2	31-50	Female	Medium
3	51+	Male	High
4	18-30	Female	Low
5	31-50	Male	Medium
6	31-50	Female	High
7	51+	Female	Medium
8	18-30	Male	Low
9	31-50	Male	Medium
10	51+	Male	High

1.2 Creating the Contingency Table (Cross-Tabulation)

We count how many individuals fall into each combination of categories:

Step 1: Count Age × Gender Combinations

For our 1000 respondents (full survey), we might get these counts:

Age \ Gender	Male	Female	Total
18-30	100	150	250
31-50	200	250	450
51+	150	150	300
Total	450	550	1000

How These Counts Are Calculated:

18-30 Males: Count all respondents where Age=18-30 AND Gender=Male
31-50 Females: Count all respondents where Age=31-50 AND Gender=Female
And so on for all combinations

Step 2: Verify Sample Representativeness

Before using this as our seed table, we should check if our sample is representative:

Characteristic	Sample %	Population %	Difference
18-30	25%	30%	+5%
Male	45%	48%	+3%

1.3 Using the Contingency Table as IPF Seed

Now we have our seed table ready for IPF:

Age \ Gender	Male	Female	Row Total
18-30	100	150	250
31-50	200	250	450
51+	150	150	300
Col Total	450	550	1000

1.4 Why This Process Matters

Preserves relationships: The seed table maintains the correlations between variables found in the survey
More accurate than random: Better than creating synthetic data with no underlying structure
Adjustable: IPF will scale these relationships to match known population totals

1.5 Extension to More Variables

The same process works for 3+ variables (e.g., Age × Gender × Income):

Age \ Gender \ Income	Male Low	Male Medium	Male High	Female Low	...
18-30	40	30	30	70	...
31-50	60	80	60	80	...

With such a contingency table representing the joint distribution of variables (e.g., age × gender):

Age \ Gender	Male	Female	Row Total
18-30	100	150	250
31-50	200	250	450
51+	150	150	300
Col Total	450	550	1000

Target Margins in Iterative Proportional Fitting

What Are Target Margins?

Target margins (also called constraints) are the known population totals that we want our final table to match. These typically come from authoritative sources like:

Census data
Official demographic statistics
Large-scale surveys
Administrative records

Example Margins Structure

For a 2D Age × Gender Table:

Row Margins (Age Totals):

Age Group	Target Total
18-30	300
31-50	500
51+	200

Column Margins (Gender Totals):

Gender	Target Total
Male	600
Female	400

Key Properties of Target Margins

1. Consistency Requirement

The sum of row margins must equal the sum of column margins:

300 (18-30) + 500 (31-50) + 200 (51+) = 1000
600 (Male) + 400 (Female) = 1000

2. Multi-Dimensional Extensions

For 3+ variables, we need margins for each dimension:

# Example for Age × Gender × Income
age_margins = [300, 500, 200]          # Age totals
gender_margins = [600, 400]            # Gender totals
income_margins = [400, 350, 250]       # Income totals (Low, Medium, High)

# Grand total must match across all dimensions
assert sum(age_margins) == sum(gender_margins) == sum(income_margins)  # 1000

Sources of Target Margins

Common Data Sources

Source Example Margins Python Loading Example

Census API

Age by gender tables

Source	Example Margins	Python Loading Example
Census API	Age by gender tables	`import requests census_data = requests.get("https://api.census.gov/data/2020/acs/acs5?get=AGE,SEX&for=state:24")`
CSV Files	Pre-aggregated totals	`import pandas as pd margins = pd.read_csv("margins.csv")`
Statistical Models	Projected demographics	`from statsmodels.api import GLM model = GLM(...) predicted_margins = model.predict(...)`

import requests
census_data = requests.get("https://api.census.gov/data/2020/acs/acs5?get=AGE,SEX&for=state:24")

CSV Files

Pre-aggregated totals

import pandas as pd
margins = pd.read_csv("margins.csv")

Statistical Models

Projected demographics

from statsmodels.api import GLM
model = GLM(...)
predicted_margins = model.predict(...)

Practical Considerations

Advanced Margin Handling in Iterative Proportional Fitting

1. Handling Margin Mismatches

1.1 The Consistency Problem

When row and column margins don't sum to the same total, we must resolve the discrepancy before running IPF:

Common Causes of Mismatches:

Different data sources: Age totals from census, gender totals from health survey
Reporting periods: Margins collected in different years
Rounding errors: Published statistics using rounded numbers
Sampling variability: Survey weights not perfectly calibrated

1.2 Normalization Approaches

Python: Advanced Margin Normalization

def normalize_margins(row_margins, col_margins, method='proportional'):
    """
    Normalize margins to common total using different strategies
    
    Parameters:
        row_margins (list): Target row totals
        col_margins (list): Target column totals
        method (str): Normalization approach ('proportional', 'row_priority', 'col_priority')
    
    Returns:
        tuple: (normalized_row_margins, normalized_col_margins)
    """
    total_row = sum(row_margins)
    total_col = sum(col_margins)
    
    # Check if normalization is needed
    if abs(total_row - total_col) < 1e-6:
        return row_margins, col_margins
    
    if method == 'proportional':
        # Scale both dimensions proportionally
        common_total = (total_row + total_col) / 2
        row_factor = common_total / total_row
        col_factor = common_total / total_col
        return [x*row_factor for x in row_margins], [x*col_factor for x in col_margins]
    
    elif method == 'row_priority':
        # Keep row totals fixed, adjust columns
        col_factor = total_row / total_col
        return row_margins, [x*col_factor for x in col_margins]
    
    elif method == 'col_priority':
        # Keep column totals fixed, adjust rows
        row_factor = total_col / total_row
        return [x*row_factor for x in row_margins], col_margins
    
    else:
        raise ValueError(f"Unknown method: {method}")

1.3 Practical Example

Mismatched Margins Scenario:

	Row Margins	Sum
Age Groups	[320, 520, 210]	1050
Gender	[610, 390]	1000

After Proportional Normalization:

row_margins, col_margins = normalize_margins(
    [320, 520, 210],
    [610, 390],
    method='proportional'
)
# Result:
# Rows: [304.76, 495.24, 200.00] (sum=1000)
# Cols: [634.15, 405.85] (sum=1040) → Will be normalized to 1000 in next IPF iteration

2. Weighted vs. Unweighted Margins

2.1 Understanding the Difference

Type	Description	When to Use	Calculation
Unweighted	Raw counts from sample	When sample is perfectly representative	Simple value_counts()
Weighted	Adjusted for sampling design	Most real-world surveys	Sum of weights per group

2.2 Comprehensive Weighting Example

import pandas as pd
import numpy as np

# Sample survey data with weights
survey_data = pd.DataFrame({
    'age': ['18-30', '31-50', '51+', '18-30', '31-50', '51+'],
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'weight': [1.2, 0.8, 1.5, 1.1, 0.9, 1.3]
})

# Unweighted counts (ignores sampling design)
unweighted = pd.crosstab(
    index=survey_data['age'],
    columns=survey_data['gender']
)

# Properly weighted counts
weighted = survey_data.groupby(['age', 'gender'])['weight'].sum().unstack()

# Calibration to known population totals
def calibrate_weights(df, target_margins):
    """
    Raking procedure to adjust weights to match margins
    """
    # Initialize calibrated weights
    df['calib_weight'] = df['weight']
    
    # Iterative proportional adjustment
    for _ in range(10):
        # Adjust to age margins
        age_factor = target_margins['age'] / df.groupby('age')['calib_weight'].sum()
        df['calib_weight'] *= df['age'].map(age_factor)
        
        # Adjust to gender margins
        gender_factor = target_margins['gender'] / df.groupby('gender')['calib_weight'].sum()
        df['calib_weight'] *= df['gender'].map(gender_factor)
    
    return df

# Target margins for calibration
targets = {
    'age': {'18-30': 300, '31-50': 500, '51+': 200},
    'gender': {'Male': 600, 'Female': 400}
}

# Apply calibration
calibrated_data = calibrate_weights(survey_data, targets)

2.3 Weighting Considerations

Key Factors in Weighting:

Design weights: Account for unequal sampling probabilities
Non-response adjustment: Compensate for missing responses
Post-stratification: Align with population benchmarks
Weight trimming: Prevent extreme weights that increase variance

Python Implementation Tips:

# Check weight diagnostics
print("Weight summary:", survey_data['weight'].describe())
print("Effective sample size:", 
      (sum(survey_data['weight'])**2 / sum(survey_data['weight']**2))

# Visualize weight distribution
import matplotlib.pyplot as plt
plt.hist(survey_data['weight'], bins=20)
plt.title('Survey Weight Distribution')
plt.show()

Visualizing Margin Constraints

Relationship between seed table and target margins:

            Seed Table                  Target Margins
            ┌───────────────┐           ┌───────────────┐
            │ 100 │ 150 │250│           │      300      │
            ├─────┼─────┼───┤           ├───────────────┤
            │ 200 │ 250 │450│   IPF     │      500      │
            ├─────┼─────┼───┤  ──────▶  ├───────────────┤
            │ 150 │ 150 │300│           │      200      │
            ├─────┼─────┼───┤           └───────────────┘
            │ 450 │ 550 │   │           ┌───────┬───────┐
            └───────────────┘           │ 600   │ 400   │
                                        └───────┴───────┘

Known totals for each category (e.g., from census data):

New row totals (age): [300, 500, 200]
New column totals (gender): [600, 400]

2. How IPF Adjusts the Table Iteratively

Initial Table (Seed):

Age \ Gender	Male	Female	Row Total	Row Target
18-30	100	150	250	300
31-50	200	250	450	500
51+	150	150	300	200
Col Total	450	550	1000	-
Col Target	600	400	-	-

Iteration 1: Adjust Rows to Match Row Targets

For each row, multiply each cell by:
Row Adjustment Factor = Row Target / Current Row Total

Row 1 (18-30): Adjustment factor = 300 / 250 = 1.2
Row 2 (31-50): Adjustment factor = 500 / 450 ≈ 1.111
Row 3 (51+): Adjustment factor = 200 / 300 ≈ 0.666

After Row Adjustment:

Age \ Gender	Male	Female	Row Total
18-30	120	180	300
31-50	222.2	277.8	500
51+	100	100	200
Col Total	442.2	557.8	1000

→ Row totals now match targets, but columns do not.

Iteration 1: Adjust Columns to Match Column Targets

For each column, multiply each cell by:
Column Adjustment Factor = Column Target / Current Column Total

Male Column: Adjustment factor = 600 / 442.2 ≈ 1.357
Female Column: Adjustment factor = 400 / 557.8 ≈ 0.717

After Column Adjustment:

Age \ Gender	Male	Female	Row Total
18-30	162.8	129.1	291.9
31-50	301.4	199.2	500.6
51+	135.7	71.7	207.4
Col Total	600	400	1000

→ Column totals now match targets, but rows are slightly off again.

Final Converged Table (After Several Iterations):

Age \ Gender	Male	Female	Row Total
18-30	180	120	300
31-50	300	200	500
51+	120	80	200
Col Total	600	400	1000

→ Now both row and column totals match the targets perfectly.

3. Key Properties of IPF

Convergence: Guaranteed if margins are consistent
Fractional Weights: Produces non-integer weights (may need rounding)
Extensions: Integer IPF, multi-dimensional IPF
Applications: Transport modeling, economics, survey raking

Iterative Proportional Fitting (IPF) with Vanilla Python

1. Data Preparation

1.1 Starting with Individual Survey Data

The seed table in IPF is created from survey data where each row represents an individual respondent.

Example Survey Data:

ID	Age	Gender	Income
1	18-30	Male	Low
2	31-50	Female	Medium
3	51+	Male	High
4	18-30	Female	Low
5	31-50	Male	Medium

Python: Loading Survey Data

# Sample survey data as list of dictionaries
survey_data = [
    {"Age": "18-30", "Gender": "Male", "Income": "Low"},
    {"Age": "31-50", "Gender": "Female", "Income": "Medium"},
    {"Age": "51+", "Gender": "Male", "Income": "High"},
    {"Age": "18-30", "Gender": "Female", "Income": "Low"},
    {"Age": "31-50", "Gender": "Male", "Income": "Medium"}
]

1.2 Creating the Contingency Table

We count how many individuals fall into each combination of categories:

Python: Creating Cross-Tabulation

def create_contingency(data, row_var, col_var):
    # Initialize counts
    row_categories = sorted({d[row_var] for d in data})
    col_categories = sorted({d[col_var] for d in data})
    
    # Create empty table
    table = {row: {col: 0 for col in col_categories} for row in row_categories}
    
    # Count occurrences
    for entry in data:
        table[entry[row_var]][entry[col_var]] += 1
    
    # Calculate margins
    row_totals = {row: sum(cols.values()) for row, cols in table.items()}
    col_totals = {col: sum(table[row][col] for row in row_categories) 
                 for col in col_categories}
    grand_total = sum(row_totals.values())
    
    return {
        "table": table,
        "row_totals": row_totals,
        "col_totals": col_totals,
        "grand_total": grand_total
    }

# Generate contingency table
result = create_contingency(survey_data, "Age", "Gender")
print("Contingency Table:", result["table"])

Resulting Contingency Table:

Age\Gender	Male	Female	Total
18-30	100	150	250
31-50	200	250	450
51+	150	150	300
Total	450	550	1000

2. IPF Implementation

2.1 IPF Inputs

IPF requires a seed table and target margins:

Python: Setting Up IPF

# Seed table as nested dictionaries
seed = {
    "18-30": {"Male": 100, "Female": 150},
    "31-50": {"Male": 200, "Female": 250},
    "51+": {"Male": 150, "Female": 150}
}

# Target margins
row_targets = {"18-30": 300, "31-50": 500, "51+": 200}
col_targets = {"Male": 600, "Female": 400}

2.2 IPF Algorithm

The iterative process adjusts rows and columns alternately:

Python: IPF Implementation

def ipf(seed, row_targets, col_targets, max_iter=10, tol=1e-6):
    current = {row: cols.copy() for row, cols in seed.items()}
    row_categories = list(row_targets.keys())
    col_categories = list(col_targets.keys())
    
    for _ in range(max_iter):
        # Adjust rows
        for row in row_categories:
            row_sum = sum(current[row].values())
            if row_sum == 0: continue
            factor = row_targets[row] / row_sum
            for col in col_categories:
                current[row][col] *= factor
        
        # Adjust columns
        for col in col_categories:
            col_sum = sum(current[row][col] for row in row_categories)
            if col_sum == 0: continue
            factor = col_targets[col] / col_sum
            for row in row_categories:
                current[row][col] *= factor
        
        # Check convergence
        converged = True
        for row in row_categories:
            if abs(sum(current[row].values()) - row_targets[row]) > tol:
                converged = False
                break
        for col in col_categories:
            if abs(sum(current[row][col] for row in row_categories) - col_targets[col]) > tol:
                converged = False
                break
        if converged:
            break
    
    return current

result = ipf(seed, row_targets, col_targets)
print("Final IPF result:", result)

2.3 Final Result

Age\Gender	Male	Female	Total
18-30	180	120	300
31-50	300	200	500
51+	120	80	200
Total	600	400	1000