How Iterative Proportional Fitting (IPF) Works

IPF is a mathematical procedure that adjusts the cells of a contingency table so that the table's margins match predefined target values.

A worked example [on going development] is avalable in this Colab notebook

1. Core Components of IPF

IPF requires two main inputs:

A. Seed Table (Initial Table)

B. Target Margins (Constraints)

1.1 Starting with individual survey data to create seed table

The seed table in IPF is typically created from survey data where each row represents an individual respondent with their characteristics.

Example Survey Data (First 10 Respondents):

Respondent ID Age Group Gender Income
1 18-30 Male Low
2 31-50 Female Medium
3 51+ Male High
4 18-30 Female Low
5 31-50 Male Medium
6 31-50 Female High
7 51+ Female Medium
8 18-30 Male Low
9 31-50 Male Medium
10 51+ Male High

1.2 Creating the Contingency Table (Cross-Tabulation)

We count how many individuals fall into each combination of categories:

Step 1: Count Age × Gender Combinations

For our 1000 respondents (full survey), we might get these counts:

Age \ Gender Male Female Total
18-30 100 150 250
31-50 200 250 450
51+ 150 150 300
Total 450 550 1000

How These Counts Are Calculated:

Step 2: Verify Sample Representativeness

Before using this as our seed table, we should check if our sample is representative:

Characteristic Sample % Population % Difference
18-30 25% 30% +5%
Male 45% 48% +3%

1.3 Using the Contingency Table as IPF Seed

Now we have our seed table ready for IPF:

Age \ Gender Male Female Row Total
18-30 100 150 250
31-50 200 250 450
51+ 150 150 300
Col Total 450 550 1000

1.4 Why This Process Matters

1.5 Extension to More Variables

The same process works for 3+ variables (e.g., Age × Gender × Income):

Age \ Gender \ Income Male Low Male Medium Male High Female Low ...
18-30 40 30 30 70 ...
31-50 60 80 60 80 ...

With such a contingency table representing the joint distribution of variables (e.g., age × gender):

Age \ Gender Male Female Row Total
18-30 100 150 250
31-50 200 250 450
51+ 150 150 300
Col Total 450 550 1000

Target Margins in Iterative Proportional Fitting

What Are Target Margins?

Target margins (also called constraints) are the known population totals that we want our final table to match. These typically come from authoritative sources like:

Example Margins Structure

For a 2D Age × Gender Table:

Row Margins (Age Totals):

Age Group Target Total
18-30 300
31-50 500
51+ 200

Column Margins (Gender Totals):

Gender Target Total
Male 600
Female 400

Key Properties of Target Margins

1. Consistency Requirement

The sum of row margins must equal the sum of column margins:

300 (18-30) + 500 (31-50) + 200 (51+) = 1000
600 (Male) + 400 (Female) = 1000

2. Multi-Dimensional Extensions

For 3+ variables, we need margins for each dimension:

# Example for Age × Gender × Income
age_margins = [300, 500, 200]          # Age totals
gender_margins = [600, 400]            # Gender totals
income_margins = [400, 350, 250]       # Income totals (Low, Medium, High)

# Grand total must match across all dimensions
assert sum(age_margins) == sum(gender_margins) == sum(income_margins)  # 1000

Sources of Target Margins

Common Data Sources

Source Example Margins Python Loading Example
Census API Age by gender tables
import requests
census_data = requests.get("https://api.census.gov/data/2020/acs/acs5?get=AGE,SEX&for=state:24")
CSV Files Pre-aggregated totals
import pandas as pd
margins = pd.read_csv("margins.csv")
Statistical Models Projected demographics
from statsmodels.api import GLM
model = GLM(...)
predicted_margins = model.predict(...)

Practical Considerations

Advanced Margin Handling in Iterative Proportional Fitting

1. Handling Margin Mismatches

1.1 The Consistency Problem

When row and column margins don't sum to the same total, we must resolve the discrepancy before running IPF:

Common Causes of Mismatches:

1.2 Normalization Approaches

Python: Advanced Margin Normalization

def normalize_margins(row_margins, col_margins, method='proportional'):
    """
    Normalize margins to common total using different strategies
    
    Parameters:
        row_margins (list): Target row totals
        col_margins (list): Target column totals
        method (str): Normalization approach ('proportional', 'row_priority', 'col_priority')
    
    Returns:
        tuple: (normalized_row_margins, normalized_col_margins)
    """
    total_row = sum(row_margins)
    total_col = sum(col_margins)
    
    # Check if normalization is needed
    if abs(total_row - total_col) < 1e-6:
        return row_margins, col_margins
    
    if method == 'proportional':
        # Scale both dimensions proportionally
        common_total = (total_row + total_col) / 2
        row_factor = common_total / total_row
        col_factor = common_total / total_col
        return [x*row_factor for x in row_margins], [x*col_factor for x in col_margins]
    
    elif method == 'row_priority':
        # Keep row totals fixed, adjust columns
        col_factor = total_row / total_col
        return row_margins, [x*col_factor for x in col_margins]
    
    elif method == 'col_priority':
        # Keep column totals fixed, adjust rows
        row_factor = total_col / total_row
        return [x*row_factor for x in row_margins], col_margins
    
    else:
        raise ValueError(f"Unknown method: {method}")

1.3 Practical Example

Mismatched Margins Scenario:

Row Margins Sum
Age Groups [320, 520, 210] 1050
Gender [610, 390] 1000

After Proportional Normalization:

row_margins, col_margins = normalize_margins(
    [320, 520, 210],
    [610, 390],
    method='proportional'
)
# Result:
# Rows: [304.76, 495.24, 200.00] (sum=1000)
# Cols: [634.15, 405.85] (sum=1040) → Will be normalized to 1000 in next IPF iteration

2. Weighted vs. Unweighted Margins

2.1 Understanding the Difference

Type Description When to Use Calculation
Unweighted Raw counts from sample When sample is perfectly representative Simple value_counts()
Weighted Adjusted for sampling design Most real-world surveys Sum of weights per group

2.2 Comprehensive Weighting Example

import pandas as pd
import numpy as np

# Sample survey data with weights
survey_data = pd.DataFrame({
    'age': ['18-30', '31-50', '51+', '18-30', '31-50', '51+'],
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'weight': [1.2, 0.8, 1.5, 1.1, 0.9, 1.3]
})

# Unweighted counts (ignores sampling design)
unweighted = pd.crosstab(
    index=survey_data['age'],
    columns=survey_data['gender']
)

# Properly weighted counts
weighted = survey_data.groupby(['age', 'gender'])['weight'].sum().unstack()

# Calibration to known population totals
def calibrate_weights(df, target_margins):
    """
    Raking procedure to adjust weights to match margins
    """
    # Initialize calibrated weights
    df['calib_weight'] = df['weight']
    
    # Iterative proportional adjustment
    for _ in range(10):
        # Adjust to age margins
        age_factor = target_margins['age'] / df.groupby('age')['calib_weight'].sum()
        df['calib_weight'] *= df['age'].map(age_factor)
        
        # Adjust to gender margins
        gender_factor = target_margins['gender'] / df.groupby('gender')['calib_weight'].sum()
        df['calib_weight'] *= df['gender'].map(gender_factor)
    
    return df

# Target margins for calibration
targets = {
    'age': {'18-30': 300, '31-50': 500, '51+': 200},
    'gender': {'Male': 600, 'Female': 400}
}

# Apply calibration
calibrated_data = calibrate_weights(survey_data, targets)

2.3 Weighting Considerations

Key Factors in Weighting:

Python Implementation Tips:

# Check weight diagnostics
print("Weight summary:", survey_data['weight'].describe())
print("Effective sample size:", 
      (sum(survey_data['weight'])**2 / sum(survey_data['weight']**2))

# Visualize weight distribution
import matplotlib.pyplot as plt
plt.hist(survey_data['weight'], bins=20)
plt.title('Survey Weight Distribution')
plt.show()

Visualizing Margin Constraints

Relationship between seed table and target margins:

            Seed Table                  Target Margins
            ┌───────────────┐           ┌───────────────┐
            │ 100 │ 150 │250│           │      300      │
            ├─────┼─────┼───┤           ├───────────────┤
            │ 200 │ 250 │450│   IPF     │      500      │
            ├─────┼─────┼───┤  ──────▶  ├───────────────┤
            │ 150 │ 150 │300│           │      200      │
            ├─────┼─────┼───┤           └───────────────┘
            │ 450 │ 550 │   │           ┌───────┬───────┐
            └───────────────┘           │ 600   │ 400   │
                                        └───────┴───────┘

Known totals for each category (e.g., from census data):

2. How IPF Adjusts the Table Iteratively

Initial Table (Seed):

Age \ Gender Male Female Row Total Row Target
18-30 100 150 250 300
31-50 200 250 450 500
51+ 150 150 300 200
Col Total 450 550 1000 -
Col Target 600 400 - -

Iteration 1: Adjust Rows to Match Row Targets

For each row, multiply each cell by:
Row Adjustment Factor = Row Target / Current Row Total

After Row Adjustment:

Age \ Gender Male Female Row Total
18-30 120 180 300
31-50 222.2 277.8 500
51+ 100 100 200
Col Total 442.2 557.8 1000

Row totals now match targets, but columns do not.

Iteration 1: Adjust Columns to Match Column Targets

For each column, multiply each cell by:
Column Adjustment Factor = Column Target / Current Column Total

After Column Adjustment:

Age \ Gender Male Female Row Total
18-30 162.8 129.1 291.9
31-50 301.4 199.2 500.6
51+ 135.7 71.7 207.4
Col Total 600 400 1000

Column totals now match targets, but rows are slightly off again.

Final Converged Table (After Several Iterations):

Age \ Gender Male Female Row Total
18-30 180 120 300
31-50 300 200 500
51+ 120 80 200
Col Total 600 400 1000

Now both row and column totals match the targets perfectly.

3. Key Properties of IPF

Iterative Proportional Fitting (IPF) with Vanilla Python

1. Data Preparation

1.1 Starting with Individual Survey Data

The seed table in IPF is created from survey data where each row represents an individual respondent.

Example Survey Data:

ID Age Gender Income
1 18-30 Male Low
2 31-50 Female Medium
3 51+ Male High
4 18-30 Female Low
5 31-50 Male Medium

Python: Loading Survey Data

# Sample survey data as list of dictionaries
survey_data = [
    {"Age": "18-30", "Gender": "Male", "Income": "Low"},
    {"Age": "31-50", "Gender": "Female", "Income": "Medium"},
    {"Age": "51+", "Gender": "Male", "Income": "High"},
    {"Age": "18-30", "Gender": "Female", "Income": "Low"},
    {"Age": "31-50", "Gender": "Male", "Income": "Medium"}
]

1.2 Creating the Contingency Table

We count how many individuals fall into each combination of categories:

Python: Creating Cross-Tabulation

def create_contingency(data, row_var, col_var):
    # Initialize counts
    row_categories = sorted({d[row_var] for d in data})
    col_categories = sorted({d[col_var] for d in data})
    
    # Create empty table
    table = {row: {col: 0 for col in col_categories} for row in row_categories}
    
    # Count occurrences
    for entry in data:
        table[entry[row_var]][entry[col_var]] += 1
    
    # Calculate margins
    row_totals = {row: sum(cols.values()) for row, cols in table.items()}
    col_totals = {col: sum(table[row][col] for row in row_categories) 
                 for col in col_categories}
    grand_total = sum(row_totals.values())
    
    return {
        "table": table,
        "row_totals": row_totals,
        "col_totals": col_totals,
        "grand_total": grand_total
    }

# Generate contingency table
result = create_contingency(survey_data, "Age", "Gender")
print("Contingency Table:", result["table"])

Resulting Contingency Table:

Age\Gender Male Female Total
18-30 100 150 250
31-50 200 250 450
51+ 150 150 300
Total 450 550 1000

2. IPF Implementation

2.1 IPF Inputs

IPF requires a seed table and target margins:

Python: Setting Up IPF

# Seed table as nested dictionaries
seed = {
    "18-30": {"Male": 100, "Female": 150},
    "31-50": {"Male": 200, "Female": 250},
    "51+": {"Male": 150, "Female": 150}
}

# Target margins
row_targets = {"18-30": 300, "31-50": 500, "51+": 200}
col_targets = {"Male": 600, "Female": 400}

2.2 IPF Algorithm

The iterative process adjusts rows and columns alternately:

Python: IPF Implementation

def ipf(seed, row_targets, col_targets, max_iter=10, tol=1e-6):
    current = {row: cols.copy() for row, cols in seed.items()}
    row_categories = list(row_targets.keys())
    col_categories = list(col_targets.keys())
    
    for _ in range(max_iter):
        # Adjust rows
        for row in row_categories:
            row_sum = sum(current[row].values())
            if row_sum == 0: continue
            factor = row_targets[row] / row_sum
            for col in col_categories:
                current[row][col] *= factor
        
        # Adjust columns
        for col in col_categories:
            col_sum = sum(current[row][col] for row in row_categories)
            if col_sum == 0: continue
            factor = col_targets[col] / col_sum
            for row in row_categories:
                current[row][col] *= factor
        
        # Check convergence
        converged = True
        for row in row_categories:
            if abs(sum(current[row].values()) - row_targets[row]) > tol:
                converged = False
                break
        for col in col_categories:
            if abs(sum(current[row][col] for row in row_categories) - col_targets[col]) > tol:
                converged = False
                break
        if converged:
            break
    
    return current

result = ipf(seed, row_targets, col_targets)
print("Final IPF result:", result)

2.3 Final Result

Age\Gender Male Female Total
18-30 180 120 300
31-50 300 200 500
51+ 120 80 200
Total 600 400 1000