IPF is a mathematical procedure that adjusts the cells of a contingency table so that the table's margins match predefined target values.
A worked example [on going development] is avalable in this Colab notebook
The seed table in IPF is typically created from survey data where each row represents an individual respondent with their characteristics.
Respondent ID | Age Group | Gender | Income |
---|---|---|---|
1 | 18-30 | Male | Low |
2 | 31-50 | Female | Medium |
3 | 51+ | Male | High |
4 | 18-30 | Female | Low |
5 | 31-50 | Male | Medium |
6 | 31-50 | Female | High |
7 | 51+ | Female | Medium |
8 | 18-30 | Male | Low |
9 | 31-50 | Male | Medium |
10 | 51+ | Male | High |
We count how many individuals fall into each combination of categories:
For our 1000 respondents (full survey), we might get these counts:
Age \ Gender | Male | Female | Total |
---|---|---|---|
18-30 | 100 | 150 | 250 |
31-50 | 200 | 250 | 450 |
51+ | 150 | 150 | 300 |
Total | 450 | 550 | 1000 |
Before using this as our seed table, we should check if our sample is representative:
Characteristic | Sample % | Population % | Difference |
---|---|---|---|
18-30 | 25% | 30% | +5% |
Male | 45% | 48% | +3% |
Now we have our seed table ready for IPF:
Age \ Gender | Male | Female | Row Total |
---|---|---|---|
18-30 | 100 | 150 | 250 |
31-50 | 200 | 250 | 450 |
51+ | 150 | 150 | 300 |
Col Total | 450 | 550 | 1000 |
The same process works for 3+ variables (e.g., Age × Gender × Income):
Age \ Gender \ Income | Male Low | Male Medium | Male High | Female Low | ... |
---|---|---|---|---|---|
18-30 | 40 | 30 | 30 | 70 | ... |
31-50 | 60 | 80 | 60 | 80 | ... |
With such a contingency table representing the joint distribution of variables (e.g., age × gender):
Age \ Gender | Male | Female | Row Total |
---|---|---|---|
18-30 | 100 | 150 | 250 |
31-50 | 200 | 250 | 450 |
51+ | 150 | 150 | 300 |
Col Total | 450 | 550 | 1000 |
Target margins (also called constraints) are the known population totals that we want our final table to match. These typically come from authoritative sources like:
Age Group | Target Total |
---|---|
18-30 | 300 |
31-50 | 500 |
51+ | 200 |
Gender | Target Total |
---|---|
Male | 600 |
Female | 400 |
The sum of row margins must equal the sum of column margins:
300 (18-30) + 500 (31-50) + 200 (51+) = 1000 600 (Male) + 400 (Female) = 1000
For 3+ variables, we need margins for each dimension:
# Example for Age × Gender × Income
age_margins = [300, 500, 200] # Age totals
gender_margins = [600, 400] # Gender totals
income_margins = [400, 350, 250] # Income totals (Low, Medium, High)
# Grand total must match across all dimensions
assert sum(age_margins) == sum(gender_margins) == sum(income_margins) # 1000
Source | Example Margins | Python Loading Example |
---|---|---|
Census API | Age by gender tables |
|
CSV Files | Pre-aggregated totals |
|
Statistical Models | Projected demographics |
|
When row and column margins don't sum to the same total, we must resolve the discrepancy before running IPF:
def normalize_margins(row_margins, col_margins, method='proportional'):
"""
Normalize margins to common total using different strategies
Parameters:
row_margins (list): Target row totals
col_margins (list): Target column totals
method (str): Normalization approach ('proportional', 'row_priority', 'col_priority')
Returns:
tuple: (normalized_row_margins, normalized_col_margins)
"""
total_row = sum(row_margins)
total_col = sum(col_margins)
# Check if normalization is needed
if abs(total_row - total_col) < 1e-6:
return row_margins, col_margins
if method == 'proportional':
# Scale both dimensions proportionally
common_total = (total_row + total_col) / 2
row_factor = common_total / total_row
col_factor = common_total / total_col
return [x*row_factor for x in row_margins], [x*col_factor for x in col_margins]
elif method == 'row_priority':
# Keep row totals fixed, adjust columns
col_factor = total_row / total_col
return row_margins, [x*col_factor for x in col_margins]
elif method == 'col_priority':
# Keep column totals fixed, adjust rows
row_factor = total_col / total_row
return [x*row_factor for x in row_margins], col_margins
else:
raise ValueError(f"Unknown method: {method}")
Row Margins | Sum | |
---|---|---|
Age Groups | [320, 520, 210] | 1050 |
Gender | [610, 390] | 1000 |
row_margins, col_margins = normalize_margins(
[320, 520, 210],
[610, 390],
method='proportional'
)
# Result:
# Rows: [304.76, 495.24, 200.00] (sum=1000)
# Cols: [634.15, 405.85] (sum=1040) → Will be normalized to 1000 in next IPF iteration
Type | Description | When to Use | Calculation |
---|---|---|---|
Unweighted | Raw counts from sample | When sample is perfectly representative | Simple value_counts() |
Weighted | Adjusted for sampling design | Most real-world surveys | Sum of weights per group |
import pandas as pd
import numpy as np
# Sample survey data with weights
survey_data = pd.DataFrame({
'age': ['18-30', '31-50', '51+', '18-30', '31-50', '51+'],
'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'weight': [1.2, 0.8, 1.5, 1.1, 0.9, 1.3]
})
# Unweighted counts (ignores sampling design)
unweighted = pd.crosstab(
index=survey_data['age'],
columns=survey_data['gender']
)
# Properly weighted counts
weighted = survey_data.groupby(['age', 'gender'])['weight'].sum().unstack()
# Calibration to known population totals
def calibrate_weights(df, target_margins):
"""
Raking procedure to adjust weights to match margins
"""
# Initialize calibrated weights
df['calib_weight'] = df['weight']
# Iterative proportional adjustment
for _ in range(10):
# Adjust to age margins
age_factor = target_margins['age'] / df.groupby('age')['calib_weight'].sum()
df['calib_weight'] *= df['age'].map(age_factor)
# Adjust to gender margins
gender_factor = target_margins['gender'] / df.groupby('gender')['calib_weight'].sum()
df['calib_weight'] *= df['gender'].map(gender_factor)
return df
# Target margins for calibration
targets = {
'age': {'18-30': 300, '31-50': 500, '51+': 200},
'gender': {'Male': 600, 'Female': 400}
}
# Apply calibration
calibrated_data = calibrate_weights(survey_data, targets)
# Check weight diagnostics
print("Weight summary:", survey_data['weight'].describe())
print("Effective sample size:",
(sum(survey_data['weight'])**2 / sum(survey_data['weight']**2))
# Visualize weight distribution
import matplotlib.pyplot as plt
plt.hist(survey_data['weight'], bins=20)
plt.title('Survey Weight Distribution')
plt.show()
Relationship between seed table and target margins:
Seed Table Target Margins ┌───────────────┐ ┌───────────────┐ │ 100 │ 150 │250│ │ 300 │ ├─────┼─────┼───┤ ├───────────────┤ │ 200 │ 250 │450│ IPF │ 500 │ ├─────┼─────┼───┤ ──────▶ ├───────────────┤ │ 150 │ 150 │300│ │ 200 │ ├─────┼─────┼───┤ └───────────────┘ │ 450 │ 550 │ │ ┌───────┬───────┐ └───────────────┘ │ 600 │ 400 │ └───────┴───────┘
Known totals for each category (e.g., from census data):
Age \ Gender | Male | Female | Row Total | Row Target |
---|---|---|---|---|
18-30 | 100 | 150 | 250 | 300 |
31-50 | 200 | 250 | 450 | 500 |
51+ | 150 | 150 | 300 | 200 |
Col Total | 450 | 550 | 1000 | - |
Col Target | 600 | 400 | - | - |
For each row, multiply each cell by:
Row Adjustment Factor = Row Target / Current Row Total
After Row Adjustment:
Age \ Gender | Male | Female | Row Total |
---|---|---|---|
18-30 | 120 | 180 | 300 |
31-50 | 222.2 | 277.8 | 500 |
51+ | 100 | 100 | 200 |
Col Total | 442.2 | 557.8 | 1000 |
→ Row totals now match targets, but columns do not.
For each column, multiply each cell by:
Column Adjustment Factor = Column Target / Current Column Total
After Column Adjustment:
Age \ Gender | Male | Female | Row Total |
---|---|---|---|
18-30 | 162.8 | 129.1 | 291.9 |
31-50 | 301.4 | 199.2 | 500.6 |
51+ | 135.7 | 71.7 | 207.4 |
Col Total | 600 | 400 | 1000 |
→ Column totals now match targets, but rows are slightly off again.
Age \ Gender | Male | Female | Row Total |
---|---|---|---|
18-30 | 180 | 120 | 300 |
31-50 | 300 | 200 | 500 |
51+ | 120 | 80 | 200 |
Col Total | 600 | 400 | 1000 |
→ Now both row and column totals match the targets perfectly.
The seed table in IPF is created from survey data where each row represents an individual respondent.
ID | Age | Gender | Income |
---|---|---|---|
1 | 18-30 | Male | Low |
2 | 31-50 | Female | Medium |
3 | 51+ | Male | High |
4 | 18-30 | Female | Low |
5 | 31-50 | Male | Medium |
# Sample survey data as list of dictionaries
survey_data = [
{"Age": "18-30", "Gender": "Male", "Income": "Low"},
{"Age": "31-50", "Gender": "Female", "Income": "Medium"},
{"Age": "51+", "Gender": "Male", "Income": "High"},
{"Age": "18-30", "Gender": "Female", "Income": "Low"},
{"Age": "31-50", "Gender": "Male", "Income": "Medium"}
]
We count how many individuals fall into each combination of categories:
def create_contingency(data, row_var, col_var):
# Initialize counts
row_categories = sorted({d[row_var] for d in data})
col_categories = sorted({d[col_var] for d in data})
# Create empty table
table = {row: {col: 0 for col in col_categories} for row in row_categories}
# Count occurrences
for entry in data:
table[entry[row_var]][entry[col_var]] += 1
# Calculate margins
row_totals = {row: sum(cols.values()) for row, cols in table.items()}
col_totals = {col: sum(table[row][col] for row in row_categories)
for col in col_categories}
grand_total = sum(row_totals.values())
return {
"table": table,
"row_totals": row_totals,
"col_totals": col_totals,
"grand_total": grand_total
}
# Generate contingency table
result = create_contingency(survey_data, "Age", "Gender")
print("Contingency Table:", result["table"])
Age\Gender | Male | Female | Total |
---|---|---|---|
18-30 | 100 | 150 | 250 |
31-50 | 200 | 250 | 450 |
51+ | 150 | 150 | 300 |
Total | 450 | 550 | 1000 |
IPF requires a seed table and target margins:
# Seed table as nested dictionaries
seed = {
"18-30": {"Male": 100, "Female": 150},
"31-50": {"Male": 200, "Female": 250},
"51+": {"Male": 150, "Female": 150}
}
# Target margins
row_targets = {"18-30": 300, "31-50": 500, "51+": 200}
col_targets = {"Male": 600, "Female": 400}
The iterative process adjusts rows and columns alternately:
def ipf(seed, row_targets, col_targets, max_iter=10, tol=1e-6):
current = {row: cols.copy() for row, cols in seed.items()}
row_categories = list(row_targets.keys())
col_categories = list(col_targets.keys())
for _ in range(max_iter):
# Adjust rows
for row in row_categories:
row_sum = sum(current[row].values())
if row_sum == 0: continue
factor = row_targets[row] / row_sum
for col in col_categories:
current[row][col] *= factor
# Adjust columns
for col in col_categories:
col_sum = sum(current[row][col] for row in row_categories)
if col_sum == 0: continue
factor = col_targets[col] / col_sum
for row in row_categories:
current[row][col] *= factor
# Check convergence
converged = True
for row in row_categories:
if abs(sum(current[row].values()) - row_targets[row]) > tol:
converged = False
break
for col in col_categories:
if abs(sum(current[row][col] for row in row_categories) - col_targets[col]) > tol:
converged = False
break
if converged:
break
return current
result = ipf(seed, row_targets, col_targets)
print("Final IPF result:", result)
Age\Gender | Male | Female | Total |
---|---|---|---|
18-30 | 180 | 120 | 300 |
31-50 | 300 | 200 | 500 |
51+ | 120 | 80 | 200 |
Total | 600 | 400 | 1000 |