setjoin¶
Set-aware record linkage with structure-preserving joins
setjoin provides algorithms for matching records across datasets while preserving hierarchical structure. When linking records that belong to groups (households, schools, orders, etc.), standard matching algorithms can split group members across different target groups. setjoin’s structure-aware matching ensures coherent group assignments.
Key Features¶
Structure-aware matching: Preserves group membership during linkage
Multiple algorithms: Greedy, Hungarian (optimal), and structure-aware methods
Configurable scoring: Pluggable comparators with customizable weights
Rich diagnostics: Match confidence, group margins, error analysis
Variable group sizes: Not limited to fixed-size groups
Quick Example¶
from setjoin import HierarchySpec, Scorer, match, MatchReport
# Configure scoring
scorer = Scorer({
"surname": {"weight": 1.5, "comparator": "levenshtein"},
"first_name": {"weight": 1.2, "comparator": "jaro_winkler"},
"age": {"weight": 0.35, "comparator": "abs_diff"},
})
scores = scorer.score(source_df, target_df)
# Define hierarchy
hierarchy = HierarchySpec.from_dataframe(
source_df, target_df,
source_group_col="household_id",
target_group_col="household_id",
)
# Match with structure preservation
result = match(scores, method="structure_aware", hierarchy=hierarchy)
# Analyze results
report = MatchReport(result, scores)
print(report.summary())
Installation¶
pip install setjoin
# With visualization support
pip install setjoin[viz]