setjoin

Set-aware record linkage with structure-preserving joins

setjoin provides algorithms for matching records across datasets while preserving hierarchical structure. When linking records that belong to groups (households, schools, orders, etc.), standard matching algorithms can split group members across different target groups. setjoin’s structure-aware matching ensures coherent group assignments.

Key Features

  • Structure-aware matching: Preserves group membership during linkage

  • Multiple algorithms: Greedy, Hungarian (optimal), and structure-aware methods

  • Configurable scoring: Pluggable comparators with customizable weights

  • Rich diagnostics: Match confidence, group margins, error analysis

  • Variable group sizes: Not limited to fixed-size groups

Quick Example

from setjoin import HierarchySpec, Scorer, match, MatchReport

# Configure scoring
scorer = Scorer({
    "surname": {"weight": 1.5, "comparator": "levenshtein"},
    "first_name": {"weight": 1.2, "comparator": "jaro_winkler"},
    "age": {"weight": 0.35, "comparator": "abs_diff"},
})
scores = scorer.score(source_df, target_df)

# Define hierarchy
hierarchy = HierarchySpec.from_dataframe(
    source_df, target_df,
    source_group_col="household_id",
    target_group_col="household_id",
)

# Match with structure preservation
result = match(scores, method="structure_aware", hierarchy=hierarchy)

# Analyze results
report = MatchReport(result, scores)
print(report.summary())

Installation

pip install setjoin

# With visualization support
pip install setjoin[viz]

Contents

Indices and tables