API Reference¶

Matching¶

Matching algorithms: greedy, Hungarian, and structure-aware.

setjoin.matchers.compare_methods(scores: ndarray[tuple[Any, ...], dtype[float64]], hierarchy: HierarchySpec | None = None, methods: list[str] | None = None) → dict[str, MatchResult][source]¶

Run multiple matching methods and return results for comparison.

Parameters:

scores – Pairwise score matrix (n_source x n_target)
hierarchy – Required if “structure_aware” is in methods
methods – List of methods to run (default: all available)

Returns:

Method name to result mapping

Return type:

dict[str, MatchResult]

setjoin.matchers.greedy_match(scores: ndarray[tuple[Any, ...], dtype[float64]], show_progress: bool = False) → MatchResult[source]¶

Match records greedily by selecting highest-scoring pairs first.

Parameters:

scores – Pairwise score matrix (n_source x n_target)
show_progress – Whether to show progress bar

Returns:

Greedy matches

Return type:

MatchResult

setjoin.matchers.hungarian_match(scores: ndarray[tuple[Any, ...], dtype[float64]]) → MatchResult[source]¶

Match records using Hungarian algorithm to maximize total score.

Parameters:: scores – Pairwise score matrix (n_source x n_target)
Returns:: Optimal global assignment
Return type:: MatchResult

setjoin.matchers.match(scores: ndarray[tuple[Any, ...], dtype[float64]], method: str = 'hungarian', hierarchy: HierarchySpec | None = None, show_progress: bool = False) → MatchResult[source]¶

Match records using the specified method.

Parameters:

scores – Pairwise score matrix (n_source x n_target)
method – Matching method - “greedy”, “hungarian”, or “structure_aware”
hierarchy – Required for structure_aware method
show_progress – Whether to show progress bar

Returns:

Matches using specified method

Return type:

MatchResult

Raises:

ValueError – If method is unknown or hierarchy missing for structure_aware

setjoin.matchers.structure_aware_match(scores: ndarray[tuple[Any, ...], dtype[float64]], hierarchy: HierarchySpec, show_progress: bool = False) → MatchResult[source]¶

Match records while preserving group structure.

Two-level assignment: first assign groups optimally, then assign records within matched groups. This ensures all members of a source group map to the same target group.

Parameters:

scores – Pairwise record score matrix (n_source x n_target)
hierarchy – HierarchySpec defining group structure
show_progress – Whether to show progress bar

Returns:

Structure-preserving matches

Return type:

MatchResult

Scoring¶

Configurable scoring functions for record comparison.

class setjoin.scorers.Scorer(config: dict[str, dict[str, float | str] | FieldConfig])[source]¶

Bases: object

Configurable scorer for computing pairwise similarity between records.

Parameters:: config – Mapping from field name to configuration. Each value can be a FieldConfig or a dict with keys: - weight: float (default 1.0) - comparator: str (default “exact”) - missing_value: float (default 0.0)

score(source: DataFrame, target: DataFrame, source_suffix: str = '', target_suffix: str = '') → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Compute pairwise score matrix between source and target records.

Parameters:

source – DataFrame with source records
target – DataFrame with target records
source_suffix – Suffix to append to field names for source columns
target_suffix – Suffix to append to field names for target columns

Returns:

Score matrix of shape (len(source), len(target))

Return type:

ScoreMatrix

Raises:

ValueError – If field not found or unknown comparator

setjoin.scorers.abs_diff(a: ndarray[tuple[Any, ...], dtype[float64]], b: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶: Return negative absolute difference (larger = more similar).

setjoin.scorers.exact_match(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶: Return 1.0 where values match exactly, 0.0 otherwise.

setjoin.scorers.jaro_winkler(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶: Compute Jaro-Winkler similarity for each pair.

setjoin.scorers.levenshtein(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶: Compute Levenshtein similarity ratio for each pair.

setjoin.scorers.score_matrix(source: DataFrame, target: DataFrame, weights: dict[str, float], comparators: dict[str, str] | None = None) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Convenience function to compute a score matrix with simple configuration.

Parameters:

source – Source DataFrame
target – Target DataFrame
weights – Mapping from field name to weight
comparators – Optional mapping from field name to comparator name

Returns:

Score matrix of shape (len(source), len(target))

Return type:

ScoreMatrix

Hierarchy¶

Hierarchy specification and decomposition for structure-aware matching.

class setjoin.hierarchy.HierarchySpec(source_groups: dict[int, Sequence[int]], target_groups: dict[int, Sequence[int]])[source]¶

Bases: object

Specification of hierarchical structure for record groups.

This defines how records are grouped (e.g., persons within households, students within schools, items within orders) for structure-preserving matching.

classmethod from_dataframe(source: DataFrame, target: DataFrame, source_group_col: str, target_group_col: str) → HierarchySpec[source]¶

Create HierarchySpec from DataFrames with group columns.

Parameters:

source – Source DataFrame
target – Target DataFrame
source_group_col – Column name containing group IDs in source
target_group_col – Column name containing group IDs in target

Returns:

Group mappings

Return type:

HierarchySpec

classmethod from_groupby(source_groupby: Any, target_groupby: Any) → HierarchySpec[source]¶

Create HierarchySpec from pandas GroupBy objects.

Parameters:

source_groupby – GroupBy object for source records
target_groupby – GroupBy object for target records

Returns:

Group mappings

Return type:

HierarchySpec

get_source_group(group_id: int) → GroupSpec[source]¶: Get specification for a source group.

get_target_group(group_id: int) → GroupSpec[source]¶: Get specification for a target group.

property n_source_groups: int¶: Number of groups in source.

property n_target_groups: int¶: Number of groups in target.

property source_group_ids: list[int]¶: List of source group IDs.

source_group_sizes() → dict[int, int][source]¶: Get sizes of all source groups.

source_groups: dict[int, Sequence[int]]¶: Mapping from source group ID to record indices.

property target_group_ids: list[int]¶: List of target group IDs.

target_group_sizes() → dict[int, int][source]¶: Get sizes of all target groups.

target_groups: dict[int, Sequence[int]]¶: Mapping from target group ID to record indices.

setjoin.hierarchy.compute_group_score_matrix(hierarchy: HierarchySpec, record_scores: ndarray[tuple[Any, ...], dtype[float64]]) → tuple[ndarray[tuple[Any, ...], dtype[float64]], dict[tuple[int, int], list[tuple[int, int]]]][source]¶

Compute score matrix at the group level with optimal within-group assignments.

For each pair of (source_group, target_group), computes the optimal assignment of records within those groups and returns the total score.

Parameters:

hierarchy – HierarchySpec defining group structure
record_scores – Pairwise record score matrix (shape: n_source_records x n_target_records)

Returns:

Group score matrix and within-group match assignments

Return type:

tuple[NDArray[np.float64], dict[tuple[int, int], list[tuple[int, int]]]]

setjoin.hierarchy.decompose_by_size(hierarchy: HierarchySpec) → dict[tuple[int, int], tuple[list[int], list[int]]][source]¶

Decompose groups by size for efficient matching.

Groups source and target groups by their sizes to enable size-aware matching strategies.

Parameters:: hierarchy – HierarchySpec defining group structure
Returns:: Size pairs to group ID lists
Return type:: dict[tuple[int, int], tuple[list[int], list[int]]]

Diagnostics¶

Diagnostic metrics and report generation for match evaluation.

class setjoin.diagnostics.MatchReport(result: ~setjoin.types.MatchResult, scores: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]], ground_truth: list[tuple[int, int]] | None = None, source_groups: dict[int, list[int]] | None = None, target_groups: dict[int, list[int]] | None = None, source_df: ~pandas.DataFrame | None = None, target_df: ~pandas.DataFrame | None = None, _cache: dict[str, object] = <factory>)[source]¶

Bases: object

Comprehensive diagnostics for evaluating match quality.

error_anatomy() → DataFrame | None[source]¶

Analyze errors by comparing to ground truth.

Returns DataFrame with columns: - source_idx: source record - matched_target_idx: where we matched it - true_target_idx: where it should have matched - score_diff: how much score we lost - error_type: ‘swap’ or ‘mismatch’

ground_truth: list[tuple[int, int]] | None = None¶: Optional ground truth matches for accuracy computation.

property group_accuracy: float | None¶: Fraction of records matched to the correct group (if groups and truth provided).

property group_exact_match_rate: float | None¶: Fraction of source groups where ALL members matched to the same target group.

group_margins() → DataFrame | None[source]¶

Compute per-group score margins.

Returns DataFrame with columns: - source_group: source group ID - target_group: matched target group ID - within_score: total score of within-group matches - best_alternative_score: best score with a different target group - margin: difference between within_score and best_alternative

match_confidence() → DataFrame[source]¶

Compute per-match confidence metrics.

Returns DataFrame with columns: - source_idx, target_idx: the match - score: score of this match - rank: rank of this match among alternatives for source - margin: difference between this score and second-best alternative

method_comparison(other: MatchReport) → DataFrame[source]¶

Compare this result with another method’s result.

Returns DataFrame showing where the methods differ.

property n_matches: int¶: Number of matches.

property record_accuracy: float | None¶: Fraction of matches that are correct (if ground truth provided).

result: MatchResult¶: The match result being evaluated.

property score_sacrifice: float | None¶

Total score lost compared to unconstrained Hungarian matching.

Positive value means structure-aware matching sacrificed score for coherence.

scores: ndarray[tuple[Any, ...], dtype[float64]]¶: The score matrix used for matching.

source_df: DataFrame | None = None¶: Optional source DataFrame for detailed analysis.

source_groups: dict[int, list[int]] | None = None¶: Optional source group structure for coherence metrics.

summary() → dict[str, float | int | None][source]¶: Return a dictionary of summary metrics.

target_df: DataFrame | None = None¶: Optional target DataFrame for detailed analysis.

target_groups: dict[int, list[int]] | None = None¶: Optional target group structure for coherence metrics.

to_csv(output_dir: str | Path) → None[source]¶

Write all diagnostic artifacts to CSV files.

Creates: - match_confidence.csv - group_margins.csv (if groups provided) - error_anatomy.csv (if ground truth provided)

setjoin.diagnostics.evaluate_matches(source: DataFrame, target: DataFrame, matches: list[tuple[int, int]], source_id_col: str = 'latent_person_id', target_id_col: str = 'latent_person_id', source_group_col: str | None = None, target_group_col: str | None = None) → dict[str, float][source]¶

Evaluate linkage quality with standard metrics.

Parameters:

source – Source DataFrame
target – Target DataFrame
matches – List of (source_idx, target_idx) pairs
source_id_col – Column in source containing true record ID
target_id_col – Column in target containing true record ID
source_group_col – Optional column for group coherence metrics
target_group_col – Optional column for group coherence metrics

Returns:

Evaluation metrics

Return type:

dict[str, float]

Types¶

Type definitions and protocols for setjoin.

class setjoin.types.Comparator(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for string/value comparison functions.

class setjoin.types.FieldConfig(weight: float = 1.0, comparator: str = 'exact', missing_value: float = 0.0)[source]¶

Bases: object

Configuration for a single field in scoring.

comparator: str¶

exact, abs_diff, levenshtein, jaro_winkler.

Type:: Name of comparison function

missing_value: float¶: Score to use when either value is missing.

weight: float¶: Weight applied to this field’s similarity score.

class setjoin.types.GroupSpec(group_id: int, indices: Sequence[int])[source]¶

Bases: object

Specification of a group of records.

group_id: int¶: Identifier for this group.

indices: Sequence[int]¶: Record indices belonging to this group.

class setjoin.types.MatchResult(matches: list[tuple[int, int]], total_score: float, method: str, group_assignments: dict[int, int] | None = None, metadata: dict[str, object] = <factory>)[source]¶

Bases: object

Result of a matching operation.

group_assignments: dict[int, int] | None¶: Mapping from source group ID to target group ID (structure-aware).

matches: list[tuple[int, int]]¶: List of (source_idx, target_idx) pairs.

metadata: dict[str, object]¶: Additional metadata about the matching process.

method: str¶: Name of the matching method used.

to_dataframe() → DataFrame[source]¶: Convert matches to a DataFrame.

total_score: float¶: Sum of scores for all matched pairs.

setjoin.types.ScoreMatrix¶

Type alias for score matrices (n_source x n_target).

alias of ndarray[tuple[Any, …], dtype[float64]]

Visualization¶

Visualization functions for match diagnostics (requires matplotlib).

setjoin.plots.plot_accuracy_by_ambiguity(data: NDArray[np.floating[Any]] | list[dict[str, float]], methods: list[str], ambiguity_values: list[float], metric: str = 'record_accuracy', ax: Axes | None = None) → Figure[source]¶

Plot accuracy metric vs ambiguity level for multiple methods.

Parameters:

data – DataFrame or list of dicts with columns [method, ambiguity, <metric>]
methods – List of method names to plot
ambiguity_values – List of ambiguity values
metric – Which metric to plot
ax – Optional matplotlib Axes

Returns:

Matplotlib figure with accuracy plot

Return type:

Figure

setjoin.plots.plot_confidence_distribution(report: MatchReport, ax: Axes | None = None) → Figure[source]¶

Plot distribution of match confidence (margin to second-best).

Parameters:

report – MatchReport to analyze
ax – Optional matplotlib Axes

Returns:

Matplotlib figure with distribution

Return type:

Figure

setjoin.plots.plot_match_comparison(report1: MatchReport, report2: MatchReport, label1: str = 'Method 1', label2: str = 'Method 2', ax: Axes | None = None) → Figure[source]¶

Compare scores of matches between two methods.

Parameters:

report1 – First method’s report
report2 – Second method’s report
label1 – Label for first method
label2 – Label for second method
ax – Optional matplotlib Axes

Returns:

Matplotlib figure with comparison

Return type:

Figure

setjoin.plots.plot_method_comparison_bar(results: dict[str, MatchResult], metric: str = 'total_score', ax: Axes | None = None) → Figure[source]¶

Bar chart comparing a metric across methods.

Parameters:

results – Dictionary of method name to MatchResult
metric – Which metric to plot (“total_score” or “n_matches”)
ax – Optional matplotlib Axes

Returns:

Matplotlib figure with bar chart

Return type:

Figure

Raises:

ValueError – If metric is unknown

setjoin.plots.plot_score_heatmap(scores: ndarray[tuple[Any, ...], dtype[float64]], matches: list[tuple[int, int]] | None = None, ax: Axes | None = None, cmap: str = 'viridis', mark_matches: bool = True) → Figure[source]¶

Plot score matrix as a heatmap with optional match overlay.

Parameters:

scores – Score matrix (n_source x n_target)
matches – Optional list of matches to highlight
ax – Optional matplotlib Axes to plot on
cmap – Colormap name
mark_matches – Whether to mark matched pairs

Returns:

Matplotlib figure with heatmap

Return type:

Figure