API Reference

Matching

Matching algorithms: greedy, Hungarian, and structure-aware.

setjoin.matchers.compare_methods(scores: ndarray[tuple[Any, ...], dtype[float64]], hierarchy: HierarchySpec | None = None, methods: list[str] | None = None) dict[str, MatchResult][source]

Run multiple matching methods and return results for comparison.

Parameters:
  • scores – Pairwise score matrix (n_source x n_target)

  • hierarchy – Required if “structure_aware” is in methods

  • methods – List of methods to run (default: all available)

Returns:

Method name to result mapping

Return type:

dict[str, MatchResult]

setjoin.matchers.greedy_match(scores: ndarray[tuple[Any, ...], dtype[float64]], show_progress: bool = False) MatchResult[source]

Match records greedily by selecting highest-scoring pairs first.

Parameters:
  • scores – Pairwise score matrix (n_source x n_target)

  • show_progress – Whether to show progress bar

Returns:

Greedy matches

Return type:

MatchResult

setjoin.matchers.hungarian_match(scores: ndarray[tuple[Any, ...], dtype[float64]]) MatchResult[source]

Match records using Hungarian algorithm to maximize total score.

Parameters:

scores – Pairwise score matrix (n_source x n_target)

Returns:

Optimal global assignment

Return type:

MatchResult

setjoin.matchers.match(scores: ndarray[tuple[Any, ...], dtype[float64]], method: str = 'hungarian', hierarchy: HierarchySpec | None = None, show_progress: bool = False) MatchResult[source]

Match records using the specified method.

Parameters:
  • scores – Pairwise score matrix (n_source x n_target)

  • method – Matching method - “greedy”, “hungarian”, or “structure_aware”

  • hierarchy – Required for structure_aware method

  • show_progress – Whether to show progress bar

Returns:

Matches using specified method

Return type:

MatchResult

Raises:

ValueError – If method is unknown or hierarchy missing for structure_aware

setjoin.matchers.structure_aware_match(scores: ndarray[tuple[Any, ...], dtype[float64]], hierarchy: HierarchySpec, show_progress: bool = False) MatchResult[source]

Match records while preserving group structure.

Two-level assignment: first assign groups optimally, then assign records within matched groups. This ensures all members of a source group map to the same target group.

Parameters:
  • scores – Pairwise record score matrix (n_source x n_target)

  • hierarchy – HierarchySpec defining group structure

  • show_progress – Whether to show progress bar

Returns:

Structure-preserving matches

Return type:

MatchResult

Scoring

Configurable scoring functions for record comparison.

class setjoin.scorers.Scorer(config: dict[str, dict[str, float | str] | FieldConfig])[source]

Bases: object

Configurable scorer for computing pairwise similarity between records.

Parameters:

config – Mapping from field name to configuration. Each value can be a FieldConfig or a dict with keys: - weight: float (default 1.0) - comparator: str (default “exact”) - missing_value: float (default 0.0)

score(source: DataFrame, target: DataFrame, source_suffix: str = '', target_suffix: str = '') ndarray[tuple[Any, ...], dtype[float64]][source]

Compute pairwise score matrix between source and target records.

Parameters:
  • source – DataFrame with source records

  • target – DataFrame with target records

  • source_suffix – Suffix to append to field names for source columns

  • target_suffix – Suffix to append to field names for target columns

Returns:

Score matrix of shape (len(source), len(target))

Return type:

ScoreMatrix

Raises:

ValueError – If field not found or unknown comparator

setjoin.scorers.abs_diff(a: ndarray[tuple[Any, ...], dtype[float64]], b: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]

Return negative absolute difference (larger = more similar).

setjoin.scorers.exact_match(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) ndarray[tuple[Any, ...], dtype[float64]][source]

Return 1.0 where values match exactly, 0.0 otherwise.

setjoin.scorers.jaro_winkler(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) ndarray[tuple[Any, ...], dtype[float64]][source]

Compute Jaro-Winkler similarity for each pair.

setjoin.scorers.levenshtein(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) ndarray[tuple[Any, ...], dtype[float64]][source]

Compute Levenshtein similarity ratio for each pair.

setjoin.scorers.score_matrix(source: DataFrame, target: DataFrame, weights: dict[str, float], comparators: dict[str, str] | None = None) ndarray[tuple[Any, ...], dtype[float64]][source]

Convenience function to compute a score matrix with simple configuration.

Parameters:
  • source – Source DataFrame

  • target – Target DataFrame

  • weights – Mapping from field name to weight

  • comparators – Optional mapping from field name to comparator name

Returns:

Score matrix of shape (len(source), len(target))

Return type:

ScoreMatrix

Hierarchy

Hierarchy specification and decomposition for structure-aware matching.

class setjoin.hierarchy.HierarchySpec(source_groups: dict[int, Sequence[int]], target_groups: dict[int, Sequence[int]])[source]

Bases: object

Specification of hierarchical structure for record groups.

This defines how records are grouped (e.g., persons within households, students within schools, items within orders) for structure-preserving matching.

classmethod from_dataframe(source: DataFrame, target: DataFrame, source_group_col: str, target_group_col: str) HierarchySpec[source]

Create HierarchySpec from DataFrames with group columns.

Parameters:
  • source – Source DataFrame

  • target – Target DataFrame

  • source_group_col – Column name containing group IDs in source

  • target_group_col – Column name containing group IDs in target

Returns:

Group mappings

Return type:

HierarchySpec

classmethod from_groupby(source_groupby: Any, target_groupby: Any) HierarchySpec[source]

Create HierarchySpec from pandas GroupBy objects.

Parameters:
  • source_groupby – GroupBy object for source records

  • target_groupby – GroupBy object for target records

Returns:

Group mappings

Return type:

HierarchySpec

get_source_group(group_id: int) GroupSpec[source]

Get specification for a source group.

get_target_group(group_id: int) GroupSpec[source]

Get specification for a target group.

property n_source_groups: int

Number of groups in source.

property n_target_groups: int

Number of groups in target.

property source_group_ids: list[int]

List of source group IDs.

source_group_sizes() dict[int, int][source]

Get sizes of all source groups.

source_groups: dict[int, Sequence[int]]

Mapping from source group ID to record indices.

property target_group_ids: list[int]

List of target group IDs.

target_group_sizes() dict[int, int][source]

Get sizes of all target groups.

target_groups: dict[int, Sequence[int]]

Mapping from target group ID to record indices.

setjoin.hierarchy.compute_group_score_matrix(hierarchy: HierarchySpec, record_scores: ndarray[tuple[Any, ...], dtype[float64]]) tuple[ndarray[tuple[Any, ...], dtype[float64]], dict[tuple[int, int], list[tuple[int, int]]]][source]

Compute score matrix at the group level with optimal within-group assignments.

For each pair of (source_group, target_group), computes the optimal assignment of records within those groups and returns the total score.

Parameters:
  • hierarchy – HierarchySpec defining group structure

  • record_scores – Pairwise record score matrix (shape: n_source_records x n_target_records)

Returns:

Group score matrix and within-group match assignments

Return type:

tuple[NDArray[np.float64], dict[tuple[int, int], list[tuple[int, int]]]]

setjoin.hierarchy.decompose_by_size(hierarchy: HierarchySpec) dict[tuple[int, int], tuple[list[int], list[int]]][source]

Decompose groups by size for efficient matching.

Groups source and target groups by their sizes to enable size-aware matching strategies.

Parameters:

hierarchy – HierarchySpec defining group structure

Returns:

Size pairs to group ID lists

Return type:

dict[tuple[int, int], tuple[list[int], list[int]]]

Diagnostics

Diagnostic metrics and report generation for match evaluation.

class setjoin.diagnostics.MatchReport(result: ~setjoin.types.MatchResult, scores: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]], ground_truth: list[tuple[int, int]] | None = None, source_groups: dict[int, list[int]] | None = None, target_groups: dict[int, list[int]] | None = None, source_df: ~pandas.DataFrame | None = None, target_df: ~pandas.DataFrame | None = None, _cache: dict[str, object] = <factory>)[source]

Bases: object

Comprehensive diagnostics for evaluating match quality.

error_anatomy() DataFrame | None[source]

Analyze errors by comparing to ground truth.

Returns DataFrame with columns: - source_idx: source record - matched_target_idx: where we matched it - true_target_idx: where it should have matched - score_diff: how much score we lost - error_type: ‘swap’ or ‘mismatch’

ground_truth: list[tuple[int, int]] | None = None

Optional ground truth matches for accuracy computation.

property group_accuracy: float | None

Fraction of records matched to the correct group (if groups and truth provided).

property group_exact_match_rate: float | None

Fraction of source groups where ALL members matched to the same target group.

group_margins() DataFrame | None[source]

Compute per-group score margins.

Returns DataFrame with columns: - source_group: source group ID - target_group: matched target group ID - within_score: total score of within-group matches - best_alternative_score: best score with a different target group - margin: difference between within_score and best_alternative

match_confidence() DataFrame[source]

Compute per-match confidence metrics.

Returns DataFrame with columns: - source_idx, target_idx: the match - score: score of this match - rank: rank of this match among alternatives for source - margin: difference between this score and second-best alternative

method_comparison(other: MatchReport) DataFrame[source]

Compare this result with another method’s result.

Returns DataFrame showing where the methods differ.

property n_matches: int

Number of matches.

property record_accuracy: float | None

Fraction of matches that are correct (if ground truth provided).

result: MatchResult

The match result being evaluated.

property score_sacrifice: float | None

Total score lost compared to unconstrained Hungarian matching.

Positive value means structure-aware matching sacrificed score for coherence.

scores: ndarray[tuple[Any, ...], dtype[float64]]

The score matrix used for matching.

source_df: DataFrame | None = None

Optional source DataFrame for detailed analysis.

source_groups: dict[int, list[int]] | None = None

Optional source group structure for coherence metrics.

summary() dict[str, float | int | None][source]

Return a dictionary of summary metrics.

target_df: DataFrame | None = None

Optional target DataFrame for detailed analysis.

target_groups: dict[int, list[int]] | None = None

Optional target group structure for coherence metrics.

to_csv(output_dir: str | Path) None[source]

Write all diagnostic artifacts to CSV files.

Creates: - match_confidence.csv - group_margins.csv (if groups provided) - error_anatomy.csv (if ground truth provided)

setjoin.diagnostics.evaluate_matches(source: DataFrame, target: DataFrame, matches: list[tuple[int, int]], source_id_col: str = 'latent_person_id', target_id_col: str = 'latent_person_id', source_group_col: str | None = None, target_group_col: str | None = None) dict[str, float][source]

Evaluate linkage quality with standard metrics.

Parameters:
  • source – Source DataFrame

  • target – Target DataFrame

  • matches – List of (source_idx, target_idx) pairs

  • source_id_col – Column in source containing true record ID

  • target_id_col – Column in target containing true record ID

  • source_group_col – Optional column for group coherence metrics

  • target_group_col – Optional column for group coherence metrics

Returns:

Evaluation metrics

Return type:

dict[str, float]

Types

Type definitions and protocols for setjoin.

class setjoin.types.Comparator(*args, **kwargs)[source]

Bases: Protocol

Protocol for string/value comparison functions.

class setjoin.types.FieldConfig(weight: float = 1.0, comparator: str = 'exact', missing_value: float = 0.0)[source]

Bases: object

Configuration for a single field in scoring.

comparator: str

exact, abs_diff, levenshtein, jaro_winkler.

Type:

Name of comparison function

missing_value: float

Score to use when either value is missing.

weight: float

Weight applied to this field’s similarity score.

class setjoin.types.GroupSpec(group_id: int, indices: Sequence[int])[source]

Bases: object

Specification of a group of records.

group_id: int

Identifier for this group.

indices: Sequence[int]

Record indices belonging to this group.

class setjoin.types.MatchResult(matches: list[tuple[int, int]], total_score: float, method: str, group_assignments: dict[int, int] | None = None, metadata: dict[str, object] = <factory>)[source]

Bases: object

Result of a matching operation.

group_assignments: dict[int, int] | None

Mapping from source group ID to target group ID (structure-aware).

matches: list[tuple[int, int]]

List of (source_idx, target_idx) pairs.

metadata: dict[str, object]

Additional metadata about the matching process.

method: str

Name of the matching method used.

to_dataframe() DataFrame[source]

Convert matches to a DataFrame.

total_score: float

Sum of scores for all matched pairs.

setjoin.types.ScoreMatrix

Type alias for score matrices (n_source x n_target).

alias of ndarray[tuple[Any, …], dtype[float64]]

Visualization

Visualization functions for match diagnostics (requires matplotlib).

setjoin.plots.plot_accuracy_by_ambiguity(data: NDArray[np.floating[Any]] | list[dict[str, float]], methods: list[str], ambiguity_values: list[float], metric: str = 'record_accuracy', ax: Axes | None = None) Figure[source]

Plot accuracy metric vs ambiguity level for multiple methods.

Parameters:
  • data – DataFrame or list of dicts with columns [method, ambiguity, <metric>]

  • methods – List of method names to plot

  • ambiguity_values – List of ambiguity values

  • metric – Which metric to plot

  • ax – Optional matplotlib Axes

Returns:

Matplotlib figure with accuracy plot

Return type:

Figure

setjoin.plots.plot_confidence_distribution(report: MatchReport, ax: Axes | None = None) Figure[source]

Plot distribution of match confidence (margin to second-best).

Parameters:
  • report – MatchReport to analyze

  • ax – Optional matplotlib Axes

Returns:

Matplotlib figure with distribution

Return type:

Figure

setjoin.plots.plot_match_comparison(report1: MatchReport, report2: MatchReport, label1: str = 'Method 1', label2: str = 'Method 2', ax: Axes | None = None) Figure[source]

Compare scores of matches between two methods.

Parameters:
  • report1 – First method’s report

  • report2 – Second method’s report

  • label1 – Label for first method

  • label2 – Label for second method

  • ax – Optional matplotlib Axes

Returns:

Matplotlib figure with comparison

Return type:

Figure

setjoin.plots.plot_method_comparison_bar(results: dict[str, MatchResult], metric: str = 'total_score', ax: Axes | None = None) Figure[source]

Bar chart comparing a metric across methods.

Parameters:
  • results – Dictionary of method name to MatchResult

  • metric – Which metric to plot (“total_score” or “n_matches”)

  • ax – Optional matplotlib Axes

Returns:

Matplotlib figure with bar chart

Return type:

Figure

Raises:

ValueError – If metric is unknown

setjoin.plots.plot_score_heatmap(scores: ndarray[tuple[Any, ...], dtype[float64]], matches: list[tuple[int, int]] | None = None, ax: Axes | None = None, cmap: str = 'viridis', mark_matches: bool = True) Figure[source]

Plot score matrix as a heatmap with optional match overlay.

Parameters:
  • scores – Score matrix (n_source x n_target)

  • matches – Optional list of matches to highlight

  • ax – Optional matplotlib Axes to plot on

  • cmap – Colormap name

  • mark_matches – Whether to mark matched pairs

Returns:

Matplotlib figure with heatmap

Return type:

Figure