API Reference¶
Matching¶
Matching algorithms: greedy, Hungarian, and structure-aware.
- setjoin.matchers.compare_methods(scores: ndarray[tuple[Any, ...], dtype[float64]], hierarchy: HierarchySpec | None = None, methods: list[str] | None = None) dict[str, MatchResult][source]¶
Run multiple matching methods and return results for comparison.
- Parameters:
scores – Pairwise score matrix (n_source x n_target)
hierarchy – Required if “structure_aware” is in methods
methods – List of methods to run (default: all available)
- Returns:
Method name to result mapping
- Return type:
- setjoin.matchers.greedy_match(scores: ndarray[tuple[Any, ...], dtype[float64]], show_progress: bool = False) MatchResult[source]¶
Match records greedily by selecting highest-scoring pairs first.
- Parameters:
scores – Pairwise score matrix (n_source x n_target)
show_progress – Whether to show progress bar
- Returns:
Greedy matches
- Return type:
- setjoin.matchers.hungarian_match(scores: ndarray[tuple[Any, ...], dtype[float64]]) MatchResult[source]¶
Match records using Hungarian algorithm to maximize total score.
- Parameters:
scores – Pairwise score matrix (n_source x n_target)
- Returns:
Optimal global assignment
- Return type:
- setjoin.matchers.match(scores: ndarray[tuple[Any, ...], dtype[float64]], method: str = 'hungarian', hierarchy: HierarchySpec | None = None, show_progress: bool = False) MatchResult[source]¶
Match records using the specified method.
- Parameters:
scores – Pairwise score matrix (n_source x n_target)
method – Matching method - “greedy”, “hungarian”, or “structure_aware”
hierarchy – Required for structure_aware method
show_progress – Whether to show progress bar
- Returns:
Matches using specified method
- Return type:
- Raises:
ValueError – If method is unknown or hierarchy missing for structure_aware
- setjoin.matchers.structure_aware_match(scores: ndarray[tuple[Any, ...], dtype[float64]], hierarchy: HierarchySpec, show_progress: bool = False) MatchResult[source]¶
Match records while preserving group structure.
Two-level assignment: first assign groups optimally, then assign records within matched groups. This ensures all members of a source group map to the same target group.
- Parameters:
scores – Pairwise record score matrix (n_source x n_target)
hierarchy – HierarchySpec defining group structure
show_progress – Whether to show progress bar
- Returns:
Structure-preserving matches
- Return type:
Scoring¶
Configurable scoring functions for record comparison.
- class setjoin.scorers.Scorer(config: dict[str, dict[str, float | str] | FieldConfig])[source]¶
Bases:
objectConfigurable scorer for computing pairwise similarity between records.
- Parameters:
config – Mapping from field name to configuration. Each value can be a FieldConfig or a dict with keys: - weight: float (default 1.0) - comparator: str (default “exact”) - missing_value: float (default 0.0)
- score(source: DataFrame, target: DataFrame, source_suffix: str = '', target_suffix: str = '') ndarray[tuple[Any, ...], dtype[float64]][source]¶
Compute pairwise score matrix between source and target records.
- Parameters:
source – DataFrame with source records
target – DataFrame with target records
source_suffix – Suffix to append to field names for source columns
target_suffix – Suffix to append to field names for target columns
- Returns:
Score matrix of shape (len(source), len(target))
- Return type:
- Raises:
ValueError – If field not found or unknown comparator
- setjoin.scorers.abs_diff(a: ndarray[tuple[Any, ...], dtype[float64]], b: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Return negative absolute difference (larger = more similar).
- setjoin.scorers.exact_match(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Return 1.0 where values match exactly, 0.0 otherwise.
- setjoin.scorers.jaro_winkler(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Compute Jaro-Winkler similarity for each pair.
- setjoin.scorers.levenshtein(a: ndarray[tuple[Any, ...], dtype[object_]], b: ndarray[tuple[Any, ...], dtype[object_]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Compute Levenshtein similarity ratio for each pair.
- setjoin.scorers.score_matrix(source: DataFrame, target: DataFrame, weights: dict[str, float], comparators: dict[str, str] | None = None) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Convenience function to compute a score matrix with simple configuration.
- Parameters:
source – Source DataFrame
target – Target DataFrame
weights – Mapping from field name to weight
comparators – Optional mapping from field name to comparator name
- Returns:
Score matrix of shape (len(source), len(target))
- Return type:
Hierarchy¶
Hierarchy specification and decomposition for structure-aware matching.
- class setjoin.hierarchy.HierarchySpec(source_groups: dict[int, Sequence[int]], target_groups: dict[int, Sequence[int]])[source]¶
Bases:
objectSpecification of hierarchical structure for record groups.
This defines how records are grouped (e.g., persons within households, students within schools, items within orders) for structure-preserving matching.
- classmethod from_dataframe(source: DataFrame, target: DataFrame, source_group_col: str, target_group_col: str) HierarchySpec[source]¶
Create HierarchySpec from DataFrames with group columns.
- Parameters:
source – Source DataFrame
target – Target DataFrame
source_group_col – Column name containing group IDs in source
target_group_col – Column name containing group IDs in target
- Returns:
Group mappings
- Return type:
- classmethod from_groupby(source_groupby: Any, target_groupby: Any) HierarchySpec[source]¶
Create HierarchySpec from pandas GroupBy objects.
- Parameters:
source_groupby – GroupBy object for source records
target_groupby – GroupBy object for target records
- Returns:
Group mappings
- Return type:
- setjoin.hierarchy.compute_group_score_matrix(hierarchy: HierarchySpec, record_scores: ndarray[tuple[Any, ...], dtype[float64]]) tuple[ndarray[tuple[Any, ...], dtype[float64]], dict[tuple[int, int], list[tuple[int, int]]]][source]¶
Compute score matrix at the group level with optimal within-group assignments.
For each pair of (source_group, target_group), computes the optimal assignment of records within those groups and returns the total score.
- Parameters:
hierarchy – HierarchySpec defining group structure
record_scores – Pairwise record score matrix (shape: n_source_records x n_target_records)
- Returns:
Group score matrix and within-group match assignments
- Return type:
tuple[NDArray[np.float64], dict[tuple[int, int], list[tuple[int, int]]]]
Diagnostics¶
Diagnostic metrics and report generation for match evaluation.
- class setjoin.diagnostics.MatchReport(result: ~setjoin.types.MatchResult, scores: ~numpy.ndarray[tuple[~typing.Any, ...], ~numpy.dtype[~numpy.float64]], ground_truth: list[tuple[int, int]] | None = None, source_groups: dict[int, list[int]] | None = None, target_groups: dict[int, list[int]] | None = None, source_df: ~pandas.DataFrame | None = None, target_df: ~pandas.DataFrame | None = None, _cache: dict[str, object] = <factory>)[source]¶
Bases:
objectComprehensive diagnostics for evaluating match quality.
- error_anatomy() DataFrame | None[source]¶
Analyze errors by comparing to ground truth.
Returns DataFrame with columns: - source_idx: source record - matched_target_idx: where we matched it - true_target_idx: where it should have matched - score_diff: how much score we lost - error_type: ‘swap’ or ‘mismatch’
- ground_truth: list[tuple[int, int]] | None = None¶
Optional ground truth matches for accuracy computation.
- property group_accuracy: float | None¶
Fraction of records matched to the correct group (if groups and truth provided).
- property group_exact_match_rate: float | None¶
Fraction of source groups where ALL members matched to the same target group.
- group_margins() DataFrame | None[source]¶
Compute per-group score margins.
Returns DataFrame with columns: - source_group: source group ID - target_group: matched target group ID - within_score: total score of within-group matches - best_alternative_score: best score with a different target group - margin: difference between within_score and best_alternative
- match_confidence() DataFrame[source]¶
Compute per-match confidence metrics.
Returns DataFrame with columns: - source_idx, target_idx: the match - score: score of this match - rank: rank of this match among alternatives for source - margin: difference between this score and second-best alternative
- method_comparison(other: MatchReport) DataFrame[source]¶
Compare this result with another method’s result.
Returns DataFrame showing where the methods differ.
- property record_accuracy: float | None¶
Fraction of matches that are correct (if ground truth provided).
- result: MatchResult¶
The match result being evaluated.
- property score_sacrifice: float | None¶
Total score lost compared to unconstrained Hungarian matching.
Positive value means structure-aware matching sacrificed score for coherence.
- source_groups: dict[int, list[int]] | None = None¶
Optional source group structure for coherence metrics.
- setjoin.diagnostics.evaluate_matches(source: DataFrame, target: DataFrame, matches: list[tuple[int, int]], source_id_col: str = 'latent_person_id', target_id_col: str = 'latent_person_id', source_group_col: str | None = None, target_group_col: str | None = None) dict[str, float][source]¶
Evaluate linkage quality with standard metrics.
- Parameters:
source – Source DataFrame
target – Target DataFrame
matches – List of (source_idx, target_idx) pairs
source_id_col – Column in source containing true record ID
target_id_col – Column in target containing true record ID
source_group_col – Optional column for group coherence metrics
target_group_col – Optional column for group coherence metrics
- Returns:
Evaluation metrics
- Return type:
Types¶
Type definitions and protocols for setjoin.
- class setjoin.types.Comparator(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol for string/value comparison functions.
- class setjoin.types.FieldConfig(weight: float = 1.0, comparator: str = 'exact', missing_value: float = 0.0)[source]¶
Bases:
objectConfiguration for a single field in scoring.
- class setjoin.types.GroupSpec(group_id: int, indices: Sequence[int])[source]¶
Bases:
objectSpecification of a group of records.
- class setjoin.types.MatchResult(matches: list[tuple[int, int]], total_score: float, method: str, group_assignments: dict[int, int] | None = None, metadata: dict[str, object] = <factory>)[source]¶
Bases:
objectResult of a matching operation.
Visualization¶
Visualization functions for match diagnostics (requires matplotlib).
- setjoin.plots.plot_accuracy_by_ambiguity(data: NDArray[np.floating[Any]] | list[dict[str, float]], methods: list[str], ambiguity_values: list[float], metric: str = 'record_accuracy', ax: Axes | None = None) Figure[source]¶
Plot accuracy metric vs ambiguity level for multiple methods.
- Parameters:
data – DataFrame or list of dicts with columns [method, ambiguity, <metric>]
methods – List of method names to plot
ambiguity_values – List of ambiguity values
metric – Which metric to plot
ax – Optional matplotlib Axes
- Returns:
Matplotlib figure with accuracy plot
- Return type:
Figure
- setjoin.plots.plot_confidence_distribution(report: MatchReport, ax: Axes | None = None) Figure[source]¶
Plot distribution of match confidence (margin to second-best).
- Parameters:
report – MatchReport to analyze
ax – Optional matplotlib Axes
- Returns:
Matplotlib figure with distribution
- Return type:
Figure
- setjoin.plots.plot_match_comparison(report1: MatchReport, report2: MatchReport, label1: str = 'Method 1', label2: str = 'Method 2', ax: Axes | None = None) Figure[source]¶
Compare scores of matches between two methods.
- Parameters:
report1 – First method’s report
report2 – Second method’s report
label1 – Label for first method
label2 – Label for second method
ax – Optional matplotlib Axes
- Returns:
Matplotlib figure with comparison
- Return type:
Figure
- setjoin.plots.plot_method_comparison_bar(results: dict[str, MatchResult], metric: str = 'total_score', ax: Axes | None = None) Figure[source]¶
Bar chart comparing a metric across methods.
- Parameters:
results – Dictionary of method name to MatchResult
metric – Which metric to plot (“total_score” or “n_matches”)
ax – Optional matplotlib Axes
- Returns:
Matplotlib figure with bar chart
- Return type:
Figure
- Raises:
ValueError – If metric is unknown
- setjoin.plots.plot_score_heatmap(scores: ndarray[tuple[Any, ...], dtype[float64]], matches: list[tuple[int, int]] | None = None, ax: Axes | None = None, cmap: str = 'viridis', mark_matches: bool = True) Figure[source]¶
Plot score matrix as a heatmap with optional match overlay.
- Parameters:
scores – Score matrix (n_source x n_target)
matches – Optional list of matches to highlight
ax – Optional matplotlib Axes to plot on
cmap – Colormap name
mark_matches – Whether to mark matched pairs
- Returns:
Matplotlib figure with heatmap
- Return type:
Figure