API Reference¶

Parameters:: margin_threshold (float) – Threshold for identifying ambiguous pairs.
Return type:: InspectionReport
Returns:: InspectionReport with detailed analysis.

merge_left(suffixes=('', '_matched'))[source]¶

Merge matches back to left DataFrame.

Parameters:: suffixes (tuple[str, str]) – Suffixes for overlapping columns.
Return type:: DataFrame
Returns:: Left DataFrame with matched right columns.

merge_right(suffixes=('_matched', ''))[source]¶

Merge matches back to right DataFrame.

Parameters:: suffixes (tuple[str, str]) – Suffixes for overlapping columns.
Return type:: DataFrame
Returns:: Right DataFrame with matched left columns.

class tether.core.Pipeline(preprocess_config, block_config, score_config, filter_config, decide_config)[source]¶

Bases: object

Executable linkage pipeline.

Parameters:

preprocess_config (PreprocessConfig | None)
block_config (BlockConfig | None)
score_config (ScoreConfig)
filter_config (FilterConfig)
decide_config (DecideConfig)

link(left, right)[source]¶

Execute the linkage pipeline.

Parameters:

left (DataFrame) – Left DataFrame to link.
right (DataFrame) – Right DataFrame to link.

Return type:

LinkageResult

Returns:

LinkageResult with matches and diagnostics.

class tether.core.PipelineBuilder[source]¶

Bases: object

Fluent builder for constructing linkage pipelines.

preprocess(normalize_unicode=True, lowercase=True, strip_whitespace=True, collapse_whitespace=True, columns=None)[source]¶

Configure preprocessing stage.

Parameters:

normalize_unicode (bool) – Normalize unicode characters.
lowercase (bool) – Convert to lowercase.
strip_whitespace (bool) – Strip whitespace.
collapse_whitespace (bool) – Collapse multiple whitespace.
columns (list[str] | None) – Columns to preprocess.

Return type:

Self

Returns:

Self for method chaining.

block(on, crosswalk=None)[source]¶

Configure blocking stage.

Parameters:

on (str | list[str]) – Field(s) to block on.
crosswalk (Crosswalk | dict[str, str] | None) – Optional crosswalk mapping.

Return type:

Self

Returns:

Self for method chaining.

score(comparisons)[source]¶

Configure scoring stage.

Parameters:: comparisons (list[Comparison]) – List of comparison operations.
Return type:: Self
Returns:: Self for method chaining.

filter(min_score=0.0, margin=None)[source]¶

Configure filtering stage.

Parameters:

min_score (float) – Minimum score threshold.
margin (float | None) – Minimum margin for ambiguity removal.

Return type:

Self

Returns:

Self for method chaining.

decide(method='hungarian')[source]¶

Configure decision stage.

Parameters:: method (Literal['hungarian', 'greedy', 'row_sequential']) – Decision algorithm to use.
Return type:: Self
Returns:: Self for method chaining.

build()[source]¶

Build the configured pipeline.

Return type:: Pipeline
Returns:: Configured Pipeline instance.
Raises:: ValueError – If score configuration is missing.

class tether.core.PreprocessConfig(normalize_unicode=True, lowercase=True, strip_whitespace=True, collapse_whitespace=True, missing_policy='skip', columns=None)[source]¶

Bases: object

Configuration for preprocessing stage.

Parameters:

normalize_unicode (bool) – Whether to normalize unicode characters.
lowercase (bool) – Whether to convert to lowercase.
strip_whitespace (bool) – Whether to strip whitespace.
collapse_whitespace (bool) – Whether to collapse multiple whitespace.
missing_policy (Literal['skip', 'zero', 'penalize']) – How to handle missing values.
columns (list[str] | None) – Specific columns to preprocess.

normalize_unicode: bool = True¶

lowercase: bool = True¶

strip_whitespace: bool = True¶

collapse_whitespace: bool = True¶

missing_policy: Literal['skip', 'zero', 'penalize'] = 'skip'¶

columns: list[str] | None = None¶

class tether.core.ScoreConfig(comparisons=<factory>)[source]¶

Bases: object

Configuration for scoring stage.

Parameters:: comparisons (list[Comparison]) – List of comparison operations.

comparisons: list[Comparison]¶

Pipeline¶

class tether.core.pipeline.PipelineBuilder[source]¶

Bases: object

Fluent builder for constructing linkage pipelines.

preprocess(normalize_unicode=True, lowercase=True, strip_whitespace=True, collapse_whitespace=True, columns=None)[source]¶

Configure preprocessing stage.

Parameters:

normalize_unicode (bool) – Normalize unicode characters.
lowercase (bool) – Convert to lowercase.
strip_whitespace (bool) – Strip whitespace.
collapse_whitespace (bool) – Collapse multiple whitespace.
columns (list[str] | None) – Columns to preprocess.

Return type:

Self

Returns:

Self for method chaining.

block(on, crosswalk=None)[source]¶

Configure blocking stage.

Parameters:

on (str | list[str]) – Field(s) to block on.
crosswalk (Crosswalk | dict[str, str] | None) – Optional crosswalk mapping.

Return type:

Self

Returns:

Self for method chaining.

score(comparisons)[source]¶

Configure scoring stage.

Parameters:: comparisons (list[Comparison]) – List of comparison operations.
Return type:: Self
Returns:: Self for method chaining.

filter(min_score=0.0, margin=None)[source]¶

Configure filtering stage.

Parameters:

min_score (float) – Minimum score threshold.
margin (float | None) – Minimum margin for ambiguity removal.

Return type:

Self

Returns:

Self for method chaining.

decide(method='hungarian')[source]¶

Configure decision stage.

Parameters:: method (Literal['hungarian', 'greedy', 'row_sequential']) – Decision algorithm to use.
Return type:: Self
Returns:: Self for method chaining.

build()[source]¶

Build the configured pipeline.

Return type:: Pipeline
Returns:: Configured Pipeline instance.
Raises:: ValueError – If score configuration is missing.

class tether.core.pipeline.Pipeline(preprocess_config, block_config, score_config, filter_config, decide_config)[source]¶

Bases: object

Executable linkage pipeline.

Parameters:

preprocess_config (PreprocessConfig | None)
block_config (BlockConfig | None)
score_config (ScoreConfig)
filter_config (FilterConfig)
decide_config (DecideConfig)

link(left, right)[source]¶

Execute the linkage pipeline.

Parameters:

left (DataFrame) – Left DataFrame to link.
right (DataFrame) – Right DataFrame to link.

Return type:

LinkageResult

Returns:

LinkageResult with matches and diagnostics.

Result¶

class tether.core.result.LinkageResult(matches, diagnostics, left, right, candidate_pairs, filtered_pairs)[source]¶

Bases: object

Container for linkage results.

Parameters:

matches (DataFrame) – DataFrame with matched pairs.
diagnostics (LinkageDiagnostics) – Linkage diagnostics.
left (DataFrame) – Original left DataFrame.
right (DataFrame) – Original right DataFrame.
candidate_pairs (DataFrame) – Candidate pairs after blocking.
filtered_pairs (DataFrame) – Pairs after filtering.

matches: DataFrame¶

diagnostics: LinkageDiagnostics¶

left: DataFrame¶

right: DataFrame¶

candidate_pairs: DataFrame¶

filtered_pairs: DataFrame¶

inspect(margin_threshold=0.1)[source]¶

Generate an inspection report for this result.

Parameters:: margin_threshold (float) – Threshold for identifying ambiguous pairs.
Return type:: InspectionReport
Returns:: InspectionReport with detailed analysis.

merge_left(suffixes=('', '_matched'))[source]¶

Merge matches back to left DataFrame.

Parameters:: suffixes (tuple[str, str]) – Suffixes for overlapping columns.
Return type:: DataFrame
Returns:: Left DataFrame with matched right columns.

merge_right(suffixes=('_matched', ''))[source]¶

Merge matches back to right DataFrame.

Parameters:: suffixes (tuple[str, str]) – Suffixes for overlapping columns.
Return type:: DataFrame
Returns:: Right DataFrame with matched left columns.

Score¶

Scoring module for pairwise comparisons.

class tether.score.Comparison(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for field comparison operations.

column: str¶

weight: float¶

compare(left, right)[source]¶

Compare two series and return similarity scores.

Parameters:

left (Series) – Left series of values.
right (Series) – Right series of values.

Return type:

Series

Returns:

Series of similarity scores between 0 and 1.

class tether.score.DateComparison(column, tolerance_days=0, weight=1.0)[source]¶

Bases: object

Date comparison with day tolerance.

Parameters:

column (str) – Column name to compare.
tolerance_days (int) – Maximum allowed difference in days.
weight (float) – Weight for this comparison in aggregate score.

column: str¶

tolerance_days: int¶

weight: float¶

compare(left, right)[source]¶

Compare date values with day tolerance.

Parameters:

left (Series) – Left series of date values.
right (Series) – Right series of date values.

Return type:

Series

Returns:

Series of similarity scores between 0 and 1.

class tether.score.ExactComparison(column, weight=1.0)[source]¶

Bases: object

Exact match comparison.

Parameters:

column (str) – Column name to compare.
weight (float) – Weight for this comparison in aggregate score.

column: str¶

weight: float¶

compare(left, right)[source]¶

Compare values for exact equality.

Parameters:

left (Series) – Left series of values.
right (Series) – Right series of values.

Return type:

Series

Returns:

Series of 1.0 for matches, 0.0 for non-matches.

class tether.score.NumericComparison(column, tolerance=0.0, weight=1.0, scale='linear')[source]¶

Bases: object

Numeric comparison with tolerance.

Parameters:

column (str) – Column name to compare.
tolerance (float) – Maximum allowed difference for a match.
weight (float) – Weight for this comparison in aggregate score.
scale (Literal['linear', 'gaussian'])

column: str¶

tolerance: float¶

weight: float¶

scale: Literal['linear', 'gaussian']¶

compare(left, right)[source]¶

Compare numeric values with tolerance.

Parameters:

left (Series) – Left series of numeric values.
right (Series) – Right series of numeric values.

Return type:

Series

Returns:

Series of similarity scores between 0 and 1.

class tether.score.PairwiseScorer(comparisons)[source]¶

Bases: object

Compute pairwise similarity scores for candidate pairs.

Parameters:: comparisons (list[Comparison])

score(pairs)[source]¶

Compute similarity scores for candidate pairs.

Parameters:: pairs (DataFrame) – DataFrame with candidate pairs containing columns from both left and right DataFrames with _left and _right suffixes.
Return type:: DataFrame
Returns:: DataFrame with original pairs plus score columns and aggregate score.

class tether.score.StringComparison(column, algorithm='jaro_winkler', weight=1.0)[source]¶

Bases: object

String similarity comparison using fuzzy matching algorithms.

Parameters:

column (str) – Column name to compare.
algorithm (Literal['jaro_winkler', 'levenshtein', 'damerau_levenshtein']) – Similarity algorithm to use.
weight (float) – Weight for this comparison in aggregate score.

column: str¶

algorithm: Literal['jaro_winkler', 'levenshtein', 'damerau_levenshtein']¶

weight: float¶

compare(left, right)[source]¶

Compare string values using the configured algorithm.

Parameters:

left (Series) – Left series of string values.
right (Series) – Right series of string values.

Return type:

Series

Returns:

Series of similarity scores between 0 and 1.

Comparisons¶

class tether.score.comparisons.StringComparison(column, algorithm='jaro_winkler', weight=1.0)[source]¶

Bases: object

String similarity comparison using fuzzy matching algorithms.

Parameters:

column (str) – Column name to compare.
algorithm (Literal['jaro_winkler', 'levenshtein', 'damerau_levenshtein']) – Similarity algorithm to use.
weight (float) – Weight for this comparison in aggregate score.

column: str¶

algorithm: Literal['jaro_winkler', 'levenshtein', 'damerau_levenshtein']¶

weight: float¶

compare(left, right)[source]¶

Compare string values using the configured algorithm.

Parameters:

left (Series) – Left series of string values.
right (Series) – Right series of string values.

Return type:

Series

Returns:

Series of similarity scores between 0 and 1.

class tether.score.comparisons.ExactComparison(column, weight=1.0)[source]¶

Bases: object

Exact match comparison.

Parameters:

column (str) – Column name to compare.
weight (float) – Weight for this comparison in aggregate score.

column: str¶

weight: float¶

compare(left, right)[source]¶

Compare values for exact equality.

Parameters:

left (Series) – Left series of values.
right (Series) – Right series of values.

Return type:

Series

Returns:

Series of 1.0 for matches, 0.0 for non-matches.

class tether.score.comparisons.NumericComparison(column, tolerance=0.0, weight=1.0, scale='linear')[source]¶

Bases: object

Numeric comparison with tolerance.

Parameters:

column (str) – Column name to compare.
tolerance (float) – Maximum allowed difference for a match.
weight (float) – Weight for this comparison in aggregate score.
scale (Literal['linear', 'gaussian'])

column: str¶

tolerance: float¶

weight: float¶

scale: Literal['linear', 'gaussian']¶

compare(left, right)[source]¶

Compare numeric values with tolerance.

Parameters:

left (Series) – Left series of numeric values.
right (Series) – Right series of numeric values.

Return type:

Series

Returns:

Series of similarity scores between 0 and 1.

class tether.score.comparisons.DateComparison(column, tolerance_days=0, weight=1.0)[source]¶

Bases: object

Date comparison with day tolerance.

Parameters:

column (str) – Column name to compare.
tolerance_days (int) – Maximum allowed difference in days.
weight (float) – Weight for this comparison in aggregate score.

column: str¶

tolerance_days: int¶

weight: float¶

compare(left, right)[source]¶

Compare date values with day tolerance.

Parameters:

left (Series) – Left series of date values.
right (Series) – Right series of date values.

Return type:

Series

Returns:

Series of similarity scores between 0 and 1.

Block¶

Blocking module for reducing comparison space.

class tether.block.Blocker(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for blocking strategies.

block(left, right)[source]¶

Generate candidate pairs from two DataFrames.

Parameters:

left (DataFrame) – Left DataFrame.
right (DataFrame) – Right DataFrame.

Return type:

DataFrame

Returns:

DataFrame with candidate pairs containing columns from both DataFrames with _left and _right suffixes.

class tether.block.Crosswalk(mapping)[source]¶

Bases: object

Mapping between blocking key values.

Parameters:: mapping (dict[str, str])

apply(series)[source]¶

Apply crosswalk mapping to a series.

Parameters:: series (Series) – Series of values to normalize.
Return type:: Series
Returns:: Series with mapped values.

validate()[source]¶

Validate the crosswalk mapping.

Return type:: list[str]
Returns:: List of validation error messages.

class tether.block.FieldBlocker(on, crosswalk=None)[source]¶

Bases: object

Block on one or more fields with optional crosswalk.

Parameters:

on (str | list[str])
crosswalk (dict[str, str] | Crosswalk | None)

block(left, right)[source]¶

Generate candidate pairs by blocking on specified fields.

Parameters:

left (DataFrame) – Left DataFrame.
right (DataFrame) – Right DataFrame.

Return type:

DataFrame

Returns:

DataFrame with candidate pairs.

class tether.block.FullBlocker[source]¶

Bases: object

Generate all possible pairs (no blocking).

Use with caution - creates n*m pairs.

block(left, right)[source]¶

Generate all possible pairs.

Parameters:

left (DataFrame) – Left DataFrame.
right (DataFrame) – Right DataFrame.

Return type:

DataFrame

Returns:

DataFrame with all pairs.

Decide¶

Decision rule implementations.

class tether.decide.DecisionRule(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for matching decision algorithms.

decide(scored_pairs)[source]¶

Select final matches from scored candidate pairs.

Parameters:: scored_pairs (DataFrame) – DataFrame with candidate pairs and scores. Must contain ‘left_index’, ‘right_index’, and ‘score’ columns.
Return type:: DataFrame
Returns:: DataFrame with selected matches.

class tether.decide.GreedyDecision[source]¶

Bases: object

Greedy matching selecting best global pair first.

Iteratively selects the highest-scoring unmatched pair until no valid pairs remain.

decide(scored_pairs)[source]¶

Select matches using greedy best-first approach.

Parameters:: scored_pairs (DataFrame) – DataFrame with ‘left_index’, ‘right_index’, and ‘score’.
Return type:: DataFrame
Returns:: DataFrame with greedy matches.

class tether.decide.HungarianDecision[source]¶

Bases: object

Optimal assignment using the Hungarian algorithm.

Maximizes total matching score while ensuring each record is matched at most once.

decide(scored_pairs)[source]¶

Select optimal matches using Hungarian algorithm.

Parameters:: scored_pairs (DataFrame) – DataFrame with ‘left_index’, ‘right_index’, and ‘score’.
Return type:: DataFrame
Returns:: DataFrame with optimal matches.

class tether.decide.RowSequentialDecision[source]¶

Bases: object

Row-sequential matching processing left records in order.

For each left record (in index order), selects the best available right record. Simple baseline algorithm.

decide(scored_pairs)[source]¶

Select matches processing left records sequentially.

Parameters:: scored_pairs (DataFrame) – DataFrame with ‘left_index’, ‘right_index’, and ‘score’.
Return type:: DataFrame
Returns:: DataFrame with row-sequential matches.

Filter¶

Filtering module for removing low-quality pairs.

class tether.filter.Filter(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for pair filtering strategies.

filter(pairs)[source]¶

Filter candidate pairs.

Parameters:: pairs (DataFrame) – DataFrame with candidate pairs and scores.
Return type:: DataFrame
Returns:: Filtered DataFrame.

class tether.filter.MarginFilter(margin)[source]¶

Bases: object

Remove ambiguous matches based on score margin.

For each left record, removes the best match if the margin to the second-best match is below the threshold.

Parameters:: margin (float)

filter(pairs)[source]¶

Remove ambiguous matches.

Parameters:: pairs (DataFrame) – DataFrame with ‘left_index’ and ‘score’ columns.
Return type:: DataFrame
Returns:: Filtered DataFrame with unambiguous matches.

class tether.filter.ThresholdFilter(min_score)[source]¶

Bases: object

Filter pairs below a minimum score threshold.

Parameters:: min_score (float)

filter(pairs)[source]¶

Remove pairs below the threshold.

Parameters:: pairs (DataFrame) – DataFrame with ‘score’ column.
Return type:: DataFrame
Returns:: Filtered DataFrame with pairs meeting threshold.

Preprocess¶

Preprocessing module for data normalization.

class tether.preprocess.MissingHandler(policy='skip', fill_value='', columns=None)[source]¶

Bases: object

Handle missing values in DataFrames.

Parameters:

policy (Literal['skip', 'zero', 'penalize'])
fill_value (str)
columns (list[str] | None)

preprocess(df)[source]¶

Handle missing values in the DataFrame.

Parameters:: df (DataFrame) – Input DataFrame.
Return type:: DataFrame
Returns:: DataFrame with missing values handled.

class tether.preprocess.Preprocessor(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for preprocessing operations.

preprocess(df)[source]¶

Preprocess a DataFrame.

Parameters:: df (DataFrame) – Input DataFrame.
Return type:: DataFrame
Returns:: Preprocessed DataFrame.

class tether.preprocess.TextNormalizer(normalize_unicode=True, lowercase=True, strip_whitespace=True, collapse_whitespace=True, columns=None)[source]¶

Bases: object

Normalize text columns in a DataFrame.

Parameters:

normalize_unicode (bool)
lowercase (bool)
strip_whitespace (bool)
collapse_whitespace (bool)
columns (list[str] | None)

preprocess(df)[source]¶

Normalize text columns in the DataFrame.

Parameters:: df (DataFrame) – Input DataFrame.
Return type:: DataFrame
Returns:: DataFrame with normalized text columns.

Deduplicate¶

Deduplication module for within-table duplicate removal.

class tether.deduplicate.ClusterDeduplicator(comparisons, threshold=0.9, margin=0.1, timestamp_column=None, block_on=None)[source]¶

Bases: object

Remove within-table duplicates using connected components.

Parameters:

comparisons (list[Comparison])
threshold (float)
margin (float)
timestamp_column (str | None)
block_on (str | list[str] | None)

deduplicate(df)[source]¶

Remove duplicate records using connected components.

Parameters:: df (DataFrame) – Input DataFrame.
Return type:: tuple[DataFrame, DeduplicationReport]
Returns:: Tuple of (deduplicated DataFrame, deduplication report).

class tether.deduplicate.DeduplicationReport(original_count, kept_count, dropped_as_duplicate, dropped_as_indistinguishable, groups_found, largest_group_size)[source]¶

Bases: object

Report on deduplication results.

Parameters:

original_count (int)
kept_count (int)
dropped_as_duplicate (int)
dropped_as_indistinguishable (int)
groups_found (int)
largest_group_size (int)

original_count: int¶

kept_count: int¶

dropped_as_duplicate: int¶

dropped_as_indistinguishable: int¶

groups_found: int¶

largest_group_size: int¶

class tether.deduplicate.Deduplicator(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for within-table deduplication.

deduplicate(df)[source]¶

Remove duplicate records from a DataFrame.

Parameters:: df (DataFrame) – Input DataFrame.
Return type:: tuple[DataFrame, DeduplicationReport]
Returns:: Tuple of (deduplicated DataFrame, deduplication report).

class tether.deduplicate.ExactDeduplicator(columns=None, keep='first')[source]¶

Bases: object

Remove exact duplicates based on specified columns.

Parameters:

columns (list[str] | None)
keep (Literal['first', 'last'])

keep: Literal['first', 'last']¶

deduplicate(df)[source]¶

Remove exact duplicates.

Parameters:: df (DataFrame) – Input DataFrame.
Return type:: tuple[DataFrame, DeduplicationReport]
Returns:: Tuple of (deduplicated DataFrame, deduplication report).

Inspect¶

Inspection module for linkage diagnostics and reports.

class tether.inspect.InspectionReport(diagnostics, ambiguous_pairs=<factory>, unmatched_left=<factory>, unmatched_right=<factory>)[source]¶

Bases: object

Detailed inspection report for linkage results.

Parameters:

diagnostics (LinkageDiagnostics) – Linkage diagnostics.
ambiguous_pairs (DataFrame) – Pairs with close scores that may be ambiguous.
unmatched_left (DataFrame) – Left records that were not matched.
unmatched_right (DataFrame) – Right records that were not matched.

diagnostics: LinkageDiagnostics¶

ambiguous_pairs: DataFrame¶

unmatched_left: DataFrame¶

unmatched_right: DataFrame¶

summary()[source]¶

Generate a text summary of the report.

Return type:: str
Returns:: Human-readable summary string.

class tether.inspect.LinkageDiagnostics(n_left, n_right, n_candidate_pairs, n_filtered_pairs, n_matches, match_rate_left, match_rate_right, score_stats)[source]¶

Bases: object

Diagnostic statistics for linkage results.

Parameters:

n_left (int) – Number of records in left DataFrame.
n_right (int) – Number of records in right DataFrame.
n_candidate_pairs (int) – Number of candidate pairs after blocking.
n_filtered_pairs (int) – Number of pairs after filtering.
n_matches (int) – Number of final matches.
match_rate_left (float) – Proportion of left records matched.
match_rate_right (float) – Proportion of right records matched.
score_stats (dict[str, float]) – Score distribution statistics.

n_left: int¶

n_right: int¶

n_candidate_pairs: int¶

n_filtered_pairs: int¶

n_matches: int¶

match_rate_left: float¶

match_rate_right: float¶

score_stats: dict[str, float]¶

tether.inspect.compute_diagnostics(left, right, candidate_pairs, filtered_pairs, matches)[source]¶

Compute diagnostic statistics for linkage results.

Parameters:

left (DataFrame) – Left DataFrame.
right (DataFrame) – Right DataFrame.
candidate_pairs (DataFrame) – Candidate pairs after blocking.
filtered_pairs (DataFrame) – Pairs after filtering.
matches (DataFrame) – Final matches.

Return type:

LinkageDiagnostics

Returns:

LinkageDiagnostics with computed statistics.

tether.inspect.generate_report(left, right, matches, diagnostics, filtered_pairs, margin_threshold=0.1)[source]¶

Generate an inspection report for linkage results.

Parameters:

left (DataFrame) – Left DataFrame.
right (DataFrame) – Right DataFrame.
matches (DataFrame) – Final matches.
diagnostics (LinkageDiagnostics) – Linkage diagnostics.
filtered_pairs (DataFrame) – Pairs after filtering.
margin_threshold (float) – Threshold for identifying ambiguous pairs.

Return type:

InspectionReport

Returns:

InspectionReport with detailed analysis.

Multipass¶

Multi-pass linkage module.

class tether.multipass.MultiPassOrchestrator[source]¶

Bases: object

Orchestrate multi-pass record linkage.

Runs multiple passes with progressively relaxed thresholds, removing matched records between passes for higher precision.

run(left, right, passes, comparisons, block_on=None, crosswalk=None, preprocess=True)[source]¶

Execute multi-pass linkage.

Parameters:

left (DataFrame) – Left DataFrame to link.
right (DataFrame) – Right DataFrame to link.
passes (list[PassConfig] | list[dict[str, float | str]]) – List of pass configurations.
comparisons (list[Comparison]) – Comparison operations for scoring.
block_on (str | list[str] | None) – Optional field(s) for blocking.
crosswalk (Crosswalk | dict[str, str] | None) – Optional crosswalk mapping.
preprocess (bool) – Whether to preprocess text columns.

Return type:

LinkageResult

Returns:

Combined LinkageResult from all passes.

class tether.multipass.PassConfig(min_score, method='hungarian', margin=None)[source]¶

Bases: object

Configuration for a single pass in multi-pass matching.

Parameters:

min_score (float) – Minimum score threshold for this pass.
method (Literal['hungarian', 'greedy', 'row_sequential']) – Decision method for this pass.
margin (float | None) – Optional margin filter for this pass.

min_score: float¶

method: Literal['hungarian', 'greedy', 'row_sequential'] = 'hungarian'¶

margin: float | None = None¶

tether.multipass.precision_first(threshold=0.9)[source]¶

Create a precision-first single-pass strategy.

Parameters:: threshold (float) – High threshold for precision.
Return type:: list[PassConfig]
Returns:: Single PassConfig list for high-precision matching.

tether.multipass.strict_then_relaxed(strict_threshold=0.95, medium_threshold=0.85, relaxed_threshold=0.7)[source]¶

Create a strict-then-relaxed multi-pass strategy.

Parameters:

strict_threshold (float) – Threshold for first strict pass.
medium_threshold (float) – Threshold for medium pass.
relaxed_threshold (float) – Threshold for final relaxed pass.

Return type:

list[PassConfig]

Returns:

List of PassConfig for multi-pass matching.