API Reference¶
Core¶
Core pipeline module.
- class tether.core.BlockConfig(on, crosswalk=None)[source]¶
Bases:
objectConfiguration for blocking stage.
- Parameters:
- class tether.core.DecideConfig(method='hungarian')[source]¶
Bases:
objectConfiguration for decision stage.
- Parameters:
method (
Literal['hungarian','greedy','row_sequential']) – Decision algorithm to use.
- class tether.core.FilterConfig(min_score=0.0, margin=None)[source]¶
Bases:
objectConfiguration for filtering stage.
- Parameters:
- class tether.core.LinkageResult(matches, diagnostics, left, right, candidate_pairs, filtered_pairs)[source]¶
Bases:
objectContainer for linkage results.
- Parameters:
matches (
DataFrame) – DataFrame with matched pairs.diagnostics (
LinkageDiagnostics) – Linkage diagnostics.left (
DataFrame) – Original left DataFrame.right (
DataFrame) – Original right DataFrame.candidate_pairs (
DataFrame) – Candidate pairs after blocking.filtered_pairs (
DataFrame) – Pairs after filtering.
- diagnostics: LinkageDiagnostics¶
- inspect(margin_threshold=0.1)[source]¶
Generate an inspection report for this result.
- Parameters:
margin_threshold (
float) – Threshold for identifying ambiguous pairs.- Return type:
- Returns:
InspectionReport with detailed analysis.
- class tether.core.Pipeline(preprocess_config, block_config, score_config, filter_config, decide_config)[source]¶
Bases:
objectExecutable linkage pipeline.
- Parameters:
preprocess_config (PreprocessConfig | None)
block_config (BlockConfig | None)
score_config (ScoreConfig)
filter_config (FilterConfig)
decide_config (DecideConfig)
- class tether.core.PipelineBuilder[source]¶
Bases:
objectFluent builder for constructing linkage pipelines.
- preprocess(normalize_unicode=True, lowercase=True, strip_whitespace=True, collapse_whitespace=True, columns=None)[source]¶
Configure preprocessing stage.
- Parameters:
- Return type:
Self- Returns:
Self for method chaining.
- score(comparisons)[source]¶
Configure scoring stage.
- Parameters:
comparisons (
list[Comparison]) – List of comparison operations.- Return type:
Self- Returns:
Self for method chaining.
- decide(method='hungarian')[source]¶
Configure decision stage.
- Parameters:
method (
Literal['hungarian','greedy','row_sequential']) – Decision algorithm to use.- Return type:
Self- Returns:
Self for method chaining.
- build()[source]¶
Build the configured pipeline.
- Return type:
- Returns:
Configured Pipeline instance.
- Raises:
ValueError – If score configuration is missing.
- class tether.core.PreprocessConfig(normalize_unicode=True, lowercase=True, strip_whitespace=True, collapse_whitespace=True, missing_policy='skip', columns=None)[source]¶
Bases:
objectConfiguration for preprocessing stage.
- Parameters:
normalize_unicode (
bool) – Whether to normalize unicode characters.lowercase (
bool) – Whether to convert to lowercase.strip_whitespace (
bool) – Whether to strip whitespace.collapse_whitespace (
bool) – Whether to collapse multiple whitespace.missing_policy (
Literal['skip','zero','penalize']) – How to handle missing values.columns (
list[str] |None) – Specific columns to preprocess.
- class tether.core.ScoreConfig(comparisons=<factory>)[source]¶
Bases:
objectConfiguration for scoring stage.
- Parameters:
comparisons (
list[Comparison]) – List of comparison operations.
- comparisons: list[Comparison]¶
Pipeline¶
- class tether.core.pipeline.PipelineBuilder[source]¶
Bases:
objectFluent builder for constructing linkage pipelines.
- preprocess(normalize_unicode=True, lowercase=True, strip_whitespace=True, collapse_whitespace=True, columns=None)[source]¶
Configure preprocessing stage.
- Parameters:
- Return type:
Self- Returns:
Self for method chaining.
- score(comparisons)[source]¶
Configure scoring stage.
- Parameters:
comparisons (
list[Comparison]) – List of comparison operations.- Return type:
Self- Returns:
Self for method chaining.
- decide(method='hungarian')[source]¶
Configure decision stage.
- Parameters:
method (
Literal['hungarian','greedy','row_sequential']) – Decision algorithm to use.- Return type:
Self- Returns:
Self for method chaining.
- build()[source]¶
Build the configured pipeline.
- Return type:
- Returns:
Configured Pipeline instance.
- Raises:
ValueError – If score configuration is missing.
- class tether.core.pipeline.Pipeline(preprocess_config, block_config, score_config, filter_config, decide_config)[source]¶
Bases:
objectExecutable linkage pipeline.
- Parameters:
preprocess_config (PreprocessConfig | None)
block_config (BlockConfig | None)
score_config (ScoreConfig)
filter_config (FilterConfig)
decide_config (DecideConfig)
Result¶
- class tether.core.result.LinkageResult(matches, diagnostics, left, right, candidate_pairs, filtered_pairs)[source]¶
Bases:
objectContainer for linkage results.
- Parameters:
matches (
DataFrame) – DataFrame with matched pairs.diagnostics (
LinkageDiagnostics) – Linkage diagnostics.left (
DataFrame) – Original left DataFrame.right (
DataFrame) – Original right DataFrame.candidate_pairs (
DataFrame) – Candidate pairs after blocking.filtered_pairs (
DataFrame) – Pairs after filtering.
- diagnostics: LinkageDiagnostics¶
- inspect(margin_threshold=0.1)[source]¶
Generate an inspection report for this result.
- Parameters:
margin_threshold (
float) – Threshold for identifying ambiguous pairs.- Return type:
- Returns:
InspectionReport with detailed analysis.
Score¶
Scoring module for pairwise comparisons.
- class tether.score.Comparison(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol for field comparison operations.
- class tether.score.DateComparison(column, tolerance_days=0, weight=1.0)[source]¶
Bases:
objectDate comparison with day tolerance.
- Parameters:
- class tether.score.ExactComparison(column, weight=1.0)[source]¶
Bases:
objectExact match comparison.
- Parameters:
- class tether.score.NumericComparison(column, tolerance=0.0, weight=1.0, scale='linear')[source]¶
Bases:
objectNumeric comparison with tolerance.
- Parameters:
- class tether.score.PairwiseScorer(comparisons)[source]¶
Bases:
objectCompute pairwise similarity scores for candidate pairs.
- Parameters:
comparisons (list[Comparison])
- class tether.score.StringComparison(column, algorithm='jaro_winkler', weight=1.0)[source]¶
Bases:
objectString similarity comparison using fuzzy matching algorithms.
- Parameters:
Comparisons¶
- class tether.score.comparisons.StringComparison(column, algorithm='jaro_winkler', weight=1.0)[source]¶
Bases:
objectString similarity comparison using fuzzy matching algorithms.
- Parameters:
- class tether.score.comparisons.ExactComparison(column, weight=1.0)[source]¶
Bases:
objectExact match comparison.
- Parameters:
- class tether.score.comparisons.NumericComparison(column, tolerance=0.0, weight=1.0, scale='linear')[source]¶
Bases:
objectNumeric comparison with tolerance.
- Parameters:
Block¶
Blocking module for reducing comparison space.
- class tether.block.Blocker(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol for blocking strategies.
- class tether.block.Crosswalk(mapping)[source]¶
Bases:
objectMapping between blocking key values.
- class tether.block.FieldBlocker(on, crosswalk=None)[source]¶
Bases:
objectBlock on one or more fields with optional crosswalk.
Decide¶
Decision rule implementations.
- class tether.decide.DecisionRule(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol for matching decision algorithms.
- class tether.decide.GreedyDecision[source]¶
Bases:
objectGreedy matching selecting best global pair first.
Iteratively selects the highest-scoring unmatched pair until no valid pairs remain.
- class tether.decide.HungarianDecision[source]¶
Bases:
objectOptimal assignment using the Hungarian algorithm.
Maximizes total matching score while ensuring each record is matched at most once.
Filter¶
Filtering module for removing low-quality pairs.
- class tether.filter.Filter(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol for pair filtering strategies.
- class tether.filter.MarginFilter(margin)[source]¶
Bases:
objectRemove ambiguous matches based on score margin.
For each left record, removes the best match if the margin to the second-best match is below the threshold.
- Parameters:
margin (float)
Preprocess¶
Preprocessing module for data normalization.
- class tether.preprocess.MissingHandler(policy='skip', fill_value='', columns=None)[source]¶
Bases:
objectHandle missing values in DataFrames.
- Parameters:
- class tether.preprocess.Preprocessor(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol for preprocessing operations.
Deduplicate¶
Deduplication module for within-table duplicate removal.
- class tether.deduplicate.ClusterDeduplicator(comparisons, threshold=0.9, margin=0.1, timestamp_column=None, block_on=None)[source]¶
Bases:
objectRemove within-table duplicates using connected components.
- Parameters:
- class tether.deduplicate.DeduplicationReport(original_count, kept_count, dropped_as_duplicate, dropped_as_indistinguishable, groups_found, largest_group_size)[source]¶
Bases:
objectReport on deduplication results.
- Parameters:
- class tether.deduplicate.Deduplicator(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol for within-table deduplication.
Inspect¶
Inspection module for linkage diagnostics and reports.
- class tether.inspect.InspectionReport(diagnostics, ambiguous_pairs=<factory>, unmatched_left=<factory>, unmatched_right=<factory>)[source]¶
Bases:
objectDetailed inspection report for linkage results.
- Parameters:
diagnostics (
LinkageDiagnostics) – Linkage diagnostics.ambiguous_pairs (
DataFrame) – Pairs with close scores that may be ambiguous.unmatched_left (
DataFrame) – Left records that were not matched.unmatched_right (
DataFrame) – Right records that were not matched.
- diagnostics: LinkageDiagnostics¶
- class tether.inspect.LinkageDiagnostics(n_left, n_right, n_candidate_pairs, n_filtered_pairs, n_matches, match_rate_left, match_rate_right, score_stats)[source]¶
Bases:
objectDiagnostic statistics for linkage results.
- Parameters:
n_left (
int) – Number of records in left DataFrame.n_right (
int) – Number of records in right DataFrame.n_candidate_pairs (
int) – Number of candidate pairs after blocking.n_filtered_pairs (
int) – Number of pairs after filtering.n_matches (
int) – Number of final matches.match_rate_left (
float) – Proportion of left records matched.match_rate_right (
float) – Proportion of right records matched.score_stats (
dict[str,float]) – Score distribution statistics.
- tether.inspect.compute_diagnostics(left, right, candidate_pairs, filtered_pairs, matches)[source]¶
Compute diagnostic statistics for linkage results.
- Parameters:
- Return type:
- Returns:
LinkageDiagnostics with computed statistics.
Multipass¶
Multi-pass linkage module.
- class tether.multipass.MultiPassOrchestrator[source]¶
Bases:
objectOrchestrate multi-pass record linkage.
Runs multiple passes with progressively relaxed thresholds, removing matched records between passes for higher precision.
- run(left, right, passes, comparisons, block_on=None, crosswalk=None, preprocess=True)[source]¶
Execute multi-pass linkage.
- Parameters:
left (
DataFrame) – Left DataFrame to link.right (
DataFrame) – Right DataFrame to link.passes (
list[PassConfig] |list[dict[str,float|str]]) – List of pass configurations.comparisons (
list[Comparison]) – Comparison operations for scoring.block_on (
str|list[str] |None) – Optional field(s) for blocking.crosswalk (
Crosswalk|dict[str,str] |None) – Optional crosswalk mapping.preprocess (
bool) – Whether to preprocess text columns.
- Return type:
- Returns:
Combined LinkageResult from all passes.
- class tether.multipass.PassConfig(min_score, method='hungarian', margin=None)[source]¶
Bases:
objectConfiguration for a single pass in multi-pass matching.
- Parameters:
- tether.multipass.precision_first(threshold=0.9)[source]¶
Create a precision-first single-pass strategy.
- Parameters:
threshold (
float) – High threshold for precision.- Return type:
- Returns:
Single PassConfig list for high-precision matching.