Theory¶
This section provides theoretical background on structure-aware record linkage.
The Problem¶
Standard record linkage finds a matching between source records and target records that maximizes total similarity score. Given a score matrix \(S \in \mathbb{R}^{n \times m}\) where \(S_{ij}\) is the similarity between source record \(i\) and target record \(j\), the Hungarian algorithm finds:
where \(\pi\) is a permutation (or partial permutation if \(n \neq m\)).
When records have hierarchical structure, this formulation ignores an important constraint: members of the same group should map to the same target group.
Two-Level Assignment¶
Structure-aware matching decomposes the problem into two levels:
Group-level assignment: Match source groups to target groups
Record-level assignment: Match records within matched groups
Given:
Source groups \(\mathcal{G}_S = \{g_1^S, g_2^S, \ldots\}\)
Target groups \(\mathcal{G}_T = \{g_1^T, g_2^T, \ldots\}\)
For each pair of groups \((g^S, g^T)\), compute the optimal within-group assignment score:
This gives a group-level score matrix \(W\). Apply Hungarian algorithm at the group level to find optimal group matching \(\Phi\):
The final record-level matching is the union of within-group assignments for matched groups.
Properties¶
Coherence guarantee: All members of a source group map to the same target group by construction.
Optimality within constraint: Given the coherence constraint, the matching maximizes total score. It is globally optimal among all structure-preserving matchings.
Score sacrifice: The total score may be lower than unconstrained Hungarian matching. This “sacrifice” measures the cost of structure preservation:
When to Use Structure-Aware Matching¶
Structure-aware matching is appropriate when:
Records have meaningful group structure that should be preserved
Downstream analysis requires coherent group assignments
Domain knowledge suggests members of source groups belong together
It may sacrifice individual match quality for group coherence. Use standard Hungarian matching when group structure doesn’t matter or when maximizing individual match accuracy is the primary goal.
Computational Complexity¶
Let \(n\) be the number of records, \(k\) the number of groups, and \(s\) the maximum group size.
Computing group scores: \(O(k^2 \cdot s^3)\) (Hungarian per pair)
Group-level assignment: \(O(k^3)\) (Hungarian on \(k \times k\) matrix)
Total: \(O(k^2 \cdot s^3 + k^3)\)
For typical hierarchical data where \(k \approx n/s\), this is \(O(n^2 s)\), competitive with standard Hungarian’s \(O(n^3)\).