Import matching algorithms
When importing data, it is very common that there are slight differences between names, place names, and other reference data.
ActivityInfo has been developed to automatically match reference data based on a set of rules that take into account common differences between names that arise because of differences in transliteration and spelling.
At present, these rules are only applied to names written in the Latin script. When matching names in other scripts, such as Arabic or Cyrillic, exact matching is used. We look forward to developping comparable matching algorithms for non-Latin scripts in the future.
Latin name matching
The Latin Name scorer algorithm helps identify pairs of place names, written in a Latin script, that are likely to be the same, but transliterated differently.
These differences can arise from a number of random processes:
The same sound may be written differently, perhaps because of a different transliteration scheme, for example:
- [ou]adi ⇒ [w]adi
- zou[q] bha[nn]ine ⇒ zou[k] bha[n]ine
- z[ai]toun ⇒ z[ei]toun[e]
Sounds can drift regionally and over time, and these differences result in new spellings
Names that include parts of speech can be reordered or discarded arbitrarily:
- "Santa Rosa City" ⇒ "City of Santa Rosa"
- "Commune de Goumera" ⇒ "Goumera"
Words can be split or joined
- "Bara Sara" ⇒ "Barassara"
- "Nema Badenyakafo" ⇒ "Nema Badenya Kafo"
In order to be useful in matching large data sets, we need a low false positive rate, otherwise analysts will be overwhelmed with hundreds or thousands of garbage matches to review.
Traditional edit-distance metrics, such as Levenshtein Distance or Jaro-Winkler similarity tend to produce far too many false positives. The strings AAIN
and ZEBDINE
, for example, have a Jaro-Winkler similarity of 0.60, despite there being no chance that ZEBDINE
is an alternate spelling of AAIN
.
ActivityInfo's Latin Name Scorer considers instead the probability that AAIN
could be transformed into ZEBDINE
by the processes listed above. Because we consider addition of consonants like Z
, B
, D
extremely unlikely, the Scorer returns a value of zero.
On the other hand, the string AIN
differs only in a single vowel, which is very common difference in transliteration, resulting in a score of 0.925.
Algorithm
ActivityInfo's Latin name matching algorithm works as follows:
First, each input string is upper-cased and stripped of any diacritical marks, including apostrophes when found between letters. For example:
- "Aïn-El-Jdeidé" ⇒
AIN-EL-JDEIDE
- "Aïn Kéni" ⇒
AIN KENI
- "N'Goutjina" ⇒
NGOUTJINA
Next, each input string is split into one or more "parts". All symbols such as "-", "/", "(", etc, are considered to start new parts, and digits will start a new part if preceded by characters. For example:
AIN-EL-JDEIDE
⇒ [AIN
,EL
,JDEIDE
]JUBBADA HOOSE (LOWER JUBA)
⇒ [JUBBADA
,HOOSE
,LOWER
,JUBA
]DOUGOUTENE 2
⇒ [DOUGOUTENE
,2
]2EME FORCE NAVALE
⇒ [2EME
,FORCE
,NAVALE
]
Given two input strings each broken into an array of parts, for example, [COMMUNE
, DE
, GOUMERA
] and [GOUMERA
, COMUNE
], we compute a similarity score for each pair of parts from the left and right:
COMMUNE | DE | GOUMERA | |
---|---|---|---|
GOUMERA | 0.0 | 0.00 | 1.0 |
COMUNE | 0.93 | 0.00 | 0.0 |
The "left" and "right" names are chosen such that the "left" name has fewer or equal parts than the right name.
The similarity score between two parts is computed depending on the type of parts:
- If both parts are numbers, the numbers are compared exactly. "1" and "2" have a similarity score of 0.0, while "1" and "1" have a score of 1.0.
- If both parts are alphabetic, the phonetic similarity score is computed. (See below)
- If one part is numeric and the other alphabetic, we check to see if the alphabetic part is a roman numeral and then compare exactly with the number. Otherwise the two parts have a similarity of zero.
With the similarity matrix in hand, we find the best matching column for each row in the matrix, ensuring that we match a column to at most one row.
Finally, we compute a score between 0 and 1 indicating how well the two names match. Specifically:
We define the numerator as the sum of the lengths of all matching parts, weighted by their similarity scores. In the example above, "COMMUNE" best matches "COMUNE" with a similarity of 0.93. We use the minimum length of the two parts, which in this case is 6 characters, weighted by 0.93, contributes 5.58 to the numerator.
"GOUMERA" matches "GOUMERA" exactly, so contributes 6 * 1.0 to the numerator.
The denominator includes lengths of the matching parts (6 + 6) as well as the unmatched "DE" (2).
Phonetic Similarity Algorithm
When computing the similarity of two alphabetic parts, ActivityInfo's Latin Word Distance algorithm is used to compute the distance between two parts, assumed to be words in Latin script.
Like the Levenstein algorithm, we compute the minimum number of "edits" -- substitutions, insertions, and deletions -- but instead of treating every character difference equally, each "edit" is assigned a cost based on the likelihood it resulted from a transliteration difference.
For example, vowels are often transliterated differently, so extra vowels or vowel substitutions are assigned a relatively low cost. In the case of MREISSE
and MRAISSE
, for example, substitution of E
with A
is assigned a cost of 0.25.
On the other hand, most consonant substitutions are very rare: the word RAS
is very unlikely to be transliterated as RAB
, and the algorithm assigns an infinite cost to this substitution.
Some consonant substitutions are common, including M
→ N
and K
→ Q
, and are assigned a cost of 0.5 and 0.25 respectively.
Additional rules have been developed based on datasets encountered in Afghanistan, Lebanon, Mali and the Philipines.