Import matching algorithms

When importing data, it is very common that there are slight differences between names, place names, and other reference data.

ActivityInfo has been developed to automatically match reference data based on a set of rules that take into account common differences between names that arise because of differences in transliteration and spelling.

At present, these rules are only applied to names written in the Latin script. When matching names in other scripts, such as Arabic or Cyrillic, exact matching is used. We look forward to developping comparable matching algorithms for non-Latin scripts in the future.

Latin name matching

The Latin Name scorer algorithm helps identify pairs of place names, written in a Latin script, that are likely to be the same, but transliterated differently.

These differences can arise from a number of random processes:

  1. The same sound may be written differently, perhaps because of a different transliteration scheme, for example:

    • [ou]adi ⇒ [w]adi
    • zou[q] bha[nn]ine ⇒ zou[k] bha[n]ine
    • z[ai]toun ⇒ z[ei]toun[e]
  2. Sounds can drift regionally and over time, and these differences result in new spellings

  3. Names that include parts of speech can be reordered or discarded arbitrarily:

    • "Santa Rosa City" ⇒ "City of Santa Rosa"
    • "Commune de Goumera" ⇒ "Goumera"
  4. Words can be split or joined

    • "Bara Sara" ⇒ "Barassara"
    • "Nema Badenyakafo" ⇒ "Nema Badenya Kafo"

In order to be useful in matching large data sets, we need a low false positive rate, otherwise analysts will be overwhelmed with hundreds or thousands of garbage matches to review.

Traditional edit-distance metrics, such as Levenshtein Distance or Jaro-Winkler similarity tend to produce far too many false positives. The strings AAIN and ZEBDINE, for example, have a Jaro-Winkler similarity of 0.60, despite there being no chance that ZEBDINE is an alternate spelling of AAIN.

ActivityInfo's Latin Name Scorer considers instead the probability that AAIN could be transformed into ZEBDINE by the processes listed above. Because we consider addition of consonants like Z, B, D extremely unlikely, the Scorer returns a value of zero.

On the other hand, the string AIN differs only in a single vowel, which is very common difference in transliteration, resulting in a score of 0.925.

Algorithm

ActivityInfo's Latin name matching algorithm works as follows:

First, each input string is upper-cased and stripped of any diacritical marks, including apostrophes when found between letters. For example:

Next, each input string is split into one or more "parts". All symbols such as "-", "/", "(", etc, are considered to start new parts, and digits will start a new part if preceded by characters. For example:

Given two input strings each broken into an array of parts, for example, [COMMUNE, DE, GOUMERA] and [GOUMERA, COMUNE], we compute a similarity score for each pair of parts from the left and right:

COMMUNE DE GOUMERA
GOUMERA 0.0 0.00 1.0
COMUNE 0.93 0.00 0.0

The "left" and "right" names are chosen such that the "left" name has fewer or equal parts than the right name.

The similarity score between two parts is computed depending on the type of parts:

With the similarity matrix in hand, we find the best matching column for each row in the matrix, ensuring that we match a column to at most one row.

Finally, we compute a score between 0 and 1 indicating how well the two names match. Specifically:

We define the numerator as the sum of the lengths of all matching parts, weighted by their similarity scores. In the example above, "COMMUNE" best matches "COMUNE" with a similarity of 0.93. We use the minimum length of the two parts, which in this case is 6 characters, weighted by 0.93, contributes 5.58 to the numerator.

"GOUMERA" matches "GOUMERA" exactly, so contributes 6 * 1.0 to the numerator.

The denominator includes lengths of the matching parts (6 + 6) as well as the unmatched "DE" (2).

Phonetic Similarity Algorithm

When computing the similarity of two alphabetic parts, ActivityInfo's Latin Word Distance algorithm is used to compute the distance between two parts, assumed to be words in Latin script.

Like the Levenstein algorithm, we compute the minimum number of "edits" -- substitutions, insertions, and deletions -- but instead of treating every character difference equally, each "edit" is assigned a cost based on the likelihood it resulted from a transliteration difference.

For example, vowels are often transliterated differently, so extra vowels or vowel substitutions are assigned a relatively low cost. In the case of MREISSE and MRAISSE, for example, substitution of E with A is assigned a cost of 0.25.

On the other hand, most consonant substitutions are very rare: the word RAS is very unlikely to be transliterated as RAB, and the algorithm assigns an infinite cost to this substitution.

Some consonant substitutions are common, including MN and KQ, and are assigned a cost of 0.5 and 0.25 respectively.

Additional rules have been developed based on datasets encountered in Afghanistan, Lebanon, Mali and the Philipines.