← Back to team overview

dhis2-devs team mailing list archive

Matching strategies and algorithms

 

This ongoing (long) discussion on the OMRS list on name matching
strategies is also relevant for us - it is not only important for
patient, but also for matching facility names and villages/chiefdoms
etc with shapefiles for mapping.

Knut

---------- Forwarded message ----------
From: Jeff Rafter <jeffrafter@xxxxxxxxx>
Date: Wed, May 5, 2010 at 5:34 PM
Subject: Re: [OPENMRS-DEV] soundex architecture question
To: openmrs-devel-l@xxxxxxxxxxxxxxxxxx


> i think using using minimal edit distance from dictionary words  or names would be a better implementation if correcting word spelling is the goal. soundex needs to be fine tuned to bantu languages which no one has done successively yet.

Check out the implementation from earlier in this thread... the
Mateme/Ruby bantu soundex is working great in Malawi and it looks like
Dave (and others?) have already ported this to Java for work in
OpenMRS.

>
> On Tue, May 4, 2010 at 11:10 PM, Paul Biondich <pbiondich@xxxxxxxxxxxxxxx> wrote:
>>
>> Also, just another gentle reminder... there is another whole "approach" to matching mis-spelled or misrepresented names:  string sequence matching or comparator algorithms.
>> Some examples of this are:  Levenshtein Edit Distance (and edit distances in general), and Longest Common Substring (LCS), and the Jaro-Winkler methodology.
>> Wikipedia is your friend on these.

I think that Levenshtein distance would be tough for suggesting. Lots
of things would have the same distance but would be wildly different
right? I could be wrong on that though. I have used LCS with account
names in a previous life and it worked well, but mainly for things
like "Bank of America" and "Bank America". These aren't similar
sounding but are clearly related. I could see this having an impact on
names like the we discussed in the other thread: "Mike von der ohe Mc
Kay" versus "Mike McKay".
Secondarily, the person_name_code approach would be more difficult,
you would have to compare the typed in string to all of the names in
the database to determine the distance. We could probably use some
kind of Ur-name and mean distance but that would take some tweaking.
All the best
Jeff

>>
>> -Paul
>> On May 1, 2010, at 1:12 AM, David Thomas wrote:
>>
>> This is excellent.  I'm adding this to my notes.  As we move forward with kinyarwanda, it seems like an eventual logical step to try to abstract the lessons learned for other implementers working in non-european languages.  Its on my 'things not to forget about' list.
>>
>> d
>>
>> On 4/28/2010 6:27 AM, Jeff Rafter wrote:
>>
>> Awesome work Dave. I really like that you can add algorithms.
>> In thinking about this more it would be cool if *anyone* could add a new algorithm. The process is pretty simple and could be accomplished with a couple tables:
>> soundex_algorithm
>>   soundex_algorithm_id:int
>>   name:string
>>   uppercase_all_letters:boolean, default:true
>>   strip_punctuation:boolean, default:true
>>   keep_first_digit:boolean, default:true
>>   remove_digit_pairs, default:true
>>   remove_leading_zero, default:true
>>   length:int, default:4
>> soundex_algorithm_replacement
>>   soundex_algorithm_step_id:int
>>   soundex_algorithm_id:int # refers to the main algorithm
>>   sort_weight:int
>>   match:string
>>   value:string
>>
>> This could obviously be massaged some more. And I am more or less imagining there would be regex in the match value which might be over the heads of non-devs (although the very simple regex in this case might be okay).
>> The other potential problem would be that you may need to specify at what point you keep the first char (in the Bantu one we have to replace leading phonemes first). I could imagine in a language where there were spaces / ' in names that you might consider the next char a leading char as well. So this could be a crazy road to go down. Just an idea.
>>
>> Great work,
>> Jeff
>> On Wed, Apr 28, 2010 at 7:18 AM, David Thomas <pihdave@xxxxxxxxx> wrote:
>>>
>>> Hi.  The namephonetic module that i checked in a couple of days ago basically mirrors everything that's said here.  The one small data model change is that each row only stores one encoded name component, which is then labeled by a java enum (GIVEN_NAME, MIDDLE_NAME, FAMILY_NAME, FAMILY_NAME2).  The module AOP's around savePatient and calculates the code for fields that you specify using a global property, meaning that you can control what PersonName fields you want to encode, and you can even attach them to different soundex algorithms (folks in rwanda get a local name and a christian name, which are quite different).
>>>
>>> There's a little JSP page that gives you the ability to re-calculate all soundexes for all patients in the db, for migration purposes, or if you want to alter your soundex.
>>>
>>> And, i ported jeff's chichewa soundex directly into java, which is included in the module -- we needed a starting point for kinyarwanda.
>>>
>>> Finally, the module allows you to add your own algorithm.  All algorithms only have to be an extension of the StringEncoder class that lives in the apache commons codec jar file.  The module also includes all the algorithms that live in this jar file by default -- Soundex, RefinedSoundex, Metaphone, Double Metaphone, etc...
>>>
>>> The module however, doesn't affect the default patient lookup in openmrs.  I didn't connect them.  But the module does have a bunch of methods for finding a patient, which i'm using from a touchscreen registration module i'm working on.
>>>
>>> dave thomas
>>>
>>> On 4/27/2010 8:19 PM, Jeff Rafter wrote:
>>>
>>> For completeness, here is the code (obviously very simple implementation):
>>> http://github.com/jeffrafter/mateme/blob/master/lib/bantu_soundex.rb
>>> Also for completeness "we" here meant Rodney Tayamika and myself (with lots of help later from Evan). This was one of those
>>> situations that absolutely required someone with local knowledge.
>>> I also realized that in the final solution actually moved the soundexes out of person_name into a nearly identical table called
>>> person_name_code. This was done so that the name codes would not show up in other modules and in places that didn't
>>> make use of the preferred flag:
>>>  desc openmrs_1_5.person_name_code
>>>     -> ;
>>> +-------------------------+-------------+------+-----+---------+----------------+
>>> | Field                   | Type        | Null | Key | Default | Extra          |
>>> +-------------------------+-------------+------+-----+---------+----------------+
>>> | person_name_code_id     | int(11)     | NO   | PRI | NULL    | auto_increment |
>>> | person_name_id          | int(11)     | YES  | MUL | NULL    |                |
>>> | given_name_code         | varchar(50) | YES  | MUL | NULL    |                |
>>> | middle_name_code        | varchar(50) | YES  | MUL | NULL    |                |
>>> | family_name_code        | varchar(50) | YES  | MUL | NULL    |                |
>>> | family_name2_code       | varchar(50) | YES  |     | NULL    |                |
>>> | family_name_suffix_code | varchar(50) | YES  |     | NULL    |                |
>>> +-------------------------+-------------+------+-----+---------+----------------+
>>> 7 rows in set (1.16 sec)
>>> I am happy to fill in any more holes here, and am excited to see this possibly being modularized.
>>> Cheers,
>>> Jeff
>>>
>>> On Tue, Apr 27, 2010 at 6:02 PM, Jeff Rafter <jeffrafter@xxxxxxxxx> wrote:
>>>>
>>>> Hey All,
>>>> I am trying to catch up on this thread, somehow I missed it. We (PIH) have implemented Soundex in Mateme (the PIH/Baobab touchscreen system). When we did this re worked the phoneme set of traditional soundex to use Bantu phonemes instead (based on a very old missionary text in Malawi). This had the effect of grouping "l" and "r" for instance, a common cross-phoneme in Malawi. It works *great*. The code is very simple as well. Ultimately Evan Waters asked for a couple of adjustments based on real world clinic use after the first couple months.
>>>> Secondarily, the question of when to do this.
>>>> The best time to calculate the soundex is when you change the name. For us we had control over this in the touchscreen application, but embedding it into a module would be even better because then edits in OpenMRS proper would cause soundex refiguring. To save the soundex we simply add a second PersonName record with soundexes instead of names. This acts a giant soundex cache. When we search, we do a quick calculation on the soundex of the search string and search for the result in our names. Ultimately this makes name searching just as fast as before.
>>>> We had a back log of names so we had a quick one-time process that calculated all of the current names... from then on the system kept up with itself as it was just one additional thing to save whenever a name changed.
>>>> Jeff
>>>> On Wed, Apr 21, 2010 at 4:40 PM, Paul Biondich <pbiondich@xxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> Dave:
>>>>> The traditional criticism of Soundex for our work is that it's algorithm was inferred from English names, so it's problematic when bringing it to a Swahili or Rwandan French name space.
>>>>> I know that Shaun Grannis is very up to speed on the various pros and cons of the phonetic "compression" algorithms, and I'll flag him to make sure he sees this and responds.
>>>>> Least Common Substring might be better given the circumstances (it tends to "hide" misspellings and typing transpositions, which are what you are likely trying to deal with).
>>>>> RE: running these algorithms on-the-fly vs. as a scheduled task... I think you need to determine first what mechanism(s) you want to use, and then measure the performance implications appropriately.  In practice, Shaun's work always pregenerates these...
>>>>> -Paul
>>>>> On Apr 21, 2010, at 11:04 PM, David Thomas wrote:
>>>>>
>>>>> Hi everyone.  We've started a discussion of whether or not to launch into development of a soundex module, or if other strategies might be more appropriate.    Here's the dialogue thus far:
>>>>>
>>>>>
>>>>> Me  (dave):
>>>>>
>>>>> Hey, quick question -- the soundex algorithm looks fairly processor intensive, to the point where i don't think this should be done in real-time for existing names in the database.  Do you pre-calculate soundex values for existing patient names, so that the algorithm is just applied to the search string, and then compared with the pre-calculated values?
>>>>>
>>>>> I feel like maybe this should be an openmrs module on its own, with a little data model of pre-calculated values for givenName, middleName,  familyName,  familyName2 and should support multiple implementations of SoundexAlgorithm.
>>>>>
>>>>> In case you haven't read it (mike, darius), here's the theory:
>>>>> http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
>>>>>
>>>>> Based on people's responses, i'll start (or not) the discussion of the data model.
>>>>>
>>>>>
>>>>> Darius:
>>>>>
>>>>>
>>>>> Interesting. I'd never read the algorithm before. More research:
>>>>>
>>>>> Another competing algorithm is metaphone, which has apparently been replaced by double metaphone. I don't know the specific algorithm, so I don't know if it's easy to translate for Bantu pronounciations.
>>>>> Apache's commons-codec has implementations of various phonetic algorithms (see API) including Soundex and RefinedSoundex (better for spellcheckers apparently) that allow you to specify the mapping. (It also has metaphone but no ability to change that algorithm.)
>>>>>
>>>>> I did a quick test of running the commons-codec soundex algorithm on all the (few) names in the PersonServiceTest-extranames.xml file in OpenMRS and got the following result:
>>>>>     Took 136 ms to do 6 person names with 16 components
>>>>>  Presumably the implication is that if you do 6000 names it will take 2+ minutes on a laptop, which is obviously not fast enough for real-time.
>>>>> So, the real question is whether we should be using a phonetic matching algorithm (in which case the module you describe is the only way to go) or if we should be considering leveraging the existing Patient Matching Module, which uses a much more sophisticated algorithm, probably has considered a real-time-check use case, but may not be complete. (That sounds like a question for the OpenMRS developers list. Anyone feel like sending that?)
>>>>> -Darius
>>>>>
>>>>> ________________________________
>>>>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>>>>
>>>>> ________________________________
>>>>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>>
>>> ________________________________
>>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>>
>>> ________________________________
>>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>
>> ________________________________
>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>
>> ________________________________
>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>
>> ________________________________
>> Click here to unsubscribe from OpenMRS Developers' mailing list
>
>
> --
> ========================================
> Mugisha Moses
> Lead Developer
> Appfrica Labs
> P.O. Box 1420 Kampala, Uganda
> http://appfrica.org
> skype name :  mossplix
> twitter: @mugisha
>
>
>
> ________________________________
> Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list


-- 
Cheers,
Knut Staring