Saturday, June 23, 2018

Soundex algorithm for searching people name (Phonetic search)


Now a days search algorithms is a common feature in Artificial Intelligence.
If you want search for a specific word then you may get results along with similar words in various search engines like google.

If you will search a name from social media like "Facebook" or "Twitter" then it will get display a list of similar names but having slightly different spellings.

Phonetic search is a type of search where the string or words have the similar pronunciation. In many applications there will have a features to search any words or names where it will match exactly or similar word matches.

This search will be performed in a secure way i,e instead of using a plain text in searching process it will use the character encoded code.
For Soundex phonetic algorithm it will generate four characters encoded code and based on this encoded code the search will be performed.



Challenge:  
Suppose we have stored a customer name (last name,first name ,middle name ) into database and data in this column is in encrypted format and our application has a functionality for the operator to do a name search so that when the operation enters the customer name in the order last name ,first name and middle name, the database needs to search for it. The name would not be the exact match in last name, first name and middle name - so a like query is required.

Because the column has encrypted, you can not directly apply like operator on the concatenation of three names(last name,first name ,middle name) , Now the challenge is how can you search the names using like query from the encrypted data.

Solution 1: Create three different columns which will be used for storing encrypted values into them(last name,first name ,middle name) and provide three input search fields to the operator for searching the names. Encrypt the input text and search the encrypted value from the respective columns(last name,first name ,middle name).


Solution 2: Create six different columns , where three columns will be used for storing the encrypted values and three columns will be used to store the HMAC hash values for the names and provide three input search fields to the operator for searching the names. Get the HMAC hash value from provided input text and search that hash value from the respective columns(last name,first name ,middle name).

Solution 3: Create two different columns , where one column will be used for storing the encrypted value(last name + first name + middle name) and the second column will be used to store the Soundex code for the customer name (last name + first name + middle name)  

For Solution 1 and 2, You would have to encrypt the values while concatenating and then apply the like operator - which will be just too slow considering the size of the database (million rows),also you can not index encrypted columns. You need to provide three different input text fields to restrict the searching criteria(last name,first name ,middle name). 

For Solution 3, You need to provide a single text field to search a customer name and you only have to get the Soundex code for the input text and search it from the database,you have to use like operator if Soundex code of all the names(last name,first name ,middle name) have stored into a single column.


CustomerID
Surname
Phonic
1
SMITH
S530
2
JOHNSON
J525
3
WILLIAMS
W452
4
JONES
J520
5
BROWN
B650
6
DAVIS
D120

Note : Keyed hash(HMAC) value won't be searchable in like operator , encrypted column hash value will only be applicable for the exactly match condition not for like operator.

Before move into phonetic search ,we need to understand some words. lets get the knowledge about them before jumping into the algorithm.

Phonetic is a wing of linguistics that deals with the sounds of human speech. Basically it is more about the word that you pronounce. In the case of a phonetic search, it is a technique to look up a word with the exact spelling along with the words having similar sounds.
  •     Caret and Carat
  •     Nelson, Neilson and Neelson
  •     Neekita and Nikita
  •     Cup and Cop
Homophone is each of two or more words having the same pronunciation but different meanings, origins, or spelling (or) each of a set of symbols denoting the same sound or group of sounds.
  • Steal and  Steel
  • Some and  Sum
  • Suite and  Sweet
  • Sole and  Soul
  • Son and Sun








Phonetic Algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language
  • Soundex
  • Metaphone, Double Metaphone, Metaphone 3
  • Daitch–Mokotoff Soundex
  • Cologne phonetics
  • New York State Identification and Intelligence System (NYSIIS)
  • Match Rating Approach
  • Caverphone
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. Soundex converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken.
  • Robert and  Rupert - R163
  • Steal and  Steel - S340
  • Sridhar , Shridhar and Shreedhar - S636
Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation.
  • Assistance and  Assistants - ASSTN 
  • Katherine and Catherine -  K0RN 
  • Steal and Steel - STL
  • Jena , Jina and  Jeena -  JN
Use of Phonetic Search
  • Phonetic Search is used in different types of word editors for spell checkers , so that an editor can show a list of relative other words (same or similar Metaphone)  whenever there will have any spelling mistakes.
  • Search functionality will use phonetic algorithms to find results that don't match exactly the term(s) used in the search. People can search their names that either contains only the identical elements like Wilson (or) Wylson, Katherine (or) Catherine,  Jena (or) Jina (or) Jeena.
There is a number of algorithms and tools or libraries that will help you for more advanced identical name matching.

Soundex
It is a standard and is very commonly used phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones (pronounced the same as another word but differs in meaning, and may differ in spelling) to be encoded to the same representation so that they can be matched despite minor differences in spelling.

Metaphone
This algorithm was developed by Lawrence Philips in 1990 for encoding words correspondng to English pronunciation rules. It is considered to be better than Soundex. In this case words are grouped and the resulting value is also a word unlikely in the case of soundex. This algorithm seems to be more complicated.

Double Metaphone
The developer of the Metaphone algorithm provided an improved version called "Doube Metaphone" in the year 2000 by providing support to other European languages. It is called "Double" because it provides both a primary and a secondary code for a word and code can be up to 4 characters. The Double Metaphone rule engine is a bit more complex than the others.

Metaphone 3
The same developer of Metaphone Lawrence Philips provided an improved version of an algorithm called Metaphone 3 in 2009. In this algorithm, various sounds like soft and hard were taken into consideration. It provided more support to Slavik languages like Russian.

Caverphone
The Caverphone algorithm was developed by David Hood in 2002 as part of a New Zealand project called "Caversham Project" to match the data in the old and new electoral lists. Although this algorithm is applicable for English words, it provides much more specific recognition of accents of words of New Zealand.

NYSIIS
This algorithm was developed in 1970 as part of the "New York State Identification and Intelligence System". It promises to provide better result and accuracy, up to 2.7%, over the Soundex algorithm. Again it is more specific to American names.

Daitch-Mokotoff
This algorithm was developed by two Jewish genealogists Gary Mokotoff and Randy Daitch in the year 1985. This algorithm is similar to Soundex but provides more accuracy for Russian and Jewish names.

Soundex Algorithim
This algorithm was developed by Robert Russell in 1910 for the words in English.As per this algorithm, words are compared based upon their index value. The main principle behind this algorithm is that consonants are grouped depending on the ordinal numbers and finally encoded into a value against which others are matched. It aims to find a code for every word by above process which is called soundex code.

The complete algorithm to find soundex code is as below:
  1. Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
  2. Replace consonants with digits as follows (after the first letter):
    b, f, p, v → 1
    c, g, j, k, q, s, x, z → 2
    d, t → 3
    l → 4
    m, n → 5
    r → 6
  3. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
  4. Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

Java Implementation : Soundex
public class Soundex {
 public static String soundex(String s) {
  char[] x = s.toUpperCase().toCharArray();
  char firstLetter = x[0];

  // Step - 2 
  // convert letters to numeric code
  for (int i = 0; i < x.length; i++) {
   switch (x[i]) {

   case 'B':
   case 'F':
   case 'P':
   case 'V':
    x[i] = '1';
    break;

   case 'C':
   case 'G':
   case 'J':
   case 'K':
   case 'Q':
   case 'S':
   case 'X':
   case 'Z':
    x[i] = '2';
    break;

   case 'D':
   case 'T':
    x[i] = '3';
    break;

   case 'L':
    x[i] = '4';
    break;

   case 'M':
   case 'N':
    x[i] = '5';
    break;

   case 'R':
    x[i] = '6';
    break;

   default:
    x[i] = '0';
    break;
   }
  }

  // remove duplicates
  // Step - 1
  String output = "" + firstLetter;

  // Step - 3
  for (int i = 1; i < x.length; i++)
   if (x[i] != x[i - 1] && x[i] != '0')
    output += x[i];

  // Step - 4
  // pad with 0's or truncate
  output = output + "0000";
  return output.substring(0, 4);
 }

 public static void main(String[] args) {
  System.out.println("Soundex Code for Wilson is: " + soundex("Wilson"));
  System.out.println("Soundex Code for Wylson is: " + soundex("Wylson"));

  System.out.println("Soundex Code for beer is: " + soundex("beer"));
  System.out.println("Soundex Code for bear is: " + soundex("bear"));
  System.out.println("Soundex Code for bearer is: " + soundex("bearer"));
 }
}
Output :
Soundex Code for Wilson is: W425
Soundex Code for Wylson is: W425
Soundex Code for beer is: B600
Soundex Code for bear is: B600
Soundex Code for bearer is: B660

Apache "commons-codec" provides the implementations for the following algorithms.
  •     Metaphone
  •     Double Metaphone
  •     Soundex
Above three algorithms are packed into apache commons codec JAR, declared as Maven dependency:  commons-codec-1.5.jar

    commons-codec
    commons-codec
    1.5

Full Example :

package com.sjena.helloswitch;

import org.apache.commons.codec.StringEncoderComparator;
import org.apache.commons.codec.language.DoubleMetaphone;
import org.apache.commons.codec.language.Metaphone;
import org.apache.commons.codec.language.Soundex;

public class LanguageCodecTest {

 public static void main(String args[]) {

  Soundex sndx = new Soundex();

  System.out.println("Soundex Code for Erickson is: " + sndx.encode("Erickson"));
  System.out.println("Soundex Code for Erikson is: " + sndx.encode("Erikson"));
  System.out.println("Soundex Code for Ericson is: " + sndx.encode("Ericson"));
  System.out.println("Soundex Code for Ericksen is: " + sndx.encode("Ericksen"));
  System.out.println("Soundex Code for Ericsen is: " + sndx.encode("Ericsen"));

  System.out.println("Soundex Code for Wilson is: " + sndx.encode("Wilson"));
  System.out.println("Soundex Code for Wylson is: " + sndx.encode("Wylson"));

  System.out.println("Soundex Code for Sridhar is: " + sndx.encode("Sridhar"));
  System.out.println("Soundex Code for Shridhar is: " + sndx.encode("Shridhar"));
  System.out.println("Soundex Code for Shreedhar is: " + sndx.encode("Shreedhar"));

  System.out.println("Soundex Code for Jena is: " + sndx.encode("Jena"));
  System.out.println("Soundex Code for Jeena is: " + sndx.encode("Jeena"));
  System.out.println("Soundex Code for Jina is: " + sndx.encode("Jina"));

  System.out.println("Soundex Code for OBrien is: " + sndx.encode("OBrien"));
  System.out.println("Soundex Code for O'Brien is: " + sndx.encode("O'Brien"));
  System.out.println("Soundex Code for OBr'ien is: " + sndx.encode("OBr'ien"));
  System.out.println("Soundex Code for OBrie'n is: " + sndx.encode("OBrie'n"));
  System.out.println("Soundex Code for OB'rien is: " + sndx.encode("OB'rien"));

  // Use the StringEncoderComparator to compare these two Strings
  StringEncoderComparator comparator1 = new StringEncoderComparator(sndx);
  System.out.println("Wilson and Wylson are same :" + comparator1.compare("Wilson", "Wylson"));
  System.out.println("Erickson and Ericson are same :" + comparator1.compare("Erickson", "Ericson"));
  System.out.println("Sridhar and Shreedhar are same :" + comparator1.compare("Sridhar", "Shreedhar"));

  Metaphone metaphone = new Metaphone();
  System.out.println("Metaphone Code for Wilson is: " + metaphone.encode("Wilson"));
  System.out.println("Metaphone Code for Wylson is: " + metaphone.encode("Wylson"));

  System.out.println("Metaphone Code for Sridhar is: " + metaphone.encode("Sridhar"));
  System.out.println("Metaphone Code for Shridhar is: " + metaphone.encode("Shridhar"));
  System.out.println("Metaphone Code for Shreedhar is: " + metaphone.encode("Shreedhar"));

  DoubleMetaphone doubleMetaphone = new DoubleMetaphone();
  System.out.println("DoubleMetaphone Code for Wilson is: " + doubleMetaphone.encode("Wilson"));
  System.out.println("DoubleMetaphone Code for Wylson is: " + doubleMetaphone.encode("Wylson"));

  System.out.println("DoubleMetaphone Code for Sridhar is: " + doubleMetaphone.encode("Sridhar"));
  System.out.println("DoubleMetaphone Code for Shridhar is: " + doubleMetaphone.encode("Shridhar"));
  System.out.println("DoubleMetaphone Code for Shreedhar is: " + doubleMetaphone.encode("Shreedhar"));

  // Use the StringEncoderComparator to compare these two Strings
  StringEncoderComparator comparator2 = new StringEncoderComparator(doubleMetaphone);
  System.out.println("Auto and Otto are same :" + comparator2.compare("Auto", "Otto"));
  System.out.println("Double Metaphone primary code for Sridhar: " + doubleMetaphone.doubleMetaphone("Sridhar"));
  System.out.println("Double Metaphone secondary code for Sridhar: " + doubleMetaphone.doubleMetaphone("Sridhar", true));

 }
}

Output :
Soundex Code for Erickson is: E625
Soundex Code for Erikson is: E625
Soundex Code for Ericson is: E625
Soundex Code for Ericksen is: E625
Soundex Code for Ericsen is: E625
Soundex Code for Wilson is: W425
Soundex Code for Wylson is: W425
Soundex Code for Sridhar is: S636
Soundex Code for Shridhar is: S636
Soundex Code for Shreedhar is: S636
Soundex Code for Jena is: J500
Soundex Code for Jeena is: J500
Soundex Code for Jina is: J500
Soundex Code for OBrien is: O165
Soundex Code for O'Brien is: O165
Soundex Code for OBr'ien is: O165
Soundex Code for OBrie'n is: O165
Soundex Code for OB'rien is: O165
Wilson and Wylson are same : 0
Erickson and Ericson are same :0
Sridhar and Shreedhar are same :0
Metaphone Code for Wilson is: WLSN
Metaphone Code for Wylson is: LSN
Metaphone Code for Sridhar is: SRTH
Metaphone Code for Shridhar is: XRTH
Metaphone Code for Shreedhar is: XRTH
DoubleMetaphone Code for Wilson is: ALSN
DoubleMetaphone Code for Wylson is: ALSN
DoubleMetaphone Code for Sridhar is: SRTR
DoubleMetaphone Code for Shridhar is: XRTR
DoubleMetaphone Code for Shreedhar is: XRTR
Auto and Otto are same :0
Double Metaphone primary code for Sridhar: SRTR
Double Metaphone secondary code for Sridhar: SRTR

Database Feature : SOUNDEX

It is now a standard feature of major database like Oracle,PostgreSQL,DB2,MySQL, Ingres, MS SQL Server and some major word editors.

Oracle : The SOUNDEX function can be used in the following versions of Oracle/PLSQL:
Oracle 12c, Oracle 11g, Oracle 10g, Oracle 9i, Oracle 8

Query: SELECT last_name, first_name FROM employees WHERE SOUNDEX(last_name) = SOUNDEX('SMYTHE');

Output :

LAST_NAME  FIRST_NAME
Smith                  Lindsey
Smith                  William

  • SOUNDEX('tech on the net') = 'T253'
  • SOUNDEX('TECH ON THE NET') = 'T253'
  • SOUNDEX('apples') = 'A142'
  • SOUNDEX('apples are great') = 'A142'
  • SOUNDEX('applus') = 'A142'

11 comments:

  1. If you have a band/group name like *Nsync or 'N Sync then Soundex is *525 or '525

    Is it valid to have a Soundex that does not start with a Letter?

    ReplyDelete

  2. I have been reading for the past two days about your blogs and topics, still on fetching! Wondering about your words on each line was massively effective. Techno-based information has been fetched in each of your topics. Sure it will enhance and fill the queries of the public needs. Feeling so glad about your article. Thanks…!
    magento training course in chennai
    magento training institute in chennai
    magento 2 training in chennai
    magento development training
    magento 2 course
    magento developer training

    ReplyDelete
  3. Thanks for sharing this blog. Keep it up and best of luck for your future blogs and posts.
    Digital Marketing Course in Kolkata

    ReplyDelete
  4. I’ve been surfing online more than three hours today, yet I never found any interesting article like yours. It’s pretty worth enough for me. In my opinion, if all webmasters and bloggers made good content as you did, the web will be a lot more useful than ever before. 신용카드 현금화

    ReplyDelete
  5. This is really a nice and informative, containing all information and also has a great impact on the new technology. Thanks for sharing it, Best Seo Tips

    ReplyDelete