sql-serversql-server-2008t-sqlparsingfreetext

SQL script to extract country from free text field


I'm trying to create a SQL UDF or statement to parse a free text field and find the country name from it, but I'm unable to do so successfully.

To give you guys the full context, I've got a transactions table (tbltransactions below) which contains transaction details and one of the fields is this free text field. This should ideally contain Payee name, Payee address and payee country (in that order). But as you'd expect with a free text field, there are all sorts of possible combinations. This also means that a country's name could be mis-spelt, abbreviated, shortened or missing completely. Fortunately, most transactions have countries specified in the end of the text block! There is another field in the table where the user enters a 3 character country code (mandatory). This may or may not match with what he entered in the free text field. Below is the dummy data in the table:

TransID     ISOCode BeneAddress
------------------- -----------
20          IRN     aaaa bb cccc Islamic Rupublic of Iran  
19          IRN     aaaa bb cccc Iran, Islamic Republic of

Now, I've made a look up table (tblCountryMappings) which stores a list of all countries and possible variations of their names (well most of them!).

Eg. 'Macedonia, Republic of', 'MACEDONIA,THE FORMER YUGOSLAV REPUBLIC OF', 'MACEDONIA', 'MASEDONIA' etc.

Below is dummy data for this table:

ID  ISONumericCode  countryName                 matchIdentifier            matchIdentifierType
----------------------------------------------------------------------------------------------
209  364            Iran, Islamic Republic of   IR                         ISOAlphaCode_2
210  364            Iran, Islamic Republic of   IRN                        ISOAlphaCode_3
495  364            Iran, Islamic Republic of   Iran                       Short_Name
1163 364            Iran, Islamic Republic of   Iran, Islamic Republic of  Original_Name
1309 364            Iran, Islamic Republic of   Islamic Rupublic of Iran   Alternate_Name

As you can see, there is a one to many mapping between the tables. The objective is to be able to analyse the transaction and find which country it was intended for (primarily based on the free-text field, not just the ISO code). For example, we should be able to see that Transaction 123 has “Iraq” in the ISO code, and “Iran” in the free text, and the free text match was on the ISO 3-character “IRN”. I also need to ensure that the matching works in boundary situations (e.g. end of line, surrounded by quotes), but not if it’s in the middle of a block of text (e.g. not matching the Saudi Arabia 2-character code “SA” to anyone called “Samuel”).

I've written this basic script to extract the last word from free-text and this could then be used to join matchIdentifier in tblCountryMappings but it obviously is an extremely poor attempt.

select 
    beneaddress
    ,SUBSTRING(beneaddress, 
               case when CHARINDEX(' ',REVERSE(beneaddress)) = 0 then 1 
                    else LEN(beneaddress) - CHARINDEX(' ',REVERSE(LTRIM(RTRIM(beneaddress))))+2
               end
    ,LEN(beneaddress)) as Country
from
    tblTransactions

If you could please help me build this solution, it'll be really appreciated. Please pardon me if I've violated any rules of posting since this is my first time. Feel free to ask for further info and I'll post it at the earliest.

Thanks a lot in advance.

Cheers


Solution

  • I doubt there is a perfect solution out there since I can imagine strange scenarios where individual street names or state names may be similar to certain country names. Having said that you can join your lookup table to the transaction table use LIKE statement. That way you can use regular expressions to match the country in the address. The country name will either be at the end or delimited at the front or end by either a ‘,’or space. It should simplify your query somewhat but as I mentioned it won’t be perfect.

    The following example shows what the query will look like.

    DECLARE @tbltransactions TABLE
    (
         TransID        INT
        ,ISOCode        NVARCHAR(3)
        ,BeneAddress    NVARCHAR(100)
    )
    
    DECLARE @tblCountryMappings TABLE
    (
         ID                 INT IDENTITY
        ,CountryName        NVARCHAR(100)
        ,MatchIdentifier    NVARCHAR(100)
    )
    
    INSERT INTO @tbltransactions
    (
         TransID    
        ,ISOCode    
        ,BeneAddress
    )
    VALUES
    (1          ,'IRN'     ,'aaaa bb cccc Islamic Rupublic of Iran') ,
    (2          ,'IRN'     ,'aaaa bb cccc "Iran", Islamic Republic of'),
    (3          ,'IRN'     ,'aaRSAbb cccc IRN'),
    (4          ,'IRN'     ,'aaaa bb cccc IR'),
    (5          ,'IRN'     ,'aaaa bb cccc The Country of Fred')
    
    
    INSERT INTO @tblCountryMappings
    (       
         CountryName    
        ,MatchIdentifier
    )
    VALUES
    ('Iran, Islamic Republic of',   'IR'),         
    ('Iran, Islamic Republic of',   'IRN'),         
    ('Iran, Islamic Republic of',   'Iran'),
    ('South Africa, Republic of',   'RSA'),
    ('South Africa, Republic of',   'R.S.A.'),
    ('South Africa, Republic of',   'South Africa')
    
    
    SELECT      T.TransID
                ,T.BeneAddress
                ,ISNULL(M.CountryName, '< Could not match country>') AS CountryName
                ,M.MatchIdentifier
    FROM        @tbltransactions    T
    LEFT OUTER JOIN @tblCountryMappings M ON 
                    ( 
                            (T.BeneAddress LIKE '%[, "]' + M.MatchIdentifier + '[, "]%') -- Match any address which contains a word that start with a comma or space or quote ([, "]) and is then followed by the MatchIdentifier and then end with either a comma or space or quote.
                                    OR
                            (T.BeneAddress LIKE '%[, "]' + M.MatchIdentifier ) -- Match any address which contains a word that start with a comma or space or quote ([, "]) and is then ends with the MatchIdentifier.
                                    OR
                            (T.BeneAddress LIKE M.MatchIdentifier + '[, "]%') -- Match any address which contains a word that start with the MatchIdentifier and then end with either a comma or space or quote.
                                    OR
                            (T.BeneAddress LIKE M.MatchIdentifier ) -- Match the address with an exact match of the MatchIdentifier
                    )
    

    In the above example sql will match the BeneAddress with regular expression generated based on the value of the MatchIdentifier field.

    Example in the tblCountryMappings the MatchIdentifier field will have the following values for Iran.

    This will generate the following regular expressions:

    To match the possibility that the country might be at the end of the string we include an additional OR condition where no pattern match is defined for the end. Similar to match the possibility that the country name might be at the start of the string we include an additional OR condition where no pattern match is defined for the start.