I'm trying to create a SQL UDF or statement to parse a free text field and find the country name from it, but I'm unable to do so successfully.
To give you guys the full context, I've got a transactions table (tbltransactions below) which contains transaction details and one of the fields is this free text field. This should ideally contain Payee name, Payee address and payee country (in that order). But as you'd expect with a free text field, there are all sorts of possible combinations. This also means that a country's name could be mis-spelt, abbreviated, shortened or missing completely. Fortunately, most transactions have countries specified in the end of the text block! There is another field in the table where the user enters a 3 character country code (mandatory). This may or may not match with what he entered in the free text field. Below is the dummy data in the table:
TransID ISOCode BeneAddress
------------------- -----------
20 IRN aaaa bb cccc Islamic Rupublic of Iran
19 IRN aaaa bb cccc Iran, Islamic Republic of
Now, I've made a look up table (tblCountryMappings) which stores a list of all countries and possible variations of their names (well most of them!).
Eg. 'Macedonia, Republic of', 'MACEDONIA,THE FORMER YUGOSLAV REPUBLIC OF', 'MACEDONIA', 'MASEDONIA' etc.
Below is dummy data for this table:
ID ISONumericCode countryName matchIdentifier matchIdentifierType
----------------------------------------------------------------------------------------------
209 364 Iran, Islamic Republic of IR ISOAlphaCode_2
210 364 Iran, Islamic Republic of IRN ISOAlphaCode_3
495 364 Iran, Islamic Republic of Iran Short_Name
1163 364 Iran, Islamic Republic of Iran, Islamic Republic of Original_Name
1309 364 Iran, Islamic Republic of Islamic Rupublic of Iran Alternate_Name
As you can see, there is a one to many mapping between the tables. The objective is to be able to analyse the transaction and find which country it was intended for (primarily based on the free-text field, not just the ISO code). For example, we should be able to see that Transaction 123 has “Iraq” in the ISO code, and “Iran” in the free text, and the free text match was on the ISO 3-character “IRN”. I also need to ensure that the matching works in boundary situations (e.g. end of line, surrounded by quotes), but not if it’s in the middle of a block of text (e.g. not matching the Saudi Arabia 2-character code “SA” to anyone called “Samuel”).
I've written this basic script to extract the last word from free-text and this could then be used to join matchIdentifier in tblCountryMappings but it obviously is an extremely poor attempt.
select
beneaddress
,SUBSTRING(beneaddress,
case when CHARINDEX(' ',REVERSE(beneaddress)) = 0 then 1
else LEN(beneaddress) - CHARINDEX(' ',REVERSE(LTRIM(RTRIM(beneaddress))))+2
end
,LEN(beneaddress)) as Country
from
tblTransactions
If you could please help me build this solution, it'll be really appreciated. Please pardon me if I've violated any rules of posting since this is my first time. Feel free to ask for further info and I'll post it at the earliest.
Thanks a lot in advance.
Cheers
I doubt there is a perfect solution out there since I can imagine strange scenarios where individual street names or state names may be similar to certain country names. Having said that you can join your lookup table to the transaction table use LIKE
statement. That way you can use regular expressions to match the country in the address. The country name will either be at the end or delimited at the front or end by either a ‘,’or space. It should simplify your query somewhat but as I mentioned it won’t be perfect.
The following example shows what the query will look like.
DECLARE @tbltransactions TABLE
(
TransID INT
,ISOCode NVARCHAR(3)
,BeneAddress NVARCHAR(100)
)
DECLARE @tblCountryMappings TABLE
(
ID INT IDENTITY
,CountryName NVARCHAR(100)
,MatchIdentifier NVARCHAR(100)
)
INSERT INTO @tbltransactions
(
TransID
,ISOCode
,BeneAddress
)
VALUES
(1 ,'IRN' ,'aaaa bb cccc Islamic Rupublic of Iran') ,
(2 ,'IRN' ,'aaaa bb cccc "Iran", Islamic Republic of'),
(3 ,'IRN' ,'aaRSAbb cccc IRN'),
(4 ,'IRN' ,'aaaa bb cccc IR'),
(5 ,'IRN' ,'aaaa bb cccc The Country of Fred')
INSERT INTO @tblCountryMappings
(
CountryName
,MatchIdentifier
)
VALUES
('Iran, Islamic Republic of', 'IR'),
('Iran, Islamic Republic of', 'IRN'),
('Iran, Islamic Republic of', 'Iran'),
('South Africa, Republic of', 'RSA'),
('South Africa, Republic of', 'R.S.A.'),
('South Africa, Republic of', 'South Africa')
SELECT T.TransID
,T.BeneAddress
,ISNULL(M.CountryName, '< Could not match country>') AS CountryName
,M.MatchIdentifier
FROM @tbltransactions T
LEFT OUTER JOIN @tblCountryMappings M ON
(
(T.BeneAddress LIKE '%[, "]' + M.MatchIdentifier + '[, "]%') -- Match any address which contains a word that start with a comma or space or quote ([, "]) and is then followed by the MatchIdentifier and then end with either a comma or space or quote.
OR
(T.BeneAddress LIKE '%[, "]' + M.MatchIdentifier ) -- Match any address which contains a word that start with a comma or space or quote ([, "]) and is then ends with the MatchIdentifier.
OR
(T.BeneAddress LIKE M.MatchIdentifier + '[, "]%') -- Match any address which contains a word that start with the MatchIdentifier and then end with either a comma or space or quote.
OR
(T.BeneAddress LIKE M.MatchIdentifier ) -- Match the address with an exact match of the MatchIdentifier
)
In the above example sql will match the BeneAddress with regular expression generated based on the value of the MatchIdentifier field.
Example in the tblCountryMappings the MatchIdentifier field will have the following values for Iran.
This will generate the following regular expressions:
To match the possibility that the country might be at the end of the string we include an additional OR
condition where no pattern match is defined for the end. Similar to match the possibility that the country name might be at the start of the string we include an additional OR
condition where no pattern match is defined for the start.