i'm working on my first "big" project, and i basically need to deal with a lot of phone numbers, like, extracting them from a file (already done), formatting them to the same format (problem is here) and lastly store them in a database (also already done).
The problem with formatting is that I have no control over the data source, their format is not consistent, and they are national and international numbers all together, some have the country code with the plus sign, others do not, Some have parenthesis, hyphens, leading 0, etc. some don't.
I'm trying to use the library phonenumbers to separate national and international numbers, my country is brazil and the overwhelming majority of numbers are brazilian. so i start by removing all unnecessary characters like parentheses, hyphen, spaces, plus symbol and leading zeros
df['Mobile Phone'] = df['Mobile Phone'].str.replace('\(|\)|\-|\+|\s', '')
df['Mobile Phone'] = df['Mobile Phone'].str[:1].str.replace('0', '') + df['Mobile Phone'].str[1:]
the next step would be to separate the nationals from the internationals, that's where the use of the library comes in. So far I've tried two ways, but they all give an exception error. In this first attempt, I expected to be able to fill the Origin column with the name of the country of origin of that number, so I could separate the numbers with origin from Brazil from the others. however this is not possible because I need to inform phonenumbers.parse() the country of origin of that number, which is not possible because I have no way of knowing, and because of that I get the error like below
df['Origin'] = df['Mobile Phone'].apply(lambda x: geocoder.description_for_number(phonenumbers.parse(x), 'en'))
NumberParseException: (0) Missing or invalid default region.
so I tried to inform the country of origin as Brazil (BR), but it also returns me an error, because at some point the number passed to phonenumbers.parse() will be an international number, and it will not be recognized as a valid number, as follows the code and error below
df['Origin'] = df['Mobile Phone'].apply(lambda x: geocoder.description_for_number(phonenumbers.parse(x, 'BR'), 'en'))
NumberParseException: (1) The string supplied did not seem to be a phone number.
i also tried to use the phonenumbers.is_valid_number() and fill the 'valid' column with true or false if the number was valid for brazil, however the error remains the same, because when passing the number to the phonenumbers.parse() method if the number is international it will not be recognized and the error will be issued
df['Valid'] = df['Mobile Phone'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, 'BR')))
NumberParseException: (1) The string supplied did not seem to be a phone number.
would there be any way to avoid or ignore these exceptions so that the rest of the checks are done? or some way to return another value for the column when the exception is called, indicating that number was not recognized? or is there a way to pass a list of all existing countries to the phonenumbers.parse() method ?, something like this
df['Valid'] = df['Mobile Phone'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, list_of_countries)))
or
df['Valid'] = df['Mobile Phone'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, ['EN', 'GB', 'BR'])))
here is a sample of some numbers that are contained in one of the files I'm working on, the first 4 numbers are Brazilian, the last ones are international, without undergoing any kind of treatment
+55 34 98400-xxxx
34 99658-xxxx
+349798xxxx
9685-xxxx
549215xxxx
+598 91 xxx xxx
+81 80-4250-xxxx
+81 90-4262-xxxx
+971 50 147 xxxx
+972 53-881-xxxx
and they look like this after I perform a treatment to clean the useless characters
553498400xxxx
3499658xxxx
349798xxxx
9685xxxx
549215xxxx
59891xxxxxx
81804250xxxx
81904262xxxx
97150147xxxx
97253881xxxx
the complete Brazilian local number follows this format: +55 XX XXXXX-XXXX, but in the data there are incomplete numbers, which do not have some information, like the country code for example.
I do not intend to perform any type of formatting on international numbers, as they are numbers from several different countries and each one has its own format,I just need to remove them from the dataframe somehow so that I can perform the formatting in the Brazilian numbers, and after that I will put the international numbers again in the dataframe, as I already said I already made the code to format the Brazilian numbers, to insert the necessary information in the numbers that are without, my difficulty is in fact in how to separate the international numbers from the Brazilian numbers using phonenumber library or otherwise.
If you don't know which numbers are international and which are local, you'll just have to try both:
def guess_phonenumber(clean, loc):
# Try national
pn = phonenumbers.parse(clean, loc)
if not phonenumbers.is_valid_number(pn):
# Not national; add + and try international
pn = phonenumbers.parse("+" + clean, None)
if not phonenumbers.is_valid_number(pn):
# Not international either
pn = None
return pn
guess_phonenumber(clean_phone_number, "BR")
# => PhoneNumber or None
If the phone cannot be recognised, it is likely either invalid altogether, or it is missing too much information to be able to be reconstructed (e.g. a local number, when you do not know which area it is local to).