How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?
I am trying to use a port of Google's libphonenumber library on Python, https://pypi.org/project/phonenumbers/.
I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.
I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.
Your initial solution using apply
is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
The differences:
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply
) rather than to the Series (the single column) that is returned by doing df.phone_number.apply
. (print the output of df.phone_number
to the console - what is returned is all the information that your lambda function will be given).
The argument axis='columns'
(or axis=1
, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record
at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])
Your lambda function only knows about the placeholder x
, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x
that it knows - i.e., x.phone_number, x.country_code.