pythonpandasdataframeapplylibphonenumber

How to use parse from phonenumbers Python library on a pandas data frame?


How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?

I am trying to use a port of Google's libphonenumber library on Python, https://pypi.org/project/phonenumbers/.

I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.

I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.

df['phone_number_clean'] = df.phone_number.apply(lambda x: 
phonenumbers.parse(str(df.phone_number),str(df.region_code)))

The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.

df['phone_number_clean'] = df.phone_number.apply(lambda x:
 phonenumbers.parse(str(df.phone_number),"US"))

I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:

for i in range(n): 
    df3['phone_number_std'][i] = 
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))

Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.

I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.


Solution

  • Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:

    df['phone_number_clean'] = df.apply(lambda x: 
                                  phonenumbers.parse(str(x.phone_number), 
                                                     str(x.region_code)), 
                                  axis='columns')
    

    The differences:

    1. You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply) rather than to the Series (the single column) that is returned by doing df.phone_number.apply. (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).

    2. The argument axis='columns' (or axis=1, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])

    3. Your lambda function only knows about the placeholder x, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - i.e., x.phone_number, x.country_code.