pythonpandasstringmedian

Median of string values in a Pandas Series (or GroupBy Object) using built-in methods


I have a dataset US Baby Names that contains a "Name" column. Even though it doesn't make much sense to do so, I am trying to find the median name in that column. i.e. after arranging the names in ascending order, based on the frequency of the names, there'll be a "middle value", and that's what I want to find without having to actually sort the entire column (Pandas Series) and then find the middlemost name. So I need a simple, built-in way to find the median name.

~*~

EDIT [5:51 am Coordinated Universal Time (UTC)]: The median name should be based on alphabetical/lexicographic order of names. Also, here is some part of the csv file (first line is the header):

,Id,Name,Year,Gender,State,Count
11349,11350,Emma,2004,F,AK,62
11350,11351,Madison,2004,F,AK,48
11351,11352,Hannah,2004,F,AK,46
11352,11353,Grace,2004,F,AK,44
11353,11354,Emily,2004,F,AK,41
11354,11355,Abigail,2004,F,AK,37

~*~

I've tried the built-in pandas median() method, but it doesn't really work for non-numeric values despite setting the numeric_only attribute as False:

import pandas as pd
baby_names = pd.read_csv(
    "Pandas_DataMart\\DataMart\\06_Stats\\US_Baby_Names\\US Baby Names.xlsx")

print(baby_names['Name'].median(numeric_only=False))

There's a bunch of error lines that trace their way through the internal working of the median() method, but ultimately I get this:

TypeError: could not convert string to float: 'Emma'

So, doesn't seem to work for non-numeric values. Or am I doing something wrong?

Here's the full error message for reference:

Traceback (most recent call last):
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\nanops.py", line 720, in nanmedian
values = values.astype("f8")
ValueError: could not convert string to float: 'Emma'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "c:\Users\...\BabyNames.py", line 18, in <module>
print(baby_names['Name'].median(numeric_only=False))
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\generic.py", line 10802, in median
return NDFrame.median(self, axis, skipna, level, numeric_only, **kwargs)
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\generic.py", line 10374, in median
return self._stat_function(
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\generic.py", line 10354, in _stat_function
return self._reduce(
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\series.py", line 4392, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\nanops.py", line 156, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\nanops.py", line 723, in nanmedian
raise TypeError(str(err)) from err
TypeError: could not convert string to float: 'Emma'

Solution

  • original answer: median by count

    Assuming you need the name with the median value of counts, use a custom function. Count the frequencies with value_counts then get the middle value:

    df = pd.DataFrame({'names': ['A', 'B', 'C', 'A', 'C', 'C']})
    
    def count_median(s):
        c = s.value_counts()
        return c.index[len(c)//2]
    
    out = df['names'].agg(count_median)
    

    Output: 'A'

    alternative: median by lexicographic order

    If you want the median/middle value of the lexicographically sorted names, then you must sort:

    df['names'].sort_values().iloc[len(df)//2]
    

    Output: 'C'

    Or if you want the first value if the length of the data is even:

    df['names'].sort_values().iloc[(len(df)-1)//2]
    

    Output: 'B'

    You can make things sightly more efficient by only sorting half of the data with numpy.partition:

    mid = (len(df)-1)//2
    out = np.partition(df['names'], mid)[mid]