I have a dataset US Baby Names that contains a "Name" column. Even though it doesn't make much sense to do so, I am trying to find the median name in that column. i.e. after arranging the names in ascending order, based on the frequency of the names, there'll be a "middle value", and that's what I want to find without having to actually sort the entire column (Pandas Series) and then find the middlemost name. So I need a simple, built-in way to find the median name.
~*~
EDIT [5:51 am Coordinated Universal Time (UTC)]: The median name should be based on alphabetical/lexicographic order of names. Also, here is some part of the csv file (first line is the header):
,Id,Name,Year,Gender,State,Count
11349,11350,Emma,2004,F,AK,62
11350,11351,Madison,2004,F,AK,48
11351,11352,Hannah,2004,F,AK,46
11352,11353,Grace,2004,F,AK,44
11353,11354,Emily,2004,F,AK,41
11354,11355,Abigail,2004,F,AK,37
~*~
I've tried the built-in pandas median() method, but it doesn't really work for non-numeric values despite setting the numeric_only
attribute as False
:
import pandas as pd
baby_names = pd.read_csv(
"Pandas_DataMart\\DataMart\\06_Stats\\US_Baby_Names\\US Baby Names.xlsx")
print(baby_names['Name'].median(numeric_only=False))
There's a bunch of error lines that trace their way through the internal working of the median() method, but ultimately I get this:
TypeError: could not convert string to float: 'Emma'
So, doesn't seem to work for non-numeric values. Or am I doing something wrong?
Here's the full error message for reference:
Traceback (most recent call last):
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\nanops.py", line 720, in nanmedian
values = values.astype("f8")
ValueError: could not convert string to float: 'Emma'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\...\BabyNames.py", line 18, in <module>
print(baby_names['Name'].median(numeric_only=False))
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\generic.py", line 10802, in median
return NDFrame.median(self, axis, skipna, level, numeric_only, **kwargs)
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\generic.py", line 10374, in median
return self._stat_function(
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\generic.py", line 10354, in _stat_function
return self._reduce(
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\series.py", line 4392, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\nanops.py", line 156, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Users\JohnDoe\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\nanops.py", line 723, in nanmedian
raise TypeError(str(err)) from err
TypeError: could not convert string to float: 'Emma'
Assuming you need the name with the median value of counts, use a custom function. Count the frequencies with value_counts
then get the middle value:
df = pd.DataFrame({'names': ['A', 'B', 'C', 'A', 'C', 'C']})
def count_median(s):
c = s.value_counts()
return c.index[len(c)//2]
out = df['names'].agg(count_median)
Output: 'A'
If you want the median/middle value of the lexicographically sorted names, then you must sort:
df['names'].sort_values().iloc[len(df)//2]
Output: 'C'
Or if you want the first value if the length of the data is even:
df['names'].sort_values().iloc[(len(df)-1)//2]
Output: 'B'
You can make things sightly more efficient by only sorting half of the data with numpy.partition
:
mid = (len(df)-1)//2
out = np.partition(df['names'], mid)[mid]