I have a large data to plot the ECDF
but got confused, so I decided using small data subset, which still didn't make sentence to me (as complete to what I read from the source).
For that, I produced a synthetic MWE
to replicate the problem. Say I have the following df
:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
# DataFrame
df = pd.DataFrame(
{'id': [54, 54, 54, 54, 54, 16, 16, 16, 50, 50, 28, 28, 28, 19, 19, 32, 32, 32, 81, 81, 81, 81, 81],
'user_id': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 84, 84, 84, 84, 84, 179, 179, 179, 179, 179],
'trip_id': [101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102, 841, 841, 841, 841, 841, 1796, 1796,
1796, 1796, 1796],
'travel_mode': ['train', 'train', 'train', 'train', 'train', 'walk', 'walk', 'walk', 'train', 'train', 'train',
'train', 'train', 'taxi', 'taxi', 'bus', 'bus', 'bus', 'train', 'train', 'train', 'train', 'train']}
)
In this example, 50% of the trips (2/4) were travelled by 1 user. I want to plot the number of trips per user. So Proceeded like so:
# number of trips per user
trips_per_user = df.groupby('user_id')['trip_id'].nunique()
trips_per_user
trip_id
user_id
10 2
84 1
179 1
# Create a DataFrame for plotting
plot_data = trips_per_user.reset_index(name='num_trips')
plot_data
user_id num_trips
0 10 2
1 84 1
2 179 1
Now, plotting the ECDF
.
# ECDF
plt.figure(figsize=(5, 4))
sns.ecdfplot(data=plot_data, x='num_trips', stat='proportion', complementary=False)
plt.xlabel('Number of Trips')
plt.ylabel('Cumulative Proportion')
Obviously, I am not doing this correctly.
Required answer:
You have 3 values in your plot_data
dataset (2 unique): [1, 1, 2]
, for the first unique point (1
), you have 2 items ([1, 1]
) out of 3, so 67%.
If you want to count 2 trips for 2
, you have to weight your ecdfplot
:
sns.ecdfplot(data=plot_data, x='num_trips', weights='num_trips',
stat='proportion')
Output: