[SOLVED] Plotting cumulative distribution from data

Plotting cumulative distribution from data

I have a large data to plot the ECDF but got confused, so I decided using small data subset, which still didn't make sentence to me (as complete to what I read from the source).

For that, I produced a synthetic MWE to replicate the problem. Say I have the following df:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# DataFrame
df = pd.DataFrame(
    {'id': [54, 54, 54, 54, 54, 16, 16, 16, 50, 50, 28, 28, 28, 19, 19, 32, 32, 32, 81, 81, 81, 81, 81],
     'user_id': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 84, 84, 84, 84, 84, 179, 179, 179, 179, 179],
     'trip_id': [101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102, 841, 841, 841, 841, 841, 1796, 1796,
                 1796, 1796, 1796],
     'travel_mode': ['train', 'train', 'train', 'train', 'train', 'walk', 'walk', 'walk', 'train', 'train', 'train',
                             'train', 'train', 'taxi', 'taxi', 'bus', 'bus', 'bus', 'train', 'train', 'train', 'train', 'train']}
)

In this example, 50% of the trips (2/4) were travelled by 1 user. I want to plot the number of trips per user. So Proceeded like so:

# number of trips per user
trips_per_user = df.groupby('user_id')['trip_id'].nunique()

trips_per_user
         trip_id
user_id     
  10       2
  84       1
  179      1

# Create a DataFrame for plotting
plot_data = trips_per_user.reset_index(name='num_trips')

plot_data
    user_id num_trips
0     10     2
1     84     1
2    179     1

Now, plotting the ECDF.

# ECDF
plt.figure(figsize=(5, 4))
sns.ecdfplot(data=plot_data, x='num_trips', stat='proportion', complementary=False)
plt.xlabel('Number of Trips')
plt.ylabel('Cumulative Proportion')

Output:

Obviously, I am not doing this correctly.

1 trip was travelled in 50% of the data (not about 70% as in the plot obtained).
The ecdf curve isn't starting from 0.

Required answer:

I wanted to plot something like below (from the source):

Solution

You have 3 values in your plot_data dataset (2 unique): [1, 1, 2], for the first unique point (1), you have 2 items ([1, 1]) out of 3, so 67%.

If you want to count 2 trips for 2, you have to weight your ecdfplot:

sns.ecdfplot(data=plot_data, x='num_trips', weights='num_trips',
             stat='proportion')

Output: