I am trying to understand the Normalization of the Dataframe values. Here is the scenario from the famous disaster i.e. Titanic and here is the code and result from a query:
dftitanic.groupby('Fsize')['Survived'].value_counts(normalize=False).reset_index(name='perc')
Result:
Fsize Survived perc
0 1 0 374
1 1 1 163
2 2 1 89
3 2 0 72
4 3 1 59
5 3 0 43
6 4 1 21
7 4 0 8
8 5 0 12
9 5 1 3
10 6 0 19
11 6 1 3
12 7 0 8
13 7 1 4
14 8 0 6
15 11 0 7
And if I use .value_counts(normalize=True)
, the result would be:
dftitanic.groupby('Fsize')['Survived'].value_counts(normalize=True).reset_index(name='perc')
Fsize Survived perc
0 1 0 0.696462
1 1 1 0.303538
2 2 1 0.552795
3 2 0 0.447205
4 3 1 0.578431
5 3 0 0.421569
6 4 1 0.724138
7 4 0 0.275862
8 5 0 0.800000
9 5 1 0.200000
10 6 0 0.863636
11 6 1 0.136364
12 7 0 0.666667
13 7 1 0.333333
14 8 0 1.000000
15 11 0 1.000000
And the data from describe()
:
Fsize Survived Perc
count 16.0000 16.000000 16.000000
mean 4.6875 0.437500 55.687500
std 2.7500 0.512348 95.378347
min 1.0000 0.000000 3.000000
25% 2.7500 0.000000 6.750000
50% 4.5000 0.000000 15.500000
75% 6.2500 1.000000 62.250000
max 11.0000 1.000000 374.000000
From https://stackoverflow.com/a/41532180, I got the following methods:
normalized_df=(df-df.mean())/df.std()
normalized_df=(df-df.min())/(df.max()-df.min()
However, from the results of describe()
, the above two methods not matching the results of .values_counts(normalize=True)
.
A similar formula and description is present here: but didn't get understandable results.
How this Normalization being done? i.e. .value_counts(normalize=True)
In the context of the Pandas GroupBy operation, the 'normalize' parameter, when set to 'True', normalizes the values to show percentages instead of counts. This means that the output will display the percentage distribution of the data rather than the raw counts.
Regarding the article on normalization in machine learning, it is essential to differentiate between the 'normalize' parameter in Pandas and normalization techniques in machine learning. In Pandas, 'normalize' is specific to calculating percentages within groups, while normalization in machine learning refers to scaling features to a range, often between 0 and 1, to ensure uniformity and prevent certain features from dominating the model due to their scale. Standardization, on the other hand, involves transforming data to have a mean of 0 and a standard deviation of 1, aiding in comparison and interpretation of different features.
In summary, the article you provided is showing math intuition of Standardization() and Normalization() that is used in Machine learning and that is different from this concept in which you are using 'normalize'. This is the parameter used in pandas library on the other hand the articcle is showing math formula of both the types of transformation that is used before executing the data into algorithm (i.e; Data Preprocessing) so that it makes it easy for machine to interpret near to accurate results. Both the technique comes under the scikit-learn (sklearn) library. from sklearn.preprocessing import StandardScaler, MinMaxScaler.
If you still have doubt feel free to reach out.