[SOLVED] machine learning, nominal data normalization

machine learning, nominal data normalization

i am working on kmeans clustering . i have 3d dataset as no.days,frequency,food ->day is normalized by means & std deviation(SD) or better to say Standardization. which gives me range of [-2 to 14]

->for frequency and food which are NOMINAL data in my data sets are normalized by DIVIDE BY MAX ( x/max(x) ) which gives me range [0 to 1]

the problem is that the kmeans only considers the day-axis for grouping since there is obvious gap b/w points in this axis and almost ignores the other two of frequency and food (i think because of negligible gaps in frequency and food dims ).

if i apply the kmeans only on day-axis alone (1D) i get the exact similar result as i applied on 3D(days,frequency,food).

"before, i did x/max(x) as well for days but not acceptable"

so i want to know is there any way to normalize the other two nominal data of frequency and food and we get fair scaling based on DAY-axis.

food => 1,2,3 frequency => 1-36

Solution

The point of normalization is not just to get the values small.

The purpose is to have comparable value ranges - something which is really hard for attributes of different units, and may well be impossible for nominal data.

For your kind of data, k-means is probably the worst choice, because k-means relies on continuous values to work. If you have nominal values, it usually gets stuck easily. So my main recommendation is to not use k-means.

For k-means to wprk on your data, a difference of 1 must be the same in every attribute. So 1 day difference = difference between food q and food 2. And because k-means is based on squared errors the difference of food 1 to food 3 is 4x as much as food to food 2.

Unless you have above property, don't use k-means.