pythonmachine-learningclassificationrandom-forestk-means

How to use Machine Learning to find the pattern customer profile?


I have a dataset with personal characteristics of customers who purchase from a fictional company. Initially, I don't have any target variable, only their characteristics. My goal is to find a pattern, which may not necessarily be the most frequent characteristic in each column. Is it possible to do this with RandomForest, for example? Or should I use another technique?

The dataset has a structure similar to the following. The columns are all in object format and there are some NaN values represented as 'Blank':

Date            Name     Salary      Position            Age
'05/10/2023'   'Daniel'  '10,000'    'IT'                32
'05/12/2024'   'John'    '9,000'     'Blank'             27
'03/01/2023'   'Niel'    'Blank'     'Data Scients'      21
'03/01/2023'   'Isa'     '10,000'    'Engineer'          51
'05/10/2023'   'Ana'     '11,000'    'Data Scients'      52
'05/12/2024'   'Ian'     '9,500'     'Doctor'            48
'03/01/2023'   'Fred'    'Blank'     'IT'                21
'03/01/2023'   'Carol'   '15,000'    'Blank'             30

I'm thinking of something that returns an output, for example, stating the characteristics that form the most standard profile, such as:

The most standard profile is: Salary x, Position y, and Age z.

I thought about using clustering, but I don't believe it is the best method (the output for the salary, for example, was a simple average). I believe the best approach would be to create a profile that may not necessarily exist and is based on studying the pattern of each variable (Salary, Position, and Age).

# Encode categorical variables
df['Position'] = pd.Categorical(df['Position']).codes

# Perform clustering
kmeans = KMeans(n_clusters=1, random_state=42)
kmeans.fit(df[['Salary', 'Position', 'Age']])

# Get the centroid of the cluster
centroid = kmeans.cluster_centers_[0]

Is there a better way to do this? NLP or RandomForest is an option?


Solution

  • To find the most "standard" profile without a target variable, clustering is a good idea, but KMeans with a single cluster might oversimplify things. Instead, try using KMeans with multiple clusters (e.g., 3-5) and then analyze the centroids to find a representative profile. Each centroid will give you an average profile for that cluster.

    Alternatively, you could use Principal Component Analysis (PCA) to identify the main characteristics that vary the least, giving you a sense of the "standard" features across the dataset.

    RandomForest is more about classification or regression with a target variable, so it's less useful here. For an NLP approach, if you have a lot of text data, you could try Topic Modeling (like LDA) to find patterns in descriptions or job titles.

    So, stick with KMeans clustering or PCA for now!