I have a dataset below which shows if a customer is a return customer or not. The end goal is for all returned customers, I need to map about 25% of them to 'yes 1 purchase' and 75% of them to 'yes >1 purchase'. I also need to set a seed to make sure the result does not change each time I re-run the process.
I researched on numpy random function and random seed function, but it seems they generate random numbers instead of randomly assign/map a proportion of data value to a specific category. Can anyone advise on how to do this?
import pandas as pd
import numpy as np
list_customer_name = ['customer1','customer2','customer3','customer4','customer5',
'customer6','customer7','customer8','customer9','customer10','customer11','customer12',
'customer13','customer14','customer15','customer16','customer17','customer18']
list_return_customer = ['yes','yes','yes','yes','yes','yes',
'yes','yes','yes','yes','yes','yes','yes','yes',
'yes','yes','no','no']
df_test = pd.DataFrame({'customer_name': list_customer_name,
'return_customer?':list_return_customer})
data looks like this
desired output looks like this - 25% of customers (4 customer highlighted in yellow) flagged "yes" in the "return_customers?" column are mapped to "yes 1 purchase", the remaining 75% of customers (12 customers highlighted in green) are mapped to "yes >1 purchase".
The following code seems to match your specifications:
import random
import pandas as pd
random.seed(1234)
list_customer_name = ['customer1','customer2','customer3','customer4','customer5',
'customer6','customer7','customer8','customer9','customer10','customer11','customer12',
'customer13','customer14','customer15','customer16','customer17','customer18']
list_return_customer = ['yes','yes','yes','yes','yes','yes',
'yes','yes','yes','yes','yes','yes','yes','yes',
'yes','yes','no','no']
list_return_customer_final = ["yes >1 purchase" if status == "yes" else "no" for status in list_return_customer]
number_of_yes_1_purchase = 4
while number_of_yes_1_purchase > 0:
rand_index = random.randint(0, len(list_return_customer_final) - 1)
if list_return_customer_final[rand_index] == "yes 1 purchase" or list_return_customer_final[rand_index] == "no":
continue
list_return_customer_final[rand_index] = "yes 1 purchase"
number_of_yes_1_purchase -= 1
df_test = pd.DataFrame({'customer_name': list_customer_name,
'return_customer?':list_return_customer,
'return_customer_final': list_return_customer_final})
print(df_test)
I used the random
module and set the seed to and arbitrary value with random.seed(1234)
. Setting the seed allows random functions to behave the same every time we run the program.
I defined the number of "yes >1 purchase" to allocate with the variable number_of_yes_1_purchase
. You can hardcode it or compute it depending on the length of list_return_customer
(but remember to round the result to have an int
).
With the while
loop, I loop until I have allocated all of the "yes >1 purchase", so each time I allocate one I decrease the remaining number by one with number_of_yes_1_purchase -= 1
I used rand_index = random.randint(0, len(list_return_customer_final) - 1)
to get a random index of the list to set to "yes 1 purchase"
.
If this index is already a "yes 1 purchase" or a "no", I skip the current iteration with continue
.
The loop ends when number_of_yes_1_purchase
reaches 0.
If you have any questions, don't hesitate to ask