pythonrandom-seednumpy-random

How to randomly map a proportion of data value to a specific category?


I have a dataset below which shows if a customer is a return customer or not. The end goal is for all returned customers, I need to map about 25% of them to 'yes 1 purchase' and 75% of them to 'yes >1 purchase'. I also need to set a seed to make sure the result does not change each time I re-run the process.

I researched on numpy random function and random seed function, but it seems they generate random numbers instead of randomly assign/map a proportion of data value to a specific category. Can anyone advise on how to do this?

import pandas as pd
import numpy as np

list_customer_name = ['customer1','customer2','customer3','customer4','customer5',
'customer6','customer7','customer8','customer9','customer10','customer11','customer12',
'customer13','customer14','customer15','customer16','customer17','customer18']
list_return_customer = ['yes','yes','yes','yes','yes','yes',
'yes','yes','yes','yes','yes','yes','yes','yes',
'yes','yes','no','no']

df_test = pd.DataFrame({'customer_name': list_customer_name,
                    'return_customer?':list_return_customer})

data looks like this

enter image description here

desired output looks like this - 25% of customers (4 customer highlighted in yellow) flagged "yes" in the "return_customers?" column are mapped to "yes 1 purchase", the remaining 75% of customers (12 customers highlighted in green) are mapped to "yes >1 purchase".

enter image description here


Solution

  • The following code seems to match your specifications:

    import random
    
    import pandas as pd
    
    random.seed(1234)
    
    list_customer_name = ['customer1','customer2','customer3','customer4','customer5',
    'customer6','customer7','customer8','customer9','customer10','customer11','customer12',
    'customer13','customer14','customer15','customer16','customer17','customer18']
    
    list_return_customer = ['yes','yes','yes','yes','yes','yes',
    'yes','yes','yes','yes','yes','yes','yes','yes',
    'yes','yes','no','no']
    
    list_return_customer_final = ["yes >1 purchase" if status == "yes" else "no" for status in list_return_customer]
    
    number_of_yes_1_purchase = 4
    
    while number_of_yes_1_purchase > 0:
        rand_index = random.randint(0, len(list_return_customer_final) - 1)
        if list_return_customer_final[rand_index] == "yes 1 purchase" or list_return_customer_final[rand_index] == "no":
            continue
        list_return_customer_final[rand_index] = "yes 1 purchase"
        number_of_yes_1_purchase -= 1
    
    df_test = pd.DataFrame({'customer_name': list_customer_name,
                            'return_customer?':list_return_customer,
                            'return_customer_final': list_return_customer_final})
    
    print(df_test)
    

    Explanations:

    I used the random module and set the seed to and arbitrary value with random.seed(1234). Setting the seed allows random functions to behave the same every time we run the program.

    I defined the number of "yes >1 purchase" to allocate with the variable number_of_yes_1_purchase. You can hardcode it or compute it depending on the length of list_return_customer (but remember to round the result to have an int).

    With the while loop, I loop until I have allocated all of the "yes >1 purchase", so each time I allocate one I decrease the remaining number by one with number_of_yes_1_purchase -= 1

    I used rand_index = random.randint(0, len(list_return_customer_final) - 1) to get a random index of the list to set to "yes 1 purchase". If this index is already a "yes 1 purchase" or a "no", I skip the current iteration with continue.

    The loop ends when number_of_yes_1_purchase reaches 0.


    If you have any questions, don't hesitate to ask