I have a np.array matrix(1826*5000) where the rows are my samples and the columns are the features. That means I have a genotype in each line with the individual nucleotides as a string. like this:
[['G' 'G' 'G' ... 'T' 'T' 'A']
['G' 'G' 'G' ... 'A' 'T' 'A']
['A' 'G' 'A' ... 'A' 'T' 'A']
...
['G' 'A' 'G' ... 'T' 'T' 'A']
['G' 'G' 'A' ... 'A' 'T' 'A']
['G' 'G' 'G' ... 'A' 'T' 'C']]
And only two different nucleotides appear in each column.
Now I would like to replace the individual strings with the numbers 0 and 2 in such a way that in each column the nucleotide that occurs more frequently gets the number 0 and the nucleotide that occurs less frequently gets the number 2.
This means that in column one the "G" should be replaced by 0 and the "A" by 2 since the "G" is more frequent.
Should look like this in the end.
[['0' '0' '0' ... '2' '0' '0']
['0' '0' '0' ... '0' '0' '0']
['2' '0' '2' ... '0' '0' '0']
...
['0' '2' '0' ... '2' '0' '0']
['0' '0' '2' ... '0' '0' '0']
['0' '0' '0' ... '0' '0' '2']]
Can someone tell me how to do this (with the help of Sklearn and Numpy functions)?
Given an array arr, the easiest way of solving it is:
import pandas as pd
df = pd.DataFrame(arr)
for column in df:
df[column] = np.where(df[column]==df[column].mode()[0], "2", "0")
arr1 = df.to_numpy()
Explanation: First, you turn the array into a Pandas dataframe. Then, for each column you replace the mode with "2" and the other values by "0". Finally, you convert the dataframe back into an array that we name arr1.