python-3.xpandasnumpydiscretization

Finding equal frequency from discrete data


i have to find equal width from time series data.

So far,i could do it by manually selecting every single column,and then applying the condition. But i need a faster way to do it.

The time-series data:

Time    ulaR    trxA

0       0.6457325   0.4040438
50      0.4594477   0.4172161
100     0.4244469   0.3878299
150     0.391452    0.49735
200     0.3570379   0.4930038
250     0.3730624   0.4221448
300     0.3676819   0.3796647
350     0.3688949   0.4228213
400     0.4018654   0.439482
450     0.3934677   0.4039933
500     0.3571651   0.3264575
550     0.5451287   0.3471816
600     0.6520524   0.3710635
650     0.6776012   0.4173777
700     0.684412    0.3812378
750     0.7298819   0.3735065
800     0.739083    0.3195176
850     0.6394782   0.213515
900     0.6483277   0.3721211
950     0.7003584   0.3528451
1000    0.6926971   0.3867717

My Code:

import numpy as np
import pandas as pd
import csv
import array as ar

infile="Ecoli-1_dream4_timeseries.tsv"
outfile="tempecoli.csv"
df=pd.read_csv(infile,delimiter="\t",dtype=float)

a1=np.array(df['ulaR'])
s=df.sort_values(['ulaR'])
s1=np.array(s['ulaR'])
gr=list()

for i in range(len(s1)):
  for j in range(len(a1)):
    if s1[i]==a1[j]:
        if j<=7:
            gr.append(0)
        elif j>7 and j<=14:
            gr.append(1)
        else:
            gr.append(2)


##########

a1=np.array(df['trxA'])
s=df.sort_values(['trxA'])
s1=np.array(s['trxA'])
gr1=list()

for i in range(len(s1)):
  for j in range(len(a1)):
     if s1[i]==a1[j]:
         if j<=7:
            gr1.append(0)
         elif j>7 and j<=14:
            gr1.append(1)
         else:
            gr1.append(2)

#############


group1=pd.Series(gr,name="ulaR")
group2=pd.Series(gr1,name="trxA")
df2=pd.concat([group1,group2],axis=1)
df2.to_csv("ecoli1.csv")
print("Completed")

If you run this code,you will get the result. I do not want any new result,all i want it a more time efficient code to get the desired result. Because,writing the names of each code and then applying the conditions takes a lot of time. A little bit of help will be appreciated. Thanks in advance.


Solution

  • You can use argsort on axis=0 to get the position of the value in each column if sorted, then digitize with the different binning conditions to get the three values 0, 1 or 2 as in your case:

    l_col = ['ulaR', 'trxA']
    bins = [-1., 7., 14., np.inf] # I use -1 as first bound to ensure 0 is in the same bin than 1 to 7
    df2 = pd.DataFrame(np.digitize(df[l_col].values.argsort(axis=0), bins, right=True)-1,
                           columns=l_col)
    # the -1 after digitize is because it starts at 1 not 0