pythonpandasdataframelogicdata-analysis

In each column of a data frame , how to find out the duration for which each unique value in a column existed?


Example , consider the

df:

   Time     colA    colB
0     1.1      2     2
1     2.2      2     2
2     3.4      3     5
3     4.5      3     5
4     5.6      4     5
5     6.2      4     6
6     7.4      4     6
7     8.5      2     6
8     9.8      2     5 
9     10.1     2     5
10    11.2     2     5

The ouptut I am expecting is a report CSV file with the columns as follows :

Col_name    unique_value   Duration

colA             2             3.8s 
colA             3             1.1s
colA             4             1.8s
colB             2             1.1s
colB             5             3.6s
colB             6             2.3s

(eg): To calculate colA :

unique value = 2
Duration = [1st consecutive appearance of 2 time difference (2.2-1.1)] + [2nd consecutive appearance time difference (11.2-8.5)] = 1.1 + 2.7 = 3.8s 

One of the logics I tried is :

  1. Create a new column that will calculate the difference of consecutive values in a column and give the output True - if there is consecutive values & False - if there are diff values one after other.

df["answer"] = df['colA'].diff().eq(0)

  1. As next step, I was planning to get all the False in one list and True in one list and get the difference of the list .

  2. How to link these with unique values, is what I am confused of.

Do help me figure out if the existing logic works or if should change the logic


Solution

  • *To create a new column indicating consecutive values with True and False is a good start. However, to calculate the duration for each unique value in each column, you can use the following steps: Iterate over each unique value in each column. For each unique value, find the consecutive occurrences and calculate the duration. Store the results in a new Data Frame.

    import pandas as pd
    
    # Sample DataFrame
    data = {
        'Time': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'colA': [1.1, 2.2, 3.4, 4.5, 5.6, 6.2, 7.4, 8.5, 9.8, 10.1, 11.2],
        'colB': [2, 2, 3, 3, 4, 4, 4, 2, 2, 2, 2]
    }
    
    df = pd.DataFrame(data)
    
    # Function to calculate duration for each unique value in a column
    def calculate_duration(column_name):
        durations = []
        unique_values = df[column_name].unique()
        for value in unique_values:
            # Find consecutive occurrences of the value
            consecutive_indices = df[df[column_name] == value].index.to_list()
            consecutive_occurrences = []
            current_occurrence = [consecutive_indices[0]]
            for i in range(1, len(consecutive_indices)):
                if consecutive_indices[i] - consecutive_indices[i-1] == 1:
                    current_occurrence.append(consecutive_indices[i])
                else:
                    consecutive_occurrences.append(current_occurrence)
                    current_occurrence = [consecutive_indices[i]]
            consecutive_occurrences.append(current_occurrence)
            
            # Calculate duration for each consecutive occurrence
            for occurrence in consecutive_occurrences:
                start_time = df.iloc[occurrence[0]]['Time']
                end_time = df.iloc[occurrence[-1]]['Time']
                duration = end_time - start_time
                durations.append((value, duration))
        return durations
    
    # Create a DataFrame to store results
    report_df = pd.DataFrame(columns=['Col_name', 'unique_value', 'Duration'])
    
    # Calculate durations for each column
    for column in df.columns[1:]:
        durations = calculate_duration(column)
        for value, duration in durations:
            report_df = report_df.append({'Col_name': column, 'unique_value': value, 'Duration': duration}, ignore_index=True)
    
    # Export to CSV
    report_df.to_csv('report.csv', index=False)
    

    Export the Data Frame to a CSV file.*