pandas dataframe plot time-series timeserieschart

Plotting timeseries data with multiple categories

I have a dataset from a production line, which is formatted as time series data. There is a batch column, which indicates the name of the batch (str), and there is a phase column which indicates the phase of the production (str). I am working with the datetime as the index of the pandas DataFrame.

I want to plot this data on timeseries graph, overlaying the data from each phase and distinguishing each batch (i.e. different colour), with each process variable (i.e. temp1, temp2, press1, press2) on a different axis (as per the diagram) How can this be done?

EDIT: for clarity, I need the trends to be plotted against a datetime baseline, otherwise they will not overlay.

Example of the dataset: | datetime | temp1 | temp 2|press1|press2|batch | phase | |:---- |:--: | :--: | :--: | :--: |:--: |:--: | | 2023-02-03 15:45:34| 34.45 | 23.34 | 13.23| 45.5 | 'D' | '10-Wait' | | ... | ... | ... | ... | ... | 'D' | ... | | 2023-02-03 15:55:34| 36.55 | 22.14 | 18.23| 78.5 | 'D' | '20-Initialise'|

To create a similar dataset -to mine- you can use the following code:

import numpy as np
import pandas as pd
import datetime  

date = pd.date_range(start='1/1/2023', end='10/06/2023', freq=datetime.timedelta(seconds=30))
tags = ['temp1','temp2','press1','press2']
data=np.random.rand(len(date),len(tags))
df=pd.DataFrame(data,columns=tags).set_index(date)

batches = ['A','B','C','D','E','F','G']
n=len(batches)
period_start = pd.to_datetime('1/1/2023')
period_end = pd.to_datetime('10/06/2023')
batch_start = (pd.to_timedelta(np.random.rand(n) * ((period_end - period_start).days + 1), unit='D') + period_start)
batch_end = (batch_start + pd.to_timedelta(8,unit='H'))

df_batches = pd.DataFrame(data=[batch_start,batch_end],columns=[batches],index=['start','end']).T

for item in batches:
    start_time = df_batches['start'][item]
    end_time = df_batches['end'][item]
    df.loc[((df.index>=start_time)&(df.index<=end_time)), 'batch'] = item
df.dropna(subset=['batch'],inplace=True)

df['phase']=''
phases = ['10-Wait','20-Initialise','30-Warm','40-Running']

for batch in batches:
    wait_len = int(len(df[df['batch']==batch].index)*0.2)
    init_len = int(len(df[df['batch']==batch].index)*0.4)
    warm_len = int(len(df[df['batch']==batch].index)*0.6)
    run_len = int(len(df[df['batch']==batch].index))
   
    wait_start = df[df['batch']==batch].index[0]
    wait_end = df[df['batch']==batch].index[wait_len]
    init_end = df[df['batch']==batch].index[init_len]
    warm_end = df[df['batch']==batch].index[warm_len]
    run_end = df[df['batch']==batch].index[-1]  
 
    df['phase'].loc[wait_start:wait_end] = phases[0]
    df['phase'].loc[wait_end:init_end] = phases[1]
    df['phase'].loc[init_end:warm_end] = phases[2]
    df['phase'].loc[warm_end:run_end] = phases[3]

df.to_csv('stackoverflowqn.csv')

Solution

Credit to @AvishWagde who definitely broke the back of the problem. The 1 missing ingredient was having the x-axis of each plot baselined against zero.

The solution to baselining these plots was to create a new Timedelta column which starts from 00:00:00 and goes upwards, in increments of 00:00:30.

In Avish's code he uses:

for batch in batches:
    batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)]
    axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='o')

However, since the Dataframe index is a Datetime, plotting this on the x-axis will not result in a comparison of the data. As stated they need to be plotted against a baseline. Using Timedelta on the x-axis allows comparison of the process data in each phase. In this case the 00:00:00 is taken to be the start of each phase. This dataset was recorded at 30s intervals, and it is necessary to convert the Timedelta from Index to Series, as per this line pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series() which results in this slight change:

for batch in batches:
    batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)] 
    baseline_time = pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series()
    batch_data = batch_data.set_index(baseline_time)
    axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='')

For the full working code:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('stackoverflowqn.csv',index_col=[0])
df.index = pd.to_datetime(df.index)

sns.set_style("whitegrid")

phases = df['phase'].unique()
batches = df['batch'].unique()
variables = ['temp1', 'temp2', 'press1', 'press2']  # List of process variables

num_cols = 2  # Number of columns for the subplot grid
num_rows = (len(variables) + num_cols - 1) // num_cols

for phase in phases:
    fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 6 * num_rows))
    plt.subplots_adjust(hspace=0.5)
    plt.suptitle(phase, y=1.02)

    for idx, variable in enumerate(variables):
        row = idx // num_cols
        col = idx % num_cols

        for batch in batches:
            batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)]
            baseline_time = pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series()
            batch_data = batch_data.set_index(baseline_time)
            axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='')

        axes[row, col].set_title(variable)
        axes[row, col].set_xlabel('time')
        axes[row, col].set_ylabel(variable)
        axes[row, col].legend()

    plt.tight_layout()
    plt.show()

Wait: 2 x variables Initialise: 2 x variables
Warm: 2 x variables