pythonmatplotlibdatetimescikit-learntime-series

plotting only points that are stable from a time series and perform linear regression


I have two questions that I put in one help request. so I hope it will not make this overcrowded.

I spend quite some time figuring this out but not successful so far. I am trying to plot only points from a series of data that are close to each other so not the transitions (see below plot). Probably I need an if condition that says if x(2)-x(1)<0.005 plot and if not do not plot. Later I want to do a linear regression on these points (thats why I want to exclude transitions). Can you please help me how to do the plotting with this condition and do linear regression.

enter image description here

This is my code:


# which value you want to use or plot reading from log data
desired_field1= "x y Box"
desired_value1 = "x1 [um]"
desired_field2= "x y Box"
desired_value2 =  "x3 [um]"
desired_field3= "LM Position"
desired_value3 =  "Z [um]"


# extracting desired data from logging
data=pd.read_excel(r"test2.xlsx", sheet_name='Sheet1')
data = data[(data['_time'] < '2024-05-21T09:49:37.6089875Z') & (data['_time'] > '2024-05-21T09:43:31.7141954Z')] #selecting desired time interval
data_measurement1 = data.loc[data['_measurement'] == desired_field1]
data_field1 = data_measurement1.loc[data['_field'] == desired_value1]
data_measurement2 = data.loc[data['_measurement'] == desired_field2]
data_field2 = data_measurement2.loc[data['_field'] == desired_value2]
data_measurement3 = data.loc[data['_measurement'] == desired_field3]
data_field3 = data_measurement3.loc[data['_field'] == desired_value3]
values1 = list(data_field1['_value']) #values we are interested in
values2 = list(data_field2['_value'])
values3 = list(data_field3['_value'])
#....

mean_xs = [(g + h) / 2 for g, h in zip(values1, values2)]
LM_mean = [50-x for x in mean_xs]

#start plotting
data_field1['_time'] = pd.to_datetime(data_field1['_time'].str.split().str[-1])
data_field2['_time'] = pd.to_datetime(data_field2['_time'].str.split().str[-1])
data_field3['_time'] = pd.to_datetime(data_field3['_time'].str.split().str[-1])
plt.plot(data_field1['_time'], values1, '-', label = desired_value1)
plt.plot(data_field2['_time'], values2, '-', label = desired_value2 )
plt.plot(data_field3['_time'], values3, '-', label = desired_value3)


plt.xlabel('time [D hh:mm]')
plt.ylabel(' x [um] MCS')
plt.legend(loc='best')
plt.gca().yaxis.grid(True)

plt.figure()
plt.plot(LM_mean, values3, 'o')

example data:

9988   2024-05-21T09:46:00.1164445Z  1294.005333
9989   2024-05-21T09:46:01.1115275Z  1294.005333
9990   2024-05-21T09:46:02.1254956Z  1294.005667
9991   2024-05-21T09:46:03.1191685Z  1294.005667
9992   2024-05-21T09:46:04.1325494Z  1294.005333
9993   2024-05-21T09:46:05.1268794Z  1294.005333
9994   2024-05-21T09:46:06.1409297Z  1294.005333
9995   2024-05-21T09:46:07.1346292Z  1294.005000
9996   2024-05-21T09:46:08.1488069Z  1294.005333
9997   2024-05-21T09:46:09.1417524Z  1294.005333
9998   2024-05-21T09:46:10.1563002Z  1294.005333
9999   2024-05-21T09:46:11.1692835Z  1294.005333
10000  2024-05-21T09:46:12.1642492Z  1332.747333
10001  2024-05-21T09:46:13.1977216Z  1344.011333
10002  2024-05-21T09:46:14.1926256Z  1344.012000
10003  2024-05-21T09:46:15.2062685Z  1344.011667
10004  2024-05-21T09:46:16.2200463Z  1344.011667
10005  2024-05-21T09:46:17.2339343Z  1344.012000
10006  2024-05-21T09:46:18.2479639Z  1344.012000
10007  2024-05-21T09:46:19.2405515Z  1344.012000
10008  2024-05-21T09:46:20.2556817Z  1344.012000

I have tried searching for this but was not successful

Update on how I solved this, note I have two time series and wanted to plot them against each other for same timestep:

idxSteps = np.array(np.where(np.abs(x.diff()) > 0.05)) 
idxSteps2 = np.array(np.where(np.abs(y.diff()) > 0.05))
previousStep = 0 # init the previous step
xIdx = list()
xMid = list() # init list for x
yMid = list() # ==/== for y
i=0
for currentStep in idxSteps[0]:# for every step detected
   if currentStep != previousStep:
           if currentStep in idxSteps2[0]:# and if they are not the same as previousStep
               dummyX = x[previousStep:currentStep] # get the data between these two steps
               dummyY = y[previousStep:currentStep] # ==/==
               xIdx.append((previousStep+currentStep)//2)
               xMid.append(dummyX.mean())
               yMid.append(dummyY.mean()) # append the statistical mode (most recurring value)
               previousStep = currentStep
               assign previousStep as currentStep for the next loop

Solution

  • Sorry for taking too long, I hope I can assist you further with the following adjusted answer.

    Here's the code as a dump, I will explain the code step by step in the section after the code:

    %matplotlib notebook
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    import matplotlib.dates as mdates 
    df = pd.read_excel("Data.xlsx")
    df["Datetime"] = pd.to_datetime(df["Datetime"])
    idxSteps = np.array(np.where(np.abs(df["Data"].diff()) > 20)) # search for drops as the abs of the difference
    previousStep = 0 # init the previous step
    xIdx = list()
    xMid = list() # init list for x
    yMid = list() # ==/== for y
    for currentStep in [idxSteps[0], len(df.index)]: # for every step detected
        if not isinstance(currentStep, int):
            currentStep= currentStep[0]
        if currentStep != previousStep: # and if they are not the same as previousStep
            dummyX = df["Datetime"].loc[previousStep:currentStep] # get the data between these two steps
            dummyY = df["Data"].loc[previousStep:currentStep] # ==/==
            xIdx.append((previousStep+currentStep)//2)
            xMid.append(dummyX.mean()) # append to the average of the date to the x list
            yMid.append(dummyY.mode()) # append the statistical mode (most recurring value)
            previousStep = currentStep # assign previousStep as currentStep for the next loop
    plt.figure()
    plt.plot(df["Datetime"], df["Data"])
    plt.scatter(xMid, yMid, marker = "x", color = "red")
    fit = np.polyfit(xIdx, yMid, 1) # fit
    plt.plot(df["Datetime"], np.polyval(fit, df["Datetime"].index), "k--") # plot fit
    locator = mdates.AutoDateLocator(minticks = 3, maxticks = 7)
    formatter = mdates.ConciseDateFormatter(locator)
    plt.gca().xaxis.set_major_locator(locator) 
    plt.gca().xaxis.set_major_formatter(formatter) 
    plt.grid()
    

    To start off, I saved the data you included in the post in an excel file under the following structure:

    excel

    I can read this with the pd.read_excel() function, keeping in mind that the datetime column needs to be converted with pd.to_datetime()

    Then we basically do the following:

    Some important information:

    1. The reason why we save the index, is because we want to use it for the regression, as regressions don't like datetime formats
    2. To get the y coordinates, I chose the statistical mode: the most recurring value
    3. To get the x coordinates, I chose the mean of the section. If you want the x coordinate to be defined at the beginning of the jump, then go with min(). Alternatively, at the end, go with max()

    Here's how the end result looks like:

    plot

    A note to you: please read and understand my code. Don't just copy paste and hope for the best. Check for the intermediate data that I am generating with the code.

    This code so far only works for the section of data that you posted. It might not work on others, which is why you understanding how this solution was thought out is important! So you can come up with your own solution!