pythonmachine-learningrecurrent-neural-networksoftmaxdata-handling

How do I preprocess my data when I have too much but need it all?


I am literally months out of college with a CS BS and my boss is having me build a machine learning agent to classify data into 23 categories from scratch all by myself in two months. I took a single Intro to AI class, and we didn't even cover neural networks. I think I've got the basics figured out, but I'm having trouble preparing my data for feeding into the model.

Feel free to comment on the (un)feasibility of this, but it's contextual info and not what my question is about. An example of the type of data I have for a powerstrip-type device is 1 column DeviceID (string of numbers, unique per device), 12 columns of various integers indicating which outlets are being used and how much power is being pulled, and an integer correlated to which location the device is at. I have oodles of this type of data, and I've been thinking I could use an RNN with a softmax layer to categorize into my categories. This will be supervised learning. The columns mentioned will be the input, and an integer 1-23 will be the output. I need the model to look at a timeframe and categorize it, which would include varying numbers of rows, because there are varying numbers of devices and because each device creates a row twice per minute. For example,

ID      1   2   3   4   5   RSSI Temperature R_ID TimeStamp
43713   0   0   0   0   118 -82   97         45   2019-08-27 15:38:00.387
49945   0   0   5   0   0   -88   89         45   2019-08-27 15:38:00.493
43711   0   0   0   0   5   -65   120        45   2019-08-27 15:38:00.557
43685   12  4   0   0   0   -76   110        45   2019-08-27 15:38:01.807
44041   0   0   0   12  0   -80   104        45   2019-08-27 15:38:02.277

My problem is this: for one sample timeframe I pulled from our SQL database of 35 minutes -- timeframes can vary from 1 minute to several hours -- and I got 3,747 distinct rows. This is clearly way too much to feed the model as 1 sample. If the usage on the powerstrip doesn't change from 1 minute to the next, it will create several rows identical except for the timestamp. When I removed the timestamp, I got 333 distinct rows. That still seems like an awful lot, and it's removing the necessary time data.

My questions are these: Is that actually too much? I know from my googling that I can make it work using several rows, but can I do it when I don't know how many rows? I.e., instead of saying "look at X rows" can I say "look at X minutes of rows" as 1 sample? What would an experienced dev (or data scientist? Idek) do in a situation like this? As an alternative approach, instead of trying to work with timeframes (predetermined by the data/work we're doing), I was thinking I might try using a sliding window over [please advise] minutes, get the output from that and use those as input to get the output for the timeframe. Is that a terrible idea? Would that even work? The model needs to be able to detect differences due to time of day, different people, etc.

Thanks!


Solution

  • New answer

    Here is a toy example of how to do the compression in Python:

    import pandas as pd
    import numpy as np
    
    # Inputs
    feature_cols = list(range(1, 13))
    n_samples = 1000 
    
    # Data prep
    df = pd.DataFrame({col: np.random.choice([0, 1], size=n_samples, p=[0.99, 0.01]) for col in feature_cols})
    df['ID'] = '1234'
    df['TimeStamp'] = pd.date_range(end='2019-12-04', freq='30s', periods=n_samples)
    
    # Compressed df
    comp_df = df.loc[df.groupby('ID')[feature_cols].diff().sum(axis=1) != 0, :]
    
    # Results
    n_comp = len(df.index)-len(comp_df.index)
    print('Data compressed by {} rows ({}%)'.format(str(n_comp), str(round(n_comp/len(df.index)*100, 2))))
    

    As I noted in the comments you really should be doing this upstream on the DB to avoid shipping unnecessary data.

    As to the machine learning I think you're getting way ahead of yourself. Start with simpler models like Random Forest or GBM. You could then work up to boosted methods like XGBoost. A net will be much less interpretable and, since you mentioned you don't have a firm grasp of the concepts I'd start small; you'd hate to be asked to interpret a result from an almost-uninterpretable model from a method you don't even fully understand yet. :)

    Previous answer

    Okay so if I understand correctly your data are:

    They are generated regardless of the orientation of the plugs and thus has lots of "duplicates". For a 35 minute sample, you got 3747 records -- so roughly 1.78 records per second.

    First, that's not too much data at all. Generally more data is better, constrained by your computing power of course. A decent laptop should handle hundreds of thousands of rows without breaking a sweat. That said, if any of your data are not meaningful (or invalid, malformed, etc), then preprocess it. To put a finer point on it, by including all of those records you are biasing the model towards periods with many duplicates because, naturally, they have more samples and thus a greater influence on model performance.

    Okay so we're doing preprocessing, where should it be done? In the case of unnecessary data (your duplicates), ideally this is done as far up the pipeline as possible so you're not shipping around useless data. Answer: do it when you pull from the database. The exact method depends on your database/SQL dialect but you want to use whatever windowing functionality you have to filter out those duplicates. One solution:

    If you're using SQL you can create a view with the above logic and query it just like you were the original data (using whatever windowing/filtering you want).

    Also, some StackOverflow tips since I can't comment yet: