pythontensorflowgoogle-app-enginegoogle-cloud-platformgcp-ai-platform-training

How to upload my training data into google for Tensorflow cloud training


I want to train my keras model in gcp.

My code:

this is how I load the dataset

dataset = pandas.read_csv('USDJPY.fx5.csv', usecols=[2, 3, 4, 5], engine='python')

this is how i trigger cloud training

job_labels = {"job": "forex-usdjpy", "team": "xxx", "user": "xxx"}
tfc.run(requirements_txt="./requirements.txt",
        job_labels=job_labels,
        stream_logs=True
        )

Right before my model, which shouldn't make much of a difference

model = Sequential()
model.add(LSTM(4, input_shape=(1, 4)))
model.add(Dropout(0.2))
model.add(Dense(4))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=1, batch_size=1, verbose=2)

Everything is working, docker image for my model is being created, but the USDJPY.fx5.csv file is not being uploaded. So I get file not found error

What is the proper way of loading custom files into the training job? I uploaded the train data to s3 bucket but I wasn't able to tell google to look there.


Solution

  • Turns out it was a problem with my GCP configuration Here are the steps I made to make it work:

    Since you are uploading the python file to GCP a good way to organize your code it to put all of the training logic into a method and then called it conditionally on the cloud train flag:

    if tfc.remote():
        train()
    

    Here is the whole working code if someone is interested

    import pandas
    import numpy
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.layers import LSTM
    from keras.layers import Dropout
    from sklearn.preprocessing import MinMaxScaler
    import tensorflow_cloud as tfc
    import os
    
    os.environ["PATH"] = os.environ["PATH"] + ":<path to google-cloud-sdk/bin"
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "<path to google credentials json (you can generate this through their UI"
    
    
    def create_dataset(data):
        dataX = data[0:len(data) - 1]
        dataY = data[1:]
        return numpy.array(dataX), numpy.array(dataY)
    
    def train():
        dataset = pandas.read_csv('gs://<bucket>/USDJPY.fx5.csv', usecols=[2, 3, 4, 5])
    
        scaler = MinMaxScaler(feature_range=(-1, 1))
        scaler = scaler.fit(dataset)
    
        dataset = scaler.transform(dataset)
    
        # split into train and test sets
        train_size = int(len(dataset) * 0.67)
        train, test = dataset[0:train_size], dataset[train_size:len(dataset)]
    
        trainX, trainY = create_dataset(train)
    
        trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
    
        model = Sequential()
        model.add(LSTM(4, input_shape=(1, 4)))
        model.add(Dropout(0.2))
        model.add(Dense(4))
        model.compile(loss='mean_squared_error', optimizer='adam')
        model.fit(trainX, trainY, epochs=1000, verbose=1)
    
    
    job_labels = {"job": "forex-usdjpy", "team": "zver", "user": "zver1"}
    tfc.run(requirements_txt="./requirements.txt",
            job_labels=job_labels,
            stream_logs=True
            )
    
    if tfc.remote():
        train()
    

    NOTE: This is probably not an optimal LSTM config, take it with a grain of salt