trainsclearml

Tracking separate train/test processes with Trains


In my setup, I run a script that trains a model and starts generating checkpoints. Another script watches for new checkpoints and evaluates them. The scripts run in parallel, so evaluation is just a step behind training.

What's the right Tracks configuration to support this scenario?


Solution

  • disclaimer: I'm part of the allegro.ai Trains team

    Do you have two experiments? one for testing one for training ?

    If you do have two experiments, then I would make sure the models are logged in both of them (which if they are stored on the same shared-folder/s3/etc will be automatic) Then you can quickly see the performance of each-one.

    Another option is sharing the same experiment, then the second process adds reports to the original experiment, that means that somehow you have to pass to it the experiment id. Then you can do:

    task = Task.get_task(task_id='training_task_id`)
    task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)
    

    EDIT: Are the two processes always launched together, or is the checkpoint test a general purpose code ?

    EDIT2:

    Let's assume you have main script training a model. This experiment has a unique task ID:

    my_uid = Task.current_task().id
    

    Let's also assume you have a way to pass it to your second process (If this is an actual sub-process, it inherits the os environment variables so you could do os.environ['MY_TASK_ID']=my_uid)

    Then in the evaluation script you could report directly into the main training Task like so:

    train_task = Task.get_task(task_id=os.environ['MY_TASK_ID'])
    train_task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)