wandb

How to avoid data averaging when logging to metric across multiple runs?


I'm trying to log data points for the same metric across multiple runs (wandb.init is called repeatedly in between each data point) and I'm unsure how to avoid the behavior seen in the attached screenshot...

Instead of getting a line chart with multiple points, I'm getting a single data point with associated statistics. In the attached e.g., the 1st data point was generated at step 1,470 and the 2nd at step 2,940...rather than seeing two points, I'm instead getting a single point that's the average and appears at step 2,205. enter image description here

My hunch is that using the resume run feature may address my problem, but even testing out this hunch is proving to be cumbersome given the constraints of the system I'm working with...

Before I invest more time in my hypothesized solution, could someone confirm that the behavior I'm seeing is, indeed, the result of logging data to the same metric across separate runs without using the resume feature?

If this is the case, can you confirm or deny my conception of how to use resume?

Initial run:

  1. run = wandb.init()
  2. wandb_id = run.id
  3. cache wandb_id for successive runs

Successive run:

  1. retrieve wandb_id from cache
  2. wandb.init(id=wandb_id, resume="must")

Is it also acceptable / preferable to replace 1. and 2. of the initial run with:

  1. wandb_id = wandb.util.generate_id()
  2. wandb.init(id=wandb_id)

Solution

  • My hunch is that using the resume run feature may address my problem,

    Indeed, providing a cached id in combination with resume="must" fixed the issue. enter image description here

    Corresponding snippet:

    import wandb
    
    # wandb run associated with evaluation after first N epochs of training.
    wandb_id = wandb.util.generate_id()
    wandb.init(id=wandb_id, project="alrichards", name="test-run-3/job-1", group="test-run-3")
    wandb.log({"mean_evaluate_loss_epoch": 20}, step=1)
    wandb.finish()
    
    # wandb run associated with evaluation after second N epochs of training.
    wandb.init(id=wandb_id, resume="must", project="alrichards", name="test-run-3/job-2", group="test-run-3")
    wandb.log({"mean_evaluate_loss_epoch": 10}, step=5)
    wandb.finish()