[SOLVED] Training Job Running on Google Cloud Platform but Not Consuming Any CPU

Training Job Running on Google Cloud Platform but Not Consuming Any CPU

My training job on the AI platform on the Google Cloud Platform seems to be running but is not consuming any CPU. The program does not terminate, but it does give a few errors when the job first started running. They look like below

INFO    2020-06-05 04:33:38 +0000       master-replica-0                Create CheckpointSaverHook.
ERROR   2020-06-05 04:33:38 +0000       master-replica-0                I0605 04:33:38.890919 139686838036224 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO    2020-06-05 04:33:41 +0000       worker-replica-0                Graph was finalized.
ERROR   2020-06-05 04:33:41 +0000       worker-replica-0                I0605 04:33:41.006648 140712303798016 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:41 +0000       worker-replica-4                Graph was finalized.
ERROR   2020-06-05 04:33:41 +0000       worker-replica-4                I0605 04:33:41.482944 139947128342272 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:41 +0000       worker-replica-2                Graph was finalized.
ERROR   2020-06-05 04:33:41 +0000       worker-replica-2                I0605 04:33:41.927765 140284058486528 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:41 +0000       master-replica-0                Graph was finalized.
ERROR   2020-06-05 04:33:41 +0000       master-replica-0                I0605 04:33:41.995326 139686838036224 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:42 +0000       master-replica-0                Restoring parameters from gs://lasertagger_v1/output/models/wikisplit_experiment_name_2/model.ckpt-0
ERROR   2020-06-05 04:33:42 +0000       master-replica-0                I0605 04:33:42.216852 139686838036224 saver.py:1284] Restoring parameters from gs://lasertagger_v1/output/models/wikisplit_experiment_name_2/model.ckpt-0
INFO    2020-06-05 04:33:43 +0000       worker-replica-3                Done calling model_fn.
ERROR   2020-06-05 04:33:43 +0000       worker-replica-3                I0605 04:33:43.411592 140653000845056 estimator.py:1150] Done calling model_fn.
INFO    2020-06-05 04:33:43 +0000       worker-replica-3                Create CheckpointSaverHook.
ERROR   2020-06-05 04:33:43 +0000       worker-replica-3                I0605 04:33:43.413079 140653000845056 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO    2020-06-05 04:33:44 +0000       worker-replica-1                Done calling model_fn.
ERROR   2020-06-05 04:33:44 +0000       worker-replica-1                I0605 04:33:44.139685 140410730743552 estimator.py:1150] Done calling model_fn.
INFO    2020-06-05 04:33:44 +0000       worker-replica-1                Create CheckpointSaverHook.
ERROR   2020-06-05 04:33:44 +0000       worker-replica-1                I0605 04:33:44.141169 140410730743552 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO    2020-06-05 04:33:47 +0000       worker-replica-1                Graph was finalized.
ERROR   2020-06-05 04:33:47 +0000       worker-replica-1                I0605 04:33:47.280014 140410730743552 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:47 +0000       worker-replica-3                Graph was finalized.
ERROR   2020-06-05 04:33:47 +0000       worker-replica-3                I0605 04:33:47.335122 140653000845056 monitored_session.py:240] Graph was finalized.

Each INFO message is followed by an ERROR message, and I am confused what is going on with this training job. Thank you!

Below are some more detailed error messages:

2020-06-05 13:12:50.583 EDT
worker-replica-4
I0605 17:12:50.583258 140104498276096 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
{
 insertId: "o5flw8f1urq2q"  
 jsonPayload: {
  created: 1591377170.5835383   
  levelname: "ERROR"   
  lineno: 328   
  message: "I0605 17:12:50.583258 140104498276096 basic_session_run_hooks.py:541] Create CheckpointSaverHook."   
  pathname: "/runcloudml.py"   
 }
 labels: {
  compute.googleapis.com/resource_id: "2069730006064940177"   
  compute.googleapis.com/resource_name: "gke-cml-0605-170056-7fb-n1-highmem-96-9990517e-rvlx"   
  compute.googleapis.com/zone: "us-east1-c"   
  ml.googleapis.com/job_id/log_area: "root"   
  ml.googleapis.com/trial_id: ""   
 }
 logName: "projects/smart-content-summary/logs/worker-replica-4"  
 receiveTimestamp: "2020-06-05T17:13:00.962017815Z"  
 resource: {
  labels: {…}   
  type: "ml_job"   
 }
 severity: "ERROR"  
 timestamp: "2020-06-05T17:12:50.583538292Z"  
}

Solution

I highly suspect the problem occurs during the saving of the model. The problem will be caused by

memory overflow
disk overflow.

Can you show some monitoring metrics of them or maybe consider :

increase the machine memory
increase the root partition size?