I have changed my data tool from xarray to polars in recent, and use pl.DataFrame.to_torch()
to generate tensor for training my Pytorch model. Data source's format is parquet file.
For avoiding fork child processes, I use torch.multiprocessing.spawn
to start my training process, however the process crashed with this:
/home/username/.conda/envs/torchhydro1/bin/python3.11 -X pycache_prefix=/home/username/.cache/JetBrains/IntelliJIdea2024.3/cpython-cache /home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --port 29781 --file /home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py
Console output is saving to: /home/username/torchhydro/experiments/results/train_gnn_ddp.txt
[20:38:51] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:38:52] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
update config file
!!!!!!NOTE!!!!!!!!
-------Please make sure the PRECIPITATION variable is in the 1st location in var_t setting!!---------
If you have POTENTIAL_EVAPOTRANSPIRATION, please set it the 2nd!!!-
!!!!!!NOTE!!!!!!!!
-------Please make sure the STREAMFLOW variable is in the 1st location in var_out setting!!---------
[20:39:04] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:39:06] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
……
Torch is using cuda:0
[2024-12-12 20:48:08,931] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
[W CUDAAllocatorConfig.h:30] Warning: expandable_segments not supported on this platform (function operator())
using 8 workers
Pin memory set to True
0%| | 0/22986 [00:00<?, ?it/s]
[20:48:40] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:48:41] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:49:28] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:49:29] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:50:19] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:50:20] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:51:12] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:51:13] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:52:07] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:52:09] DEBUG Using selector: EpollSelector selector_events.py:54
……
[20:52:13] DEBUG CACHEDIR=/home/username/.cache/matplotlib __init__.py:341
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:53:11] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:53:12] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:55:12] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:55:14] DEBUG Using selector: EpollSelector selector_events.py:54
……
[20:55:19] DEBUG CACHEDIR=/home/username/.cache/matplotlib __init__.py:341
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
Traceback (most recent call last):
File "/home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py", line 171, in <module>
test_run_model()
File "/home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py", line 56, in test_run_model
mp.spawn(gnn_train_worker, args=(world_size, config_data, None), nprocs=world_size, join=True)
File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
python-BaseException
Traceback (most recent call last):
File "/home/username/.conda/envs/torchhydro1/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
python-BaseException
/home/username/.conda/envs/torchhydro1/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Now I have 2 problems:
First, why it will appears _pickle.UnpicklingError
?
Second, after executing 0%| | 0/22986 [00:00<?, ?it/s]
, there is 7 ……
s in my process log, means that this DEBUG process has been repeated for 8 or 9 times! I have set num_worker
of pytorch DataLoader
to 8, does this problem have connection with num_worker
?
This problem occurs after I'm using polars, so I think problem comes from polars, or threads in polars and pytorch have some mistakes.
But how to know why there is UnpicklingError
and solve it? Hope for your reply.
It's mistake to filter polars.Dataframe
and convert result to torch.Tensor
in __get_item__
of torch.Dataset
.
Convert the whole dataframe to tensor solved the problem.