pythonpython-3.xpytorchmultiprocessingpython-polars

Why there is 'Unpickling Error' when using polars to read data for pytorch?


I have changed my data tool from xarray to polars in recent, and use pl.DataFrame.to_torch() to generate tensor for training my Pytorch model. Data source's format is parquet file.

For avoiding fork child processes, I use torch.multiprocessing.spawn to start my training process, however the process crashed with this:

/home/username/.conda/envs/torchhydro1/bin/python3.11 -X pycache_prefix=/home/username/.cache/JetBrains/IntelliJIdea2024.3/cpython-cache /home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --port 29781 --file /home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py 
Console output is saving to: /home/username/torchhydro/experiments/results/train_gnn_ddp.txt
[20:38:51] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:38:52] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
update config file
!!!!!!NOTE!!!!!!!!
-------Please make sure the PRECIPITATION variable is in the 1st location in var_t setting!!---------
If you have POTENTIAL_EVAPOTRANSPIRATION, please set it the 2nd!!!-
!!!!!!NOTE!!!!!!!!
-------Please make sure the STREAMFLOW variable is in the 1st location in var_out setting!!---------
[20:39:04] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:39:06] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
……
Torch is using cuda:0
[2024-12-12 20:48:08,931] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
[W CUDAAllocatorConfig.h:30] Warning: expandable_segments not supported on this platform (function operator())
using 8 workers
Pin memory set to True
  0%|          | 0/22986 [00:00<?, ?it/s]
[20:48:40] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:48:41] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:49:28] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:49:29] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:50:19] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:50:20] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:51:12] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:51:13] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:52:07] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:52:09] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
[20:52:13] DEBUG    CACHEDIR=/home/username/.cache/matplotlib   __init__.py:341
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:53:11] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:53:12] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:55:12] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:55:14] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
[20:55:19] DEBUG    CACHEDIR=/home/username/.cache/matplotlib   __init__.py:341
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
Traceback (most recent call last):
  File "/home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py", line 171, in <module>
    test_run_model()
  File "/home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py", line 56, in test_run_model
    mp.spawn(gnn_train_worker, args=(world_size, config_data, None), nprocs=world_size, join=True)
  File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
python-BaseException
Traceback (most recent call last):
  File "/home/username/.conda/envs/torchhydro1/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
python-BaseException
/home/username/.conda/envs/torchhydro1/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Now I have 2 problems:

First, why it will appears _pickle.UnpicklingError?

Second, after executing 0%| | 0/22986 [00:00<?, ?it/s], there is 7 ……s in my process log, means that this DEBUG process has been repeated for 8 or 9 times! I have set num_worker of pytorch DataLoader to 8, does this problem have connection with num_worker?

This problem occurs after I'm using polars, so I think problem comes from polars, or threads in polars and pytorch have some mistakes.

But how to know why there is UnpicklingError and solve it? Hope for your reply.


Solution

  • It's mistake to filter polars.Dataframe and convert result to torch.Tensor in __get_item__ of torch.Dataset. Convert the whole dataframe to tensor solved the problem.