pythonamazon-web-servicesamazon-ec2pytorchtensorboardx

AWS tensorboard Segmentation fault (core dumped)


I am trying to use tensorboardX to debug a pytorch NN that is running in a p2.xlarge instance of AWS.

I followed this tutorial to open the port 6006.

The model is running and tensorboardX is making its writer file. I get the following warning there. I am not sure how relevant it is.

WARNING:root:tuple appears in op that does not forward tuples (VisitNode at /pytorch/torch/csrc/jit/passes/lower_tuples.cpp:117) frame #0: std::function::operator()() const + 0x11 (0x7fbe3dd04441 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fbe3dd03d7a in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: + 0xaf61f5 (0x7fbe3cdc41f5 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #3: + 0xaf6464 (0x7fbe3cdc4464 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: torch::jit::LowerAllTuples(std::shared_ptr&) + 0x13 (0x7fbe3cdc44a3 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #5: + 0x3f84b4 (0x7fbe7d2cb4b4 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x130cfc (0x7fbe7d003cfc in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #40: __libc_start_main + 0xf0 (0x7fbe8d69c830 in /lib/x86_64-linux-gnu/libc.so.6)

WARNING:root:tuple appears in op that does not forward tuples (VisitNode at /pytorch/torch/csrc/jit/passes/lower_tuples.cpp:117) frame #0: std::function::operator()() const + 0x11 (0x7fbe3dd04441 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fbe3dd03d7a in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: + 0xaf61f5 (0x7fbe3cdc41f5 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #3: + 0xaf6464 (0x7fbe3cdc4464 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: torch::jit::LowerAllTuples(std::shared_ptr&) + 0x13 (0x7fbe3cdc44a3 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #5: + 0x3f84b4 (0x7fbe7d2cb4b4 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x130cfc (0x7fbe7d003cfc in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #40: __libc_start_main + 0xf0 (0x7fbe8d69c830 in /lib/x86_64-linux-gnu/libc.so.6)

The problem is that I don't have access to the tensorboard browser user interface. I take the following steps:

$ cd PATH_TO_FOLDER_CONTAINING_runs
$ source activate pytorch_p36
$ tensorboard --logdir=runs

Where I get the error message:

Segmentation fault (core dumped)

When I check the syslog var/log/syslog I see that following:

Jun 26 09:06:40 ip-172-xx-xx-xxx kernel: [515315.598917] tensorboard[1446]: segfault at 0 ip (null) sp 00007ffd64c5f178 error 14 in python2.7[55d8673d1000+1000]

My googling skills were far from enough. How can I access tensorboard through the browser with it running in the ASW instance?

Please let me know if something is unclear or if some info is missing.


Solution

  • Even though the code has to run in the environment pytorch_p36, tensorboard actually has to run on a different environment.

    The sequence of commands in the terminal should be:

    $ cd PATH_TO_FOLDER_CONTAINING_runs
    $ source activate tensorflow_p27
    $ tensorboard --logdir=runs
    

    Then the designated port opens.