I am building a wheel of PyTorch from source, based on their https://github.com/pytorch/pytorch/blob/v2.6.0/.ci/manywheel/build_common.sh CI build script. I tested on a "local" instance of a g5.xlarge EC2 instance, I installed it with pip and everything works well. Then I built the same wheel on a g5.12xlarge instance to speed up the process, tested it on that machine and everything works. This leads to a problem when trying to install the g5.12xlarge wheel on a g5.xlarge instance:
Python 3.11.11 (main, Nov 13 2025, 17:12:08) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Illegal instruction (core dumped)
After using gdb we see:
Program received signal SIGILL, Illegal instruction.
0x00007fffe7b74d69 in ska::detailv3::sherwood_v3_table<std::pair<c10::OperatorName, c10::OperatorHandle>, c10::OperatorName, std::hash<c10::OperatorName>, ska::detailv3::KeyOrValueHasher<c10::OperatorName, std::pair<c10::OperatorName, c10::OperatorHandle>, std::hash<c10::OperatorName> >, std::equal_to<c10::OperatorName>, ska::detailv3::KeyOrValueEquality<c10::OperatorName, std::pair<c10::OperatorName, c10::OperatorHandle>, std::equal_to<c10::OperatorName> >, std::allocator<std::pair<c10::OperatorName, c10::OperatorHandle> >, std::allocator<ska::detailv3::sherwood_v3_entry<std::pair<c10::OperatorName, c10::OperatorHandle> > > >::rehash(unsigned long) () \
from /home/prod/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
So it seems the libtorch_cpu.so has different symbols. I am trying to understand how this happened, because these two instance types have the same CPUs. I would love some help in making this work, i.e. how to build the wheel on a g5.12xlarge instance so it works on a g5.xlarge instance.
Update: Both g5.xlarge and g5.12xlarge claim to use identical CPUs:
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7R32
stepping : 0
microcode : 0x830107f
cpu MHz : 3299.275
cache size : 512 KB
GDB shows crashing instruction:
Program received signal SIGILL, Illegal instruction.
0x00007fffe7b74d69 in ska::detailv3....
(gdb) x/i $pc
=> 0x7fffe7b74d69 <_ZN3ska8detail...EEE6rehashEm+25>: vcvtusi2sdq 0x18(%rdi),%xmm1,%xmm0
Update #2 Here is more GDB output:
(gdb) disas/r $pc, $pc+1
Dump of assembler code from 0x7fffe7b74d69 to 0x7fffe7b74d6a:
=> 0x00007fffe7b74d69 <_ZN3ska8detailv317sherwood_v3_tableISt4pairIN3c1012OperatorNameENS3_14OperatorHandleEES4_St4hashIS4_ENS0_16KeyOrValueHasherIS4_S6_S8_EESt8equal_toIS4_ENS0_18KeyOrValueEqualityIS4_S6_SC_EESaIS6_ESaINS0_17sherwood_v3_entryIS6_EEEE6rehashEm+25>: 62 f1 f7 08 7b 47 03 vcvtusi2sdq 0x18(%rdi),%xmm1,%xmm0
End of assembler dump.
It looks like the crashing instruction: vcvtusi2sdq 0x18(%rdi),%xmm1,%xmm0 is an AVX512F one, which neither EC instance supports.
This is probably happening because you build on a AVX512-capable Intel machine, and your compilation flags include -march=native.
Changing flags to -march=x86-64 and rebuilding may solve this crash.
It is unclear to me why only one of the machines exercises this code (and crashes). The other machine must not exercise this code (or it would have also crashed).