On a node with 4 NVIDIA GPUs I enabled on device 0 the ECC memory protection (all other have ECC disabled). Since I enabled ECC on device 0 my application (CUDA, using just one device) hangs when it tries to create the context on this device 0 (driver API). I don't know why it hangs at that point. If I use a different device setting CUDA_VISIBLE_DEVICE accordingly to another device it works fine. It must have to do with enabling ECC. Any thoughts?
Here the output of nvidia-smi
:
(Why does it report 99% volatile GPU utilization, nothing is running there?)
+------------------------------------------------------+
| NVIDIA-SMI 4.304.54 Driver Version: 304.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m | 0000:02:00.0 Off | 1 |
| N/A 29C P0 49W / 225W | 0% 12MB / 4799MB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20m | 0000:03:00.0 Off | 0 |
| N/A 22C P8 15W / 225W | 0% 12MB / 4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20m | 0000:83:00.0 Off | 0 |
| N/A 22C P8 24W / 225W | 0% 11MB / 4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20m | 0000:84:00.0 Off | 0 |
| N/A 23C P8 25W / 225W | 0% 11MB / 4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
EDIT: nvidia-smi -a
reports ECC enabled on all devices. Strange!
==============NVSMI LOG==============
Timestamp : Fri Apr 26 10:18:14 2013
Driver Version : 304.54
Attached GPUs : 4
GPU 0000:02:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044699
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:02:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 12 MB
Free : 4787 MB
Compute Mode : Default
Utilization
Gpu : 99 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 1
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 1
Aggregate
Single Bit
Device Memory : 1
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 1
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 29 C
Power Readings
Power Management : Supported
Power Draw : 49.51 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:03:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044821
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:03:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 12 MB
Free : 4787 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 19 C
Power Readings
Power Management : Supported
Power Draw : 15.22 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:83:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044783
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x83
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:83:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 11 MB
Free : 4788 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 22 C
Power Readings
Power Management : Supported
Power Draw : 24.74 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:84:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044628
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x84
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:84:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 11 MB
Free : 4788 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 23 C
Power Readings
Power Management : Supported
Power Draw : 25.47 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
The nvidia-smi output shows an uncorrectable ECC error on the device. You can reset the error using nvidia-smi --reset-ecc-errors=0 -g 0
and retry. The 0
in the reset indicates to reset the volatile counter only, the aggregate counter will still indicate that an error has happened in the past.
If you see further errors from the device then it would be worth investigating the cause further.
Note that in the summary view the ECC field you are looking at is actually "Volatile Uncorr. ECC", i.e. it's the error count not the ECC enabled/disabled flag. If ECC is disabled it will say "N/A".