omnisci query is throttling in NVIDIA GPU + CUDA

I was trying to benchmark few of my queries in omnisci in GPU server.But I am experiencing queries are choking.Then I tried to experiment on sample data provided by omnisci itself flights dataset.

Below are my observation (I am using JDBC connector)

1.PreparedStatement pstmt2 = conn.prepareStatement("select * from flights_2008_7M natural join omnisci_countries");
    pstmt2.execute(); # with 8 parallel threads 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     
|
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   43C    P0    45W / 300W |   2343MiB / 16280MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |   2343MiB / 16280MiB |     15%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |   2343MiB / 16280MiB |     14%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   38C    P0    42W / 300W |   2343MiB / 16280MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+


2.PreparedStatement pstmt2 = conn.prepareStatement(
    "select * from flights_2008_7M where dest = 'TPA' limit 100000");
    pstmt2.execute(); # with 8 threads

Script hung and nothing is moving , in-fact no GPU utilization also .Just wanted to check if its configuration issue. How I can maximize GPU utilization and execute some complex queries with larger dataset .

Solution

are you sure the query isn't falling for CPU execution; I used an optimized DDLs to be sure the columns used by the query fit into VRAM memory.

To be sure the query isn't punting to CPU for execution go into the mapd_log/omnisci_server.INFO and after you run the query be sure you are not getting messages like that.

Query unable to run in GPU mode, retrying on CPU.

I did a brief try using the 1.2B+, not optimized table on an AWS server with 4xV100 GPUs and I had to change the parameter GPU-input-mem-limit=4, because of a bug (you can change adding this to the omnisci.conf file, then restarting the instance) with a default fragment size of 32M.

Have you changed the fragment size on your flight's table? Because the one in flights_7m is very low. If not recreate the table with the default fragment size of 32000000 or bigger.

the execution time on a single thread is around 290ms

78 %, 84 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
81 %, 88 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
77 %, 84 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
76 %, 83 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
79 %, 85 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
73 %, 80 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
91 %, 99 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
77 %, 84 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
95 %, 100 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
76 %, 82 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
93 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
82 %, 88 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
95 %, 100 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
75 %, 82 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
77 %, 83 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
78 %, 85 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
76 %, 83 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
75 %, 82 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
90 %, 97 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
74 %, 80 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
75 %, 82 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB

running four-thread the response time increase to around 1100ms with a slight increase of GPU utilization

93 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
85 %, 93 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
89 %, 95 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
95 %, 100 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
90 %, 98 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
89 %, 96 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
84 %, 91 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
92 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
87 %, 95 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
94 %, 100 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
94 %, 100 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
89 %, 98 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
89 %, 95 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
84 %, 91 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
88 %, 97 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB

Some GPUs are less busy than others because the data is unbalanced; we should shard the table to get an even distribution between the GPUs.

The runtimes are so high because on a projection query like that the server process one fragment at a time (default 32M, so there is some overhead to back and forth some data from CPU and GPU and vice-versa.