nvidiaomniscidb

omnisci query is throttling in NVIDIA GPU + CUDA


I was trying to benchmark few of my queries in omnisci in GPU server.But I am experiencing queries are choking.Then I tried to experiment on sample data provided by omnisci itself flights dataset.

Below are my observation (I am using JDBC connector)

1.PreparedStatement pstmt2 = conn.prepareStatement("select * from flights_2008_7M natural join omnisci_countries");
    pstmt2.execute(); # with 8 parallel threads 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     
|
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   43C    P0    45W / 300W |   2343MiB / 16280MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |   2343MiB / 16280MiB |     15%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |   2343MiB / 16280MiB |     14%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   38C    P0    42W / 300W |   2343MiB / 16280MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+


2.PreparedStatement pstmt2 = conn.prepareStatement(
    "select * from flights_2008_7M where dest = 'TPA' limit 100000");
    pstmt2.execute(); # with 8 threads

Script hung and nothing is moving , in-fact no GPU utilization also .Just wanted to check if its configuration issue. How I can maximize GPU utilization and execute some complex queries with larger dataset .


Solution

  • are you sure the query isn't falling for CPU execution; I used an optimized DDLs to be sure the columns used by the query fit into VRAM memory.

    To be sure the query isn't punting to CPU for execution go into the mapd_log/omnisci_server.INFO and after you run the query be sure you are not getting messages like that.

    Query unable to run in GPU mode, retrying on CPU.

    I did a brief try using the 1.2B+, not optimized table on an AWS server with 4xV100 GPUs and I had to change the parameter GPU-input-mem-limit=4, because of a bug (you can change adding this to the omnisci.conf file, then restarting the instance) with a default fragment size of 32M.

    Have you changed the fragment size on your flight's table? Because the one in flights_7m is very low. If not recreate the table with the default fragment size of 32000000 or bigger.

    the execution time on a single thread is around 290ms

    78 %, 84 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    81 %, 88 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    77 %, 84 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    76 %, 83 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    79 %, 85 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    73 %, 80 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    91 %, 99 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    77 %, 84 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    95 %, 100 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    76 %, 82 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    93 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    82 %, 88 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    95 %, 100 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    75 %, 82 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    77 %, 83 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    78 %, 85 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    76 %, 83 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    75 %, 82 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    90 %, 97 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    74 %, 80 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    75 %, 82 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    

    running four-thread the response time increase to around 1100ms with a slight increase of GPU utilization

    93 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    85 %, 93 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    89 %, 95 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    95 %, 100 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    90 %, 98 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    89 %, 96 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    84 %, 91 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    92 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    87 %, 95 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    94 %, 100 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    94 %, 100 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    89 %, 98 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    94 %, 100 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    89 %, 95 %, 1530 MHz, 16130 MiB, 6748 MiB, 9382 MiB
    84 %, 91 %, 1530 MHz, 16130 MiB, 6924 MiB, 9206 MiB
    88 %, 97 %, 1530 MHz, 16130 MiB, 8972 MiB, 7158 MiB
    

    Some GPUs are less busy than others because the data is unbalanced; we should shard the table to get an even distribution between the GPUs.

    The runtimes are so high because on a projection query like that the server process one fragment at a time (default 32M, so there is some overhead to back and forth some data from CPU and GPU and vice-versa.