cudaptx

Can I hint to CUDA that it should move a given variable into the L1 cache?


Can I hint to CUDA that it should asynchronously move a given variable into the L1 cache?

I have a deterministic data access pattern (crazy_access_order) that is unfortunately very ill-served by an LRU cache eviction policy, which I think is what CUDA uses. So, is it possible to tell CUDA to start loading a given variable into the L1 cache, so that it's (probably) available by the time I need it?

For example, I'd like async_prefetch_to_l1 to exist so I could to write something like this:

#define PREFETCH_DISTANCE 4

int crazy_access_order(const int i) { ... }

const MyType data[N] = {...};

// Warmup the first PREFECT_DISTANCE - 1 accesses.
for (int i = 0; i < PREFETCH_DISTANCE - 1; ++i) {
  async_prefetch_to_l1(&data[crazy_access_order(i)], sizeof(MyType));
}

// Do the computation.
for (int i = 0; i < N; ++i) {
  // Start to fetch the data PREFETCH_DISTANCE steps into the future.
  async_prefetch_to_l1(&data[crazy_access_order(i + PREFETCH_DISTANCE)], sizeof(MyType));

  // Access the data that should already have been moved into the L1 cache.
  const datum = data[crazy_access_order(i)];

  // Do something with `datum`.
  ...
}

FWIW I've tried various full and partial loop unrolling schemes to give the compiler a full view of the data access order, and they don't seem to help much. I suspect the compiler isn't generating prefetch instructions, assuming they even exist.

EDIT: It appears the prefetch PTX instruction, which is likely to compile to a CCTL or CCTLL SASS instruction, does tell the system to load the given data into the given cache (thanks to Robert Crovella). We also guess that prefetch can return before the prefetch is complete, as this would be consistent with the behavior of LD (load) instructions, but we're not sure.

EDIT 2: I started using prefetch and immediately saw a 2x increase in performance (with a corresponding reduction in Long Scoreboard stalls).


Solution

  • It does seem to be possible to indicate a "prefetch to L1" using the PTX prefetch instruction you spotted.

    Here is a simple test case:

    $ cat t2125.cu
    #include <cstdio>
    template <typename T>
    __device__ void pfL1(T *a){
        asm("prefetch.global.L1 [%0];"
            ::"l"(a));
        return;
    }
    
    __global__ void k(float *r){
    
      pfL1(r);
      printf("%f\n", r[0]);
    }
    
    
    $ nvcc -arch=sm_80 -c t2125.cu
    $ cuobjdump -sass ./t2125.o
    
    Fatbin elf code:
    ================
    arch = sm_80
    code version = [1,7]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    
            code for sm_80
                    Function : _Z1kPf
            .headerflags    @"EF_CUDA_SM80 EF_CUDA_PTX_SM(EF_CUDA_SM80)"
            /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;      /* 0x00000a00ff017624 */
                                                                                    /* 0x000fc400078e00ff */
            /*0010*/                   IMAD.MOV.U32 R2, RZ, RZ, c[0x0][0x160] ;     /* 0x00005800ff027624 */
                                                                                    /* 0x000fe200078e00ff */
            /*0020*/                   ULDC.64 UR4, c[0x0][0x118] ;                 /* 0x0000460000047ab9 */
                                                                                    /* 0x000fe20000000a00 */
            /*0030*/                   IMAD.MOV.U32 R3, RZ, RZ, c[0x0][0x164] ;     /* 0x00005900ff037624 */
                                                                                    /* 0x000fe200078e00ff */
            /*0040*/                   IADD3 R1, R1, -0x8, RZ ;                     /* 0xfffffff801017810 */
                                                                                    /* 0x000fc80007ffe0ff */
            /*0050*/                   CCTL.E.PF1 [R2] ;                            /* 0x000000000200798f */
                                                                                    /* 0x0001ea0000000100 */
            /*0060*/                   LDG.E R0, [R2.64] ;                          /* 0x0000000402007981 */
                                                                                    /* 0x000ea2000c1e1900 */
            /*0070*/                   IADD3 R6, P0, R1, c[0x0][0x20], RZ ;         /* 0x0000080001067a10 */
                                                                                    /* 0x000fe20007f1e0ff */
            /*0080*/                   UMOV UR4, 0x0 ;                              /* 0x0000000000047882 */
                                                                                    /* 0x000fe40000000000 */
            /*0090*/                   UMOV UR5, 0x0 ;                              /* 0x0000000000057882 */
                                                                                    /* 0x000fe20000000000 */
            /*00a0*/                   IADD3.X R7, RZ, c[0x0][0x24], RZ, P0, !PT ;  /* 0x00000900ff077a10 */
                                                                                    /* 0x000fe200007fe4ff */
            /*00b0*/                   IMAD.U32 R4, RZ, RZ, UR4 ;                   /* 0x00000004ff047e24 */
                                                                                    /* 0x000fe2000f8e00ff */
            /*00c0*/                   MOV R5, UR5 ;                                /* 0x0000000500057c02 */
                                                                                    /* 0x000fe20008000f00 */
            /*00d0*/                   F2F.F64.F32 R8, R0 ;                         /* 0x0000000000087310 */
                                                                                    /* 0x004e640000201800 */
            /*00e0*/                   STL.64 [R1], R8 ;                            /* 0x0000000801007387 */
                                                                                    /* 0x0021e80000100a00 */
            /*00f0*/                   MOV R20, 0x0 ;                               /* 0x0000000000147802 */
                                                                                    /* 0x000fe40000000f00 */
            /*0100*/                   MOV R21, 0x0 ;                               /* 0x0000000000157802 */
                                                                                    /* 0x000fc80000000f00 */
            /*0110*/                   CALL.ABS.NOINC 0x0 ;                         /* 0x0000000000007943 */
                                                                                    /* 0x001fea0003c00000 */
            /*0120*/                   EXIT ;                                       /* 0x000000000000794d */
                                                                                    /* 0x000fea0003800000 */
            /*0130*/                   BRA 0x130;                                   /* 0xfffffff000007947 */
                                                                                    /* 0x000fc0000383ffff */
            /*0140*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*0150*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*0160*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*0170*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*0180*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*0190*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*01a0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*01b0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*01c0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*01d0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*01e0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
            /*01f0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                    /* 0x000fc00000000000 */
                    ..........
    
    
    
    Fatbin ptx code:
    ================
    arch = sm_80
    code version = [7,4]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    compressed
    $
    

    The instructions here:

            /*0050*/                   CCTL.E.PF1 [R2] ;                            /* 0x000000000200798f */
                                                                                    /* 0x0001ea0000000100 */
            /*0060*/                   LDG.E R0, [R2.64] ;                          /* 0x0000000402007981 */
    

    seem to correspond to the prefetch request, followed by the load of r[0] into R0, in preparation for the call to printf.

    Although I happened to compile for cc 8.0, I observe the CCTL.E.PF1 instruction present when I compile for architectures all the way back to cc 5.2 at least, on CUDA 11.4.

    So you can try that to see how it works for your code.