I've got a large chunk of generated data (A[i,j,k]) on the device, but I only need one 'slice' of A[i,:,:], and in regular CUDA this could be easily accomplished with some pointer arithmetic.
Can the same thing be done within pycuda? i.e
cuda.memcpy_dtoh(h_iA,d_A+(i*stride))
Obviously this is completely wrong since theres no size information (unless inferred from the dest shape), but hopefully you get the idea?
The pyCUDA gpuArray class supports slicing of 1D arrays, but not higher dimensions that require a stride (although it is coming). You can, however, get access to the underlying pointer in a multidimensional gpuArray from the gpuarray member, which is a pycuda.driver.DeviceAllocation type, and the size information from the gpuArray.dtype.itemsize member. You can then do the same sort of pointer arithmetic you had in mind to get something that the driver memcpy functions will accept.
It isn't very pythonic, but it does work (or at least it did when I was doing a lot of pyCUDA + MPI hacking last year).