CUDA unified memory pages accessed in CPU but not evicted from GPU

I was trying to understand the functioning of the CUDA Unified Memory. I have read the blog on CUDA unified memory for beginners. I wrote the code given below:

#include <cstdio>
#include <iostream>
#include <fstream>
#include <climits>
#include <vector>

__global__ void transfer(int *X)
{
    X[threadIdx.x] = X[threadIdx.x]+3;
}
using namespace std;
int main()
{
    int *x;
    size_t free_bytes, total_bytes;
    
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "Before cudaMallocManaged: " << "free: " << free_bytes << " total: " << total_bytes <<'\n'; 
    cudaMallocManaged(&x,sizeof(int)*512);
    
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "After cudaMallocManaged and Before Prefetch to GPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    std::cout <<  cudaMemPrefetchAsync(x, sizeof(int)*512, 0);
    cudaMemset(x,0,sizeof(int)*512);
    cudaDeviceSynchronize();
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "\nAfter Prefetch to GPU Before Kernel call: " << "free: " << free_bytes << " total: " << total_bytes <<'\n'; 
    transfer<<<1,512>>>(x);
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "After Kernel call Before memAdvise: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    cudaMemAdvise(x,sizeof(int)*512, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "After memAdvise Before Prefetch to CPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    std::cout << cudaMemPrefetchAsync(x, sizeof(int)*512, cudaCpuDeviceId);
    cudaDeviceSynchronize();
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "\nAfter Prefetch Before processing in CPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n'; 
    for(int i=0;i<512;i++)
    {
        x[i] = x[i]+1;
        std::cout << x[i];
    }
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "\nAfter processing in CPU Before free: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    cudaFree(x);
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "After free: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    return 0;
}

Output:

Before cudaMallocManaged: free: 16804216832 total: 17071734784
After cudaMallocManaged and Before Prefetch to GPU: free: 16804216832 total: 17071734784
0
After Prefetch to GPU Before Kernel call: free: 16669999104 total: 17071734784
After Kernel call Before memAdvise: free: 16669999104 total: 17071734784
After memAdvise Before Prefetch to CPU: free: 16669999104 total: 17071734784
0
After Prefetch Before processing in CPU: free: 16669999104 total: 17071734784
44444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
After processing in CPU Before free: free: 16669999104 total: 17071734784
After free: free: 16674193408 total: 17071734784

I am running the code on Kaggle which provides 16 GB Tesla P100 PCIe GPUs. I have an array of integers x allocated using cudaMallocManaged(). First, I prefetch the array in GPU and do some processing on it and then I prefetch it to CPU and do some processing. In between, I print the free memory available on the GPU before and after the memory transfer. I have two questions based on this:

During the first prefetch just after cudaMallocManaged() the free memory decreases a lot more than I am allocating. Why?
The free memory before and after prefetching to the CPU is the same. Also, when I access and modify the array on the CPU the free memory on GPU before and after this still remains the same. I don't understand why this is happening. When prefetching/processing a unified memory location on CPU shouldn't the corresponding pages on GPU be evicted and moved to CPU and shouldn't this free up the GPU memory?

Solution

There is considerable overhead to have a fully functional CUDA environment on a GPU. This may exceed 100MB of space needed for CUDA overhead, not including your data
CUDA has a lazy initialization system.

During the first prefetch just after cudaMallocManaged() the free memory decreases a lot more than I am allocating. Why?

Because CUDA has a lazy initialization system. This means that it may build up more and more of the necessary environment to run your kernel code, along with the memory overhead involved with that, as you continue to make CUDA runtime API calls in your program. At the point of kernel launch, most or all of this initialization will be complete, excepting things associated with new resource usage. So the reduction in free memory is due to your allocation plus the additional overhead for CUDA itself.

The free memory before and after prefetching to the CPU is the same. Also, when I access and modify the array on the CPU the free memory on GPU before and after this still remains the same. I don't understand why this is happening.

The amount of memory we are talking about is on the order of 100MB. Your allocation of 512*sizeof(int) is insignificant compared to that. Furthermore, there is no statement in CUDA documentation what will happen with underlying allocations as a result of demand-paging. You seem to think that demand-paging automatically frees underlying allocations when the content is paged out. This is not stated anywhere and is not the case. The exact behavior here is unspecified. Furthermore, the GPU in your setting has the ability to be oversubscribed, so there is no particular reason to immediately free allocations.

When prefetching/processing a unified memory location on CPU shouldn't the corresponding pages on GPU be evicted and moved to CPU and shouldn't this free up the GPU memory?

Prefetching is not the same as eviction. But, yes, prefetching to CPU means the corresponding pages are no longer resident in that GPU's memory. No, there is no reason to think that this automatically/immediately frees up GPU memory. You can expect that the memory will be freed when you do a cudaFree operation on the allocated pointer, not before.