Let me start by admitting that the concept of high memory and low memory on Linux is still not completely clear in my mind even after reading several relevant resources. However, from what I understand on 64-bit Linux there's no high memory anyway (correct me if I am wrong).
I am trying to understand how kmap and address spaces work on Linux kernel version 5.8.1 configured with defconfig
for arm64.
I have added the following system call:
SYSCALL_DEFINE1(mycall, unsigned long __user, user_addr)
{
struct page *pages[1];
int *p1, *p2;
p1 = (int *) user_addr;
*p1 = 1; /* this works */
pr_info("kernel: first: 0x%lx", (long unsigned) p1);
if (get_user_pages(user_addr, 1, FOLL_WRITE, pages, NULL) != 1)
return -1;
p2 = kmap(pages[0]);
*p2 = 2; /* this also works */
pr_info("kernel: second: 0x%lx", (long unsigned) p2);
return 0;
}
From user-space I allocate a whole memory page (on a page boundary) which I pass to the kernel as a parameter to that system call. Modifying that memory by dereferencing either pointer from within the kernel works perfectly fine. However, the two pointers have different values:
[ 4.493480] kernel: first: 0x4ff3000
[ 4.493888] kernel: second: 0xffff000007ce9000
From what I understand get_user_pages
returns the physical page corresponding to that user address (in current's address space). Then since there's no high memory, I expected kmap
to return the exact same address from the user part of the address space.
According to the virtual memory layout of arm64, the address returned by kmap
lies in a range described as "kernel logical memory map". Is this a new mapping just created by kmap
or is this another previously existing mapping for the same page?
Can somebody explain what exactly is going on here?
The memory referred to by user_addr
(or p1
) and by p2
will be the same physical memory pages once they have actually been pinned into physical memory by get_user_pages()
. (Before the get_user_pages()
call, the pages might not be in physical memory yet.) However, user_addr
(and p1
) are a user-space address of the page, and p2
is a kernel-space address of the page. kmap()
will create a temporary mapping of a physical memory page to kernel-space.
On arm64 (and also amd64), if bit 63 is treated as a sign bit, then user-space addresses are non-negative and kernel space addresses are negative. So there is no way that the numeric values of the user-space and kernel-space addresses can be equal.
Most kernel code should not dereference user-space pointers directly. The user-space memory access functions and macros should be used, and should be checked for failures. The first part of your example should be something like:
int __user *p1 = (int __user *)user_addr;
if (put_user(1, p1))
return -EFAULT;
pr_info("kernel: first: 0x%lx\n", (unsigned long)p1);
put_user()
will return 0 on success or -EFAULT
on failure.
get_user_pages()
will return either the number of pages pinned into memory, or a negative errno value if none of the requested pages could be pinned. (It will only return 0
if the number of requested pages is 0.) The number of pages actually pinned may be less than the number requested, but since your code only requests a single page, the return value in that case would be either 1
or a negative errno value. You can use a variable to capture the error number. Note that it must be called with the current task's mmap semaphore locked:
#define NR_REQ 1
struct page *pages[NR_REQ];
long nr_gup;
mmap_read_lock(current->mm);
nr_gup = get_user_pages(user_addr, NR_REQ, FOLL_WRITE, pages, NULL);
mmap_read_unlock(current->mm);
if (nr_gup < 0)
return nr_gup;
if (nr_gup < NR_REQ) {
/* Some example code to deal with not all pages pinned - just 'put' them. */
long i;
for (i = 0; i < nr_gup; i++)
put_page(pages[i]);
return -ENOMEM;
}
Note: You could use get_user_pages_fast()
instead of get_user_pages()
. If get_user_pages_fast()
is used, the calls to mmap_read_lock()
and mmap_read_unlock()
above must be removed:
#define NR_REQ 1
struct page *pages[NR_REQ];
long nr_gup;
nr_gup = get_user_pages_fast(user_addr, NR_REQ, FOLL_WRITE, pages);
if (nr_gup < 0)
return nr_gup;
if (nr_gup < NR_REQ) {
/* Some example code to deal with not all pages pinned - just 'put' them. */
long i;
for (i = 0; i < nr_gup; i++)
put_page(pages[i]);
return -ENOMEM;
}
kmap()
will temporarily map a page into kernel address space. It should be paired with a call to kunmap()
to release the temporary mapping:
p2 = kmap(pages[0]);
/* do something with p2 here ... */
kunmap(p2);
Pages pinned by get_user_pages()
need to be 'put' using put_page()
when finished with. If they have been written to, they first need to be marked 'dirty' using set_page_dirty_lock()
. The last part of your example should be something like:
p2 = kmap(pages[0]);
*p2 = 2; /* this also works */
pr_info("kernel: second: 0x%lx\n", (unsigned long)p2);
kunmap(p2);
set_page_dirty_lock(pages[0]);
put_page(pages[0]);
The above code is not completely robust. The pointer p2
could be misaligned for the *p2
dereference, or *p2
could straddle a page boundary. Robust code needs to deal with such situations.
Since accessing the memory through user-space addresses needs to be done through special user-space access functions and macros, may sleep due to page faults (unless the pages have been locked into physical memory), and are only valid (if at all) within a single process, locking the user-space address region into memory with get_user_pages()
and mapping the pages to kernel address space (if required) is useful in some circumstances. It allows the memory to be accessed from an arbitrary kernel context such as an interrupt handler. It allows bulk copies to and from memory mapped I/O (memcpy_toio()
or memcpy_fromio()
). DMA operations can be performed on user-memory once it has been locked down by get_user_pages()
. In that case the pages will be mapped to "DMA addresses" by the DMA API.