I am using kernel 5.4.93 (with Ubuntu 20.04 rootfs) on an ARMv8-based embedded system. The "BIOS" is U-boot which does not include PCIe feature thus the PCIe bus enumeration is solely done by kernel.
The issue is about prefetchable BAR allocation during PCIe bus enumeration, related to alignment I believe.
The following is the output of lspci -tv
, for bus topology:
-[0000:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0 Intel Corporation I211 Gigabit Network Connection
| +-01.0-[04]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]
| | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
| +-02.0-[05]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
| +-03.0-[06]----00.0 Shanghai Enflame Technology Co. Ltd I20 [CloudBlazer]
| +-04.0-[07]----00.0 Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
| \-05.0-[08-09]----00.0-[09]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family
\-00.1 PMC-Sierra Inc. Device 4052
The symptom is that, the system is failed in allocating prefetchable-64bit BARs for the AMDGPU RX550 graphics function (unassigned at BDF 4:0.0
), since the bridge (2:1.0
) above 4:0.0
is failed to allocate 6G prefetchable window (disabled) for the RX550 card.
root@u2004:/home/cmic# lspci -s 4:0.0 -v
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]
Physical Slot: 2
Flags: fast devsel, IRQ 255
Memory at <unassigned> (64-bit, prefetchable) [disabled]
Memory at <unassigned> (64-bit, prefetchable) [disabled]
I/O ports at <unassigned> [disabled]
Memory at 33100000 (32-bit, non-prefetchable) [disabled] [size=256K]
Expansion ROM at 33140000 [virtual] [disabled] [size=128K]
root@u2004:/home/cmic# lspci -s 2:1.0
02:01.0 PCI bridge: PMC-Sierra Inc. Device 4052
root@u2004:/home/cmic# lspci -s 2:1.0 -v
02:01.0 PCI bridge: PMC-Sierra Inc. Device 4052 (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 14
Bus: primary=02, secondary=04, subordinate=04, sec-latency=0
I/O behind bridge: [disabled]
Memory behind bridge: 33100000-332fffff [size=2M]
Prefetchable memory behind bridge: [disabled]
The following picture depicts why bridge 2:1.0
has failed in finding a 6G contiguous space (although there are two free 4G spaces, they are separated).
I have a workaround by changing PCIe ranges in DTS such that the 32G space starts from 12_0000_0000
, thus the 24G is allocated from the beginning, leaving contiguous 4T+4G=8G at the tail for RX550, as manifested by the following /proc/iomem
fragment:
1140000000-1fffffffff : pcie@d0000
1200000000-19ffffffff : PCI Bus 0000:01
1200000000-19ffffffff : PCI Bus 0000:02
1200000000-17ffffffff : PCI Bus 0000:06 <= i20, 24G
1200000000-1203ffffff : 0000:06:00.0 |-- 64M (bar 4)
1400000000-17ffffffff : 0000:06:00.0 `-- 16G (bar 2)
1800000000-197fffffff : PCI Bus 0000:04 <= rx550, 6G
1800000000-18ffffffff : 0000:04:00.0 |-- 4G (bar 0)
1900000000-19001fffff : 0000:04:00.0 `-- 2M (bar 2)
1980000000-19801fffff : PCI Bus 0000:03
1980200000-19803fffff : PCI Bus 0000:05
1980400000-19805fffff : PCI Bus 0000:07
1980600000-19807fffff : PCI Bus 0000:08
My question: what is the root cause of the problem (w/o workaround)? Is this a kernel bug, or configuration bug somewhere?
PS: I did some printk trace for v5.4.93, and compared with a virtual host via qemu-system-aarch64
(which does not have such issue). So far the difference lies in the 1st argument to function calculate_mem_align(aligns[], max_order=13)
: in real system, aligns[0]=0x400000
and the rest are zeros, and in virt host all aligns[]
are zero. This cause the different outputs from this function (4GiB vs 8GiB). I can reproduce the difference in a standalone C program:
// https://elixir.bootlin.com/linux/v5.4.93/source/drivers/pci/setup-bus.c
#include <stdio.h>
#include <string.h>
typedef unsigned long long resource_size_t;
#define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) // align `x` to alignment `a`
#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
resource_size_t calculate_mem_align(resource_size_t *aligns,
int max_order)
{
resource_size_t align_sum = 0;
resource_size_t min_align = 0;
int order;
printf("max_order=%d, aligns[]=\n", max_order);
for (order = 0; order <= max_order; order++) {
resource_size_t cur_align = 1;
cur_align <<= (order + 20);
if (!align_sum) {
min_align = cur_align;
}
else {
if (ALIGN(align_sum + min_align, min_align) < cur_align) {
min_align = cur_align >> 1;
} else {
printf("do nothing\n");
}
}
align_sum += aligns[order];
printf(" order=%2d: aligns[%2d]=%09llx, cur_align=%09llx, min_align=%09llx, align_sum=%09llx\n", order, order, aligns[order], cur_align, min_align, align_sum);
}
printf("return min_align=%#llx\n", min_align);
return min_align;
}
void main()
{
resource_size_t aligns0[18];
resource_size_t aligns1[18];
resource_size_t min_align;
int max_order = 13;
memset(aligns0, 0, sizeof(aligns0));
memset(aligns1, 0, sizeof(aligns1));
aligns1[0] = 0x400000;
min_align = calculate_mem_align(aligns0, max_order);
min_align = calculate_mem_align(aligns1, max_order);
}
$ ./a.out
max_order=13, aligns[]=
order= 0: aligns[ 0]=000000000, cur_align=000100000, min_align=000100000, align_sum=000000000
order= 1: aligns[ 1]=000000000, cur_align=000200000, min_align=000200000, align_sum=000000000
order= 2: aligns[ 2]=000000000, cur_align=000400000, min_align=000400000, align_sum=000000000
order= 3: aligns[ 3]=000000000, cur_align=000800000, min_align=000800000, align_sum=000000000
order= 4: aligns[ 4]=000000000, cur_align=001000000, min_align=001000000, align_sum=000000000
order= 5: aligns[ 5]=000000000, cur_align=002000000, min_align=002000000, align_sum=000000000
order= 6: aligns[ 6]=000000000, cur_align=004000000, min_align=004000000, align_sum=000000000
order= 7: aligns[ 7]=000000000, cur_align=008000000, min_align=008000000, align_sum=000000000
order= 8: aligns[ 8]=000000000, cur_align=010000000, min_align=010000000, align_sum=000000000
order= 9: aligns[ 9]=000000000, cur_align=020000000, min_align=020000000, align_sum=000000000
order=10: aligns[10]=000000000, cur_align=040000000, min_align=040000000, align_sum=000000000
order=11: aligns[11]=000000000, cur_align=080000000, min_align=080000000, align_sum=000000000
order=12: aligns[12]=000000000, cur_align=100000000, min_align=100000000, align_sum=000000000
order=13: aligns[13]=000000000, cur_align=200000000, min_align=200000000, align_sum=000000000
return min_align=0x200000000
max_order=13, aligns[]=
order= 0: aligns[ 0]=000400000, cur_align=000100000, min_align=000100000, align_sum=000400000
do nothing
order= 1: aligns[ 1]=000000000, cur_align=000200000, min_align=000100000, align_sum=000400000
do nothing
order= 2: aligns[ 2]=000000000, cur_align=000400000, min_align=000100000, align_sum=000400000
order= 3: aligns[ 3]=000000000, cur_align=000800000, min_align=000400000, align_sum=000400000
order= 4: aligns[ 4]=000000000, cur_align=001000000, min_align=000800000, align_sum=000400000
order= 5: aligns[ 5]=000000000, cur_align=002000000, min_align=001000000, align_sum=000400000
order= 6: aligns[ 6]=000000000, cur_align=004000000, min_align=002000000, align_sum=000400000
order= 7: aligns[ 7]=000000000, cur_align=008000000, min_align=004000000, align_sum=000400000
order= 8: aligns[ 8]=000000000, cur_align=010000000, min_align=008000000, align_sum=000400000
order= 9: aligns[ 9]=000000000, cur_align=020000000, min_align=010000000, align_sum=000400000
order=10: aligns[10]=000000000, cur_align=040000000, min_align=020000000, align_sum=000400000
order=11: aligns[11]=000000000, cur_align=080000000, min_align=040000000, align_sum=000400000
order=12: aligns[12]=000000000, cur_align=100000000, min_align=080000000, align_sum=000400000
order=13: aligns[13]=000000000, cur_align=200000000, min_align=100000000, align_sum=000400000
return min_align=0x100000000
But I don't quite understand the logic behind the function; can anyone explain it bit?
The short answer is that on my platform, the PCIe controller's driver (QDMA) is not from upstream kernel, but from Xilinx AR76647. In the xilinx_pcie_probe()
function, it calls pci_assign_unassigned_bus_resources()
instead of pci_bus_size_bridges()
+ pci_bus_assign_resources()
(as does in pci_host_probe()
).
One major difference caused by the two different calls is in the later call to __pci_bus_size_bridges(bus, realloc_head)
wrt to the second argument: pci_assign_unassigned_bus_resources()
leads to a non-null realloc_head
argument while pci_bus_size_bridges()
leads to a null realloc_head
argument. Then the following calls to calculate_memsize()
and calculate_mem_align()
gives different results, leading to the issue on my platform.
I am not yet very clear on how the internal scanning process works, but it seems that the use of pci_assign_unassigned_bus_resources()
assumes that the firmware (EFI BIOS or U-BOOT) has already enumerated the pci bus, and kernel is only doing the rest part. As this assumption does not apply to my platform (as explained in the question above), so the issue. The issue can be reproduced on qemu virt host, if I modify pci_host_probe()
to use pci_assign_unassigned_bus_resources(bus)
. On my platform, after making the following change in xilinx_pcie_probe()
, the issue has been resolved.
#if (0)
pci_assign_unassigned_bus_resources(bus);
#else
pci_bus_size_bridges(bus);
pci_bus_assign_resources(bus);
#endif
Btw, the following is a brief call tree of xilinx probe function, to get a grasp of the scanning process:
xilinx_pcie_probe()
|-- xilinx_pcie_parse_dt()
|-- xilinx_pcie_init_port()
|-- xilinx_pcie_init_irq_domain()
|-- devm_of_pci_get_host_bridge_resource()
|-- devm_request_pci_bus_resources()
|-- pci_scan_root_bus_bridge()
| |-- pci_register_host_bridge()
| |-- pci_bus_insert_busn_res()
| |-- pci_scan_child_bus()
| | `== pci_scan_child_bus_extend() <-------+
| | |-- pci_scan_slot(bus, devfn=0~256) |
| | | |-- pci_scan_single_device() |
| | | |-- pci_scan_device() |
| | | `-- pci_device_add() |
| | | |-- pci_configure_device() |
| | | |-- device_initialize() |
| | | |-- pci_fixup_device() |
| | | |-- pci_reassigndev_resource_alignment() |
| | | |-- ... |
| | `-- for_each_pci_bridge: pci_scan_bridge_extend() |
| | |-- ... |
| | |== pci_scan_child_bus_extend() <== recursive
| | |-- ...
| `-- pci_bus_update_busn_res_end()
|-- pci_assign_unassigned_bus_resources()
| |== for each bridge: __pci_bus_size_bridges(bus, add_list) <-------+
| | |-- for each cardbus: pci_bus_size_cardbus() |
| | |== for each bridge: __pci_bus_size_bridge() <== recursive
| | |-- pci_bridge_check_ranges()
| | |-- pbus_size_io()
| | |-- pbus_size_mem(pref-64)
| | | |-- ...
| | | |-- calculate_memsize()
| | | |-- calculate_mem_align()
| | | |-- ...
| | |-- pbus_size_mem(pref-32)
| | `-- pbus_size_mem(non-pref)
| `== __pci_bus_assign_resources() <-------+
| |-- pbus_assign_resources_sorted() |
| | |-- for each dev: __dev_sort_resources() |
| | `-- __assign_resources_sorted() |
| | |-- assign_requested_resources_sorted() |
| | |-- pci_assign_resource() |
| `-- for each dev: |
| |-- pdev_assign_fixed_resources() |
| |== __pci_bus_assign_resources() <== recursive
| `-- pci_setup_bridge()
|-- loop for each child: pci_bus_configure_settings()
`-- pci_bus_add_devices()
The following is the call stack up to __pci_bus_size_bridges()
for xilinx driver (before modification) on my platform:
all trace:
525 [ 10.325976] dump_backtrace+0x0/0x18c
526 [ 10.330144] show_stack+0x28/0x3c
527 [ 10.333917] dump_stack+0xb4/0x110
528 [ 10.337794] __pci_bus_size_bridges+0xdc/0xa0c
529 [ 10.342858] pci_assign_unassigned_bus_resources+0x94/0x100 <== notice this call
530 [ 10.349205] xilinx_pcie_probe+0x96c/0x9d0
531 [ 10.353871] platform_drv_probe+0x5c/0xb0
532 [ 10.358436] really_probe+0xf0/0x4a0
533 [ 10.362505] driver_probe_device+0xec/0x134
534 [ 10.367268] device_driver_attach+0xc0/0xcc
535 [ 10.372031] __driver_attach+0xac/0x164
536 [ 10.376397] bus_for_each_dev+0x80/0xd0
537 [ 10.380763] driver_attach+0x34/0x40
538 [ 10.384834] bus_add_driver+0x14c/0x240
539 [ 10.389204] driver_register+0x7c/0x124
540 [ 10.393567] __platform_driver_register+0x58/0x6c
541 [ 10.398928] pcie_xdma_pl_init+0x24/0x2c
542 [ 10.403394] do_one_initcall+0x50/0x250
543 [ 10.407760] kernel_init_freeable+0x1f4/0x2bc
544 [ 10.412721] kernel_init+0x1c/0x114
545 [ 10.416693] ret_from_fork+0x10/0x18
and the following is the corresponding stack on qemu virt host:
323 [ 0.956000] Call trace:
324 [ 0.956141] dump_backtrace+0x0/0x134
325 [ 0.956321] show_stack+0x14/0x20
326 [ 0.956487] dump_stack+0xb4/0x110
327 [ 0.956655] __pci_bus_size_bridges+0xe8/0xa50
328 [ 0.956859] pci_bus_size_bridges+0x14/0x20 <== a different call
329 [ 0.957057] pci_host_probe+0x60/0xbc
330 [ 0.957234] pci_host_common_probe+0xdc/0x1dc
331 [ 0.957442] gen_pci_probe+0x30/0x40
332 [ 0.957616] platform_drv_probe+0x50/0xa0
333 [ 0.957806] really_probe+0xd8/0x420
334 [ 0.957978] driver_probe_device+0x54/0xe4
335 [ 0.958178] device_driver_attach+0xb4/0xc0
336 [ 0.958376] __driver_attach+0x80/0x114
337 [ 0.958586] bus_for_each_dev+0x6c/0xc0
338 [ 0.958791] driver_attach+0x20/0x30
339 [ 0.958982] bus_add_driver+0xfc/0x1e0
340 [ 0.959180] driver_register+0x74/0x120
341 [ 0.959383] __platform_driver_register+0x44/0x50
342 [ 0.959629] gen_pci_driver_init+0x18/0x20
343 [ 0.959843] do_one_initcall+0x4c/0x1b0
344 [ 0.960045] kernel_init_freeable+0x194/0x23c
345 [ 0.960270] kernel_init+0x10/0x100
346 [ 0.960457] ret_from_fork+0x10/0x24
PS: tracing the issue takes me some time, during the process I had a feeling that using qemu virt host for comparison with real platform helps a lot: it simply much quicker to modify the kernel (adding printk()
mostly) and re-run using qemu. So below I will give a brief description of my qemu-related setup for this debugging task.
The original idea is to use qemu to reproduce the issue on my arm64 platform. As the PCIe core code is platform-independent, my first try is to use qemu-system-x86_64
, but the problem with qemu x86_64 platform is that it uses UEFI BIOS by default thus the PCIe scanning process is not the same as on my platform, further more, the device tree is replaced by ACPI tables, and it's hard to modify the PCIe bus topology of the virtual guest. So I later switch to qemu-system-aarch64
with u-boot
as bios.
The following components are needed for setting up the virt guest:
The following picture summaries the components involved so far:
After necessary setup , using the following command, the PCIe topology on virt guest is established for mimicking my platform.
#!/bin/bash
# run as root
QEMU="/usr/local/bin/qemu-system-aarch64"
MACHINE="virt-9.0"
BIOS="/tank/work/u-boot/u-boot.bin"
KERNEL="/tank/work/tmp/Image-aarch64-v5.4.93"
APPEND="root=/dev/vda2"
HDD="/tank/work/tmp/pcie-test-arm64.qcow2"
MEM="8G"
EP_NIC_I211="pci-bardev,bus=dsp0,addr=0.0,bar0np32=128K,bar3np32=16K"
EP_VID_RX550="pci-bardev,bus=dsp1,addr=0.0,bar4p64=4G,bar2p64=2M,bar6np32=256K"
EP_NVME_SM980="pci-bardev,bus=dsp2,addr=0.0,bar0np64=16K"
EP_DPU_I20="pci-bardev,bus=dsp3,addr=0.0,bar0np32=16K,bar1np32=16M,bar2p64=64M,bar4p64=16G"
EP_USB_UPD7202="pci-bardev,bus=dsp4,addr=0.0,bar0np64=8K"
EP_GFX_AST="pci-bardev,bus=bridge2,addr=1.0,bar0np32=16M,bar1np32=256K"
${QEMU} \
-machine ${MACHINE} \
-cpu cortex-a53 \
-smp 8 \
-nographic \
-no-reboot \
-m ${MEM} \
-bios ${BIOS} \
-append ${APPEND} \
-kernel ${KERNEL} \
-drive if=virtio,file=${HDD},cache=none \
-netdev tap,id=mynet,ifname=qemu-tap0,script=no,downscript=no \
-device virtio-net-pci,netdev=mynet \
-device pci-bridge,bus=pcie.0,addr=8.0,id=bridge1,chassis_nr=1 \
-device x3130-upstream,bus=bridge1,addr=1.0,id=usp0 \
-device xio3130-downstream,bus=usp0,addr=0.0,id=dsp0,chassis=1,slot=0 \
-device xio3130-downstream,bus=usp0,addr=1.0,id=dsp1,chassis=1,slot=1 \
-device xio3130-downstream,bus=usp0,addr=2.0,id=dsp2,chassis=1,slot=2 \
-device xio3130-downstream,bus=usp0,addr=3.0,id=dsp3,chassis=1,slot=3 \
-device xio3130-downstream,bus=usp0,addr=4.0,id=dsp4,chassis=1,slot=4 \
-device xio3130-downstream,bus=usp0,addr=5.0,id=dsp5,chassis=1,slot=5 \
-device pci-bridge,bus=dsp5,addr=0.0,id=bridge2,chassis_nr=1 \
-device ${EP_NIC_I211} \
-device ${EP_VID_RX550} \
-device ${EP_NVME_SM980} \
-device ${EP_DPU_I20} \
-device ${EP_USB_UPD7202} \
-device ${EP_GFX_AST}
Notice that the topology after bridge 0:8.0
is for simulating my platform:
root@qemu-d11:~# lspci -tv
-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
+-01.0 Red Hat, Inc. Virtio network device
+-02.0 Red Hat, Inc. Virtio block device
\-08.0-[01-09]----01.0-[02-09]--+-00.0-[03]----00.0 Red Hat, Inc. QEMU PCI Test Device
+-01.0-[04]----00.0 Red Hat, Inc. QEMU PCI Test Device
+-02.0-[05]----00.0 Red Hat, Inc. QEMU PCI Test Device
+-03.0-[06]----00.0 Red Hat, Inc. QEMU PCI Test Device
+-04.0-[07]----00.0 Red Hat, Inc. QEMU PCI Test Device
\-05.0-[08-09]----00.0-[09]----01.0 Red Hat, Inc. QEMU PCI Test Device
clone (using v9.0.2
):
```
git clone https://github.com/qemu/qemu.git
git checkout v9.0.2
git config pull.ff only
```
build on linux: on Proxmox (based on debian-11):
install dependencies:
sudo apt-get install git libglib2.0-dev libfdt-dev libpixman-1-dev zlib1g-dev ninja-build
sudo apt-get install git-email
sudo apt-get install libaio-dev libbluetooth-dev libcapstone-dev libbrlapi-dev libbz2-dev
sudo apt-get install libcap-ng-dev libcurl4-gnutls-dev libgtk-3-dev
sudo apt-get install libibverbs-dev libjpeg8-dev libncurses5-dev libnuma-dev
sudo apt-get install librbd-dev librdmacm-dev
sudo apt-get install libsasl2-dev libsdl2-dev libseccomp-dev libsnappy-dev libssh-dev
sudo apt-get install libvde-dev libvdeplug-dev libvte-2.91-dev libxen-dev liblzo2-dev
sudo apt-get install valgrind xfslibs-dev
sudo apt-get install libnfs-dev libiscsi-dev
make:
mkdir build
cd build
../configure
make -j32
sudo make install
Verify that it's installed at /usr/local/bin
:
bruin@x99:~$ /usr/local/bin/qemu-system-x86_64 --version
QEMU emulator version 9.0.2 (v9.0.2)
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers
Two patches are used:
pci-bardev
The customized EP device pci-bardev
is based on the existing pci-testdev
device. The patch file explains the purpose and usage of the device.
The virt board automatically generates a device tree blob (“dtb”) which it passes to the guest. This provides information about the addresses, interrupt lines and other configuration of the various devices in the system.
As the dtb for virt
platform is generated dynamically, the way to customize the dtb is to change the qemu source (./hw/arm/virt.c
) for this platform.
After the following change (change size from 512G to 63G):
(base) bruin@cl210x ~/tank/work/qemu ((HEAD detached from v9.0.2)) $ git diff hw/arm/virt.c
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index a9a913aead..75ea50e185 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -208,7 +208,8 @@ static MemMapEntry extended_memmap[] = {
[VIRT_HIGH_GIC_REDIST2] = { 0x0, 64 * MiB },
[VIRT_HIGH_PCIE_ECAM] = { 0x0, 256 * MiB },
/* Second PCIe window */
- [VIRT_HIGH_PCIE_MMIO] = { 0x0, 512 * GiB },
+ //[VIRT_HIGH_PCIE_MMIO] = { 0x0, 512 * GiB },
+ [VIRT_HIGH_PCIE_MMIO] = { 0x0, 63 * GiB },
};
static const int a15irqmap[] = {
@@ -1532,7 +1533,7 @@ static void create_pcie(VirtMachineState *vms)
2, base_pio, 2, size_pio,
1, FDT_PCI_RANGE_MMIO, 2, base_mmio,
2, base_mmio, 2, size_mmio,
- 1, FDT_PCI_RANGE_MMIO_64BIT,
+ 1, FDT_PCI_RANGE_MMIO_64BIT | FDT_PCI_RANGE_PREFETCHABLE,
2, base_mmio_high,
2, base_mmio_high, 2, size_mmio_high);
} else {
The range looks like this, which is pretty much the same as my platform:
4040000000-4fffffffff : pcie@10000000
4040000000-4040003fff : 0000:00:01.0
4040000000-4040003fff : virtio-pci-modern
4040004000-4040007fff : 0000:00:02.0
4040004000-4040007fff : virtio-pci-modern
v2024.07
)To switch from ACPI (UEFI) to DTS, we need to use u-boot as BIOS.
refs:
git checkout v2024.07
sudo apt install gcc-aarch64-linux-gnu binutils-aarch64-linux-gnu
export CROSS_COMPILE=aarch64-linux-gnu-
make qemu_arm64_defconfig
make all
It generates:
Btw, to disable PCI of u-boot (v2024.07
):
--- a/configs/qemu_arm64_defconfig
+++ b/configs/qemu_arm64_defconfig
@@ -1,3 +1,4 @@
+CONFIG_VIDEO=n
CONFIG_ARM=y
CONFIG_POSITION_INDEPENDENT=y
CONFIG_ARCH_QEMU=y
@@ -13,7 +14,7 @@ CONFIG_DEBUG_UART_CLOCK=0
CONFIG_ARMV8_CRYPTO=y
CONFIG_SYS_LOAD_ADDR=0x40200000
CONFIG_ENV_ADDR=0x4000000
-CONFIG_PCI=y
+#CONFIG_PCI=y
CONFIG_DEBUG_UART=y
CONFIG_AHCI=y
CONFIG_FIT=y
@@ -25,19 +26,19 @@ CONFIG_LEGACY_IMAGE_FORMAT=y
CONFIG_USE_PREBOOT=y
# CONFIG_DISPLAY_CPUINFO is not set
# CONFIG_DISPLAY_BOARDINFO is not set
-CONFIG_PCI_INIT_R=y
+#CONFIG_PCI_INIT_R=y
CONFIG_CMD_SMBIOS=y
CONFIG_CMD_BOOTZ=y
CONFIG_CMD_BOOTEFI_SELFTEST=y
CONFIG_CMD_NVEDIT_EFI=y
CONFIG_CMD_DFU=y
CONFIG_CMD_MTD=y
-CONFIG_CMD_PCI=y
+#CONFIG_CMD_PCI=y
CONFIG_CMD_TPM=y
CONFIG_CMD_MTDPARTS=y
CONFIG_ENV_IS_IN_FLASH=y
CONFIG_SCSI_AHCI=y
-CONFIG_AHCI_PCI=y
+#CONFIG_AHCI_PCI=y
CONFIG_DFU_TFTP=y
CONFIG_DFU_MTD=y
CONFIG_DFU_RAM=y
@@ -55,8 +56,8 @@ CONFIG_SYS_MAX_FLASH_SECT=256
CONFIG_SYS_MAX_FLASH_BANKS=2
CONFIG_SYS_MAX_FLASH_BANKS_DETECT=y
CONFIG_E1000=y
-CONFIG_NVME_PCI=y
-CONFIG_PCIE_ECAM_GENERIC=y
+#CONFIG_NVME_PCI=y
+#CONFIG_PCIE_ECAM_GENERIC=y
CONFIG_SCSI=y
CONFIG_DEBUG_UART_PL011=y
CONFIG_DEBUG_UART_SHIFT=2
@@ -65,6 +66,6 @@ CONFIG_SYSRESET_CMD_POWEROFF=y
CONFIG_SYSRESET_PSCI=y
CONFIG_TPM2_MMIO=y
CONFIG_USB_EHCI_HCD=y
-CONFIG_USB_EHCI_PCI=y
+#CONFIG_USB_EHCI_PCI=y
CONFIG_SEMIHOSTING=y
CONFIG_TPM=y
git checkout v5.4.93
git clean -fxd
export CROSS_COMPILE=aarch64-linux-gnu-
export ARCH=arm64
make mrproper
make defconfig
make -j12 Image
./arch/arm64/boot/Image
is the zipped kernel image to be supplied to qemu.
The way to populate the rootfs is to install a linux distro. I first tried Ubuntu (but the installer requires GUI), then Debian.
executable of ubuntu installer on iso: as ubuntu uses ubiquit
as installer which requires a gui front-end. The solution is switch to debian cdrom (debian-11.6.0-arm64-DVD-1.iso
), which supports installation with command line. However, debian installer may report that it "cannot find installation media", the solution is to manually mount the cdrom (first get into a shell, mount cdrom, and then exit to the installer), according to this link, the following is the screen log:
```
BusyBox v1.30.1 (Debian 1:1.30.1-6+b3) built-in shell (ash)
Enter 'help' for a list of built-in commands.
~ # cat /proc/partitions
major minor #blocks name
254 0 16777216 vda
254 16 3899692 vdb
254 17 3891776 vdb1
254 18 7616 vdb2
~ # blkid
/dev/vdb1: BLOCK_SIZE="2048" UUID="2022-12-17-12-03-02-00" LABEL="Debian 11.6.0 arm64 1" TYPE="iso9660" PTTYPE="dos"
/dev/vdb2: SEC_TYPE="msdos" UUID="D6DF-2909" BLOCK_SIZE="512" TYPE="vfat"
~ # mount -t iso9660 /dev/vdb /cdrom
~ # ls /cdrom
EFI boot.catalog install
README.html css install.a64
README.mirrors.html debian md5sum.txt
README.mirrors.txt dists pics
README.txt doc pool
boot firmware
~ #
```
Create a disk for the VM: ./qemu-img create -f qcow2 <path-to-file.qcow2> 16G
```
bruin@x99:/tank/work/tmp$ /usr/local/bin/qemu-img create -f qcow2 /tank/work/tmp/pcie-test-arm64.qcow2 16G
Formatting '/tank/work/tmp/pcie-test-arm64.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=17179869184 lazy_refcounts=off refcount_bits=16
```
prepare UEFI bios (which comes with qemu) and flash device for storing variables:
```
bruin@x99:/tank/work/tmp$ truncate -s 64m efi.img
bruin@x99:/tank/work/tmp$ dd if=/usr/local/share/qemu/edk2-aarch64-code.fd of=efi.img conv=notrunc
131072+0 records in
131072+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.233858 s, 287 MB/s
bruin@x99:/tank/work/tmp$ rm varstore.img; truncate -s 64m varstore.img
```
Download an ISO (debian-11
):
```
(base) bruin@cl210x ~/work/distro/arm64 $ ls -la debian-11.6.0-arm64-DVD-1.iso
-rw-r--r-- 1 bruin bruin 3993284608 Dec 17 2022 debian-11.6.0-arm64-DVD-1.iso
```
Install debian 11, the purpose is to populate the rootfs:
```
bruin@x99:/tank/work/tmp$ cat qemu-arm64.sh
#!/bin/bash
# run as root
QEMU="/usr/local/bin/qemu-system-aarch64"
MACHINE="virt-9.0"
EFI="efi.img"
VARSTORE="varstore.img"
#ISO="/tank/work/distro/arm64/ubuntu-22.04.1-desktop-arm64.iso"
ISO="/tank/work/distro/arm64/debian-11.6.0-arm64-DVD-1.iso"
HDD="/tank/work/tmp/pcie-test-arm64.qcow2"
MEM="8G"
${QEMU} \
-machine ${MACHINE} \
-cpu max \
-nographic \
-m ${MEM} \
-drive if=pflash,format=raw,file=${EFI},readonly=on \
-drive if=pflash,format=raw,file=${VARSTORE} \
-drive if=virtio,file=${HDD},cache=none \
-cdrom ${ISO}
```
If it reports that Grub is failed to install, then select "Continue without bootloader".
Note the rootfs partition (using blkid
), which is /dev/vda2
in my case.
On the host (pve), I already have a bridge setup (lanbr
), and there is a dhcp server running on the bridge.
On the host, do:
sudo ip tuntap add mode tap qemu-tap0
sudo brctl addif lanbr qemu-tap0
On the qemu command line, add:
-netdev tap,id=mynet,ifname=qemu-tap0,script=no,downscript=no \
-device virtio-net-pci,netdev=mynet \
In guest /etc/network/interfaces
, add:
auto enp0s1
iface enp0s1 inet dhcp
Then reboot the qemu, the nic is working:
bruin@d11:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
inet 192.168.99.107/24 brd 192.168.99.255 scope global dynamic enp0s1
valid_lft 86389sec preferred_lft 86389sec
bruin@d11:~$ ping www.bing.com
ping: socket: Address family not supported by protocol
PING china.bing123.com (202.89.233.100) 56(84) bytes of data.
64 bytes from 202.89.233.100 (202.89.233.100): icmp_seq=1 ttl=116 time=28.7 ms
64 bytes from 202.89.233.100 (202.89.233.100): icmp_seq=2 ttl=116 time=14.9 ms
^C