linux-kernelpci-e

PCIe BAR alignment issue under Linux


I am using kernel 5.4.93 (with Ubuntu 20.04 rootfs) on an ARMv8-based embedded system. The "BIOS" is U-boot which does not include PCIe feature thus the PCIe bus enumeration is solely done by kernel.

The issue is about prefetchable BAR allocation during PCIe bus enumeration, related to alignment I believe.

The following is the output of lspci -tv, for bus topology:

-[0000:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0  Intel Corporation I211 Gigabit Network Connection
                           |               +-01.0-[04]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]
                           |               |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
                           |               +-02.0-[05]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
                           |               +-03.0-[06]----00.0  Shanghai Enflame Technology Co. Ltd I20 [CloudBlazer]
                           |               +-04.0-[07]----00.0  Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
                           |               \-05.0-[08-09]----00.0-[09]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
                           \-00.1  PMC-Sierra Inc. Device 4052

The symptom is that, the system is failed in allocating prefetchable-64bit BARs for the AMDGPU RX550 graphics function (unassigned at BDF 4:0.0), since the bridge (2:1.0) above 4:0.0 is failed to allocate 6G prefetchable window (disabled) for the RX550 card.

root@u2004:/home/cmic# lspci -s 4:0.0 -v
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00 [VGA controller])
    Subsystem: ASUSTeK Computer Inc. Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]
    Physical Slot: 2
    Flags: fast devsel, IRQ 255
    Memory at <unassigned> (64-bit, prefetchable) [disabled]
    Memory at <unassigned> (64-bit, prefetchable) [disabled]
    I/O ports at <unassigned> [disabled]
    Memory at 33100000 (32-bit, non-prefetchable) [disabled] [size=256K]
    Expansion ROM at 33140000 [virtual] [disabled] [size=128K]

root@u2004:/home/cmic# lspci -s 2:1.0 
02:01.0 PCI bridge: PMC-Sierra Inc. Device 4052
root@u2004:/home/cmic# lspci -s 2:1.0 -v
02:01.0 PCI bridge: PMC-Sierra Inc. Device 4052 (prog-if 00 [Normal decode])
    Flags: bus master, fast devsel, latency 0, IRQ 14
    Bus: primary=02, secondary=04, subordinate=04, sec-latency=0
    I/O behind bridge: [disabled]
    Memory behind bridge: 33100000-332fffff [size=2M]
    Prefetchable memory behind bridge: [disabled]

The following picture depicts why bridge 2:1.0 has failed in finding a 6G contiguous space (although there are two free 4G spaces, they are separated).

enter image description here

I have a workaround by changing PCIe ranges in DTS such that the 32G space starts from 12_0000_0000, thus the 24G is allocated from the beginning, leaving contiguous 4T+4G=8G at the tail for RX550, as manifested by the following /proc/iomem fragment:

1140000000-1fffffffff : pcie@d0000
  1200000000-19ffffffff : PCI Bus 0000:01
    1200000000-19ffffffff : PCI Bus 0000:02
      1200000000-17ffffffff : PCI Bus 0000:06           <= i20, 24G
        1200000000-1203ffffff : 0000:06:00.0               |-- 64M (bar 4) 
        1400000000-17ffffffff : 0000:06:00.0               `-- 16G (bar 2)
      1800000000-197fffffff : PCI Bus 0000:04           <= rx550, 6G
        1800000000-18ffffffff : 0000:04:00.0               |-- 4G (bar 0)
        1900000000-19001fffff : 0000:04:00.0               `-- 2M (bar 2)
      1980000000-19801fffff : PCI Bus 0000:03
      1980200000-19803fffff : PCI Bus 0000:05
      1980400000-19805fffff : PCI Bus 0000:07
      1980600000-19807fffff : PCI Bus 0000:08

My question: what is the root cause of the problem (w/o workaround)? Is this a kernel bug, or configuration bug somewhere?

PS: I did some printk trace for v5.4.93, and compared with a virtual host via qemu-system-aarch64 (which does not have such issue). So far the difference lies in the 1st argument to function calculate_mem_align(aligns[], max_order=13): in real system, aligns[0]=0x400000 and the rest are zeros, and in virt host all aligns[] are zero. This cause the different outputs from this function (4GiB vs 8GiB). I can reproduce the difference in a standalone C program:

// https://elixir.bootlin.com/linux/v5.4.93/source/drivers/pci/setup-bus.c
#include <stdio.h>
#include <string.h>

typedef unsigned long long resource_size_t;

#define ALIGN(x, a)             __ALIGN_KERNEL((x), (a))  // align `x` to alignment `a`
#define __ALIGN_KERNEL(x, a)            __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))

resource_size_t calculate_mem_align(resource_size_t *aligns,
                                                  int max_order)
{
        resource_size_t align_sum = 0;
        resource_size_t min_align = 0;
        int order;

        printf("max_order=%d, aligns[]=\n", max_order);

        for (order = 0; order <= max_order; order++) {
                resource_size_t cur_align = 1;
                cur_align <<= (order + 20);
                if (!align_sum) {
                        min_align = cur_align;
                }
                else {
                        if (ALIGN(align_sum + min_align, min_align) < cur_align) {
                                min_align = cur_align >> 1;
                        } else {
                                printf("do nothing\n");
                        }
                }
                align_sum += aligns[order];
                printf("    order=%2d: aligns[%2d]=%09llx, cur_align=%09llx, min_align=%09llx, align_sum=%09llx\n", order, order, aligns[order], cur_align, min_align, align_sum);
        }

        printf("return min_align=%#llx\n", min_align);
        return min_align;
}

void main()
{
        resource_size_t aligns0[18];
        resource_size_t aligns1[18];
        resource_size_t min_align;
        int max_order = 13;

        memset(aligns0, 0, sizeof(aligns0));
        memset(aligns1, 0, sizeof(aligns1));
        aligns1[0] = 0x400000;

        min_align = calculate_mem_align(aligns0, max_order);
        min_align = calculate_mem_align(aligns1, max_order);
}
$ ./a.out
max_order=13, aligns[]=
    order= 0: aligns[ 0]=000000000, cur_align=000100000, min_align=000100000, align_sum=000000000
    order= 1: aligns[ 1]=000000000, cur_align=000200000, min_align=000200000, align_sum=000000000
    order= 2: aligns[ 2]=000000000, cur_align=000400000, min_align=000400000, align_sum=000000000
    order= 3: aligns[ 3]=000000000, cur_align=000800000, min_align=000800000, align_sum=000000000
    order= 4: aligns[ 4]=000000000, cur_align=001000000, min_align=001000000, align_sum=000000000
    order= 5: aligns[ 5]=000000000, cur_align=002000000, min_align=002000000, align_sum=000000000
    order= 6: aligns[ 6]=000000000, cur_align=004000000, min_align=004000000, align_sum=000000000
    order= 7: aligns[ 7]=000000000, cur_align=008000000, min_align=008000000, align_sum=000000000
    order= 8: aligns[ 8]=000000000, cur_align=010000000, min_align=010000000, align_sum=000000000
    order= 9: aligns[ 9]=000000000, cur_align=020000000, min_align=020000000, align_sum=000000000
    order=10: aligns[10]=000000000, cur_align=040000000, min_align=040000000, align_sum=000000000
    order=11: aligns[11]=000000000, cur_align=080000000, min_align=080000000, align_sum=000000000
    order=12: aligns[12]=000000000, cur_align=100000000, min_align=100000000, align_sum=000000000
    order=13: aligns[13]=000000000, cur_align=200000000, min_align=200000000, align_sum=000000000
return min_align=0x200000000
max_order=13, aligns[]=
    order= 0: aligns[ 0]=000400000, cur_align=000100000, min_align=000100000, align_sum=000400000
do nothing
    order= 1: aligns[ 1]=000000000, cur_align=000200000, min_align=000100000, align_sum=000400000
do nothing
    order= 2: aligns[ 2]=000000000, cur_align=000400000, min_align=000100000, align_sum=000400000
    order= 3: aligns[ 3]=000000000, cur_align=000800000, min_align=000400000, align_sum=000400000
    order= 4: aligns[ 4]=000000000, cur_align=001000000, min_align=000800000, align_sum=000400000
    order= 5: aligns[ 5]=000000000, cur_align=002000000, min_align=001000000, align_sum=000400000
    order= 6: aligns[ 6]=000000000, cur_align=004000000, min_align=002000000, align_sum=000400000
    order= 7: aligns[ 7]=000000000, cur_align=008000000, min_align=004000000, align_sum=000400000
    order= 8: aligns[ 8]=000000000, cur_align=010000000, min_align=008000000, align_sum=000400000
    order= 9: aligns[ 9]=000000000, cur_align=020000000, min_align=010000000, align_sum=000400000
    order=10: aligns[10]=000000000, cur_align=040000000, min_align=020000000, align_sum=000400000
    order=11: aligns[11]=000000000, cur_align=080000000, min_align=040000000, align_sum=000400000
    order=12: aligns[12]=000000000, cur_align=100000000, min_align=080000000, align_sum=000400000
    order=13: aligns[13]=000000000, cur_align=200000000, min_align=100000000, align_sum=000400000
return min_align=0x100000000

But I don't quite understand the logic behind the function; can anyone explain it bit?


Solution

  • The short answer is that on my platform, the PCIe controller's driver (QDMA) is not from upstream kernel, but from Xilinx AR76647. In the xilinx_pcie_probe() function, it calls pci_assign_unassigned_bus_resources() instead of pci_bus_size_bridges() + pci_bus_assign_resources() (as does in pci_host_probe()).

    One major difference caused by the two different calls is in the later call to __pci_bus_size_bridges(bus, realloc_head) wrt to the second argument: pci_assign_unassigned_bus_resources() leads to a non-null realloc_head argument while pci_bus_size_bridges() leads to a null realloc_head argument. Then the following calls to calculate_memsize() and calculate_mem_align() gives different results, leading to the issue on my platform.

    I am not yet very clear on how the internal scanning process works, but it seems that the use of pci_assign_unassigned_bus_resources() assumes that the firmware (EFI BIOS or U-BOOT) has already enumerated the pci bus, and kernel is only doing the rest part. As this assumption does not apply to my platform (as explained in the question above), so the issue. The issue can be reproduced on qemu virt host, if I modify pci_host_probe() to use pci_assign_unassigned_bus_resources(bus). On my platform, after making the following change in xilinx_pcie_probe(), the issue has been resolved.

    #if (0)
        pci_assign_unassigned_bus_resources(bus);
    #else
        pci_bus_size_bridges(bus);
        pci_bus_assign_resources(bus);
    #endif
    

    Btw, the following is a brief call tree of xilinx probe function, to get a grasp of the scanning process:

    xilinx_pcie_probe()
    |-- xilinx_pcie_parse_dt()
    |-- xilinx_pcie_init_port()
    |-- xilinx_pcie_init_irq_domain()
    |-- devm_of_pci_get_host_bridge_resource()
    |-- devm_request_pci_bus_resources()
    |-- pci_scan_root_bus_bridge()
    |   |-- pci_register_host_bridge()
    |   |-- pci_bus_insert_busn_res()
    |   |-- pci_scan_child_bus()
    |   |   `== pci_scan_child_bus_extend()                                       <-------+
    |   |       |-- pci_scan_slot(bus, devfn=0~256)                                       |
    |   |       |   |-- pci_scan_single_device()                                          |
    |   |       |       |-- pci_scan_device()                                             |
    |   |       |       `-- pci_device_add()                                              |
    |   |       |           |-- pci_configure_device()                                    |
    |   |       |           |-- device_initialize()                                       |
    |   |       |           |-- pci_fixup_device()                                        |
    |   |       |           |-- pci_reassigndev_resource_alignment()                      |
    |   |       |           |-- ...                                                       |
    |   |       `-- for_each_pci_bridge: pci_scan_bridge_extend()                         |
    |   |                                |-- ...                                          |
    |   |                                |== pci_scan_child_bus_extend()          <== recursive
    |   |                                |-- ...
    |   `-- pci_bus_update_busn_res_end()
    |-- pci_assign_unassigned_bus_resources()
    |   |== for each bridge: __pci_bus_size_bridges(bus, add_list)                <-------+
    |   |                    |-- for each cardbus: pci_bus_size_cardbus()                 |
    |   |                    |== for each bridge: __pci_bus_size_bridge()         <== recursive
    |   |                    |-- pci_bridge_check_ranges()
    |   |                    |-- pbus_size_io()
    |   |                    |-- pbus_size_mem(pref-64)
    |   |                    |   |-- ...
    |   |                    |   |-- calculate_memsize()
    |   |                    |   |-- calculate_mem_align()
    |   |                    |   |-- ...
    |   |                    |-- pbus_size_mem(pref-32)
    |   |                    `-- pbus_size_mem(non-pref)
    |   `== __pci_bus_assign_resources()                                          <-------+
    |       |-- pbus_assign_resources_sorted()                                            |
    |       |   |-- for each dev: __dev_sort_resources()                                  |
    |       |   `-- __assign_resources_sorted()                                           |
    |       |       |-- assign_requested_resources_sorted()                               |
    |       |           |-- pci_assign_resource()                                         |
    |       `-- for each dev:                                                             |
    |           |-- pdev_assign_fixed_resources()                                         |
    |           |== __pci_bus_assign_resources()                                  <== recursive
    |           `-- pci_setup_bridge()
    |-- loop for each child: pci_bus_configure_settings()
    `-- pci_bus_add_devices()
    

    The following is the call stack up to __pci_bus_size_bridges() for xilinx driver (before modification) on my platform:

    all trace:
     525 [   10.325976]  dump_backtrace+0x0/0x18c
     526 [   10.330144]  show_stack+0x28/0x3c
     527 [   10.333917]  dump_stack+0xb4/0x110
     528 [   10.337794]  __pci_bus_size_bridges+0xdc/0xa0c
     529 [   10.342858]  pci_assign_unassigned_bus_resources+0x94/0x100    <== notice this call
     530 [   10.349205]  xilinx_pcie_probe+0x96c/0x9d0
     531 [   10.353871]  platform_drv_probe+0x5c/0xb0
     532 [   10.358436]  really_probe+0xf0/0x4a0
     533 [   10.362505]  driver_probe_device+0xec/0x134
     534 [   10.367268]  device_driver_attach+0xc0/0xcc
     535 [   10.372031]  __driver_attach+0xac/0x164
     536 [   10.376397]  bus_for_each_dev+0x80/0xd0
     537 [   10.380763]  driver_attach+0x34/0x40
     538 [   10.384834]  bus_add_driver+0x14c/0x240
     539 [   10.389204]  driver_register+0x7c/0x124
     540 [   10.393567]  __platform_driver_register+0x58/0x6c
     541 [   10.398928]  pcie_xdma_pl_init+0x24/0x2c
     542 [   10.403394]  do_one_initcall+0x50/0x250
     543 [   10.407760]  kernel_init_freeable+0x1f4/0x2bc
     544 [   10.412721]  kernel_init+0x1c/0x114
     545 [   10.416693]  ret_from_fork+0x10/0x18
    

    and the following is the corresponding stack on qemu virt host:

     323 [    0.956000] Call trace:
     324 [    0.956141]  dump_backtrace+0x0/0x134
     325 [    0.956321]  show_stack+0x14/0x20
     326 [    0.956487]  dump_stack+0xb4/0x110
     327 [    0.956655]  __pci_bus_size_bridges+0xe8/0xa50
     328 [    0.956859]  pci_bus_size_bridges+0x14/0x20                <== a different call
     329 [    0.957057]  pci_host_probe+0x60/0xbc
     330 [    0.957234]  pci_host_common_probe+0xdc/0x1dc
     331 [    0.957442]  gen_pci_probe+0x30/0x40
     332 [    0.957616]  platform_drv_probe+0x50/0xa0
     333 [    0.957806]  really_probe+0xd8/0x420
     334 [    0.957978]  driver_probe_device+0x54/0xe4
     335 [    0.958178]  device_driver_attach+0xb4/0xc0
     336 [    0.958376]  __driver_attach+0x80/0x114
     337 [    0.958586]  bus_for_each_dev+0x6c/0xc0
     338 [    0.958791]  driver_attach+0x20/0x30
     339 [    0.958982]  bus_add_driver+0xfc/0x1e0
     340 [    0.959180]  driver_register+0x74/0x120
     341 [    0.959383]  __platform_driver_register+0x44/0x50
     342 [    0.959629]  gen_pci_driver_init+0x18/0x20
     343 [    0.959843]  do_one_initcall+0x4c/0x1b0
     344 [    0.960045]  kernel_init_freeable+0x194/0x23c
     345 [    0.960270]  kernel_init+0x10/0x100
     346 [    0.960457]  ret_from_fork+0x10/0x24
    

    PS: tracing the issue takes me some time, during the process I had a feeling that using qemu virt host for comparison with real platform helps a lot: it simply much quicker to modify the kernel (adding printk() mostly) and re-run using qemu. So below I will give a brief description of my qemu-related setup for this debugging task.

    Overview

    The original idea is to use qemu to reproduce the issue on my arm64 platform. As the PCIe core code is platform-independent, my first try is to use qemu-system-x86_64, but the problem with qemu x86_64 platform is that it uses UEFI BIOS by default thus the PCIe scanning process is not the same as on my platform, further more, the device tree is replaced by ACPI tables, and it's hard to modify the PCIe bus topology of the virtual guest. So I later switch to qemu-system-aarch64 with u-boot as bios.

    The following components are needed for setting up the virt guest:

    The following picture summaries the components involved so far:

    enter image description here

    After necessary setup , using the following command, the PCIe topology on virt guest is established for mimicking my platform.

    #!/bin/bash
    
    # run as root
    
    QEMU="/usr/local/bin/qemu-system-aarch64"
    MACHINE="virt-9.0"
    BIOS="/tank/work/u-boot/u-boot.bin"
    KERNEL="/tank/work/tmp/Image-aarch64-v5.4.93"
    APPEND="root=/dev/vda2"
    HDD="/tank/work/tmp/pcie-test-arm64.qcow2"
    MEM="8G"
    EP_NIC_I211="pci-bardev,bus=dsp0,addr=0.0,bar0np32=128K,bar3np32=16K"
    EP_VID_RX550="pci-bardev,bus=dsp1,addr=0.0,bar4p64=4G,bar2p64=2M,bar6np32=256K"
    EP_NVME_SM980="pci-bardev,bus=dsp2,addr=0.0,bar0np64=16K"
    EP_DPU_I20="pci-bardev,bus=dsp3,addr=0.0,bar0np32=16K,bar1np32=16M,bar2p64=64M,bar4p64=16G"
    EP_USB_UPD7202="pci-bardev,bus=dsp4,addr=0.0,bar0np64=8K"
    EP_GFX_AST="pci-bardev,bus=bridge2,addr=1.0,bar0np32=16M,bar1np32=256K"
    
    ${QEMU} \
    -machine ${MACHINE} \
    -cpu cortex-a53 \
    -smp 8 \
    -nographic \
    -no-reboot \
    -m ${MEM} \
    -bios ${BIOS} \
    -append ${APPEND} \
    -kernel ${KERNEL} \
    -drive if=virtio,file=${HDD},cache=none \
    -netdev tap,id=mynet,ifname=qemu-tap0,script=no,downscript=no \
    -device virtio-net-pci,netdev=mynet \
    -device pci-bridge,bus=pcie.0,addr=8.0,id=bridge1,chassis_nr=1 \
    -device x3130-upstream,bus=bridge1,addr=1.0,id=usp0 \
    -device xio3130-downstream,bus=usp0,addr=0.0,id=dsp0,chassis=1,slot=0 \
    -device xio3130-downstream,bus=usp0,addr=1.0,id=dsp1,chassis=1,slot=1 \
    -device xio3130-downstream,bus=usp0,addr=2.0,id=dsp2,chassis=1,slot=2 \
    -device xio3130-downstream,bus=usp0,addr=3.0,id=dsp3,chassis=1,slot=3 \
    -device xio3130-downstream,bus=usp0,addr=4.0,id=dsp4,chassis=1,slot=4 \
    -device xio3130-downstream,bus=usp0,addr=5.0,id=dsp5,chassis=1,slot=5 \
    -device pci-bridge,bus=dsp5,addr=0.0,id=bridge2,chassis_nr=1 \
    -device ${EP_NIC_I211} \
    -device ${EP_VID_RX550} \
    -device ${EP_NVME_SM980} \
    -device ${EP_DPU_I20} \
    -device ${EP_USB_UPD7202} \
    -device ${EP_GFX_AST}
    

    Notice that the topology after bridge 0:8.0 is for simulating my platform:

    root@qemu-d11:~# lspci -tv
    -[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
               +-01.0  Red Hat, Inc. Virtio network device
               +-02.0  Red Hat, Inc. Virtio block device
               \-08.0-[01-09]----01.0-[02-09]--+-00.0-[03]----00.0  Red Hat, Inc. QEMU PCI Test Device
                                               +-01.0-[04]----00.0  Red Hat, Inc. QEMU PCI Test Device
                                               +-02.0-[05]----00.0  Red Hat, Inc. QEMU PCI Test Device
                                               +-03.0-[06]----00.0  Red Hat, Inc. QEMU PCI Test Device
                                               +-04.0-[07]----00.0  Red Hat, Inc. QEMU PCI Test Device
                                               \-05.0-[08-09]----00.0-[09]----01.0  Red Hat, Inc. QEMU PCI Test Device
    

    QEMU

    Build

    Patches

    Two patches are used:

    EP pci-bardev

    The customized EP device pci-bardev is based on the existing pci-testdev device. The patch file explains the purpose and usage of the device.

    DTS for virt guest

    The virt board automatically generates a device tree blob (“dtb”) which it passes to the guest. This provides information about the addresses, interrupt lines and other configuration of the various devices in the system.

    --- https://www.qemu.org/docs/master/system/arm/virt.html

    As the dtb for virt platform is generated dynamically, the way to customize the dtb is to change the qemu source (./hw/arm/virt.c) for this platform.

    After the following change (change size from 512G to 63G):

    (base) bruin@cl210x ~/tank/work/qemu ((HEAD detached from v9.0.2)) $ git diff hw/arm/virt.c
    diff --git a/hw/arm/virt.c b/hw/arm/virt.c
    index a9a913aead..75ea50e185 100644
    --- a/hw/arm/virt.c
    +++ b/hw/arm/virt.c
    @@ -208,7 +208,8 @@ static MemMapEntry extended_memmap[] = {
         [VIRT_HIGH_GIC_REDIST2] =   { 0x0, 64 * MiB },
         [VIRT_HIGH_PCIE_ECAM] =     { 0x0, 256 * MiB },
         /* Second PCIe window */
    -    [VIRT_HIGH_PCIE_MMIO] =     { 0x0, 512 * GiB },
    +    //[VIRT_HIGH_PCIE_MMIO] =     { 0x0, 512 * GiB },
    +    [VIRT_HIGH_PCIE_MMIO] =     { 0x0, 63 * GiB },
     };
    
     static const int a15irqmap[] = {
    @@ -1532,7 +1533,7 @@ static void create_pcie(VirtMachineState *vms)
                                          2, base_pio, 2, size_pio,
                                          1, FDT_PCI_RANGE_MMIO, 2, base_mmio,
                                          2, base_mmio, 2, size_mmio,
    -                                     1, FDT_PCI_RANGE_MMIO_64BIT,
    +                                     1, FDT_PCI_RANGE_MMIO_64BIT | FDT_PCI_RANGE_PREFETCHABLE,
                                          2, base_mmio_high,
                                          2, base_mmio_high, 2, size_mmio_high);
         } else {
    

    The range looks like this, which is pretty much the same as my platform:

    4040000000-4fffffffff : pcie@10000000
      4040000000-4040003fff : 0000:00:01.0
        4040000000-4040003fff : virtio-pci-modern
      4040004000-4040007fff : 0000:00:02.0
        4040004000-4040007fff : virtio-pci-modern
    

    U-boot (v2024.07)

    To switch from ACPI (UEFI) to DTS, we need to use u-boot as BIOS.

    refs:

    git checkout v2024.07
    sudo apt install gcc-aarch64-linux-gnu binutils-aarch64-linux-gnu
    export CROSS_COMPILE=aarch64-linux-gnu-
    make qemu_arm64_defconfig
    make all
    

    It generates:

    Btw, to disable PCI of u-boot (v2024.07):

    --- a/configs/qemu_arm64_defconfig
    +++ b/configs/qemu_arm64_defconfig
    @@ -1,3 +1,4 @@
    +CONFIG_VIDEO=n
     CONFIG_ARM=y
     CONFIG_POSITION_INDEPENDENT=y
     CONFIG_ARCH_QEMU=y
    @@ -13,7 +14,7 @@ CONFIG_DEBUG_UART_CLOCK=0
     CONFIG_ARMV8_CRYPTO=y
     CONFIG_SYS_LOAD_ADDR=0x40200000
     CONFIG_ENV_ADDR=0x4000000
    -CONFIG_PCI=y
    +#CONFIG_PCI=y
     CONFIG_DEBUG_UART=y
     CONFIG_AHCI=y
     CONFIG_FIT=y
    @@ -25,19 +26,19 @@ CONFIG_LEGACY_IMAGE_FORMAT=y
     CONFIG_USE_PREBOOT=y
     # CONFIG_DISPLAY_CPUINFO is not set
     # CONFIG_DISPLAY_BOARDINFO is not set
    -CONFIG_PCI_INIT_R=y
    +#CONFIG_PCI_INIT_R=y
     CONFIG_CMD_SMBIOS=y
     CONFIG_CMD_BOOTZ=y
     CONFIG_CMD_BOOTEFI_SELFTEST=y
     CONFIG_CMD_NVEDIT_EFI=y
     CONFIG_CMD_DFU=y
     CONFIG_CMD_MTD=y
    -CONFIG_CMD_PCI=y
    +#CONFIG_CMD_PCI=y
     CONFIG_CMD_TPM=y
     CONFIG_CMD_MTDPARTS=y
     CONFIG_ENV_IS_IN_FLASH=y
     CONFIG_SCSI_AHCI=y
    -CONFIG_AHCI_PCI=y
    +#CONFIG_AHCI_PCI=y
     CONFIG_DFU_TFTP=y
     CONFIG_DFU_MTD=y
     CONFIG_DFU_RAM=y
    @@ -55,8 +56,8 @@ CONFIG_SYS_MAX_FLASH_SECT=256
     CONFIG_SYS_MAX_FLASH_BANKS=2
     CONFIG_SYS_MAX_FLASH_BANKS_DETECT=y
     CONFIG_E1000=y
    -CONFIG_NVME_PCI=y
    -CONFIG_PCIE_ECAM_GENERIC=y
    +#CONFIG_NVME_PCI=y
    +#CONFIG_PCIE_ECAM_GENERIC=y
     CONFIG_SCSI=y
     CONFIG_DEBUG_UART_PL011=y
     CONFIG_DEBUG_UART_SHIFT=2
    @@ -65,6 +66,6 @@ CONFIG_SYSRESET_CMD_POWEROFF=y
     CONFIG_SYSRESET_PSCI=y
     CONFIG_TPM2_MMIO=y
     CONFIG_USB_EHCI_HCD=y
    -CONFIG_USB_EHCI_PCI=y
    +#CONFIG_USB_EHCI_PCI=y
     CONFIG_SEMIHOSTING=y
     CONFIG_TPM=y
    

    Kernel (v5.4.93)

    git checkout v5.4.93
    git clean -fxd
    export CROSS_COMPILE=aarch64-linux-gnu-
    export ARCH=arm64
    make mrproper
    make defconfig
    make -j12 Image
    

    ./arch/arm64/boot/Image is the zipped kernel image to be supplied to qemu.

    Rootfs

    The way to populate the rootfs is to install a linux distro. I first tried Ubuntu (but the installer requires GUI), then Debian.

    If it reports that Grub is failed to install, then select "Continue without bootloader".

    Note the rootfs partition (using blkid), which is /dev/vda2 in my case.

    Tap networking

    On the host (pve), I already have a bridge setup (lanbr), and there is a dhcp server running on the bridge.

    On the host, do:

    sudo ip tuntap add mode tap qemu-tap0
    sudo brctl addif lanbr qemu-tap0
    

    On the qemu command line, add:

    -netdev tap,id=mynet,ifname=qemu-tap0,script=no,downscript=no \
    -device virtio-net-pci,netdev=mynet \
    

    In guest /etc/network/interfaces, add:

    auto enp0s1
    iface enp0s1 inet dhcp
    

    Then reboot the qemu, the nic is working:

    bruin@d11:~$ ip a
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
    2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
        link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
        inet 192.168.99.107/24 brd 192.168.99.255 scope global dynamic enp0s1
           valid_lft 86389sec preferred_lft 86389sec
    bruin@d11:~$ ping www.bing.com
    ping: socket: Address family not supported by protocol
    PING china.bing123.com (202.89.233.100) 56(84) bytes of data.
    64 bytes from 202.89.233.100 (202.89.233.100): icmp_seq=1 ttl=116 time=28.7 ms
    64 bytes from 202.89.233.100 (202.89.233.100): icmp_seq=2 ttl=116 time=14.9 ms
    ^C