Where are NVMe commands located inside the PCIe BAR?

According to the NVMe specification, the BAR has tail and head fields for each queue. For example:

Submission Queue y Tail Doorbell (SQyTDBL):
- Start: 1000h + (2y * (4 << CAP.DSTRD))
- End: 1003h + (2y * (4 << CAP.DSTRD))
Submission Queue y Head Doorbell (SQyHDBL):
- Start: 1000h + ((2y + 1) * (4 << CAP.DSTRD))
- End: 1003h + ((2y + 1) * (4 << CAP.DSTRD))

Are there the queue itself or just mere pointers? Is this correct? If it is the queue, I would assume the DSTRD indicates the maximum length of all queues.

Moreover, the specification talks about two optional regions: Host Memory Buffer (HMB) and Controller Memory Buffer (CMB).

HMB: a region within the host's DRAM (PCIe root)
CMB: a region within the NVMe controller's DRAM (inside the SSD)

If both are optional, where is it located then? Since endpoint PCIe only works with BARs and PCI Headers, I don't see any other place they might be located, other than a BAR.

Solution

Sorry but I am doing this from memory but I have implemented an FPGA NVMe host so hopefully my memory will be enough to answer your questions and more, if I get something wrong though at least you know why. I'll be providing reference sections from the specification which you can find here. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf Also as a note before I really answer your question I want to clarify some confusion, understanding the spec takes some time I honestly recommend reading it bottom to top the last few sections help give context for the first few as strange as that sounds.

These are the submission and completion queues, specifically the subqueue tail and completion queue head respectively (SECTION 3.1). More on this later I just wanted to correct the missconception that you access the submission queue head as the host, you do not only the controller (traditionally the drive) does. A simple reminder submission is you asking the drive to do something, completion is the drive telling you how it went. Read SECTION 7.2 for more info.
Before you can send anything to these queues you must first setup said queues. Baseline in the system these queues do not exist, you must use the admin queue to set them up.

28h 2Fh ASQ Admin Submission Queue Base Address

30h 37h ACQ Admin Completion Queue Base Address

Your statement about DSTRD is a huge miss understanding. This field is from the capabilities register (0x0) Figure 3.1.1. This field is the controller (drive) telling you the "doorbell stride" which says how many bytes are between each doorbell, I've never seen a drive report anything but 0 for this value since well, why would you want to leave dead space between doorbell registers.
Please be careful with the size of your writes, in my experience most NVMe drives require you to send writes of at least 2dwords (8 bytes) even if you only intend to send 1dword of data, just a note.
Onto actually helping you use this thing as a host, please reference SECTION 7.6.1 to find the initialization sequence. Notice how you must setup multiple registers, read certain parameters and other such things.
Assuming you or someone else has done initalization let me now answer the core of your question, how to use these queues. The thing is, this answer spans MANY sections of the spec and is the core of it. So with that I am going to break it down as best I can for a simple write command. Please note you CANNOT write, until you have first created the queues using the admin queues which leverage different opcodes from a different section of the spec, sorry I cannot write all of this out.

STEPS TO WRITING DATA TO AN NVMe DRIVE.

In the creation of the submission queue you will specify the size of this specific queue. This is the number of commands that can be placed in the queue at one time for processing. Along with this you will specify the queue base address. So for this example let's assume you set the base address to 0x1000_0000 and size 16 (0x10). Figure 105 let's us know that every submission queue entry has a size of 64bytes (0x40) so queue entry 0 is at 0x1000_0000 entry 1 is at 0x1000_0040 2 0x1000_0080 and so on for our 16 entries then it loops back.
You will first store data for writing, let's say you were given 512bytes (0x200) of data to write. So for simplicity you place that data at 0x2000_0000 - 0x2000_0200.
You create the submission queue command. This is not a simple process. I'm not going to document all of this for you but understand you should be referencing Figure 104, Figure 346, and Section 6.15. This is not enough however. You will also need to understand PRP vs SGL and which you are using (PRP is easier to start with). NLB (Number of logical blocks) which determine your write size, with NVMe you do not specify writes in bytes but in terms of NLBs which the size is specified by the controller (drive), it may implement multiple NLB sizes but this is up to the drive not you as the host, you just get to pick from what it supports Section 5.15.2.1, Figure 245 You want to look at identify namespace to tell you the LBA (logical block address) size, this will lead you down a rabbit hole to determine the actual size but that's ok the info is there.
Ok so you finished this mess and have created the submission command. Let's assume the host has already completed 2 commands on this queue (at start this will be 0 I'm picking 2 just to be clearer in my example). What you now need to do is place this command at 0x1000_0080.
Now let's assume this is queue 1 (from the equation you posted the queue number is the y value. Note that queue 0 is the admin queue). What you need to do is poke the controllers submission queue tail doorbell to say how many commands are now loaded (thus you can queue multiple up at once and only tell the drive when you are ready to). In this case the number is 2. So you need to write the value 2 to register 0x1008.
At this point the drive will go. aha, the host has told me there are new commands to fetch. So the controller will go to queue base address + commandsize*2 and fetch 64bytes of data aka 1 command (address 0x1000_0080). The controller will decode this command as a write which means the controller (drive) must read data from some address and put it in memory where it was told to. This means your write command should tell the drive to go to address 0x2000_0000 and read 512 bytes of data, and it will if you scope the PCIe bus. At this point the drive will fill out a completion queue entry (16 bytes specified at Section 4.6) and place it in the completion queue address you specified at queue creation (plus 0x20 since this is the 2nd completion). Then the controller will generate and MSI-X interrupt.
At this point you must go to wherever the completion queue was placed and read the response to check status, and also if you queued multiple submissions check the SQID to see what finished since jobs can finish out of order. You then must write to the completion queue head (0x100C) to indicate that you have retrieved the completion queue (success or failure). Notice here you never interact with the submission queue head (that's up to the controller since only he knows when the submission queue entry was processed) and only the controller places things in the completion queue tail since only he can create new entries.

I'm sorry this is so long and not well formatted but hopefully you now have a slightly better understanding of NVMe, it's a bit of a mess at first but once you get it it all makes sense. Just remember my example assumed you had created a queue which baseline doesn't exist. First you need to setup the admin submission and completion queues (0x28 and 0x30) which has queue ID 0 thus it's tail/head doorbell is address 0x1000,0x1004 respectively. You then must reference Section 5 to find the opcodes to make stuff happen but I have faith you can figure it out from what I've given you. If you have any more questions put a comment down and I'll see what I can do.