[SOLVED] I am having performance issues with rendering 1 million triangles using OpenGL

I am having performance issues with rendering 1 million triangles using OpenGL

I am using an RTX 3060 GPU (laptop version) and I am trying to render 1 million triangles(via instancing of 500,000 instanced quads) and the GPU seems to be taking around 40ms as shown in nsight and the CPU is practically doing nothing. My shaders are simple vertex and fragment shaders with nothing else in them. This seems very trivial as I thought GPUs could render millions of triangles easily. the main render loop looks like this:

m_RenderData.InitialColorPass->Use();
m_RenderData.InitialColorPass->SetUniformMat4("ViewProjection", ViewProjectionMatrix);

InstanceBatch.BeginInstanceBatch();

glDrawElementsInstanced(
    GL_TRIANGLES,                     
    (GLuint)(Instance.GetCount()),     
    GL_UNSIGNED_INT,                   
    nullptr,                          
    (GLuint)(InstanceBatch.GetNumberOfInstances())
);

InstanceBatch.EndInstanceBatch();

Within begin instance batch I unmap the buffer containing matrix data

if (m_CurrentBufferIndex == 0)
    m_InstanceBuffer.UnMap();
else if (m_CurrentBufferIndex == 1)
    m_InstanceSecondBuffer.UnMap();
else
    m_InstanceThirdBuffer.UnMap();

m_InstanceVAO.Bind();

and then in end instance batch i simply remap the buffer range, i do this with asynchronous flags enabled and triple buffering.

if (m_CurrentBufferIndex == 0)
{
    m_BufferMapBase = (float*)m_InstanceSecondBuffer.MapBufferRange();
    m_InstanceVAO.AttachVertexBuffer(m_InstanceSecondBuffer, 1, m_AttributeDataCopy.Offsets[1], m_AttributeDataCopy.Strides[1]);
    m_CurrentBufferIndex = 1;
}
else if (m_CurrentBufferIndex == 1)
{
    m_BufferMapBase = (float*)m_InstanceThirdBuffer.MapBufferRange();
    m_InstanceVAO.AttachVertexBuffer(m_InstanceThirdBuffer, 1, m_AttributeDataCopy.Offsets[1], m_AttributeDataCopy.Strides[1]);
    m_CurrentBufferIndex = 2;
}
else
{
    m_BufferMapBase = (float*)m_InstanceBuffer.MapBufferRange();
    m_InstanceVAO.AttachVertexBuffer(m_InstanceBuffer, 1, m_AttributeDataCopy.Offsets[1], m_AttributeDataCopy.Strides[1]);
    m_CurrentBufferIndex = 0;
}

At first I figured it was a synchronising issue with the buffers and waiting on the gpu which is why I implemented triple buffering, whilst this reduced the cpu time practically down to nothing (around 2ms per frame) the GPU was still taking a long time. I am starting to think this is maybe hardware limitation but a 3060 even on a laptop is a powerful GPU. Next I thought vsync was enabled and disabled that within the control panel and that seemed to do nothing. My integrated graphics (amd ryzen 7 5800) also slightly outperforms my dedicated 3060.

one warning I started to get recently was [Core-Warn] Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering. and this occurs once when I initially remap the buffer or reconstruct the buffer with a new size.

Solution

I found the problem, turns out one of the mapped buffers was incorrectly attached to a vertex array, so I was only using double buffering not triple as I intended because every frame there would be no mapping so the draw call would be drawing from an empty buffer. Looking back at it now it seems obvious with a high GPU usage and low CPU with that much data. EDIT: i also had some extra timers i forgot to remove and that was affecting it aswell although i dont know why it wasn't showing up in CPU usage.