An experiment in using a single storage buffer for all data passed into shaders.

Vulkan Single Storage Buffer for all Uniform Data

Passing data into a shader is one of the most basic functions needed when developing graphics applications. In Vulkan, there are 3 methods of doing so: Push Constants, Uniform Buffers and Storage Buffers

Push Constants are very fast and the data in the push constants are written directly to the command buffer. The down side is that Push Constants are very limited in size. The minimum size required by the Vulkan Specification is 128 bytes, while only a few graphics devices having 256 bytes.

Uniform Buffers allows you to write data to a buffer object, and then bind that buffer object to the pipeline so the shader can read from it. Uniform Buffers are much larger in size than push constants, up to 65535 bytes. This is usually more than enough data for most shaders.

When drawing multiple objects, multiple uniform buffers are usually required (unless you use Dynamic Uniform Buffers). This means you will need to keep track of each buffer you created, as well as descriptor sets for each of them.

Shader Storage Buffers are similar to uniform buffers, but they have the benefit of being much larger. Most graphics devices allow you to allocate storage buffers up to the size of the available graphics memory. Shader Storage buffers, unlike uniform buffers, can also be written to from the shader.

Because Shader Storage Buffers can be very large, we can use a single storage buffer to store all the data that the pipeline (or even multiple pipelines) use.

Aliasing Storage Buffer Bindings

What’s interesting in SPIRV shaders, is that you can alias the storage buffer bindings. This means you can access the same storage data through two different typed interfaces. To do this, we define an unbound array within the storage buffers, but use the same set and binding number for each definition.

layout(set=0, binding = 0) buffer readonly s_MatrixData_t
{
    mat4 data[];
} s_MatrixData;

layout(set=0, binding = 0) buffer readonly s_FloatData_t
{
    float data[];
} s_FloatData;

layout(set=0, binding = 0) buffer readonly s_CustomData_t
{
    CustomStruct data[];
} s_CustomData;

In the above definitions, there are three interfaces to a storage buffer which is bound to set 0, binding 0. All three storage buffers point to the same block of memory. We can then index into the appropriate element index to retrieve the data we need. The index into the array is provided by the push constants.


layout(push_constant) uniform PushConsts
{
    uint matrixIndex;
    uint floatIndex;
    uint CustomIndex;
} pushC;


void main()
{
    mat4         matrix  = s_MatrixData.data[ pushC.matrixIndex ];
    float        myFloat = s_FloatData.data[ pushC.floatIndex ];
    CustomStruct custom  = s_CustomData.data[ pushC.CustomIndex ];
}

Copy Data into the Storage Buffers

On the host side, we need to make sure our data for each datatype is aligned appropriately when we copy it into the buffer. If we do not align our data properly, we will get garbage data when we go to read it in the shader.

To do this, we’ll create a helper class that we can use to manage where the data will get stored. It is a fairly simple interface and only need a memory mapped storage buffer, and it’s maximum size.

Using a host-visible storage buffer may be slower than using a device-only buffer. This example uses a host-visible storage buffer for convenience. If using a device-only buffer, you will have to perform the buffer-to-buffer copy yourself.

The class acts like a circular buffer, it keeps writing to the end of the buffer until it reaches the max size. When that happens, it will reset and start writing from the front.

Each time we copy data into the buffer, we need to make sure that the byte offset we are copying into is a multiple of the array element size. To do this we define a static function roundToNearestMultiple.

class MultiStorageBuffer
{
public:
    void init(void * mappedStorage, 
              size_t maxSize) : m_mappedStorageBuffer(mappedStorage), 
                                m_maxStorageBufferSize(maxSize)
    {
    }

    // Copy data to the storage buffer and return the appropriate index
    // which should be passed to the shader via push constants
    uint32_t copyDataToStorage( void const * data, 
                                size_t sizeOfData, 
                                size_t sizeOfElement)
    {
        assert( sizeOfData % sizeOfElement == 0);

        auto byteLocation = roundToNearestMultiple(m_currentByteOffset, 
                                                   sizeOfElement);
   
        assert( byteLocation % sizeOfElement == 0);
        
        if( byteLocation+sizeOfData >= m_maxStorageBufferSize)
        {
            m_currentByteOffset = 0;
            return copyDataToStorage(data, sizeOfData, sizeOfElement);
        }

        dstLoc = static_cast<uint8_t*>(mappedStorageBuffer) + byteLocation;
        std::memcpy( dstLoc, data, sizeOfData);

        m_currentByteOffset = byteLocation + sizeOfData;

        return static_cast<uint32_t>(byteLocation / sizeOfElement);    
    }

    static size_t roundToNearestMultiple(size_t numToRound, size_t multiple)
    {
        assert(multiple);
        return ((numToRound + multiple - 1) / multiple) * multiple;
    };

protected:
    void * mappedStorageBuffer = nullptr;
    size_t m_maxStorageBufferSize = 0;
    size_t m_currentByteOffset = 0;

};

As long as our storage buffer is big enough to hold all the data for one frame of rendering, we shouldn’t have any problems with running out of storage. A storage buffer size of 50 to 100 MB would probably be enough.

When we go to use this, we pass in the data we want to copy and get the index returned to pass to the shader.


struct pushConsts
{
    uint32_t matrixIndex;
    uint32_t floatIndex;
    uint32_t CustomIndex;
};

pushConsts pC;

// fill in the data we want to send
glm::mat4 matrix = ....;
CustomStruct custom = ....;
float floatData = ...;

// copy the data to the appropriate locations and get their index
pC.matrixIndex = multiStorage.copyDataToStorage(&matrix, 
                                                sizeof(matrix), 
                                                sizeof(matrix) );

pC.floatIndex  = multiStorage.copyDataToStorage(&floatDat, 
                                                sizeof(float), 
                                                sizeof(float) );

pC.CustomIndex = multiStorage.copyDataToStorage(&custom, 
                                                sizeof(CustomStruct), 
                                                sizeof(customStruct) );

// push the data using push constants
vkCmdPushConstants(cmd,
                   layout,
                   VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT,
                   0,
                   sizeof(pC), 
                   &pC);

// draw the object

My Implementation

The implementation above is a very simple case. In the engine I am working on, all data passed to the shader is via the storage buffer. The entire push constant range is used to hold indices into the arrays. The minimum size of push constants is 128 bytes, which gives us a maximum of 32 indices we can use.

To make my life a little easier, I have a separate header with the following:

#ifndef DEFAULT_DEFINITIONS_GLSL
#define DEFAULT_DEFINITIONS_GLSL

layout(push_constant) uniform PushConsts
{
    uint storageIndex[32];
} _pc;

#define DEFINE_STORAGE(type, STORAGE_NAME, STORAGE_INDEX)\
layout(set=0, binding = 0) buffer readonly s_ ## STORAGE_NAME ## _t\
{\
    type data[];\
} s_ ## STORAGE_NAME;\
type get ## STORAGE_NAME(int indexOffset)\
{\
    return s_ ## STORAGE_NAME.data[ _pc.storageIndex[STORAGE_INDEX]+indexOffset ];\
}

#endif

In my shader, I have the following structs which are usually accessed in the shader. The first parameter is the data type, The second parameter is the name we want to use to access the data. And the third parameter is the storageIndex lookup index.


struct perFrame
{
    // data with padding
};
struct viewPortData
{
    // data with padding
};
struct material
{
    // data with padding
};

DEFINE_STORAGE(perFrame, FrameData, 0); // frame times, mouse locations, etc
DEFINE_STORAGE(viewPortData, ViewPortData, 1); // camera matrices/viewport sizes
DEFINE_STORAGE(mat4, Transform, 2); // model and bone matrices
DEFINE_STORAGE(material, Material, 3); // materials

The preprocessor definition also defines a set of functions which I can use to access the data.

perFrame fd = getFrameData(0);
viewPortData vd = getViewPortData(0);
mat4 matrix = getTransform(0);
material Mat = getMaterial(0);

The input parameter to the getXXXX functions is an additional index offset. This allows us to push an array of data as well. For example, if we want to push multiple matrices to use for bones, we can access each of the bone matrices using getTransform(bondIndex).

On the host side, I have a function which allows me to push the data as well as write the appropriate index to the push constants. I have removed some of the code that was not relevant to showing the basic concept

uint32_t pushStorage(uint32_t storageIndex, void const * V, uint32_t sizeOfData, uint32_t sizeofElement)
{
    auto & mainStorageBuffer = getMainStorageBuffer();

    // copy data to the storage buffer
    auto i = static_cast<uint32_t>(mainStorageBuffer.copyDataToStorage(V, sizeOfData, sizeofElement));

    // set the push push constants
    vkCmdPushConstants(cmd,
                       pipelineLayout,
                       VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT ,
                       static_cast<uint32_t>(storageIndex * sizeof(uint32_t)),
                       sizeof(storageIndex), &i);
    return i;
}

My main render loop looks something like this.

pushStorage(0, &perFrameData, sizeof(perFrameData), sizeof(decltype(perFrameData));

for each viewportData

    pushStorage(1, &viewportData, sizeof(viewportData), sizeof(decltype(viewportData));
    
    for each object
        pushStorage(2, &objectTransform, sizeof(mat4), sizeof(decltype(mat4));
        pushStorage(3, &objectMaterial, sizeof(material), sizeof(decltype(material));
        drawObject( );

A word about Alignment

Memory alignment requirements on the GPU is different from memory alignment on the host’s CPU. Without going into too much detail about how to align the data properly, a general rule of thumb is to

floats and uints must be aligned to 4 byte boundaries
vec2 must be aligned to 8 byte boundaries
vec3 must be aligned to 16 byte boundaries
vec4 must be aligned to 16 byte boundaries

This means that the following is a valid struct

struct MyCustomData
{
    // first block of vec4 size
    vec2 data1; // bytes 0-7
    uint data2; // bytes 8-11
    uint data3; // bytes 12-15
};

But the following is not because the vec2 is not aligned to a 8 byte boundary, its offset is 4

struct MyCustomData
{
    // first block of vec4 size
    uint data1; // bytes 0-3
    vec2 data2; // bytes 4-11
    uint data3; // bytes 12-15
};

Vec3s are interesting, because it needs to have the same alignment as vec2. The following is an invalid placement for data2

struct MyCustomData
{
    // first block of vec4 size
    uint data1; //
    vec3 data2; // invalid
};

One option is to place the vec3 first in the struct followed by the uint, This satisfies the alignment requirements and the entire struct is a multiple of 16.

struct MyCustomData
{
    // first block of vec4 size
    vec3 data2; // valid
    uint data1; //
};

If you must have a specific order, then you will have to pad the struct so ensure the vec3 data is at the correct offset.

struct MyCustomData
{
    // first block of vec4 size
    uint data1; //
    uint unused1;
    uint unused2;
    uint unused3;

    vec3 data2; // valid, offset == 16
    uint padding; // to make sizeof(MyCustomData) == 32
};

Conclusion

The only downside downside I can think of to using this method is the performance of using storage buffers over uniform buffers. Storage Buffers in are a little slower to access than Uniform buffers.

But you will have to weigh the performance penalty against the other benefits. Those benefits are:

You only have a one descriptor set that you allocate at he start. No need to constantly update it
You only need to bind the descriptor set once at the start of rendering.
Memory for the storage buffer would never get fragmented since it’s never being released/reallocated

In one of Arseny Kapoulkine’s Vulkan videos, he uses a Storage Buffer to read vertex data instead of using vertex buffers. If storage buffers are fast enough to fetch the vertex information for each vertex in a mesh, then it’s probably fast enough to use for reading per-object uniform data.

The video in question is linked here: Niagra: Rendering a Mesh

Vulkan Aliased Storage Buffer for all Uniform Data