Thursday, October 22, 2015

Intel Blog: Performance Considerations for Resource Binding in Microsoft DirectX* 12

Wednesday, August 12, 2015

Implementation of the GPU Pro 5 Screen-Space Glossy Reflection Algorithm

Someone (can't find the name on the website) provided an implementation of a GPU Pro 5 article:

http://roar11.com/2015/07/screen-space-glossy-reflections/

Pretty cool!

Wednesday, July 1, 2015

MVP Award

This year I was honored with an MVP award. This is the tenth time in a row and I am very excited about this. I would like to thank everyone for supporting my nominations for the last 10 years.

Here is my MVP page:

http://mvp.microsoft.com/en-us/mvp/Wolfgang%20Engel-35704

A lot of things that I do during the year do not find their way onto this blog. Most of the time I am too busy doing these things, that leaves me with not much time to blog about them. I also consider this blog more an offer to provide advice or insights into things I am working on in my spare time (outside of Confetti). On top of that with Confetti growing more and more over the last more than six years, my programming time including spare time decreased.

In general I do not give myself much time to reflect what happened during those ten years as a MVP. I am still trying to understand the dimension of being active in a highly volatile industry like the game industry for 10 years. Obviously I am already much longer in the industry.

10 years ago a new console generation launched with the XBOX 360 / PS3. We considered that launch a major event because these two platforms together with the PC were considered the main gaming devices for the next seven years. Only two years later, mobile games started to take off after Steve Jobs changed his mind about not supporting native programming on the iOS devices.
Today we have devices like the iPad Air 2 and the NVIDIA Shield that offer performance close to the XBOX 360 / PS3 and the big console manufacturers have a serious challenge in competing with the many mobile devices that people already have in their homes. It became so easy for companies to launch their own consoles that now many companies are launching mini consoles that use more advanced mobile parts.

The production models in the industry are rapidly changing. Similar to the movie industry, parts of the industry move away from the monolithic model of having large dev-teams on games to more flexible strike teams, where they hire companies like Confetti to come in and take care of graphics and tools instead of having a group of people permanently on staff for those tasks for a long time.
This is an exciting development for Confetti and I feel like we are in the middle of it.
It will be interesting where all this will go ... one thing I know is that we will become better every year. We will always strive to make the next year better than the previous year, improve efficiency, learn more.

With the companies that haven't adjusted to the strike team model, there is the unfortunate development over the last 10 years that they keep flooding the news with large layoffs, they send out press releases saying that they had to reduce workforce out of reasons like "aligning" expectations, budget, lack of success etc.. Many of those press releases express a snide view on the treatment of humans that remind of the darker time of slavery.

One more unfortunate development with sharing information over the last 10 years is, that most of the information that is shared on conferences now have software patents attached to them. So in case someone wants to implement them (obviously without knowing it: every employee is told that they are not allowed to read patent descriptions) his / her company might have to pay for them in the future. The system of freely sharing information and helping other developers to succeed with the difficult technical implications was turned upside down in favor of companies with large law units. The willingness of developers to help their peers is used by companies to secure future economic advantages.
On top of that middleware companies like Unity and others have a hard time to open-source their engine because they are concerned that they violate various patents and therefore would run into huge economic risks when they share source code.

Apart from the "strike team" model, the most exciting development is the new breed of developers that adjusted to the new economic pressures of the App store model and make a living from new and innovative games. We had the pleasure of working with some of these and it is an awesome experience to feel the creative and positive energy that is flowing in those companies. They remind me of the development in the middle of the 90's when what we call now the game industry booted into "big" games that reach millions of people. This new generation now reaches hundreds of millions of people. You could say this is the third wave of game developers, being the first wave the developers of the 80's, the second wave the developers of the 90's.





Sunday, May 31, 2015

Multi-GPU Game Engine

Many high-end rendering solutions for -for example- battlefield simulations can utilize now hardware solutions with multiple consumer GPUs. The idea is to split up computational power in-between 4 - 8 GPUs to increase the level of realism as much as possible.
Now with more modern APIs like DirectX 12 and probably Vulcan and before that CUDA, splitting up the rendering pipeline can happen in the following way:
- GPU0 - fills up the G-Buffer after a Z pre-pass
- GPU1 - Renders Deferred Lights and Shadows
- GPU2 - Renders Particles and Vegetation
- GPU3 - Renders Screen-Space Materials like skin etc. and PostFX

Now you can use the result of GPU0 and feed it to GPU1 and then feed it to GPU2 and so on. All this will run in parallel but will introduce two or three frames of lag (depending on how you light Particles and Vegetation). As long as the system renders 60 fps or 120 fps this will not be as much noticeable (obviously one of the targets is to have a high framerate to make animations look smooth and then also having 4K resolution rendering). GPU4 and higher can work on Physics, AI and other things. There is also the opportunity to spread out G-Buffer rendering over several GPUs, like one GPU is doing the Z pre-pass, then another fills up diffuse, normal and probably some geometry data to indentify different objects later or store their edges and another GPU is filling up the terrain data. Vegetation can be rendered on a dedicated GPU etc. etc.. On the CPU side the rule of thumb is that at least 2 cores are needed for one GPU. It is probably better to go for 3 or four. So a four GPU machine should have 8 - 16 CPU cores and a eight GPU machine 16 - 32 CPU cores; which might be split between several physical CPUs. We need at least 2x as much CPU RAM as the GPUs have RAM, so if four GPUs have each 2 GB, we need at least 16 GB Ram, if we have eight GPUs, we need at least 32 GB RAM etc..
A 4K resolution consists of 3840 × 2160 pixels and it will occupy with four render targets, each 32-bit per pixel (8:8:8:8 or 11:11:10), roughly 126.56 MB. This number goes up with 4x or 8x MSAA and maybe super-sampling. It is probably save to assume that the G-Buffer might occupy between 500 and 1GB.
Achieving a frametime of 8 - 16ms, means that even a high-end GPU will be quite busy to fill up a G-Buffer this size. So thinking about splitting this between two GPUs might make sense.
A high-end PostFX pipeline is now < 5ms on medium-class GPUs but dedicating a whole high-end GPU means we can finally switch on the movie settings :-)
A GPU particle system can easily saturate a GPU with 16 ms ... especially if it is not rendering in a quarter size resolution.
For Lights and shadows it depends on the number of lights that should be applied. Caching all the shadow data in partially resident textures or cube maps or any other shadow map technique will hit the memory budget of this card substantially.


Note: I wrote this more than two years ago. At the time a G-Buffer was a valid solution for designing a rendering system. Now with the high-res displays it is not anymore.

Wednesday, May 27, 2015

V Buffer - Deferred Lighting Re-Thought

After eight years I would like to go back to re-design the existing rendering systems, so that they are capable to run more efficiently on high-resolution devices and display more lights with attached shadows.

Let's first see where we are: the Light Pre-Pass was introduced in March 2008 on this blog. At this point I had it already running in one R* game for a while. It eventually shipped in a large number of games and also outside of R*. The S.T.A.L.K.E.R series and the games developed by Naughty Dog had at the time a similar approach. Since then a number of modifications were proposed.
One modification was to calculate lighting by tiling the G-Buffer, then sorting lights into those tiles and then execute each tile with its light. Johan Andersson covered a practical implementation in "DirectX 11 rendering in Battlefield 3" (http://www.slideshare.net/DICEStudio/directx-11-rendering-in-battlefield-3). Before Tiled-Deferred, lights were additively blended into a buffer, consuming memory bandwidth with each blit. The Tiled-Deferred approach reduced memory bandwidth consumption substantially by resolving all the lights in one tile.
The drawback of this approach is the higher minimum run-time cost. Sorting the lights into the tiles raised the "resting" workload even when only a few lights were rendered. Compared to the older approaches it didn't break even until one rendered a few dozen lights. Additionally as soon as lights had to be drawn with shadows, the memory bandwidth savings were negligible.
Newer approaches like "Clustered Deferred and Forward Shading" (http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf) by Ola Olsson et all. started solving the "light overdraw" problem in even more efficient ways. A practical implementation is shown on Emil Perrson's website (http://www.humus.name/Articles/PracticalClusteredShading.pdf) in an example program.
Because transparency solutions with all the approaches mentioned above are inconsistent with the way opaque objects are handled, there was a group of people that wishes to go back to forward rendering. Takahiro Harada described and refined an approach that he called Forward+ (http://www.slideshare.net/takahiroharada/forward-34779335). The tiled-based handling of light sources was similar to the Tiled-Deferred approach. The advantage of having a consistent way of lighting transparent and opaque objects was bought by having to re-submit all potentially visible geometry several times.

Filling a G-Buffer or re-submitting geometry in case of Forward+ is expensive. For the Deferred Lighting implementations, the G-Buffer fill was the stage were also visibility of geometry was solved (there is also the option of making a Z Pre-Pass which means geometry is submitted one more time at least).
With modern 4k displays and high-res devices like Tablets and smart phones, a G-Buffer is not a feasible solution anymore. When the Light Pre-Pass was developed a 1280x720 resolution was considered state of the art. Today 1080p is considered the minimum resolution, iOS and Android devices have resolutions several times this size and even modern PC monitors can have more than 4K resolution.
MSAA increases the size and therefore cost manifold.

Instead of rendering geometry into three or four render targets with overdraw (or re-submitting after the Z prepass), we need to find a way to store visibility data separate in a much smaller buffer, in a more efficient way.
In other words, if we could capture the full-screen visibility of geometry in as small a footprint as possible, we could significantly reduce the cost of geometry submission and pixel overdraw afterwards.

A first idea on how this could be done is described in the article "The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading" by Christopher A. Burns et all.. The article outlines the idea to store per triangle visibility data in a visibility buffer.









Thursday, April 9, 2015

Introduction to Resource Binding in Microsoft DirectX* 12

I spent some time to write an article that should explain resource binding in DirectX 12. When I looked at this for the first time I had a tough time to get my head around resource binding ... so I am hoping this article makes it for others easier to understand. Let me know in the comments ...

https://software.intel.com/en-us/articles/introduction-to-resource-binding-in-microsoft-directx-12

Monday, January 12, 2015

Reloaded: Compute Shader Optimizations for AMD GPUs: Parallel Reduction

After nearly a year, it was time to revisit the last blog entry. The source code of the example implementation was still on one of my hard-drives and needed to be cleaned-up and released, which I had planned for the first quarter of last year.
I also did receive a comment high-lighting a few mistakes I made in the previous blog post and on top of that I wanted to add numbers for other GPUs as well.

Now while looking at the code the few hours of time I had reserved for the task turned into a day and then a bit more. On top of that getting some time off from my project management duties at Confetti was quite enjoyable :-)

In the previous blog post I forgot to mention that I used INTEL's GPA to measure all the performance numbers. Several runs of the performance profiler always generated slightly different results but I felt the overall direction is becoming clear.
My current setup uses the currently latest AMD driver 14.12.

All the source code can be found at

https://code.google.com/p/graphicsdemoskeleton/source/browse/#svn%2Ftrunk%2F04_DirectCompute%20Parallel%20Reduction%20Case%20Study

While comparing the current performance numbers with the previous setup from the previous post, it becomes obvious that not much has changed for the first three rows. Here is the new chart:

Latest Performance numbers from January 2015

In the fourth column ("Pre-fetching two color values into TGSM with 64 threads"), the numbers for the 6770 are nearly cut in half while they stay roughly the same for the other cards; only a slight improvement on the 290X. This is the first shader that fetches two values from device memory, converts them to luminance, stores them into shared memory and then kicks off the Parallel Reduction.
Here is the source code.

StructuredBuffer Input : register( t0 );
RWTexture2D Result : register (u0);

#define THREADX 8
#define THREADY 16

cbuffer cbCS : register(b0)
{
int c_height : packoffset(c0.x);
int c_width : packoffset(c0.y); // size view port
/*
This is in the constant buffer as well but not used in this shader, so I just keep it in here as a comment
float c_epsilon : packoffset(c0.z); // julia detail  
int c_selfShadow : packoffset(c0.w);  // selfshadowing on or off  
float4 c_diffuse : packoffset(c1); // diffuse shading color
float4 c_mu : packoffset(c2); // julia quaternion parameter
float4x4 rotation : packoffset(c3);
float zoom : packoffset(c7.x);
*/
};

//
// the following shader applies parallel reduction to an image and converts it to luminance
//
#define groupthreads THREADX * THREADY
groupshared float sharedMem[groupthreads];

[numthreads(THREADX, THREADY, 1)]
void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex  )
{
const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);

// thread groups in x is 1920 / 16 = 120
// thread groups in y is 1080 / 16 = 68
// index in x (1920) goes from 0 to 119 | 120 (thread groups) * 8 (threads) = 960 indices in x
// index in y (1080) goes from 0 to 67 | 68 (thread groups) * 16 (threads) = 1080 indices in y
uint idx = ((DTid.x * 2) + DTid.y * c_width);

  // 1920 * 1080 = 2073600 pixels
 // 120 * 68 * 128(number of threads : 8 * 16) * 2 (number of fetches) = 2088960
 float temp = (dot(Input[idx], LumVector) + dot(Input[idx + 1], LumVector));
sharedMem[GI] = temp;
// wait until everything is transfered from device memory to shared memory
GroupMemoryBarrierWithGroupSync();

// hard-coded for 128 threads
if (GI < 64)
sharedMem[GI] += sharedMem[GI + 64];
GroupMemoryBarrierWithGroupSync();

if (GI < 32) sharedMem[GI] += sharedMem[GI + 32];
if (GI < 16) sharedMem[GI] += sharedMem[GI + 16];
if (GI < 8) sharedMem[GI] += sharedMem[GI + 8];
if (GI < 4) sharedMem[GI] += sharedMem[GI + 4];
if (GI < 2) sharedMem[GI] += sharedMem[GI + 2];
if (GI < 1) sharedMem[GI] += sharedMem[GI + 1];

// Have the first thread write out to the output
if (GI == 0)
{
// write out the result for each thread group
Result[Gid.xy] = sharedMem[0] / (THREADX * THREADY * 2);
}
}

The grid size in x and why is 1920 / 16 and 1080 / 16. In other words this is the number of thread groups kicked off by the dispatch call.

The next shader extends the idea to fetching four values. It fetches four instead of two values from device memory.

// thread groups in x is 1920 / 16 = 120
// thread groups in y is 1080 / 16 = 68
// index in x (1920) goes from 0 to 119 | 120 (thread groups) * 4 (threads) = 480 indices in x
// index in y (1080) goes from 0 to 67 | 68 (thread groups) * 16 (threads) = 1080 indices in y
uint idx = ((DTid.x * 4) + DTid.y * c_width);

// 1920 * 1080 = 2073600 pixels
// 120 * 68 * 64 (number of threads : 4 * 16) * 4 (number of fetches) = 2088960
float temp = (dot(Input[idx], LumVector) + dot(Input[idx + 1], LumVector))
      + (dot(Input[idx + 2], LumVector) + dot(Input[idx + 3], LumVector));

// store in shared memory 
sharedMem[IndexOfThreadInGroup] = temp;

// wait until everything is transfered from device memory to shared memory
GroupMemoryBarrierWithGroupSync();

if (IndexOfThreadInGroup < 32) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 32];
if (IndexOfThreadInGroup < 16) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 16];
if (IndexOfThreadInGroup < 8) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];
if (IndexOfThreadInGroup < 4) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];
if (IndexOfThreadInGroup < 2) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];
if (IndexOfThreadInGroup < 1) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];

Looking at the performance results ("Pre-fetching four color values into TGSM with 64 threads"), the difference between the performance numbers is not significant. This seems to be the first sign that the shader might be read memory bandwidth limited. Just reading the 1080p memory area takes the longest time.

While all the previous shaders were writing the reduced image into a 120 x 68 area, The following two shaders in the chart are writing into a 60 x 34 area. This is mostly achieved by decreasing the grid size, or in other words running less thread groups. To make up for the decrease in grid size we had to increase the size of each thread group to 256 and then 512.

#define THREADX 8
#define THREADY 32

... // more code here

// thread groups in x is 1920 / 32 = 60
// thread groups in y is 1080 / 32 = 34
// index in x (1920) goes from 0 to 60 (thread groups) * 8 (threads) = 480 indices in x
// index in y (1080) goes from 0 to 34 (thread groups) * 32 (threads) = 1088 indices in y
uint idx = ((DTid.x * 4) + DTid.y * c_width);

// 1920 * 1080 = 2073600 pixels
// 60 * 34 * 256 (number of threads : 8 * 32) * 4 (number of fetches) = 2088960
float temp = (dot(Input[idx], LumVector) + dot(Input[idx + 1], LumVector))
        + (dot(Input[idx + 2], LumVector) + dot(Input[idx + 3], LumVector));

// store in shared memory 
sharedMem[IndexOfThreadInGroup] = temp;

// wait until everything is transfered from device memory to shared memory
GroupMemoryBarrierWithGroupSync();

// hard-coded for 256 threads
if (IndexOfThreadInGroup < 128)
sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 128];
    GroupMemoryBarrierWithGroupSync();

if (IndexOfThreadInGroup < 64)
sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 64];
    GroupMemoryBarrierWithGroupSync();

if (IndexOfThreadInGroup < 32) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 32];
if (IndexOfThreadInGroup < 16) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 16];
if (IndexOfThreadInGroup < 8) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 8];
if (IndexOfThreadInGroup < 4) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 4];
if (IndexOfThreadInGroup < 2) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 2];
if (IndexOfThreadInGroup < 1) sharedMem[IndexOfThreadInGroup] += sharedMem[IndexOfThreadInGroup + 1];
... // more code here

The next shader decreases the grid size even more and increases the number of threads of each thread group to 1024; the current maximum that the Direct3D run-time allows. For both shaders ("Pre-fetching four color values into TGSM with 1024 threads" and then "Pre-fetching four color values into 2x TGSM with 1024 threads"), the performance numbers do not change much compared to the previous shaders, although the reduction has to do more work, because the dimension of the target area halve in each direction. Here is the source code for the second of the two shaders that fetch four color values with 1024 threads per thread group:

#define THREADX 16
#define THREADY 64
//.. constant buffer code here
//
// the following shader applies parallel reduction to an image and converts it to luminance
//
#define groupthreads THREADX * THREADY
groupshared float sharedMem[groupthreads * 2]; // double the number of shared mem slots

[numthreads(THREADX, THREADY, 1)]
void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex  )
{
const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);

// thread groups in x is 1920 / 64 = 30
// thread groups in y is 1080 / 64 = 17
// index in x (1920) goes from 0 to 29 | 30 (thread groups) * 16 (threads) = 480 indices in x
// index in y (1080) goes from 0 to 16 | 17 (thread groups) * 64 (threads) = 1088 indices in y
uint idx = ((DTid.x * 4) + DTid.y * c_width); // index into structured buffer

// 1920 * 1080 = 2073600 pixels
// 30 * 17 * 1024 (number of threads : 16 * 64) * 4 (number of fetches) = 2088960
uint idSharedMem = GI * 2;
sharedMem[idSharedMem] = (dot(Input[idx], LumVector) + dot(Input[idx + 1], LumVector));
sharedMem[idSharedMem + 1] = (dot(Input[idx + 2], LumVector) + dot(Input[idx + 3], LumVector));

// wait until everything is transfered from device memory to shared memory
GroupMemoryBarrierWithGroupSync();

// hard-coded for 1024 threads
if (GI < 1024)
sharedMem[GI] += sharedMem[GI + 1024];
GroupMemoryBarrierWithGroupSync();

if (GI < 512)
sharedMem[GI] += sharedMem[GI + 512];
GroupMemoryBarrierWithGroupSync();

if (GI < 256)
sharedMem[GI] += sharedMem[GI + 256];
GroupMemoryBarrierWithGroupSync();

if (GI < 128)
sharedMem[GI] += sharedMem[GI + 128];
GroupMemoryBarrierWithGroupSync();

if (GI < 64)
sharedMem[GI] += sharedMem[GI + 64];
GroupMemoryBarrierWithGroupSync();

if (GI < 32) sharedMem[GI] += sharedMem[GI + 32];
if (GI < 16) sharedMem[GI] += sharedMem[GI + 16];
if (GI < 8) sharedMem[GI] += sharedMem[GI + 8];
if (GI < 4) sharedMem[GI] += sharedMem[GI + 4];
if (GI < 2) sharedMem[GI] += sharedMem[GI + 2];
if (GI < 1) sharedMem[GI] += sharedMem[GI + 1];

One thing I wanted to try here, is utilize double the amount of shared memory and therefore saturate the 1024 threads more by having the first addition happening in shared memory. At the end that didn't change much because the shader is not utilizing temp registers much, so replacing a temp register with using shared memory didn't increase performance much.

My last test was aiming at fetching 16 color values while decreasing the 1080p image to 15x9. The result is shown in the last column. This shader also uses 1024 threads and fetches into 2x the shared memory like the previous one. It runs slower than the previous shaders. Here is the source code:

#define THREADX 16
#define THREADY 64
//.. some constant buffer code here 
//
// the following shader applies parallel reduction to an image and converts it to luminance
//
#define groupthreads THREADX * THREADY
groupshared float sharedMem[groupthreads * 2]; // double the number of shared mem slots

[numthreads(THREADX, THREADY, 1)]
void PostFX( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex  )
{
const float4 LumVector = float4(0.2125f, 0.7154f, 0.0721f, 0.0f);

// thread groups in x is 1920 / 128 = 15
// thread groups in y is 1080 / 128 = 9
        // index in x (1920) goes from 0 to 14 | 15 (thread groups) * 16 (threads)
        // = 240 indices in x | need to fetch 8 in x direction

// index in y (1080) goes from 0 to 8 | 9 (thread groups) * 64 (threads)
        // = 576 indices in y | need to fetch 2 in y direction

uint idx = ((DTid.x * 8) + (DTid.y * 2) * c_width); // index into structured buffer

// 1920 * 1080 = 2073600 pixels
// 15 * 9 * 1024 (number of threads : 16 * 64) * 15 (number of fetches) = 2073600
uint idSharedMem = GI * 2;
sharedMem[idSharedMem] = (dot(Input[idx], LumVector) 
+ dot(Input[idx + 1], LumVector) 
+ dot(Input[idx + 2], LumVector) 
+ dot(Input[idx + 3], LumVector)
+ dot(Input[idx + 4], LumVector)
+ dot(Input[idx + 5], LumVector) 
+ dot(Input[idx + 6], LumVector)
+ dot(Input[idx + 7], LumVector));
sharedMem[idSharedMem + 1] = (dot(Input[idx + 8], LumVector)
+ dot(Input[idx + 9], LumVector)
+ dot(Input[idx + 10], LumVector)
+ dot(Input[idx + 11], LumVector)
+ dot(Input[idx + 12], LumVector)
+ dot(Input[idx + 13], LumVector)
+ dot(Input[idx + 14], LumVector) 
+ dot(Input[idx + 15], LumVector));

// wait until everything is transfered from device memory to shared memory
GroupMemoryBarrierWithGroupSync();

// hard-coded for 1024 threads
if (GI < 1024)
sharedMem[GI] += sharedMem[GI + 1024];
GroupMemoryBarrierWithGroupSync();

if (GI < 512)
sharedMem[GI] += sharedMem[GI + 512];
GroupMemoryBarrierWithGroupSync();

if (GI < 256)
sharedMem[GI] += sharedMem[GI + 256];
GroupMemoryBarrierWithGroupSync();

if (GI < 128)
sharedMem[GI] += sharedMem[GI + 128];
GroupMemoryBarrierWithGroupSync();

if (GI < 64)
sharedMem[GI] += sharedMem[GI + 64];
GroupMemoryBarrierWithGroupSync();

if (GI < 32) sharedMem[GI] += sharedMem[GI + 32];
if (GI < 16) sharedMem[GI] += sharedMem[GI + 16];
if (GI < 8) sharedMem[GI] += sharedMem[GI + 8];
if (GI < 4) sharedMem[GI] += sharedMem[GI + 4];
if (GI < 2) sharedMem[GI] += sharedMem[GI + 2];
if (GI < 1) sharedMem[GI] += sharedMem[GI + 1];

Looking at all those numbers it seems that the performance is mostly limited by the speed on how to read the 1080p source buffer. In the moment I would like to predict that reducing the source resolution to 720p or 480p would lead to a more differentiated view of performance. Maybe something to try in the future ...