🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Graphics and GPU Programming

Clustered shading thoughts

J. Rakocevic · 2020-05-04T20:23:23

I spent some time looking at alternatives to deal with light culling and came upon clustered shading. Now there are a few resources on how it's done online, but implementations differ a little bit every time. One thing that warrants attention is a talk by Ola Ollson that he gave in Brisbane, which you can find here https://www.youtube.com/watch?v=uEtI7JRBVXk In there he said thathttps://pastebin.com/FDqGvpGEof implementations create them for all clusters which is wasting time. I thought about it some and came up with something that might or might not be a good idea, would like to throw it out there and see what you think.What I usually see when reading some articles online is small frustums being created, either was planes or approximated by a conservative box, per cluster. These are stored in memory and culled against on a per cluster basis. Even the conservative cube approximation needs 2 * 3 floats, which probably ends up being 2 X float4 for alignment (not sure if this matters in structured buffers but it does in constant buffers at least).Consider the following. 1. Store three arrays of planes (or pack them into one with some offsets if it's better I really can't tell). Each array has planes that slice the original frustum, including the bounding planes of the frustum itself. Not too much memory since a plane is a float4 and there are three linear arrays, one per axis.2. When creating clusters, assign nothing but three numbers to each cluster. Each number is a plane index for the front, right and top plane (or the back, left and bottom plane, either way the other three can be inferred with +- 1). These can be 8 bit unsigned integers as it can be assumed that one won't need more than 256 planes per axis. This would make a cluster use only 32 bits, with 8 bits wasted but maybe useful for something else idk. If one wants to use more, we can expand this to 16 bits which still leaves the total at 2 bytes i guess but seems unnecessary.3. When culling, instead of testing each frustum individually over and over, repeating a lot of the plane tests - go through the three arrays and for each light ( I guess the same can go for other things like probes and decals but not sure, I'm not at that stage yet), mark whether it intersects each of the planes. A bit mask can be used for this, it would need to have as many bits as planes (or one could store the indices of two outmost planes in that axis?).4. Then, when it comes to light - to - cluster assignments, simply check if the planes that the frustum “points” to, or more precisely, the planes it indexes, are intersecting the light we are interested in. If the light intersects either of the planes it can be considered to affect the cluster. This would, naively speaking, use an if/else but i guess it can be avoided, didn't really give it that much thought.I'm just halfway implementing this technique so I might be misunderstanding something. Would like some of your thoughts on this.

Graphics and GPU Programming Programming Optimization clusteredshading

Started by J. Rakocevic April 14, 2020 10:06 AM

17 comments, last by J. Rakocevic 4 years, 1 month ago

J. Rakocevic

Author

April 25, 2020 02:17 PM

I'd like to thank everyone for your help so far. I managed to do everything up to a certain point.

A list of lights is sent into the function. Every light is projected to a clip-space AABB as per given advice. These dimensions are then converted into the indices of clusters that the light spans. For X and Y it goes as follows: (30 slices in x, 17 in y, 16 in z)

clipSpace.x * xNumSlices => results in -30 to 30 for [-1, 1] clip space and 30 slices along X, but we need them in [0, 30] range so:
((clipSpace.x + 1.f) * 0.5f) to bring it to [0, 1] range first, resulting in 0-30 slices. Same process for y.
For Z min and max indices you use the same function as you would in the pixel/fragment shader to determine the pixel's Z slice, on both the z min and z max values in clip space.
This is all good and working, math is pretty fast (I'm testing for 125 lights on CPU for now, but even without multithreading this is almost instant).

My problem now is the following. I have an “cluster index span” for all lights. But the GPU needs an array of indices that are grouped by cluster. I can't just add them willy-nilly into a container. It has to be sorted so the indices for a single cluster are all contiguous.

I was wondering about how people do this in a performant way? I thought about pooling them in random order with key=>value pairs with cluster index being the first value and light index the second, and then just sorting that based on the key but given how many cluster-light assignment pairs you might have, that seems like a lot of memory and a possibly long array to sort. Consider the case when you are standing inside a lit area and that light collides several first z slices which are tiny and therefore you have potentially hundreds of pairs.

Conceptually a map would do great here but dynamic allocation at this scale is not acceptable, So, if you didn't have the luxury of compute shader parallelism, how would you approach sorting the list?

Edit - I found this: https://pastebin.com/FDqGvpGE which is very similar to what I'm doing. Could try what they did, but vector of vectors seems cache unfriendly and possibly leads to a ton of small vectors resizing and a lot of overhead.

Juliean

7,351

April 25, 2020 02:48 PM

J. Rakocevic said:
Edit - I found this: https://pastebin.com/FDqGvpGE which is very similar to what I'm doing. Could try what they did, but vector of vectors seems cache unfriendly and possibly leads to a ton of small vectors resizing and a lot of overhead.

Yeah, vectors of vectors is not optimal, but it was way fast enough in my case (which I tested with max of 2000 lights evenly spaced around the sponza atrium scene). Since I made that whole construct static, the overhead becomes a litte better after the first execution as the whole vector<vector<>> construct will likely fit into the cache.
You could make an vector<array<>> if you can compromise on a maximum number of lights, or by simply sacrificing a ton of memory by choosing an enormous max. Think something like 50 lights per cell, then the entire construct could still fit into L3-cache. But as I said I didn't try any of that so far, so can't say how it really compares.

EDIT: Or you could just use one vector/stack, append each light+cell to that one container, and after all lights are processed, and sort before writing the elements to the GPU, now that I think about it a bit more.

JoeJ

4,263

April 25, 2020 03:22 PM

It's the binning to buckets i've talked before…

1. For each object calculate intersected range, increase atomic bucket counters and store the range of occupied buckets plus returned atomic count on per object data.
2. Prefix sum over bucket counters to get offsets and space of compacted per bucket lists.
3. Per object, for each of its buckets, store it into list at bucket offset + stored object counter.

I'll try some pseudo code. But i do it in 1D, so only x dimension, to keep it simple;


struct Froxel
{
int count;
int listStart;
} froxels[1000+1]; // we have 1000 froxels or 'buckets'.

void BinToFroxels (std::vector<Light> &amp;amp;lights)
{
	SetFroxelVariablesToZero();
	
	// 1. increase bucket counters
	
	for (int i=0; i<lights.size(); i++)
	{
		lights[i].CaclIntersectedRange();
		
		for int (x=lights[i].rangeXmin; x < lights[i].rangeXmax; x++)
		{
			froxels[x].count++; // would be atomic on GPU. But i realize lights can cover almost all cells, so i can not store the counts in the lights as proposed above.
		}
	}
	
	
	// 2. Prefix sum over bucket counters to get offsets and space of compacted per bucket lists.
	
	int listStart = 0;
	for (int i=0; i<1000; i++)
	{
		listStart += froxels[i].count;
		froxel[i+1].listStart = listStart;
	}
	
	// now we know how much memory and the per froxel offsets for our compacted list of lights.
	// For an example of how to implement such simple 'prefix sum' on GPU in praallel, see e.g. OpenGL Super Bible chapter about compute shaders.
	
	std::vector<int> lists (listStart);
	
	// 3. insert lights to the list

	for (int i=0; i<lights.size(); i++)
	{
		/// lights[i].CaclIntersectedRange(); // Edit: accidently copy pasted this line
		
		for int (x=lights[i].rangeXmin; x < lights[i].rangeXmax; x++)
		{
			int listStart = froxels[x].listStart;
			int listOffset = subIfroxels[x].count--; // atomic on GPU
			lists[listStart + listOffset] = i;
		}
	}
	
	/*
	We are done. Notice froxels.count is now zero everywhere so we would not need to store and keep this data.
	To iterate the lights of froxel 10, we con do it simply like:
	
	int start = froxels[10].listStart;
	int end = froxels[10+1].listStart;
	
	for (int i=start; i<end; i++)
	{
		int lightIndex = lists[i];
	}
	*/

}

I hope i did not introduce a bug.

But you see the idea is to iterate over lights twice to get rid of terribly slow vector of vectors, dynamic allocation, or sorting alternatives.

JoeJ

4,263

April 25, 2020 03:24 PM

Juliean said:
but it was way fast enough in my case

shame on you ; )

J. Rakocevic

Author

April 25, 2020 03:27 PM

I had the idea that this was exactly what you were talking about because it's the part I was missing and what I was imagining kind of resembled your explanation but I just couldn't get it. Thank you both.

Juliean is right, i did some math and even my pretty mediocre i5 could fit this entire vec<vec> structure into L3 even with vector overhead accounted for (I think about 16 bytes per vector) but I'll give binning a go and see if I get it. I like learning a new approach. It's a curse to always want to do things the more complicated way but hey. ?

J. Rakocevic

Author

April 25, 2020 03:40 PM

@joej Have you maybe swapped these two lines?

froxel[i+1].listStart = listStart; 
listStart += froxels[i].count;

JoeJ

4,263

April 25, 2020 04:56 PM

J. Rakocevic said:
@joej Have you maybe swapped these two lines?

uhh yes - fixed it.

J. Rakocevic

Author

May 04, 2020 08:23 PM

Well, although I have implemented it “successfully”, some grid-like artefacts are showing up :( Sad life but what can you do.

Thanks everyone, you have been a big help, conceptually it all works and if someone followed up start to finish of the thread I reckon it would help them a lot as well. The artefacts will be destroyed with extreme prejudice when I figure out what's causing them. It's quite a big technique to implement I'd say, with lots of tricky bits along the way (I'm looking at you, Microsoft structured buffer documentation!!!) but it's really nifty to use once you have it. I also learned multithreading a little bit on the side so all in all was a blast!

Proudly presenting, in what is probably the lamest possible way to demonstrate this technique, 16 white lights evenly spaced out over a flat plateau. Still happy with it ngl.

🎉 Celebrating 25 Years of GameDev.net! 🎉

Clustered shading thoughts

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

Clustered shading thoughts

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines