🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Clustered shading thoughts

Started by
17 comments, last by J. Rakocevic 4 years, 1 month ago

I'd like to thank everyone for your help so far. I managed to do everything up to a certain point.

A list of lights is sent into the function. Every light is projected to a clip-space AABB as per given advice. These dimensions are then converted into the indices of clusters that the light spans. For X and Y it goes as follows: (30 slices in x, 17 in y, 16 in z)

clipSpace.x * xNumSlices => results in -30 to 30 for [-1, 1] clip space and 30 slices along X, but we need them in [0, 30] range so:
((clipSpace.x + 1.f) * 0.5f) to bring it to [0, 1] range first, resulting in 0-30 slices. Same process for y.
For Z min and max indices you use the same function as you would in the pixel/fragment shader to determine the pixel's Z slice, on both the z min and z max values in clip space.
This is all good and working, math is pretty fast (I'm testing for 125 lights on CPU for now, but even without multithreading this is almost instant).

My problem now is the following. I have an “cluster index span” for all lights. But the GPU needs an array of indices that are grouped by cluster. I can't just add them willy-nilly into a container. It has to be sorted so the indices for a single cluster are all contiguous.

I was wondering about how people do this in a performant way? I thought about pooling them in random order with key=>value pairs with cluster index being the first value and light index the second, and then just sorting that based on the key but given how many cluster-light assignment pairs you might have, that seems like a lot of memory and a possibly long array to sort. Consider the case when you are standing inside a lit area and that light collides several first z slices which are tiny and therefore you have potentially hundreds of pairs.

Conceptually a map would do great here but dynamic allocation at this scale is not acceptable, So, if you didn't have the luxury of compute shader parallelism, how would you approach sorting the list?

Edit - I found this: https://pastebin.com/FDqGvpGE​ which is very similar to what I'm doing. Could try what they did, but vector of vectors seems cache unfriendly and possibly leads to a ton of small vectors resizing and a lot of overhead.

Advertisement

J. Rakocevic said:
Edit - I found this: https://pastebin.com/FDqGvpGE​ which is very similar to what I'm doing. Could try what they did, but vector of vectors seems cache unfriendly and possibly leads to a ton of small vectors resizing and a lot of overhead.

Yeah, vectors of vectors is not optimal, but it was way fast enough in my case (which I tested with max of 2000 lights evenly spaced around the sponza atrium scene). Since I made that whole construct static, the overhead becomes a litte better after the first execution as the whole vector<vector<>> construct will likely fit into the cache.
You could make an vector<array<>> if you can compromise on a maximum number of lights, or by simply sacrificing a ton of memory by choosing an enormous max. Think something like 50 lights per cell, then the entire construct could still fit into L3-cache. But as I said I didn't try any of that so far, so can't say how it really compares.

EDIT: Or you could just use one vector/stack, append each light+cell to that one container, and after all lights are processed, and sort before writing the elements to the GPU, now that I think about it a bit more.

It's the binning to buckets i've talked before…

1. For each object calculate intersected range, increase atomic bucket counters and store the range of occupied buckets plus returned atomic count on per object data.
2. Prefix sum over bucket counters to get offsets and space of compacted per bucket lists.
3. Per object, for each of its buckets, store it into list at bucket offset + stored object counter.

I'll try some pseudo code. But i do it in 1D, so only x dimension, to keep it simple;


struct Froxel
{
int count;
int listStart;
} froxels[1000+1]; // we have 1000 froxels or 'buckets'.

void BinToFroxels (std::vector<Light> &amp;amp;lights)
{
	SetFroxelVariablesToZero();
	
	// 1. increase bucket counters
	
	for (int i=0; i<lights.size(); i++)
	{
		lights[i].CaclIntersectedRange();
		
		for int (x=lights[i].rangeXmin; x < lights[i].rangeXmax; x++)
		{
			froxels[x].count++; // would be atomic on GPU. But i realize lights can cover almost all cells, so i can not store the counts in the lights as proposed above.
		}
	}
	
	
	// 2. Prefix sum over bucket counters to get offsets and space of compacted per bucket lists.
	
	int listStart = 0;
	for (int i=0; i<1000; i++)
	{
		listStart += froxels[i].count;
		froxel[i+1].listStart = listStart;
	}
	
	// now we know how much memory and the per froxel offsets for our compacted list of lights.
	// For an example of how to implement such simple 'prefix sum' on GPU in praallel, see e.g. OpenGL Super Bible chapter about compute shaders.
	
	std::vector<int> lists (listStart);
	
	// 3. insert lights to the list

	for (int i=0; i<lights.size(); i++)
	{
		/// lights[i].CaclIntersectedRange(); // Edit: accidently copy pasted this line
		
		for int (x=lights[i].rangeXmin; x < lights[i].rangeXmax; x++)
		{
			int listStart = froxels[x].listStart;
			int listOffset = subIfroxels[x].count--; // atomic on GPU
			lists[listStart + listOffset] = i;
		}
	}
	
	/*
	We are done. Notice froxels.count is now zero everywhere so we would not need to store and keep this data.
	To iterate the lights of froxel 10, we con do it simply like:
	
	int start = froxels[10].listStart;
	int end = froxels[10+1].listStart;
	
	for (int i=start; i<end; i++)
	{
		int lightIndex = lists[i];
	}
	*/

}

I hope i did not introduce a bug.

But you see the idea is to iterate over lights twice to get rid of terribly slow vector of vectors, dynamic allocation, or sorting alternatives.

Juliean said:
but it was way fast enough in my case

shame on you ; )

I had the idea that this was exactly what you were talking about because it's the part I was missing and what I was imagining kind of resembled your explanation but I just couldn't get it. Thank you both.

Juliean is right, i did some math and even my pretty mediocre i5 could fit this entire vec<vec> structure into L3 even with vector overhead accounted for (I think about 16 bytes per vector) but I'll give binning a go and see if I get it. I like learning a new approach. It's a curse to always want to do things the more complicated way but hey. ?

@joej Have you maybe swapped these two lines?

froxel[i+1].listStart = listStart; 
listStart += froxels[i].count;

J. Rakocevic said:
@joej Have you maybe swapped these two lines?

uhh yes - fixed it.

Well, although I have implemented it “successfully”, some grid-like artefacts are showing up :( Sad life but what can you do.

Thanks everyone, you have been a big help, conceptually it all works and if someone followed up start to finish of the thread I reckon it would help them a lot as well. The artefacts will be destroyed with extreme prejudice when I figure out what's causing them. It's quite a big technique to implement I'd say, with lots of tricky bits along the way (I'm looking at you, Microsoft structured buffer documentation!!!) but it's really nifty to use once you have it. I also learned multithreading a little bit on the side so all in all was a blast!

Proudly presenting, in what is probably the lamest possible way to demonstrate this technique, 16 white lights evenly spaced out over a flat plateau. Still happy with it ngl.

This topic is closed to new replies.

Advertisement