🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

parallelizing a software rasterizer algorithm using opencl

Started by
58 comments, last by JoeJ 1 year, 9 months ago

I've written a small software rasterizer using OpenCL and would like to optimize and parallelize it more, currently I'm scanning the whole screen and see if the triangle overlaps with the pixels..

I would like to parallelize the loop and do it more efficiently. For example in my idea, is to process only the bounding box pixels.. ?

enter image description here
__kernel void sendImageToPBO(__global uchar4* dst_buffer, __global float* vbo, int vbosize,
    __global int* ibo, int ibosize)

{
    size_t blockIdx = get_group_id(0);
    size_t blockIdy = get_group_id(1);
    size_t blockDimX = get_local_size(0);
    size_t blockDimY = get_local_size(1);
    size_t threadIdX = get_local_id(0);
    size_t threadIdY = get_local_id(1);

    float3 c0 = { 1, 0, 0 };
    float3 c1 = { 0, 1, 0 };
    float3 c2 = { 0, 0, 1 };

    int x = get_global_id(0);
    int y= get_global_id(1);
 


    int imageWidth = 800;
    int imageHeight = 800;



    if (x < vbosize && y < vbosize)
    {

        for (int i = 0; i < vbosize; i += 9)
        {
            float3 v1 = (float3)(vbo[i], vbo[i + 1], vbo[i + 2]);
            float3 v0 = (float3)(vbo[i + 3], vbo[i + 4], vbo[i + 5]);
            float3 v2 = (float3)(vbo[i + 6], vbo[i + 7], vbo[i + 8]);

            float xmin = fmin(v0.x, fmin(v1.x, v2.x));
            float ymin = fmin(v0.y, fmin(v1.y, v2.y));
            float xmax = fmax(v0.x, fmin(v1.x, v2.x));
            float ymax = fmax(v0.y, fmin(v1.y, v2.y));


            // be careful xmin/xmax/ymin/ymax can be negative. Don't cast to unsigned int
            unsigned int x0 = max(0, (int)(floor(xmin)));
            unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax)));
            unsigned int y0 = max(0, (int)(floor(ymin)));
            unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax)));
            
            float3 p = { x + 0.5f, y + 0.5f, 0 };

            float w0 = edgeFunction(v1, v2, p);
            float w1 = edgeFunction(v2, v0, p);
            float w2 = edgeFunction(v0, v1, p);
            if (w0 >= 0 && w1 >= 0 && w2 >= 0) {

                float area = edgeFunction(v0, v1, v2);

                float r = w0 * c0.x + w1 * c1.x + w2 * c2.x;
                float g = w0 * c0.y + w1 * c1.y + w2 * c2.y;
                float b = w0 * c0.z + w1 * c1.z + w2 * c2.z;

                w0 /= area;
                w1 /= area;
                w2 /= area;
                float z = 1 / (w0 * v0.z + w1 * v1.z + w2 * v2.z);
                r *= z, g *= z, b *= z;


                dst_buffer[y * get_global_size(0) + x] = (uchar4)(r * 255, g * 255, b * 255, 255);
            }
            
        }
    }


Game Programming is the process of converting dead pictures to live ones .
Advertisement

I see you iterate over all triangles for every tile, which is not efficient at all.

I see two options:

Iterate all triangles once and rasterize them to the framebuffer, just like the HW or traditional software rasterizers do it it.
To support parallelism, the usual practice (as demonstrated by Dreams or UE5) is to use 64 bit atomic min operations.
The depth goes into the high bits, so atomic min does a depth test. The remaining bits are use to store triangle data. E.g. instance id, triangle id, barycentric coords.
After that we have a visibility buffer, and we need to shade it in a second pass, see http://filmicworlds.com/blog/visibility-buffer-rendering-with-material-graphs/

The other approach is tile based, but requires an acceleration structure. To reduce it's size, ideally we work with clusters of triangles, not single triangles.
For each tile we build a frustum and traverse the AS in loose front to back order to find intersecting clusters.
We raster the cluster to a tile framebuffer in LDS memory, so no need for slow atomics to global VRAM.
Because we do in front to back order, we can also implement occlusion culling to quickly reject clusters or whole branches of the AS.

The second approach is very interesting, but also very complex and i think it has been never done yet.
I might try it in case i decide to use point splatting instead triangles.
With triangles the complexity is quite high, and i'm not sure we really need insane high tessellation as demonstrated by UE5 Nanite. They can do it, but i guess games will end up using lower geoemtry resolution in practice. And if so, no need for a software rasterizer.

Anyway, please share your results, as it's always interesting… ; )

@joej

You always come and answer my questions many thanks.

For the above code, the main problem that I find is I iterate of the width*height of screen pixels then check if the pixels overlap a triangle.

Do you have an optimization or pseudo code to overcome that ? or at least for option 1 ?

How do I iterate over all the triangles once in OpenCL ? where to put the kernel code ?

The current opencl kernel initiates a 2D Kernel command..

Many thanks again

Game Programming is the process of converting dead pictures to live ones .

AhmedSaleh said:
Do you have an optimization or pseudo code to overcome that ? or at least for option 1 ?

No. The problem isn't that simple. If you simply want to iterate pixels and find stuff to cover it, the efficient solution is raytracing, requiring an acceleration structure too.

We could for example bin the triangles to tiles (basically a low res framebuffer). So when processing tiles you only rasterize triangles which are actually visible in them very likely.
But such binning isn't trivial: We don't know how many triangles go into a tile, and one triangle goes into an unknown number of tiles. So we get a problem which would require a form of dynamic allocation like std::vector, which can't be done efficiently.

Well, people often think raytracing is more complex than rasterization, but it's the opposite. Especially if we want parallelism.
If you want efficiency, complexity will grow a lot, causing something like 10 x more code than you have now or more. : /

I remember a ‘new’ approach of rasterization which was introduced with simd CPUs. Set up 3 edge equations, make bounding rectangle over the triangle, and for each pixel evaluate all 3 equations in parallel ‘for free’ due to simd instructions.
That's similar to what you do (as i understand), and much simpler than the traditional scanline approach used in early 3D games. It also seems more parallel at first.
But interestingly, looking at Nanite source code, they came back to the scanline approach afaict. One thread rasterizes one triangle using that. This means high runtime divergence across threads, minimized only by the fact that all their software triangles are guaranteed to be small.
One workgroup rasterizes one cluster of triangles, and there are background threads performing a hacky and imperfect but simple and good enough approach of occlusion culling using the reprojected depth of the previous frame.
Those background threads are persistent, and implement a multi producer / multi consumer queue on queue on GPU, which is not even specified to work by APIs, but they found it does work an any HW they tried.
That's how they get parallelism (background threads produce cluster to draw, rasterizer threads consume them), which is quite adventurous, but also quite complex.
There is some Nanite Deep Dive video on YT wroth to watch. Dreams also had presentation and paper, really interesting:



@joej

Iterate all triangles once and rasterize them to the framebuffer, just like the HW or traditional software rasterizers do it it.

Should I write two kernels, one for processing all the triangles, get them as linear buffer and then draw pixels out of them ?

Can you elaborate more on option 1 please ? even in pseudo code or idea.

Game Programming is the process of converting dead pictures to live ones .

hmmm… following that Epic example, one idea would be to bin or sort the triangles by pixel area. So triangles nearby in the sorted list have a similar count of pixels. That's one dispatch for the binning or sorting.

Then a second dispatch would process the triangles from the sorted list. So all threads in a workgroup raster one triangle, and they'll have similar count of pixels, so all threads finish at nearly the same time and saturation is good.
Would work with both the scanline or the edge equations approach (for the latter you would sort by are of the bounding rectangle, not the triangle). Not sure what's better. The latter wastes half of work on processing pixels which are not inside the triangle, the former has more code divergence due to complexity. Might not matter too much.

That's very simple and not that bad. But the downside is that nearby threads will write to different locations in the framebuffer. But maybe due to the atomic access this does not make it so much slower than it is anyway.

Though, i guess for just a single cow model it might not beat your current approach. The scene should be more complex to make it a win.

StoneBird/Project4-CUDA-Rasterizer: CIS 565: CUDA Rasterizer (github.com)

But its implemented in Cuda, and would like to implement it using OpenCL.

Looks very hard to me… Can you give a better and easier solution for the above code in the first post to optimize it.

Game Programming is the process of converting dead pictures to live ones .

@joej

Can you please go to basics and explain what's the problem with my approach and what's the approach in github is doing ? possibly with pictures if you can ? Many thanks

Game Programming is the process of converting dead pictures to live ones .

AhmedSaleh said:
Can you please go to basics and explain what's the problem with my approach

If i get it right, you iterate over all triangles for every pixel. (I'm mainly confused about the line ‘if (x < vbosize && y < vbosize)’, so not sure)
That's inefficient. Think of the other way around: For each triangle, process the whole screen and test each pixel to be in or outside. That's the same amount of work, but maybe the inefficiency becomes more obvious to see. It's brute force.

Ideally we want to process and triangle only once, and each pixel only once. That would be very efficient, but not possible.
Usually on GPU we aim to achieve efficiency on a coarse level to limit the work.
On the fine level (usually within a single workgroup), brute force is often the best option, due to the parallel hardware.
But on top of that we have the same requirements to be efficient as usual, e.g. when working with a single core, because GPUs have not an infinite number of cores.

Likely you have observed this already yourself, and it brought you to the idea to use tiles to increase efficiency.
And by rephrasing it, i can only help to make you more aware, or to look at it from a general perspective first.
Which then may help to come up with an algorithm for your specific case more easily, or at less uncertainty along the way.

AhmedSaleh said:
and what's the approach in github is doing ? possibly with pictures if you can

Well, they already try to do this. Giving an overview and some pictures. It does not clarify all the details, but i won't study their code just to summarize their work better.

One interesting point i catch from looking at their summary is the limitation of 1024 triangles per tile. So that's their way to address (or dodge) the dynamic allocation problem i have mentioned.
But they admit it's not a general solution. It works only if the application can guarantee to keep within the limit. That's a very common solution for games, because a general solution is not possible within our tight performance budgets.

The scanline approach is probably general, but no idea how it works.

However, looking at their FPS numbers for single low poly models, neither approach seems practical or worth the effort. Why not just use the HW rasterization pipeline, which has more features, is orders of magnitudes faster and no work needed to be done at all?
What's your motivation to work on this?

For learning purposes, anything is fine, as you'll learn some thing from any approach.

If you want something practical for games, solving problems where HW rasterization fails, Nanite might be the better resource to look up. And investment of serious time and effort is expected.
Personally i think software solutions do make sense, but only if we solve some more problems than just rasterization along our way, like occlusion culling and LOD.
Notice those latter problems are still open, while simple rasterization is not. Rasterization is solved, even so well that we have HW acceleration for decades.

JoeJ said:
If i get it right, you iterate over all triangles for every pixel. (I'm mainly confused about the line ‘if (x < vbosize && y < vbosize)’, so not sure) That's inefficient. Think of the other way around: For each triangle, process the whole screen and test each pixel to be in or outside. That's the same amount of work, but maybe the inefficiency becomes more obvious to see. It's brute force.

Yes I want to optimize that how ? ?

Game Programming is the process of converting dead pictures to live ones .

This topic is closed to new replies.

Advertisement