🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

compute shader multitasking

Started by
4 comments, last by MJP 2 years, 1 month ago

Hi,

I started going through an OpenGL tutorial I found on the internet. vertex shaders and fragment shaders work fine and reading the OpenGL4.6 specs I realize I can start compute shaders as well. My compute shader updates coordinates in a world which it can then draw onscreen. There are lots of calculations and I use only one of the GPU's streaming processors , i.e. the job is started with:

glDispatchCompute(1,1,1);

It takes about 0.3 seconds for the job to complete and update the world one simulation step and this is repeated in a loop. In a completely separate process, I'm playing a video. While my opengl simulation job is running, that video drops to 3 fps. If I try to move around a window with the mouse, that also updates at 3 fps, as if my simulation used up all gpu resources and other processes are only let in between calls to glDispatchCompute. However, the gpu has 34 streaming processors and the simulation is only using one.

Is there some opengl function one can call to enable sharing of gpu resources with other programs on the machine?

(Ubuntu Linux using Nvidia's proprietary driver)

Advertisement

drhex said:
Is there some opengl function one can call to enable sharing of gpu resources with other programs on the machine?

No, drivers do this for us.

I guess you see issues because 0.3 seconds is too long for a typical dispatch. The driver may even terminate the application due to time out. At least that's what usually happens on windows.

drhex said:
glDispatchCompute(1,1,1);

drhex said:
There are lots of calculations and I use only one of the GPU's streaming processors

You seemingly assume you could free GPU resources for other applications by using only one thread. But this does not work.
1. If you use only one thread, all other 31 of a SM can not do any other work, because they all need to run the same program. Thus all you get is 31 idle threads which do nothing, which only has disadvantages.
2. If this single thread runs a very long program which lasts 0.3 sec., the GPU can eventually not do any other work for the same application until done. This depends on data dependencies - if all later dispatches require results from the long running task, they can not start until this result is available. (OpenGL gives little control here - low level APIs require the programmer to set up dependencies and synchronization in detail)
Though, i agree that other applications like video playback should be able to get GPU resources in such situation, as the GPU is clearly underutilized. But if this does not work, there is not much you can do about that.

But instead worrying about concurrent applications, you should make sure that your application uses the GPU properly and efficiently:
1. Try to make sure all threads do work, and use a workgroup size of at least (32,1,1).
2. If a single dispatch takes more than say 4ms, consider to divide the work into multiple dispatches, which may also help the driver and OS to distribute GPU resources to multiple apps.

The first point is important, and it often leads us to designing algorithms very differently. Unlike multi threading on CPU, GPUs support true parallel algorithms, which might be something new to you.
I recommend the chapter on compute shaders from the OpenGL Super Bible book, which covers both tech details and parallel programming with nice examples.

Thank you, JoeJ. As I wrote, I'm doing a dispatch of (1,1,1) meaning utilizing 1 streaming processor. In the compute shader code, there's a line

layout(local_size_x = 64, local_size_y = 1) in;

meaning that the 1 streaming processor is running 64 threads. (the threads need to talk to each other during the calculation and can do so using shuffles and “shared” variables. It will be trickier to use more streaming processors as they would have to communicate via shader storage buffer objects which is likely to lead to issues with outdated information in caches. Maybe the “coherent” qualifier will help, but I haven't gotten it to work yet).

I suppose you're right that I need to use more and shorter dispatches if other processes are to get their fair share of the GPU as well. Given that GPUs today are used for many things other than real-time graphics I had expected multitasking to work better. Perhaps that works in Vulkan? Well, I'd better learn a bit more openGL first :-)

drhex said:
meaning that the 1 streaming processor is running 64 threads.

Oh, my bad. Various APIs treat those numbers differently, so i was just assuming.

drhex said:
Given that GPUs today are used for many things other than real-time graphics I had expected multitasking to work better. Perhaps that works in Vulkan?

No. With Vulkan you can record many dispatches to a GPU command buffer and reuse it every frame, so there is no more need to feed GPU constantly from CPU.
This gave me a speedup of 2 over OpenCL, which also had a speedup of 2 over OpenGL compute shaders.
But that's just performance details i saw 7 years ago. In general compute shaders have the same features on any API.

To me it feels very obvious what you should do: You want to saturate the whole GPU with work, so (still) not just 1 streaming processor, but all of them.
But not any algorithm can be parallelized ofc. If you want, you can describe the function of your coordinates update, maybe i have an idea.

drhex said:
It will be trickier to use more streaming processors as they would have to communicate via shader storage buffer objects which is likely to lead to issues with outdated information in caches. Maybe the “coherent” qualifier will help, but I haven't gotten it to work yet).

Maybe there is a simple way to implement your needs.
But one interesting approach i've heard of recently is Epics Nanite, where they implement a multiple producer, multiple consumer queue using persistent threads. I would not have thought that's possible on GPU, and indeed it's not specified to work. But they tested on any hardware, it worked, so they kept using it.

Preemption and multitasking is mostly a function of the driver, OS, and GPU itself. It is still quite common for preemption to only work at a command buffer granularity, which would mean your process can “hog” the entire GPU until all of its work completes (even if your process is achieving poor occupancy since you're only dispatching a single thread group). Preemption at a draw/dispatch level is more common these days, but again that wouldn't really help you since you have one very long dispatch. You might possibly achieve better results if you can spit up your dispatch into multiple steps, but I know nothing about how Linux handles GPU scheduling (and perhaps less about Nvidia's GL driver works) so I can't really confirm that for you.

On modern GPUs there can be hardware functionality for executing multiple command buffers simultaneously, but typically it only works for “compute-only” dispatches that get routed to specific compute queues. Vulkan exposes this more directly to you through its queue family functionality, whereas GL does not. I would expect GL submissions to always go through the graphics pipe since it does not expose this to you.

This topic is closed to new replies.

Advertisement