🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Fixed Function Pipeline Faster For Sprites?

Started by
15 comments, last by VoxycDev 5 years, 2 months ago
2 minutes ago, lawnjelly said:

The other things is that you appear to be recreating and compiling the shader on every frame, which will probably kill performance. Again move this to one off code and reuse the shader. After all this is done you can reassess whether there are any bottlenecks.

Oh no, I'm definitely not doing that. Shader compilation happens once at init time. I did not include that part for brevity's sake.

Advertisement
1 minute ago, VoxycDev said:

Oh no, I'm definitely not doing that. Shader compilation happens once at init time. I did not include that part for brevity's sake.

Ah yep, sorry you are right, I didn't read thoroughly enough. :) 

1 minute ago, lawnjelly said:

Ah yep, sorry you are right, I didn't read thoroughly enough. :) 

Cool. Thank you for your input. Overall, this has been a very helpful thread for me. I think I am on my way to that blazing fast particle system I wanted.

2 hours ago, VoxycDev said:

Yes, since I was looking for a way to draw all sprites with one call, I decided to make mvMatrix an attribute.

I still don't see why it is needed? Maybe you are trying to over optimise this. The reason sprite batching is needed, is because you can have thousands , even tens of thousands of visible sprites in a scene, even after basic culling, and that many state changes and draw calls is too many. A handful, even dozens of draw calls is fine.

In something 2D (for simplicity of the example) like your Youtube video or other 2D games, I might have separate drawing at least for the background tiles (easy to cull calculate coordinates CPU side, might even cache), world objects, and the UI.

 

2 hours ago, VoxycDev said:

Thing is, even though right now it's a just a grid, the sprites are supposed to be stretchable/bendable, like trey were in my old fixed pipeline code

Sounds like it could still be done without a unique matrix per vertex, but I am not clear exactly what you are doing. The old code you posted just draws a normal tile grid with no deform unless I missed something.

Would you need these deformed positions CPU side anyway, e.g. for collision detection? In which case just use those directly.

Is the deformation limited, to say moving the 4 corner points of a large object? Or something else that can be determined on the fly from a small dataset?

And surely you can't deform every sprite in the game? If some things need a more complex and expensive routine, avoid letting that add significant cost to the thousands of other things being rendered.

 

2 hours ago, VoxycDev said:
Quote
  • What is `texAtlas.add(obj- >textureName);`. Your not rebuilding a texture dynamically are you? Even if not every frame, need to be careful not to cause slow frames / stutter. Also looks like a string, if its doing string map lookups for every sprite that is not ideal.

It makes sure the texture is in the texture atlas. It's rebuilt as-needed (only when a brand new texture is added). You're right, I probably should get rid of string map lookup here. But in this particular case there is only one texture so array size is 1, so it's not the bottleneck.

Normally if I had an atlas Id do it at load time. Doing it dynamically is a lot more complex. "rebuilt as-needed" can be perceived as stutter if not careful when that "as needed" frame takes longer than the other frames that didn't rebuild anything.

One of the reasons I hate string comparison, is even the best case is fairly expensive. You have a hash map with one entry, well in the case of a "hit" you just did an O(n) hash computation, and an O(n) string comparison (to check against collision), and if you have a fairly long string like a filename, or worse a path, it is a fair bit. Probably not the bottleneck, but things like that if throughout a program add up a lot (some languages and/or programs might "intern" strings so they can use reference equality instead, essentially turning such strings into integers).

2 hours ago, VoxycDev said:
Quote
  • Also not sure on the cost of things like `setVertexAttrib`. You should be able to do this once, and it is saved with the `GL_ARRAY_BUFFER` (possibly all in one go, e.g. `glVertexAttribPointer`)

setVertexAttrib just calls all the gl functions needed to set up an attribute. Good point, though. I should try to do this once if I can. This is not the only program/renderer that runs in the engine though, so I assumed I have to re-set-up all the attributes on every frame for every program. Is that not the case?

This is where `glVertexAttribPointer` etc. come in, despite maybe the first appearance, it is not setting global state, it is modifying the buffer, and what you set will be there next time you use that buffer.

2 hours ago, VoxycDev said:
Quote
  • Any sort of dynamic branch in a shader is usually bad if adjacent/nearby data will branch differently. GPU cores are not like CPU ones and can't all independently do their own thing. I didn't look closely at your data, but something to be aware of.
     

I'm not super worried about the gaps between the sprites. This is only for an editor, not for rendering in the game. As long as it's smooth and I can quickly build vast landscapes and cities out of voxels, that's all I care about.

Wrong quote? The gaps is when you let a translation get combined with other things in a matrix and it causes rounding errors.

 

Dynamic branching in a GPU program / shader can be a serious performance impact. If the GPU has say 32 threads together, then all 32 threads must do the exact same thing each cycle, they just get different registers (and there are some memory access rules as well). If you have a condition of some sort such that some threads will do one thing and others something else, then it basically has to "pause" one set of threads do the first thing, "pause" the others threads, and do the other thing, on separate cycles.

 

2 hours ago, VoxycDev said:

@SyncViews, just an idea. What if I send mvMatrix as a uniform array, and even though I can only send 32 or 64 matrices at once, I can then break it up into, let's say, 4 draw calls, to do 128 or 256 sprites? Maybe worth a try.

With only 6 vertices using the same matrix, I am not sure if that is a great help. You would need to test it. Also there may be a penalty for that uniform/memory access pattern, not sure.

2 hours ago, SyncViews said:

I still don't see why it is needed? Maybe you are trying to over optimise this.

Perhaps. I guess I jumped on the whole "as fewer draw calls as possible" wagon and took it a bit too far. It's still a lot faster than drawing every quad with separate draw call, though.

Quote

In something 2D (for simplicity of the example) like your Youtube video or other 2D games, I might have separate drawing at least for the background tiles (easy to cull calculate coordinates CPU side, might even cache), world objects, and the UI.

 

Ultimately, this is for a universal sprite/particle renderer class that I can use for:

  1. Regular sprites in the game
  2. Particle system in the game
  3. Flexible stretchable 2D tiles in the orthographic voxel editor (the part that I need most at the moment)
Quote

Sounds like it could still be done without a unique matrix per vertex, but I am not clear exactly what you are doing.

To be able to hit all 3 cases above, I need at least a matrix per quad. I will do the stretched corners with vertex coordinates. As @JohnnyCode pointed out, if they are 2D and do not rotate (yes and yes for case 3), I can save on memory by using a smaller data structure. This may allow me to cram enough of them into an uniform array and still do all of them with one draw call.

 

Quote

The old code you posted just draws a normal tile grid with no deform unless I missed something.

Yes, the example does not deform. I had trouble finding the old piece of code that deforms (it's old code). But the original question was only to find out why the quads draw so slow.

Quote

Would you need these deformed positions CPU side anyway, e.g. for collision detection? In which case just use those directly.

Yes, I do and I will.

Quote

Is the deformation limited, to say moving the 4 corner points of a large object? Or something else that can be determined on the fly from a small dataset?

Deformation will be everywhere for terrain, but less so for buildings (when designing either in the orthographic editor).

Quote

And surely you can't deform every sprite in the game? If some things need a more complex and expensive routine, avoid letting that add significant cost to the thousands of other things being rendered.

This may be needed if I'm designing a sophisticated landscape for case 3.

Quote

Normally if I had an atlas Id do it at load time. Doing it dynamically is a lot more complex. "rebuilt as-needed" can be perceived as stutter if not careful when that "as needed" frame takes longer than the other frames that didn't rebuild anything.

Yes, it does stutter, especially in Evertank. But you cannot predict what textures the user is going to load in the editor and when, so it has to be on demand. In a game release, I try to remedy this by pre-loading all the required textures in Lua at the start of the game, but not sure if this is working right now.

Quote

One of the reasons I hate string comparison, is even the best case is fairly expensive. You have a hash map with one entry, well in the case of a "hit" you just did an O(n) hash computation, and an O(n) string comparison (to check against collision), and if you have a fairly long string like a filename, or worse a path, it is a fair bit. Probably not the bottleneck, but things like that if throughout a program add up a lot (some languages and/or programs might "intern" strings so they can use reference equality instead, essentially turning such strings into integers).

I'd be happy to get rid of all the string lookups.

Collision detection was causing most of the performance slow down in my original question. Once I disabled it, the frame-rate went way up. Here is the new version of fast sprite renderer, with all suggestions included (not re-creating VBO, dynamic draw and so on):

https://github.com/dimitrilozovoy/Voxyc/blob/master/engine/SpriteRenderer.cpp

It works pretty well. I also tried to put mvMatrix into uniforms and here is that version of the fast sprite renderer:

https://github.com/dimitrilozovoy/Voxyc/blob/master/engine/SpriteRenderer2.cpp

The one above can do max 10 sprites per draw call due to uniforms limit, but may be even faster (further testing needed). Thank you everyone for your input. 

This topic is closed to new replies.

Advertisement