XNAInfo blogs
Ramblings about XNA, .NET and stuff

How I Saved 3ms By Unrolling A Loop

September 25, 2008 10:05 by MJP
Last night I decided to sit down and do some nitty gritty optimization on the 60 version of my game.  The PC version has been running great even when I crank up every setting I have (although that might have something to do with the 4870 I just bought), but as well all know by now things are never so easy on the Xbox.  For the past few weeks I'd been struggling to keep the framerate above 60Hz, with it slipping down to 55Hz during complex scenes.  Now 55fps wouldn't be so much of a problem if I weren't the kinda guy who hates screen tearing, but it just so happens I am.  Which means I want VSYNC to be enabled, which means the game drops to 30fps whenever it's below 60.  Definitely unacceptable.  For a while I avoided the problem by doing the unthinkable...I dropped the resolution from 1280 x 720 to 1024 x 600.  I even considered leaving it this way for the final release...I mean if Halo 3 and Metal Gear Solid 4 can do it, why can't I? 

But no, I'm too finicky to settle with the lowered resolution.  It just looks so much better in 720p!  So I cranked the res back up, and decided to see where I could squeeze out some extra performance.  Naturally, the first place I went to was my show mapping shader.  I already knew this particular shader was giving me trouble, since I'd discovered that decreasing the size of the buffer that I render the shadow occlusion to (since I do a deferred shadowing pass) resulted in significant performance gains.  I'd already reduced thigns to 4 PCF samples on the 360 (did I mention I ditched VSM for the 360?  Performance and precision ended up being so awful it wasn't worth it) so I couldn't squeeze that down anymore.  At this point my eyes drifted down to little loop I had for determining which split of shadow map cascade to use, when I remembered an excellent presentation on shader performance that I'd read a long time ago. One of the things mentioned in there was that unrolling loops can have a significant impact on general purpose register usage, ALU usage, and shader compiler optimization.  So I thought, "hey, let's try unrolling this loop and flattening this branch".  I then changed my code from this:

for (int i = 1; i < NUM_SPLITS; i++)
{
    if (vPositionVS.z <= g_vClipPlanes[i].x && vPositionVS.z > g_vClipPlanes[i].y)
    {
        matLightViewProj = g_matLightViewProj[i];
        fOffset = i / (float)NUM_SPLITS;           
    }
}  


to this:

[unroll(NUM_SPLITS)]
for (int i = 1; i < NUM_SPLITS; i++)
{
    [flatten]
    if (vPositionVS.z <= g_vClipPlanes[i].x && vPositionVS.z > g_vClipPlanes[i].y)
    {
        matLightViewProj = g_matLightViewProj[i];
        fOffset = i / (float)NUM_SPLITS;           
    }
}  
      

I ran the game again, and BAM:  I shot up from 55fps to 65fps!  Huge difference!  I was very impressed with myself.  Moral of the story:  experiement with stuff, and make sure you profile it!

Today I decided to read through some other presentations to see what other useful bits I could find. This one from Gamefest 2007 pointed out that vfetch's should be aligned to 32-bytes on the 360.  I did the math on the vertex declaration used for most of my models, and found out it's 48 bytes.  Later tonight I'll have to see if I can squeeze it down to 32, and see if it makes performance any better. 

I also came across this one from Gamefest 2008, which is all about how texture and surface formats are handled on the 360.  This gave me some insight into some problems I'd come across already.  For example, R32G32F (Vector2) isn't actually a format the GPU can render to!  Apparently it renders to R16G16F, and then just expands it upon resolve from eDRAM.  This explained why my VSM's with exponential warp were have such precision problems.  Another thing pointed out in there is that the 360's texture units filter fp16 at 1/4 the rate of INT8!  This would help explain why my VSM performance was so poor.  Definitely good things to keep in mind.

-MJP

A peek at what's to come

July 14, 2008 11:14 by MJP

After completing my first game programming tutorial for XNAinfo.com, I've realized something:  I like writing these things!  So I've been spending some idle time musing about future subjects for which a new article or tutorial would be of use to the XNA community.  Here's a few possibilities I've come up with so far:

  • Debugging shaders and profiling with PIX.  I can't tell you how many times I've replied to a thread in the gamedev.net forums by saying "just step through your shader in PIX!".  PIX is undoubtedly a huguely useful tool (even when it's occasionally crashing), and I think it's important that beginners get comfortable with it early on.  There's documentation in the SDK, but I think a lot of beginner XNA programmers don't think to look through there and therefore don't understand how powerful PIX can be. 

  • HDR and tonemapping.  It's probably pretty advanced for your average XNA project, but thesedays everybody wants to know how to do it.  There's some great samples in the SDK but once again I'm not sure all XNA programmers want to go poking around in there, especially if they don't know C++.  Plus there's always cool things I could throw in, like the LogLuv encoding scheme Rim and I worked out.

  • Skyboxes.  Another topic I see brought up all the time on the gamedev forums.  It's a very very simple technique once you understand how it works, but unfortunately I don't know of any tutorials devoted to that topic.  Plus there's always a few advanced things I could throw in there...like RGBE (or LogLuv) encoding for HDR, or comparing DXT compression vs. uncompressed.

  • An article or maybe even a series about map editors.  I've been working on a map editor for my own project, and I'm constantly amazed at all the cool stuff you can do with .NET when your engine is made with managed code.

  • A follow-up article for my ModTool tutorial that discusses instancing.  There's an excellent Instancing sample on the creator's club website, however integregating the the technique with the content authoring system I described can be tricky (it requires modifying the shaders quite a bit, and also requires making some changes to the content processor). 
Well I guess that's all I have for now.  I'm not sure which I'll start with...maybe the skybox one because it's easy.  Cool