XNAInfo blogs
Ramblings about XNA, .NET and stuff

How I Saved 3ms By Unrolling A Loop

September 25, 2008 10:05 by MJP
Last night I decided to sit down and do some nitty gritty optimization on the 60 version of my game.  The PC version has been running great even when I crank up every setting I have (although that might have something to do with the 4870 I just bought), but as well all know by now things are never so easy on the Xbox.  For the past few weeks I'd been struggling to keep the framerate above 60Hz, with it slipping down to 55Hz during complex scenes.  Now 55fps wouldn't be so much of a problem if I weren't the kinda guy who hates screen tearing, but it just so happens I am.  Which means I want VSYNC to be enabled, which means the game drops to 30fps whenever it's below 60.  Definitely unacceptable.  For a while I avoided the problem by doing the unthinkable...I dropped the resolution from 1280 x 720 to 1024 x 600.  I even considered leaving it this way for the final release...I mean if Halo 3 and Metal Gear Solid 4 can do it, why can't I? 

But no, I'm too finicky to settle with the lowered resolution.  It just looks so much better in 720p!  So I cranked the res back up, and decided to see where I could squeeze out some extra performance.  Naturally, the first place I went to was my show mapping shader.  I already knew this particular shader was giving me trouble, since I'd discovered that decreasing the size of the buffer that I render the shadow occlusion to (since I do a deferred shadowing pass) resulted in significant performance gains.  I'd already reduced thigns to 4 PCF samples on the 360 (did I mention I ditched VSM for the 360?  Performance and precision ended up being so awful it wasn't worth it) so I couldn't squeeze that down anymore.  At this point my eyes drifted down to little loop I had for determining which split of shadow map cascade to use, when I remembered an excellent presentation on shader performance that I'd read a long time ago. One of the things mentioned in there was that unrolling loops can have a significant impact on general purpose register usage, ALU usage, and shader compiler optimization.  So I thought, "hey, let's try unrolling this loop and flattening this branch".  I then changed my code from this:

for (int i = 1; i < NUM_SPLITS; i++)
{
    if (vPositionVS.z <= g_vClipPlanes[i].x && vPositionVS.z > g_vClipPlanes[i].y)
    {
        matLightViewProj = g_matLightViewProj[i];
        fOffset = i / (float)NUM_SPLITS;           
    }
}  


to this:

[unroll(NUM_SPLITS)]
for (int i = 1; i < NUM_SPLITS; i++)
{
    [flatten]
    if (vPositionVS.z <= g_vClipPlanes[i].x && vPositionVS.z > g_vClipPlanes[i].y)
    {
        matLightViewProj = g_matLightViewProj[i];
        fOffset = i / (float)NUM_SPLITS;           
    }
}  
      

I ran the game again, and BAM:  I shot up from 55fps to 65fps!  Huge difference!  I was very impressed with myself.  Moral of the story:  experiement with stuff, and make sure you profile it!

Today I decided to read through some other presentations to see what other useful bits I could find. This one from Gamefest 2007 pointed out that vfetch's should be aligned to 32-bytes on the 360.  I did the math on the vertex declaration used for most of my models, and found out it's 48 bytes.  Later tonight I'll have to see if I can squeeze it down to 32, and see if it makes performance any better. 

I also came across this one from Gamefest 2008, which is all about how texture and surface formats are handled on the 360.  This gave me some insight into some problems I'd come across already.  For example, R32G32F (Vector2) isn't actually a format the GPU can render to!  Apparently it renders to R16G16F, and then just expands it upon resolve from eDRAM.  This explained why my VSM's with exponential warp were have such precision problems.  Another thing pointed out in there is that the 360's texture units filter fp16 at 1/4 the rate of INT8!  This would help explain why my VSM performance was so poor.  Definitely good things to keep in mind.

-MJP