Dunshire Doom Perf: Part 3


In this third (and probably final) part of this series, I’ll walk through some of the performance improvements I made to the game logic of Dunshire DOOM. Way back when I started this project I wasn’t thinking about performance at all. My goal was to learn and have fun. As the project progressed though, and I tried out some big maps with tens of thousands of monsters, I couldn’t help but feel disappointed to find those large maps completely unplayable. But it did make me curious: what would it take to make these maps playable? CPUs and GPUs have come along way in the 30 years since DOOM was released so, even with the abstraction of the browser, surely we can get something playable, right?

A little history. In October, I learned some things about modern GPUs that enabled me to render big maps really fast - but items and monsters were still slow. In November, by extending those techniques and doing a little more profiling, I was able to render the map and the items and monsters in the map. The trouble was, as soon as I woke up a few hundred monsters, the game logic took too much CPU time and the game was unplayable. For a classic WAD like nuts with 7,000 monsters in the first room, the game logic would take 2-3ms per tick when the monsters were idle and then drop to a completely unplayable 8s per tick (or .003 frames per second) when the monsters wake up. I could put a video here but no one wants to wait 30s to see the animations happen after firing the gun. I do have a before and after video later on from a different map.

Performance work is fun though - honestly. And with such terrible performance, surely there were lots of ways to improve. How do I begin? I have two tools I always turn to with performance:

  1. Write a test that measures the code I want to improve. Having a test is a really nice way to get consistent results. I can make some changes in the code, run my test, and see what kind of difference I’ve made. It can also be helpful to have a couple of tests for different scenarios.
  2. Run a CPU profiler. The profiler data gives hints at what code is taking the most time and where the biggest impact would be.

Performance also takes some critical thinking and creativity to decide on the right data structure or algorithm. It is a very good feeling when you get it right though and your change makes the code twice as fast or an order of magnitude faster.

Before going into details of what I changed, here is a table summarizing the performance improvements between Nov-8, when I started this work, and Jan-4 when the bugs had been fixed. Most of the improvements happened between Nov-8 and Dec-15. These numbers are captured from Cosmogenesis MAP05 which is probably my favourite DOOM map and it highlights the changes best because it’s so big.

Data

Datemap-bin (ms)map-init (ms)map-geom (ms)render-data (ms)Idle FPSFPS after shotTest score
Nov-8664910001380300118.7
Nov-156670884141530570137.6
Nov-226347835120530790444.6
Nov-296043724853246103245.5
Dec-4617338177229210017104.4
Dec-1360673688443029678128
Dec-2060954197682449678126
Jan-4610546381023710583133

From the table, we can see scores improved across the board. The test score shows a 16x improvement although this varies by map. Nuts.wad for example went from a score of 4 in November to over 2,400 by January and Sunder MAP08 went from 145 to 620. Those maps with less geometry don’t show much change in other numbers but Cosmogenesis now loads several seconds faster and only drops 20% FPS (100 down to 80) after waking up the initial batch of monsters.

I’ve highlighted Dec-4 in the table above because that was the biggest win and I’ll explain more below. If you are simply interested in the neat before and after video from Sunlust MAP30, skip to the very end.

BSP and Blockmap

When I first wrote Dunshire DOOM, I spent quite a bit of time working through bugs in collision detection and settled on using the BSP tree for broad phase collision detection. As I started looking at performance, I noticed two problems with this approach:

(1) The BSP tree is generally between 15-25 levels deep (although some sections of large maps can go quite a bit deeper). Even at 15-25 levels, a map with 70,000 monsters spends a lot of time traversing the tree. Below is a heatmap that shows the BSP tree depth for Cosmogenesis MAP05:

Heatmap of BSP depth for the large MAP05 of Cosmogenesis.
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
Cosmosgenesis coloured by BSP tree depth. For consistency with other maps, the scale is clipped to the range 0-80. There are many pink spots where depth is above 40. The real hot spots is the circle at the end of the map (far right) where the tree depth is over 200.
Histogram of BSP depth for the large MAP05 of Cosmogenesis.
Cosmosgenesis MAP05 BSP tree depth histogram. The majority of subsectors are around 15-20 levels deep but there are outliers are over 200.

Cosmogenesis has a fairly balanced tree. The tree depth of another map, Profane Promiseland, varies more across the map. While most of the map has a depth of 15-25, the lower right corner has sections with depth of 50 or 60 and there are several shapes that are in the 40-50 range as well:

Heatmap of BSP depth for Profane Promiseland (also a large DOOM map).
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
BSP tree depth for Profane Promiseland.
Histogram of BSP depth for Profane Promiseland (also a large DOOM map).
BSP tree depth histogram for Profane Promiseland.

BSP tree depth is magnified by two extra queries anytime an object moves. The first is to see what sectors the object is touching to compute floor height. The second is to get the sector for the centre of the object to know the light level.

(2) Even with the map divided by BSP, some sections can still be large and contain hundreds (or thousands) of monsters which means our approach with BSP doesn’t narrow down the collision search space the way we need it to.

Consider the initial room in nuts.wad. It contains over 7,000 monsters but because the map geometry is simple and the room so large, most monsters are put into either the big left or right subsectors. Our collision detection searches for collisions with all monsters in a subsector which means we are performing an O(n^2) algorithm where N is 3,000 or more. We could perhaps get away with that if we do it infrequently but we’re doing the calculation near 100 times per second depending on the monster type.

Overhead view of nuts.wad with subsectors coloured in blue and orange.
An overhead view of nuts.wad’s opening room with 7,000 monsters. This rooms is mainly two large subsectors (orange on the left and blue on the right) each with 3,000 or more monsters.

I mentioned earlier that I had chosen BSP because it was just so elegant but with the performance I was seeing, I wanted to explore other things. I turned back to DOOM’s blockmap to see if it could work.

DOOM’s blockmap is a 2D grid where each cell (or block) has a list of monsters and walls that are inside it. The blockmap nicely solves both of the above problems. Blockmap lookup is always O(1) which means we can query multiple blocks faster than descending 15 or 25 levels of a tree. Blocks are small (128x128) which means they will only ever contain a few monsters and not the potential thousand monsters in a large subsector.

In hindsight, blockmap is a much better fit for colliding monsters with other monsters with one caveat. The blockmap is a 2D grid so for sparse maps, there is definitely wasted memory. There are other structures out there that could help but the blockmap is conceptually simple, easy to update, easy to query, and also what DOOM used so I didn’t give it too much thought. I did need to be careful about a few points though:

Monster Crowds

Switching from BSP to blockmap was an improvement but game logic would still slow down if all the monsters were on one side of the room. The slowdown was caused by how DOOM monsters search for a new path when moving.

DOOM monsters, in general, move in one of 8 directions (N, NE, E, SE, S, SE, W, or NW). When DOOM monsters are moving, they will take a straight line to the player unless that line is blocked. When the path is blocked or after a bit of time, they will re-evaluate their path and choose a new direction. It’s probably not easy to see from the code but in the worst case, DOOM monsters can perform 12 searches for 8 possible directions (4 of those are duplicated). Monster crowds trigger this worst case because the monsters are so close together they are blocked on all sides. Because we’ve switched to blockmap for collisions, at least we aren’t traversing 20 levels of a BSP tree for each direction check but it’s still not enough.

Overhead view of a monster surrounded by a crowd with arrows showing the path the monster could take.
An overhead view of a monster stuck in a crowd in nuts.wad’s opening room. The blue arrows show the directions the monster will try but each path is blocked. The green box is the player with noclip turned on so monsters do not collide with the player.

To improve this, instead of colliding with items in our block on each movement, I collect all the items (lines and objects) within the move distance in any direction and store them. For example, if a monster moves 4 units, I collect all items within 4 units of the monster. When the monster tries to find a new movement direction, I only need to look at the items from the 4 unit box instead of everything in the blocks map. This is important because a block could contain 10 or 20 lines (or 50!) and several monsters. Because monster movement is small, the monster will only touch a few of these items therefore we only need to look at 2 items instead of 10 or 20. We still potentially evaluate 12 possible routes (4 duplicates) but because there are so many fewer lines and objects, we do it much faster.

I haven’t looked at other DOOM source ports but I wonder if they may actually benefit from a similar optimization.

Other improvements

The two changes above were the big ones. Here is an incomplete list of some other interesting changes that improved performance:

Future work

While I’ve made big improvements here, it’s not still enough. Dunshire DOOM can sustain 40-60fps in Cosmogenesis on Firefox and that is amazing! I’d prefer to be hitting 120fps but I’m still happy with the progress. The finale on Firefox, with 12,000 monsters all waking up, slows down though. Interestingly, Chrome is fast enough and can handle the finale although even Chrome falls apart on the layer cake room in Profane Promiseland.

If I were going to improve this further, I might:

Going to an older machine, like my 13 year old i5 3450 would take even more work. Even disabling enemy AI won’t get Cosmogenesis MAP05 to a playable framerate on that machine. On the other hand, Helion works on that machine so it is technically possible to get something playable. Unfortunately, I’ve got other projects and ideas on my mind so I’m not likely to take this work much further.

Results: Before and After

One last way to think about performance. I mentioned above that DOOM processes game logic at 35 ticks per second. So, in order to maintain that tick rate, we need to complete all game logic within 28ms (where 28x35 approximately equals 1000ms). Prior to the performance improvements, nuts.wad was taking about 8s per tick. After these changes, we are in the range of 12-14ms per tick which means we have some spare room! Obviously I’d prefer to go lower but it’ll take some creativity to get there.

In summary, we’ve saved seconds off map load times for large maps (Cosmogenesis or Sunder maps 15-20), we’ve got very playable framerates for maps with thousands of monsters, and memory usage is about half. Not bad. Nothing quite compares with a visual though so here is a before and after comparison of me free flying around the finale of Sunlust MAP30: God Machine. Enjoy!

Before (v0.9) where framerate drops into the low single digits by the end:

Play

After (v0.10) where framerate stays above 100 even in a busy scene:

Play