Dunshire Doom Perf: Part 3
In this third (and probably final) part of this series, I’ll walk through some of the performance improvements I made to the game logic of Dunshire DOOM. Way back when I started this project I wasn’t thinking about performance at all. My goal was to learn and have fun. As the project progressed though, and I tried out some big maps with tens of thousands of monsters, I couldn’t help but feel disappointed to find those large maps completely unplayable. But it did make me curious: what would it take to make these maps playable? CPUs and GPUs have come along way in the 30 years since DOOM was released so, even with the abstraction of the browser, surely we can get something playable, right?
A little history. In October, I learned some things about modern GPUs that enabled me to render big maps really fast - but items and monsters were still slow. In November, by extending those techniques and doing a little more profiling, I was able to render the map and the items and monsters in the map. The trouble was, as soon as I woke up a few hundred monsters, the game logic took too much CPU time and the game was unplayable. For a classic WAD like nuts with 7,000 monsters in the first room, the game logic would take 2-3ms per tick when the monsters were idle and then drop to a completely unplayable 8s per tick (or .003 frames per second) when the monsters wake up. I could put a video here but no one wants to wait 30s to see the animations happen after firing the gun. I do have a before and after video later on from a different map.
Performance work is fun though - honestly. And with such terrible performance, surely there were lots of ways to improve. How do I begin? I have two tools I always turn to with performance:
- Write a test that measures the code I want to improve. Having a test is a really nice way to get consistent results. I can make some changes in the code, run my test, and see what kind of difference I’ve made. It can also be helpful to have a couple of tests for different scenarios.
- Run a CPU profiler. The profiler data gives hints at what code is taking the most time and where the biggest impact would be.
Performance also takes some critical thinking and creativity to decide on the right data structure or algorithm. It is a very good feeling when you get it right though and your change makes the code twice as fast or an order of magnitude faster.
Before going into details of what I changed, here is a table summarizing the performance improvements between Nov-8, when I started this work, and Jan-4 when the bugs had been fixed. Most of the improvements happened between Nov-8 and Dec-15. These numbers are captured from Cosmogenesis MAP05 which is probably my favourite DOOM map and it highlights the changes best because it’s so big.
Data
Date | map-bin (ms) | map-init (ms) | map-geom (ms) | render-data (ms) | Idle FPS | FPS after shot | Test score |
---|---|---|---|---|---|---|---|
Nov-8 | 6649 | 1000 | 1380 | 300 | 1 | 1 | 8.7 |
Nov-15 | 6670 | 884 | 1415 | 305 | 70 | 1 | 37.6 |
Nov-22 | 6347 | 835 | 1205 | 307 | 90 | 4 | 44.6 |
Nov-29 | 6043 | 724 | 853 | 246 | 103 | 2 | 45.5 |
Dec-4 | 6173 | 381 | 772 | 292 | 100 | 17 | 104.4 |
Dec-13 | 6067 | 368 | 844 | 302 | 96 | 78 | 128 |
Dec-20 | 6095 | 419 | 768 | 244 | 96 | 78 | 126 |
Jan-4 | 6105 | 463 | 810 | 237 | 105 | 83 | 133 |
map-bin
is the time to load the binary data of the map.map-init
is the time to initialize monsters and items and texture animations and other map runtime things.map-geom
is the time to build the OpenGL wall and floor geometry of the map (using data frommap-bin
).render-data
creates extra data to support a subset of DOOM render tricks.- Idle FPS is the animation framerate (frames per second) after loading the map but not moving or shooting.
- FPS after shot is the framerate after taking a shot and waking up monsters.
- Test score is the benchmark test I wrote. It loads the map, fires the weapon, and then counts how many ticks can be run within 2 seconds. DOOM updates the game at 35 tics per second so to avoid slowdown, we need to hit at least 70 on the test but higher is always better because this test does not include any overhead with rendering or sound or the browser.
From the table, we can see scores improved across the board. The test score shows a 16x improvement although this varies by map. Nuts.wad for example went from a score of 4 in November to over 2,400 by January and Sunder MAP08 went from 145 to 620. Those maps with less geometry don’t show much change in other numbers but Cosmogenesis now loads several seconds faster and only drops 20% FPS (100 down to 80) after waking up the initial batch of monsters.
I’ve highlighted Dec-4 in the table above because that was the biggest win and I’ll explain more below. If you are simply interested in the neat before and after video from Sunlust MAP30, skip to the very end.
BSP and Blockmap
When I first wrote Dunshire DOOM, I spent quite a bit of time working through bugs in collision detection and settled on using the BSP tree for broad phase collision detection. As I started looking at performance, I noticed two problems with this approach:
(1) The BSP tree is generally between 15-25 levels deep (although some sections of large maps can go quite a bit deeper). Even at 15-25 levels, a map with 70,000 monsters spends a lot of time traversing the tree. Below is a heatmap that shows the BSP tree depth for Cosmogenesis MAP05:


Cosmogenesis has a fairly balanced tree. The tree depth of another map, Profane Promiseland, varies more across the map. While most of the map has a depth of 15-25, the lower right corner has sections with depth of 50 or 60 and there are several shapes that are in the 40-50 range as well:


BSP tree depth is magnified by two extra queries anytime an object moves. The first is to see what sectors the object is touching to compute floor height. The second is to get the sector for the centre of the object to know the light level.
(2) Even with the map divided by BSP, some sections can still be large and contain hundreds (or thousands) of monsters which means our approach with BSP doesn’t narrow down the collision search space the way we need it to.
Consider the initial room in nuts.wad. It contains over 7,000 monsters but because the map geometry is simple and the room so large, most monsters are put into either the big left or right subsectors. Our collision detection searches for collisions with all monsters in a subsector which means we are performing an O(n^2)
algorithm where N is 3,000 or more. We could perhaps get away with that if we do it infrequently but we’re doing the calculation near 100 times per second depending on the monster type.

I mentioned earlier that I had chosen BSP because it was just so elegant but with the performance I was seeing, I wanted to explore other things. I turned back to DOOM’s blockmap to see if it could work.
DOOM’s blockmap is a 2D grid where each cell (or block) has a list of monsters and walls that are inside it. The blockmap nicely solves both of the above problems. Blockmap lookup is always O(1)
which means we can query multiple blocks faster than descending 15 or 25 levels of a tree. Blocks are small (128x128) which means they will only ever contain a few monsters and not the potential thousand monsters in a large subsector.
In hindsight, blockmap is a much better fit for colliding monsters with other monsters with one caveat. The blockmap is a 2D grid so for sparse maps, there is definitely wasted memory. There are other structures out there that could help but the blockmap is conceptually simple, easy to update, easy to query, and also what DOOM used so I didn’t give it too much thought. I did need to be careful about a few points though:
- Not all maps have BLOCKMAP data so I don’t bother reading the WAD and always build it at map load.
- Objects can touch multiple blocks. I decided to put objects into all the blocks they touch. This fixes a well known defect from DOOM but does have a small cost.
- Walls can - and often do - exist in multiple blocks. To fix this, I reject a collision if the collision point is outside the bounds of the block. This create extra work for collision detection but we perform the same check for objects so at least it’s consistent.
- Collide with segs, not linedefs. Unlike DOOM, we don’t actually insert lines into blocks. We insert segs which are the lines split by BSP divisions. Segs are nice because they have a front and back side however two sided lines have two segs so we are doing extra work.
Monster Crowds
Switching from BSP to blockmap was an improvement but game logic would still slow down if all the monsters were on one side of the room. The slowdown was caused by how DOOM monsters search for a new path when moving.
DOOM monsters, in general, move in one of 8 directions (N, NE, E, SE, S, SE, W, or NW). When DOOM monsters are moving, they will take a straight line to the player unless that line is blocked. When the path is blocked or after a bit of time, they will re-evaluate their path and choose a new direction. It’s probably not easy to see from the code but in the worst case, DOOM monsters can perform 12 searches for 8 possible directions (4 of those are duplicated). Monster crowds trigger this worst case because the monsters are so close together they are blocked on all sides. Because we’ve switched to blockmap for collisions, at least we aren’t traversing 20 levels of a BSP tree for each direction check but it’s still not enough.

To improve this, instead of colliding with items in our block on each movement, I collect all the items (lines and objects) within the move distance in any direction and store them. For example, if a monster moves 4 units, I collect all items within 4 units of the monster. When the monster tries to find a new movement direction, I only need to look at the items from the 4 unit box instead of everything in the blocks map. This is important because a block could contain 10 or 20 lines (or 50!) and several monsters. Because monster movement is small, the monster will only touch a few of these items therefore we only need to look at 2 items instead of 10 or 20. We still potentially evaluate 12 possible routes (4 duplicates) but because there are so many fewer lines and objects, we do it much faster.
I haven’t looked at other DOOM source ports but I wonder if they may actually benefit from a similar optimization.
Other improvements
The two changes above were the big ones. Here is an incomplete list of some other interesting changes that improved performance:
- Collision checks by type (object or wall or ceiling or floor). For example, line of sight traces only need to look at map geometry (walls and floors), not monsters. The code used to look at all things all the time but by making the check more precise, this created most of that 4x increase on Nov-15.
Set
instead ofArray
. There were several places that filtered arrays of objects multiple times per frame. Even some during map load. By using aSet
orMap
, I save a few seconds off map load, keep a smooth framerate when searching for a particular object type (see this youtube example), and save upwards of 15-20% performance boost when there are lots of enemy projectiles.- Remove Svelte stores. Originally, I relied on svelte stores for things like sector height or object position. This was nice in a way because anytime those properties updated, the subscribers would get notified and update accordingly. For a game engine, it wasn’t a good fit because property changes are not frequent enough to outweigh the cost of managing subscribers. This change (and the above one) where the 25% performance boost between Nov-15 and Nov-29.
- Remove
foreach()
andfor (of)
.foreach()
and it’s cousinfor (of)
are nice for readability but they will create an iterator and therefore put a little extra pressure on the GC. Removing them from the critical path of code doesn’t impact readability but gives a 3-5% performance boost.
Future work
While I’ve made big improvements here, it’s not still enough. Dunshire DOOM can sustain 40-60fps in Cosmogenesis on Firefox and that is amazing! I’d prefer to be hitting 120fps but I’m still happy with the progress. The finale on Firefox, with 12,000 monsters all waking up, slows down though. Interestingly, Chrome is fast enough and can handle the finale although even Chrome falls apart on the layer cake room in Profane Promiseland.
If I were going to improve this further, I might:
- Reduce memory pressure by avoiding short lived arrays and objects. I didn’t talk about it above, because I find it hard to get good measurements from the browser, but memory usage matters. Removing stores, in combination with a few other improvements, reduced the browser process memory usage for Cosmogenesis from 2.4GB in v0.9 to about 1.2GB in v0.10. Obviously that is a big win but 1.2GB is still a lot of memory.
- 95% of the time spent in
map-bin
is anO(n^3)
algorithm to complete subsectors. This algorithm doesn’t work in many cases anyway so we would be much better doing what other DOOM ports do and generate GL BSP nodes. If we cache the result, we could save a big chunk of map load time. - Experiment with typed arrays for object properties. I’m not sure this would be beneficial but it would give fine control over memory usage and perhaps faster math functions. It may actually turn out slower though because, honestly, it’s amazing what is possible with JS these days.
- Move computations to the GPU. It should be possible to run monster movement code in parallel (although this would mess up demo playback). To do it on the GPU, we would have to figure out a way to pass blockmap data, among other things, to the GPU which wouldn’t be easy. It would be interesting to try.
- WebAssembly? If I move the whole game logic to a language like Zig (or if I compile some other DOOM source port to wasm) then perhaps we can be faster by taking advantage of better typing and memory management. I’m a little on the fence on this. From experiments with edge-classic, performance of Dunshire DOOM is not far off so wasm may not provide as big a boost as it appears. It would still be interesting to experiment.
- During busy rooms, the CPU is mostly busy with two monster movement functions
newChaseDir()
andhasLineOfSight()
. I’m not sure how to make them faster (I’m already using the REJECT table) but that is something to think about.
Going to an older machine, like my 13 year old i5 3450 would take even more work. Even disabling enemy AI won’t get Cosmogenesis MAP05 to a playable framerate on that machine. On the other hand, Helion works on that machine so it is technically possible to get something playable. Unfortunately, I’ve got other projects and ideas on my mind so I’m not likely to take this work much further.
Results: Before and After
One last way to think about performance. I mentioned above that DOOM processes game logic at 35 ticks per second. So, in order to maintain that tick rate, we need to complete all game logic within 28ms (where 28x35 approximately equals 1000ms). Prior to the performance improvements, nuts.wad
was taking about 8s per tick. After these changes, we are in the range of 12-14ms per tick which means we have some spare room! Obviously I’d prefer to go lower but it’ll take some creativity to get there.
In summary, we’ve saved seconds off map load times for large maps (Cosmogenesis or Sunder maps 15-20), we’ve got very playable framerates for maps with thousands of monsters, and memory usage is about half. Not bad. Nothing quite compares with a visual though so here is a before and after comparison of me free flying around the finale of Sunlust MAP30: God Machine. Enjoy!
Before (v0.9) where framerate drops into the low single digits by the end:
After (v0.10) where framerate stays above 100 even in a busy scene: