Dunshire Doom Perf: Part 1
When I announced Dunshire Doom, I made it clear there were gaps in performance:
The renderer is terribly inefficient.
How inefficient? It could handle most DOOM maps at 60fps on my computer but not much more. DOOM 2’s MAP15 would get around 70fps but go to some of the larger Final DOOM maps like Plutonia’s MAP28 or MAP29 and the framerate would drop to 30 or 40fps. Community wads with larger maps were even worse. Sunlust MAP01 was around 30fps, MAP02 was 15-20fps, MAP29 was around 10fps, and MAP30 crawled along at 1fps (or less). The infamous nuts.wad was around 2-3fps. My browser tab would give up and crash on even larger maps like Profane Promiseland or Cosmosgeneis. These numbers are captured with the enemy AI turned off - it’s just poor rendering performance.
I made several attempts to fix it:
- Adding and removing geometry on the fly was too slow. I tested it by randomly selecting walls/floors to hide and show and my framerates would stutter along as the browser created and destroyed those objects.
- Instead of adding and removing, I would add everything then toggle visibility. This was better, but it wasn’t enough and initializing the map was still slow and used lots of memory.
- I tried instance geometry but I wasn’t sure how to apply textures to instanced geometry. Even if I could get segments of a texture, how would things like scrolling textures work?
I figured I must be doing something wrong but I wasn’t sure what so I took a break and let the project rest. I moved on to other things. Every now and then I would do some googling on the topic but I was no closer to solving the puzzle.
Glimmer of Hope
I’m not exactly sure how I got there, but somehow after 6 months a thought occured to me: my browser can render 2M triangles at 120fps in WebGL demos so why can’t I handle 30K-40K from DOOM maps? I had read that reducing draw calls would help but to get there, I would need a texture atlas. Perhaps I knew this solution months ago but felt it would be too much work. Whatever the reason, I wanted to give it a try so I built a little prototype that:
- put all DOOM graphics (walls and floors) into a single texture
- wrote a shader to read sections of that texture and apply it to walls
- added thousands and eventually hundreds of thousands of walls to see how it would perform.
- added a thread to periodically move those walls around to see how performance handles during updates
How did it perform? Amazingly well. Even with 400,000 walls:
With the success of the prototype, I was motivated to get this working. It took a few days of work to translate the existing threlte Wall/Flat components into something that created a single geometry and updated vertex and texture attributes when the walls moved (like a platform raising or door opening). It also took a few days to figure out GLSL shaders and how ThreeJS shaders are built so I could load a sections of a texture and tile them or scroll them as needed. Here is the code that fetches the section of a texture atlas:
// vertex shader:
float invAtlasWidth = 1.0 / float(tAtlasWidth);
vec2 atlasUV = vec2(
mod(float(texN), tAtlasWidth),
floor(float(texN) * invAtlasWidth));
atlasUV = (atlasUV + .5) * invAtlasWidth;
vUV = texture2D( tAtlas, atlasUV );
vDim = vec2( vUV.z - vUV.x, vUV.w - vUV.y );
...
// fragment shader:
vec2 mapUV = mod( vMapUv * vDim, vDim) + vUV.xy;
vec4 sampledDiffuseColor = texture2D( map, mapUV );
How does the above code work? It relies on two things: having a single texture map
that contains all texture for the DOOM map and a second texture with the coordinates of each texture ()tAtlas
). To fetch a particular texture (texN
), we sample tAtlas
and that gives us vDim
(the size and dimensions) of texN
in map
. Lastly, we read mod by vDim
so we don’t scroll into the next texture and voilà! We can extract individual textures from a map and apply them to walls and floors and ceilings.
In the end, Dunshire Doom now renders the whole map as a single mesh with a single texture and therefore 1 draw call. It’s a little mind boggling for me that I can render 30K triangles at 30fps or 5.6M triangles at 120fps just by changing my approach.
Caveat for benchmarks: this is an incomplete change. While the DOOM map is rendered as a single mesh, each monster is still a separate draw calls and for maps with lots of monsters, that makes a big difference. For these comparisons between the old rendere (R1) and the new (R2), I’ve turned off monsters. I’ll revisit this later as future work.
Okay, here are the numbers so far:
Map | R1 Average FPS | R1 Draw Calls | R2 Average FPS | R2 Draw calls |
---|---|---|---|---|
Sunlust MAP01 | 35fps | 5794 | 110fps | 6 |
Sunlust MAP02 | 20fps | 9008 | 104fps | 6 |
Sunlust MAP29 | 7fps | 7865 | 100fps | 6 |
Sunlust MAP30 | 3fps | 38,542 | 104fps | 6 |
nuts MAP01 | 120fps | 181 | 120fps | 6 |
Cosmogensis MAP05 | 5fps | 13,200 | 113fps | 6 |
Profane Promiseland MAP01 | Crash! | Crash! | 120fps | 6 |
You can try it yourself by loading your favourite wad into Dunshire Doom and toggling the renderer from R1 to R2. NOTE for v0.8: don’t forget to check “No items/monsters” before loading a large map!
Performance Details
Obviously the reduction in draw calls is having a huge impact here but that’s not the whole story. With this change, the renderer has moved away from each wall being a svelte component that subscribes to floor/ceiling height, texture, and light changes. This makes a huge difference. Instead of 10s of walls subscribing to changes in the room floor height, for the rare event the floor moves, we now only have 1 subscription that updates 10 walls. Across the whole map, that could be thousands or tens of thousands fewer subscriptions. In fact, I think we can get even more performance improvement with a kind of map-changed event because most of a DOOM map is static. We don’t need subscribers listening for events that very rarely happen. I’ll tackle this in the future.
Another improvement comes from lighting. In R1, each floor, ceiling, wall, and monster in a room subscribes to the light level of the room. There is also a global extra light that overrides the light level and is used when the player fires their weapon or picks up the light visor powerup. While R1 was easy to write, it didn’t perform well. When the player fired their weapon it means that every floor, ceiling, wall, and monster in the map would have to update their own light level (ouch)! In R2, we pass the extra light to the shader which means there is only 1 subscription (instead of thousands). Further, because floors, ceilings, and walls all share the light value from the room, we create a texture where each pixel represents the light level of one room and when we render a wall, floor, or ceiling, we read the value from that texture.
I was able to get other wins by moving more computation into the shader. For example, the computation for fake contrast in R1 was in the wall component:
$: fakeContrastValue =
$fakeContrast === 'classic' ? (
linedef.v[1].x === linedef.v[0].x ? 16 :
linedef.v[1].y === linedef.v[0].y ? -16 :
0
) :
$fakeContrast === 'gradual' ? Math.cos(angle * 2 + Math.PI) * 16 :
0;
And now the code has moved to the shader (branch-free! although branching may be better in this case):
const float fakeContrastStep = 16.0 / 256.0;
float fakeContrast(vec3 normal) {
vec3 absNormal = abs(normal);
float dfc = float(doomFakeContrast);
float gradual = step(2.0, dfc);
float classic = step(1.0, dfc) * (1.0 - gradual);
return (
(classic * (
step(1.0, absNormal.y) * -fakeContrastStep +
step(1.0, absNormal.x) * fakeContrastStep
)) +
(gradual * (
(smoothstep(0.0, 1.0, absNormal.y) * -fakeContrastStep) +
(smoothstep(0.0, 1.0, absNormal.x) * fakeContrastStep)
))
);
}
I’m not sure how to measure the performance of shaders so I’m not sure how to assess the cost or benefit of this. It feels like the right direction though because fake contrast is purely a rendering concern and GPUs are good at that. We want to free up the CPU and JS thread for other work when possible.
I’ll miss svelte components and stores though. This project was a chance to play with svelte and Threlte. I could have perhaps structured the data to make better use of stores to reduce subscribers but still I wonder if this project is a case that shows those solutions weren’t the right fit. Threlte’s mission is to “Rapidly build interactive 3D apps for the web.” and while that is true, I could and did build this app quickly, the performance wouldn’t scale. To really perform well, I needed to move away from property change events and thousands of components because each component brings overhead and, for large maps, the cost was too high. ThreeJS has always felt a little daunting to me but Threlte was approachable and fun and helped me understand how ThreeJS works. Threlte is fantastic and I would recommend it to anyone who isn’t already familiar with ThreeJS. Perhaps ThreeJS is a stepping stone for me to dust off my OpenGL knowledge. I’ll have more detailed thoughts on this at a later time after completing the future work.
Map Load Time
As I started playing with larger maps (like Sunder) I quickly grew impatient because it could take 25-35s to load the map. With browser profiling tools and a few console.time()
messages it was not hard to spot the places to fix.
- When visiting sectors to render, we would filter linedefs on each loop iteration. Sorting lindefs by sector before the loop saved 5-8s of time during map load.
- Texture animations were stored in a list and we were searching that list pretty frequently. Switching to a map saved almost 2s of load time.
- Stop using svelte components for walls/floors/ceilings of maps. Not only does this reduce draw calls as discussed above, it also seems about 10x faster (2500ms to 250ms) and the browser appears to use 50% less memory.
It still takes way to long to load a map. Cosmogenesis MAP05, for example, still takes about 15s to load but 11s of that is spent figuring out implicit vertices for subsectors. I don’t have a good idea on how to optimize that code. I’ve been experimenting with community DOOM maps and I’ve learned that many seem to have incomplete BSP nodes. I’ve also learned that many DOOM ports simply regenerate the BSP on load using zdbsp so perhaps I’ll end up with something similar.
Bonus: Lights!
With the extra performance, I now had a chance to play with some more advance features like lights.
I think there is a neat opportunity for a set of maps designed around an orthographic camera which are dark and moody and take advantage of lighting and shadowing. Hard shadows look pretty cool in classic DOOM!
Future Work
I’m excited that map geometry renders much faster but it’s not enough. Now that I’ve got a little taste of optimization, I’d like to try and get large maps playable (at least on powerful computers). To get there, I’ll need to:
- Use instance geometry for monsters. Now that I’ve built a texture atlas, sprites should be similar although there are additional complexities: geometry size based on texture size, rotations, interpolation of movement, animation, “full brite” states. It’ll be more shader work than maps but I think it’s doable.
- Remove svelte stores. The core game logic is full of svelte/store. I love svelte and stores are pretty cool and may get even better with Svelte 5 but it’s probably not efficient enough. Why subscribe for texture or lighting changes when 90% of walls and rooms won’t ever change? We can be much more efficient with a onMapChanged event.
- Now that I’ve got a little experience with shaders, I think I could move scrolling texture logic into the shader. The benefit is that the JS thread isn’t occupied updating some variables and copying data to the GPU. Instead, all we do is copy the game time and let the shader handle the scrolling.
- The current texture atlas is pretty inefficient. It creates a giant texture that is mostly empty and doesn’t fit content very well. To reduce memory usage, especially for mobile, we should do better.
- Map sections. It doesn’t seem expensive to have one geometry for the whole map but perhaps it would be more efficient to cut the map into sections and only render the visible sections.
Of course, I’d also like to run zdbsp (or equivalent) instead of the subsector vertex stuff I’m doing now but that’s maybe a future future work.