Commit Graph

13 Commits

Author SHA1 Message Date
atlas dostal
e29d0d573b Use bit ops instead of integer modulo and divide in shaders 2025-07-06 20:49:29 -04:00
atlv
57e58ef997
Meshlet BVH Culling (#19318)
# Objective

- Merge @SparkyPotato 's efforts to implement BVH-accelerated meshlet
culling.

## Solution

- Add hot reloading support
- Fix near-plane overculling
- Fix hzb sampling
- Fix orthographic error metric

## Testing

- Meshlet example, Nsight, hot-reloading and careful thinking

---------

Co-authored-by: SparkyPotato <noob.sparkypotato@gmail.com>
Co-authored-by: JMS55 <47158642+JMS55@users.noreply.github.com>
Co-authored-by: charlotte <charlotte.c.mcelwain@gmail.com>
2025-06-29 00:04:21 +00:00
charlotte 🌸
96dcbc5f8c
Ugrade to wgpu version 25.0 (#19563)
# Objective

Upgrade to `wgpu` version `25.0`.

Depends on https://github.com/bevyengine/naga_oil/pull/121

## Solution

### Problem

The biggest issue we face upgrading is the following requirement:
> To facilitate this change, there was an additional validation rule put
in place: if there is a binding array in a bind group, you may not use
dynamic offset buffers or uniform buffers in that bind group. This
requirement comes from vulkan rules on UpdateAfterBind descriptors.

This is a major difficulty for us, as there are a number of binding
arrays that are used in the view bind group. Note, this requirement does
not affect merely uniform buffors that use dynamic offset but the use of
*any* uniform in a bind group that also has a binding array.

### Attempted fixes

The easiest fix would be to change uniforms to be storage buffers
whenever binding arrays are in use:
```wgsl
#ifdef BINDING_ARRAYS_ARE_USED
@group(0) @binding(0) var<uniform> view: View;
@group(0) @binding(1) var<uniform> lights: types::Lights;
#else
@group(0) @binding(0) var<storage> view: array<View>;
@group(0) @binding(1) var<storage> lights: array<types::Lights>;
#endif
```

This requires passing the view index to the shader so that we know where
to index into the buffer:

```wgsl
struct PushConstants {
    view_index: u32,
}

var<push_constant> push_constants: PushConstants;
```

Using push constants is no problem because binding arrays are only
usable on native anyway.

However, this greatly complicates the ability to access `view` in
shaders. For example:
```wgsl
#ifdef BINDING_ARRAYS_ARE_USED
mesh_view_bindings::view.view_from_world[0].z
#else
mesh_view_bindings::view[mesh_view_bindings::view_index].view_from_world[0].z
#endif
```

Using this approach would work but would have the effect of polluting
our shaders with ifdef spam basically *everywhere*.

Why not use a function? Unfortunately, the following is not valid wgsl
as it returns a binding directly from a function in the uniform path.

```wgsl
fn get_view() -> View {
#if BINDING_ARRAYS_ARE_USED
    let view_index = push_constants.view_index;
    let view = views[view_index];
#endif
    return view;
}
```

This also poses problems for things like lights where we want to return
a ptr to the light data. Returning ptrs from wgsl functions isn't
allowed even if both bindings were buffers.

The next attempt was to simply use indexed buffers everywhere, in both
the binding array and non binding array path. This would be viable if
push constants were available everywhere to pass the view index, but
unfortunately they are not available on webgpu. This means either
passing the view index in a storage buffer (not ideal for such a small
amount of state) or using push constants sometimes and uniform buffers
only on webgpu. However, this kind of conditional layout infects
absolutely everything.

Even if we were to accept just using storage buffer for the view index,
there's also the additional problem that some dynamic offsets aren't
actually per-view but per-use of a setting on a camera, which would
require passing that uniform data on *every* camera regardless of
whether that rendering feature is being used, which is also gross.

As such, although it's gross, the simplest solution just to bump binding
arrays into `@group(1)` and all other bindings up one bind group. This
should still bring us under the device limit of 4 for most users.

### Next steps / looking towards the future

I'd like to avoid needing split our view bind group into multiple parts.
In the future, if `wgpu` were to add `@builtin(draw_index)`, we could
build a list of draw state in gpu processing and avoid the need for any
kind of state change at all (see
https://github.com/gfx-rs/wgpu/issues/6823). This would also provide
significantly more flexibility to handle things like offsets into other
arrays that may not be per-view.

### Testing

Tested a number of examples, there are probably more that are still
broken.

---------

Co-authored-by: François Mockers <mockersf@gmail.com>
Co-authored-by: Elabajaba <Elabajaba@users.noreply.github.com>
2025-06-26 19:41:47 +00:00
JMS55
2fd4cc4937
Meshlet texture atomics (#17765)
* Use texture atomics rather than buffer atomics for the visbuffer
(haven't tested perf on a raster-heavy scene yet)
* Unfortunately to clear the visbuffer we now need a compute pass to
clear it. Using wgpu's clear_texture function internally uses a buffer
-> image copy that's insanely expensive. Ideally it should be using
vkCmdClearColorImage, which I've opened an issue for
https://github.com/gfx-rs/wgpu/issues/7090. For now we'll have to stick
with a custom compute pass and all the extra code that brings.
* Faster resolve depth pass by discarding 0 depth pixels instead of
redundantly writing zero (2x faster for big depth textures like shadow
views)
2025-02-12 18:15:43 +00:00
JMS55
3fb6cefb2f
Meshlet fill cluster buffers rewritten (#15955)
# Objective
- Make the meshlet fill cluster buffers pass slightly faster
- Address https://github.com/bevyengine/bevy/issues/15920 for meshlets
- Added PreviousGlobalTransform as a required meshlet component to avoid
extra archetype moves, slightly alleviating
https://github.com/bevyengine/bevy/issues/14681 for meshlets
- Enforce that MeshletPlugin::cluster_buffer_slots is not greater than
2^25 (glitches will occur otherwise). Technically this field controls
post-lod/culling cluster count, and the issue is on pre-lod/culling
cluster count, but it's still valid now, and in the future this will be
more true.

Needs to be merged after https://github.com/bevyengine/bevy/pull/15846
and https://github.com/bevyengine/bevy/pull/15886

## Solution

- Old pass dispatched a thread per cluster, and did a binary search over
the instances to find which instance the cluster belongs to, and what
meshlet index within the instance it is.
- New pass dispatches a workgroup per instance, and has the workgroup
loop over all meshlets in the instance in order to write out the cluster
data.
- Use a push constant instead of arrayLength to fix the linked bug
- Remap 1d->2d dispatch for software raster only if actually needed to
save on spawning excess workgroups

## Testing

- Did you test these changes? If so, how?
- Ran the meshlet example, and an example with 1041 instances of 32217
meshlets per instance. Profiled the second scene with nsight, went from
0.55ms -> 0.40ms. Small savings. We're pretty much VRAM bandwidth bound
at this point.
- How can other people (reviewers) test your changes? Is there anything
specific they need to know?
  - Run the meshlet example

## Changelog (non-meshlets)
- PreviousGlobalTransform now implements the Default trait
2024-10-23 19:18:49 +00:00
JMS55
9d54fe0370
Meshlet new error projection (#15846)
* New error projection code taken from @zeux's meshoptimizer nanite.cpp
demo for determining LOD (thanks zeux!)
* Builder: `compute_lod_group_data()`
* Runtime: `lod_error_is_imperceptible()`
2024-10-22 20:14:30 +00:00
JMS55
aa626e4f0b
Per-meshlet compressed vertex data (#15643)
# Objective
- Prepare for streaming by storing vertex data per-meshlet, rather than
per-mesh (this means duplicating vertices per-meshlet)
- Compress vertex data to reduce the cost of this

## Solution
The important parts are in from_mesh.rs, the changes to the Meshlet type
in asset.rs, and the changes in meshlet_bindings.wgsl. Everything else
is pretty secondary/boilerplate/straightforward changes.

- Positions are quantized in centimeters with a user-provided power of 2
factor (ideally auto-determined, but that's a TODO for the future),
encoded as an offset relative to the minimum value within the meshlet,
and then stored as a packed list of bits using the minimum number of
bits needed for each vertex position channel for that meshlet
- E.g. quantize positions (lossly, throws away precision that's not
needed leading to using less bits in the bitstream encoding)
- Get the min/max quantized value of each X/Y/Z channel of the quantized
positions within a meshlet
- Encode values relative to the min value of the meshlet. E.g. convert
from [min, max] to [0, max - min]
- The new max value in the meshlet is (max - min), which only takes N
bits, so we only need N bits to store each channel within the meshlet
(lossless)
- We can store the min value and that it takes N bits per channel in the
meshlet metadata, and reconstruct the position from the bitstream
- Normals are octahedral encoded and than snorm2x16 packed and stored as
a single u32.
- Would be better to implement the precise variant of octhedral encoding
for extra precision (no extra decode cost), but decided to keep it
simple for now and leave that as a followup
- Tried doing a quantizing and bitstream encoding scheme like I did for
positions, but struggled to get it smaller. Decided to go with this for
simplicity for now
- UVs are uncompressed and take a full 64bits per vertex which is
expensive
  - In the future this should be improved
- Tangents, as of the previous PR, are not explicitly stored and are
instead derived from screen space gradients
- While I'm here, split up MeshletMeshSaverLoader into two separate
types

Other future changes include implementing a smaller encoding of triangle
data (3 u8 indices = 24 bits per triangle currently), and more
disk-oriented compression schemes.

References:
* "A Deep Dive into UE5's Nanite Virtualized Geometry"
https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf#page=128
(also available on youtube)
* "Towards Practical Meshlet Compression"
https://arxiv.org/pdf/2404.06359
* "Vertex quantization in Omniforce Game Engine"
https://daniilvinn.github.io/2024/05/04/omniforce-vertex-quantization.html

## Testing

- Did you test these changes? If so, how?
- Converted the stanford bunny, and rendered it with a debug material
showing normals, and confirmed that it's identical to what's on main.
EDIT: See additional testing in the comments below.
- Are there any parts that need more testing?
- Could use some more size comparisons on various meshes, and testing
different quantization factors. Not sure if 4 is a good default. EDIT:
See additional testing in the comments below.
- Also did not test runtime performance of the shaders. EDIT: See
additional testing in the comments below.
- How can other people (reviewers) test your changes? Is there anything
specific they need to know?
- Use my unholy script, replacing the meshlet example
https://paste.rs/7xQHk.rs (must make MeshletMesh fields pub instead of
pub crate, must add lz4_flex as a dev-dependency) (must compile with
meshlet and meshlet_processor features, mesh must have only positions,
normals, and UVs, no vertex colors or tangents)

---

## Migration Guide
- TBD by JMS55 at the end of the release
2024-10-08 18:42:55 +00:00
JMS55
9cc7e7c080
Meshlet screenspace-derived tangents (#15084)
* Save 16 bytes per vertex by calculating tangents in the shader at
runtime, rather than storing them in the vertex data.
* Based on https://jcgt.org/published/0009/03/04,
https://www.jeremyong.com/graphics/2023/12/16/surface-gradient-bump-mapping.
* Fixed visbuffer resolve to use the updated algorithm that flips ddy
correctly
* Added some more docs about meshlet material limitations, and some
TODOs about transforming UV coordinates for the future.


![image](https://github.com/user-attachments/assets/222d8192-8c82-4d77-945d-53670a503761)

For testing add a normal map to the bunnies with StandardMaterial like
below, and then test that on both main and this PR (make sure to
download the correct bunny for each). Results should be mostly
identical.

```rust
normal_map_texture: Some(asset_server.load_with_settings(
    "textures/BlueNoise-Normal.png",
    |settings: &mut ImageLoaderSettings| settings.is_srgb = false,
)),
```
2024-09-29 18:39:25 +00:00
JMS55
6cc96f4c1f
Meshlet software raster + start of cleanup (#14623)
# Objective
- Faster meshlet rasterization path for small triangles
- Avoid having to allocate and write out a triangle buffer
- Refactor gpu_scene.rs

## Solution
- Replace the 32bit visbuffer texture with a 64bit visbuffer buffer,
where the left 32 bits encode depth, and the right 32 bits encode the
existing cluster + triangle IDs. Can't use 64bit textures, wgpu/naga
doesn't support atomic ops on textures yet.
- Instead of writing out a buffer of packed cluster + triangle IDs (per
triangle) to raster, the culling pass now writes out a buffer of just
cluster IDs (per cluster, so less memory allocated, cheaper to write
out).
  - Clusters for software raster are allocated from the left side
- Clusters for hardware raster are allocated in the same buffer, from
the right side
- The buffer size is fixed at MeshletPlugin build time, and should be
set to a reasonable value for your scene (no warning on overflow, and no
good way to determine what value you need outside of renderdoc - I plan
to fix this in a future PR adding a meshlet stats overlay)
- Currently I don't have a heuristic for software vs hardware raster
selection for each cluster. The existing code is just a placeholder. I
need to profile on a release scene and come up with a heuristic,
probably in a future PR.
- The culling shader is getting pretty hard to follow at this point, but
I don't want to spend time improving it as the entire shader/pass is
getting rewritten/replaced in the near future.
- Software raster is a compute workgroup per-cluster. Each workgroup
loads and transforms the <=64 vertices of the cluster, and then
rasterizes the <=64 triangles of the cluster.
- Two variants are implemented: Scanline for clusters with any larger
triangles (still smaller than hardware is good at), and brute-force for
very very tiny triangles
- Once the shader determines that a pixel should be filled in, it does
an atomicMax() on the visbuffer to store the results, copying how Nanite
works
- On devices with a low max workgroups per dispatch limit, an extra
compute pass is inserted before software raster to convert from a 1d to
2d dispatch (I don't think 3d would ever be necessary).
- I haven't implemented the top-left rule or subpixel precision yet, I'm
leaving that for a future PR since I get usable results without it for
now
- Resources used:
https://kristoffer-dyrkorn.github.io/triangle-rasterizer and chapters
6-8 of
https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index
- Hardware raster now spawns 64*3 vertex invocations per meshlet,
instead of the actual meshlet vertex count. Extra invocations just
early-exit.
- While this is slower than the existing system, hardware draws should
be rare now that software raster is usable, and it saves a ton of memory
using the unified cluster ID buffer. This would be fixed if wgpu had
support for mesh shaders.
- Instead of writing to a color+depth attachment, the hardware raster
pass also does the same atomic visbuffer writes that software raster
uses.
- We have to bind a dummy render target anyways, as wgpu doesn't
currently support render passes without any attachments
- Material IDs are no longer written out during the main rasterization
passes.
- If we had async compute queues, we could overlap the software and
hardware raster passes.
- New material and depth resolve passes run at the end of the visbuffer
node, and write out view depth and material ID depth textures

### Misc changes
- Fixed cluster culling importing, but never actually using the previous
view uniforms when doing occlusion culling
- Fixed incorrectly adding the LOD error twice when building the meshlet
mesh
- Splitup gpu_scene module into meshlet_mesh_manager, instance_manager,
and resource_manager
- resource_manager is still too complex and inefficient (extract and
prepare are way too expensive). I plan on improving this in a future PR,
but for now ResourceManager is mostly a 1:1 port of the leftover
MeshletGpuScene bits.
- Material draw passes have been renamed to the more accurate material
shade pass, as well as some other misc renaming (in the future, these
will be compute shaders even, and not actual draw calls)

---

## Migration Guide
- TBD (ask me at the end of the release for meshlet changes as a whole)

---------

Co-authored-by: vero <email@atlasdostal.com>
2024-08-26 17:54:34 +00:00
JMS55
77ebabc4fe
Meshlet remove per-cluster data upload (#13125)
# Objective

- Per-cluster (instance of a meshlet) data upload is ridiculously
expensive in both CPU and GPU time (8 bytes per cluster, millions of
clusters, you very quickly run into PCIE bandwidth maximums, and lots of
CPU-side copies and malloc).
- We need to be uploading only per-instance/entity data. Anything else
needs to be done on the GPU.

## Solution

- Per instance, upload:
- `meshlet_instance_meshlet_counts_prefix_sum` - An exclusive prefix sum
over the count of how many clusters each instance has.
- `meshlet_instance_meshlet_slice_starts` - The starting index of the
meshlets for each instance within the `meshlets` buffer.
- A new `fill_cluster_buffers` pass once at the start of the frame has a
thread per cluster, and finds its instance ID and meshlet ID via a
binary search of `meshlet_instance_meshlet_counts_prefix_sum` to find
what instance it belongs to, and then uses that plus
`meshlet_instance_meshlet_slice_starts` to find what number meshlet
within the instance it is. The shader then writes out the per-cluster
instance/meshlet ID buffers for later passes to quickly read from.
- I've gone from 45 -> 180 FPS in my stress test scene, and saved
~30ms/frame of overall CPU/GPU time.
2024-05-04 19:56:19 +00:00
JMS55
e1a0da0fa6
Meshlet LOD-compatible two-pass occlusion culling (#12898)
Keeping track of explicit visibility per cluster between frames does not
work with LODs, and leads to worse culling (using the final depth buffer
from the previous frame is more accurate).

Instead, we need to generate a second depth pyramid after the second
raster pass, and then use that in the first culling pass in the next
frame to test if a cluster would have been visible last frame or not.

As part of these changes, the write_index_buffer pass has been folded
into the culling pass for a large performance gain, and to avoid
tracking a lot of extra state that would be needed between passes.

Prepass previous model/view stuff was adapted to work with meshlets as
well.

Also fixed a bug with materials, and other misc improvements.

---------

Co-authored-by: François <mockersf@gmail.com>
Co-authored-by: atlas dostal <rodol@rivalrebels.com>
Co-authored-by: vero <email@atlasdostal.com>
Co-authored-by: Patrick Walton <pcwalton@mimiga.net>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
2024-04-28 05:30:20 +00:00
JMS55
6d6810c90d
Meshlet continuous LOD (#12755)
Adds a basic level of detail system to meshlets. An extremely brief
summary is as follows:
* In `from_mesh.rs`, once we've built the first level of clusters, we
group clusters, simplify the new mega-clusters, and then split the
simplified groups back into regular sized clusters. Repeat several times
(ideally until you can't anymore). This forms a directed acyclic graph
(DAG), where the children are the meshlets from the previous level, and
the parents are the more simplified versions of their children. The leaf
nodes are meshlets formed from the original mesh.
* In `cull_meshlets.wgsl`, each cluster selects whether to render or not
based on the LOD bounding sphere (different than the culling bounding
sphere) of the current meshlet, the LOD bounding sphere of its parent
(the meshlet group from simplification), and the simplification error
relative to its children of both the current meshlet and its parent
meshlet. This kind of breaks two pass occlusion culling, which will be
fixed in a future PR by using an HZB from the previous frame to get the
initial list of occluders.

Many, _many_ improvements to be done in the future
https://github.com/bevyengine/bevy/issues/11518, not least of which is
code quality and speed. I don't even expect this to work on many types
of input meshes. This is just a basic implementation/draft for
collaboration.

Arguable how much we want to do in this PR, I'll leave that up to
maintainers. I've erred on the side of "as basic as possible".

References:
* Slides 27-77 (video available on youtube)
https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf
*
https://blog.traverseresearch.nl/creating-a-directed-acyclic-graph-from-a-mesh-1329e57286e5
*
https://jglrxavpok.github.io/2024/01/19/recreating-nanite-lod-generation.html,
https://jglrxavpok.github.io/2024/03/12/recreating-nanite-faster-lod-generation.html,
https://jglrxavpok.github.io/2024/04/02/recreating-nanite-runtime-lod-selection.html,
and https://github.com/jglrxavpok/Carrot
*
https://github.com/gents83/INOX/tree/master/crates/plugins/binarizer/src
* https://cs418.cs.illinois.edu/website/text/nanite.html


![image](https://github.com/bevyengine/bevy/assets/47158642/e40bff9b-7d0c-4a19-a3cc-2aad24965977)

![image](https://github.com/bevyengine/bevy/assets/47158642/442c7da3-7761-4da7-9acd-37f15dd13e26)

---------

Co-authored-by: Ricky Taylor <rickytaylor26@gmail.com>
Co-authored-by: vero <email@atlasdostal.com>
Co-authored-by: François <mockersf@gmail.com>
Co-authored-by: atlas dostal <rodol@rivalrebels.com>
Co-authored-by: Patrick Walton <pcwalton@mimiga.net>
2024-04-23 21:43:53 +00:00
JMS55
4f20faaa43
Meshlet rendering (initial feature) (#10164)
# Objective
- Implements a more efficient, GPU-driven
(https://github.com/bevyengine/bevy/issues/1342) rendering pipeline
based on meshlets.
- Meshes are split into small clusters of triangles called meshlets,
each of which acts as a mini index buffer into the larger mesh data.
Meshlets can be compressed, streamed, culled, and batched much more
efficiently than monolithic meshes.


![image](https://github.com/bevyengine/bevy/assets/47158642/cb2aaad0-7a9a-4e14-93b0-15d4e895b26a)

![image](https://github.com/bevyengine/bevy/assets/47158642/7534035b-1eb7-4278-9b99-5322e4401715)

# Misc
* Future work: https://github.com/bevyengine/bevy/issues/11518
* Nanite reference:
https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf
Two pass occlusion culling explained very well:
https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501

---------

Co-authored-by: Ricky Taylor <rickytaylor26@gmail.com>
Co-authored-by: vero <email@atlasdostal.com>
Co-authored-by: François <mockersf@gmail.com>
Co-authored-by: atlas dostal <rodol@rivalrebels.com>
2024-03-25 19:08:27 +00:00