 b6ead2be95
			
		
	
	
		b6ead2be95
		
			
		
	
	
	
	
		
			
			# Objective - Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does. ## Solution - Define `EntityHasher` that calculates a hash from the `Entity.to_bits()` by `i | (i.wrapping_mul(0x517cc1b727220a95) << 32)`. `0x517cc1b727220a95` is something like `u64::MAX / N` for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from `rustc-hash` (a.k.a. `FxHasher`) with some tweaks for the case of hashing an `Entity`. `FxHasher` and `SeaHasher` were also tested but were significantly slower. - Define `EntityHashMap` type that uses the `EntityHashser` - Use `EntityHashMap<Entity, T>` for render world entity storage, including: - `RenderMaterialInstances` - contains the `AssetId<M>` of the material associated with the entity. Also for 2D. - `RenderMeshInstances` - contains mesh transforms, flags and properties about mesh entities. Also for 2D. - `SkinIndices` and `MorphIndices` - contains the skin and morph index for an entity, respectively - `ExtractedSprites` - `ExtractedUiNodes` ## Benchmarks All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none. ### 2D Meshes `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` #### `--ordered-z` This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1112" alt="Screenshot 2023-09-27 at 07 50 45" src="https://github.com/bevyengine/bevy/assets/302146/e140bc98-7091-4a3b-8ae1-ab75d16d2ccb"> -39.1% median frame time. #### Random This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable. <img width="1108" alt="Screenshot 2023-09-27 at 07 51 28" src="https://github.com/bevyengine/bevy/assets/302146/29c2e813-645a-43ce-982a-55df4bf7d8c4"> -7.2% median frame time. ### 3D Meshes `many_cubes --benchmark` <img width="1112" alt="Screenshot 2023-09-27 at 07 51 57" src="https://github.com/bevyengine/bevy/assets/302146/1a729673-3254-4e2a-9072-55e27c69f0fc"> -7.7% median frame time. ### Sprites **NOTE: On `main` sprites are using `SparseSet<Entity, T>`!** `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` #### `--ordered-z` This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1116" alt="Screenshot 2023-09-27 at 07 52 31" src="https://github.com/bevyengine/bevy/assets/302146/bc8eab90-e375-4d31-b5cd-f55f6f59ab67"> +13.0% median frame time. #### Random This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses. <img width="1109" alt="Screenshot 2023-09-27 at 07 53 01" src="https://github.com/bevyengine/bevy/assets/302146/22073f5d-99a7-49b0-9584-d3ac3eac3033"> +0.6% median frame time. ### UI **NOTE: On `main` UI is using `SparseSet<Entity, T>`!** `many_buttons` <img width="1111" alt="Screenshot 2023-09-27 at 07 53 26" src="https://github.com/bevyengine/bevy/assets/302146/66afd56d-cbe4-49e7-8b64-2f28f6043d85"> +15.1% median frame time. ## Alternatives - Cart originally suggested trying out `SparseSet<Entity, T>` and indeed that is slightly faster under ideal conditions. However, `PassHashMap<Entity, T>` has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that `SparseSet`'s dense `Vec<usize>` that maps from the `Entity` index to sparse index into `Vec<T>`. This dense `Vec` has to be as large as the largest Entity index used with the `SparseSet`. - I also tested `PassHashMap<u32, T>`, intending to use `Entity.index()` as the key, but this proved to sometimes be slower and mostly no different. - The only outstanding approach that has not been implemented and tested is to _not_ clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved. - Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via `&mut T` queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems. - The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense. --- ## Changelog - Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a `EntityHashMap<Entity, T>` in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are `RenderMeshInstances` and `RenderMaterialInstances`, and their 2D counterparts. ## Migration Guide Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially `EntityHashMap<Entity, T>` where `T` are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as `RenderMeshInstances`, or maintain their own `EntityHashMap<Entity, T>` for their own data. Before: ```rust fn queue_custom( material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>, ) { ... for (entity, mesh_transforms, mesh_handle) in &material_meshes { ... } } ``` After: ```rust fn queue_custom( render_mesh_instances: Res<RenderMeshInstances>, instance_entities: Query<Entity, With<InstanceMaterialData>>, ) { ... for entity in &instance_entities { let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; }; // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now // be found in `mesh_instance` which is a `RenderMeshInstance` ... } } ``` --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
		
			
				
	
	
		
			128 lines
		
	
	
		
			5.1 KiB
		
	
	
	
		
			Rust
		
	
	
	
	
	
			
		
		
	
	
			128 lines
		
	
	
		
			5.1 KiB
		
	
	
	
		
			Rust
		
	
	
	
	
	
| use bevy_ecs::{
 | |
|     component::Component,
 | |
|     prelude::Res,
 | |
|     query::{QueryItem, ReadOnlyWorldQuery},
 | |
|     system::{Query, ResMut, StaticSystemParam, SystemParam, SystemParamItem},
 | |
| };
 | |
| use bevy_utils::nonmax::NonMaxU32;
 | |
| 
 | |
| use crate::{
 | |
|     render_phase::{CachedRenderPipelinePhaseItem, DrawFunctionId, RenderPhase},
 | |
|     render_resource::{CachedRenderPipelineId, GpuArrayBuffer, GpuArrayBufferable},
 | |
|     renderer::{RenderDevice, RenderQueue},
 | |
| };
 | |
| 
 | |
| /// Add this component to mesh entities to disable automatic batching
 | |
| #[derive(Component)]
 | |
| pub struct NoAutomaticBatching;
 | |
| 
 | |
| /// Data necessary to be equal for two draw commands to be mergeable
 | |
| ///
 | |
| /// This is based on the following assumptions:
 | |
| /// - Only entities with prepared assets (pipelines, materials, meshes) are
 | |
| ///   queued to phases
 | |
| /// - View bindings are constant across a phase for a given draw function as
 | |
| ///   phases are per-view
 | |
| /// - `batch_and_prepare_render_phase` is the only system that performs this
 | |
| ///   batching and has sole responsibility for preparing the per-object data.
 | |
| ///   As such the mesh binding and dynamic offsets are assumed to only be
 | |
| ///   variable as a result of the `batch_and_prepare_render_phase` system, e.g.
 | |
| ///   due to having to split data across separate uniform bindings within the
 | |
| ///   same buffer due to the maximum uniform buffer binding size.
 | |
| #[derive(PartialEq)]
 | |
| struct BatchMeta<T: PartialEq> {
 | |
|     /// The pipeline id encompasses all pipeline configuration including vertex
 | |
|     /// buffers and layouts, shaders and their specializations, bind group
 | |
|     /// layouts, etc.
 | |
|     pipeline_id: CachedRenderPipelineId,
 | |
|     /// The draw function id defines the RenderCommands that are called to
 | |
|     /// set the pipeline and bindings, and make the draw command
 | |
|     draw_function_id: DrawFunctionId,
 | |
|     dynamic_offset: Option<NonMaxU32>,
 | |
|     user_data: T,
 | |
| }
 | |
| 
 | |
| impl<T: PartialEq> BatchMeta<T> {
 | |
|     fn new(item: &impl CachedRenderPipelinePhaseItem, user_data: T) -> Self {
 | |
|         BatchMeta {
 | |
|             pipeline_id: item.cached_pipeline(),
 | |
|             draw_function_id: item.draw_function(),
 | |
|             dynamic_offset: item.dynamic_offset(),
 | |
|             user_data,
 | |
|         }
 | |
|     }
 | |
| }
 | |
| 
 | |
| /// A trait to support getting data used for batching draw commands via phase
 | |
| /// items.
 | |
| pub trait GetBatchData {
 | |
|     type Param: SystemParam + 'static;
 | |
|     type Query: ReadOnlyWorldQuery;
 | |
|     type QueryFilter: ReadOnlyWorldQuery;
 | |
|     /// Data used for comparison between phase items. If the pipeline id, draw
 | |
|     /// function id, per-instance data buffer dynamic offset and this data
 | |
|     /// matches, the draws can be batched.
 | |
|     type CompareData: PartialEq;
 | |
|     /// The per-instance data to be inserted into the [`GpuArrayBuffer`]
 | |
|     /// containing these data for all instances.
 | |
|     type BufferData: GpuArrayBufferable + Sync + Send + 'static;
 | |
|     /// Get the per-instance data to be inserted into the [`GpuArrayBuffer`].
 | |
|     /// If the instance can be batched, also return the data used for
 | |
|     /// comparison when deciding whether draws can be batched, else return None
 | |
|     /// for the `CompareData`.
 | |
|     fn get_batch_data(
 | |
|         param: &SystemParamItem<Self::Param>,
 | |
|         query_item: &QueryItem<Self::Query>,
 | |
|     ) -> (Self::BufferData, Option<Self::CompareData>);
 | |
| }
 | |
| 
 | |
| /// Batch the items in a render phase. This means comparing metadata needed to draw each phase item
 | |
| /// and trying to combine the draws into a batch.
 | |
| pub fn batch_and_prepare_render_phase<I: CachedRenderPipelinePhaseItem, F: GetBatchData>(
 | |
|     gpu_array_buffer: ResMut<GpuArrayBuffer<F::BufferData>>,
 | |
|     mut views: Query<&mut RenderPhase<I>>,
 | |
|     query: Query<F::Query, F::QueryFilter>,
 | |
|     param: StaticSystemParam<F::Param>,
 | |
| ) {
 | |
|     let gpu_array_buffer = gpu_array_buffer.into_inner();
 | |
|     let system_param_item = param.into_inner();
 | |
| 
 | |
|     let mut process_item = |item: &mut I| {
 | |
|         let batch_query_item = query.get(item.entity()).ok()?;
 | |
| 
 | |
|         let (buffer_data, compare_data) = F::get_batch_data(&system_param_item, &batch_query_item);
 | |
|         let buffer_index = gpu_array_buffer.push(buffer_data);
 | |
| 
 | |
|         let index = buffer_index.index.get();
 | |
|         *item.batch_range_mut() = index..index + 1;
 | |
|         *item.dynamic_offset_mut() = buffer_index.dynamic_offset;
 | |
| 
 | |
|         compare_data.map(|compare_data| BatchMeta::new(item, compare_data))
 | |
|     };
 | |
| 
 | |
|     for mut phase in &mut views {
 | |
|         let items = phase.items.iter_mut().map(|item| {
 | |
|             let batch_data = process_item(item);
 | |
|             (item.batch_range_mut(), batch_data)
 | |
|         });
 | |
|         items.reduce(|(start_range, prev_batch_meta), (range, batch_meta)| {
 | |
|             if batch_meta.is_some() && prev_batch_meta == batch_meta {
 | |
|                 start_range.end = range.end;
 | |
|                 (start_range, prev_batch_meta)
 | |
|             } else {
 | |
|                 (range, batch_meta)
 | |
|             }
 | |
|         });
 | |
|     }
 | |
| }
 | |
| 
 | |
| pub fn write_batched_instance_buffer<F: GetBatchData>(
 | |
|     render_device: Res<RenderDevice>,
 | |
|     render_queue: Res<RenderQueue>,
 | |
|     gpu_array_buffer: ResMut<GpuArrayBuffer<F::BufferData>>,
 | |
| ) {
 | |
|     let gpu_array_buffer = gpu_array_buffer.into_inner();
 | |
|     gpu_array_buffer.write_buffer(&render_device, &render_queue);
 | |
|     gpu_array_buffer.clear();
 | |
| }
 |