Optimise Entity
with repr align & manual PartialOrd
/Ord
(#10558)
# Objective - Follow up on https://github.com/bevyengine/bevy/pull/10519, diving deeper into optimising `Entity` due to the `derive`d `PartialOrd` `partial_cmp` not being optimal with codegen: https://github.com/rust-lang/rust/issues/106107 - Fixes #2346. ## Solution Given the previous PR's solution and the other existing LLVM codegen bug, there seemed to be a potential further optimisation possible with `Entity`. In exploring providing manual `PartialOrd` impl, it turned out initially that the resulting codegen was not immediately better than the derived version. However, once `Entity` was given `#[repr(align(8)]`, the codegen improved remarkably, even more once the fields in `Entity` were rearranged to correspond to a `u64` layout (Rust doesn't automatically reorder fields correctly it seems). The field order and `align(8)` additions also improved `to_bits` codegen to be a single `mov` op. In turn, this led me to replace the previous "non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits` comparison. The result was remarkably better codegen across the board, even for hastable lookups. The current baseline codegen is as follows: https://godbolt.org/z/zTW1h8PnY Assuming the following example struct that mirrors with the existing `Entity` definition: ```rust #[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)] pub struct FakeU64 { high: u32, low: u32, } ``` the output for `to_bits` is as follows: ``` example::FakeU64::to_bits: shl rdi, 32 mov eax, esi or rax, rdi ret ``` Changing the struct to: ```rust #[derive(Clone, Copy, Eq)] #[repr(align(8))] pub struct FakeU64 { low: u32, high: u32, } ``` and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`, `to_bits` now optimises to: ``` example::FakeU64::to_bits: mov rax, rdi ret ``` The full codegen example for this PR is here for reference: https://godbolt.org/z/n4Mjx165a To highlight, `gt` comparison goes from ``` example::greater_than: cmp edi, edx jae .LBB3_2 xor eax, eax ret .LBB3_2: setne dl cmp esi, ecx seta al or al, dl ret ``` to ``` example::greater_than: cmp rdi, rsi seta al ret ``` As explained on Discord by @scottmcm : >The root issue here, as far as I understand it, is that LLVM's middle-end is inexplicably unwilling to merge loads if that would make them under-aligned. It leaves that entirely up to its target-specific back-end, and thus a bunch of the things that you'd expect it to do that would fix this just don't happen. ## Benchmarks Before discussing benchmarks, everything was tested on the following specs: AMD Ryzen 7950X 16C/32T CPU 64GB 5200 RAM AMD RX7900XT 20GB Gfx card Manjaro KDE on Wayland I made use of the new entity hashing benchmarks to see how this PR would improve things there. With the changes in place, I first did an implementation keeping the existing "non shortcircuit" `PartialEq` implementation in place, but with the alignment and field ordering changes, which in the benchmark is the `ord_shortcircuit` column. The `to_bits` `PartialEq` implementation is the `ord_to_bits` column. The main_ord column is the current existing baseline from `main` branch.  My machine is not super set-up for benchmarking, so some results are within noise, but there's not just a clear improvement between the non-shortcircuiting implementation, but even further optimisation taking place with the `to_bits` implementation. On my machine, a fair number of the stress tests were not showing any difference (indicating other bottlenecks), but I was able to get a clear difference with `many_foxes` with a fox count of 10,000: Test with `cargo run --example many_foxes --features bevy/trace_tracy,wayland --release -- --count 10000`:  On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This trace" represents the current PR's perf, while "External trace" represents the `main` branch baseline. ## Changelog Changed: micro-optimized Entity align and field ordering as well as providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further. ## Migration Guide Any `unsafe` code relying on field ordering of `Entity` or sufficiently cursed shenanigans should change to reflect the different internal representation and alignment requirements of `Entity`. Co-authored-by: james7132 <contact@jamessliu.com> Co-authored-by: NathanW <nathansward@comcast.net>
This commit is contained in:
parent
29f711cd40
commit
9c0fca072d
@ -115,9 +115,17 @@ type IdCursor = isize;
|
|||||||
/// [`EntityCommands`]: crate::system::EntityCommands
|
/// [`EntityCommands`]: crate::system::EntityCommands
|
||||||
/// [`Query::get`]: crate::system::Query::get
|
/// [`Query::get`]: crate::system::Query::get
|
||||||
/// [`World`]: crate::world::World
|
/// [`World`]: crate::world::World
|
||||||
#[derive(Clone, Copy, Eq, Ord, PartialOrd)]
|
#[derive(Clone, Copy)]
|
||||||
|
// Alignment repr necessary to allow LLVM to better output
|
||||||
|
// optimised codegen for `to_bits`, `PartialEq` and `Ord`.
|
||||||
|
#[repr(C, align(8))]
|
||||||
pub struct Entity {
|
pub struct Entity {
|
||||||
|
// Do not reorder the fields here. The ordering is explicitly used by repr(C)
|
||||||
|
// to make this struct equivalent to a u64.
|
||||||
|
#[cfg(target_endian = "little")]
|
||||||
|
index: u32,
|
||||||
generation: u32,
|
generation: u32,
|
||||||
|
#[cfg(target_endian = "big")]
|
||||||
index: u32,
|
index: u32,
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -126,7 +134,42 @@ pub struct Entity {
|
|||||||
impl PartialEq for Entity {
|
impl PartialEq for Entity {
|
||||||
#[inline]
|
#[inline]
|
||||||
fn eq(&self, other: &Entity) -> bool {
|
fn eq(&self, other: &Entity) -> bool {
|
||||||
(self.generation == other.generation) & (self.index == other.index)
|
// By using `to_bits`, the codegen can be optimised out even
|
||||||
|
// further potentially. Relies on the correct alignment/field
|
||||||
|
// order of `Entity`.
|
||||||
|
self.to_bits() == other.to_bits()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
impl Eq for Entity {}
|
||||||
|
|
||||||
|
// The derive macro codegen output is not optimal and can't be optimised as well
|
||||||
|
// by the compiler. This impl resolves the issue of non-optimal codegen by relying
|
||||||
|
// on comparing against the bit representation of `Entity` instead of comparing
|
||||||
|
// the fields. The result is then LLVM is able to optimise the codegen for Entity
|
||||||
|
// far beyond what the derive macro can.
|
||||||
|
// See <https://github.com/rust-lang/rust/issues/106107>
|
||||||
|
impl PartialOrd for Entity {
|
||||||
|
#[inline]
|
||||||
|
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
|
||||||
|
// Make use of our `Ord` impl to ensure optimal codegen output
|
||||||
|
Some(self.cmp(other))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// The derive macro codegen output is not optimal and can't be optimised as well
|
||||||
|
// by the compiler. This impl resolves the issue of non-optimal codegen by relying
|
||||||
|
// on comparing against the bit representation of `Entity` instead of comparing
|
||||||
|
// the fields. The result is then LLVM is able to optimise the codegen for Entity
|
||||||
|
// far beyond what the derive macro can.
|
||||||
|
// See <https://github.com/rust-lang/rust/issues/106107>
|
||||||
|
impl Ord for Entity {
|
||||||
|
#[inline]
|
||||||
|
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
|
||||||
|
// This will result in better codegen for ordering comparisons, plus
|
||||||
|
// avoids pitfalls with regards to macro codegen relying on property
|
||||||
|
// position when we want to compare against the bit representation.
|
||||||
|
self.to_bits().cmp(&other.to_bits())
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -197,6 +240,7 @@ impl Entity {
|
|||||||
/// In general, one should not try to synchronize the ECS by attempting to ensure that
|
/// In general, one should not try to synchronize the ECS by attempting to ensure that
|
||||||
/// `Entity` lines up between instances, but instead insert a secondary identifier as
|
/// `Entity` lines up between instances, but instead insert a secondary identifier as
|
||||||
/// a component.
|
/// a component.
|
||||||
|
#[inline]
|
||||||
pub const fn from_raw(index: u32) -> Entity {
|
pub const fn from_raw(index: u32) -> Entity {
|
||||||
Entity {
|
Entity {
|
||||||
index,
|
index,
|
||||||
@ -210,6 +254,7 @@ impl Entity {
|
|||||||
/// for serialization between runs.
|
/// for serialization between runs.
|
||||||
///
|
///
|
||||||
/// No particular structure is guaranteed for the returned bits.
|
/// No particular structure is guaranteed for the returned bits.
|
||||||
|
#[inline(always)]
|
||||||
pub const fn to_bits(self) -> u64 {
|
pub const fn to_bits(self) -> u64 {
|
||||||
(self.generation as u64) << 32 | self.index as u64
|
(self.generation as u64) << 32 | self.index as u64
|
||||||
}
|
}
|
||||||
@ -217,6 +262,7 @@ impl Entity {
|
|||||||
/// Reconstruct an `Entity` previously destructured with [`Entity::to_bits`].
|
/// Reconstruct an `Entity` previously destructured with [`Entity::to_bits`].
|
||||||
///
|
///
|
||||||
/// Only useful when applied to results from `to_bits` in the same instance of an application.
|
/// Only useful when applied to results from `to_bits` in the same instance of an application.
|
||||||
|
#[inline(always)]
|
||||||
pub const fn from_bits(bits: u64) -> Self {
|
pub const fn from_bits(bits: u64) -> Self {
|
||||||
Self {
|
Self {
|
||||||
generation: (bits >> 32) as u32,
|
generation: (bits >> 32) as u32,
|
||||||
|
Loading…
Reference in New Issue
Block a user