nimforum mirror - Playing with CPU caches

yglukhov (orginal) [2021-08-03T11:26:40+02:00] view original

While migrating our game engine to ECS architecture we went out to validate some simple ideas about CPU caches, and were a bit surprised. I'm hoping someone can explain what's happening here.

import random, std/monotimes

const MAX = 9000000

type Elem = int

var x = newSeq[Elem](MAX)
var y = newSeq[Elem](MAX)
var z = newSeq[Elem](MAX)
var m = newSeq[Elem](MAX)

type S = object
  x, y, z, m: Elem

var d = newSeq[S](MAX)

template bench(name: string, code: untyped) =
  let s = getMonotime().ticks
  code
  let e = getMonotime().ticks
  echo "test ", name, ": ", e - s

randomize()

proc testSOA(i: int) =
  m[i] = x[i] + y[i] + z[i]

proc testAOS(i: int) =
  d[i].m = d[i].x + d[i].y + d[i].z

proc test() =
    for i in 0 .. MAX - 1:
        x[i] = rand(Elem.high)
        y[i] = rand(Elem.high)
        z[i] = rand(Elem.high)
        m[i] = rand(Elem.high)
        d[i] = S(x: x[i], y: y[i], z: z[i], m: m[i])
    
    bench "A1":
        for i in 0 .. MAX - 1:
            testSOA(i)
            # m[i] = x[i] + y[i] + z[i]
    
    bench "S1":
        for i in 0 .. MAX - 1:
           testAOS(i)

test()


nim c -d:release -d:danger test.nim


test A1: 9957102
test S1: 15503720

Which suggests that "array of structs" is ~1.5 times slower than "struct of arrays". Also disassembly shows that testSOA has more instructions than testAOS.

testSOA:


Dump of assembler code for function testSOA_t_92:
   0x000000000000e260 <+0>:	mov    0x1a0b1(%rip),%rdx        # 0x28318 <x_t_26>
   0x000000000000e267 <+7>:	mov    0x1a0a2(%rip),%rax        # 0x28310 <y_t_34>
   0x000000000000e26e <+14>:	add    $0x2,%rdi
   0x000000000000e272 <+18>:	mov    (%rax,%rdi,8),%rax
   0x000000000000e276 <+22>:	add    (%rdx,%rdi,8),%rax
   0x000000000000e27a <+26>:	mov    0x1a087(%rip),%rdx        # 0x28308 <z_t_42>
   0x000000000000e281 <+33>:	add    (%rdx,%rdi,8),%rax
   0x000000000000e285 <+37>:	mov    0x1a074(%rip),%rdx        # 0x28300 <m_t_50>
   0x000000000000e28c <+44>:	mov    %rax,(%rdx,%rdi,8)
   0x000000000000e290 <+48>:	ret

testAOS:


Dump of assembler code for function testAOS_t_94:
   0x000000000000e2a0 <+0>:	shl    $0x5,%rdi
   0x000000000000e2a4 <+4>:	add    0x1a04d(%rip),%rdi        # 0x282f8 <d_t_86>
   0x000000000000e2ab <+11>:	mov    0x18(%rdi),%rax
   0x000000000000e2af <+15>:	add    0x10(%rdi),%rax
   0x000000000000e2b3 <+19>:	add    0x20(%rdi),%rax
   0x000000000000e2b7 <+23>:	mov    %rax,0x28(%rdi)
   0x000000000000e2bb <+27>:	ret

So my intuition tells me that AOS should be just as fast or faster than SOA, but apparently I'm wrong. Why?

mratsim (orginal) [2021-08-03T11:44:22+02:00] view original

In general SoA is more cache-friendly than AoS and lend itself more to SIMD optimizations. This is why in video processing, the preferred format is YUV or YV12 instead of RGB.

Now regarding your code, hardware prefetchers can follow up to 12 forward streams. It's very easy to catch the pattern "prefetch next item", which happens with SoA. In the AoS case, it's "prefetch 4th item" which is less prefetcher friendly. I don't think the assembly matters here since you can't do instruction-level parallelism when adding just 3 registers.

Stefan_Salewski (orginal) [2021-08-03T11:51:45+02:00] view original

I think that your test is very special in that you really use all the fields of your object struct in your test. I would assume that for a larger object, with more fields that are not actually used in your tests loops, that fields would pollute the data cache.

demotomohiro (orginal) [2021-08-03T13:06:27+02:00] view original

In testSOA, only the content of seq m need to be written to main memory . I think in testAOS, not only field m in object S, but also field x, y, z need to be written to main memory because they are also likely in the cache line containing field m.

According to this wikipedia entry:

https://en.wikipedia.org/wiki/CPU_cache#Cache_entries

Data is transferred between memory and cache in blocks of fixed size, called cache lines or cache blocks.

I moved field m in object S in your code to separete seq m2. Then testAOS become slighty faster than testSOA.

import random, std/monotimes

const MAX = 9000000

type Elem = int

var x = newSeq[Elem](MAX)
var y = newSeq[Elem](MAX)
var z = newSeq[Elem](MAX)
var m = newSeq[Elem](MAX)
var m2 = newSeq[Elem](MAX)

type S = object
  x, y, z: Elem

var d = newSeq[S](MAX)

template bench(name: string, code: untyped) =
  let s = getMonotime().ticks
  code
  let e = getMonotime().ticks
  echo "test ", name, ": ", e - s

randomize()

proc testSOA(i: int) =
  m[i] = x[i] + y[i] + z[i]

proc testAOS(i: int) =
  m2[i] = d[i].x + d[i].y + d[i].z

proc test() =
  for i in 0 .. MAX - 1:
    x[i] = rand(Elem.high)
    y[i] = rand(Elem.high)
    z[i] = rand(Elem.high)
    m[i] = rand(Elem.high)
    m2[i] = rand(Elem.high)
    d[i] = S(x: x[i], y: y[i], z: z[i])
  
  bench "A1":
    for i in 0 .. MAX - 1:
      testSOA(i)
  
  bench "S1":
    for i in 0 .. MAX - 1:
      testAOS(i)

test()

Output:

test A1: 27826736
test S1: 24209680

yglukhov (orginal) [2021-08-03T14:14:55+02:00] view original

@demotomohiro thanks, that nails it. So basically my AoS variant flushes 4 times more memory than actually needed. I guess it's safe to conclude that it's better to not interleave write memory with read memory.

planetis (orginal) [2021-08-03T15:29:18+02:00] view original

because your component's already pretty compact. There is no cache trashing, since you use every member field. Note in ECS is not about this low level of granularity where you would split it's direction to different components. Although that would make sense if you plan to do manual vectorization.

This is a nice find: https://github.com/AnneVanEde/MasterThesis. It has interviews with some known people working on ECS frameworks, they discuss common ECS topics.

Mirror of forum.nim-lang.org

8295 :: Playing with CPU caches