In my continuing exploration of the Nim libraries I came across the count function.
I have a byte array seg[uint8] where each byte is a '0' or '1'. I tried to use count to count the number of zeroes like: sum = count(seg, 0)
Using 0.18.0, compiling gives:
twinprimes_test6yd.nim(252, 16) Error: type mismatch: got <seq[uint8], int literal(0)>
but expected one of:
proc count(s: string; sub: string; overlapping: bool = false): int
first type mismatch at position: 1
required type: string
but expression 'seg' is of type: seq[uint8]
proc count(s: string; subs: set[char]): int
first type mismatch at position: 1
required type: string
but expression 'seg' is of type: seq[uint8]
proc count(s: string; sub: char): int
first type mismatch at position: 1
required type: string
but expression 'seg' is of type: seq[uint8]
expression: count(seg, 0)
Even when I do sum = count(seg, 0'u8) it gives:
twinprimes_test6yd.nim(252, 16) Error: type mismatch: got <seq[uint8], uint8>
but expected one of:
proc count(s: string; sub: string; overlapping: bool = false): int
first type mismatch at position: 1
required type: string
but expression 'seg' is of type: seq[uint8]
proc count(s: string; subs: set[char]): int
first type mismatch at position: 1
required type: string
but expression 'seg' is of type: seq[uint8]
proc count(s: string; sub: char): int
first type mismatch at position: 1
required type: string
but expression 'seg' is of type: seq[uint8]
expression: count(seg, 0'u8)
According to docs this should work, right?
Aside
What would be even better if there was a count function that uses popcount | countSetBits on the array.
For me, seg is an integer number of 8 bytes (64 bits) so popcount would work on the integer number of slices, if the slices are allocated continuously in memory on 64-bit boundaries. (What would be even better better, for this case, is a fast bit vector, which I understand is in the works.)
Unfortunately there has been not much improvements in the quality of your "bug" reports in the last years. Did you just refuse to import module sequtils? And popCount() function is available, so you shopuld be able to make yourv own popCount() variant.
stefan@nuc /tmp/hhh $ cat h.nim
import sequtils
proc main =
var s = newSeq[uint8]()
for i in 0 .. 7:
s.add(i.uint8)
echo count(s, 3u8)
main()
stefan@nuc /tmp/hhh $ ./h
1
stefan@nuc /tmp/hhh $ nim -v
Nim Compiler Version 0.18.0 [Linux: amd64]
Copyright (c) 2006-2018 by Andreas Rumpf
git hash: 4164ec4f8b6d9968e69381edd35d9cf6fe79dee1
active boot switches: -d:release
OK, here is what I conceptually want. It works, though I haven't tested it yet in my code to see if its faster. I image it will be, since I won't have to check each byte individually now.
import sequtils
import bitops
proc countBitsArray(s: seq[uint8], n: int): int =
for j in 0..(n div 8 - 1):
let i = j * 8
var a = s[i+0].int or
(s[i+1].int shl 8) or
(s[i+2].int shl 16) or
(s[i+3].int shl 24) or
(s[i+4].int shl 32) or
(s[i+5].int shl 40) or
(s[i+6].int shl 48) or
(s[i+7].int shl 56)
result += popcount(a)
var s = newSeq[uint8]()
for i in 0..127: s.add(i.uint8)
echo countBitsArray(s, 128) # -> 448
This is specific for my use case where s is uint8 and n (its size) is always a multiple of 8 bytes. I imagine it wouldn't be too hard to generalize to work with other number sizes.
What would be ideal is to have a pointer (whatever best technique) to the start of each 8 (4) byte slice in s, and read them into a as a 64 (32) bit value.
Ah yes, this implementation is actually slower in my code, because I still have to read each byte anyway, and then perform additional operations on them. However, again, if a memory chunk could be read in as one operation it should then be faster.
Any suggestions on a faster Nim implementation?
This version takes into account, for my use case, where the first|last segment size can actually be less than 8 bytes. Actually, this version better conceptually performs the process to operate on any array which may not be an exact multiple of a number size, and get the correct results. Notice I changed the name to countArrayBits,
proc countArrayBits(s: seq[uint8], n: int): int =
## count number of '1' bits in seq[uint8] segment array
let (q, r) = (n div 8, n mod 8)
var (a, i) = (0, 0)
if q > 0:
for j in 0..q-1:
a = s[i+0].int or
(s[i+1].int shl 8) or
(s[i+2].int shl 16) or
(s[i+3].int shl 24) or
(s[i+4].int shl 32) or
(s[i+5].int shl 40) or
(s[i+6].int shl 48) or
(s[i+7].int shl 56)
i += 8
result += popcount(a)
if r > 0:
a = 0
for j in 0..r-1:
a += s[i+j].int shl j*8
result += popcount(a)
Even in this raw, totally non-optimized form, it gives correct results and is only slightly slower.
I'm still looking at the docs, and examples, of how I would go about reading raw 64-bit chunks of memory at once, to replace all the single byte reads and shifting. First, I'll see if I can optimize this code|technique to make it faster in my code.
Note that gcc and clang produce highly optimized SIMD code for plain loops like
int64_t xxx(int8_t x[], int l) {
int i = 0;
int res = 0;
while (i < l) {
if (x[i] != 0)
res++;
i++;
}
return res;
}
You can verify that using https://www.godbolt.org/ easily.
So if you really want to optimize performance you may look at Nim 's assembly code first. Using shift and or on the bytes followed by popcount is very questionable, as the compiler may have trouble understanding your real intent and so may not optimize well.
Casting 8 uints to one 64 bit word may be not too bad from performance, but I still assume that it is slower than plain SIMD loop. Casting can be done by using pointers of course, you may assign the address of an uint8 to a pointer to a 64 bit word and then call popcount on the 64 bit word. But there may be better ways for casting, you may ask in IRC. Note that the heading of this tread is very wrong and misleading, so most smart people may not follow your discussion.
Thanks for your help and suggestions. I agree the subsequent discussion is past the scope of the original thread title, and not obvious to people who would just be looking at it. I'll start another thread directly dealing with this issue later.
Just to let you know, I'm almost there with using casting, but I think I'm running into a memory alignment issue going from 8 addressable bytes to one 8 byte memory chunk.
Here's code that shows the problem.
import sequtils
import bitops
proc countArrayBits(s: seq[uint8], n: int): int =
## count number of '1' bits in seq[uint8] segment array
let (q, r) = (n div 8, n mod 8)
var (a, i, bitshift) = (0, 0, 0)
for j in 0..q-1:
a = s[i+0].int or
(s[i+1].int shl 8) or
(s[i+2].int shl 16) or
(s[i+3].int shl 24) or
(s[i+4].int shl 32) or
(s[i+5].int shl 40) or
(s[i+6].int shl 48) or
(s[i+7].int shl 56)
i += 8
result += popcount(a)
echo(result) # for testing purposes
a = 0
for j in 0..r-1:
a += s[i+j].int shl bitshift
bitshift += 8
result += popcount(a)
proc countBitsArray(s: seq[uint8], n: int): int =
## count number of '1' bits in seq[uint8] segment array
let (q, r) = (n div 8, n mod 8)
var i = 0
for j in 0..q-1:
result += popcount(cast[uint64](s[i..i+7]))
echo(result) # for testing purposes
i += 8
if r > 0: result += popcount(cast[uint64](s[i..i+r-1]))
var s = newSeq[uint8]()
for i in 0..7: s.add(i.uint8)
echo countArrayBits(s, 8) # -> 12
echo()
echo countBitsArray(s, 8) #
Here I'm just testing using one 8 byte array, but the 2nd version using casting always produces wrong answers, which are greater than the correct answer of 12. I tried all different types for casting, reversing the array, etc. It's something going on that's not obvious. But I get the correct results in the 2nd version when I read each byte separately.
If you run this repetitively, the 2nd versions gives different wrong answers, which tells me depending on where s resides in memory (multiple of 8?) the chunk is read differently. If you use larger larger values for s your can see the differences accumulate.
However, the 2nd version is faster operationally, I just need to make it be correct.
Maybe I need to use pointers to the exact bytes ranges in memory?
(Solving this problem of reading 8 byte memory chunks correctly, should also show me how to write to 8 byte chunks too, so I can initialize the array to '0' 8 bytes at a time.)
What would use suggest as a good title for a new thread for this issue?
(cast[uint64](s[i..i+7]))
may be not what we want, because s[i..i+7] may be a new seq including length header. I am not really good with casting, I generally avoid it, sorry. Maybe something like this works:
var h: ptr uint64 = cast[ptr uint64](addr(s[i]))
result += popcount(h[])
I used your suggestion, and got it to work by first setting s as a var (not a let) outside the loop, kept getting 'no address' error for s without it.
Both versions now return the correct answers. However, in testing each version in my code the 2nd version is much slower than the first (which was slower than original code). I suspect it's all that casting|pointer stuff inside the loop. I tried to play with moving it outside the loop, with no success so far.
I've had enough for right now, I'll play with it later with a fresher mind (and food in my stomach). I have learned a lot more than I knew before this exercise, though.
import system
import sequtils
import bitops
proc countArrayBits(s: seq[uint8], n: int): int =
## count number of '1' bits in seq[uint8] segment array
let (q, r) = (n div 8, n mod 8)
var (a, i, bitshift) = (0, 0, 0)
for j in 0..q-1:
a = s[i+0].int or
(s[i+1].int shl 8) or
(s[i+2].int shl 16) or
(s[i+3].int shl 24) or
(s[i+4].int shl 32) or
(s[i+5].int shl 40) or
(s[i+6].int shl 48) or
(s[i+7].int shl 56)
i += 8
result += popcount(a)
a = 0
for j in 0..r-1: a += s[i+j].int shl bitshift; bitshift += 8
result += popcount(a)
proc countBitsArray(s: seq[uint8], n: int): int =
## count number of '1' bits in seq[uint8] segment array
let (q, r) = (n div 8, n mod 8)
var (a, i, bitshift, s) = (0, 0, 0, s)
for j in 0..q-1:
var h: ptr uint64 = cast[ptr uint64](s[i].addr)
result += popcount(h[])
i += 8
#if r > 0: result += popcount(cast[uint64](s[i..i+r-1]))
a = 0
for j in 0..r-1: a += s[i+j].int shl bitshift; bitshift += 8
result += popcount(a)
var s = newSeq[uint8]()
for i in 0..101050: s.add(i.uint8)
echo countArrayBits(s, 101051) # -> 404130
echo()
echo countBitsArray(s, 101051)
However, in testing each version in my code the 2nd version is much slower than the first (which was slower than original code).
Which is great, as it indicates that the plain original loop already gives highly optimized SIMD code. (I am a bit surprised that the proc with shl and or ops is faster than the one with only a cast to uint64 still.)