Came across another paper on content-based chunking recently. This one from the paper A new content-defined chunking algorithm for data deduplication in cloud storage.
https://codeberg.org/IcedQuinn/icedchunker
Its interesting how the general idea for these seems to be getting simpler over time. The original Rabin fingerprint requires some math (which isn't an onerous formula, but everyone explains it as painfully as they can), then you have rsync that just bodges past the math, then to FastCDC which uses a gearhash, and finally RAM and EndRE's SampleByte that just turn the whole thing in to zero hashes and a byte check somewhere.
EndRE is actually doing something more like RoLZ compression with the chunker but that's a story for another time.
I want to use the following proc to chop websites for threaded parsing / spidering, but the thing is still experimental for me. I dont know if you chunking-lib is relevant for me, i will look a little to the paper...
proc chopString4*(inputtekst: string; chars_per_chunkit: int; breakbeforetekst: string): seq[string]
#[ new algorithm that only separates just before the specified breakpoint (breakbeforetekst)
chop the input-text in chunks of about chars_per_chunkit words,
but from the calculated border start searching for the first breakbeforetekst,
after which the separation is to be done.
]#
var
chunktekst: string
partsq: seq[string]
var curposit: int = 0
var newposit, prevposit: int
let eofit = inputtekst.len # end-of-file
let chunksizeit = chars_per_chunkit
var eof_reachedbo: bool = false
# while eof - current-pos > chunk-size:
while eofit - curposit > chunksizeit:
prevposit = curposit
# cur.pos = cur.pos + chunksize
curposit += chunksizeit
# search from cur.pos (the pre-calculated border-point) the first breakpoint-tekst
newposit = inputtekst.find(breakbeforetekst, curposit)
# if found it becomes the new cur.pos and becomes clip-point
if newposit != -1:
curposit = newposit
# create chunktekst
chunktekst = inputtekst[prevposit .. curposit]
else:
# else make a last chunk from the cur.pos to the eof
chunktekst = inputtekst[prevposit .. eofit - 1]
eof_reachedbo = true
partsq.add(chunktekst)
if not eof_reachedbo:
# the last part that was smaller than a chunk
if curposit < eofit - 1:
partsq.add(inputtekst[curposit .. eofit - 1])
result = partsqNo. Contentful chunking uses probabilities to decide when to split data and pray that similarities in two binaries resolve to enough similar chunks that you save space. Many chunk stores also attempt compression after chunking--but the state for that is purely local.
Parsing can't be split cleanly. See Sayre's paradox: a grammar is not segmentable without already having recognized it.
To do what you want requires performing enough recognition to segment and punt further refinement down the line. You'd be more interested in simdjson which does this. simdjson's trick is they use SIMD instructions to make the initial segmenting pass as fast as possible while deferring as many complex decidables as possible. Once the "index" is made, you have an idea of where the subtrees are and can pass them down.
HTML in particular is very stateful in ways that are hostile to just grabbing a chunk and moving on. Something like WordPerfect's reveal codes is a bit less so since its just a stream of "and now we're using rule 22 for the following letters," whereas HTML shares JSON's issues of being a tree.