I done did went and copied the enwik8 value for ts_zip when doing that compare, ...

I done did went and copied the enwik8 value for ts_zip when doing that compare, good catch!

I guess that leaves the question of "how well does the LLM's predictions work for things we're certain weren't in the test data set". If it's truly just the prebuilt RWKV then it is only trained on enwik8 and enwik9 is already a generalization but there's nothing really guaranteeing that assumption. On the other hand... I can't think of GB class open datasets of plain english to test with that aren't already in use on the page.