It's funny that there are often people who say "this test is stupid because it's a problem with the tokenizer" or "OpenAI intransparently sends your queries to smaller models" or "it would make sense with version numbers", and yeah. Sure. But Sam promised an LLM that's so intelligent it's scary, and what people see instead is a model that gets the answer wrong because it's not even smart enough to understand that you're talking about decimal numbers instead of versions.