It seems to me we're already starting to see diminishing returns here. According to this metric, this new model is ~4% better than the previous? I've read elsewhere that it's faster and uses fewer computing resources to achieve its results, so maybe that's where the primary gains lie.
I don't play any competitive sports that use ELO rankings, but if I'm understanding
this table correctly, a 50 point gap is more like "the new model was judged as better 7.15% of the time".
I do expect we're well into diminishing returns ... of the test rankings. Once the bots are good enough, a large number of people won't throw a hard enough challenge at the bots to see a difference, and will judge more-or-less randomly.
That is, the new bot could be drastically better (or worse!) at solving differential equations, but given that most people won't ask about something that hard...