If they just benchmark it on math questions, the answers to those are verifiably correct or incorrect, and if they have a magic knob that you can use to reduce errors by a vague amount, how does that help guarantee the answer to questions like that is usable? Of course, trying to come up with a...