I tested the same question with both GPT-4o and the less sycophantic o3 model. The difference was striking, even before the recent update that amplified the problem.
An example of LM Arena. I ask a question and two different chatbots answer. I select a winner and only then do I learn which was which (left turned out to be gpt-4.1-mini, right turned out to be o4-mini)