Interesting that Claude code performs better than codex in this exercise. We’ve been finding them to be roughly similar but our tasks are quite different! @xuyiqing did you do any comparisons across the two in your replication work?