Visualizing token use of long-Chain-of-Thought Reasoning Models
Model | Output TOK Rate | vs FinalReply | TOK Distribution |
---|---|---|---|
AVERAGE ⌀ |
This visualization compares recorded token rates across 16 long-Chain-of-Thought Reasoning/Thinking AI Models, showing the effect on verbosity, token usage and thus inference time/cost. I referenced and compared ~250 queries per model from a multitude of usecases across the entirety of my collected benchmark data.
While the precise rates may differ vastly between single queries, this aims to give a rough ballpark by using directly comparable data. With an average of 5.45x TOK usage, these models clearly require vastly more compute and inference time.
Traditional mTok pricing - Is no longer an indicator of the true cost. The base mTok price needs to be multiplied by the real TOK rate in order to compare price efficiency. E.g., if we just compare the price per million output tokens on GPT-4.5 Preview ($150) and o1 ($60), one might think that GPT-4.5 Preview is significantly more expensive, however during my testing o1 was actually ~50% more expensive. Certain use cases showed an even more profound difference, such as during my first Chess Tournament, where o1 was staggering 2.5x as expensive, despite its vastly "cheaper" base price.
Token Verbosity - Not all models allocate token usage in the same way. For example, while R1-Zero multiplies its token usage to almost 12x due to thought-chains, its extremely concise Final Reply lowered its total Output TOK rate down to near average.
While QwQ-32B showed overall strong capability for its size, its local usability is certainly an extremely limiting factor, as it was the most verbose of any model I ever tested. In practical application, whether a potentially, but not guaranteed, enhanced reply is worth spending almost tenfold on inference is questionable to say the least.
Conclusion - This data showcases the need for in-depth efficiency calculations, adjusted for your use case. Comparing traditional models directly to long-Chain-of-Thought reasoning models is like comparing apples to oranges, if one omits to factor in the effects of inference/compute, latency, token efficiency, and cost.