First Impressions from discord - Dubesor LLM Benchmark

Qwen3-VL-8B-Instruct

2025-11-02

Tested **Qwen3-VL-8B-Instruct** (*local, bf16*): Despite being a non-reasoning model, due to thoughts and self-corrections within replies, it is quite verbose and yappy; more than **2x** verbosity of Qwen3-8B (non-thinking). In terms of raw text-capability, it performed around original Qwen3-4B Thinking or Qwen3-14B (non-thinking). Isolated [**vision testing**](https://dubesor.de/visionbench) was already conducted 3 weeks ago. It performed exceptionally well for size, and like **all** Qwen3-VL models tested, also consistently outperformed its Thinking-counterpart. When it comes to open general vision capability, I haven't tested a more capable model other than Qwen3-VL-235B-A22B-Instruct or, in some instances, Qwen3-VL-32B-Instruct. Overall - great Vision model, and raw text capability, while not amazing, is good enough for the occasional non-vision interaction between image queries. **YMMV**

MiniMax-M2

2025-10-28

Tested **MiniMax-M2**: At 230B-A10B a **much smaller** reasoning MoE model than the predecessor MiniMax-M1 (456B-A46B). * Despite being significantly smaller, achieves roughly same capability * non-agentic coding was worse, slightly smarter in other areas * much more manageable thought-chains, verbosity was down from 13.9x to 7.7x (8/2 split) * heavy gpt-oss-reek was present in reasoning chains, such as excessive "policy" considerations * This influenced risk-queries, where this model is far more likely to produce refusals **Chess** verbosity was still quite high (10k tok/move), but performance improved (+120 Elo, +4% accuracy, -13% illegal play); now around gpt-oss-120b level. Overall, when directly compared to MiniMax-M1, it is a superior model for most usecases. When pitted against leading open models (GLM-4.6, Qwen3-235B, Kimi-K2, DeepSeek variants, etc.), it didn't perform on the same level. For general use, it's roughly around Llama 4 Maverick or Qwen3-32B (Thinking) capability. I don't test agentic use cases nor tool calls, and this model suggests to be heavily trained on **gpt-oss-120b** outputs, so as always **YMMV**!

LFM2-8B-A1B

LFM2-2.6B

2025-10-24

Tested 2 new LiquidAI models: **LFM2-2.6B** Small hybrid-thinker, which is meant to initiate ＜think＞ing in complex or multilingual tasks though it only did so in 6% of my queries, among which ⅓ were in very low complexity queries. Thus, it's merely a gimmick imho. Unable to play chess even when providing a legal move list, attempted to play 102 illegal moves in a 140 move match. Overall, very weak, around Granite-4.0-H-Tiny level. Official API is overpriced. **LFM2-8B-A1B**: Still a very weak model, but universally "smarter" than 2.6B. Around Ministral 8B level. Actually able to play Chess, though poorly ~530 elo @40% accuracy. Thus, these models are for locally GPU starved. API pricing is a tad too high imo. The biggest downside of the release of these models is the deprecation of the **LFM-7B** API endpoint, which at $0.01 mtok was a fantastic default for all types of dev testing.

Ling-1T

2025-10-23

Tested **Ling-1T**: Massive MoE with 1T params, 50B active. * Not a thinker, but quite verbose, utilizing chain of thoughts in responses (2.35x verbosity). * Zero character, personality same as a wet towel * Disappointing intelligence across most tasks, very poor size/performance * STEM capability was decent though * Around non-thinking Llama 3.3 Nemotron Super 49B v1.5 capability **Chess** performance was laughable at ~600 Elo w/ 45% accuracy, below llama 4 maverick Overall, this model is completely uninteresting. Far too large, far too low performance and not a shred of uniqueness about it. **YMMV**.

Claude Haiku 4.5

2025-10-16

Tested **Claude Haiku 4.5**: Anthropic's small, flash/mini equivalent model. Price is ⅓ of Sonnet 4/4.5. **Default**: Competent enough in most fields, decent general intelligence for a haiku sized model. Tech performance was very good, a clear focus area. Webdesign for my own purpose was kinda meh, though. **Chess** performance scored within expected, ~700 Elo w/ 58% accuracy, around 3.7 Sonnet level Overall, it performed a bit better than Gemini 2.5 Flash around Llama 3.3 70B level. Real pricing is on par with non-thinking variants of 2.5 Flash or GLM-4.5. **Thinking**: x2.1 Token use, 60% spent on reasoning Unlike larger Claude models, leads to actually consistent improvements in more complicated tasks Thought chains were counterproductive in creative tasks (e.g. RP, creative writing), partially due to unwanted risk analysis and attempts and using thought chains to plan combating user instructions Around Grok-3 mini or DeepSeek V3.1 Overall I think this is a decent model for a variety of generic tasks. The $5 mToK is quite steep for this class of model, though it not using too many tokens kinda evens it out to be relatively cost-neutral. As always, **YMMV**.

LongCat-Flash-Chat

2025-10-12

Tested **LongCat-Flash-Chat**: ~560B/27B non-thinking MoE, though it utilizes thinking without explicit tags, at 3.3x verbosity most verbose "non-thinker" I tested thus far. * Overall capability was around Llama 3.3 70B / Llama 4 Maverick, though better coder * As stated, kinda cheats in the "non-thinking" department, baked into responses * Not great size/performance, behind GLM 4.5/4.6 (non-thinking) * Style is verbose and DeepSeek-esque **Chess** performance at 52% accuracy 740 Elo was mediocre, around Qwen2.5 72B level. Overall, I think this model is fairly mediocre. Current API pricing is alright, though not best bang/buck. Due to its relatively poor size/performance and token inefficiency I think there are better alternatives for most use cases, though tech related tasks were solid. YMMV.

Granite-4.0-H-Tiny

Granite-4.0-H-Small

2025-10-04

Tested IBM **Granite 4.0**: IBM's newest Mamba-2 MoE models series (32B-A9B, 7B-A1B, 2x 3B), nonthinking, concise **Granite-4.0-H-Tiny** *(7B-A1B, bf16, local)*: intended use case: low latency agentic work & function calling * Worst STEM results I have recorded for this size * Abysmal capability in every field, around Granite 3.0-8B * inference on my 4090 was nice at 80tok/s (~15 on 7950X3D CPU only) * can generate text **Granite-4.0-H-Small** *(32B-A9B, Q4_K_M, local)*: intended use case: Workhorse model for key enterprise tasks like RAG and agents * very weak capability for size, around Gemma 3n E4B level * inference 60tok/s was good (~9 on CPU) * actually somewhat usable for very easy generic tasks I didn't bother testing the even smaller models. Overall, testing these models invoked nostalgic feelings. While reading their responses, I was reminded of the very early days of my testing. Other than nice inference, they feel and behave like ancient models. Very concise, low attention to detail and easily susceptible to all types of even 2023-era jailbreaks. I cannot see any use for these, outside of hyper-niche RAG implementations, but even so, I doubt there aren't far better models out there. YMMV.

GLM-4.6

2025-10-02

Tested **GLM-4.6**: Hybrid thinker MoE, improved token efficiency, reasoning, context window **Default**/Thinking: * More efficient reasoning (7/3 split), token use -19%, turning it into an average verbose reasoning model * Despite this, improvements in STEM, general reasoning, and coding. * Sycophancy can lead to hallucinations, noticed riddle overfit * No improvements in chess **Nonthinking**: * More efficient token use, -22% * Capability remained around 4.5 nonthinking, while being more efficient Overall, far better CoT scaling than on 4.5, and less token spam without loss of capability is appreciated. This is quite a substantial improvement for a 2 month gap, but **YMMV**.

Claude Sonnet 4.5

2025-09-30

Tested Claude **Sonnet 4.5**: Sonnet update, focusing on coding, agentic workflows, and tool calling. * Token use was up +15% * non-agentic coding was slightly more consistent * Noticeable weaker performance in STEM and some common sense & creative tasks * Similar to Opus, increased safety leading to false-positive policy violations I found generated frontend often quite generic and creatively a step back when directly compared to 3.7 and Sonnet 4. Encountered multiple instances of false-positive policy violations during my standard demo page creation of UI and RP. **Vision** remains a major family weakness, scoring at Gemma 3 4/12B level, placing ~30 ranks below SOTA models. **Chess** placements were very weak, below Sonnet 4 and currently at 52% accuracy ~700 Elo, Llama 3.3 70B level. I am disappointed in the general ability of this model as it's either mostly samey or a step back. Overall, this might be a great update for agentic coding, which is not my use- nor test case. Thus **YMMV**.

GPT-5 Codex

2025-09-28

Tested **GPT-5 Codex**: Coding specific GPT-5 version with emphasis on agentic coding. * used 43% less tokens than gpt-5 in my general purpose benchmark (73% tokens spent on reasoning) * roughly same performance as gpt-5, though stem/math performance was weaker * saw no improvements in non-agentic coding tasks * vision testing scored between gpt-5 and gpt-5-chat, thus for vision tasks gpt-5 might be preferable In **Chess** testing it generated ~18k tokens per move, though sometimes racking up 50-70k reasoning tokens. It **excelled** in reasoning chess, placing 10-0-0 with 96% avg. accuracy, vastly outperforming gpt-5 and beating the strongest competition; currently #1 with a substantial 150 Elo lead. Thus, some of its coding optimizations might be surprisingly beneficial in seemingly unrelated areas. At same API pricing as gpt-5, the biggest draw could be the decreased token use, though that heavily depends on the use case and exact environment. Obviously YMMV.

Grok-4-fast-non-reasoning

Grok-4-fast-reasoning

2025-09-21

Tested **Grok-4-fast**: Cost-efficient xAI model at $0.20/$0.50 mTok, in 2 variants. **-non-reasoning**: * very cheap, benchprice around 2.5 Flash Lite or 4o-mini * not the smartest, performed around oss-20b / 2.5 flash lite / gpt-5 nano level * strength clearly in lighter tasks with great price & speed efficiency **-reasoning**: * 2/3 tokens were used for reasoning (unprovided), +180% token use, 60% less tokens than Grok-4 * reasoning significantly improved model logic & output quality in most cases * Similar to Grok-4, sometimes spends thousands of tokens on calculations, only providing a single number with no context. * Repeatedly refused several of my utility tasks, false-identifying them as jailbreak attempts * benchprice was around grok-3 mini or DeepSeek V3 0324 * performed much stronger than non-reasoning, around GLM-4.5 or 2.5 Flash thinking models **Chess** was obviously not as strong as Grok-4 (but unfathomably cheaper); still placed in top 10, around gpt-5-nano level, with 77% accuracy. Overall, the gains from reasoning here are very worthwhile. Some weird quirks with the refusals, which is very unlike xAI, but good model in terms of bang/buck and much higher usability than the unfathomably verbose & expensive Grok-4. Grok-3 mini still stands out as a potentially better alternative though, but **YMMV**.

Qwen3-Coder-Flash

Qwen3-Coder-Plus

2025-09-20

Tested **Qwen3-Coder-Flash** & **Qwen3-Coder-Plus** : API only proprietary, non-thinking Qwen3 coders. Flash: * Performed around Qwen3-Coder-32B-Instruct level * tiered Pricing is too high for this caliber model Plus: * Performed around Qwen3-Coder-480B-A35B in coding specific tasks, otherwise weaker * Mid-tier coder being charged like SOTA * It performed decently but beaten by a plethora of alternatives. I don't have much to add since these models are completely uninteresting to me. Being proprietary and not performing peak, combined with poor bang/buck makes them instantly skippable, but of course **YMMV**.

Seed-OSS-36B

2025-09-17

Tested **Seed-OSS-36B** (*local, Q4_K_M*): Long-cot thinker, roughly around Qwen3-Thinking verbosity * Similar performance to models such as Qwen3-32B Thinking, Phi-4-reasoning-plus, Qwen3-30B-A3B-Thinking-2507 * Decent showing in all areas I test, though not excelling at any * For a reasoning model, high general utility for a variety of multipurpose tasks * Good instruction following. **Chess** probing, considering size, was comparatively good, better than comparable models, around kimi-k2-0905 level On a 4090/24GB VRAM, this quant was slightly too large, as I couldn't fit all layers, with context, so a smaller Q4 would have been preferable in retrospect. Overall, I think this is a decent model that deserves a look in particular if you want to try a non-qwen model. YMMV.

Qwen3-Next-80B-A3B-Instruct

Qwen3-Next-80B-A3B-Thinking

2025-09-13

Tested **Qwen3-Next-80B-A3B**: Two distinct MoE models, with and without thinking. **Instruct:** Non-thinking, same verbosity as Qwen3-235B-A22B-Instruct-2507 (2.3x) Good model for most tasks, around o3-mini on non-coding, some weaknesses in instruction following and creative work **Thinking:** Very verbose thinker (80% reasoning), ~**4.5x** token use of Instruct, near Grok-4 level. Despite using massively more tokens, not universally better results. Reasoning tokens were wasted reflecting on unwanted risk analysis. It is worth noting that Alibabas own API offering ($6 output multiplied by reasoning at time of writing) should be avoided as it's massively overpriced and resulted in GPT-5 level costs. In **Chess** probing thinking averaged a slightly higher accuracy (~+7%), though both variants played poorly (620-740 Elo). Interestingly, this application revealed the massive drawback and diminishing returns of Thinking. Here is a 77 move [reasoning match](https://dubesor.de/chess/chess-leaderboard#game=1392&player=qwen3-next-80b-a3b-thinking) between the 2 variants, that ended in a draw. I'll let the numbers speak for itself: | | avg. tok/move | avg. move latency | game cost | |-------|---------|--------------|--------| | Qwen3-Next-80B-A3B-**Instruct** | 223 | 1.2s | $0.025 | | Qwen3-Next-80B-A3B-**Thinking** | 9652 *(43.3x)* | 68.2s *(56.8x)* | $0.743 *(30.3x)* | Overall, they performed around Llama 3.3 Nemotron Super 49B or near GLM-4.5 level. I think that the Instruct model is the vastly superior choice for many users and use cases, but obviously **YMMV**.

Qwen3-Max

2025-09-10

Tested **Qwen3-Max**: Alibaba's API-only Non-thinking proprietary model * Much smarter than Qwen2.5-Max, though quite yappy, token use was up +26% * High common sense, good overall reasoning * Good coder. though frontend and UX can be hit or miss; also added various example pages * Generic utility is mediocre, flaws in instruction following and censorship * Initial **Chess** performance was slightly stronger, though still very weak (750 Elo, 58% accuracy, 1 win in 10 matches, lost to 2.5-max) Overall, this model performed slightly better than *Qwen3-235B-A22B-Instruct-2507*, though at much higher **cost** (~8x). Though its larger size is noticeable in nuanced tasks and common sense scenarios. Compared to *Kimi-K2*, its style isn't really my cup of tea, though that's subjective and **YMMV!**

Kimi-K2-Instruct-0905

2025-09-05

Tested **Kimi-K2-Instruct-09-05** update: * Bit more wordy, token use was up +13%. * Overall Performance same as before, within my environment * Initial **Chess** performance, it slightly edged out against 0711, though still within variance In other smaller random tests, it performed largely as before. Thus, targeted improvements at tool calling and context have no noticeable impact within the majority of my testing scenarios. **YMMV.**

Claude Opus 4.1 Thinking

2025-08-30

Tested **Claude Opus 4.1 Thinking**: Performed as expected, I am not gonna repeat everything I already laid out in detail in the Claude 4 and 4.1 impressions. * Compared to 4 Thinking +10% token use. * Reasoning chain still bad bang for buck, the model doesn't require it, and you're paying **double** price for either no or minuscule improvement in rare edge cases. Other than minor flaws, easily fixed with follow-up, zero mistakes in anything coding I throw at models during testing. Still great model/family, performance in line with expected. Reasoning unneeded.

Grok Code Fast 1

2025-08-28

Tested **Grok Code Fast 1**: Reasoning model optimized for agentic coding. Put through my general use case testing regardless. 2/3 tokens were used for reasoning. Raw reasoning is **not** provided, only summary. This makes real delivered mtok quite high (almost 8x grok-3 mini) While not its intended use, completely usable for generic use. * Around GLM-4.5 Air Thinking capability * Mediocre in everything, decent -albeit not great- coding ability * Achieved same results as the much cheaper grok-3 mini in my coding related tasks * Frontend results are okay, visually a bit like more dated models Initial **chess** probing placed it at ~940 Elo / ~67% accuracy - around gpt-oss-20b or gpt-4.1-mini level I can't comment on it's agentic coding as it's neither something I utilize nor test for. Clearly this model isn't the right use case for me (manual, non-agentic coding), thus especially in this case: **YMMV!**

DeepSeek V3.1

2025-08-22

Tested **DeepSeek V3.1**: Hybrid model, that supports light thinking **Non-Thinking:** Same verbosity as V3 0324 Comparatively, smarter overall, but performed noticeably weaker in coding tasks **Thinking:** +125% token use. 64% of tokens were spent on reasoning. This is very light reasoning, ~45% less verbosity than R1 0528 Compared to non-thinking, the thinking did very little if anything to improve final response quality. In fact, it was mostly even or slightly worse on some tasks. During evaluation, it reminded me a lot of Sonnet 4 thinking in terms of reasoning token benefits. Thus, enabling thinking proved highly ineffective in the totality of my testing. **Chess** performance remained poor (~650 starting Elo), around V3 level. Overall, compared to V3 0324 this is a small upgrade, except for (non-tool) coding where it's a noticeable downgrade imo. (example demo pages available) Compared to R1 0528, the model lacks behind severely in general intelligence and is not a replacement. Imo, for general use case, nonthinking DeepSeek V3.1 is a good option. Overall, I was rather disappointed with the hybrid performance, so I'm not sure it's the right approach - but **YMMV**

ERNIE-4.5-21B-A3B

2025-08-18

Tested **ERNIE-4.5-21B-A3B** (*local, Q6_K*): Small, non-thinking Baidu MoE model. * Just like the larger model, unintelligent overall * Not very useful except for light math tasks Fast inference locally (~160tok/s on 4090) This one is another miss imo, YMMV.

Jamba 1.7

2025-08-15

Tested **Jamba 1.7** (Mini & Large): * Large (399B) performs like a modern 5B model * Mini (52B) performs sub Mixtral-8x7b-Instruct-v0.1 (even smaller, released in 2023) Still don't support basic response format (such as structured table). Strangely, the older 1.5 models performed better. API pricing is outdated. They feel and perform absolutely ancient, as if we traveled back in time 1.5+ years ago. Consequently, I see no use case for this family. YMMV.

Mistral Medium 3.1

2025-08-14

Tested **Mistral Medium 3.1** (aka Mistral-Medium-2508): Compared to Medium 3, used 11% more tokens * Improvements in code, and reasoning * However, did worse in my STEM segment. And first time reproducible refusal. * Chess skill still poor ~600 Elo * **same total capability** in my environment Price/performance is samey, still not state of the art performance at 8x lower cost. Uninteresting release imho, in particular because it's API only with no price changes. YMMV.

Qwen3-4B-Instruct-2507

Qwen3-4B-Thinking-2507

2025-08-11

Tested **Qwen3-4B-2507** (*local, bf16*): **Instruct**: "Non-thinker", though it uses the methods of 2507 models = simply not wrapping it thoughts into tags +84% token use compared to Qwen3-4B Better at math, instruction following, and programming compared to initial 4B Around Qwen3-8B level performance **Thinking**: Very verbose thinker (10.64x verbosity, Grok-4 level). +25% token use compared to Qwen3-4B thinking 700% token of Qwen3-4b non-thinking. Saw improvements in capability across all fields. Censorship was quite high, making this model very restricted without jailbreaks or context injection. Very impressive capability; around nonthinking GLM-4.5 Air (100B/12B active). Overall, even though these models are very verbose, the performance for this size (4B!) is currently unmatched, though **YMMV**!

GPT-5

GPT-5 Chat

GPT-5 Mini

GPT-5 Nano

2025-08-09

*Sorry for the delay, I spent the past 2 days doing almost nothing but testing, running numbers, retesting, and verifying. My current benchmarking suite got quite comprehensive and can overwhelm me on multi-model releases.* Tested the **GPT-5** series: **Nano**: * Ultra verbose thinker, to the point where it was well over 5 times slower in completing responses than GPT-5 Chat or twice as slow as GPT-5 mini. * In terms of capability, it performed around Gemini 2.5 Flash Lite (thinking OFF) level, while comparatively being 4x as expensive and 11x slower It would either need to be massively reduced in reasoning (tanks capability) or cut in price by at least 10x to be remotely viable. **Mini:** * Much more concise thinker Very solid small model, around o4-mini-high or Llama 3 405B capability. **Chat**: This is the model used in ChatGPT. Non-thinking, though documentation claims it has reasoning token support, but I wasn't able to get it to produce any reasoning in any responses * Very chatty and slightly more permissive. Default personality ends every single reply with a forced follow up question. I find its general style a bit annoying (subjective/unrated). * In code tasks, showed some laziness and delivered sometimes very plain, unimpressive results (see examples in my [demo pages](https://dubesor.de/assets/shared/)) * Around Claude 3.7 Sonnet capability. Weaker at math tasks due to missing reasoning chains. Very high general utility for generic use (akin to 4o) * Strong performance at continuation chess (not as strong as GPT-4.5 ~~but better than other OpenAI models~~*) **GPT-5**: Note, that I do not have API access due to heavy restrictions, thus all testing had to be done manually 1by1. For more info, see my explanations on the [o3 impression](#o3-2025-04-16). * Reasonable thought lengths (default/medium), with same reasoning allocation as ~Grok-4 * Performed around Gemini 2.5 Pro level, although with slightly less common sense but higher STEM performance. * Much more enthusiastic during coding, though backend and consistency is noticeably beaten by Claude 4. * Improvements in my vision bench, dethroning Gemini 2.5 * Overall it performed strong; obviously not for general use- I see strongest use case in academia or STEM. I prefer its more neutral, non-cringe style, (akin to GPT-4 Turbo) a lot over the Chat model. Here are some stats for comparison: | Model | Cost | Time (h:m) | Verbosity | Think | Score | Level | Vision | Chess | |-------|-------|------|-----------|------------|--------------|---------|-------------|------------------| | GPT-5 Nano | $0.36 | 2:05 | 10.26x | 88% | 47.9% (#84) | Gemini 2.5 Flash Lite | #17 | 1121 (#11) | | GPT-5 Mini | $0.73 | 1:08 | 4.09x | 68% | 65.4% (#28) | o4-mini-high | #7 | 1189 (#7) | | GPT-5 Chat | $1.28 | 0:22 | 1.41x | 0% | 71.9% (#12) | Claude 3.7 Sonnet | #2 | 1279 (#4) | | GPT-5 | $5.34 | 2:51 | 6.03x | 79% | 76.4% (#7) | Gemini 2.5 Pro | #1 | n/a | This is everything I got right now, and obviously as always - **YMMV**!

gpt-oss-20b

gpt-oss-120b

2025-08-06

Tested **GPT-OSS**: > We're going to do a very powerful open source model [...] better than any current open source model out there. **120B** (5.1B active): concise thinker, akin to o1-mini verbosity, 3/5 reasoning split * around 4.1-mini & GLM-4.5 Air capability * okay for STEM/math and light programming tasks * underwhelming performance, a **bit** smarter than 20B * poor style, very **censored** * weak chess player, initial performance around gemma 2 27B level, ~56% accuracy **20B** (3.6B active): concise thinker, though longer thoughts, 5/3 reasoning split * around Llama-3.1-Nemotron-51B & 4o-mini capability * okay for STEM, math, and easy tasks * almost as smart as the 120B, though more cooperative and fun to use * okay chess player, initial performance around gpt-4.1-mini ~69% Accuracy Both models are very fast to inference but underwhelming open models that get beat by a plethora of competing models (e.g. Llama-3.3-Nemotron-Super-49B, Qwen3-30B-A3B, GLM-4.5, etc.) The 120B is **obsolete** on arrival, in terms of capability and behaviour. Between the two, the 20B is more interesting imo. Might be okay for fast math workloads, though that's outside my use case. Weak models imo, but **YMMV!**

Claude Opus 4.1

2025-08-05

Tested **Claude Opus 4.1** (default/non-thinking, 20250805): Concise model, pricey but not overly so when accounting for token use (~30% more expensive than Gemini 2.5 Pro, ~50% cheaper than Grok-4 in testing). * Saw no improvements in logic/common sense * Saw no improvements in vision (in fact performed worse on 1 task) * Slight improvements in STEM & math consistency * Slight improvements in code, debugging * More consistent refusals and less willing to tackle risque topics (akin to Opus 4 thinking after risk evals in thought chains) * Initial Chess performance was same and remained underwhelming for a SOTA ~830 Elo / ~64% accuracy Fantastic model with some good minor improvements. If it wasn't for refusing several of my tasks, this model comes close to exhausting my main classic benchmark. Risk aversion reduces its utility for many creative tasks. Bar none best coder, very cooperative and easy to work with. I have also published [demo pages](https://dubesor.de/assets/shared/) as well as added its responses to various small experiments. I like Opus 4.1 a lot, other than the censorship / false-positive refusals. Overall, fantastic model.

XBai-o4

2025-08-04

Tested **XBai-o4** (*Q4_K_M, local*): Verbose 32B reasoning model, with 7.73x token use ~around GLM-4.5 / Qwen3 Thinking verbosity Marketing claims it outperforms o3-mini and even Claude Opus 4 (spoiler: it doesn't) * Math was solid but everything else showed weaker performance than current competing models * Performed around QwQ-32B / DeepSeek R1 0528 Qwen3-8B / Magistral Small 2506 level This model enters a size segment that is completely saturated and doesn't outperform alternatives. Utterly uninteresting model to me. Also I cannot respect outlandish marketing claims. **YMMV**, though..

Qwen3-Coder-30B-A3B-Instruct

Qwen3-30B-A3B-Thinking-2507

2025-08-02

Tested **Qwen3-Coder-30B-A3B-Instruct** (*Q4_K_M, local*): As expected, did worse in non-code related tasks, but to my surprise actually even scored slightly lower than instruct in my tech segment. Naturally, my testing isn't coding focused, but in code tasks it misunderstood the objective that the non-coder didn't misunderstand, lowering the outcome usefulness. However, I don't use nor test agentic workflows, nor IDE integrations, so my test setting might be the wrong environment for this model type. Tested **Qwen3-30B-A3B-Thinking-2507** (*Q4_K_M, local*): x3.2 token of instruct counterpart, verbosity was identical to Qwen3-235B-A22B-Thinking-2507 * Was better at following instructions or sticking to instructions * Coding results were weaker, overall intelligence of the model did not benefit from extra token chains * Very censored models in testing, akin to older Claude models Overall, didn't like either of these two, for my use cases. I'd rather stick to 30B-A3B-Instruct-2507, which strikes a much better balance of inference bang for buck, and general intelligence. That's just me though, and **YMMV!**

Qwen3-30B-A3B-Instruct-2507

2025-07-30

Tested **Qwen3-30B-A3B-Instruct-2507** (*Q4_K_M, local*) Nonthinker, though 1.56x tokens of **Qwen3-30B-A3B** (non-thinking). Very smart model for size, punches well above its weight. Not great at following precise instructions (e.g. formatting adherence) Hyper fast at 130+ tok/s on my 4090. Top model, daily driver candidate for me, but depending on use case, **YMMV!**

GLM-4.5

GLM-4.5-Air

2025-07-30

Tested **GLM-4.5** (358B - 12B active): Very verbose hybrid thinker MoE, thinking can be disabled by passing "enable_thinking": False **Default**/Thinking: Verbosity on R1 0528 level, overall performed around Llama 3.3 70B level Price/Performance not great since benchprice was ~Claude Sonnet 4 level but didn't deliver on the same level Weird CoT-scaling, sometimes performs dumber than non-thinker, overall weak tokens/performance ratio Encountered Overthinking issues, imprecision especially in my STEM segment. Chess play was weak, ~qwen3-235b-a22b level Nonthinking: 64% less tokens, overall performance same level. Performed almost at thinking level overall, far better value **Air**: (106B - 12B active) **Default**/Thinking: Around Qwen2.5 max capability, bench price almost same as 4.5 (non-thinking). CoT-tokens scaled better than the larger model Chess play was weak, ~gpt4.1-nano level Nonthinking: 69% less tokens, around Claude-3.5 Haiku capability. Same use cases as Llama 4 Scout, though much better at coding **Overall** The family is a bit weird and inconsistent in terms of performance and token use. If you want a concise model, Kimi-K2 is a better option. If you don't care about token spam, Qwen3-235B-A22B variant models are smarter. Style/vibe (not rated) isn't my personal taste, though I am a bit tired of excessive token spam. As always, and in particular due to varying performance between tasks, **YMMV**!

Llama-3.3-Nemotron-Super-49B-v1.5

2025-07-28

Tested **Llama-3.3-Nemotron-Super-49B-v1.5** *(local, Q4_K_M)*: Just like v1, a hybrid thinker, though it now reasons by default and Reasoning can be turned off by using `/no_think` in system prompt. **Reasoning ON** (default): * **very** verbose, 2x tokens of v1 thinking, 6.7x more tokens than v1.5 *no_think* * performance gains in most areas * stronger at specialized tasks (e.g. maths, coding..) than v1 * more risk analysis lead to more censored responses **Reasoning OFF** (`/no_think`): * token savings of ~85% * around v1 level, but more censored For consumer grade hardware like mine (4090), the default mode is simply not feasible for use due to extremely long thought chains. Some single responses took ~45 minutes to generate! I got around 4.6 tok/s with partial offloading, which is far too slow to support 10k+ thought chains. Unfortunately, depending on the task, reasoning off is not necessarily an upgrade to v1. Model is still good, and if you can inference the thinking model it's probably an upgrade, however for my setup v1.5 is not an upgrade. Try it out on your own hardware and use cases, because as always: **YMMV!**

Qwen3-235B-A22B-Thinking-2507

2025-07-26

Tested **Qwen3-235B-A22B-Thinking-2507** *(API, fp8, Alibaba recommended params)* Due to inconsistent and lower than expected performance, multiple retests were conducted (including Alibaba's own offerings). As a verbose reasoning model, it averaged 70/30 reasoning split and used * x0.85 tokens of Qwen3-235B-A22B Thinking * x3.6 tokens of Qwen3-235B-A22B-Instruct-2507 Unlike initial Qwen3 family, thinking was not beneficial in censorship testing. It performed noticeably worse in my STEM tasks. Other areas were fine. In **chess**, while drawing in direct competition to Instruct, comparatively it was very inefficient at ⌀7.5k tokens/move (40x!). Though games took ages, results were slightly better in terms of achieved Elo & accuracy; around command-a / qwq-32b level. Current pricing is all over the place ranging from $0.13-0.70 input and $0.30-8.40 output. Regardless, there is a hefty premium on the model, plus added verbosity, somewhere in the ballpark of ~15x more **expensive** than non-thinking Qwen3-235B-A22B-Instruct-2507 during testing. This model is weird in that it wasn't a raw upgrade to previous models nor non-thinking counterparts. To me, it's somewhat of a dud. Give it a go and do your own testing, because - **YMMV**!

Gemini 2.5 Flash Lite

2025-07-24

Stable release. This was a bit of wasted time, as it performed identical to the 06-17 preview I tested a bit over a month ago. Some reply variance as is expected but same model. Current API speed actually lower than last month (far below the 200 tok/s). Otherwise, same capability: Fast small model for simple generic tasks.

Qwen3-Coder-480B-A35B

2025-07-23

Tested **Qwen3-Coder-480B-A35B**: As expected from a coding focused model - most concise Qwen3 model * 46 % less tokens than DeepSeek V3 0324 While competent for general use, too, performed best in STEM (math) and coding obviously. During creation of demo pages and further probing, it showcased several obvious weaknesses such as producing buggy collision, glaring UI oversights in multiple projects, and in general required error correction that was not necessary on models such as DeepSeek V3 0324. For a massive, coding specialized model I personally was not convinced by its coding results, combined with the quite poor price/performance on current API offerings. However, as always - **YMMV!**

Qwen3-235B-A22B-2507

2025-07-23

Tested **Qwen3-235B-A22B-2507**: Not a thinking model, though it can contain similar chain-of-thought in its responses without the thought tags. * 75% less tokens than Qwen3-235B-A22B thinking. * 45% more tokens than Qwen3-235B-A22B non-thinking. While I saw no notable differences in coding, STEM or censorship, it performed slightly better in my reasoning segment. Chess testing was on a very similar low level (~Claude 3.7 Sonnet), though in mirror matches it lost to it's thinking counterpart (low sample size for now). I personally like getting samey performance without the ultra verbosity, thus it's an upgrade in my book - but **YMMV**.

ERNIE-4.5-300B-A47B

2025-07-14

Tested **ERNIE-4.5-300B-A47B**: *Tried testing this 2 weeks ago, but API was riddled with issues, thus another attempt:* Non-thinking Baidu MoE model that is quite verbose (almost 2x token use of Kimi-K2). * Very dry model roughly on par with Qwen3-32B NOTHINK or the original Llama-3-70b-Instruct * Not very smart, subpar results in all tested fields * Vibe/Style (unscored) is really poor imo * very restricted model with high censorship Chess was already tested 2 weeks ago, ~600 Elo with 48% accuracy - on gpt-4.1 nano level This model was a lot weaker than I anticipated. Combined with the lame style, I can't think of any use case where I would want to use it. However, as always: **YMMV**.

Kimi-K2-Instruct

2025-07-13

Tested **Kimi-K2-Instruct**: Very large non-reasoning MoE (1T params, 32B active); in fact so large I had to update my UI slider. It's a very **concise** model (11% token use of Qwen3-235B-A22B), which helps the very slow inference offerings (around 13 tok/s at time of testing) * Competent in all tested areas, **smartest** open non-reasoning model * Good prose and general style/vibes (unscored), although quite risk-averse * Not the strongest at debugging but usually good frontend results (some [demo pages](https://dubesor.de/assets/shared/) added here) * Around Grok-3 & Qwen3-235B-A22B (thinking on) performance **Chess** probing wasn't noteworthy; around Llama 3.3 70B level with 55% move accuracy. This one is definitely worth checking out imo but **YMMV**.

Grok-4

2025-07-10

Tested **Grok-4**: *I have run and published full testing on everything I have, including the core benchmark, chess, vision, token rates, demo pages, small experiments, etc.* Very **verbose** reasoning model, much more so than Grok-3 mini-high, around QwQ level with a 4/1 reasoning split. The reasoning tokens are hidden. * Smarter than Grok-3, though coding and in particular web-design was weaker in places * On multiple math tasks and repeatably, provided just a single number in its response with zero explanations, despite using 20k+ tokens on thought chain * Very good at following instructions and **high general utility** * Among the least censored models I have tested * **Vision** performance was decent (not as good as Gemini 2.5 but on par with o3). **Chess**: #1 in reasoning mode (full information), beating the highest rated models (o4-mini/codex-mini) #3 in continuation mode (raw movetext), losing to GPT-4.5 and 3.5 Turbo Instruct Currently at ~90% move accuracy, though low amount of games - placement and Elo have **yet to settle in**. * spent a ton of tokens even on opening book moves, averaging a cost of **$0.27 per move**! The model was among the most **expensive** to test, with a bench price exceeding Opus 4 Thinking and hovering around GPT-4.5 level! Overall, a nice additional SOTA model, although the relatively lackluster code performance was disappointing to me. But as always - **YMMV**!

Hunyuan-A13B Instruct

2025-07-10

Tested **Hunyuan-A13B Instruct** *(local, Q4_K_M)*: This is a Tencent 80B/13B MoE model. By default, this model reasons on every input. This can be disabled manually by setting `enable_thinking=False` in system prompt or prepending `/no_think` to queries. *(using `/think` supposedly forces the model to reason, however I encountered no scenario where this was needed)* **Default** (thinking): With 5.93x verbosity is akin to original DeepSeek-R1, though with a slightly smaller 75/25 reasoning split. * core intelligence seems mediocre but lacks in common sense scenarios, and attention to detail * very dry in creative tasks * weak programmer * around Qwen3-4B (Thinking) or Qwen2.5-14B (non-thinker) capability **Non-thinking** (`enable_thinking=False`): * token savings of ~80% * output generally a bit weaker, in particular in terms of instruction following, which wasn't strong to begin with * lack of thinking was more likely to cause hard refusals * around Gemma2 9B capability On my system (24GB VRAM + 64GB DDR5) inference was quite slow at 9tok/s. I had higher hopes for this model, but **YMMV**.

Llama-3.1-Nemotron-Ultra-253B-v1

2025-07-08

Tested **Llama-3.1-Nemotron-Ultra-253B-v1**: Too large for my machine, thus utilized Nebius AI Studio API, which claims to serve fp8. This model has **2 modes**, the reasoning mode (enabled by using `detailed thinking on` in system prompt), and the non-reasoning mode (`detailed thinking off`). **Reasoning OFF:** * Good at STEM, math * Subpar for size in most everything else * Around DeepSeek v2.5 level capability **Reasoning ON:** * **+240% token usage** overall, with a 82/18 reason split the final replies were 40% shorter * Improvements in general Logic and math * Worse for non-English queries, responding in English despite query being a different language * Multiple times, falsely claimed I asked for python code when asking for non-python code * Occasional hallucinations in thought chains (such as real-time visiting links) * Mode changes behaviour; far more likely to produce refusals. **Chess** performance, like most llama models, was poor with an avg. accuracy of 52%, similar level to DeepSeek V3. Lost twice and drew the rest during testing. (all match replays available, as always) Other than for math tasks, this model is surprisingly weak, considering its size. The 3.3-Nemotron-Super-**49B** I tested locally at only Q4_K_M performed either stronger or equal on most tasks. This would make more sense for a large model running at very low precision, but I have to work with the information I am given. Bang for buck, while average overall, is poor for a llama variant model. I personally won't utilize this model, but maybe you can run it yourself and achieve a stronger implementation, thus: **YMMV!**

Gemma 3n E4B it

2025-06-28

Tested **Gemma 3n E4B it** *(local, fp16)*: * small multimodal local model, though I tested text only (due to lacking llama.cpp implementation) * capability falls between 4B & 9B Gemma models * I saw no hard refusals, though disclaimers and nagging that is present in whole family remains * It's a nice fast, small multipurpose model that can be used for easy tasks in anything except code Not exactly required for my use cases, but a nice alternative small model. YMMV.

o3-2025-04-16

2025-06-28

Tested **o3-2025-04-16**: Why did it take me 2.5 months to finally test and add this model? Well.. As of today, API access is still gated behind an organization verification, which consists of an 3rd party ID Check mandating consent to processing biometric information (access to camera, ID, facial scans). Needless to say, this is completely ridiculous and nothing I would ever consent to for any model. 2 weeks ago the raw mTok pricing was reduced by 80% from $10/40 to $2/8. I did some lower volume manual testing previously, and the old pricing was not economically feasible for any type of usage (e.g., making a singular chess move cost me almost half a dollar). That being said, with barriers and increased time-commitment, I did finally manually test the entirety of the model in my test-suit (OR, unlimited maxtoken reasoning): * Verbosity was very low, with 3.22x of a non-thinker and a 2/1 reasoning split. * Capability was rather lackluster and roughly on par with o1-2024-12-17 * Logical reasoning was just fine; I noticed it dismissing or glossing over crucial key details in multiple instances * STEM (in particular math) was good, but I did notice critical flaws in legal advice, such as naming correct & relevant court judgment, but concluding the opposite of its ruling. * In my tech/coding segment, it performed weaker than a plethora of other models; frontend design was particularly unimpressive with weak UI and incorporating unasked images consisting of broken imgur links * It showcased weak intent-recognition and would execute literally, which makes iterative workflows more painful than using a cooperative model (e.g. Claude). In my **vision** testing, it scored below Gemini 2.5, o4-mini and GPT-4.5, on par with Gemini 2.5 Flash Lite Preview. **Chess** is not possible for me to test in bulk without proper API access, however I was able to painfully conduct 2 matches in continuation manually, where it lost to the strongest opponents in GPT-3.5 Turbo Instruct & GPT-4.5 Preview, averaging 83.5% move accuracy Overall, o3 performed **fine** but unimpressive. However, it's also reasonably **affordable** now with my benchprice hovering around o4-mini-high level. It's a shame the API access is so restricted and invasive, however no model, especially not this one, warrants it imo. However, that's just my own testing and 2cents, and always **YMMV!**

Gemini 2.5 Flash

2025-06-22

Tested **Gemini 2.5 Flash** (stable release): **Reasoning off**: * Roughly the same verbosity as 2.5 Flash Lite, 2.35x verbosity of a concise non-thinker * About 2.0 Flash capability * Quite expensive for this class of model, there are better bang/buck contenders (e.g. 4.1-mini or older flash models) **Reasoning on**: * Reasonably concise thinker (6.37x verbosity), 270% tok use of non-thinking variant, with a 2/3 reasoning split, meaning the final output was slightly more concise than non-reasoning * Gains were observed in Reasoning/**Logic** tasks and **STEM** segments. * I saw **no** significant improvements in tech/code, instruct following or overall utility when reasoning is enabled * Raw thoughts are hidden and only a step summary is provided (useless to me) * Quite expensive for this class of model, raw bench price was the same as 4o-latest, and far more expensive as a similar-level grok-3-mini Both variants were overall the weakest of all 2.5 Flash snapshots tested (04-17 being the strongest). Vision was retested and scored identical to older snapshots. I feel like the price/performance here isn't quite right, so I personally won't be utilizing this model at this pricepoint. However, and as always - **YMMV!**

Mistral Small 3.2

2025-06-21

Tested **Mistral Small 3.2 24B Instruct 2506** (local, Q6_K): This is a fine-tune of Small 3.1 2503, and as expected, overall performs in the **same realm** as its base model. * more verbose (**+18%** tokens) * noticed slightly lower common sense, was more likely to approach logic problems in a mathematical manner * saw minor improvements in **technical** fields such as STEM & Code * acted slightly more risque-averse * saw no improvements in instruction following within my test-suite (including side projects, e.g. chess move syntax adherence) * Vision testing yielded an identical score Since I did not have issues with repetitive answers in my testing of the base models, I cannot make comments on claimed improvements in that area. Overall, it's a fine-tune that has the same TOTAL capability with some shifts in behaviour, and personally I prefer 3.1, but depending on your own use case or encountered issues, obviously **YMMV!**

MiniMax-M1

2025-06-19

**Tested MiniMax-M1**: At 456B too large to run local, and as a ultra-verbose reasoning model and slow inference speed via API, found this model to be unusable for any real work. With 92/8 reasoning split, this model spent most of its time thinking, sometimes exhausting all 40k max tokens without giving a single reply token. In terms of capability, I found it to be competent at my tech and coding tasks, while producing fairly average results in other areas; around Qwen2.5 Max level. I place this model in the same category as Phi-4-reasoning-plus or, to an extend, Mistral Magistral, not really usable. But, **YMMV**!

Gemini 2.5 Pro

Gemini 2.5 Flash Lite Preview 06-17

2025-06-19

(Re-)Tested **Gemini 2.5 Pro**: * More akin to 03-25 than 05-06 in my testing, meaning less code-focused and better performance for general utility * Very good common sense (only beaten by Opus 4) * Hidden thought-chains on all platforms is understandable from a business standpoint, but a huge loss for average users, losing on the very valuable additional insights * With a ~6.44x token verbosity, and useless thought summaries, real cost for displayed tokens is quite high (more than 200% of Sonnet 4) * Out of the four 2.5 Pro snapshots I tested (Previews/Experimental), was the most censored one * Code was good, but I saw some outcome UI-, and verbose code commentary issues, which makes this less appealing to me as a coding model Overall, generally just as strong in total, still a great SOTA model As always, and depending on use case - **YMMV!** --- **Tested Gemini 2.5 Flash Lite** (Preview 06-17): * verbosity at 2.25x samey as 2.5 Flash models, which means it's a bit yappy (twice the token use of 4.1-Nano) * hyper fast model (generally 200+ tok/s), which makes it great for bulkwork * Around DeepSeek V2.5/Qwen2.5 72B/4omini level capability, very versatile and good general utility, good instruction following * price/performance is good, but not great when compared to older models (1.5 flash 002, 2.0 flash lite, etc.) Overall, found it to be quite competent, versatile and at fast inference a good option for simple general tasks.

Magistral Medium 2506

Magistral Small 2506

2025-06-11

Tested **Mistral Magistral 2506**: I used the recommended/default settings and the included recommended chat template. **Magistral Small 2506** (local, Q6_K): * 13x token use of non-thinking 2503 * saw slight gains in logic tasks, and most noticeably STEM (in particular math) * General usability obviously decreases significantly, not a general utility model * saw no improvements in my coding segments **Magistral Medium 2506** (API only model): * 8x token use of Mistral Medium 3 (which was already on the verbose side) * combined with the random price hike, was roughly 20x the bottom line price * Improvements were seen in reasoning, and coding problems * General usability is far lower than non-thinking Mistral Models * In the 2 reasoning chess matches against Mistral Medium (20x cheaper), it lost both times These models are the 2nd most verbose I ever tested (x15.4 tok rate), only Phi-4-reasoning-plus produced more tokens. I don't understand the premium upcharge on the thinking Medium model, as this also gets multiplied by the massive token use The gains overall are minuscule and on areas outside of generic use. In terms of bang for buck, or inference for buck, very poor. I did not enjoy reading the thought chains, they are quite mundane. Overall, these models scores *slightly* higher purely on numbers, but are completely not comparable to general use models. I don't see any reason to use them for my personal use cases. However, test them yourself because as always - **YMMV!**

dots.llm1.inst

2025-06-08

Tested **dots.llm1.inst** (*142B MoE* | *14B active*): Rednote open source non-thinking model, that utilized Chain of thoughts, totalling **~2x** token verbosity overall. I utilized the dots demo on huggingface (temp=0.7, top_p=0.8) - too large for my machine, no API yet. It was **relatively uncensored**, though it still suffers from Chinese specific censoring & political propaganda. In terms of formatting, I saw minor issues with Chinese characters (rarely), emoji loops, and overall subpar instruction following. Code was fine on easier problems, frontend results are rather minimalistic though. Overall, it's not a poor model, performance was around **Llama-3.1-Nemotron-51B** or **Qwen2.5 72B** level (though weaker coder). Since it might be competing against **Llama 4 Scout** (*109B, 17B active*), I found it to be smarter in direct comparison but less versatile, and losing out in format-crucial workflows. Overall, worth checking out locally if you happen to have a monster machine, and as always: **YMMV!**

DeepSeek-R1 0528-Qwen3-8B

2025-05-31

Tested **DeepSeek-R1 0528-Qwen3-8B**: This took way longer than expected, I encountered many issues with local testing, ranging from degraded replies, inconsistent results, thought loops, and symptoms of minor brain damage in certain tasks. I tried several quants (bf16) from unsloth, bartowski, lmstudio,.. and used recommended inference parameters (0.6 temp, 0.95 topp), template variations, along with high context (16k & 32k) with and without repeat penalties and limited response length, but no matter what combination I tried (and I ran a ton of tests) there were signs of degradation in every test. Instead of trashing my results and calling it a day I decided to instead test NovitaAI's API implementation as they seem to have gotten rid of problems I wasn't able to, thus: API Results: * Very verbose, even more so than DeepSeek-R1 0528 and Qwen3 Thinking models, though not quite QwQ level. 81% tokens were used for reasoning. * Did extremely well in Reason & general Logic * Non-math STEM performance was weaker * Instruction following and prompt adherence was fairly bad * For code I found it annoying as it generated "solutions" that ignored instructions or dismissed restrictions. While the results are overall fantastic for size (8B performing on ~60B level with brute force thought chains), I didn't vibe with this models utility and general usability, it feels like a model created for benchmarking, not for general use. But maybe I am just annoyed with all those hours wasted on busted local testing.. Either way, as always: **YMMV!**

DeepSeek-R1 0528

2025-05-29

Tested **DeepSeek-R1 0528**: * As seems to be the trend with newer iterations, **more verbose** than R1 (**+42%** token usage, 76/24 reasoning/reply split) * Thus, despite low mTok, by pure token volume real bench cost a bit more than Sonnet 4. * I saw **no notable improvements to reasoning** or core model logic. * Biggest improvements seen were in **math** with no blunders across my **STEM** segment. * Tech was samey, with better visual frontend results but disappointing C++ * Similarly to the V3 0324 update, I noticed **significant improvements in frontend** presentation. * In the 2 matches against it former version (these take forever!) I saw **no chess improvements**, despite costing **~48% more** in inference. Overall, around Claude Sonnet 4 Thinking level. DeepSeek remains having the strongest open models, and this release increases the gap to alternatives from Qwen and Meta. To me though, in practical application, the massive token use combined/multiplied with the **very slow** inference excludes this model from my candidate list for any real usage, within my use cases. It's fine for a few queries, but waiting on exponentially slower final outputs isn't worth it, in my case. (*e.g. a single chess match takes hours to conclude)*. However, that's just me and as always: **YMMV!** Example front-end showcases improvements (**identical** prompt, identical settings, 0-shot - **NOT** part of my benchmark testing): [CSS Demo page R1](https://dubesor.de/assets/shared/UIcompare/DeepSeek-R1.html) | [CSS Demo page 0528](https://dubesor.de/assets/shared/UIcompare/Deepseek-R1%200528%20UI.html) [Steins;Gate Terminal R1](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/DeepSeek-R1.html) | [Steins;Gate Terminal 0528](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Deepseek-R1%200528.html) [Benchtable R1](https://dubesor.de/assets/shared/LLMBenchtableMockup/DeepSeek-R1%200.6%20cents.html) | [Benchtable 0528](https://dubesor.de/assets/shared/LLMBenchtableMockup/Deepseek-R1%200528%201.7%20cents.html) [Mushroom platformer R1](https://dubesor.de/assets/shared/MushroomPlatformer/DeepSeek%20R1.html) | [Mushroom platformer 0528](https://dubesor.de/assets/shared/MushroomPlatformer/Deepseek-R1%200528.html) [Village game R1](https://dubesor.de/assets/shared/VillageGame/DeepSeek%20R1.html) | [Village game 0528](https://dubesor.de/assets/shared/VillageGame/Deepseek-R1%200528.html)

Sonnet 4 Thinking

Opus 4 Thinking

2025-05-24

Tested Claude **4 Thinking** (budget 16k, though the max it ever used was 6k): **Sonnet 4 Thinking**: * Overall output usage was merely ~**2x** compared to default (significantly reduced from the 7.44x I recorded for 3.7 in February), with a 50/50 reasoning split. * In most cases, final outputs were of same quality compared to non-thinking. * In certain reasoning and creative tasks it performed consistently worse than non-thinking (e.g. due to overthinking, pondering about reasons to not adhere to user query) In scenarios where thought chains would be immensely helpful, e.g. precise calculations, the model simply assumed its own rounded numbers to be correct without any further verification in thought chains, leading to false results. Thus, in my observation the Reasoning feels '**slapped on'**, and didn't improve performance Often, it spent reason tokens engaging in self-reflection about potential risks or policy violations, thus introducing unwanted risk analysis that can lead to more conservative (and ultimately worse) responses. Weirdly, the affected tasks weren't even probing for censorship (thus not part of that category). **Opus 4 Thinking**: * Only used 65% more tokens than without thinking, making it the most concise thinker I have ever tested (4/10 reason split) * Same observed risk analysis within thought chains, leading to more often refuse doing a harmless task than without thinking * Saw actual benefits though on very hard math or very hard coding problems So from what I have seen, I will practically never enable thinking on Sonnet 4, unless I am interested in reading the chains. Opus 4 can be worth a shot for very hard problems. The base models are very strong and clearly not native reasoning models. But this is just my own testing & opinion derived from what I observed during multi hour testing/comparing. Can obviously vary between use cases, so: **YMMV**!

Claude 4 Sonnet

Claude 4 Opus

2025-05-23

Tested **Claude 4** (default/non-thinking, Opus & Sonnet, 20250514): * Ended up **topping my ranks** (#1 & #2) * Very high reason, logic and common sense * quite concise models (16% token use of reason models such as 2.5 Pro) * highly competent in most areas tested, though Opus had more slip ups in math related tasks * Great coders, but Sonnet is probably the better choice in most cases (bang 4 buck) * Noticed improvements in back-end tasks and debugging * Saw **no improvements in Vision** * Chess: competent opening moves, then blunder all pieces even in hugely winning positions (14 draws, 1 loss in 15 matches, with zero secured wins) Opus in particular seems to have additional guardrails, enforced by API, as I received some usage policy violation warnings on harmless queries (e.g. my Steins;Gate demo pages). This issue was not present on Claude Sonnet 4. I have also uploaded some demo pages onto my [shared assets](https://dubesor.de/assets/shared/). Pricing on Opus with little benefit in most scenarios means I won't be utilizing it much, though. I'll check out performance with reasoning in the coming days, too. Overall, **impressive** models. As always, **YMMV!**

Codex Mini

2025-05-16

Tested **Codex Mini** Obviously not a general purpose model, but out of curiosity I like to test specialized models in my general environment regardless: * General capability around GPT-4.1 and between o3-mini <> o3-mini high * In the few (non-agentic) coding tasks I have, it performed on o3-mini high level * Overall token verbosity (x5.91) was around R1 level (slightly lower than o4-mini high), with a 3/1 split between reasoning and output tokens. * Real bottom line cost was a bit lower than o3-mini-high and a bit higher than o4-mini-high * Did well in my small Vision Test (in fact on top of OpenAI lineup, barely edging out against o4 Mini due to comparable weighted ratings), though still behind Google models Overall vibes, it felt like an o4-mini model with code focused system instructions. This model doesn't fit within my personal use case/work flow, so obviously take my findings with a grain of salt, and as always - **YMMV!**

Mistral Medium 3

2025-05-08

Tested **Mistral Medium 3** * Non-reasoning model, but baked in chain of thoughts, resulted in overall x2.08 token verbosity. * Supports basic vision (but quite weak, similar to Pixtral 12B in my vision bench) * Capability was quite mediocre, placing it between Mistral Large 1 & 2, similar level as Gemini 2.0 Flash or 4.1 Mini * Bang for buck is meh, cost efficiency is lower than it's competing field Overall, found this model fairly average, definitely **not** "*SOTA performance at 8X lower cost*" as claimed in their marketing. But of course, as always -**YMMV!**

Gemini 2.5 Pro Preview 05-06

2025-05-06

Checked out the new Gemini 2.5 Pro Preview **05-06** update (prev 03-25): Did slightly worse at my reasoning segment (particularly and reproducible the same 3 tasks), same STEM, slightly improved instr. follow, tech was better on 1 issue (but worse on 1 issue compared to Exp). Overall, same overall capability **in my environment**, shifted more towards coding, as the blog post suggests. Token use (and thus price) was up +17% (+11% on non-reason and +21% on reason tokens). Example front-end showcases comparisons (**identical** prompt, identical settings, 0-shot - **NOT** part of my benchmark testing): [CSS Demo page exp/03-25](https://dubesor.de/assets/shared/UIcompare/gemini2.5proexp0325UI) | [CSS Demo page 05-06](https://dubesor.de/assets/shared/UIcompare/Gemini2.5ProPreview05-06UI) :thumbsup: [Steins;Gate Terminal exp/03-25](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Gemini%202.5%20Pro%20Experimental%2003-25) | [Steins;Gate Terminal 05-06](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Gemini%202.5%20Pro%20Preview%2005-06) [Benchtable exp/03-25](https://dubesor.de/assets/shared/LLMBenchtableMockup/Gemini%202.5%20Pro%20Experimental%2003-25) | [Benchtable 05-06](https://dubesor.de/assets/shared/LLMBenchtableMockup/Gemini%202.5%20Pro%20Preview%2005-06%2010.3%20cent.html) [Mushroom platformer exp/03-25](https://dubesor.de/assets/shared/MushroomPlatformer/Gemini%202.5%20Pro%20Experimental%2003-25) | [Mushroom platformer 05-06](https://dubesor.de/assets/shared/MushroomPlatformer/Gemini%202.5%20Pro%20Preview%2005-06) [Village game 03-25](https://dubesor.de/assets/shared/VillageGame/Gemini%202.5%20Pro%20Preview%2003-25.html) | [Village game 05-06](https://dubesor.de/assets/shared/VillageGame/Gemini%202.5%20Pro%20Preview%2005-06.html) :thumbsdown: Overall, minor observable change **in my environment** /small test set, YMMV! the extra token use is a major bummer though..

Qwen3-235B-A22B

Qwen3-1.7B

2024-05-04

Tested **Qwen3-235B-A22B** (API fp8) & **Qwen3-1.7B** (local bf16): **1.7B:** * `/no_think` is nothing special, performs as expected, a tiny model for the GPU starved * default/thinking performs around a decent 7B model, quite usable for easy tasks * not my use case, but probably the best option besides small Gemma models, if you cannot fit the 4B --- **Qwen3-235B-A22B** * `/no_think` verbosity at x1.57, yet only 16% of default mode * performed slightly worse in Reasoning and STEM subjects * around Llama 3.3 70B level, but better coder **thinking** (*default mode*): * 616% token usage compared to non-thinking mode * very capable model across the board, almost DeepSeek-R1 level, beating out Llama 405B * impressive STEM performance (not just math but other STEM subjects, too) * Extremely cost efficient, decent vibes, **TOP** model * The trend of thought-chains aiding sensitive topics continued with these 2 models, too **YMMV!**

Phi-4-reasoning plus

2025-05-02

Tested **Phi-4-reasoning plus** (*14B, local, Q8_0*): * In terms of overall capability, roughly on par with Qwen3-32B. * I already thought QwQ was unusable for general use, but this one takes the cake in terms of sheer token verbosity: By far the most verbose model I ever tested, almost **20x token usage** of a traditional model. * Quite dry and soulless responses overall * Not a model for general use, clearly optimized for benchmarks (math, note that my STEM includes non-math topics) * Ok model to run a few tests or benchmarks, insanely inadequate inference requirements, not reasonable for general use As always, just my own testing, **YMMV**!

Qwen3

2025-05-01

Tested **Qwen3** (4B, 8B, 14B, 32B, 30B-A3B): **non-thinking** (mode `enable_thinking=False`, `/no_think`): * still relatively verbose when compared to a traditional non-thinking model (~45% more token usage) * very good all-rounders for size, mostly best in slot for their sizes/non-thought * Good overall utility for a variety of tasks, not recommended for precise maths or programming. * More prone to flat out refuse requests **thinking** (*default mode* `enable_thinking=True`, `/think`): * very verbose, but not extremely so (x7.85 token usage puts it among the more verbose tested, but not as extreme as QwQ and o3-mini-high) * Huge gains in math (in particular rounding), as well as Coding * Less prone to flat out refuse requests, thought-chains were beneficial in Censorship testing * Extremely performant overall, dominating in my sub 49B rankings * In fact, 4B and MoE 3.3B were so performant for size (usually tiny models struggle at my test-suit), that I suspected test-leakage and ran multiple re-tests All models were tested locally, rough inference speeds on my 4090/24GB VRAM: Qwen3-30B-A3B Q4_K_M: **130** tok/s Qwen3-4B bf16: **83** tok/s Qwen3-14B Q8_0: **50** tok/s Qwen3-8B bf16: **50** tok/s Qwen3-32B Q4_K_M: **28** tok/s 30B-A3B (***insane speed!***⚡️) will definitely be utilized as a daily driver by me for all types of random non-crucial tasks. This was just in my tested use cases, as always, **YMMV!**

GLM-4-32B-0414

GLM-Z1-32B-0414

2025-04-23

Tested **GLM-4-32B-0414** & **GLM-Z1-32B-0414** (local Q4_K_M): **GLM-4-32B-0414** * non-thinking model, still fairly verbose (x1.62 tok in my testing) * Good overall utility for a variety of tasks * similar overall capability as Qwen2.5 32B * more competition in this non-reasoning size segment (e.g. Mistral Small 3.1, Gemma 3 27B) **GLM-Z1-32B-0414** * reasoning model, requires a large context window (**minimum** 16k in my testing) * nowhere near as verbose as QwQ (x6 instead of x10 token usage), thus higher general usability * capability didn't quite reach QwQ level, but overall 2nd best for models under 49B. * I noticed syntax errors in less popular languages (e.g. swift) * I prefer it over QwQ simply because of its less excessive token spam This is just my testing and my use cases. As always, **YMMV!**

Gemini 2.5 Flash Preview 04-17

2025-04-18

Tested **Gemini 2.5 Flash Preview 04-17**: * Fairly verbose, fast, cheap model that is competent in all tested areas. * Improvements from 2.0 flash, except my coding tasks, where it did slightly worse * Around GPT-4.1 level overall **Thinking**: * increased output base price ($0.6 > $3.5) combined with ~3.42x token usage (74.3% reasoning tokens), leads to a **much** higher inference price, overall almost 20x than non-thinking. * Biggest improvements were seen in reasoning, analytical conclusions, and coding * Counterintuitively, it did consistently worse with thinking on my STEM tasks * Around DeepSeek-R1 & Grok-3 level overall Due to some inconsistencies observed during testing, I reran my benchmark several times on the Thinking variant. While it is overall far stronger than non-thinking (and far more expensive), it also produced less consistent results compared to non-thinking in some areas. As always, **YMMV!**

o4-mini

o4-mini-high

2025-04-17

Tested **o4-mini** & **o4-mini-high**: **o4-mini:** * Quite concise for a long-CoT-reasoning model (only ~3.2x token verbosity compared to a traditional model). * Real inference cost was almost identical to 3.7 Sonnet (non-thinking). * Performance was roughly in line with o3-mini-high. **o4-mini-high:** * Roughly 156% more thinking tokens, translates to inference cost&delay x2. * Comparatively minuscule improvements, in certain areas (very hard code & reasoning). * Not universally better in every scenario, even when disregarding cost increase. * Roughly on par with Grok-3 (non-thinking). Overall, in my environment, this models feels like a small upgrade to o3-mini, in some scenarios. The effective cost is a bit lower, which is an upside. Not too impressive in my testing, but as always, depending on your own use case: **YMMV!**

Granite-3.3-8B-Instruct

2025-04-16

Tested **Granite-3.3-8B-Instruct** (f16): Actually did a bit worse overall than the Granite 3.0 8B Instruct (Q8) [I tested 6 months ago.](https://discord.com/channels/1110598183144399058/1111649100518133842/1297985907546259678) Not the absolute worst, but just utterly uninteresting and beaten by a plethora of other models in the same size segment in pretty much all tested fields.

GPT-4.1

2025-04-14

Tested **GPT-4.1** series: **GPT-4.1 Nano:** Cheap tiny model, roughly comparable to Qwen2.5 14B. Substantially beaten on price & performance by e.g. Googles flash models. **GPT-4.1 Mini**: Versatile fast model, roughly comparable to Gemini 2.0 flash (but more expensive). Quite a solid coder, and performed on par with the larger model in my STEM segment. **GPT-4.1**: "flagship" of the series, roughly as strong Llama 3.3 70B (but weaker STEM) & DeepSeek V3 0324 (but weaker coder). Behind 7 other OpenAI models in my testing. The "Maverick" type model of OpenAI. All models are non-reasoning models and not very verbose, when compared to other recent model releases (1.15x / 1.23x / 1-35x token verbosity as size increases in testing). All models, including Nano, are fairly competent coders! though none excel at my backend testing None of these were particularly good in my STEM segment. I have also added 0-shot examples for UI impressions and simplistic game design for each model on my [shared assets](https://dubesor.de/assets/shared/) (**NOT** part of any scoring, just for additional curiosity/comparison). As always, **YMMV**!

Llama-3.1-Nemotron-Nano-8B-v1

2025-04-13

Tested **Llama-3.1-Nemotron-Nano-8B-v1** (*bf16*): This model has **2 modes**, the reasoning mode (enabled by using `detailed thinking on` in system prompt), and the default mode (`detailed thinking off`). **Default behaviour:** * Despite not officially ＜think＞ing, about 2x verbose as base model * Weak performance across the board, terrible instruction following/prompt adherence * About the same capability of a 3B model, with added verbosity **Reasoning mode:** * Not always ＜think＞ing, despite system instructions as per nvidia documentation * minor improvements in logic, some improvements in STEM related tasks * terrible instruction following/prompt adherence. Low utility Both variants perform significantly below base Llama 3.1 8B and have far less general utility. Very poor model imo. But as always: **YMMV!**

Grok-3 mini

2025-04-12

Tested **Grok-3 mini**: **default reasoning**: * Near identical token use to o3-mini *(medium)*, 132% more token use than the non-thinking Grok-3 * Good performance in all tested areas, around o3-mini level, not far behind Grok-3 * better instruction following than Grok-3 * better price/performance for most tasks than Grok-3 **high reasoning**: * 65% more token use than default reasoning (labeled as "low" but I would say is more akin to "medium reasoning") * same overall smartness, but gains stability in math and instruction following * not recommended for areas outside of the above, as I saw certain task even produce worse results, for higher price. (e.g. some C++ issues not present on default thinking). Also retested & updated current Grok-3 due to observed deviations since 2 months ago, scored slightly higher (+~1.5%) . As always: **YMMV!**

Llama 4 Scout

Llama 4 Maverick

2025-04-06

Tested Meta's new **Llama 4 Scout** & **Llama 4 Maverick**: **Llama 4 Scout:** (109B MoE) * Not a reasoning model, but quite yappy (x1.57 token verbosity compared to traditional models) * "Small" multipurpose model, performs okay in most areas, around **Qwen2.5-32B** / **Mistral Small 3 24B** capability * Utterly useless in producing anything code. * Price/Performance (at current offerings) is okay but not too enticing when compared to stronger models such as Gemini 2.0 flash **Llama 4 Maverick:** (402B MoE) * Smarter, more concise model. * Weaker than Llama 3.1 405B, performed decent in all areas, exceptional in none, performed around **Llama 3.3 70B** / **DeepSeek V3** capability. * Workable but fairly unimpressive coding results, archaic frontend. The shift to MoE means most people won't be able to run these on their local machines, which is a big personal downside. Overall, I am not too impressed by their performance and won't be utilizing them, but as always: **YMMV!**

Gemini 2.5 Pro Experimental 03-25

2025-03-27

Tested **Gemini 2.5 Pro Experimental 03-25**: **Average-verbose** reasoning model with around 5.4x token use of a traditional model, clocking in around DeepSeek-R1 level token usage. Far less verbose than models such as o3-mini-high or Sonnet Thinking. * **#1** Reasoning/Logic segment, surpassing GPT-4.5 Preview * **#1** in Code segment, surpassing GPT-4.5 Preview * STEM and math were **competent**, but nowhere near top, in my testing * Overall utility for miscellaneous casual tasks, where **fine**, but not outstanding I really enjoyed testing this model. It's very capable, but still shows flaws in certain areas. As always: **YMMV!**

DeepSeek V3 0324

2025-03-24

Tested **DeepSeek V3 0324**: * More verbose than previous V3 model, lengthier CoT-type responses resulted in total token verbosity of **+31.8%** * Slightly smarter overall. Better coder. Most noticeable difference were a **hugely better frontend** and UI related coding tasks This was merely in my own testing, as always: **YMMV!** Example frontend showcases comparisons (**identical** prompt, identical settings, 0-shot - **NOT** part of my benchmark testing): [CSS Demo page DeepSeek V3](https://dubesor.de/assets/shared/UIcompare/deepseek3UI.html) [CSS Demo page DeepSeek V3 0324](https://dubesor.de/assets/shared/UIcompare/deepseek3%200324UI.html) [Steins;Gate Terminal DeepSeek V3](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/DeepSeek%20V3.html) [Steins;Gate Terminal DeepSeek V3 0324](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/DeepSeek%20V3%200324.html) [Benchtable DeepSeek V3](https://dubesor.de/assets/shared/LLMBenchtableMockup/DeepSeek%20V3%200.04%20cents.html) [Benchtable DeepSeek V3 0324](https://dubesor.de/assets/shared/LLMBenchtableMockup/DeepSeek%20V3%200324%200.07%20cents.html) [Mushroom platformer DeepSeek V3](https://dubesor.de/assets/shared/MushroomPlatformer/DeepSeek%20V3.html) [Mushroom platformer DeepSeek V3 0324](https://dubesor.de/assets/shared/MushroomPlatformer/DeepSeek%20V3%200324.html)

EXAONE Deep 32B

2025-03-23

Tested **EXAONE Deep 32B** (*local, Q4_K_M*): Yet another long-cot reasoner. Stumbles around with thoughts and delivers unimpressive results, even when compared to non-reasoning models less than half its size. Was utterly useless in anything code related. This one is very lame, and weak imho, there are at least a dozen far better options at that size. As always: **YMMV!**

Llama-3.3-Nemotron-Super-49B-v1

2025-03-22

Tested **Llama-3.3-Nemotron-Super-49B-v1** (*local, Q4_K_M*): This model has **2 modes**, the reasoning mode (enabled by using `detailed thinking on` in system prompt), and the default mode (`detailed thinking off`). **Default behaviour:** * Despite not officially ＜think＞ing, can be quite verbose, using about 92% more tokens than a traditional model. * Strong performance in reasoning, solid in STEM and coding tasks. * Showed some weaknesses in my Utility segment, produced some flawed outputs when it came to precise instruction following * Overall capability very high for size (**49B**), about on par with Llama 3.3 **70B**. Size slots nicely into 32GB or above (e.g. 5090). **Reasoning mode:** * Produced about **167% more tokens** than the non-reasoning counterpart. * Counterintuitively, scored slightly lower on my reasoning segment. Partially caused by **overthinking** or more likelihood to land at creative -but ultimately false- solutions. There have also been instances where it reasoned about important details, but failed to address these in its final reply. * **Improvements** were seen in **STEM** (particularly math), and higher precision instruction following. This has been 3 days of local testing, with many side-by-side comparisons between the 2 modes. While the reasoning mode received a slight edge overall, in terms of total weighted scoring, the default mode is far more feasible when it comes to token efficiency and thus general usability. Overall, very good model for its size, wasn't too impressed by its 'detailed thinking', but as always: **YMMV!**

Olmo 2 32B

2025-03-19

Tested **Olmo 2 32B Instruct** (API/bf16): * Performs around a modern 10B model * Okay for general questions but rather weak in any specialized field (math, code, etc.) * quite vanilla/sterile This model is quite poor size/performance overall. Outclassed by models such as Nemo 12B and Phi-4 14B. Subjective Vibe Checks not passed (not rated), Uninteresting model imho, but **YMMV**.

Mistral Small 3.1

2025-03-17

Tested **Mistral Small 3.1** (API): * not much to say, it's pretty much **identical** to Mistral Small 3 (within margin of error & minute precision/quantization differences) * you get multi-modality. I found no underlying text-capability differences.

Jamba 1.6

2025-03-15

Tested **Jamba 1.6** (Mini & Large): * Literally worse than the 1.5 Models I tested 7 months ago. * The models cannot even produce a simplistic table! * They are completely coherent, but unintelligent and feel ancient. The "large" model gets beaten by local ~15B models in terms of raw capability, and the pricing is completely outdated. The Mini model performed slightly above Ministral 3B. These models are very bad imho. As always: YMMV!

Reka Flash 3

2025-03-15

Tested **Reka Flash 3** (21B, Q8): This one is yet another long-CoT reasoning model (~5.32x token verbosity compared to a traditional model). I did decent in my coding segment (don't use this for frontend webdesign though! looks terrible). It has low general utility due to extreme verbosity and subpar instruction following. In other categories, it performed okay-ish for size. Outclassed by models such as Mistral Small 3, Gemma 3 12B, Phi-4 14B in most scenarios. As always: YMMV!

Command A

2025-03-13

Tested **Command A** (03-2025): * Significant upgrade to Command R+ 08-2024 * Feels a bit dated for its size (111B) when compared to models such as Llama 3.3 70B * Surprising performance in my tech and code segment, where it delivered consistently good results * Less censored than most other models, easy to steer As for their marketing claims about being on par or better than 4o and DeepSeek V3: certainly not for general use, but it did perform on par in my coding segments. This is obviously a model geared at enterprise, RAG, and agentic works, but it will still be useful to risk writing and similar creative work. As always: YMMV!

Gemma 3

2025-03-13

Tested **Gemma 3** (*local, Q5, Q8, bf16*): **27B**: Better in STEM, particularly math, poor coder, disappointing reasoning compared to Gemma 2 & 12B **12B**: Slightly better in almost everything compared to Gemma 2 9B, equivalent performance to 27B in many tasks **4B**: Comparable to Gemma 2 2B, found it less versatile but a tiny bit smarter in certain cases. Family as a whole: Hard **refusals** have been significantly reduced. You now have to live with large segments containing legal and warning disclaimers though.. Multi-modality & image inputs: My testing does NOT test any multimodal functionality, so do keep an eye on benchmarks that do. For my use case, as someone who barely ever requires image input, these models are a bit disappointing in terms of raw text capability. but: As always, **YMMV**!

R1 1776

2025-03-07

Ran a full retest of R1 1776, after perplexity claims to have fixed their implementation. * Higher quality chain of thoughts, in particular in long context, fixed degradation * Thus, gains in all tested areas, compared to initial implementation * Still falls short when compared to DeepSeek-R1 * Core model remains identical with same issues such as still censored Chinese areas and propaganda Tldr; Recent fixes improved the thought chains and thus outcome significantly, doesn't quite reach R1 level, in my testing. As always,**YMMV!**

QwQ-32B

2025-03-06

Tested **QwQ-32B** (*local, Q4_K_M*): * best in size, except for coding * extremely verbose (avg. ~10x output tokens compared to traditional model, more verbose than any other long-cot-model I ever tested) * more effective thought chains than r1 distill versions of Qwen2.5-32B * terrible at all webdesign tests I threw at it * Smartest sub 70B by brute force token chains This is a smart model, but for me the extreme verbosity and inference required excludes it from becoming a daily driver. The good outcomes feel brute forced with cot, and the verbosity is borderline ridiculous. Good if for complex STEM related subjects or reasoning tasks. Not useful for coding. As always, **YMMV**!

GPT-4.5 Preview

2025-02-28

Tested **GPT-4.5 Preview**: * Very **expensive** model obviously with the highest raw price yet, but actually a bit cheaper than o1 if you account for hidden thought-chains. Model is also fairly concise, with reply lengths slightly below median non-thinking models. * **Highest common sense** of all models I have ever tested (~130+) * STEM, Coding, and other professional tasks were good, but not super impressive. Attention to detail (haystack tests, bug spotting) was very good, though. * Vibe, style etc. I do not specifically test for but I found the model to be fairly standard, at least in my collected queries. While the model is advertised to be good at conversation, creativity and natural conversation, I don't see how casual conversations with this model is a feasible use case, considering the outrageous price. I will personally use it as an agent with decision making that requires common sense (e.g. a judge or critical analyst). As always - **YMMV**!

Claude 3.7 Sonnet Thinking

2025-02-25

Tested Claude **3.7 Sonnet Thinking** (Budget 16k, though the max it ever used was 9k): * Overall output usage was ~7.44x compared to normal (signficantly more expensive). * The **vast majority** of cases, final outputs were of identical quality compared to non-thinking. * In certain reasoning and creative tasks it performed consistently worse than non-thinking (e.g. due to overthinking, pondering about reasons to not adhere to user query) * In rare specific queries (most consistently in hard code and hard math), it performed slightly better. I know it feels counterintuitive how it can perform below non-thinking on e.g. Reason, but I have retested all differentiating results a multitude of times, and the differences were reproducible & consistent. For my use case, the thinking mode will remain deactivated 99% of the time, unless I have a very specific issue that non-thinking cannot solve, then it might be worth giving a thinking budget a try. For the average user, I doubt that using it is wise considering cost-effectiveness. However, as always, just my own testing. **YMMV**!

Claude 3.7 Sonnet

2025-02-24

Tested **Claude 3.7 Sonnet** (non-thinking, claude-3-7-sonnet-20250219): Smarter & better overall, biggest improvement imho was it's far less aggressive Nanny-behaviour (still not uncensored but big improvement!). It's frontend dev skills (which was already great) was taken up a notch and produces even better results. Flaws were rather in backend and debugging. Overall, fantastic model. I'll check out the different thinking options over the next few days *(though I have a feeling it won't lead to very cost-efficient improvements)* As always, **YMMV**! 3 simple frontend UI comparisons between 3.5 and 3.7 (short query prompt, 0 shot) - **NOT PART OF MY TESTING**; JUST FUN COMPARISON: CSS DEMO: https://dubesor.de/assets/shared/UIcompare/Sonnet3.5.1.html https://dubesor.de/assets/shared/UIcompare/Sonnet3.7UI.html STEINS;GATE TERMINAL: https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Claude%203.5%20Sonnet%20new.html https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Claude%203.7%20Sonnet.html LLM BENCHTABLE MOCKUP: https://dubesor.de/assets/shared/LLMBenchtableMockup/Claude%203.5%20Sonnet%203.1%20cents.html https://dubesor.de/assets/shared/LLMBenchtableMockup/Claude%203.7%20Sonnet%2017.9%20cents.html

R1 1776

2025-02-24

Tested **R1 1776** (Perplexity post-trained to remove Chinese censorship): Reasoning showed strong signs of degradation, leading to worse results in all tested areas. Math, formatting and code related tasks were more strongly affected than pure Logic tasks. Ironically, the only few Chinese censor tests I have (and have had for a long time) still produced 100% censored and propagandistic answers. Whether the degradation is due to the post-training, or how the model is implemented, I do not know. But I do know that it isn't on R1 level. As always, **YMMV**.

Grok-3

2025-02-19

Tested the current **Grok-3**: Reasoning was similar to Grok-2 in my environment, but I saw large improvements in STEM, general utility and Coding (*On a sidenote, UX design was hit or miss, sometimes phenomenal, sometimes poor, so a bit inconsistent*.) It's still fairly uncensored, and **very wordy** model (non-reasoning but produces large responses, roughly 2.25x as GPT-4o-latest which is already wordier than the average traditional model.) I found it to be a little less cringe-inducing than Grok-2 (subjective, unrated). Overall, very capable model but not the best at any field I test. As always - **YMMV**!

chatgpt-4o-latest

2025-02-16

Tested current 'chatgpt-4o-latest' (time stamp 2025-02-16), and compared to results from 4 months ago: * about 1-4.7% better on my test set, depending how refusals are weighted * more prone to censor in risk topics, lower utility in risk-deemed RP * slightly improved capability across different segments, math, logic, coding, ... * slightly altered behaviour/styling, more emojis by default, more casual tone in certain settings * overall, slightly better for most use cases, most capable non-thinking model, other than 4-Turbo As always, YMMV!

Qwen2.5

2025-02-02

Tested the new **Qwen2.5** models (also updated the price since it changed just 1 day after my testing): Qwen2.5-**Turbo** - cheap model, roughly equivalent to GPT3.5 Turbo Qwen2.5-**Plus** - mid model, roughly equivalent to Qwen2.5-72B Qwen2.5-**Max** - large model, roughly equivalent to Mistral Large 2 Traditional models who are competent enough for their weight & price. Not the most interesting models to me but as always - YMMV!

o3-mini

2025-02-01

Tested **o3-mini** (default): Minor improved reasoning results over o1-mini, strong coder. Did slightly worse at my STEM segment and anything deemed not safe. If you don't care for censorship or refusals, or use it solely for coding, it's slightly better than o1. Overall, for general use not a noticeable capability upgrade. New pricing will make it more affordable though (thought token impact calculations still withstanding on my part). As always, YMMV.

Mistral-Small-24B-Instruct-2501

2025-01-31

Tested **Mistral-Small-24B-Instruct-2501** aka Mistral Small 3: Saw improvement over the previous Mistral Small in most areas, except for code where it actually performed slightly lower (retests were already done, but my segment is rather small, so do take it with a grain of salt). It ends up among the best choice for sub50B models along Gwen2.5 32B and Gemma 2 27B. Mistral always has lower inherent censorship, so it should perform well for roleplay and similar creative tasks. Very useful model size for local inference (16GB VRAM+). Good universal model overall.

Qwen2.5-Max

2025-01-30

Tested **Qwen2.5-Max**: It's a traditional, competent but overall uninteresting model. Slightly smarter than the vastly cheaper 72B, it's pricing strategy at $10/30 mTok is severely outdated. Compared to completing models with similar capability such as 72B or DeepSeek V3. I saw minor improvements in maths and formatting, but none in core logic or coding related tasks. This model is also quite dry and doesn't pass my vibe check, rather weak conversationalist and bad for RP. For me, considering the price, it's a pass. As always, YMMV!

R1-Distill

2025-01-22

Locally tested **R1-Distill-Llama-8B**, **R1-Distill-Qwen-14B**, **R1-Distill-Qwen-32B** 32B was decent, the smaller distilled models were weaker than base. Overall I would use the non-thinkers in my own use case, as the benefit (or lack thereof) is not worth it for local compute, imho. For the smaller models, makes them less usable with lack of benefit. Also the token spam is not really desired for local use, at least in my use cases. Llama was particularly impacted, gaining minute capability in reason and stem, but sacrificing almost all utility and coding. For me, after several hours of testing, the distilled models aren't really attractive. as always, YMMV. Tested 4 Distill models by now, all statements are in my testing, **ymmv**: **8B** - weaker than base (*ranked #107 -vs- #84*) **14B** - weaker than base (*ranked #80 -vs- #62*) **32B** - slightly better than base (*ranked #54 -vs #59*) **70B** - weaker than base, due to bloated thoughts - pretty much unusable locally (*ranked #26 -vs- #18*)

DeepSeek R1-Zero

2025-01-26

Tested **R1-Zero** (fp8): highly capable model, a little bit messier and less conventional than R1, less aligned/filtered. Loses out in formatting and thus coding, but is a highly capable model overall. probably not as consumer-friendly as R1, but my testing probes mostly raw capability. As always, YMMV!

DeepSeek-R1

2025-01-20

Tested **DeepSeek-R1**: This model is extremely capable for a non-proprietary model, and the first to truly successfully challenge the top SOTA models (more so than DeepSeek V3). It will not be the most efficient model for every use case, as its long-chain-of-thought reasoning response (can be ignored by user) is very verbose, with an average ratio of 4.7:1 in my testing. Some of it's programming thoughts were even breaking my previous response storage method, due to surpassing 32k chars. So, it can be extremely verbose in its "reasoning_content" response. The final answer ("content") is fairly concise in comparison. Compared to R1-Lite-preview it did not suffer from false refusals or chinese output issues. It outperformed Llama 3.1 405B in almost all tasks except for general utility (roleplay, concise formatting, etc. which is to be expected). Overall a fantastic long-cot model. Adding in the cost factor, in terms of cost effectiveness, it blows o1 completely out of the water. Plus you get to actually see every token you pay for, if you desire. as always - YMMV!

MiniMax-01

2025-01-17

Tested the model **MiniMax-01** in my bench environment. Results were around WizardLM-2 8x22B or Llama3.0 70B level. It was pretty mediocre in most tested fields, cost/performance was top 40%, not expensive and neither particularly cheap for capability. There are some minor quirks with Chinese output or lack of format adhering but not to an unusable degree. Overall pretty meh model to me. As always - YMMV!

QVQ-72B-Preview

2024-12-31

Tested **QVQ-72B-Preview** (bf16): Surprisingly bad model, despite being a long CoT reasoning model, it did not impress in reasoning tasks. It did OK in STEM-related tasks, but was useless in programming and anything requiring it to follow instructions. It fails to provide full working code segments most of the time and thinks about poorly formatted snippets. It's also very censored and refused a lot of tasks unjustified. It placed #79 in my current environment; right next to Llama 3.1 8B and Ministral 8B, and for a 72B model that should be quite telling. It also could not deliver on vibe, style or character. I see zero use case for this model; there are far better options, and far better long chain-of-thought models out there.

DeepSeek V3

2024-12-25

**DeepSeek V3** - Thoroughly tested the new capability, I was fortunate to still have very recent 2.5 datasets (due to being late on 1210) for direct output comparisons. Strong STEM & code, solid instruction following and general utility, arguably minimally improved reasoning. Overall, 3rd most capable OpenSource model (behind Llama 3.1 405B & Llama 3.3 70B) in my testing. As for proprietary, roughly on o1-mini level. The biggest flaw for this model in my testing is clearly its reasoning expert, more specifically anything in areas requiring critical thinking and applying common sense, where it blunders a lot and consistently. As always, YMMV!

GPT-4 Turbo

2024-12-22

Decided to retest GPT-4 Turbo (OR identifier 'openai/gpt-4-turbo') after 10 months with most recent adjustments, still holds up very well in comparison to most other state of the art models, in fact beating them in most scenarios on pure substance. It's a bit dry, doesn't have the best formatting nor style/vibe, but gets the job done. Compared to my testing back in February 2024, something clearly changed though, as the model behaviour was slightly different, e.g. I ran into reproducible refusals that were definitely not present beforehand. Combined with slightly weaker performance (but still strong) I suspect this is caused by the changing of system prompt into more convoluted legal coverage or similar back-end alterations over time. Still, it's better in this regard than o1. It will remain my go to for when cheaper more efficient models don't cut it for a problem.

Gemini 1206

Gemini 2.0 Flash Experimental

Gemini 2.0 Flash Thinking Experimental

2024-12-21

Tested the 3 recent experimental Gemini models (Gemini Experimental 1206, Gemini 2.0 Flash Experimental, Gemini 2.0 Flash Thinking Experimental). All 3 were not major improvements in terms of total capability, but are slightly different in terms of behaviour. The thinking model obviously is used for reasoning but introduces too much noise which suffers its coding/instruct follow ability.

o1-2024-12-17

2024-12-18

Tested the full o1 via API: Compared to o1-preview it used slightly fewer invisible thought-tokens, and is undoubtedly much better at STEM (particularly math), multiple unjustified refusals tanked its utility in my cases, this model is clearly not designed for tasks such as e.g. summarization or agentic personas. I was not impressed by its coding. I required multiple reiterations and restating info that was present in the task already, wasting a TON of money on non-exceptional results. This model is also fairly censored and steers away from any potentially controversial subjects, even if harmless in the context. This model is more akin to Claude models than what I am used to from OpenAI in terms of overcautiousness. tldr; Fantastic STEM capability, great reasoning, not too impressive in other areas from my testing. Unfathomably expensive, obviously, because the invisible tokens inflate actual pricing to around $190 mTok across my testing.

Command R7B

2024-12-15

Tested **Command R7B** (12-2024) - Around Granite 3.0 8B / Qwen2.5-7B level, with decent STEM performance, poor reasoning and terrible coding. There are stronger options in that size category (LLama 3.1, Ministral, etc.) Price/Performance is OK, but again there are much better options even for bang4buck.

Phi-4

2024-12-13

Tested **Phi-4** (14B), it's a decent model around Nemo 12B & Qwen2.5 14B level, with decent reasoning, very good STEM capability but lackluster code & instruct following. default vibe is very neutral and quite sterile, as expected from a microsoft model.

Llama 3.3 70B

2024-12-07

Checked out Llama 3.3 70B locally (Q4_K_M): Strongest open model after Llama 3.1 405B, saw big improvements in reasoning and STEM-related tasks, compared to 3.1 It did not do particularly well in my coding-related tasks, though. Due to this outlier, had to retest this segment multiple times. Overall, very capable general use model

QwQ-32B-Preview

2024-11-28

Ran **QwQ-32B-Preview** (Q4_K_M) through my own benchmark, have to say I am disappointed, had higher hopes. I's outputs are often annoyingly formatting (e.g. no proper distinction between thinking/reasoning loop and true final output.), often no full code blocks but snippets despite instructed to, terrible instruct follow overall, reasoning was poorer than vanilla Qwen2.5 32B. Math/Stem got boosted by Reasoning Loops in a positive way - everything else was rather poor in comparison. Vibe check is failed, the style of this model annoys me, and high refusal rate. Also it claimed to be developed by OpenAI in my self-description collections. I ranked it #62 (between Jamba 1.5 Mini and GPT-3.5 Turbo). Was hoping for a more fun model for my 100th model anniversary, oh well.

Marco-o1

2024-11-24

Tested Marco-o1 (fp16) and it's capability was pretty much exactly how I expect a 7B model to be. The thinking is a nice gimmick, but it didn't yield better results in reasoning, as the model was unable to outthink bad thinking. It did help in math related tasks, though. For any generic utility tasks such as instruct following, summarization, etc. I found it to be borderline unusable. The model sometimes had crucial information within its tags while omitting it entirely from the tags, meaning you cannot effectively filter out tags without info loss. Fun and quirky to play around with sure, but not groundbreaking in my testing.

DeepSeek-R1-Lite-Preview

2024-11-22

DeepSeek-R1-Lite-Preview: Performed a bit worse than DeepSeek V2.5 overall, partially due to uncalled for **refusals** in completely generic tasks. However, it has the highest reasoning skills of all DeepSeek models (slightly higher than o1-mini). Overall, it placed on a similar level to Grok-2 mini and Claude 3.5 Haiku in my testing. If this model is in the 50-60B range it would be very impressive if they can iron out the refusal behaviour. Can be quirky fun to use due to its ramblings, which are hit or miss.

Claude 3.5 Haiku

2024-11-04

Just checked out Claude 3.5 Haiku, very unexpected results.. In my own small-scale test it showcased: * By far the least censored Claude model (other than Claude-1), very different refusal/censor behaviour when compared to old haiku or Sonnets & Opus. * Roughly 2x capability of Claude 3 Haiku * Did better on my small subset of code related tasks than 3.5 Sonnet * STEM was pretty identical * Some flaws in utility/misc tasks (terrible roleplayer) * Reasoning still pretty weak but huge gains compared to the previous iteration * Pricing is too high, when competing with models such as 4o-mini or Gemini 1.5 Pro 002 Not rated but subjective vibe check: very concise model that seems to love putting nearly everything into list format. AS ALWAYS - YMMV!

Aya Expanse

2024-10-27

Cohere **Aya Expanse** testing has concluded for me, very weak for their size compared to the competition, not worth the storage space imo. Aya Expanse 8B (f16) - failed pretty much everything and was around L3.2 3B capability. Aya Expanse 32B (Q4_K_M) - weaker than even Gemma 2 9B & Nemo 12B in my testing. It would be OK as like a 12B model due to being fairly uncensored. Gets absolutely stomped by Qwen2.5

Claude 3.5 Sonnet 20241022

2024-10-22

Tested the new 3.5 Sonnet. After all is done and accounted for, it jumped ranks from #15 > #7 with slightly less prudishness (still much higher than the competition). I saw massive gains in tasks labeled for Reasoning (suspiciously high gains, I need to investigate this further). A slight dip in prompt adherence and code. I scrutinized and retested all tech-related coding tasks a total of 6 times, ended up running 18 queries PER TASK in that particular label to exclude any random outliers. The results were consistently delivering the same outcome, though. Good improvements as a whole.

Granite-3.0-8B-Instruct

2024-10-21

Granite-3.0-8B-Instruct (Q8_0). Not terrible, not great. While my bench is too hard for small models, and doesn't catch the minute differences for them, it still gives a rough expected performance ballpark. But if I add to that the vibe check (not tested nor depicted) - utterly uninteresting model, won't stay on my drive.

Yi-Lightning

2024-10-22

My 2 cents on the Yi-Lightning models: Tested around Llama 3.1 70B level, and Llama 3.0 70B for lite model for me. Reasoning labeled tasks was pretty dead even between them, not their strong suit. Pretty good at STEM&Maths, and better at following instructions than Qwen models. Fairly uncensored. Competent but did not reach a top spot, unlike say the current arena ranking. it has good style tho, so that would gain a fair amount of votes.

Ministral

2024-10-19

I was quite impressed by Ministral 3B, 8B on the other hand was a barely noticeable improvement in the vast majority of cases. here are some neighbouring performers. Mistral 8B =~ Llama 3.1 8B Mistral 3B =~ Llama 3.0 8B sucks that the 3B model is not local, would be good to run on the side, it's definitely the more interesting model here but usage is so limited by this. As always, YMMV!

Inflection 3

2024-10-18

Tested Inflection 3 Models: Productivity one is better not just in performance but also in style imho. Capability range is around Gemma 2 9B (Pi) and Nemo12b/Qwen14B (Productivity). The models are too expensive for what they offer, ranking near the bottom of my price/performance calculations.

Llama-3.1-Nemotron-70B-Instruct

2024-10-16

Comparing Llama-3.1-Nemotron-70B-Instruct to the vanilla Llama-3.1 Model and the best performing competitor at that size, Qwen2.5-72B. Overall a pretty substantial improvement, I saw biggest gains in STEM related questions. It's also a pretty consistent model and didn't really blunder anything too terribly.

Llama 3.2 11B & 90B

2024-09-28

tested the new 3.2 vision models (text capability), comparing them to their non-vision brethren. 90B was slightly smarter, and 11B was about even with 8B. I do not bench for vision, but tested them a little bit for myself anyways, it's ok for a first iteration vision but not what I would personally use compared to other vision models.

Llama-3.2 1B & 3B

2024-09-27

Gave Llama-3.2-1B and Llama-3.2-3B a spin. My testing isn't very good for such tiny models, as the testset is too hard for such models, but I wanted to try anyways. I found Gemma 2B it to be vastly superior to Llama-3.2-3B

Llama-3.1-Nemotron-51B

2024-09-23

Llama-3.1-Nemotron-51B tested, very impressive for it's size, being toe to toe with it's 70B brother, outperforming it in math but losing out on reasoning and misc tasks. Great model! As always - YMMV!

Qwen2.5-14B-Instruct

2024-09-18

Finished testing Qwen2.5-14B-Instruct on Q8 - best overall local model sub70B I have tested thus far. Barely beat the former champion despite being half as big. As is the issue with most Chinese models, its not very good at sticking to strict instructions and has general prompt adherence issues, but other than that it's a very capable model.

Mistral-Small-Instruct-24-09

2024-09-17

At 22B, another great sub 70B option, joining the ranks of the likes of Nemo & Gemma 27B. Decent coder, good at math, and fairly unrestricted out of the box. Kinda flopped in the logic department during my testing, so if you want something to solve riddles this isn't the right model.

o1-mini

2024-09-16

finally finished my o1 testing. this was a long and expensive ride. o1-mini completely wiped the floor with o1-preview in anything math related, not even close. rest is not big enough of a difference to justify the pricing. as always, YMMV

o1-preview

2024-09-15

full o1-preview results are in on my own smallscale-testing! Highest reasoning I have tested thus far (outside of unreleased models), partially embarrassing math skills, okayish utility (bad for RP tho), not impressed by coding, most censored OpenAI model I have ever tested. This was very expensive and time-consuming, due to usage caps, and the fact that this time around I also had to track invisible token usage, for the true mTok cost... working on mini next, so far I like it a lot better in terms of price/performance and it seems to do less wasteful tokenwaste thinking. As always, YMMV!

DeepSeek V2.5

2024-09-10

Put DeepSeek V2.5 through my benchmark. Very similar total capability to DeepSeek-V2, with improvement in math and programming but slight decrease in reasoning and prompt adherence. Very good model for the price/size. As always - YMMV!

Reflection Llama-3.1 70B

2024-09-06

model review of Reflection (local Quant, thus not as powerful as fp16 (once the bugs are ironed out), but still very useful data for general ballpark. In terms of Llama variations: Reasoning: Very good, as was expected. Only beaten by 405B STEM: Still good for its size. Reflection can cause math to introduce additional inaccuracies. General utility: Bad, the baked in reflection counter-acts user instructions to the point where the reflecting part actively tries to combat user instructions Code: on par with L3 70B, it's large thinking/reflecting segments seems to cause context poisoning issues, lowering the end result code quality Censorship: same as L3 70b and L3.1 8B.

Command R 08-2024

Command R+ 08-2024

2024-08-31

New Command R models, more efficient but minor "improvements" overall, the API introduced safety guardrails, which technically can be removed in terminal if you access model directly, but will be unable to be turned off on providers that do not allow for manipulating the safety parameter. If we compare to competition in similar size brackets, e.g. R+ to Mistral Large and R to Gemma 2 27B, the performance is underwhelming.

Command R 08-2024

2024-08-30

Command R 08-2024 testing (Q4_K_M); slightly better than old model, but , at least in my testing, for its size not good enough compared to the competition of smaller size. Weirdly enough it blundered my entire code section.

Jamba 1.5

2024-08-23

Tested the Jamba 1.5 models. they are decent-ish but pretty underwhelming for their size. Jamba 1.5 Large with a gigantic 399B size punches in the same league as old L3-70B, and the mini version is roughly equivalent to the over 5 times smaller Gemma 2 9B

Llama 3.1 405B Instruct

2024-08-12

in lieu of https://x.com/aidan_mclau/status/1822830757137596521 I redid the entire bench on bf16. also reran results 3 additional times if they differed between versions. most notable difference is that I got 0 refusals this time (4 reproducible refusals on fp8), and the reasoning is higher with minute discepancies in math and 1 programming task. Meta changing kv heads a few days ago, and API outputs retroactively being bugged as I noticed 2 days ago doesn't help discrepancies.

Gemini Pro 1.5 experimental

2024-08-03

Gemini Pro 1.5 experimental is quite the step up. *(in before the compliance gets nerfed after experimental phase).*

Claude-3.5-Sonnet

2024-06-21

Obligatory 3.5 sonnet graph. my own bench, as always ymmv. Better than Opus in almost every tested way, except for programming and censorship tasks. Still inferior to OpenAI in reasoning and critical thinking tasks. Passed ~57% of my tasks, with double the fail rate of GPT-4 Turbo. as always ymmv.

Llama-3-70B-Instruct

2024-04-19

Tested llama 3 today. Benched Around Mistral medium level. Good reasoning, A terrible programmer though, missed every bug hunting task, and every needle in haystack task. But a big improvement over llama-2 overall, also far more lenient in terms of refusals.

Command R+

2024-04-15

Today I ran Command R+ through all 80 of my benchmarks. The results are far lower than arena rankings, putting its results around Claude-3 Sonnet and "Mixtral-8x7b-Instruct-v0.1"-level

claude-3-haiku-20240307

2024-03-24

I finally had the time to run all tests through Haiku, so here are the 4 recent claude models together. Haiku is the only cost effective claude model. It comes at a miniscule 1.67% the price of Opus and performs well in STEM and small generic tasks, but sucks heavily at reasoning skills.

Gemini, Claude, Mistral, GPT-4 Turbo

2024-03-09

A few other interesting findings: * Gemini (both pro and ultra) are very prone to unnecessary refusal, even refusing tasks that are not even remotely questionable., * Mistral and OpenAI models almost never refuse anything, even my tasks that are specifically designed to be risky. (Claude-1 belonged to this camp), * Sonnet is such a weird model. In my testing, it performed better than Opus on tasks that have extremely high difficulty (>83%) yet somehow manages to give completely moronic answers to the easiest questions: https://i.imgur.com/4VeZ5vB.png)., * Out of all tested models, Claude 2.1 scored highest on prompt adherence (sticking to prompt instructions), * Opus seems significantly better in STEM and math, but did not deliver better results in programming over sonnet., * GPT-4 Turbo has the highest reasoning skills, bar none., * The best models sometimes fail to do the simplest tasks, that the worst models easily do, such as ending a sentence with a specific word, or excluding certain things., * GPT-4 Turbo was the only model that consistently gets easy to medium tasks correct, whereas other models sometimes fail at even the simplest of tasks.

claude-3-opus-20240229

claude-3-sonnet-20240229

2024-03-05

Opus seems very similar to mistral large model in terms of performance. lower reasoning, better math, more censored. Overall very similar to mistral large. Not as good as gpt-4 by any metrics I was able to test. And claude 3-sonnet is a very average model, around mixtral level, weaker than gemini ultra even.

Earlier first impressions

2023/12 - 2024/08

*Earlier first impressions between 2023 to Aug 2024 are scattered across different servers & messages, with screenshots or short in nature, too timeconsuming to find to copy right now.*