Dubesor - First Impressions Blog

Initial first model impressions & general vibe, copy-pasted from my discord comments

Codex Mini
Tested **Codex Mini** Obviously not a general purpose model, but out of curiosity I like to test specialized models in my general environment regardless: * General capability around GPT-4.1 and between o3-mini <> o3-mini high * In the few (non-agentic) coding tasks I have, it performed on o3-mini high level * Overall token verbosity (x5.91) was around R1 level (slightly lower than o4-mini high), with a 3/1 split between reasoning and output tokens. * Real bottom line cost was a bit lower than o3-mini-high and a bit higher than o4-mini-high * Did well in my small Vision Test (in fact on top of OpenAI lineup, barely edging out against o4 Mini due to comparable weighted ratings), though still behind Google models Overall vibes, it felt like an o4-mini model with code focused system instructions. This model doesn't fit within my personal use case/work flow, so obviously take my findings with a grain of salt, and as always - **YMMV!**
Mistral Medium 3
Tested **Mistral Medium 3** * Non-reasoning model, but baked in chain of thoughts, resulted in overall x2.08 token verbosity. * Supports basic vision (but quite weak, similar to Pixtral 12B in my vision bench) * Capability was quite mediocre, placing it between Mistral Large 1 & 2, similar level as Gemini 2.0 Flash or 4.1 Mini * Bang for buck is meh, cost efficiency is lower than it's competing field Overall, found this model fairly average, definitely **not** "*SOTA performance at 8X lower cost*" as claimed in their marketing. But of course, as always -**YMMV!**
Gemini 2.5 Pro Preview 05-06
Checked out the new Gemini 2.5 Pro Preview **05-06** update (prev 03-25): Did slightly worse at my reasoning segment (particularly and reproducible the same 3 tasks), same STEM, slightly improved instr. follow, tech was better on 1 issue (but worse on 1 issue compared to Exp). Overall, same overall capability **in my environment**, shifted more towards coding, as the blog post suggests. Token use (and thus price) was up +17% (+11% on non-reason and +21% on reason tokens). Example front-end showcases comparisons (**identical** prompt, identical settings, 0-shot - **NOT** part of my benchmark testing): [CSS Demo page exp/03-25](https://dubesor.de/assets/shared/UIcompare/gemini2.5proexp0325UI) | [CSS Demo page 05-06](https://dubesor.de/assets/shared/UIcompare/Gemini2.5ProPreview05-06UI) :thumbsup: [Steins;Gate Terminal exp/03-25](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Gemini%202.5%20Pro%20Experimental%2003-25) | [Steins;Gate Terminal 05-06](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Gemini%202.5%20Pro%20Preview%2005-06) [Benchtable exp/03-25](https://dubesor.de/assets/shared/LLMBenchtableMockup/Gemini%202.5%20Pro%20Experimental%2003-25) | [Benchtable 05-06](https://dubesor.de/assets/shared/LLMBenchtableMockup/Gemini%202.5%20Pro%20Preview%2005-06%2010.3%20cent.html) [Mushroom platformer exp/03-25](https://dubesor.de/assets/shared/MushroomPlatformer/Gemini%202.5%20Pro%20Experimental%2003-25) | [Mushroom platformer 05-06](https://dubesor.de/assets/shared/MushroomPlatformer/Gemini%202.5%20Pro%20Preview%2005-06) [Village game 03-25](https://dubesor.de/assets/shared/VillageGame/Gemini%202.5%20Pro%20Preview%2003-25.html) | [Village game 05-06](https://dubesor.de/assets/shared/VillageGame/Gemini%202.5%20Pro%20Preview%2005-06.html) :thumbsdown: Overall, minor observable change **in my environment** /small test set, YMMV! the extra token use is a major bummer though..
Qwen3-235B-A22B
Qwen3-1.7B
Tested **Qwen3-235B-A22B** (API fp8) & **Qwen3-1.7B** (local bf16): **1.7B:** * `/no_think` is nothing special, performs as expected, a tiny model for the GPU starved * default/thinking performs around a decent 7B model, quite usable for easy tasks * not my use case, but probably the best option besides small Gemma models, if you cannot fit the 4B --- **Qwen3-235B-A22B** *`/no_think` verbosity at x1.57, yet only 16% of default mode * performed slightly worse in Reasoning and STEM subjects * around Llama 3.3 70B level, but better coder **thinking** (*default mode*): * 616% token usage compared to non-thinking mode * very capable model across the board, almost DeepSeek-R1 level, beating out Llama 405B * impressive STEM performance (not just math but other STEM subjects, too) * Extremely cost efficient, decent vibes, **TOP** model * The trend of thought-chains aiding sensitive topics continued with these 2 models, too **YMMV!**
Phi-4-reasoning plus
Tested **Phi-4-reasoning plus** (*14B, local, Q8_0*): * In terms of overall capability, roughly on par with Qwen3-32B. * I already thought QwQ was unusable for general use, but this one takes the cake in terms of sheer token verbosity: By far the most verbose model I ever tested, almost **20x token usage** of a traditional model. * Quite dry and soulless responses overall * Not a model for general use, clearly optimized for benchmarks (math, note that my STEM includes non-math topics) * Ok model to run a few tests or benchmarks, insanely inadequate inference requirements, not reasonable for general use As always, just my own testing, **YMMV**!
Qwen3
Tested **Qwen3** (4B, 8B, 14B, 32B, 30B-A3B): **non-thinking** (mode `enable_thinking=False`, `/no_think`): * still relatively verbose when compared to a traditional non-thinking model (~45% more token usage) * very good all-rounders for size, mostly best in slot for their sizes/non-thought * Good overall utility for a variety of tasks, not recommended for precise maths or programming. * More prone to flat out refuse requests **thinking** (*default mode* `enable_thinking=True`, `/think`): * very verbose, but not extremely so (x7.85 token usage puts it among the more verbose tested, but not as extreme as QwQ and o3-mini-high) * Huge gains in math (in particular rounding), as well as Coding * Less prone to flat out refuse requests, thought-chains were beneficial in Censorship testing * Extremely performant overall, dominating in my sub 49B rankings * In fact, 4B and MoE 3.3B were so performant for size (usually tiny models struggle at my test-suit), that I suspected test-leakage and ran multiple re-tests All models were tested locally, rough inference speeds on my 4090/24GB VRAM: Qwen3-30B-A3B Q4_K_M: **130** tok/s Qwen3-4B bf16: **83** tok/s Qwen3-14B Q8_0: **50** tok/s Qwen3-8B bf16: **50** tok/s Qwen3-32B Q4_K_M: **28** tok/s 30B-A3B (***insane speed!***⚡️) will definitely be utilized as a daily driver by me for all types of random non-crucial tasks. This was just in my tested use cases, as always, **YMMV!**
GLM-4-32B-0414
GLM-Z1-32B-0414
Tested **GLM-4-32B-0414** & **GLM-Z1-32B-0414** (local Q4_K_M): **GLM-4-32B-0414** * non-thinking model, still fairly verbose (x1.62 tok in my testing) * Good overall utility for a variety of tasks * similar overall capability as Qwen2.5 32B * more competition in this non-reasoning size segment (e.g. Mistral Small 3.1, Gemma 3 27B) **GLM-Z1-32B-0414** * reasoning model, requires a large context window (**minimum** 16k in my testing) * nowhere near as verbose as QwQ (x6 instead of x10 token usage), thus higher general usability * capability didn't quite reach QwQ level, but overall 2nd best for models under 49B. * I noticed syntax errors in less popular languages (e.g. swift) * I prefer it over QwQ simply because of its less excessive token spam This is just my testing and my use cases. As always, **YMMV!**
Gemini 2.5 Flash Preview 04-17
Tested **Gemini 2.5 Flash Preview 04-17**: * Fairly verbose, fast, cheap model that is competent in all tested areas. * Improvements from 2.0 flash, except my coding tasks, where it did slightly worse * Around GPT-4.1 level overall **Thinking**: * increased output base price ($0.6 > $3.5) combined with ~3.42x token usage (74.3% reasoning tokens), leads to a **much** higher inference price, overall almost 20x than non-thinking. * Biggest improvements were seen in reasoning, analytical conclusions, and coding * Counterintuitively, it did consistently worse with thinking on my STEM tasks * Around DeepSeek-R1 & Grok-3 level overall Due to some inconsistencies observed during testing, I reran my benchmark several times on the Thinking variant. While it is overall far stronger than non-thinking (and far more expensive), it also produced less consistent results compared to non-thinking in some areas. As always, **YMMV!**
o4-mini
o4-mini-high
Tested **o4-mini** & **o4-mini-high**: **o4-mini:** * Quite concise for a long-CoT-reasoning model (only ~3.2x token verbosity compared to a traditional model). * Real inference cost was almost identical to 3.7 Sonnet (non-thinking). * Performance was roughly in line with o3-mini-high. **o4-mini-high:** * Roughly 156% more thinking tokens, translates to inference cost&delay x2. * Comparatively minuscule improvements, in certain areas (very hard code & reasoning). * Not universally better in every scenario, even when disregarding cost increase. * Roughly on par with Grok-3 (non-thinking). Overall, in my environment, this models feels like a small upgrade to o3-mini, in some scenarios. The effective cost is a bit lower, which is an upside. Not too impressive in my testing, but as always, depending on your own use case: **YMMV!**
Granite-3.3-8B-Instruct
Tested **Granite-3.3-8B-Instruct** (f16): Actually did a bit worse overall than the Granite 3.0 8B Instruct (Q8) [I tested 6 months ago.](https://discord.com/channels/1110598183144399058/1111649100518133842/1297985907546259678) Not the absolute worst, but just utterly uninteresting and beaten by a plethora of other models in the same size segment in pretty much all tested fields.
GPT-4.1
Tested **GPT-4.1** series: **GPT-4.1 Nano:** Cheap tiny model, roughly comparable to Qwen2.5 14B. Substantially beaten on price & performance by e.g. Googles flash models. **GPT-4.1 Mini**: Versatile fast model, roughly comparable to Gemini 2.0 flash (but more expensive). Quite a solid coder, and performed on par with the larger model in my STEM segment. **GPT-4.1**: "flagship" of the series, roughly as strong Llama 3.3 70B (but weaker STEM) & DeepSeek V3 0324 (but weaker coder). Behind 7 other OpenAI models in my testing. The "Maverick" type model of OpenAI. All models are non-reasoning models and not very verbose, when compared to other recent model releases (1.15x / 1.23x / 1-35x token verbosity as size increases in testing). All models, including Nano, are fairly competent coders! though none excel at my backend testing None of these were particularly good in my STEM segment. I have also added 0-shot examples for UI impressions and simplistic game design for each model on my [shared assets](https://dubesor.de/assets/shared/) (**NOT** part of any scoring, just for additional curiosity/comparison). As always, **YMMV**!
Llama-3.1-Nemotron-Nano-8B-v1
Tested **Llama-3.1-Nemotron-Nano-8B-v1** (*bf16*): This model has **2 modes**, the reasoning mode (enabled by using `detailed thinking on` in system prompt), and the default mode (`detailed thinking off`). **Default behaviour:** * Despite not officially <think>ing, about 2x verbose as base model * Weak performance across the board, terrible instruction following/prompt adherence * About the same capability of a 3B model, with added verbosity **Reasoning mode:** * Not always <think>​ing, despite system instructions as per nvidia documentation * minor improvements in logic, some improvements in STEM related tasks * terrible instruction following/prompt adherence. Low utility Both variants perform significantly below base Llama 3.1 8B and have far less general utility. Very poor model imo. But as always: **YMMV!**
Grok-3 mini
Tested **Grok-3 mini**: **default reasoning**: * Near identical token use to o3-mini *(medium)*, 132% more token use than the non-thinking Grok-3 * Good performance in all tested areas, around o3-mini level, not far behind Grok-3 * better instruction following than Grok-3 * better price/performance for most tasks than Grok-3 **high reasoning**: * 65% more token use than default reasoning (labeled as "low" but I would say is more akin to "medium reasoning") * same overall smartness, but gains stability in math and instruction following * not recommended for areas outside of the above, as I saw certain task even produce worse results, for higher price. (e.g. some C++ issues not present on default thinking). Also retested & updated current Grok-3 due to observed deviations since 2 months ago, scored slightly higher (+~1.5%) . As always: **YMMV!**
Llama 4 Scout & Llama 4 Maverick
Tested Meta's new **Llama 4 Scout** & **Llama 4 Maverick**: **Llama 4 Scout:** (109B MoE) * Not a reasoning model, but quite yappy (x1.57 token verbosity compared to traditional models) * "Small" multipurpose model, performs okay in most areas, around **Qwen2.5-32B** / **Mistral Small 3 24B** capability * Utterly useless in producing anything code. * Price/Performance (at current offerings) is okay but not too enticing when compared to stronger models such as Gemini 2.0 flash **Llama 4 Maverick:** (402B MoE) * Smarter, more concise model. * Weaker than Llama 3.1 405B, performed decent in all areas, exceptional in none, performed around **Llama 3.3 70B** / **DeepSeek V3** capability. * Workable but fairly unimpressive coding results, archaic frontend. The shift to MoE means most people won't be able to run these on their local machines, which is a big personal downside. Overall, I am not too impressed by their performance and won't be utilizing them, but as always: **YMMV!**
Gemini 2.5 Pro Experimental 03-25
Tested **Gemini 2.5 Pro Experimental 03-25**: **Average-verbose** reasoning model with around 5.4x token use of a traditional model, clocking in around DeepSeek-R1 level token usage. Far less verbose than models such as o3-mini-high or Sonnet Thinking. * **#1** Reasoning/Logic segment, surpassing GPT-4.5 Preview * **#1** in Code segment, surpassing GPT-4.5 Preview * STEM and math were **competent**, but nowhere near top, in my testing * Overall utility for miscellaneous casual tasks, where **fine**, but not outstanding I really enjoyed testing this model. It's very capable, but still shows flaws in certain areas. As always: **YMMV!**
DeepSeek V3 0324
Tested **DeepSeek V3 0324**: * More verbose than previous V3 model, lengthier CoT-type responses resulted in total token verbosity of **+31.8%** * Slightly smarter overall. Better coder. Most noticeable difference were a **hugely better frontend** and UI related coding tasks This was merely in my own testing, as always: **YMMV!** Example frontend showcases comparisons (**identical** prompt, identical settings, 0-shot - **NOT** part of my benchmark testing): [CSS Demo page DeepSeek V3](https://dubesor.de/assets/shared/UIcompare/deepseek3UI.html) [CSS Demo page DeepSeek V3 0324](https://dubesor.de/assets/shared/UIcompare/deepseek3%200324UI.html) [Steins;Gate Terminal DeepSeek V3](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/DeepSeek%20V3.html) [Steins;Gate Terminal DeepSeek V3 0324](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/DeepSeek%20V3%200324.html) [Benchtable DeepSeek V3](https://dubesor.de/assets/shared/LLMBenchtableMockup/DeepSeek%20V3%200.04%20cents.html) [Benchtable DeepSeek V3 0324](https://dubesor.de/assets/shared/LLMBenchtableMockup/DeepSeek%20V3%200324%200.07%20cents.html) [Mushroom platformer DeepSeek V3](https://dubesor.de/assets/shared/MushroomPlatformer/DeepSeek%20V3.html) [Mushroom platformer DeepSeek V3 0324](https://dubesor.de/assets/shared/MushroomPlatformer/DeepSeek%20V3%200324.html)
EXAONE Deep 32B
Tested **EXAONE Deep 32B** (*local, Q4_K_M*): Yet another long-cot reasoner. Stumbles around with thoughts and delivers unimpressive results, even when compared to non-reasoning models less than half its size. Was utterly useless in anything code related. This one is very lame, and weak imho, there are at least a dozen far better options at that size. As always: **YMMV!**
Llama-3.3-Nemotron-Super-49B-v1
Tested **Llama-3.3-Nemotron-Super-49B-v1** (*local, Q4_K_M*): This model has **2 modes**, the reasoning mode (enabled by using `detailed thinking on` in system prompt), and the default mode (`detailed thinking off`). **Default behaviour:** * Despite not officially <think>ing, can be quite verbose, using about 92% more tokens than a traditional model. * Strong performance in reasoning, solid in STEM and coding tasks. * Showed some weaknesses in my Utility segment, produced some flawed outputs when it came to precise instruction following * Overall capability very high for size (**49B**), about on par with Llama 3.3 **70B**. Size slots nicely into 32GB or above (e.g. 5090). **Reasoning mode:** * Produced about **167% more tokens** than the non-reasoning counterpart. * Counterintuitively, scored slightly lower on my reasoning segment. Partially caused by **overthinking** or more likelihood to land at creative -but ultimately false- solutions. There have also been instances where it reasoned about important details, but failed to address these in its final reply. * **Improvements** were seen in **STEM** (particularly math), and higher precision instruction following. This has been 3 days of local testing, with many side-by-side comparisons between the 2 modes. While the reasoning mode received a slight edge overall, in terms of total weighted scoring, the default mode is far more feasible when it comes to token efficiency and thus general usability. Overall, very good model for its size, wasn't too impressed by its 'detailed thinking', but as always: **YMMV!**
Olmo 2 32B
Tested **Olmo 2 32B Instruct** (API/bf16): * Performs around a modern 10B model * Okay for general questions but rather weak in any specialized field (math, code, etc.) * quite vanilla/sterile This model is quite poor size/performance overall. Outclassed by models such as Nemo 12B and Phi-4 14B. Subjective Vibe Checks not passed (not rated), Uninteresting model imho, but **YMMV**.
Mistral Small 3.1
Tested **Mistral Small 3.1** (API): * not much to say, it's pretty much **identical** to Mistral Small 3 (within margin of error & minute precision/quantization differences) * you get multi-modality. I found no underlying text-capability differences.
Jamba 1.6
Tested **Jamba 1.6** (Mini & Large): * Literally worse than the 1.5 Models I tested 7 months ago. * The models cannot even produce a simplistic table! * They are completely coherent, but unintelligent and feel ancient. The "large" model gets beaten by local ~15B models in terms of raw capability, and the pricing is completely outdated. The Mini model performed slightly above Ministral 3B. These models are very bad imho. As always: YMMV!
Reka Flash 3
Tested **Reka Flash 3** (21B, Q8): This one is yet another long-CoT reasoning model (~5.32x token verbosity compared to a traditional model). I did decent in my coding segment (don't use this for frontend webdesign though! looks terrible). It has low general utility due to extreme verbosity and subpar instruction following. In other categories, it performed okay-ish for size. Outclassed by models such as Mistral Small 3, Gemma 3 12B, Phi-4 14B in most scenarios. As always: YMMV!
Command A
Tested **Command A** (03-2025): * Significant upgrade to Command R+ 08-2024 * Feels a bit dated for its size (111B) when compared to models such as Llama 3.3 70B * Surprising performance in my tech and code segment, where it delivered consistently good results * Less censored than most other models, easy to steer As for their marketing claims about being on par or better than 4o and DeepSeek V3: certainly not for general use, but it did perform on par in my coding segments. This is obviously a model geared at enterprise, RAG, and agentic works, but it will still be useful to risk writing and similar creative work. As always: YMMV!
Gemma 3
Tested **Gemma 3** (*local, Q5, Q8, bf16*): **27B**: Better in STEM, particularly math, poor coder, disappointing reasoning compared to Gemma 2 & 12B **12B**: Slightly better in almost everything compared to Gemma 2 9B, equivalent performance to 27B in many tasks **4B**: Comparable to Gemma 2 2B, found it less versatile but a tiny bit smarter in certain cases. Family as a whole: Hard **refusals** have been significantly reduced. You now have to live with large segments containing legal and warning disclaimers though.. Multi-modality & image inputs: My testing does NOT test any multimodal functionality, so do keep an eye on benchmarks that do. For my use case, as someone who barely ever requires image input, these models are a bit disappointing in terms of raw text capability. but: As always, **YMMV**!
R1 1776
Ran a full retest of R1 1776, after perplexity claims to have fixed their implementation. * Higher quality chain of thoughts, in particular in long context, fixed degradation * Thus, gains in all tested areas, compared to initial implementation * Still falls short when compared to DeepSeek-R1 * Core model remains identical with same issues such as still censored Chinese areas and propaganda Tldr; Recent fixes improved the thought chains and thus outcome significantly, doesn't quite reach R1 level, in my testing. As always,**YMMV!**
QwQ-32B
Tested **QwQ-32B** (*local, Q4_K_M*): * best in size, except for coding * extremely verbose (avg. ~10x output tokens compared to traditional model, more verbose than any other long-cot-model I ever tested) * more effective thought chains than r1 distill versions of Qwen2.5-32B * terrible at all webdesign tests I threw at it * Smartest sub 70B by brute force token chains This is a smart model, but for me the extreme verbosity and inference required excludes it from becoming a daily driver. The good outcomes feel brute forced with cot, and the verbosity is borderline ridicilous. Good if for complex STEM related subjects or reasoning tasks. Not useful for coding. As always, **YMMV**!
GPT-4.5 Preview
Tested **GPT-4.5 Preview**: * Very **expensive** model obviously with the highest raw price yet, but actually a bit cheaper than o1 if you account for hidden thought-chains. Model is also fairly concise, with reply lengths slightly below median non-thinking models. * **Highest common sense** of all models I have ever tested (~130+) * STEM, Coding, and other professional tasks were good, but not super impressive. Attention to detail (haystack tests, bug spotting) was very good, though. * Vibe, style etc. I do not specifically test for but I found the model to be fairly standard, at least in my collected queries. While the model is advertised to be good at conversation, creativity and natural conversation, I don't see how casual conversations with this model is a feasible use case, considering the outrageous price. I will personally use it as an agent with decision making that requires common sense (e.g. a judge or critical analyst). As always - **YMMV**!
Claude 3.7 Sonnet Thinking
Tested Claude **3.7 Sonnet Thinking** (Budget 16k, though the max it ever used was 9k): * Overall output usage was ~7.44x compared to normal (signficantly more expensive). * The **vast majority** of cases, final outputs were of identical quality compared to non-thinking. * In certain reasoning and creative tasks it performed consistently worse than non-thinking (e.g. due to overthinking, pondering about reasons to not adhere to user query) * In rare specific queries (most consistently in hard code and hard math), it performed slightly better. I know it feels counterintuitive how it can perform below non-thinking on e.g. Reason, but I have retested all differentiating results a multitude of times, and the differences were reproducible & consistent. For my use case, the thinking mode will remain deactivated 99% of the time, unless I have a very specific issue that non-thinking cannot solve, then it might be worth giving a thinking budget a try. For the average user, I doubt that using it is wise considering cost-effectiveness. However, as always, just my own testing. **YMMV**!
Claude 3.7 Sonnet
Tested **Claude 3.7 Sonnet** (non-thinking, claude-3-7-sonnet-20250219): Smarter & better overall, biggest improvement imho was it's far less aggressive Nanny-behaviour (still not uncensored but big improvement!). It's frontend dev skills (which was already great) was taken up a notch and produces even better results. Flaws were rather in backend and debugging. Overall, fantastic model. I'll check out the different thinking options over the next few days *(though I have a feeling it won't lead to very cost-efficient improvements)* As always, **YMMV**! 3 simple frontend UI comparisons between 3.5 and 3.7 (short query prompt, 0 shot) - **NOT PART OF MY TESTING**; JUST FUN COMPARISON: CSS DEMO: https://dubesor.de/assets/shared/UIcompare/Sonnet3.5.1.html https://dubesor.de/assets/shared/UIcompare/Sonnet3.7UI.html STEINS;GATE TERMINAL: https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Claude%203.5%20Sonnet%20new.html https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Claude%203.7%20Sonnet.html LLM BENCHTABLE MOCKUP: https://dubesor.de/assets/shared/LLMBenchtableMockup/Claude%203.5%20Sonnet%203.1%20cents.html https://dubesor.de/assets/shared/LLMBenchtableMockup/Claude%203.7%20Sonnet%2017.9%20cents.html
R1 1776
Tested **R1 1776** (Perplexity post-trained to remove Chinese censorship): Reasoning showed strong signs of degradation, leading to worse results in all tested areas. Math, formatting and code related tasks were more strongly affected than pure Logic tasks. Ironically, the only few Chinese censor tests I have (and have had for a long time) still produced 100% censored and propagandistic answers. Whether the degradation is due to the post-training, or how the model is implemented, I do not know. But I do know that it isn't on R1 level. As always, **YMMV**.
Grok-3
Tested the current **Grok-3**: Reasoning was similar to Grok-2 in my environment, but I saw large improvements in STEM, general utility and Coding (*On a sidenote, UX design was hit or miss, sometimes phenomenal, sometimes poor, so a bit inconsistent*.) It's still fairly uncensored, and **very wordy** model (non-reasoning but produces large responses, roughly 2.25x as GPT-4o-latest which is already wordier than the average traditional model.) I found it to be a little less cringe-inducing than Grok-2 (subjective, unrated). Overall, very capable model but not the best at any field I test. As always - **YMMV**!
chatgpt-4o-latest
Tested current 'chatgpt-4o-latest' (time stamp 2025-02-16), and compared to results from 4 months ago: * about 1-4.7% better on my test set, depending how refusals are weighted * more prone to censor in risk topics, lower utility in risk-deemed RP * slightly improved capability across different segments, math, logic, coding, ... * slightly altered behaviour/styling, more emojis by default, more casual tone in certain settings * overall, slightly better for most use cases, most capable non-thinking model, other than 4-Turbo As always, YMMV!
Qwen2.5
Tested the new **Qwen2.5** models (also updated the price since it changed just 1 day after my testing): Qwen2.5-**Turbo** - cheap model, roughly equivalent to GPT3.5 Turbo Qwen2.5-**Plus** - mid model, roughly equivalent to Qwen2.5-72B Qwen2.5-**Max** - large model, roughly equivalent to Mistral Large 2 Traditional models who are competent enough for their weight & price. Not the most interesting models to me but as always - YMMV!
o3-mini
Tested **o3-mini** (default): Minor improved reasoning results over o1-mini, strong coder. Did slightly worse at my STEM segment and anything deemed not safe. If you don't care for censorship or refusals, or use it solely for coding, it's slightly better than o1. Overall, for general use not a noticeable capability upgrade. New pricing will make it more affordable though (thought token impact calculations still withstanding on my part). As always, YMMV.
Mistral-Small-24B-Instruct-2501
Tested **Mistral-Small-24B-Instruct-2501** aka Mistral Small 3: Saw improvement over the previous Mistral Small in most areas, except for code where it actually performed slightly lower (retests were already done, but my segment is rather small, so do take it with a grain of salt). It ends up among the best choice for sub50B models along Gwen2.5 32B and Gemma 2 27B. Mistral always has lower inherent censorship, so it should perform well for roleplay and similar creative tasks. Very useful model size for local inference (16GB VRAM+). Good universal model overall.
Qwen2.5-Max
Tested **Qwen2.5-Max**: It's a traditional, competent but overall uninteresting model. Slightly smarter than the vastly cheaper 72B, it's pricing strategy at $10/30 mTok is severely outdated. Compared to completing models with similar capability such as 72B or DeepSeek V3. I saw minor improvements in maths and formatting, but none in core logic or coding related tasks. This model is also quite dry and doesn't pass my vibe check, rather weak conversationalist and bad for RP. For me, considering the price, it's a pass. As always, YMMV!
R1-Distill
Locally tested **R1-Distill-Llama-8B**, **R1-Distill-Qwen-14B**, **R1-Distill-Qwen-32B** 32B was decent, the smaller distilled models were weaker than base. Overall I would use the non-thinkers in my own use case, as the benefit (or lack thereof) is not worth it for local compute, imho. For the smaller models, makes them less usable with lack of benefit. Also the token spam is not really desired for local use, at least in my use cases. Llama was particularly impacted, gaining minute capability in reason and stem, but sacrificing almost all utility and coding. For me, after several hours of testing, the distilled models aren't really attractive. as always, YMMV. Tested 4 Distill models by now, all statements are in my testing, **ymmv**: **8B** - weaker than base (*ranked #107 -vs- #84*) **14B** - weaker than base (*ranked #80 -vs- #62*) **32B** - slightly better than base (*ranked #54 -vs #59*) **70B** - weaker than base, due to bloated thoughts - pretty much unusable locally (*ranked #26 -vs- #18*)
DeepSeek R1-Zero
Tested **R1-Zero** (fp8): highly capable model, a little bit messier and less conventional than R1, less aligned/filtered. Loses out in formatting and thus coding, but is a highly capable model overall. probably not as consumer-friendly as R1, but my testing probes mostly raw capability. As always, YMMV!
DeepSeek-R1
Tested **DeepSeek-R1**: This model is extremely capable for a non-proprietary model, and the first to truly successfully challenge the top SOTA models (more so than DeepSeek V3). It will not be the most efficient model for every use case, as its long-chain-of-thought reasoning response (can be ignored by user) is very verbose, with an average ratio of 4.7:1 in my testing. Some of it's programming thoughts were even breaking my previous response storage method, due to surpassing 32k chars. So, it can be extremely verbose in its "reasoning_content" response. The final answer ("content") is fairly concise in comparison. Compared to R1-Lite-preview it did not suffer from false refusals or chinese output issues. It outperformed Llama 3.1 405B in almost all tasks except for general utility (roleplay, concise formatting, etc. which is to be expected). Overall a fantastic long-cot model. Adding in the cost factor, in terms of cost effectiveness, it blows o1 completely out of the water. Plus you get to actually see every token you pay for, if you desire. as always - YMMV!
MiniMax-01
Tested the model **MiniMax-01** in my bench environment. Results were around WizardLM-2 8x22B or Llama3.0 70B level. It was pretty mediocre in most tested fields, cost/performance was top 40%, not expensive and neither particularly cheap for capability. There are some minor quirks with Chinese output or lack of format adhering but not to an unusable degree. Overall pretty meh model to me. As always - YMMV!
QVQ-72B-Preview
Tested **QVQ-72B-Preview** (bf16): Surprisingly bad model, despite being a long CoT reasoning model, it did not impress in reasoning tasks. It did OK in STEM-related tasks, but was useless in programming and anything requiring it to follow instructions. It fails to provide full working code segments most of the time and thinks about poorly formatted snippets. It's also very censored and refused a lot of tasks unjustified. It placed #79 in my current environment; right next to Llama 3.1 8B and Ministral 8B, and for a 72B model that should be quite telling. It also could not deliver on vibe, style or character. I see zero use case for this model; there are far better options, and far better long chain-of-thought models out there.
DeepSeek V3
**DeepSeek V3** - Thoroughly tested the new capability, I was fortunate to still have very recent 2.5 datasets (due to being late on 1210) for direct output comparisons. Strong STEM & code, solid instruction following and general utility, arguably minimally improved reasoning. Overall, 3rd most capable OpenSource model (behind Llama 3.1 405B & Llama 3.3 70B) in my testing. As for proprietary, roughly on o1-mini level. The biggest flaw for this model in my testing is clearly its reasoning expert, more specifically anything in areas requiring critical thinking and applying common sense, where it blunders a lot and consistently. As always, YMMV!
GPT-4 Turbo
Decided to retest GPT-4 Turbo (OR identifier 'openai/gpt-4-turbo') after 10 months with most recent adjustments, still holds up very well in comparison to most other state of the art models, in fact beating them in most scenarios on pure substance. It's a bit dry, doesn't have the best formatting nor style/vibe, but gets the job done. Compared to my testing back in February 2024, something clearly changed though, as the model behaviour was slightly different, e.g. I ran into reproducible refusals that were definitely not present beforehand. Combined with slightly weaker performance (but still strong) I suspect this is caused by the changing of system prompt into more convoluted legal coverage or similar back-end alterations over time. Still, it's better in this regard than o1. It will remain my go to for when cheaper more efficient models don't cut it for a problem.
Gemini 1206
Gemini 2.0 Flash Experimental
Gemini 2.0 Flash Thinking Experimental
Tested the 3 recent experimental Gemini models (Gemini Experimental 1206, Gemini 2.0 Flash Experimental, Gemini 2.0 Flash Thinking Experimental). All 3 were not major improvements in terms of total capability, but are slightly different in terms of behaviour. The thinking model obviously is used for reasoning but introduces too much noise which suffers its coding/instruct follow ability.
o1-2024-12-17
Tested the full o1 via API: Compared to o1-preview it used slightly fewer invisible thought-tokens, and is undoubtedly much better at STEM (particularly math), multiple unjustified refusals tanked its utility in my cases, this model is clearly not designed for tasks such as e.g. summarization or agentic personas. I was not impressed by its coding. I required multiple reiterations and restating info that was present in the task already, wasting a TON of money on non-exceptional results. This model is also fairly censored and steers away from any potentially controversial subjects, even if harmless in the context. This model is more akin to Claude models than what I am used to from OpenAI in terms of overcautiousness. tldr; Fantastic STEM capability, great reasoning, not too impressive in other areas from my testing. Unfathomably expensive, obviously, because the invisible tokens inflate actual pricing to around $190 mTok across my testing.
Command R7B
Tested **Command R7B** (12-2024) - Around Granite 3.0 8B / Qwen2.5-7B level, with decent STEM performance, poor reasoning and terrible coding. There are stronger options in that size category (LLama 3.1, Ministral, etc.) Price/Performance is OK, but again there are much better options even for bang4buck.
Phi-4
Tested **Phi-4** (14B), it's a decent model around Nemo 12B & Qwen2.5 14B level, with decent reasoning, very good STEM capability but lackluster code & instruct following. default vibe is very neutral and quite sterile, as expected from a microsoft model.
Llama 3.3 70B
Checked out Llama 3.3 70B locally (Q4_K_M): Strongest open model after Llama 3.1 405B, saw big improvements in reasoning and STEM-related tasks, compared to 3.1 It did not do particularly well in my coding-related tasks, though. Due to this outlier, had to retest this segment multiple times. Overall, very capable general use model
QwQ-32B-Preview
Ran **QwQ-32B-Preview** (Q4_K_M) through my own benchmark, have to say I am disappointed, had higher hopes. I's outputs are often annoyingly formatting (e.g. no proper distinction between thinking/reasoning loop and true final output.), often no full code blocks but snippets despite instructed to, terrible instruct follow overall, reasoning was poorer than vanilla Qwen2.5 32B. Math/Stem got boosted by Reasoning Loops in a positive way - everything else was rather poor in comparison. Vibe check is failed, the style of this model annoys me, and high refusal rate. Also it claimed to be developed by OpenAI in my self-description collections. I ranked it #62 (between Jamba 1.5 Mini and GPT-3.5 Turbo). Was hoping for a more fun model for my 100th model anniversary, oh well.
Marco-o1
Tested Marco-o1 (fp16) and it's capability was pretty much exactly how I expect a 7B model to be. The thinking is a nice gimmick, but it didn't yield better results in reasoning, as the model was unable to outthink bad thinking. It did help in math related tasks, though. For any generic utility tasks such as instruct following, summarization, etc. I found it to be borderline unusable. The model sometimes had crucial information within its tags while omitting it entirely from the tags, meaning you cannot effectively filter out tags without info loss. Fun and quirky to play around with sure, but not groundbreaking in my testing.
DeepSeek-R1-Lite-Preview
DeepSeek-R1-Lite-Preview: Performed a bit worse than DeepSeek V2.5 overall, partially due to uncalled for **refusals** in completely generic tasks. However, it has the highest reasoning skills of all DeepSeek models (slightly higher than o1-mini). Overall, it placed on a similar level to Grok-2 mini and Claude 3.5 Haiku in my testing. If this model is in the 50-60B range it would be very impressive if they can iron out the refusal behaviour. Can be quirky fun to use due to its ramblings, which are hit or miss.
Claude 3.5 Haiku
Just checked out Claude 3.5 Haiku, very unexpected results.. In my own small-scale test it showcased: * By far the least censored Claude model (other than Claude-1), very different refusal/censor behaviour when compared to old haiku or Sonnets & Opus. * Roughly 2x capability of Claude 3 Haiku * Did better on my small subset of code related tasks than 3.5 Sonnet * STEM was pretty identical * Some flaws in utility/misc tasks (terrible roleplayer) * Reasoning still pretty weak but huge gains compared to the previous iteration * Pricing is too high, when competing with models such as 4o-mini or Gemini 1.5 Pro 002 Not rated but subjective vibe check: very concise model that seems to love putting nearly everything into list format. AS ALWAYS - YMMV!
Aya Expanse
Cohere **Aya Expanse** testing has concluded for me, very weak for their size compared to the competition, not worth the storage space imo. Aya Expanse 8B (f16) - failed pretty much everything and was around L3.2 3B capability. Aya Expanse 32B (Q4_K_M) - weaker than even Gemma 2 9B & Nemo 12B in my testing. It would be OK as like a 12B model due to being fairly uncensored. Gets absolutely stomped by Qwen2.5
Claude 3.5 Sonnet 20241022
Tested the new 3.5 Sonnet. After all is done and accounted for, it jumped ranks from #15 > #7 with slightly less prudishness (still much higher than the competition). I saw massive gains in tasks labeled for Reasoning (suspiciously high gains, I need to investigate this further). A slight dip in prompt adherence and code. I scrutinized and retested all tech-related coding tasks a total of 6 times, ended up running 18 queries PER TASK in that particular label to exclude any random outliers. The results were consistently delivering the same outcome, though. Good improvements as a whole.
Granite-3.0-8B-Instruct
Granite-3.0-8B-Instruct (Q8_0). Not terrible, not great. While my bench is too hard for small models, and doesn't catch the minute differences for them, it still gives a rough expected performance ballpark. But if I add to that the vibe check (not tested nor depicted) - utterly uninteresting model, won't stay on my drive.
Yi-Lightning
My 2 cents on the Yi-Lightning models: Tested around Llama 3.1 70B level, and Llama 3.0 70B for lite model for me. Reasoning labeled tasks was pretty dead even between them, not their strong suit. Pretty good at STEM&Maths, and better at following instructions than Qwen models. Fairly uncensored. Competent but did not reach a top spot, unlike say the current arena ranking. it has good style tho, so that would gain a fair amount of votes.
Ministral
I was quite impressed by Ministral 3B, 8B on the other hand was a barely noticeable improvement in the vast majority of cases. here are some neighboring performers. Mistral 8B =~ Llama 3.1 8B Mistral 3B =~ Llama 3.0 8B sucks that the 3B model is not local, would be good to run on the side, it's definitely the more interesting model here but usage is so limited by this. As always, YMMV!
Inflection 3
Tested Inflection 3 Models: Productivity one is better not just in performance but also in style imho. Capability range is around Gemma 2 9B (Pi) and Nemo12b/Qwen14B (Productivity). The models are too expensive for what they offer, ranking near the bottom of my price/performance calculations.
Llama-3.1-Nemotron-70B-Instruct
Comparing Llama-3.1-Nemotron-70B-Instruct to the vanilla Llama-3.1 Model and the best performing competitor at that size, Qwen2.5-72B. Overall a pretty substantial improvement, I saw biggest gains in STEM related questions. It's also a pretty consistent model and didn't really blunder anything too terribly.
Llama 3.2 11B & 90B
tested the new 3.2 vision models (text capability), comparing them to their non-vision brethren. 90B was slightly smarter, and 11B was about even with 8B. I do not bench for vision, but tested them a little bit for myself anyways, it's ok for a first iteration vision but not what I would personally use compared to other vision models.
Llama-3.2 1B & 3B
Gave Llama-3.2-1B and Llama-3.2-3B a spin. My testing isn't very good for such tiny models, as the testset is too hard for such models, but I wanted to try anyways. I found Gemma 2B it to be vastly superior to Llama-3.2-3B
Llama-3.1-Nemotron-51B
Llama-3.1-Nemotron-51B tested, very impressive for it's size, being toe to toe with it's 70B brother, outperforming it in math but losing out on reasoning and misc tasks. Great model! As always - YMMV!
Qwen2.5-14B-Instruct
Finished testing Qwen2.5-14B-Instruct on Q8 - best overall local model sub70B I have tested thus far. Barely beat the former champion despite being half as big. As is the issue with most Chinese models, its not very good at sticking to strict instructions and has general prompt adherence issues, but other than that it's a very capable model.
Mistral-Small-Instruct-24-09
At 22B, another great sub 70B option, joining the ranks of the likes of Nemo & Gemma 27B. Decent coder, good at math, and fairly unrestricted out of the box. Kinda flopped in the logic department during my testing, so if you want something to solve riddles this isn't the right model.
o1-mini
finally finished my o1 testing. this was a long and expensive ride. o1-mini completely wiped the floor with o1-preview in anything math related, not even close. rest is not big enough of a difference to justify the pricing. as always, YMMV
o1-preview
full o1-preview results are in on my own smallscale-testing! Highest reasoning I have tested thus far (outside of unreleased models), partially embarrassing math skills, okayish utility (bad for RP tho), not impressed by coding, most censored OpenAI model I have ever tested. This was very expensive and time-consuming, due to usage caps, and the fact that this time around I also had to track invisible token usage, for the true mTok cost... working on mini next, so far I like it a lot better in terms of price/performance and it seems to do less wasteful tokenwaste thinking. As always, YMMV!
DeepSeek V2.5
Put DeepSeek V2.5 through my benchmark. Very similar total capability to DeepSeek-V2, with improvement in math and programming but slight decrease in reasoning and prompt adherence. Very good model for the price/size. As always - YMMV!
Reflection Lllama-3.1 70B
model review of Reflection (local Quant, thus not as powerful as fp16 (once the bugs are ironed out), but still very useful data for general ballpark. In terms of Llama variations: Reasoning: Very good, as was expected. Only beaten by 405B STEM: Still good for its size. Reflection can cause math to introduce additional inaccuracies. General utility: Bad, the baked in reflection counter-acts user instructions to the point where the reflecting part actively tries to combat user instructions Code: on par with L3 70B, it's large thinking/reflecting segments seems to cause context poisoning issues, lowering the end result code quality Censorship: same as L3 70b and L3.1 8B.
Command R 08-2024
New Command R models, more efficient but minor "improvements" overall, the API introduced safety guardrails, which technically can be removed in terminal if you access model directly, but will be unable to be turned off on providers that do not allow for manipulating the safety parameter. If we compare to competition in similar size brackets, e.g. R+ to Mistral Large and R to Gemma 2 27B, the performance is underwhelming.
Command R 08-2024
Command R 08-2024 testing (Q4_K_M); slightly better than old model, but , at least in my testing, for its size not good enough compared to the competition of smaller size. Weirdly enough it blundered my entire code section.
Jamba 1.5
Tested the Jamba 1.5 models. they are decent-ish but pretty underwhelming for their size. Jamba 1.5 Large with a gigantic 399B size punches in the same league as old L3-70B, and the mini version is roughly equivalent to the over 5 times smaller Gemma 2 9B
Llama 3.1 405B Instruct
in lieu of https://x.com/aidan_mclau/status/1822830757137596521 I redid the entire bench on bf16. also reran results 3 additional times if they differed between versions. most notable difference is that I got 0 refusals this time (4 reproducible refusals on fp8), and the reasoning is higher with minute discepancies in math and 1 programming task. Meta changing kv heads a few days ago, and API outputs retroactively being bugged as I noticed 2 days ago doesn't help discrepancies.
Gemini Pro 1.5 experimental
Gemini Pro 1.5 experimental is quite the step up. *(in before the compliance gets nerfed after experimental phase).*
Claude-3.5-Sonnet
Obligatory 3.5 sonnet graph. my own bench, as always ymmv. Better than Opus in almost every tested way, except for programming and censorship tasks. Still inferior to OpenAI in reasoning and critical thinking tasks. Passed ~57% of my tasks, with double the fail rate of GPT-4 Turbo. as always ymmv.
Llama-3-70B-Instruct
Tested llama 3 today. Benched Around Mistral medium level. Good reasoning, A terrible programmer though, missed every bug hunting task, and every needle in haystack task. But a big improvement over llama-2 overall, also far more lenient in terms of refusals.
claude-3-haiku-20240307
I finally had the time to run all tests through Haiku, so here are the 4 recent claude models together. Haiku is the only cost effective claude model. It comes at a miniscule 1.67% the price of Opus and performs well in STEM and small generic tasks, but sucks heavily at reasoning skills.
Gemini, Claude, Mistral, GPT-4 Turbo
A few other interesting findings: * Gemini (both pro and ultra) are very prone to unnecessary refusal, even refusing tasks that are not even remotely questionable., * Mistral and OpenAI models almost never refuse anything, even my tasks that are specifically designed to be risky. (Claude-1 belonged to this camp), * Sonnet is such a weird model. In my testing, it performed better than Opus on tasks that have extremely high difficulty (>83%) yet somehow manages to give completely moronic answers to the easiest questions: https://i.imgur.com/4VeZ5vB.png)., * Out of all tested models, Claude 2.1 scored highest on prompt adherence (sticking to prompt instructions), * Opus seems significantly better in STEM and math, but did not deliver better results in programming over sonnet., * GPT-4 Turbo has the highest reasoning skills, bar none., * The best models sometimes fail to do the simplest tasks, that the worst models easily do, such as ending a sentence with a specific word, or excluding certain things., * GPT-4 Turbo was the only model that consistently gets easy to medium tasks correct, whereas other models sometimes fail at even the simplest of tasks.
Earlier first impressions
*Earlier first impressions between Feb to Aug 2024 are scattered across different servers & messages, with screenshots or short in nature, too timeconsuming to find to copy right now.*