Dubesor LLM Benchmark table

Small-scale manual performance comparison benchmark I made for myself. This table showcases the results I recorded of various AI models across different personal tasks I encountered over time (currently 83). I use a weighted rating system and calculate the difficulty for each tasks by incorporating the results of all models. This is particularly relevant in scoring when failing easy questions or passing hard ones.

NOTE, THAT THIS JUST ME SHARING THE RESULTS FROM MY OWN SMALL-SCALE PERSONAL TESTING. YMMV! OBVIOUSLY THE SCORES ARE JUST THAT AND MIGHT NOT REFLECT YOUR OWN PERSONAL EXPERIENCES OR OTHER WELL-KNOWN BENCHMARKS.

This table currently supports:
intro tooltips • info popups • dynamic sorting • searching • filtering • comparing • highlighting • exporting.
VRAM estimate (experimental)

1000 B limit (arrow keys for precision)

VRAM:

F32 F16 Q8 Q6 Q5 Q4 Low Quant Too Large

JavaScript is Required

JavaScript is needed for this table. Please enable JavaScript in your browser or addon settings and reload the page.

If you need help enabling JavaScript, visit enable-javascript.com for instructions.

Model (282)	TOTALⓘ	Passⓘ	Refineⓘ	Failⓘ	Refusalⓘ	$ mTokⓘ	Vⓘ	Reasonⓘ	STEMⓘ	Utilityⓘ	Tech ⓘ	~~Censor~~ⓘ

Vision Scores Chess Scores Impressions

Latest additions:

Ministral 3 14B Instruct Q8_0 local

Ministral 3 8B Instruct bf16 local

Mistral Large 3 2512

DeepSeek V3.2 Speciale

DeepSeek V3.2 Thinking

DeepSeek V3.2

Olmo 3 32B Think Q4_K_M local

Olmo 3 7B Think Q8_0 local

Olmo 3 7B Instruct f16 local

Claude Opus 4.5

Gemini 3 Pro Preview

GPT-5.1 Codex

GPT-5.1 Codex Mini

GPT-5.1 Chat

GPT-5.1

Kimi-K2-Thinking

Qwen3-VL-32B-Instruct Q4_K_M local

Qwen3-VL-30B-A3B-Instruct Q4_K_M local

Qwen3-VL-8B-Instruct bf16 local

MiniMax-M2

LFM2-2.6B

LFM2-8B-A1B

Ling-1T

Claude Haiku 4.5 Thinking

Claude Haiku 4.5

LongCat-Flash-Chat

Granite-4.0-H-Small (32B-A9B) Q4_K_M local

Granite-4.0-H-Tiny (7B-A1B) bf16 local

GLM-4.6 Thinking

GLM-4.6

Claude Sonnet 4.5

GPT-5 Codex

Grok-4-fast-reasoning

Grok-4-fast-non-reasoning

Qwen3-Coder-Plus

Qwen3-Coder-Flash

Seed-OSS-36B Q4_K_M local

Qwen3-Next-80B-A3B-Thinking

Qwen3-Next-80B-A3B-Instruct

Qwen3-Max

Kimi-K2-Instruct-0905

💬 Verbosity levels

Claude Opus 4.1 Thinking

Grok Code Fast 1

DeepSeek V3.1 Thinking

DeepSeek V3.1

ERNIE-4.5-21B-A3B Q6_K local

Jamba 1.7 Large

Jamba 1.7 Mini

Mistral Medium 3.1

Qwen3-4B-Thinking-2507 bf16 local

Qwen3-4B-Instruct-2507 bf16 local

GPT-5

GPT-5 Chat

GPT-5 Mini

GPT-5 Nano

gpt-oss-120b

gpt-oss-20b f16 local

Claude Opus 4.1

XBai-o4 Q4_K_M local

Qwen3-30B-A3B-Thinking-2507 Q4_K_M local

Qwen3-Coder-30B-A3B-Instruct Q4_K_M local

Qwen3-30B-A3B-Instruct-2507 Q4_K_M local

GLM-4.5 Thinking

GLM-4.5

GLM-4.5-Air Thinking

GLM-4.5-Air

Llama-3.3-Nemotron-Super-49B-v1.5 Thinking Q4_K_M local

Llama-3.3-Nemotron-Super-49B-v1.5 Q4_K_M local

Qwen3-235B-A22B-Thinking-2507 fp8

Gemini 2.5 Flash Lite

Qwen3-Coder-480B-A35B

Qwen3-235B-A22B-Instruct-2507 fp8

ERNIE-4.5-300B-A47B

Kimi-K2-Instruct

Grok-4

Hunyuan-A13B Instruct Q4_K_M local

Hunyuan-A13B Instruct Thinking Q4_K_M local

Llama-3.1-Nemotron-Ultra-253B-v1 Thinking

Llama-3.1-Nemotron-Ultra-253B-v1

Gemma 3n E4B it fp16 local

o3-2025-04-16

Gemini 2.5 Flash Thinking

Gemini 2.5 Flash

Mistral Small 3.2 24B Instruct 2506 Q6_K local

MiniMax-M1

Gemini 2.5 Pro

Gemini 2.5 Flash Lite Preview 06-17 †

Magistral Medium 2506

Magistral Small 2506 Q6_K local

dots.llm1.inst

DeepSeek-R1 0528-Qwen3-8B

DeepSeek-R1 0528

Claude Opus 4 Thinking

Claude Sonnet 4 Thinking

Claude Opus 4

Claude Sonnet 4

Gemini 2.5 Flash Preview 05-20 Thinking †

Gemini 2.5 Flash Preview 05-20 †

Codex Mini

Mistral Medium 3

Gemini 2.5 Pro Preview 05-06 †

Qwen3-235B-A22B Thinking fp8

Qwen3-235B-A22B fp8

Qwen3-1.7B Thinking bf16 local

Qwen3-1.7B bf16 local

Phi-4-reasoning-plus Q8_0 local

Qwen3-32B Thinking Q4_K_M local

Qwen3-32B Q4_K_M local

Qwen3-30B-A3B Thinking Q4_K_M local

Qwen3-30B-A3B Q4_K_M local

Qwen3-14B Thinking Q8_0 local

Qwen3-14B Q8_0 local

Qwen3-8B Thinking bf16 local

Qwen3-8B bf16 local

Qwen3-4B Thinking bf16 local

Qwen3-4B bf16 local

🧠 Thinking filter

GLM-Z1-32B-0414 Q4_K_M local

GLM-4-32B-0414 Q4_K_M local

Gemini 2.5 Pro Preview 03-25 †

Gemini 2.5 Flash Preview 04-17 Thinking †

Gemini 2.5 Flash Preview 04-17 †

o4-mini-high

o4-mini

Granite-3.3-8B-Instruct f16 local

GPT-4.1

GPT-4.1 Mini

GPT-4.1 Nano

Llama-3.1-Nemotron-Nano-8B-v1 Thinking bf16 local

Llama-3.1-Nemotron-Nano-8B-v1 bf16 local

Grok-3 mini-high

Grok-3 mini

⟲ Grok-3

Llama 4 Maverick

Llama 4 Scout

Gemini 2.5 Pro Experimental 03-25 †

DeepSeek V3 0324

EXAONE Deep 32B Q4_K_M local

Llama-3.3-Nemotron-Super-49B-v1 Thinking Q4_K_M local

Llama-3.3-Nemotron-Super-49B-v1 Q4_K_M local

Olmo 2 0325 32B Instruct

Mistral Small 3.1 24B Instruct 2503

Gemma 3 1B it bf16 local

Jamba 1.6 Mini

Jamba 1.6 Large

Reka Flash 3 21B Q8_0 local

Command A 03-2025

Gemma 3 4B it bf16 local

Gemma 3 12B it Q8_0 local

Gemma 3 27B it Q5_K_M local

⟲ R1 1776 post fixes

QwQ-32B Q4_K_M local

Gemini 2.0 Flash Lite

GPT-4.5 Preview †

Claude 3.7 Sonnet Thinking

Claude 3.7 Sonnet

R1 1776

Grok-3

Chatgpt-4o-latest (2025-02)

Gemini 2.0 Flash

o3-mini-high

LFM-7B †

Qwen2.5-Plus

Qwen2.5-Turbo

Qwen2.5-Max -$

o3-mini

Mistral-Small-24B-Instruct-2501 Q6_K local

Qwen2.5-Max

Param limit VRAM estimate

DeepSeek-R1-Zero

✝ Hide deprecated models

R1-Distill-Llama-70B Q4_K_M local

R1-Distill-Qwen-32B Q4_K_M local

R1-Distill-Qwen-14B Q8_0 local

R1-Distill-Llama-8B f16 local

DeepSeek-R1

MiniMax-01

QVQ-72B-Preview

DeepSeek V3

⟲ GPT-4 Turbo (2024-12)

Add IgnoreCensor Data&Fn

Gemini 2.0 Flash Thinking Experimental †

Gemini 2.0 Flash Experimental

Gemini Experimental 1206 †

o1-2024-12-17

Command R7B 12-2024

Phi-4 14B Q8_0 local

Llama 3.3 70B Instruct bf16

Llama 3.3 70B Instruct Q4_K_M local

QwQ-32B-Preview Q4_K_M local

Marco-o1 7B fp16 local

DeepSeek-R1-Lite-Preview

Qwen2.5-Coder-32B-Instruct Q4_K_M local

Gemini 1.5 Flash 002 †

Claude 3.5 Haiku

Fixed weight MAX(range) flaw

SmolLM2 1.7B Instruct Q8_0 local

Yi Coder 9B Chat Q8_0 local

Llama-3.1-Nemotron-70B Instruct HF Q4_K_M local

Aya Expanse 32B Q4_K_M local

Aya Expanse 8B f16 local

Claude 3.5 Sonnet 20241022

Yi-Lightning lite

Yi-Lightning

Granite-3.0-8B-Instruct Q8_0 local

Chatgpt-4o-latest (2024-10) †

Ministral 8B

Ministral 3B

Inflection 3 Pi

Inflection 3 Productivity

Llama-3.1-Nemotron-70B Instruct

⟲ Grok 2-mini

⟲ Grok 2

Gemini 1.5 Flash-8B †

⟲ Gemini 1.5 Pro 002 † -$

Qwen2.5-3B-Instruct fp16 local

Llama 3.2 90B Vision Instruct

Llama 3.2 11B Vision Instruct

Llama-3.2-3B-Instruct fp16 local

Llama-3.2-1B-Instruct fp16 local

Gemini 1.5 Pro 002 †

Llama-3.1-Nemotron-51B

Qwen2.5-7B-Instruct Q8_0 local

Qwen2.5-72B-Instruct bf16

Qwen2.5-32B-Instruct Q4_K_M local

Qwen2.5-72B-Instruct Q4_K_M local

Qwen2.5-Coder-7B-Instruct-Q8_0 local

Qwen2.5-14B-Instruct Q8_0 local

Mistral-Small-Instruct-2409 Q6_K local

ChatGPT o1-mini

ChatGPT o1-preview

DeepSeek V2.5

Reflection Llama-3.1 70B Q4_K_M local

Codestral-22B-v0.1 Q6_K local

Grok-2 mini-2024-08-13 †

Grok-2-2024-08-13 †

Command R 08-2024 Q4_K_M local

Command R+ 08-2024

Phi-3.5-mini-instruct Q8_0 local

Jamba 1.5 Large

Jamba 1.5 Mini

Llama 3.1 405B Instruct bf16

Qwen2-7B-Instruct fp16 local

InternLM2.5 20B Q8_0 local

gemini-1.5-pro-exp-0801 †

Gemma 2 27B it Q5_K_M local

Gemma 2 2B it F32 local

Command R Q4_K_M local

Athene-Llama3-70B Q4_K_M local

Mistral Large 2

Llama 3.1 405B Instruct fp8

Llama 3.1 70B Instruct

Llama 3.1 8B Instruct

mistral-nemo-12b-instruct

MythoMax-L2-13B Q8_0 local

Gemini 1.5 Flash †

Gemma 2 27B API

GPT-4-0613

OpenHermes-2.5-Mistral-7B Q8_0 local

Phi-3-medium-128k-instruct Q8_0 local

GPT-4o-mini

Llama-3-70b-Instruct Q4_K_M local

Yi-1.5 34B-Chat-16K Q6_K local

GLM-4-0520

Reka Core

DeepSeek-V2 Chat

Nemotron-4 340B Instruct

DeepSeek-Coder-V2

Gemma 2 9B Q8_0_L local

Phi-3-Mini-4K-Instruct f16 local

Qwen2-72B-Instruct

Yi Large

WizardLM-2 8x22B

Gemini 1.5 Pro †

Claude 3.5 Sonnet †

GPT-4o

Llama-3-8b-Instruct f16 local

claude-3-opus-20240229

claude-3-sonnet-20240229

gpt2-chatbot †

Command R+

Llama-3-70b-Instruct

claude-3-haiku-20240307

llama-2-70b-chat

mistral-large-2402

Claude-1 †

Claude-2.1 †

Gemini Ultra †

Gemini Pro †

GPT-4 Turbo †

Mistral Medium

Mixtral-8x7b-Instruct-v0.1

GPT-3.5 Turbo

Cost effiency, Performance per $
Score/API cost @20%input 80%output MTok, 50%=median

Model (164)	$/mTok	$/bench	Performance	Cost Efficiency

FAQ

No, during the day I work an ordinary office job for the German government, and in my free time I like to work with data and numbers. This can range from small github projects, creating game wiki pages, data tables, steam guides, or in this case, benchmarks.

See introduction message and info tooltips. I don't have a horse in the race, I just post my results, even if they conflict with other findings. At most, my results should be used in addition to everything else, not as a substitute. I will happily correct mistakes if I encounter them, but I am not willing to adjust, or fudge any results so they look in line with a popular vote. The category scores are just a visualization mechanism based on broad labeling that I did afterward. They might adjust as I improve task-labeling precision.

For transparency, I name the quantization I used for local testing. I mainly use ollama (terminal & Open WebUI) and lmstudio. On my 4090 (24GB VRAM) I found the sweet spot for large models (70B) to be Q4_K_M. With partial offloading, this gives me bareable speed (~2.5 tokens/sec) while getting good precision. Any lower and the output quality takes too much of a hit, and any higher and the speed loss is not worth it. For smaller models, I generally try to use the highest precision that I can find while fitting on my GPU. Quantization slightly alters the behavior of the model, which counterintuitively might even occasionally lead to better responses for certain queries. Here are some examples of quantization comparisons. Note, that I don't have the time for any specific Quant comparison testing.

Generally, yes. While some model families do better or worse than others, my own collected data suggest a moderately strong positive correlation between achieved Benchmark score and model Parameter Size, with diminishing returns at the large end. This also seems to hold true when comparing lower quant large models (Q4+) to higher precision smaller models (Q8+).

View full-page plot

If there's a voluntary censor/filter toggle, I turn it off. Other than that, since I want to capture the vanilla experience, I don't change ANY optional parameter values, everything stays on the recommended/default. Also, not all providers/platforms allow changing every parameter. I don't use specific system prompts nor do I aid models with any custom jailbreaks etc. While I do utilize languages for specific tasks, 95% of the queries are in English.

I only retest if I notice stark discrepancies from the expected, verifying my initial testing wasn't flawed. Other than that - No. Since it takes me ~4+ hours per model just for the raw testing, I generally test a model once, usually shortly after it has been publicly announced and shipped. Models get constantly tweaked, updated, changed, nerfed, etc. and it's neither feasible nor practical to retest everything everytime in a benchmark that consist of manual review. If a major update gets announced&released, I would add the results as a separate entry.

Unless I am personally very interested in a specific fine-tune or merge, I generally tend to steer clear of fine-tunes and merges, because of the time investment required, mentioned above. So far, every fine-tune I tested (~20 incl. non-published) was performing in line with its expected base model performance +/- negligible margin with no notable TOTAL deviation in my environment. So - see base model for reference. I also do not want to flood my tables with dozens of variations of the same models.

No, I am not going to share my exact prompts, as that might inevitably cause them leaking into training sets more quickly and thus render them as a test tool utterly useless. Also, the vast majority are based on my real-life problems, that I encountered over time.

That's usually what happens, when someone bad at webdesign stitches together suggestions from ~7 different AI models.

If it's not answered in the introduction, tooltips, info-popups or FAQ, then you can shoot me a message on either reddit (dubesor86), steam (dubesor), or discord (dubesor#9671). I don't require any model requests. I might not always respond to stupid questions, or when I am busy with life or work. edit: or use this fancy contact form:

Dubesor LLM Benchmark table

Latest additions:

Cost effiency, Performance per $ Score/API cost @50%input 50%output MTok, 50%=median Score/API cost @80%input 20%output MTok, 50%=median Score/API cost @20%input 80%output MTok, 50%=median

FAQ+

Cost effiency, Performance per $
Score/API cost @20%input 80%output MTok, 50%=median

FAQ