Dubesor LLM Benchmark table

Small-scale manual performance comparison benchmark I made for myself. This table showcases the results I recorded of various AI models across different personal tasks I encountered over time (currently 83). I use a weighted rating system and calculate the difficulty for each tasks by incorporating the results of all models. This is particularly relevant in scoring when failing easy questions or passing hard ones.

NOTE, THAT THIS JUST ME SHARING THE RESULTS FROM MY OWN SMALL-SCALE PERSONAL TESTING. YMMV! OBVIOUSLY THE SCORES ARE JUST THAT AND MIGHT NOT REFLECT YOUR OWN PERSONAL EXPERIENCES OR OTHER WELL-KNOWN BENCHMARKS.

This table currently supports:
intro tooltipsdynamic sortingsearchingfilteringcomparinghighlightingexporting.

Model (99) TOTAL Pass Refine Fail Refusal $ mTok Reason STEM Utility Code Censor




Latest additions:

Marco-o1 7B fp16 local
DeepSeek-R1-Lite-Preview
Qwen2.5-Coder-32B-Instruct Q4_K_M local
Gemini 1.5 Flash 002
Claude 3.5 Haiku
Fixed weight MAX(range) flaw
SmolLM2 1.7B Instruct Q8_0 local
Yi Coder 9B Chat Q8_0 local
Llama-3.1-Nemotron-70B Instruct HF Q4_K_M local
Aya Expanse 32B Q4_K_M local
Aya Expanse 8B f16 local
Claude 3.5 Sonnet 20241022
Yi-Lightning lite
Yi-Lightning
Granite-3.0-8B-Instruct Q8_0 local
Chatgpt-4o-latest (2024-10)
Ministral 8B
Ministral 3B
Inflection 3 Pi
Inflection 3 Productivity
Llama-3.1-Nemotron-70B Instruct
⟲ Grok 2-mini
⟲ Grok 2
Gemini 1.5 Flash-8B
⟲ Gemini 1.5 Pro 002 -$
Qwen2.5-3B-Instruct fp16 local
Llama 3.2 90B Vision Instruct
Llama 3.2 11B Vision Instruct
Llama-3.2-3B-Instruct fp16 local
Llama-3.2-1B-Instruct fp16 local
Gemini 1.5 Pro 002
Llama-3.1-Nemotron-51B
Qwen2.5-7B-Instruct Q8_0 local
Qwen2.5-72B-Instruct bf16
Qwen2.5-32B-Instruct Q4_K_M local
Qwen2.5-72B-Instruct Q4_K_M local
Qwen2.5-Coder-7B-Instruct-Q8_0 local
Qwen2.5-14B-Instruct Q8_0 local
Mistral-Small-Instruct-2409 Q6_K local
ChatGPT o1-mini
ChatGPT o1-preview
DeepSeek V2.5
Reflection Llama-3.1 70B Q4_K_M local
Codestral-22B-v0.1 Q6_K local
Grok-2 mini-2024-08-13 †
Grok-2-2024-08-13 †
Command R 08-2024 Q4_K_M local
Command R+ 08-2024
Phi-3.5-mini-instruct Q8_0 local
Jamba 1.5 Large
Jamba 1.5 Mini
Llama 3.1 405B Instruct bf16
Qwen2-7B-Instruct fp16 local
InternLM2.5 20B Q8_0 local
gemini-1.5-pro-exp-0801
Gemma 2 27B it Q5_K_M local
Gemma 2 2B it F32 local
Command-R Q4_K_M local
Athene-Llama3-70B Q4_K_M local
Mistral Large 2
Llama 3.1 405B Instruct fp8
Llama 3.1 70B Instruct
Llama 3.1 8B Instruct
mistral-nemo-12b-instruct
MythoMax-L2-13b Q8_0 local
Gemini 1.5 Flash
Gemma 2 27B API
GPT-4-0613
OpenHermes-2.5-Mistral-7B Q8_0 local
Phi-3-medium-128k-instruct Q8_0 local
GPT-4o-mini
Llama-3-70b-Instruct Q4_K_M local
Yi-1.5 34B-Chat-16K Q6_K local
GLM-4-0520
Reka Core
DeepSeek-V2 Chat
Nemotron-4 340B Instruct
DeepSeek-Coder-V2
Gemma 2 9b Q8_0_L local
Phi-3-Mini-4K-Instruct f16 local
Qwen2-72B-Instruct
Yi Large
WizardLM-2 8x22B
Gemini 1.5 Pro
Claude 3.5 Sonnet
GPT-4o
Llama-3-8b-Instruct f16 local
claude-3-opus-20240229
claude-3-sonnet-20240229
gpt2-chatbot †
Command R+
Llama-3-70b-Instruct
claude-3-haiku-20240307
llama-2-70b-chat
mistral-large-2402
Claude-1
Claude-2.1
Gemini Ultra †
Gemini Pro †
GPT-4 Turbo
Mistral Medium
Mixtral-8x7b-Instruct-v0.1
GPT-3.5 Turbo

Cost effectiveness, Performance per $
Score/API cost @20%input 80%output MTok, 50%=median

Model (58) Price ($/mTok) Performance Cost Effectiveness

* API cost at time of testing, check current vendor pages for most up-to-date pricing

FAQ

×