Small-scale manual performance comparison benchmark I made for myself. This table showcases the results I recorded of various AI models across different personal tasks I encountered over time (currently 83). I use a weighted rating system and calculate the difficulty for each tasks by incorporating the results of all models. This is particularly relevant in scoring when failing easy questions or passing hard ones.
NOTE, THAT THIS JUST ME SHARING THE RESULTS FROM MY OWN SMALL-SCALE PERSONAL TESTING. YYMV! OBVIOUSLY THE SCORES ARE JUST THAT AND MIGHT NOT REFLECT YOUR OWN PERSONAL EXPERIENCES OR OTHER WELL-KNOWN BENCHMARKS.
This table currently supports: intro tooltips • dynamic sorting • searching • filtering • comparing • highlighting • exporting.
No, during the day I work an ordinary office job for the German government, and in my free time I like to work with data and numbers. This can range from small github projects, creating game wiki pages, data tables, steam guides, or in this case, benchmarks.
See introduction message and info tooltips. I don't have a horse in the race, I just post my results, even if they conflict with other findings. At most, my results should be used in addition to everything else, not as a substitute. I will happily correct mistakes if I encounter them, but I am not willing to adjust, or fudge any results so they look in line with a popular vote. The category scores are just a visualization mechanism based on broad labeling that I did afterward. They might adjust as I try to add task-labeling precision.
For transparency, I name the quantization I used for local testing. I mainly use ollama and lmstudio. On my 4090 (24GB VRAM) I found the sweet spot for large models (70B) to be Q4_K_M. With partial offloading, this gives me bareable speed (~2.5 tokens/sec) while getting good precision. Any lower and the output quality takes too much of a hit, and any higher and the speed loss is not worth it. For smaller models, I use the highest Quality that fits on my GPU. Quantization slightly alters the behavior of the model, which counterintuitively might even occasionally lead to better responses for certain queries. Here are some examples of quantization comparisons.
If there's a voluntary censor/filter toggle, I turn it off. Other than that, since I want to capture the vanilla experience, I don't change ANY optional parameter values, everything stays on the recommended/default. Also, not all providers/platforms allow changing every parameter. I don't use specific system prompts nor do I aid models with any custom jailbreaks etc. While I do utilize languages for specific tasks, 95% of the queries are in English.
It's one of the model-versions of the same architecture that GPT-4o is based on (according to @LiamFedus). I tested it end of April 2024 on the LMSYS arena. It seemed to use CoT by default and thus performed better at reasoning tasks, at a cost of prompt adherence. Sometimes new models are tested on that platform in different versions but end up not being released in that form (due to compute efficiency, cost, or otherwise).
I only retest if I notice stark discrepancies from the expected, verifying my initial testing wasn't flawed. Other than that - No. Since it takes me ~2+ hours per model just for the raw testing, I generally test a model once, usually shortly after it has been publicly announced and shipped. Models get constantly tweaked, updated, changed, nerfed, etc. and it's neither feasible nor practical to retest everything everytime in a benchmark that consist of manual review. If a major update gets announced&released, I would add the results as a separate entry.
No, I am not going to share my exact prompts, as that might inevitably cause them leaking into training sets more quickly and thus render them as a test tool utterly useless. Also, the vast majority are based on my real-life problems, that I encountered over time.
That's usually what happens, when someone bad at webdesign stitches together suggestions from ~7 different AI models.
If it's not answered in the introduction, tooltips, info-popups or FAQ, then you can shoot me a message on either reddit (dubesor86), steam (dubesor), or discord (dubesor#9671). I might not always respond to stupid questions, or when I am busy with life or work. edit: or use this fancy contact form: