Annotations (3)
“There are some tasks where a better model produces better, more accurate results, but other tasks where there's no such thing as a 'better' result and no such thing as 'more accurate', only right or wrong. Some questions don't have 'wrong' answers; the quality of the output is subjective and 'better' is a spectrum.”
Philosophy & Reasoning · Technology & Engineering · Strategy & Decision Making
DUR_ENDURING
Two types of evaluation: objective vs subjective
“More practically, you can try them with your own workflows. Does this model do a better job? Here, though, we run into a problem, because there are some tasks where a better model produces better, more accurate results, but other tasks where there's no such thing as a 'better' result and no such thing as 'more accurate', only right or wrong.”
Operations & Execution · Strategy & Decision Making
DUR_ENDURING
Workflow testing reveals true utility
“You can look at the benchmarks, but setting aside the debate about how meaningful these are, that doesn't tell me what I can do that I couldn't do before, or couldn't do as well. You can also keep a text file full of carefully crafted logic puzzles to try, which is really just doing your own benchmark, but again, what does that tell you?”
Strategy & Decision Making · Operations & Execution
DUR_ENDURING
Benchmarks measure performance, not utility
Frameworks (1)
Two-Category Evaluation Framework
Distinguishing Objective Accuracy from Subjective Quality
When evaluating tools, systems, or decisions, first determine whether the task has an objective correct answer (deterministic evaluation) or exists on a subjective quality spectrum. Deterministic tasks require accuracy testing; spectrum tasks require judgment of quality dimensions. Mixing these evaluation types produces misleading conclusions.
Components
- Classify the Task Type
- Select Appropriate Evaluation Method
Prerequisites
- Clear understanding of the task being evaluated
- Access to real-world workflows or test cases
Success Indicators
- Evaluation method matches task type
- Results inform actual decisions
- No confusion between accuracy and quality
Failure Modes
- Applying wrong evaluation type to task
- Over-relying on abstract benchmarks for spectrum tasks
- Treating subjective quality as objective accuracy
Synthesis
Synthesis
Migrated from Scholia