- Measure performance on realistic, important tasks.
- Pinpoint where the model excels and where it fails.
- Compare your model to state-of-the-art (SOTA) systems.
- Gain actionable insights to guide iteration and fine-tuning.
The Evaluation Process: A Four-Step Cycle
Here is a high-level overview of the recommended evaluation process:- Build a realistic, diverse, and challenging evaluation set.
- Compare your model’s responses against those from a state-of-the-art (SOTA) model.
- Analyze the results to understand performance gaps.
- Improve your model in a targeted way.
Step 1: Build a High-Quality Evaluation Set - The quality of your evaluation depends entirely on the quality of the prompts you test it with.
- Realistic: Your evaluation set should contain prompts and documents that are representative of real-world use cases. Avoid contrived prompts that are designed only to make the model fail but don’t reflect real scenarios.
- Diverse: To ensure broad coverage, build a taxonomy of different scenarios. Think in terms of “verticals” (sub-domains) and “horizontals” (workflows). For example, in finance:
- Verticals could be investment banking, private equity, or corporate finance.
- Horizontals could be market research, financial modeling, or deal structuring.
- Challenging: The evaluation set should include tasks across a spectrum of difficulty to properly test your model’s limits.
- Using the “SME Corp_MSMEs in 2015-2023” file, project the number of MSMEs in 2024 using the 2019-2023 CAGR. Round the calculated CAGR to two decimal places using normal rounding rules before calculating the number of MSMEs in 2024. Round the calculated number of MSMEs to the nearest whole number.
- Using the “SC MTC Companies” file, determine the number of Malaysian MTCs; assume that the reported number is for the year 2024.
- Using the “Orbis_Malaysian Companies Data” file, identify the total number of companies meeting the following investment criteria:
- a) Average EBITDA margin (rounded to one decimal) over the years 2022 to 2024 is at least 15%. Note: To calculate this average EBITDA margin, sum the EBITDA margins for all available years between 2022 and 2024 for a company and divide by the number of years for which data is available, and then round to one decimal. For example, if a company has EBITDA margin data only for 2023, and 2024 (missing 2022), calculate the average as (EBITDA_2023 + EBITDA_2024) ÷ 2.
- b) Exclude companies that report any EBITDA margin greater than 90% (without rounding) during the 2022–2024 period to avoid unrealistic profitability outliers.
- c) Report how many companies meet both criteria.
- From the qualified companies from Task 3, identify the top three industries (use “BvD sectors” as an indicator of industry) by volume (e.g., how many times they are represented in the dataset). For each of the top three industries:
- Calculate the median and the top quartile EBITDA margins (using the rounded company-level averages from 3a).
- Round answers to one decimal place.
-
State the number of companies in each of the top three industries.
Why is this a good prompt?
- Stresses realistic reasoning for private equity professionals
- Requires challenging instruction following and multi-step logical reasoning
- Needs the model to integrate multiple source documents that are representative of real-world scenarios
Step 2: Compare Model Responses Using a Clear Taxonomy
To run evaluations, you will compare your model’s output to another model’s output (e.g., a SOTA model) on the same prompt. The following taxonomy provides a best-practice framework for scoring the responses.| Taxonomy | Definition | Proposed Scale |
|---|---|---|
| Overall Preference | The evaluator’s choice between multiple model outputs, reflecting which response is most useful, accurate, or satisfying overall. | 1 – 5 1 – A is much better 2 – A is slightly better 3 – No preference 4 – B is slightly better 5 – B is much better |
| Overall Preference Justification | The explanation or reasoning behind why a specific output was preferred, highlighting strengths and weaknesses relative to alternatives. | Free text explanation |
| Overall Quality Score | A holistic rating (often numerical or categorical) summarizing the evaluator’s judgment of the response’s quality, independent of comparison. | 1 – Terrible 2 – Pretty Bad 3 – Okay 4 – Minor Room for Improvements 5 – Cannot be Improved |
| Explanation of Model improvement areas | If model scored 3 or lower, then the expert should explain in free text what could be improved about the model’s response. Sometimes, we provide dropdown options with set categories. | Free text explanation |
| Truthfulness | The model’s factual accuracy and breadth of information, including recall of established facts and grounding of reference documents | 1 – No Issues 2 – Minor Issues 3 - Major Issues |
| Instruction Following | The degree to which the model adheres to the user’s prompt, constraints, and task requirements without deviation. | 1 – No Issues 2 – Minor Issues 3 - Major Issues |
| Verbosity | The appropriateness of response length and level of detail, avoiding being too terse or unnecessarily wordy. | 1 – Too Short 2 – Just Right 3 - Too Long |
| Writing Style and Tone | The clarity, fluency, and appropriateness of language, including tone, voice, and stylistic alignment with the intended audience. | 1 – No Issues 2 – Minor Issues 3 - Major Issues |
Step 3: Analyze the Results
Once scores are collected, the goal is to diagnose why your model performed the way it did.- If a response scored less than a perfect 5/5, focus on understanding the specific gaps.
- You can use LLMs to help automatically classify results into failure categories (e.g., reasoning errors, factual inaccuracies).
- Track performance over time to see if your interventions are fixing specific failure types.
Step 4: Continuously Improve Your Model (aka “Hillclimbing”)
“Hillclimbing” is the process of continuous, systematic improvement. After analyzing the results, you can take targeted steps to address your model’s weaknesses.- Treat each evaluation round as a feedback loop: fix the weakest areas, re-evaluate, and measure your progress.
- Focus your efforts on the failure buckets where you consistently lag behind SOTA models.
- Once your model reaches parity with SOTA models on your current evaluation set, it’s time to increase the difficulty by creating a more challenging set of prompts. This ensures you are always pushing the limits of your model’s capabilities.
How to Improve the Quality of Your Evaluations
- Source High-Quality Experts: The quality of your evaluation data depends on the people providing it. Prioritize experts who are experienced in your subject matter.
- Calibrate Your Raters: Create a set of “golden tasks” – example evaluations with pre-determined correct answers – to align your internal team and external experts on the grading criteria.
- Focus on Quality Before Quantity: When an evaluator starts, review their first few graded tasks closely before allowing them to work on a larger volume.
- Use Consensus to Reduce Bias: To improve reliability, many labs have multiple evaluators rate the same response. This reduces individual bias. You can measure the consistency between raters with a metric called inter-rater agreement (IRA). A high IRA score means your instructions and rubric are clear, while a low score indicates ambiguity.
- Keep Comparisons Simple: Studies show that side-by-side (pairwise) comparisons are less cognitively demanding and lead to more consistent ratings. Try to stick to comparing just two or three model responses at a time.
- imes closing the batch at less than 100% completion is easier, faster, and preferred. If closing out batches early, you need to watch out for “cherrypicking”, where annotators only do the easy tasks and you never get the diversity or complex cases you had in mind completed.
Recommended Tooling
Good tooling is essential for maintaining high annotation quality. While you may eventually build custom tools, you can accomplish a great deal with off-the-shelf SaaS products.- Google Forms: A lightweight and easy-to-use option, great for getting started with simple annotation projects.
- Airtable: A more powerful tool that allows for significant automation and customization, capable of supporting large-scale annotation pipelines with a user-friendly interface.
