Prompt Comparison Tool

by Engineering | May 21, 2024

Prompt Comparison Tool

Why We Need a Prompt Comparison Tool

Existing models can look at entire contracts and extract insights. However, as we integrate new features and fine-tune LLMs, we need tools to evaluate their performance effectively. One such feature is finding issues in contracts or parts of contracts and then explaining them. Another is 'chatting' with a contract (traditional prompt) and extending this capability across all contracts in a project using Retrieval-Augmented Generation (RAG).

Evaluating LLMs Beyond Benchmarks

The first decision in this process is determining which LLM to fine-tune. While it might be tempting to rely on the latest news or benchmarks, this approach has limitations. Benchmarks can provide a good idea about accuracy but may not always reflect tendencies to hallucinate or other practical issues. Moreover, they often test areas where LLMs already excel, potentially skewing results.

Drawbacks of Benchmarks:

  • Comfort Zones: Benchmarks often test areas where LLMs are already known to excel, not effectively measuring potential in unexplored or more complex capabilities.
  • Speed of Progress: As LLMs achieve human-level performance on benchmarks, new benchmarks are needed, complicating meaningful comparisons over time.
  • Errors in Benchmark Datasets: While always improving, benchmark datasets can contain inaccuracies that affect scores.

The primary goal of an LLM for lawyers or any user is to simplify real-world tasks. If an LLM needs to generate meeting notes or summaries in a particular tone, its actual results will be more crucial than its score on a standardized benchmark.

Practical Considerations for Choosing LLMs

Some models may lag in benchmark scores but excel in commercial applications due to cost-effectiveness, adaptability, or unique features. Factors like being open-source, self-hostable, or fitting on lower-powered devices are not shown by common benchmarks. Additionally, the context size of a model can be more important than its benchmark score.

HuggingFace’s LMSys ChatBot ranking is a useful comparison tool that uses feedback from thousands of real users. However, topping the charts isn’t everything, especially when fine-tuning is a significant investment. We need to consider open-source models, specific results against costs, context window size, speed/latency, and other factors.

Building Our Comparison Tool

Given the limitations of existing tools, we decided to build our own prompt comparison tool. This tool will help us evaluate how good open LLMs are out of the box with some existing and planned features. Specifically, we have tasks that might not suit a RAG approach because we’re accustomed to sending whole contracts, which can be expensive due to large context windows and the “needle in the haystack” issue.

Our approach to building this tool involves several practical steps for prompt refinement:

  • Try simple prompts and do a human evaluation of “good” or “bad”.
  • Discuss where the results were often weak and refine the prompt iteratively.
  • Add in examples
  • Add in legal checklists and guidelines for clause types
  • The prompt now gets longer and longer and maybe becomes a mega-prompt.
  • If there are still problems, switch to few-shot or many-shot or, less frequently, discuss fine-tuning.
  • Finally, there’s always considering a complete breakdown of steps and building an agent workflow.

There’s a lot here to keep track of. But by following these steps, we can systematically improve the performance of our LLM results and ensure they meet the specific needs of our applications.

Key Features of Our Tool

  • Prompt Engineering Concerns: We need to test different models with "mega-prompts" that include large checklists and guidelines and examples, with various prompt engineering techniques, system and user prompts, the number of examples to give, and consistency over time.
  • Evaluation: Once the app is in production, we need a full evaluation suite. This will include a three-part process: the test response, a second LLM review, and an LLM-generated score. Human oversight will be crucial to ensure accuracy and reliability.

Why Prompt Engineering Over Fine-Tuning

At this stage, we prefer adjusting prompts rather than using feedback for fine-tuning the model. Here’s why:

  • Flexibility and Control: Adjusting prompts allows targeted changes to align results with human evaluators and adapt to evolving criteria or new tasks.
  • Iterative Refining: This approach supports incremental improvements and A/B testing of different prompts.
  • Less Risk of Overfitting: Adjusting prompts helps ensure the model generalises better to new, unseen data.
  • Easier to Interpret: Prompt adjustments provide more transparent and interpretable results.
  • Cost-Effectiveness: Prompt engineering is generally less resource-intensive than fine-tuning, although monitoring human evaluation costs is essential.

Immediate Human Evaluation

Our CEO and front-end developer, both ex-lawyers, can begin human evaluation right away. They can categorize results as “woeful,” “rubbish,” “not bad,” “good,” or “Cicero!” This initial evaluation helps refine our processes while we develop the full evaluation flow.

Continuous Improvement and Sharing

Our tool allows us to filter prompts by content, re-run them with different models or parameters, and save them for comparison. As we incorporate LLM review, we’ll do less manual evaluation but will continue to share CSV files to monitor results regularly.