RELEVANCE Banner Image

RELEVANCE:

Generative AI (GenAI) evaluation framework designed to automatically evaluate creative responses from Large Language Models (LLMs)

RELEVANCE (Relevance and Entropy-based Evaluation with Longitudinal Inversion Metrics) is a generative AI evaluation framework designed to automatically evaluate creative responses from large language models (LLMs). RELEVANCE combines custom tailored relevance assessments with mathematical metrics to ensure AI-generated content aligns with human standards and maintains consistency. Monitoring these metrics over time enables the automatic detection of when the LLM’s relevance evaluation starts to slip or hallucinate. 

Custom relevance evaluation alone involves scoring responses based on predefined criteria. However, while these scores provide a direct assessment, they might not capture the full complexity and dynamics of response patterns over multiple evaluations or different sets of data (e.g. model hallucination and model slip). To address this issue RELEVANCE integrates mathematical techniques with custom evaluations to ensure LLM response accuracy over time and adaptability to evolving LLM behaviors without involving manual review. Each metric serves a specific purpose:

  • Permutation Entropy (PEN): Quantifies the randomness of response rankings compared to human rankings. Ensures that the sequence isn’t too random, maintaining a predictable level of complexity.
  • Count Inversions (CIN): Measures the degree of disorder within these rankings. Ensures that the sequence is ordered correctly, with fewer out-of-order pairs.
  • Longest Increasing Subsequence (LIS): Identifies the length of the most consistent sequence of responses, mirroring human judgment. Ensures that there are long, consistent patterns of increasing relevance.
  • Custom Relevance Evaluation: Scores responses based on criteria such as accuracy, completeness, engagement, or alignment with a given prompt.
  • Initial Human Relevance Evaluation: Ensures deeper contextual and semantic nuances are captured by the custom relevance evaluation.

Together, these tools provide a robust framework for evaluating AI-generated responses, especially in contexts where responses are open-ended and there is no single correct answer. For instance, a sudden increase in Permutation Entropy or Count Inversions, or a decrease in Longest Increasing Subsequence, can alert you to potential issues, prompting further investigation or model adjustments.