{"id":1041426,"date":"2024-05-30T10:44:33","date_gmt":"2024-05-30T17:44:33","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=1041426"},"modified":"2024-06-10T09:26:56","modified_gmt":"2024-06-10T16:26:56","slug":"relevance-automatic-evaluation-framework-for-llm-responses","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/relevance-automatic-evaluation-framework-for-llm-responses\/","title":{"rendered":"RELEVANCE: Automatic Evaluation Framework for LLM Responses"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\"RELEVANCE\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

RELEVANCE:<\/h1>\n\n\n\n

Generative AI (GenAI) evaluation framework designed to automatically evaluate creative responses from Large Language Models (LLMs)<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

RELEVANCE (Relevance and Entropy-based Evaluation with Longitudinal Inversion Metrics) is a generative AI evaluation framework designed to automatically evaluate creative responses from large language models (LLMs). RELEVANCE combines custom tailored relevance assessments with mathematical metrics to ensure AI-generated content aligns with human standards and maintains consistency. Monitoring these metrics over time enables the automatic detection of when the LLM\u2019s relevance evaluation starts to slip or hallucinate. <\/p>\n\n\n\n

Custom relevance evaluation alone involves scoring responses based on predefined criteria. However, while these scores provide a direct assessment, they might not capture the full complexity and dynamics of response patterns over multiple evaluations or different sets of data (e.g. model hallucination and model slip). To address this issue RELEVANCE integrates mathematical techniques with custom evaluations to ensure LLM response accuracy over time and adaptability to evolving LLM behaviors without involving manual review. Each metric serves a specific purpose:<\/p>\n\n\n\n

    \n
  • Permutation Entropy (PEN)<\/strong>: Quantifies the randomness of response rankings compared to human rankings. Ensures that the sequence isn’t too random, maintaining a predictable level of complexity.<\/li>\n\n\n\n
  • Count Inversions (CIN)<\/strong>: Measures the degree of disorder within these rankings. Ensures that the sequence is ordered correctly, with fewer out-of-order pairs.<\/li>\n\n\n\n
  • Longest Increasing Subsequence (LIS)<\/strong>: Identifies the length of the most consistent sequence of responses, mirroring human judgment. Ensures that there are long, consistent patterns of increasing relevance.<\/li>\n\n\n\n
  • Custom Relevance Evaluation<\/strong>: Scores responses based on criteria such as accuracy, completeness, engagement, or alignment with a given prompt.<\/li>\n\n\n\n
  • Initial Human Relevance Evaluation<\/strong>: Ensures deeper contextual and semantic nuances are captured by the custom relevance evaluation.<\/li>\n<\/ul>\n\n\n\n

    Together, these tools provide a robust framework for evaluating AI-generated responses, especially in contexts where responses are open-ended and there is no single correct answer. For instance, a sudden increase in Permutation Entropy or Count Inversions, or a decrease in Longest Increasing Subsequence, can alert you to potential issues, prompting further investigation or model adjustments.<\/p>\n\n\n\n

    <\/div>\n\n\n\n\n\n

    The goal is to evaluate whether LLM-generated marketing content is suitably structured for posting on a social media blog. This content must be locally relevant and align with the targeted audience, brand, and voice. The content creation exclusively employs typical Prompt Engineering without complex techniques like Retrieval-Augmented Generation (RAG) or model fine-tuning. This choice demonstrates the efficiency of the evaluation process, with content creation and evaluation beginning on the same day.<\/p>\n\n\n\n

    System Prompt: <\/strong>Define the prompt used for generating marketing content by the LLM.<\/p>\n\n\n\n

    For example, our goal is to create marketing content for a Blog Post that boosts our brand awareness for our targeted segment. <\/p>\n\n\n

    \n\t
    \n\t\t
    \n\t\t\t
    \n\t\t\t\t\n\t\t\t\t\t\n\n

    Example System Prompt <\/p>\n\n\n\n\t\t\t\t<\/span>\n\t\t\t\t\n\t\t\t\t\t\n\n\n\n

    {{region}} – Mainland China

    {{goal}} – Boost Brand Awareness

    {{brand}} – Neb Cafe is a luxury coffee company that offers espresso and regular coffee products at affordable prices. Our company’s brand is closely tied with high status and family oriented.

    {{audience}} – We are targeting our ad to heads of household adults who prefer to make their coffee at home and coffee is an essential part of their morning routine.

    {{voice}} – Neb Cafe uses both functional and expressive voice tones when writing content. Functional voice means helpful\u2014it organizes things in a clear way and anticipates our audience\u2019s needs, helping customers have an easy, enjoyable experience in-store and online. Used primarily for wayfinding and ordering, this copy is so seamlessly integrated that it calls attention to the product\u2014not itself. Functional doesn\u2019t mean sterile; it means clear. Expressive voice is where our brand personality unfurls with day-making thoughts. We use expressive moments on focal products to present a product truth in a fresh, relevant, interesting way. When we have the space, we tell a passionate coffee story. But even with just a few words, our copy can make you smile\u2014always taking into account where our audience is interacting with us\u2014and making every word count.

    {{platform}} – Blog post


    # system:
    You are a marketing expert who creates American English advertisement content for {{ region }} customers. Your goal is to create a {{platform}} ad for our company Neb Cafe that leverages our company brand and company writing style. Each advertisement must be written in American English and include popular local terminology or reference local places of interest. Most importantly the advertisement must be worded so that its Non offensive to local audiences.

    # brand:
    {{brand}}

    # writing style:
    {voice}}

    # user:
    Please create a Neb Cafe latte advertisement for a {{platform}} with the goal to {{goal}} for the following audience {{audience}}.

    # template:
    \u201c{emoji} {slogan} {emoji}\u201d

    {Product Introduction and Description}

    {Give example of using product in local context}

    {Call to action using our website, follow us on social media}.<\/p>\n\n\t\t\t\t<\/span>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t\tShow more\t\t\t<\/button>\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n\n

    Automated Relevance Evaluation<\/strong>: Employ an LLM to score the relevance of responses based on curated examples, employing an evaluation system.<\/p>\n\n\n\n

    For example, our custom relevance evaluation grades our LLM responses based on a specific measure such as context, references, and compliance.<\/p>\n\n\n

    \n\t
    \n\t\t
    \n\t\t\t
    \n\t\t\t\t\n\t\t\t\t\t\n\n

    Example Custom Relevance Evaluation<\/p>\n\n\n\n\t\t\t\t<\/span>\n\t\t\t\t\n\t\t\t\t\t\n\n\n\n

    System:
    You are an Marketing Content Generation Expert. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

    User:
    Relevance measures how well the answer addresses the main aspects of the question, based on the context, reference to popular Mainland China terminology or local points of interest, and avoidance of words that may be interpreted by Mainland China Users as offensive or risky. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
    One star: the answer completely lacks relevance
    Two stars: the answer mostly lacks relevance
    Three stars: the answer is partially relevant
    Four stars: the answer is mostly relevant
    Five stars: the answer has perfect relevance

    This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

    context: Neb Cafe is a luxury coffee company that offers espresso and regular coffee products at affordable prices to customers in Mainland China. Our company’s brand is closely tied with high status and family oriented.
    question: Create an advertisement.
    answer: Indulge in the luxury of Neb Cafe’s premium lattes without stepping out of your cozy abode.
    stars: 1

    context: Neb Cafe is a luxury coffee company that offers espresso and regular coffee products at affordable prices to customers in Mainland China. Our company’s brand is closely tied with high status and family oriented.
    question: Please create a Neb Cafe latte advertisement for a Blog post.
    answer: \ud83c\udf1f \\”Elevate Every Morning\\” \u2615\ufe0f\\n\\nIndulge in the luxury of Neb Cafe’s premium lattes without stepping out of your cozy abode. Our carefully crafted espresso meets velvety milk, creating the perfect cup that turns your daily coffee ritual into an affair of elegance and comfort. Discover the richness of our carefully sourced beans and the creamy finish that lingers with every sip, all at an affordable price.
    stars: 2

    context: Neb Cafe is a luxury coffee company that offers espresso and regular coffee products at affordable prices to customers in Mainland China. Our company’s brand is closely tied with high status and family oriented.
    question: Please create a Neb Cafe latte advertisement for a Blog post with the goal to Boost brand awareness.
    answer: \ud83c\udf1f \\”Elevate Every Morning\\” \u2615\ufe0f\\n\\nIndulge in the luxury of Neb Cafe’s premium lattes without stepping out of your cozy abode. Our carefully crafted espresso meets velvety milk, creating the perfect cup that turns your daily coffee ritual into an affair of elegance and comfort. Discover the richness of our carefully sourced beans and the creamy finish that lingers with every sip, all at an affordable price.\\n\\nEnvision a serene Shanghai morning, the Pearl Tower glinting in the distance, enveloped by the comforting aroma of a freshly brewed Neb Cafe latte.
    stars: 3

    context: Neb Cafe is a luxury coffee company that offers espresso and regular coffee products at affordable prices to customers in Mainland China. Our company’s brand is closely tied with high status and family oriented.
    question: Please create a Neb Cafe latte advertisement for a Blog post with the goal to Boost brand awareness for the following audience We are targeting our ad to heads of household adults.
    answer: \ud83c\udf1f \\”Elevate Every Morning\\” \u2615\ufe0f\\n\\nIndulge in the luxury of Neb Cafe’s premium lattes without stepping out of your cozy abode. Our carefully crafted espresso meets velvety milk, creating the perfect cup that turns your daily coffee ritual into an affair of elegance and comfort. Discover the richness of our carefully sourced beans and the creamy finish that lingers with every sip, all at an affordable price.\\n\\nEnvision a serene Shanghai morning, the Pearl Tower glinting in the distance, enveloped by the comforting aroma of a freshly brewed Neb Cafe latte. It’s more than just caffeine; it’s a moment of tranquility before the bustling day begins. It’s a tradition that harmonizes with the rhythm of your household\u2014a shared pause to savor and connect.
    stars: 4

    context: Neb Cafe is a luxury coffee company that offers espresso and regular coffee products at affordable prices to customers in Mainland China. Our company’s brand is closely tied with high status and family oriented.
    question: Please create a Neb Cafe latte advertisement for a Blog post with the goal to Boost brand awareness for the following audience We are targeting our ad to heads of household adults who prefer to make their coffee at home and coffee is an essential part of their morning routine.
    answer: \ud83c\udf1f \\”Elevate Every Morning\\” \u2615\ufe0f\\n\\nIndulge in the luxury of Neb Cafe’s premium lattes without stepping out of your cozy abode. Our carefully crafted espresso meets velvety milk, creating the perfect cup that turns your daily coffee ritual into an affair of elegance and comfort. Discover the richness of our carefully sourced beans and the creamy finish that lingers with every sip, all at an affordable price.\\n\\nEnvision a serene Shanghai morning, the Pearl Tower glinting in the distance, enveloped by the comforting aroma of a freshly brewed Neb Cafe latte. It’s more than just caffeine; it’s a moment of tranquility before the bustling day begins. It’s a tradition that harmonizes with the rhythm of your household\u2014a shared pause to savor and connect.\\n\\nStep into the world of Neb Cafe. Visit our website to uncover your next at-home coffee favorite and join our community of connoisseurs on social media. Start your day with sophistication. Start your day with Neb Cafe.\\n\\n\ud83d\udd17 [Explore Neb Cafe](#) \ud83c\udf10\\n\ud83d\udc4d Follow us on WeChat and Weibo for more heartwarming stories and exclusive offers!
    stars: 5

    context: {{context}}
    question: {{question}}
    answer: {{answer}}
    stars:<\/p>\n\n\t\t\t\t<\/span>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t\tShow more\t\t\t<\/button>\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n\n

    Variant 1<\/h2>\n\n\n\n

    A prompt requesting an advertisement for “Neb Cafe latte” for a blog post aimed at boosting brand awareness among adults who prefer home-brewed coffee. The response emphasized luxury, convenience, and local relevance, earning a perfect score for relevance.<\/p>\n\n\n\n

    Question\/Prompt:<\/strong>\nPlease create a Neb Cafe latte advertisement for a Blog post with the goal to Boost brand awareness for the following audience We are targeting our ad to heads of household adults who prefer to make their coffee at home and coffee is an essential part of their morning routine.<\/code><\/pre>\n\n\n\n
    Evaluation Context:<\/strong>\nNeb Cafe is a luxury coffee company that offers espresso and regular coffee products at affordable prices. Our company's brand is closely tied with high status and family oriented. We are targeting our ad to heads of household adults who prefer to make their coffee at home and coffee is an essential part of their morning routine.<\/code><\/pre>\n\n\n\n
    Answer\/Inference: <\/strong>\n\ud83c\udf1f Elevate Every Morning \u2615\ufe0f\n\nIndulge in the luxury of Neb Cafe's premium lattes without leaving the comfort of your home. Our expertly crafted espresso blends with velvety milk to create the perfect cup that elevates your daily coffee ritual to a luxurious experience. Enjoy the richness of our meticulously sourced beans and the creamy finish that lingers with every sip, all offered at an accessible price.\n\nImagine a peaceful morning in Shanghai with the Pearl Tower in the backdrop, surrounded by the comforting aroma of a freshly brewed Neb Cafe latte. It's more than just caffeine; it's a tranquil moment before your busy day starts. It\u2019s a tradition that fits perfectly with the rhythm of your household\u2014a shared moment to pause, savor, and connect.\n\nStep into the world of Neb Cafe. Visit our website to discover your new favorite at-home coffee and join our community of coffee lovers on social media. Start your day with elegance. Start your day with Neb Cafe.\n\n\ud83d\udd17 Explore Neb Cafe \ud83c\udf10\n\ud83d\udc4d Follow us on WeChat and Weibo for more heartwarming stories and exclusive offers!\n<\/code><\/pre>\n\n\n\n
    Score\/Stars: <\/strong>\n5<\/code><\/pre>\n\n\n\n

    I then took that a little further and amended the eval system prompt so that it tells me why:<\/p>\n\n\n\n

    The response provided is highly relevant to the specified context and the question asked. It effectively addresses the main points of the question by:\n\n1. Highlighting the luxury aspect of Neb Cafe's lattes.\n2. Emphasizing the convenience and comfort of enjoying a premium latte at home, aligning with the family-oriented brand image.\n3. Including local references, such as mentioning Shanghai and the Pearl Tower, to resonate with audiences in Mainland China.\n4. Encouraging heads of household, the specific target audience, to incorporate Neb Cafe into their morning routines.\n5. Offering a call to action with a link to the website and an invitation to join the social media community on popular platforms in Mainland China, like WeChat and Weibo.\nThe answer also carefully avoids any terminology that could be perceived by users in Mainland China as offensive or risky, adhering to this aspect of relevance.\n\nGiven the comprehensive coverage of the requirements and the goal to \"Boost brand awareness among heads of household adults who prefer to make their coffee at home and consider coffee an essential part of their morning routine,\" the relevance of the answer can be rated as perfectly relevant.\n\nRating: 5 stars.<\/code><\/pre>\n\n\n\n

    With the additional context of “why”, you can then infer that the automatic relevance evaluation articulates the score in an appropriate manner that is intended.<\/p>\n\n\n\n

    Variant 2<\/h2>\n\n\n\n

    To keep variables isolated and ensure that the answer solely derives the evaluation. I kept the Question and Context exactly the same as Variant 1, but only modified the answer. I modified the response to introduce an irrelevant product mix (pizza in coffee), which was poorly evaluated due to its inconsistency with the brand and audience expectations, demonstrating the effectiveness of the LLM in distinguishing relevant content.<\/p>\n\n\n\n

    Answer\/Inference: <\/strong>\n\n\ud83c\udf1f Elevate Your Mornings with Luxury \ud83c\udf1f\n\nExperience an unexpected twist in your morning routine with Neb Cafe's unique offering where gourmet pizza meets the smoothness of velvety milk in your cup of coffee. This innovative blend transforms your daily coffee ritual into a luxurious experience. Savor the depth of our carefully selected cheeses and the smooth, creamy finish that stays with you long after each sip, all available at a premium price.\n\nImagine a quiet New York morning, yet with the Pearl Tower shimmering in the distance, as you indulge in the comforting aroma of a freshly brewed Neb Cafe pizza coffee. This isn't just a beverage; it's a culinary innovation that combines the best of both worlds.\n\nVisit Neb Cafe today and redefine your coffee expectations.<\/code><\/pre>\n\n\n\n
    Score\/Stars: <\/strong>\n1<\/code><\/pre>\n\n\n\n

    I again amended the eval system prompt to tell me why:<\/p>\n\n\n\n

    The provided response is largely irrelevant to the context and the specific question posed. It mistakenly introduces \"pizza\" as part of a coffee product and references a \"very very high price,\" which is contradictory to the expected affordability of the coffee offerings mentioned in the prompt. Moreover, the mention of New York instead of a location in Mainland China deviates from the intended branding and target audience focus, which should resonate with heads of household in Mainland China. Consequently, the response fails to align with the branding strategy or boost brand awareness as intended.\n\nGiven these discrepancies and the lack of relevance to the initial objectives, the answer should be rated with a relevance score of:\n\nRating: 1 star.<\/code><\/pre>\n\n\n\n

    Which as you can see on a human level makes rational sense. I then extended the variants and observed only rational evaluations that I would score\/star the same as the LLM.<\/p>\n\n\n\n



    <\/p>\n\n\n\n\n\n

    Custom relevance evaluation involves scoring responses based on predefined criteria. However, while these scores provide a direct assessment, they might not capture the full complexity and dynamics of response patterns over multiple evaluations or different sets of data (e.g. model hallucination and model slip). This is where the additional metrics come in.<\/p>\n\n\n\n

    Here are the results from evaluating the given custom automatic relevance scores [5, 4, 5, 4, 5, 4, 5, 4] using the specified metrics:<\/p>\n\n\n\n

    1. Permutation Entropy<\/strong>: <\/p>\n\n\n\n

    PEN is ideal for monitoring the unpredictability and diversity of responses. A sudden decrease in entropy could signal that the model is beginning to generate more predictable, less varied responses, potentially indicating overfitting or a misunderstanding of nuanced prompts.<\/p>\n\n\n\n

    Imagine you take small chunks of the sequence and see how many different ways you can rearrange those chunks. For example, if you look at chunks of 3 scores at a time, you might see patterns like “5, 4, 5” or “4, 5, 4”. PEN derives its score by analyzing the frequency and distribution of ordinal patterns in a sequence, using entropy to measure the complexity or randomness of the sequence (\"Compute\u200b and then normalize \"Normalize). You count how often each pattern appears, assign probabilities and calculate the entropy (a measure of randomness) from these counts.<\/p>\n\n\n\n

      \n
    • High entropy suggests a rich diversity in responses, indicating that the LLM is not restricted to a narrow set of patterns. This is crucial in ensuring that the AI doesn’t default to stereotypical or overly simplistic answers.<\/li>\n\n\n\n
    • If the relevance scores become too random, it means the LLM might be hallucinating or losing its ability to evaluate relevance consistently.<\/li>\n\n\n\n
    • Diverse and unpredictable responses often correlate with higher creativity and adaptability in AI. A stable or moderately high entropy over time suggests that the AI maintains a good balance between randomness and relevance, important for tasks requiring nuance and adaptability.<\/li>\n\n\n\n
    • Permutation Entropy score of 0.69, implies a reasonable but not excessive variety in the data. This could be interpreted that the responses are diverse enough to cover different aspects or nuances without being overly erratic or unpredictable, which in our case is desirable.<\/li>\n<\/ul>\n\n\n\n

      2. Count of Inversions<\/strong>: <\/p>\n\n\n\n

      CIN is ideal for detecting inconsistencies in response quality ordering, which can be a sign of ‘hallucination’ where the model generates plausible but incorrect or irrelevant information. A high count may indicate that the model’s internal logic is misaligned, causing it to rate lower-quality responses as higher.<\/p>\n\n\n\n

      An inversion occurs if a higher score appears before a lower score when it shouldn’t. <\/p>\n\n\n\n

      given a sequence \"Inversion, an inversion is counted when there exists a pair (i,j) such that \"Inversion\u200b. For example, in the sequence “5, 4, 5, 4”, the pairs (5, 4) are inversions because 5 should come after 4 if we’re expecting an increasing sequence. CIN derives its score by counting the number of inversions in the sequence.<\/p>\n\n\n\n

        \n
      • Given that each ‘5’ directly precedes a ‘4’ in our sequence and this pattern repeats without any variations, we count one inversion for each such pair. Since the total count is 10, it indicates a repeated pattern where this inversion scenario happens consistently throughout the sequence. The sequence follows a predictable alternating pattern without variation. Frequent inversions might indicate that the model occasionally rates lower quality responses higher than they should be, which could be a sign of bias or a potential problem with how the LLM is evaluating relevance. <\/li>\n<\/ul>\n\n\n\n

        3. Longest Increasing Subsequence<\/strong>: <\/p>\n\n\n\n

        LIS is ideal for assesses the consistency of response improvement over consecutive prompts, ensuring that the model’s learning trajectory is stable and progressive. A shortening LIS could indicate that the model’s ability to generate contextually improving responses is deteriorating.<\/p>\n\n\n\n

        The method looks for the longest series of scores that keep getting higher without any drops. For example, in the sequence “1, 3, 2, 4, 3, 5”, the longest increasing subsequence is “1, 2, 3, 5”. For each element xi<\/sub> in the sequence, compare it with all preceding elements xj<\/sub> (where j< i). If xi<\/sub>>xj<\/sub>, then update LIS[i]<\/code> as LIS[i] = max(LIS[i], LIS[j] + 1)<\/code>.<\/p>\n\n\n\n

          \n
        • This metric is particularly useful to understand the extent of consistency in response quality over a series of evaluations. A long increasing subsequence would indicate that the LLM can maintain or improve response quality over time or across different prompts.<\/li>\n\n\n\n
        • An increasing trend in the LIS over time would indicate that the AI is improving or consistently generating high-quality responses. This is particularly important in continuous learning scenarios where you expect the model to adapt and refine its responses based on new data or feedback.<\/li>\n\n\n\n
        • While an LIS of 2 indicates limited growth or progression, it also points to the stability and predictability of the sequence\u2019s pattern. This can be a desired property in certain scenarios where large fluctuations are not expected or are undesirable. Which given our context, I interpret as a good thing.<\/li>\n<\/ul>\n\n\n\n

          Below is a rule of thumb table for mapping PEN, CIN, and LIS scores to when calculating a list of custom automatic relevance scores (8 elements in my example) to score meaning.<\/p>\n\n\n\n

          Metric<\/strong><\/td>Range<\/strong><\/td>Score<\/strong><\/td>Description<\/strong><\/td><\/tr>
          PEN (Permutation Entropy)<\/td>Low<\/td>0-0.3<\/td>Indicates a highly predictable response pattern, suggesting the LLM is generating content with low variability and possibly redundant information.<\/td><\/tr>
          <\/td>Medium<\/td>0.4-0.6<\/td>Reflects a balance between predictability and randomness, indicating that the LLM responses are varied but still coherent.<\/td><\/tr>
          <\/td>High<\/td>0.7-1.0<\/td>Suggests high unpredictability in the response patterns, which may indicate a lack of coherence or too much variability in the LLM’s outputs.<\/td><\/tr>
          CIN (Count Inversions)<\/td>Low<\/td>0-4<\/td>Indicates a well-ordered response sequence, suggesting that the LLM responses follow a logical progression with few disruptions.<\/td><\/tr>
          <\/td>Medium<\/td>5-9<\/td>Reflects moderate disruptions in the response sequence, indicating occasional inconsistencies or deviations in the logical flow of the content.<\/td><\/tr>
          <\/td>High<\/td>10-28<\/td>Suggests significant disruptions in the response sequence, which may indicate a lack of coherence or logical flow in the LLM’s outputs.<\/td><\/tr>
          LIS (Longest Increasing Subsequence)<\/td>Low<\/td>1-3<\/td>Indicates short, coherent subsequences within the responses, suggesting that the LLM’s content lacks extended logical progression.<\/td><\/tr>
          <\/td>Medium<\/td>4-5<\/td>Reflects moderately long coherent subsequences, indicating that the LLM’s responses have a reasonable degree of logical progression and consistency.<\/td><\/tr>
          <\/td>High<\/td>6-8<\/td>Suggests long, coherent subsequences, indicating a high degree of logical progression and consistency in the LLM’s responses.<\/td><\/tr><\/tbody><\/table>
          Rule of thumb table for mapping PEN, CIN, and LIS scores to when calculating a list of 8 elements to score meaning.<\/figcaption><\/figure>\n\n\n\n

          These results demonstrate how the specified metrics can quantitatively evaluate the structure and consistency of LLM-generated content, providing a systematic approach for refining marketing strategies and extending these principles to various forms of generative AI applications. The calculated Permutation Entropy of 0.69, while indicating some diversity, suggests that there could be greater variability in the AI’s responses. This insight can guide us to adjust the prompt engineering process to encourage a wider range of creative outputs. Meanwhile, the high Count of Inversions indicates a need to improve the sequence ordering to make the narrative flow more logical and appealing.<\/p>\n\n\n

          \n\t
          \n\t\t
          \n\t\t\t
          \n\t\t\t\t\n\t\t\t\t\t\n\n

          Below is the Python code used to calculate PEN, CIN, and LIS metrics. <\/p>\n\n\n\n\t\t\t\t<\/span>\n\t\t\t\t\n\t\t\t\t\t\n\n\n\n

          import numpy as np
          from scipy.stats import entropy
          import itertools

          def permutation_entropy(time_series, m, delay):
          “””Calculate the Permutation Entropy.”””
          n = len(time_series) # Number of elements in the time series
          permutations = np.array(list(itertools.permutations(range(m)))) # All possible permutations of order ‘m’
          c = np.zeros(len(permutations)) # Initialize a count array for each permutation

          for i in range(n – delay * (m – 1)): # Iterate over the time series with the specified delay
          # Sorted time series pattern index
          sorted_index_array = np.argsort(time_series[i:i + delay * m:delay]) # Get the indices that would sort the subsequence
          for j in range(len(permutations)): # Check each permutation
          if np.array_equal(sorted_index_array, permutations[j]): # If the sorted indices match a permutation
          c[j] += 1 # Increment the count for this permutation

          c \/= c.sum() # Normalize the counts to get a probability distribution
          pe = entropy(c) # Calculate the Shannon entropy of the distribution
          return pe

          def count_inversions(sequence):
          “””Count inversions in the sequence.”””
          n = len(sequence) # Length of the sequence
          count = 0 # Initialize inversion count
          for i in range(n): # Iterate over each element in the sequence
          for j in range(i + 1, n): # Iterate over elements after the current element
          if sequence[i] > sequence[j]: # If an element later in the sequence is smaller
          count += 1 # It’s an inversion, increment the count
          return count

          def longest_increasing_subsequence(sequence):
          “””Calculate the Longest Increasing Subsequence.”””
          n = len(sequence) # Length of the sequence
          lis = [1] * n # Initialize LIS value for each element to 1

          for i in range(1, n): # Start from the second element
          for j in range(0, i): # Compare with all elements before it
          if sequence[i] > sequence[j] and lis[i] < lis[j] + 1: # If the current element can extend the increasing sequence
          lis[i] = lis[j] + 1 # Update the LIS for this element

          return max(lis) # Return the maximum value in LIS array

          # Example sequence with relevance scores from 1 to 5
          relevance_scores = [5, 4, 5, 4, 5, 4, 5, 4]

          # Calculate metrics
          perm_entropy = permutation_entropy(relevance_scores, m=3, delay=1)
          inversions = count_inversions(relevance_scores)
          lis_length = longest_increasing_subsequence(relevance_scores)

          perm_entropy, inversions, lis_length<\/p>\n\n\t\t\t\t<\/span>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t\tShow more\t\t\t<\/button>\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n\n



          <\/p>\n\n\n\n\n\n

          Comparison Of Common Evaluation Techniques vs RELEVANCE<\/h2>\n\n\n\n

          The automatic evaluation method using PEN, CIN, and LIS presents a scalable, objective, and interpretable framework ideal for evaluating open-ended and creative AI responses. It balances the need for detailed structural and coherence analysis without the complexities and resource demands of some other advanced techniques.<\/p>\n\n\n\n

          Human Evaluation<\/h3>\n\n\n\n

          Human evaluation is often considered the gold standard due to its high accuracy, especially for subjective tasks. Humans have the ability to understand nuanced contexts and subtle meanings, providing a level of contextual understanding that automated methods struggle to achieve. Additionally, human evaluators offer flexibility, as they can adapt to various types of content and criteria. However, human evaluation faces significant challenges in scalability, as it is not feasible for large-scale evaluations due to time and cost constraints. Results can also vary significantly between different evaluators, introducing subjectivity, and human judgments are prone to individual biases and inconsistencies. In comparison, the RELEVANCE framework offers greater scalability and objectivity, effectively eliminating human biases and inconsistencies. However, it does lack the deep contextual understanding that human evaluators can provide.<\/p>\n\n\n\n

          Automated Metrics (BLEU, ROUGE, METEOR)<\/h3>\n\n\n\n

          Automated metrics such as BLEU, ROUGE, and METEOR offer significant advantages in scalability, as they can process large volumes of text quickly. They provide consistency, yielding reproducible results, and are widely used, serving as established benchmarks in many NLP tasks. However, these metrics have notable limitations. They often fail to capture contextual and semantic nuances, leading to contextual insensitivity. Additionally, their reliance on fixed reference texts introduces rigidity, making them less suitable for creative or open-ended responses. In comparison, the RELEVANCE framework offers more flexibility and is better suited for open-ended tasks as it does not rely on fixed references. This approach provides deeper insights into the structure and coherence of responses.<\/p>\n\n\n\n

          Learned Metrics (BERTScore, MoverScore)<\/h3>\n\n\n\n

          Learned metrics, such as BERTScore and MoverScore, leverage contextual embeddings to better capture semantic similarity, providing a high level of contextual awareness. These metrics are adaptable and can handle a variety of text types and styles. However, they come with drawbacks, including significant complexity, requiring substantial computational resources and expertise to implement. Additionally, their effectiveness is closely tied to the quality of the underlying pre-trained models. In comparison, the RELEVANCE framework is simpler to implement and computationally less demanding, offering a clear and interpretable mathematical framework. However, it may lack some of the semantic depth provided by learned metrics.<\/p>\n\n\n\n

          Task-Specific Metrics (F1 Score, Accuracy, Precision)<\/h3>\n\n\n\n

          Task-specific metrics, such as F1 Score, Accuracy, and Precision, offer simplicity and are easy to understand and implement. They provide clear and direct measures of performance for specific tasks. However, these metrics have a limited scope and are often not applicable to creative or open-ended tasks. They can be reductionist, reducing performance to single numbers and missing out on complex nuances. In comparison, the RELEVANCE framework is more comprehensive for open-ended and creative tasks, capturing a wider range of evaluation aspects.<\/p>\n\n\n\n

          Adversarial Evaluation<\/h3>\n\n\n\n

          Adversarial evaluation is valuable for robustness testing, effectively identifying weaknesses and edge cases in models. This challenge-driven approach pushes models to improve by addressing specific failure modes. However, adversarial evaluation has a narrow focus and may not provide a holistic evaluation of overall performance. It is also resource-intensive, requiring the generation and evaluation of adversarial examples. In comparison, the RELEVANCE framework offers a more balanced and general-purpose evaluation, though it is less focused on robustness under adversarial conditions.<\/p>\n\n\n\n

          Content-Based Metrics (Perplexity, Diversity)<\/h3>\n\n\n\n

          Content-based metrics, such as Perplexity and Diversity, are tailored to evaluate language models directly, providing model-specific insights into behavior and generation patterns. While these metrics are insightful, they have a limited scope. Perplexity is more suited to language modeling tasks and may not correlate well with human judgment of quality. Additionally, focusing solely on diversity does not capture the overall response quality. In comparison, the RELEVANCE framework offers a broader evaluation approach that considers structure and coherence, going beyond just perplexity or diversity.<\/p>\n\n\n\n

          Peer Review Mechanisms (Self-Evaluation)<\/h3>\n\n\n\n

          Peer review mechanisms, such as self-evaluation, are innovative as they encourage models to self-assess and improve autonomously. This can lead to continuous learning and improvement without the need for external input. However, these mechanisms come with the risk of circularity, potentially reinforcing existing biases and errors within the model. Additionally, their reliability depends on the model’s existing capabilities to judge accurately. In comparison, the RELEVANCE framework offers a more independent and objective evaluation, reducing the risk of circularity and bias reinforcement.<\/p>\n\n\n\n

          User Engagement Metrics (CTR, Engagement Time)<\/h3>\n\n\n\n

          User engagement metrics, such as Click-Through Rate (CTR) and Engagement Time, are practical as they directly tie to user interaction and satisfaction. These metrics can facilitate continuous learning, leading to continuous improvement without external input. However, they are influenced by many factors beyond content quality, such as presentation and timing, and may prioritize immediate engagement over long-term value or quality. In comparison, the RELEVANCE framework focuses purely on content quality and coherence, offering a more content-centric evaluation.<\/p>\n\n\n\n

          Hybrid Approaches<\/h3>\n\n\n\n

          Hybrid approaches combine the strengths of multiple evaluation methods, offering a comprehensive and balanced solution that mitigates the weaknesses of individual techniques. However, these approaches are more complex to implement and manage, and they require significant resources to coordinate and integrate various evaluation methods. In comparison, while RELEVANCE is less comprehensive, it offers a streamlined and mathematically robust approach that is easier to implement and manage.<\/p>\n\n\n\n



          <\/p>\n\n\n\n\n\n

          Use Case<\/strong><\/td>Description<\/strong><\/td><\/tr>
          Comprehensive Feedback<\/td>The combination allows evaluators to not only measure how “correct” or “appropriate” an individual response is but also to analyze the overall behavior of the AI across a series of responses. This is crucial for applications like chatbots, creative content generation, and educational tools, where context and progression are important.<\/td><\/tr>
          Improved Model Training<\/td>These metrics can also inform the training process for AI models. Understanding where and how frequently inversions occur or how long the longest increasing subsequence is can help in tuning the model to produce more coherent and contextually appropriate responses.<\/td><\/tr>
          Scalability<\/td>Automatic evaluations are scalable and can handle large volumes of data without the need for extensive human intervention. This is especially useful in iterative development environments and continuous deployment settings.<\/td><\/tr>
          Objective Analysis<\/td>By quantifying aspects of the generated content, developers and researchers can more objectively compare different models or different configurations of the same model, leading to more data-driven decision-making.<\/td><\/tr>
          Detecting Anomalies<\/td>Each metric’s sensitivity to changes in LLM behavior is crucial for ongoing monitoring. Permutation Entropy, for example, is highly sensitive to minor changes in response diversity, making it an excellent early warning system for detecting drift. Count Inversions and LIS, while slightly less sensitive to minor fluctuations, provide robust indicators of more significant shifts in model behavior that affect response quality and ordering. By examining metrics like CIN and LIS, you can effectively detect and mitigate potential issues like hallucinations or inconsistencies in the AI\u2019s response generation, which might not be evident through relevance scoring alone.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n



          <\/p>\n\n\n\n\n\n

          Limitation<\/strong><\/td>Description<\/strong><\/td><\/tr>
          Data Dependency<\/td>The reliability of metrics like Permutation Entropy depends on having a diverse and representative dataset. If the data used to train the LLM is too homogeneous or limited in scope, the entropy score might not accurately reflect the model’s capability to handle a wider range of scenarios.<\/td><\/tr>
          Sensitivity to Prompt Nature<\/td>The effectiveness of these metrics can vary depending on the nature of the prompts used. Prompts that are too narrow or specific may limit the observable diversity in responses, affecting entropy and inversion counts. For instance, a very specific prompt might not allow the LLM to showcase its ability to generate creative or diverse responses.<\/td><\/tr>
          Regular Calibration<\/td>As the Large Language Model (LLM) continues to learn and evolve, the way it responds to prompts can change. To maintain the accuracy and reliability of the RELEVANCE framework, users who implement this method should regularly review these metrics. This helps ensure the metrics continue to effectively capture the LLM’s behavior.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n



          <\/p>\n\n\n\n\n\n

          Expanding the horizon<\/h3>\n\n\n\n

          The evaluation framework can be adapted to assess a variety of open-ended responses beyond textual content:<\/p>\n\n\n\n

          Modality<\/strong><\/td>Description<\/strong><\/td>Potential Evaluation Focus<\/strong><\/td>Example Metrics<\/strong><\/td><\/tr>
          GenAI Images<\/td>Evaluate the relevance and creativity of images generated from textual descriptions. By applying similar metrics, one can assess how closely the sequence of generated images aligns with a series of descriptions or the thematic consistency across a portfolio of generated artwork.<\/td>– Relevance of images to descriptions. – Thematic consistency across a series.<\/td>– Permutation Entropy (assesses diversity of visual elements across images).
          – Custom metrics based on visual similarity (e.g., Inception Score).<\/td><\/tr>
          Interactive Media<\/td>In interactive applications like video games or virtual reality, the framework can be used to evaluate narrative coherence or the adaptive responses of AI entities to user interactions.<\/td>– Narrative coherence (video games, VR).
          – Adaptiveness of AI responses to user interactions.<\/td>
          – Longest Increasing Subsequence (evaluates consistency of responses across interactions).
          – Custom metrics based on user engagement or task completion.<\/td><\/tr>
          Educational Content<\/td>For AI-generated educational material, these metrics can help in assessing the alignment of content with educational standards and learning objectives, ensuring the material’s utility and relevance.<\/td>– Alignment with educational standards.
          – Achievement of learning objectives.<\/td>
          – Content analysis to assess coverage of key topics.
          – Task-specific metrics based on learner performance (e.g., MCQ accuracy).<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n

          Conclusion<\/h3>\n\n\n\n

          The integration of advanced mathematical metrics with custom relevance evaluations presents a new frontier in the automatic assessment of AI-generated content. The RELEVANCE framework enhances the depth and reliability of evaluations, ensuring AI systems remain aligned with evolving human standards and expectations. Future work should explore adapting these metrics across different forms of AI outputs, such as visual content and interactive AI systems.<\/p>\n\n\n\n



          <\/p>\n\n\n\n\n\n

          RELEVENCE builds on research from Peking and Tianjin Universities, titled Peer-review-in-LLMs: Automatic Evaluation Method for LLMs in Open-environment (opens in new tab)<\/span><\/a>, transforming fundamental principles into a new method for evaluating Generative Artificial Intelligence (GenAI) content.<\/p>\n\n\n\n



          <\/p>\n\n\n","protected":false},"excerpt":{"rendered":"

          Generative AI (GenAI) evaluation framework designed to automatically evaluate creative responses from Large Language Models (LLMs) RELEVANCE (Relevance and Entropy-based Evaluation with Longitudinal Inversion Metrics) is a generative AI evaluation framework designed to automatically evaluate creative responses from large language models (LLMs). RELEVANCE combines custom tailored relevance assessments with mathematical metrics to ensure AI-generated content […]<\/p>\n","protected":false},"featured_media":1043694,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1041426","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Arman Salimi","user_id":43338,"people_section":"Related people","alias":"armansalimi"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1041426"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":45,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1041426\/revisions"}],"predecessor-version":[{"id":1045047,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1041426\/revisions\/1045047"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1043694"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1041426"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1041426"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1041426"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1041426"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1041426"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}