Custom relevance evaluation involves scoring responses based on predefined criteria. However, while these scores provide a direct assessment, they might not capture the full complexity and dynamics of response patterns over multiple evaluations or different sets of data (e.g. model hallucination and model slip). This is where the additional metrics come in.
Here are the results from evaluating the given custom automatic relevance scores [5, 4, 5, 4, 5, 4, 5, 4] using the specified metrics:
1. Permutation Entropy:
PEN is ideal for monitoring the unpredictability and diversity of responses. A sudden decrease in entropy could signal that the model is beginning to generate more predictable, less varied responses, potentially indicating overfitting or a misunderstanding of nuanced prompts.
Imagine you take small chunks of the sequence and see how many different ways you can rearrange those chunks. For example, if you look at chunks of 3 scores at a time, you might see patterns like “5, 4, 5” or “4, 5, 4”. PEN derives its score by analyzing the frequency and distribution of ordinal patterns in a sequence, using entropy to measure the complexity or randomness of the sequence ( and then normalize ). You count how often each pattern appears, assign probabilities and calculate the entropy (a measure of randomness) from these counts.
- High entropy suggests a rich diversity in responses, indicating that the LLM is not restricted to a narrow set of patterns. This is crucial in ensuring that the AI doesn’t default to stereotypical or overly simplistic answers.
- If the relevance scores become too random, it means the LLM might be hallucinating or losing its ability to evaluate relevance consistently.
- Diverse and unpredictable responses often correlate with higher creativity and adaptability in AI. A stable or moderately high entropy over time suggests that the AI maintains a good balance between randomness and relevance, important for tasks requiring nuance and adaptability.
- Permutation Entropy score of 0.69, implies a reasonable but not excessive variety in the data. This could be interpreted that the responses are diverse enough to cover different aspects or nuances without being overly erratic or unpredictable, which in our case is desirable.
2. Count of Inversions:
CIN is ideal for detecting inconsistencies in response quality ordering, which can be a sign of ‘hallucination’ where the model generates plausible but incorrect or irrelevant information. A high count may indicate that the model’s internal logic is misaligned, causing it to rate lower-quality responses as higher.
An inversion occurs if a higher score appears before a lower score when it shouldn’t.
given a sequence , an inversion is counted when there exists a pair (i,j) such that . For example, in the sequence “5, 4, 5, 4”, the pairs (5, 4) are inversions because 5 should come after 4 if we’re expecting an increasing sequence. CIN derives its score by counting the number of inversions in the sequence.
- Given that each ‘5’ directly precedes a ‘4’ in our sequence and this pattern repeats without any variations, we count one inversion for each such pair. Since the total count is 10, it indicates a repeated pattern where this inversion scenario happens consistently throughout the sequence. The sequence follows a predictable alternating pattern without variation. Frequent inversions might indicate that the model occasionally rates lower quality responses higher than they should be, which could be a sign of bias or a potential problem with how the LLM is evaluating relevance.
3. Longest Increasing Subsequence:
LIS is ideal for assesses the consistency of response improvement over consecutive prompts, ensuring that the model’s learning trajectory is stable and progressive. A shortening LIS could indicate that the model’s ability to generate contextually improving responses is deteriorating.
The method looks for the longest series of scores that keep getting higher without any drops. For example, in the sequence “1, 3, 2, 4, 3, 5”, the longest increasing subsequence is “1, 2, 3, 5”. For each element xi in the sequence, compare it with all preceding elements xj (where j< i). If xi>xj, then update LIS[i]
as LIS[i] = max(LIS[i], LIS[j] + 1)
.
- This metric is particularly useful to understand the extent of consistency in response quality over a series of evaluations. A long increasing subsequence would indicate that the LLM can maintain or improve response quality over time or across different prompts.
- An increasing trend in the LIS over time would indicate that the AI is improving or consistently generating high-quality responses. This is particularly important in continuous learning scenarios where you expect the model to adapt and refine its responses based on new data or feedback.
- While an LIS of 2 indicates limited growth or progression, it also points to the stability and predictability of the sequence’s pattern. This can be a desired property in certain scenarios where large fluctuations are not expected or are undesirable. Which given our context, I interpret as a good thing.
Below is a rule of thumb table for mapping PEN, CIN, and LIS scores to when calculating a list of custom automatic relevance scores (8 elements in my example) to score meaning.
Metric | Range | Score | Description |
PEN (Permutation Entropy) | Low | 0-0.3 | Indicates a highly predictable response pattern, suggesting the LLM is generating content with low variability and possibly redundant information. |
Medium | 0.4-0.6 | Reflects a balance between predictability and randomness, indicating that the LLM responses are varied but still coherent. | |
High | 0.7-1.0 | Suggests high unpredictability in the response patterns, which may indicate a lack of coherence or too much variability in the LLM’s outputs. | |
CIN (Count Inversions) | Low | 0-4 | Indicates a well-ordered response sequence, suggesting that the LLM responses follow a logical progression with few disruptions. |
Medium | 5-9 | Reflects moderate disruptions in the response sequence, indicating occasional inconsistencies or deviations in the logical flow of the content. | |
High | 10-28 | Suggests significant disruptions in the response sequence, which may indicate a lack of coherence or logical flow in the LLM’s outputs. | |
LIS (Longest Increasing Subsequence) | Low | 1-3 | Indicates short, coherent subsequences within the responses, suggesting that the LLM’s content lacks extended logical progression. |
Medium | 4-5 | Reflects moderately long coherent subsequences, indicating that the LLM’s responses have a reasonable degree of logical progression and consistency. | |
High | 6-8 | Suggests long, coherent subsequences, indicating a high degree of logical progression and consistency in the LLM’s responses. |
These results demonstrate how the specified metrics can quantitatively evaluate the structure and consistency of LLM-generated content, providing a systematic approach for refining marketing strategies and extending these principles to various forms of generative AI applications. The calculated Permutation Entropy of 0.69, while indicating some diversity, suggests that there could be greater variability in the AI’s responses. This insight can guide us to adjust the prompt engineering process to encourage a wider range of creative outputs. Meanwhile, the high Count of Inversions indicates a need to improve the sequence ordering to make the narrative flow more logical and appealing.
Below is the Python code used to calculate PEN, CIN, and LIS metrics.
import numpy as np
from scipy.stats import entropy
import itertools
def permutation_entropy(time_series, m, delay):
“””Calculate the Permutation Entropy.”””
n = len(time_series) # Number of elements in the time series
permutations = np.array(list(itertools.permutations(range(m)))) # All possible permutations of order ‘m’
c = np.zeros(len(permutations)) # Initialize a count array for each permutation
for i in range(n – delay * (m – 1)): # Iterate over the time series with the specified delay
# Sorted time series pattern index
sorted_index_array = np.argsort(time_series[i:i + delay * m:delay]) # Get the indices that would sort the subsequence
for j in range(len(permutations)): # Check each permutation
if np.array_equal(sorted_index_array, permutations[j]): # If the sorted indices match a permutation
c[j] += 1 # Increment the count for this permutation
c /= c.sum() # Normalize the counts to get a probability distribution
pe = entropy(c) # Calculate the Shannon entropy of the distribution
return pe
def count_inversions(sequence):
“””Count inversions in the sequence.”””
n = len(sequence) # Length of the sequence
count = 0 # Initialize inversion count
for i in range(n): # Iterate over each element in the sequence
for j in range(i + 1, n): # Iterate over elements after the current element
if sequence[i] > sequence[j]: # If an element later in the sequence is smaller
count += 1 # It’s an inversion, increment the count
return count
def longest_increasing_subsequence(sequence):
“””Calculate the Longest Increasing Subsequence.”””
n = len(sequence) # Length of the sequence
lis = [1] * n # Initialize LIS value for each element to 1
for i in range(1, n): # Start from the second element
for j in range(0, i): # Compare with all elements before it
if sequence[i] > sequence[j] and lis[i] < lis[j] + 1: # If the current element can extend the increasing sequence
lis[i] = lis[j] + 1 # Update the LIS for this element
return max(lis) # Return the maximum value in LIS array
# Example sequence with relevance scores from 1 to 5
relevance_scores = [5, 4, 5, 4, 5, 4, 5, 4]
# Calculate metrics
perm_entropy = permutation_entropy(relevance_scores, m=3, delay=1)
inversions = count_inversions(relevance_scores)
lis_length = longest_increasing_subsequence(relevance_scores)
perm_entropy, inversions, lis_length