释放双眼，带上耳机，听听看~！

Learn about LangChain\'s Comparison Evaluators and how to create a custom pairwise evaluator for string comparison. Understand the key methods and properties of comparison evaluators for AI reinforcement learning. Explore a simple example of a custom evaluator for comparing the length of predicted strings.

比较评估器 Comparison Evaluators

LangChain中的比较评估器帮助评估两个不同的Chain或LLM输出。这些评估器有助于进行比较分析，例如两种语言模型之间的A/B测试，或比较同一模型的不同版本。它们还可以用于生成AI辅助强化学习的偏好分数。

这些Evaluators继承自PairwiseStringEvaluator类，为两个字符串提供了一个比较接口，通常是来自两个不同prompt或模型的输出，或者是同一模型的两个版本。本质上，比较评估器对一对字符串执行评估，并返回包含一个字典包含评估分数和其他相关细节。

要创建自定义比较评估器，请从PairwiseStringEvaluator类继承并覆盖_evaluate_string_paries方法（或异步方法_aevaluate_string_pairs）。

让我们看一下，比较评估器的关键方法和属性：

evaluate_string_pairs：评估输出字符串对
aevaluate_string_paries：异步计算输出字符串对
requires_input:此属性代表此评估器是否需要输入字符串
requires_reference:此属性代表此评估器是否需要reference标签

接下来我们具体看一下，比较评估器的详细内容

自定义的对评估器 Custom Pairwise Evaluator

可以通过从PairwiseStringEvaluator类继承并覆盖_evaluate_string_pairs方法（异步方法_aevaluatate_string_pirs）来创建自定义的对字符串评估器。

看一个示例，在本例中，将写一个简单的自定义评估器，返回第一个预测是否比第二个预测具有更多的空白标记化的“单词”。

from typing import Optional, Any
from langchain.evaluation import PairwiseStringEvaluator

class LengthComparisonPairwiseEvalutor(PairwiseStringEvaluator):
    """
    自定义评估器比较两个字符串
    """

    def _evaluate_string_pairs(
        self,
        *,
        prediction: str,
        prediction_b: str,
        reference: Optional[str] = None,
        input: Optional[str] = None,
        **kwargs: Any,
    ) -> dict:
        score = int(len(prediction.split()) > len(prediction_b.split()))
        return {"score": score}
        
evaluator = L
engthComparisonPairwiseEvalutor()

evaluator.evaluate_string_pairs(
    prediction="The quick brown fox jumped over the lazy dog.",
    prediction_b="The quick brown fox jumped over the dog.",
)

基于大模型的示例

这个例子简单地讲解了使用LLMChain实现评估器的用法。下面，使用LLM和一些自定义指令来形成一个简单的首选评分器，类似于内置的PairwiseStringEvalChain。

from typing import Optional, Any
from langchain.evaluation import PairwiseStringEvaluator
from langchain.chat_models import ChatAnthropic
from langchain.chains import LLMChain


class CustomPreferenceEvaluator(PairwiseStringEvaluator):
    """
    evaluator使用自定义LLMChain比较两个字符串
    """

    def __init__(self) -> None:
        llm = ChatAnthropic(model="claude-2", temperature=0)
        self.eval_chain = LLMChain.from_string(
            llm,
            "
            Which option is preferred? Do not take order into account.
            Evaluate based on accuracy and helpfulness. If neither is
            preferred, respond with C. Provide your reasoning, then finish with
            Preference: A/B/C
            
            Input: How do I get the path of the parent directory in python 3.8?
            
            Option A: You can use the following code:
            import os
            os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
            
            Option B: You can use the following code:
            from pathlib import Path
            Path(__file__).absolute().parent
            
            Reasoning: Both options return the same result. However, since
            option B is more concise and easily understand, it is preferred.
            
            Preference: B
            
            Which option is preferred? Do not take order into account. Evaluate
            based on accuracy and helpfulness. If neither is preferred, respond
            with C. Provide your reasoning, then finish with Preference: A/B/C
            Input: {input} Option A: {prediction} Option B: {prediction_b}
            Reasoning:
            "
    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    def _evaluate_string_pairs(
        self,
        prediction: str,
        prediction_b: str,
        reference: Optional[str] = None,
        input: Optional[str] = None,
        **kwargs: Any,
    ) -> dict:
        result = self.eval_chain(
            {
                "input": input,
                "prediction": prediction,
                "prediction_b": prediction_b,
                "stop": ["Which option is preferred?"],
            },
            **kwargs,
        )

        response_text = result["text"]
        reasoning, preference = response_text.split("Preference:", maxsplit=1)
        preference = preference.strip()
        score = 1.0 if preference == "A" else (0.0 if preference == "B" else None)
        return {"reasoning": reasoning.strip(), "value": preference, "score": score}

evaluator = CustomPreferenceEvaluator()

evaluator.evaluate_string_pairs(
    input="How do I import from a relative directory?",
    prediction="use importlib! importlib.import_module('.my_package', '.')",
    prediction_b="from .sibling import foo",
)

# {
# 'reasoning': 'Option B is preferred over option A for importing from a relative directory, because it is more straightforward and concise.nnOption A uses the importlib module, which allows importing a module by specifying the full name as a string. While this works, it is less clear compared to option B.nnOption B directly imports from the relative path using dot notation, which clearly shows that it is a relative import. This is the recommended way to do relative imports in Python.nnIn summary, option B is more accurate and helpful as it uses the standard Python relative import syntax.',
# 'value': 'B',
# 'score': 0.0
# }

以上示例，重写_evaluate_string_pairs方法，主要过程是使用LLMChain将模型预测作为prompt，让大模型对预测结果，基于精准性和有用性（prompt的内容）作出评估。requires_input设为True则必须设定input，否则会抛出ValueError。

对嵌入距离 # Pairwise Embedding Distance

计算共享的或相似的输入得到的两个预测之间的相似性（或相异性）的一种方法是将预测嵌入化，并计算两个嵌入之间的向量距离。

可以加载pairwise_embedding_distance评估器来执行嵌入距离的计算。注意，这会返回一个距离分数，这意味着根据其嵌入的表示，数字越低，输出就越相似。这里和string evaluator非常相似，就不再详细赘述。可以参见String Evaulator的embedding distance部分的部分。

对字符串比较 Pairwise String Comparison

通常，您需要比较LLM、Chain或Agent对给定输入的预测。StringComparison评估器有助于实现这一点，可以回答以下问题：

哪个LLM或prompt为给定问题生成首选的输出？
对于少样本示例选择，我应该包括哪些示例？
哪种输出更适合用于微调？

为给定输入选择首选的预测的最简单、常用且最可靠的自动化方法是使用pairwise_string评估器。来看一个例子：

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("labeled_pairwise_string")

evaluator.evaluate_string_pairs(
    prediction="there are three dogs",
    prediction_b="4",
    input="how many dogs are in the park?",
    reference="four",
)

# {'reasoning': 'Both responses are relevant to the question asked, as they both provide a numerical answer to the question about the number of dogs in the park. However, Response A is incorrect according to the reference answer, which states that there are four dogs. Response B, on the other hand, is correct as it matches the reference answer. Neither response demonstrates depth of thought, as they both simply provide a numerical answer without any additional information or context. nnBased on these criteria, Response B is the better response.n',
# 'value': 'B',
# 'score': 0}

根据这个例子，我们来了解一下pairwise_string评估器evaluate_string_pairs方法接收的参数：

prediction（str）：第一个模型、Chain或Prompt的预测
prediction_b（str）：第二个模型、Chain或Prompt的预测
input（str）：输入的问题、Prompt或其他文本。
reference（str）:(仅适用于labeled_pairwise_string）reference响应

函数返回一个字典包含以下键值：

value：“A”或“B”，表示是首选预测
score：对应“value”的整数0或1，其中得分为1表示首选第一个预测，得分为0表示首选第二个预测
reasoning：在得到score之前，LLM生成的思维链推理（推理的过程）

不使用Reference

当reference不可用时，仍然可以评估首选的预测。结果将反映评估模型的偏好，该偏好不太可靠，并可能是事实上不正确的偏好。

来看个例子,示例中比较了加法的两种解释的优劣，因为没有reference，所以评估的依据是大模型的喜好。但也可以看到，大模型对于简单的问题也给出了较为详细的评估内容。

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("pairwise_string")

evaluator.evaluate_string_pairs(
    prediction="Addition is a mathematical operation.",
    prediction_b="Addition is a mathematical operation that adds two numbers to create a third number, the 'sum'.",
    input="What is addition?",
)

# {'reasoning': 'Both responses are correct and relevant to the question. However, Response B is more helpful and insightful as it provides a more detailed explanation of what addition is. Response A is correct but lacks depth as it does not explain what the operation of addition entails. nnFinal Decision: [[B]]',
# 'value': 'B',
# 'score': 0}

设定评估标准 Criteria

默认情况下，大语言模型会根据有用性、相关性、正确性和思考深度来选择“首选”的响应。但也可以通过传递criteria参数来自定义条件，其中标准可以采用以下任何形式：

默认的Criteria：使用默认criteria之一及其描述
宪法原则（Constitutional principal）：使用langchain中定义的任何宪法原则
字典：自定义条件的列表，其中键是条件的名称，值是描述
多个Criteria或宪法原则：将多个标准结合在一起

custom_criteria = {
    "simplicity": "Is the language straightforward and unpretentious?",
    "clarity": "Are the sentences clear and easy to understand?",
    "precision": "Is the writing precise, with no unnecessary words or details?",
    "truthfulness": "Does the writing feel honest and sincere?",
    "subtext": "Does the writing suggest deeper meanings or themes?",
}
evaluator = load_evaluator("pairwise_string", criteria=custom_criteria)

evaluator.evaluate_string_pairs(
    prediction="Every cheerful household shares a similar rhythm of joy; but sorrow, in each household, plays a unique, haunting melody.",
    prediction_b="Where one finds a symphony of joy, every domicile of happiness resounds in harmonious,"
    " identical notes; yet, every abode of despair conducts a dissonant orchestra, each"
    " playing an elegy of grief that is peculiar and profound to its own existence.",
    input="Write some prose about families.",
)

# {
# 'reasoning': 'Response A is simple, clear, and precise. It uses straightforward language to convey a deep and sincere message about families. The metaphor of joy and sorrow as music is effective and easy to understand.nnResponse B, on the other hand, is more complex and less clear. The language is more pretentious, with words like "domicile," "resounds," "abode," "dissonant," and "elegy." While it conveys a similar message to Response A, it does so in a more convoluted way. The precision is also lacking due to the use of unnecessary words and details.nnBoth responses suggest deeper meanings or themes about the shared joy and unique sorrow in families. However, Response A does so in a more effective and accessible way.nnTherefore, the better response is [[A]].',
# 'value': 'A',
# 'score': 1
# }

设定LLM

默认在Evaluation Chain使用gpt-4，可以自定义

from langchain.chat_models import ChatAnthropic

llm = ChatAnthropic(temperature=0)

evaluator = load_evaluator("labeled_pairwise_string", llm=llm)

设定评估Prompt

可以使用自定义评估提示来添加更多特定于任务的说明，或指示评估者对输出进行评分。注意，如果使用的Prompt期望生成唯一格式的结果，则可能还需要传入自定义输出解析器（output_parser=your_parser（）），而不是默认的PairwiseStringResultOutputParser。

from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
    """Given the input context, which do you prefer: A or B?
Evaluate based on the following criteria:
{criteria}
Reason step by step and finally, respond with either [[A]] or [[B]] on its own line.

DATA
----
input: {input}
reference: {reference}
A: {prediction}
B: {prediction_b}
---
Reasoning:

"""
)
evaluator = load_evaluator(
    "labeled_pairwise_string", prompt=prompt_template
)

evaluator.evaluate_string_pairs(
    prediction="The dog that ate the ice cream was named fido.",
    prediction_b="The dog's name is spot",
    input="What is the name of the dog that ate the ice cream?",
    reference="The dog's name is fido",
)

# {
# 'reasoning': 'Helpfulness: Both A and B are helpful as they provide a direct answer to the question.nRelevance: A is relevant as it refers to the correct name of the dog from the text. B is not relevant as it provides a different name.nCorrectness: A is correct as it accurately states the name of the dog. B is incorrect as it provides a different name.nDepth: Both A and B demonstrate a similar level of depth as they both provide a straightforward answer to the question.nnGiven these evaluations, the preferred response is:n',
# 'value': 'A',
# 'score': 1
# }

在本例中使用了自定义的Prompt Template，评估模型给狗子起名字的能力。

总结

本节介绍了三种比较评估器，主要用于比较模型之间生成内容的质量，包括自定义的对评估器、利用嵌入距离比较的对评估器，基于字符串比较的对评估器。

自定义评估器允许自定义预测的比较过程
基于嵌入的对评估器则需要选择距离计算指标和嵌入模型来完成距离的比较
基于字符串比较的对评估器，则可以使用langchain所支持的Criteria和宪法原则（可以参考Criteria Evaluation部分），或者自定义一些字符串指标，发挥大模型的能力对生成的内容进行评估

本网站的内容主要来自互联网上的各种资源，仅供参考和信息分享之用，不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益，请联系我们，我们将尽快采取行动，包括删除或更正。

{{userData.name}}已认证

LangChain Comparison Evaluators: Custom Pairwise Evaluator

比较评估器 Comparison Evaluators

自定义的对评估器 Custom Pairwise Evaluator

基于大模型的示例

对嵌入距离 # Pairwise Embedding Distance

对字符串比较 Pairwise String Comparison

不使用Reference

设定评估标准 Criteria

设定LLM

设定评估Prompt

总结

文本情感分析方法研究综述

编码器-解码器结构在生成任务中的应用

GeoSpy.ai

即梦Dreamina

Globe Explorer

Luma Dream Machine

抖音即创

Motionshop

归档

{{userData.name}}已认证

比较评估器 Comparison Evaluators

自定义的对评估器 Custom Pairwise Evaluator

基于大模型的示例

对嵌入距离 # Pairwise Embedding Distance

对字符串比较 Pairwise String Comparison

不使用Reference

设定评估标准 Criteria

设定LLM

设定评估Prompt

总结

文本情感分析方法研究综述

编码器-解码器结构在生成任务中的应用

OpenAI 2020年论文《Scaling laws for neural language models》实验结果分析

ChatGPT Prompt Framework: Understanding the Basics and Advanced Applications

LangChain + ChatGPT 实战应用之AI销售客服优化

"如何使用chatGPT实现大文件切片上传",