LangChain Comparison Evaluators: Custom Pairwise Evaluator

Learn about LangChain\'s Comparison Evaluators and how to create a custom pairwise evaluator for string comparison. Understand the key methods and properties of comparison evaluators for AI reinforcement learning. Explore a simple example of a custom evaluator for comparing the length of predicted strings.

比较评估器 Comparison Evaluators





  • evaluate_string_pairs:评估输出字符串对
  • aevaluate_string_paries:异步计算输出字符串对
  • requires_input:此属性代表此评估器是否需要输入字符串
  • requires_reference:此属性代表此评估器是否需要reference标签


自定义的对评估器 Custom Pairwise Evaluator



from typing import Optional, Any
from langchain.evaluation import PairwiseStringEvaluator

class LengthComparisonPairwiseEvalutor(PairwiseStringEvaluator):

    def _evaluate_string_pairs(
        prediction: str,
        prediction_b: str,
        reference: Optional[str] = None,
        input: Optional[str] = None,
        **kwargs: Any,
    ) -> dict:
        score = int(len(prediction.split()) > len(prediction_b.split()))
        return {"score": score}
evaluator = L

    prediction="The quick brown fox jumped over the lazy dog.",
    prediction_b="The quick brown fox jumped over the dog.",



from typing import Optional, Any
from langchain.evaluation import PairwiseStringEvaluator
from langchain.chat_models import ChatAnthropic
from langchain.chains import LLMChain

class CustomPreferenceEvaluator(PairwiseStringEvaluator):

    def __init__(self) -> None:
        llm = ChatAnthropic(model="claude-2", temperature=0)
        self.eval_chain = LLMChain.from_string(
            Which option is preferred? Do not take order into account.
            Evaluate based on accuracy and helpfulness. If neither is
            preferred, respond with C. Provide your reasoning, then finish with
            Preference: A/B/C
            Input: How do I get the path of the parent directory in python 3.8?
            Option A: You can use the following code:
            import os
            Option B: You can use the following code:
            from pathlib import Path
            Reasoning: Both options return the same result. However, since
            option B is more concise and easily understand, it is preferred.
            Preference: B
            Which option is preferred? Do not take order into account. Evaluate
            based on accuracy and helpfulness. If neither is preferred, respond
            with C. Provide your reasoning, then finish with Preference: A/B/C
            Input: {input} Option A: {prediction} Option B: {prediction_b}
    def requires_input(self) -> bool:
        return True

    def requires_reference(self) -> bool:
        return False

    def _evaluate_string_pairs(
        prediction: str,
        prediction_b: str,
        reference: Optional[str] = None,
        input: Optional[str] = None,
        **kwargs: Any,
    ) -> dict:
        result = self.eval_chain(
                "input": input,
                "prediction": prediction,
                "prediction_b": prediction_b,
                "stop": ["Which option is preferred?"],

        response_text = result["text"]
        reasoning, preference = response_text.split("Preference:", maxsplit=1)
        preference = preference.strip()
        score = 1.0 if preference == "A" else (0.0 if preference == "B" else None)
        return {"reasoning": reasoning.strip(), "value": preference, "score": score}

evaluator = CustomPreferenceEvaluator()

    input="How do I import from a relative directory?",
    prediction="use importlib! importlib.import_module('.my_package', '.')",
    prediction_b="from .sibling import foo",

# {
# 'reasoning': 'Option B is preferred over option A for importing from a relative directory, because it is more straightforward and concise.nnOption A uses the importlib module, which allows importing a module by specifying the full name as a string. While this works, it is less clear compared to option B.nnOption B directly imports from the relative path using dot notation, which clearly shows that it is a relative import. This is the recommended way to do relative imports in Python.nnIn summary, option B is more accurate and helpful as it uses the standard Python relative import syntax.',
# 'value': 'B',
# 'score': 0.0
# }


对嵌入距离 # Pairwise Embedding Distance


可以加载pairwise_embedding_distance评估器来执行嵌入距离的计算。注意,这会返回一个距离分数,这意味着根据其嵌入的表示,数字越低,输出就越相似。这里和string evaluator非常相似,就不再详细赘述。可以参见String Evaulator的embedding distance部分的部分。

对字符串比较 Pairwise String Comparison


  • 哪个LLM或prompt为给定问题生成首选的输出?
  • 对于少样本示例选择,我应该包括哪些示例?
  • 哪种输出更适合用于微调?


from langchain.evaluation import load_evaluator

evaluator = load_evaluator("labeled_pairwise_string")

    prediction="there are three dogs",
    input="how many dogs are in the park?",

# {'reasoning': 'Both responses are relevant to the question asked, as they both provide a numerical answer to the question about the number of dogs in the park. However, Response A is incorrect according to the reference answer, which states that there are four dogs. Response B, on the other hand, is correct as it matches the reference answer. Neither response demonstrates depth of thought, as they both simply provide a numerical answer without any additional information or context. nnBased on these criteria, Response B is the better response.n',
# 'value': 'B',
# 'score': 0}


  • prediction(str):第一个模型、Chain或Prompt的预测
  • prediction_b(str):第二个模型、Chain或Prompt的预测
  • input(str):输入的问题、Prompt或其他文本。
  • reference(str):(仅适用于labeled_pairwise_string)reference响应


  • value:“A”或“B”,表示是首选预测
  • score:对应“value”的整数0或1,其中得分为1表示首选第一个预测,得分为0表示首选第二个预测
  • reasoning:在得到score之前,LLM生成的思维链推理(推理的过程)




from langchain.evaluation import load_evaluator

evaluator = load_evaluator("pairwise_string")

    prediction="Addition is a mathematical operation.",
    prediction_b="Addition is a mathematical operation that adds two numbers to create a third number, the 'sum'.",
    input="What is addition?",

# {'reasoning': 'Both responses are correct and relevant to the question. However, Response B is more helpful and insightful as it provides a more detailed explanation of what addition is. Response A is correct but lacks depth as it does not explain what the operation of addition entails. nnFinal Decision: [[B]]',
# 'value': 'B',
# 'score': 0}

设定评估标准 Criteria


  • 默认的Criteria:使用默认criteria之一及其描述
  • 宪法原则(Constitutional principal):使用langchain中定义的任何宪法原则
  • 字典:自定义条件的列表,其中键是条件的名称,值是描述
  • 多个Criteria或宪法原则:将多个标准结合在一起
custom_criteria = {
    "simplicity": "Is the language straightforward and unpretentious?",
    "clarity": "Are the sentences clear and easy to understand?",
    "precision": "Is the writing precise, with no unnecessary words or details?",
    "truthfulness": "Does the writing feel honest and sincere?",
    "subtext": "Does the writing suggest deeper meanings or themes?",
evaluator = load_evaluator("pairwise_string", criteria=custom_criteria)

    prediction="Every cheerful household shares a similar rhythm of joy; but sorrow, in each household, plays a unique, haunting melody.",
    prediction_b="Where one finds a symphony of joy, every domicile of happiness resounds in harmonious,"
    " identical notes; yet, every abode of despair conducts a dissonant orchestra, each"
    " playing an elegy of grief that is peculiar and profound to its own existence.",
    input="Write some prose about families.",

# {
# 'reasoning': 'Response A is simple, clear, and precise. It uses straightforward language to convey a deep and sincere message about families. The metaphor of joy and sorrow as music is effective and easy to understand.nnResponse B, on the other hand, is more complex and less clear. The language is more pretentious, with words like "domicile," "resounds," "abode," "dissonant," and "elegy." While it conveys a similar message to Response A, it does so in a more convoluted way. The precision is also lacking due to the use of unnecessary words and details.nnBoth responses suggest deeper meanings or themes about the shared joy and unique sorrow in families. However, Response A does so in a more effective and accessible way.nnTherefore, the better response is [[A]].',
# 'value': 'A',
# 'score': 1
# }


默认在Evaluation Chain使用gpt-4,可以自定义

from langchain.chat_models import ChatAnthropic

llm = ChatAnthropic(temperature=0)

evaluator = load_evaluator("labeled_pairwise_string", llm=llm)



from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
    """Given the input context, which do you prefer: A or B?
Evaluate based on the following criteria:
Reason step by step and finally, respond with either [[A]] or [[B]] on its own line.

input: {input}
reference: {reference}
A: {prediction}
B: {prediction_b}

evaluator = load_evaluator(
    "labeled_pairwise_string", prompt=prompt_template

    prediction="The dog that ate the ice cream was named fido.",
    prediction_b="The dog's name is spot",
    input="What is the name of the dog that ate the ice cream?",
    reference="The dog's name is fido",

# {
# 'reasoning': 'Helpfulness: Both A and B are helpful as they provide a direct answer to the question.nRelevance: A is relevant as it refers to the correct name of the dog from the text. B is not relevant as it provides a different name.nCorrectness: A is correct as it accurately states the name of the dog. B is incorrect as it provides a different name.nDepth: Both A and B demonstrate a similar level of depth as they both provide a straightforward answer to the question.nnGiven these evaluations, the preferred response is:n',
# 'value': 'A',
# 'score': 1
# }

在本例中使用了自定义的Prompt Template,评估模型给狗子起名字的能力。



  • 自定义评估器允许自定义预测的比较过程
  • 基于嵌入的对评估器则需要选择距离计算指标和嵌入模型来完成距离的比较
  • 基于字符串比较的对评估器,则可以使用langchain所支持的Criteria和宪法原则(可以参考Criteria Evaluation部分),或者自定义一些字符串指标,发挥大模型的能力对生成的内容进行评估


2023-11-26 16:45:14



2023-11-26 17:00:14

有新私信 私信列表