大语言模型作为类比推理者的新方法

释放双眼，带上耳机，听听看~！

这篇文章介绍了一种名为“类推提示法”的新提示技巧，旨在自动引领大语言模型进行推理，通过自生成示例来指导大语言模型进行推理，提高了在不同任务上的准确率。

“思维链”（CoT）提示法让语言模型在各类推理任务上表现优异，但是，这通常需要对推理过程的实例进行标记。在这个研究中，我们推出了一种新颖的提示技巧，名为 “类推提示法”，旨在自动引领大语言模型进行推理。

原文：Large Language Models as Analogical Reasoners

Michihiro Yasunaga,2,1 Xinyun Chen,1 Yujia Li,1 Panupong Pasupat,1 Jure Leskovec,2 Percy Liang,2 Ed H. Chi,1 Denny Zhou1 1 Google DeepMind 2 Stanford University

2022b 和 Shum 等人，2023）。

在多个需要推理的任务中，我们测试了这个新方法的效果，包括 GSM8K (Cobbe et al., 2021) 和 MATH (Hendrycks et al., 2021b) 中的数学问题，Codeforces 的代码生成任务，以及 BIG-Bench (Srivastava et al., 2022) 中的其他推理任务。我们使用了几个基础的 LLMs，比如 GPT-3.5，GPT-4 (OpenAI, 2023)。实验结果显示，这个新方法在不同的任务和基础 LLMs 上表现得都比 0-shot CoT 和 few-shot CoT 要好，平均准确率提高了 4%。尤其在涉及不同推理类型的任务上，如 MATH (包括代数，概率，几何等) 和 Codeforces (包括动态编程，图形算法等)，它表现得特别好。这说明，为每个问题生成定制的例子来引导 LLMs 的推理过程是行之有效的。

2019 年研发）发展到有着数十亿参数的超级模型（由 Brown 等人在 2020 年提出）。并且，他们还在模型训练中加入了互联网上的文本和使用说明数据（参见 Gao 等人 2020 年，Ouyang 等人 2022 年的研究）。这使得大语言模型在处理各种自然语言处理（NLP）任务上变得非常高效。

超大型 LLMs 凭借其数十亿的参数，展现出了出色的上下文学习和少样本学习能力（参见 Brown 等人 2020 年，Liu 等人 2022 年，Su 等人 2022 年，Mishra 等人 2022 年，Wei 等人 2022a 年，以及 Yasunaga 等人 2022 年提出的 “增强回忆生成” 技术。在开放领域的问答（QA）任务中，这项技术能指导 LLMs 在回答问题之前先回忆相关的事实信息。而我们 “自生成示例” 的想法虽然与 “回忆” 有关，但我们更关注的是模型如何回忆和重现问题解决和推理的过程，而不仅仅是事实知识。

2022b 年提出，是一种独特的引导方式。它就像一个指路人，辅助大语言模型（LLMs）展现其解题过程中的每个思考步骤，帮助它更好地解决问题。其中，0-shot CoT 和 few-shot CoT 是两种常见的应用方式，分别由 Kojima 等人和 Wei 等人在 2022 年和 2022b 年提出。

0-shot CoT 是通过一个简单的指令，比如 “一步一步地思考”，来激发 LLMs 的思维并展示其思考过程。而 few-shot CoT 则是通过提供一些例子，展示从问题到答案的完整推理过程，借助 LLMs 强大的学习能力，从而取得更好的效果。但是，这需要一些已经标记好的例子作为参考。为了解决这个问题，我们的方法是让 LLMs 自己创造例子。

在使用 few-shot CoT 的过程中，通常所有的测试问题都会使用一套固定的、已经标记好的例子。但最近有研究开始尝试使用基于检索的 CoT，也就是为每个具体问题从大量的外部数据中找到最相关的例子（参见 Zhang 等人 2022b 年和 Shum 等人 2023 年的研究）。与他们的研究目标相同，我们也是想提供相关的例子，但不是通过检索，而是让 LLMs 自己生成。这样不仅简单，因为不需要检索外部数据，而且更加灵活多变，LLMs 不仅可以创造特定的例子，还能够生成更广泛的知识和见解作为补充。实际测试表明，相比基于检索的 CoT，我们的生成型 CoT 在使用更大的基础 LLMs 时表现更好。

最后，还有一些其他的方法也可以提高 CoT 的效果，比如 “自我一致性”（由 Wang 等人 2022 年提出）和 “由简到繁”（由 Zhou 等人 2022 年提出）。这些方法可以和我们的研究相互补充，共同提高效果。

想了解更多相关的研究，请参考附录 A（§2021 年提出），这个测试包括了一些基础的数学题；还有 MATH（由 Hendrycks 等人在 2021b 年提出），包含了一些高中数学竞赛中的高难度数学题。我们针对每个问题使用温度设置为 0 的 LLMs 进行输出，并报告其准确性。

代码生成

代码生成是一种创造程序的过程，目的是解决算法问题。其中，竞赛编程是个挑战颇大的领域，它要求参与者能够熟练掌握和运用各种算法，比如动态规划和图算法 (Li et al., codeforces.com 收集了一些竞赛编程题目（详细信息请参见 §2022)。每道题目都包含一个题目描述，这个描述会被输入到大语言模型（LLMs）中，然后通过一组测试用例来检验生成代码的正确性。只有代码能够通过所有的测试用例，我们才认为它是正确的。

我们的研究报告中采用了与早前的代码生成研究相一致的指标，即 Acc@1 和 Acc@10 (Li et al., 2023)。Acc@k 是用来衡量在 k 次模型输出样本中，至少有一次输出是正确的。我们从大语言模型中抽取 10 次输出样本来针对每个问题进行测试，并在抽样过程中设定温度为 0.7。

其他推理任务

我们还评估了 BIG-Bench (Srivastava et al., 2022; Suzgun et al., 2022) 中的多个推理任务，包括单词排序、五个对象的逻辑推理、时间序列推理、彩色对象推理以及形式谬误推理。这些任务形式各异，并可能没有相应的专用训练数据。因此，这些任务非常适合用我们的方法来生成定制的示例进行测试。对于每个任务，我们从大语言模型（LLMs）中得到输出，并且在输出温度设为 0 的条件下，记录模型的准确性。

2023 年开发）。

2019 年开发）对每个问题进行编码，然后根据余弦相似度从训练集中检索出最相似的 K=5 个问题。

我们的方法

在 GSM8K 数据集上，我们让模型自动生成 K=5 个样本；而在 MATH 和 BIG-Bench 任务上则生成 K=3 个样本。对于 Codeforces 任务，我们同时自动生成知识和 K=3 个样本。

1997)。这个现象在心理学领域被深入研究，显示它在解决问题 (Gentner & Markman, 1997) 和激发创造力 (Ward et al., 2014; Huang et al., codeforces.com 收集了一批数据。我们选取了 2023 年 1 月至 8 月期间发布的 Level-A 难度的问题。这些问题包含完整的问题描述和测试用例，都可以在网站上找到。其中，有些测试用例因为长度过长被截断，我们没有采用。最终，我们选取了长度在 GPT3.5-turbo 可处理的 2000 令牌内的问题，共计 50 个。由于数据集规模相对较小，我们进行了两轮评估，并最终报告了平均结果。

附录 C：其它研究成果

Prompting Method	Acc@1	Acc@10
Ours: Self-generate knowledge after exemplars	14%	27%
Ours: Self-generate knowledge before exemplars	15%	29%

表 6: 展示了使用 GPT3.5-turbo 在 Codeforces 任务上进行的消融学习实验结果。相比之下，在生成样例之前就进行自我生成知识的方法略微优于在生成样例之后进行的方法（详见 4.2 节)。

附录 D：完整示例

D.1 数学部分

我们的提示词（自动生成示例）：

你的任务是解数学题。当遇到一道数学题时，请你先回想一些相关的题目作为例子。然后，再去解决最开始的那个问题。

最开始的问题：

有一家航空公司为飞机上的每位乘客提供晚餐，乘客可以选择牛排或者鱼。六名机组人员中有三人可以选择牛排，三人可以选择鱼。如果食物是随机分配的，那么两名飞行员都拿到鱼的概率是多少呢？

操作指南：

你的答案中必须包含以下几点：

相关问题：

请你想出三个和最开始的问题相关的数学问题作为例子。注意，每个问题都应该和其它问题有所区别，包括和最开始的问题（比如，涉及不同的数字和名字）。每个问题都要按照以下的格式：

“Q：” 后面写问题描述

“A：” 后面写问题的解决方法，并把最终答案写在 boxed 里。

解决最开始的问题：

说：“我们来解决下面这道数学题。” 然后按照以下的格式回答：

Q：把最开始的问题复制粘贴在这里。

A：解释解决方案，并把最终答案写在 boxed 里。

GPT3.5-turbo 输出结果：

类似问题：

问：有个袋子里面装了 5 个红球和 3 个蓝球，如果从中不放回地随机抽两个球，那么两个球都是红的概率是多少呢？

答：袋子里总共有 8 个球。首次抽球时，拿到红球的概率是 5/8。拿到一个红球以后，袋子里还剩 4 个红球和 3 个蓝球，此时再拿到红球的概率就是 4/7 了。要计算的两次都拿到红球的概率，只需要把两次抽球的概率相乘：(5/8)×(4/7)=145。

问：假如有个盒子，里面有 10 个红色弹子和 5 个蓝色弹子。如果不放回地抽三个弹子，那么全部抽到红弹子的概率是多少？

答：这个盒子里总共有 15 个弹子。首次抽到红弹子的概率是 10/15，也就是 2/3。抽出一个红弹子后，盒子里剩下 9 个红弹子，14 个弹子总计。第二次抽到红弹子的概率是 9/14。再抽出一个红弹子，盒子里就剩下 8 个红弹子，13 个弹子总计了。此时，第三次抽到红弹子的概率是 8/13。我们只需把每次抽到红弹子的概率相乘，就得到了抽三个红弹子的概率：(10/15)×(9/14)×(8/13)=9124。

问：一个袋子里有 4 个绿色弹子、3 个红色弹子和 2 个蓝色弹子。如果我们不放回地从中抽两个弹子，那么抽到的两个弹子都是绿色的概率是多少？

答：袋子里总共有 9 个弹子。第一次抽到绿色弹子的概率是 4/9。然后袋子里就剩下 3 个绿色弹子和 5 个非绿色弹子了，所以第二次再抽到绿色弹子的概率是 3/8。两次抽到绿色弹子的概率就是两者的乘积：(4/9)×(3/8)=61。

再来看一个实际问题：

有个航空公司给飞机上的所有乘客提供晚餐，乘客可以选择牛排或者鱼。飞机的 6 名机组成员可以从三份牛排和三份鱼中选择。如果给机组人员的餐食是随机分配的，那么两个飞行员都选择了鱼的概率是多少？

答：一共有 6 名机组人员，首个飞行员拿到鱼的概率是 1/2。当首个飞行员拿到一份鱼后，剩下的是两份鱼和三份牛排，所以第二个飞行员拿到鱼的概率是 2/5。两个飞行员都拿到鱼的概率就是这两个概率的乘积：(3/6)×(2/5)=51。

D.2 GSM8K

我们的任务提示词（自行生成的示例）:

你需要解决数学问题。遇到数学题时，想想与之相关或相似的题目作为例子，然后再去解决最开始的那个问题。

初始问题：

Kyle 用 $19.50 的价格购买了去年最畅销的书，这个价格是打了 75% 的价格。那么，这本书的原价是多少？

注意事项：

请确保你的回答遵守以下的说明。

类似问题：

请给出五个与初始问题相关或相似的数学问题的例子。对于每个问题，详细地说明解决的步骤，并给出最后的答案（请用 boxed 括起来）。对于每个问题：

在 “Q:” 后面写出问题。

在 “A:” 后面解释解决方案，并在 boxed 中写出最终答案。

解决初始问题：

Q: 在此处复制并粘贴初始问题。

A: 解释解决方案，并在 boxed 中写出最终答案。

GPT3.5-turbo 输出结果：

类似问题：

Q1：一件原价 $40 的衬衫现在打八折，销售价是多少？答：衬衫原价$ 40，八折之后，可以节省 $40 的 20$ 8。所以，衬衫的销售价是 $40 减去$ 8，等于 $32。

Q2：商店所有商品都有 15% 的折扣。一个顾客用 $85 买了双鞋，那么鞋的原价是多少呢？答：设鞋的原价为$ x。15% 的折扣意味着可以省下 $x 的 15$ 0.15x。已知销售价为 $85 ，我们可以建立方程$ x – $0.15 x =$ 85，进一步计算可得鞋的原价是 $100。

Q3：餐厅对所有餐品提供 10% 的折扣。顾客付了 $45 的餐费，那么原价是多少？答：设餐品原价为$ x。10% 的折扣相当于省下 $x 的 10$ 0.10x。已知销售价是 $45 ，可以建立方程$ x – $0.10 x =$ 45，从中我们得知餐品原价为 $50。

Q4：一台原价 $800 的笔记本现在打七折，销售价是多少？答：笔记本原价$ 800，打七折后，折扣金额是 $800 的 30$ 240。所以，打折后的价格是 $800 减去$ 240，等于 $560。

Q5：商店所有玩具打 12% 的折扣。顾客用 $22 买了个玩具，那么玩具的原价是多少？答：设玩具原价为$ x。12% 的折扣就是省下 $x 的 12$ 0.12x。销售价为 $22 ，通过方程$ x – $0.12 x =$ 22，我们算出玩具原价为 $25。

最初的问题解答：

Q：Kyle 花了 $19.50 买了去年最热卖的书，这个价格是原价的 75$ x。25% 的折扣意味着省下 $x 的 25$ 0.25x。已知销售价为 $19.50 ，通过方程$ x – $0.25 x =$ 19.50，我们算出书的原价是 $26。

D.3 Codeforces

Our prompt (self-generate exemplars and knowledge):

Your goal is to write Python3 code to solve competitive programming problems. Given a problem, explain the core concepts in it and provide other relevant problems. Then solve the original problem.

# Problem:

You are given a sequence a_1, a_2, \ldots, a_n. Each element of a is 1 or 2. Find out if an integer k exists so that the following conditions are met.

1 \leq k \leq n-1, and a_1 \cdot a_2 \cdot \ldots \cdot a_k = a_{k+1} \cdot a_{k+2} \cdot \ldots \cdot a_n, where \cdot denotes a product.

If there exist multiple k that satisfy the given condition, print the smallest.

Input

Each test contains multiple test cases. The first line contains the number of test cases t (1 \le t \le 100). Description of the test cases follows.

The first line of each test case contains one integer n (2 \leq n \leq 1000).

The second line of each test case contains n integers a_1, a_2, \ldots, a_n (1 \leq a_i \leq 2).

Output

For each test case, if there is no such k, print -1.

Otherwise, print the smallest possible k.

# Instruction:

## Algorithms:

Identify the core concepts or algorithms used to solve the problem.

## Tutorial:

Write a useful tutorial about these algorithms.

## Example Problems:

Provide three examples of relevant competitive programming problems that involve these algorithms. For each problem, describe the problem, explain the solution in detail, and then write the correct Python3 code.

## Python3 code to solve the original problem:

Include the following points in your response:

- Explanation of the solution:

- Python3 code to solve the problem:

Output by GPT3.5-turbo-16k:

## Algorithms:

The core algorithm for solving this problem is the prefix product algorithm.

## Tutorial:

The prefix product algorithm is used to calculate the product of all the elements in a prefix of an array. It is a useful algorithm for solving problems that involve calculating the product of all the elements in a subarray.

The algorithm works by initializing a variable "prefix" to 1, and then iterating through the array from left to right. At each iteration, the current element is multiplied with the prefix, and the result is stored in the prefix variable. This way, the prefix variable keeps track of the product of all the elements in the prefix of the array.

Here is the pseudocode for the prefix product algorithm:

The time complexity of this algorithm is O(n), where n is the size of the input array.

## Example Problems:

1. Problem: Given an array of integers, find the product of all the elements in each subarray of size k.

   Solution: We can use the prefix product algorithm to solve this problem. We initialize a prefix variable to 1, and then iterate through the array from left to right. At each iteration, we multiply the current element with the prefix, and store the result in the prefix variable. If the index i is greater than or equal to k-1, we divide the prefix by the element arr[i-k+1], which represents the first element of the subarray of size k starting at index i-k+1. The result is the product of all the elements in the subarray of size k starting at index i-k+1. We can store these results in a new array and return it as the output.

2. Problem: Given an array of integers, find the longest subarray with a product less than or equal to a given target.

   Solution: We can use the prefix product algorithm to solve this problem. We initialize a prefix variable to 1, and then iterate through the array from left to right. At each iteration, we multiply the current element with the prefix, and store the result in the prefix variable. If the prefix becomes greater than the target, we divide the prefix by the first element of the subarray, and move the start pointer of the subarray one step forward. We update the maximum length of the subarray if necessary.

3. Problem: Given an array of integers, find the smallest k such that the product of all the elements in the prefix of the array is equal to the product of all the elements in the suffix of the array.

   Solution: We can use the prefix product algorithm to solve this problem. We initialize a prefix variable to 1, and then iterate through the array from left to right. At each iteration, we multiply the current element with the prefix, and store the result in the prefix variable. We also initialize a suffix variable to 1, and then iterate through the array from right to left. At each iteration, we multiply the current element with the suffix, and store the result in the suffix variable. If the prefix is equal to the suffix, we return the current index as the smallest k. If no such index is found, we return -1.

Below is a baseline output generated by GPT3.5-turbo-16k using vanilla 0-shot prompt. This output does not recall any relevant exemplars and fails to employ the prefix product algorithm, resulting in an incorrect solution.

D.4 BIG-Bench: formal fallacies

Our prompt (self-generate exemplars):

Your task is to tackle reasoning problems. When presented with a problem, recall relevant problems as examples. Afterward, proceed to solve the initial problem.

# Initial Problem:

"Is Siri a stepsister of Mary? Is Susan related to Kate? In large families, it is sometimes difficult to keep track of all one’s relatives. The following argument seeks to clarify some such relations: To start with, everyone who is a schoolmate of Mildred is a daughter of Jeanette, too. Now, being a daughter of Jeanette is sufficient for not being a cousin of Cristi. Therefore, no schoolmate of Mildred is a cousin of Cristi." Is the argument, given the explicitly stated premises, deductively valid or invalid? Options: - valid - invalid

# Instructions:

Make sure to include all of the following points:

## Relevant Problems:

Recall three examples of problems that are relevant to the initial problem. Note that your problems must be distinct from each other and from the initial problem. For each problem:

- After "Q: ", describe the problem

- After "A: ", explain the solution and enclose the ultimate answer in \boxed{}.

## Solve the Initial Problem:

Say "Let’s solve the following reasoning problem." Then formulate your response in the following format:

Q: Copy and paste the initial problem here.

A: Explain the solution and enclose the ultimate answer in \boxed{} here.

Output by GPT3.5-turbo:

Relevant Problems:

Q: In a certain town, all doctors are required to have a medical degree. John is a doctor. Therefore, John must have a medical degree. Is this argument valid or invalid?

A: This argument is valid because it follows the logical form of a categorical syllogism, where the first premise establishes a universal statement and the second premise applies that statement to a specific case. Therefore, the conclusion can be logically deduced. \boxed{valid}

Q: All cats have fur. Fluffy is a cat. Therefore, Fluffy has fur. Is this argument valid or invalid?

A: This argument is valid because it follows the logical form of a categorical syllogism, where the first premise establishes a universal statement and the second premise applies that statement to a specific case. Therefore, the conclusion can be logically deduced. \boxed{valid}

Q: If it is raining, then the ground is wet. The ground is wet. Therefore, it must be raining. Is this argument valid or invalid?

A: This argument is invalid because it commits the fallacy of affirming the consequent. Just because the ground is wet does not necessarily mean that it is raining, as there could be other factors that caused the ground to become wet. \boxed{invalid}

Solve the Initial Problem:

Let’s solve the following reasoning problem:

Q: "Is Siri a stepsister of Mary? Is Susan related to Kate? In large families, it is sometimes difficult to keep track of all one’s relatives. The following argument seeks to clarify some such relations: To start with, everyone who is a schoolmate of Mildred is a daughter of Jeanette, too. Now, being a daughter of Jeanette is sufficient for not being a cousin of Cristi. Therefore, no schoolmate of Mildred is a cousin of Cristi." Is the argument, given the explicitly stated premises, deductively valid or invalid?

A: The argument is deductively valid. The premises state that everyone who is a schoolmate of Mildred is a daughter of Jeanette, and being a daughter of Jeanette is sufficient for not being a cousin of Cristi. Therefore, it logically follows that no schoolmate of Mildred is a cousin of Cristi. \boxed{valid}

Below is a baseline output generated by GPT3.5-turbo using 0-shot CoT prompt. This output does not recall any relevant exemplars and adopts an incorrect approach to address the deductive reasoning problem.

The argument is deductively invalid.

To determine the validity of the argument, we need to analyze the logical structure and see if the conclusion necessarily follows from the premises.

Let’s break down the argument:

1. Everyone who is a schoolmate of Mildred is a daughter of Jeanette.

2. Being a daughter of Jeanette is sufficient for not being a cousin of Cristi.

Conclusion: No schoolmate of Mildred is a cousin of Cristi.

The first premise establishes a relationship between being a schoolmate of Mildred and being a daughter of Jeanette. However, it does not provide any information about the relationship between being a schoolmate of Mildred and being a cousin of Cristi.

The second premise establishes a relationship between being a daughter of Jeanette and not being a cousin of Cristi. However, it does not provide any information about the relationship between being a schoolmate of Mildred and being a cousin of Cristi.

Therefore, the conclusion that no schoolmate of Mildred is a cousin of Cristi cannot be logically derived from the given premises. The argument is deductively invalid.

\boxed{invalid}

本网站的内容主要来自互联网上的各种资源，仅供参考和信息分享之用，不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益，请联系我们，我们将尽快采取行动，包括删除或更正。

{{userData.name}}已认证

大语言模型作为类比推理者的新方法

2022b 和 Shum 等人，2023）。

2023 年开发）。

2019 年开发）对每个问题进行编码，然后根据余弦相似度从训练集中检索出最相似的 K=5 个问题。

附录 C：其它研究成果

附录 D：完整示例

最开始的问题：

操作指南：

相关问题：

解决最开始的问题：

初始问题：

注意事项：

类似问题：

解决初始问题：

利用Httprunner框架实现接口自动化测试详解

音色克隆工具使用指南及常见问题解决

GeoSpy.ai

即梦Dreamina

Globe Explorer

Luma Dream Machine

抖音即创

Motionshop

归档

{{userData.name}}已认证

2022b 和 Shum 等人，2023）。

2023 年开发）。

2019 年开发）对每个问题进行编码，然后根据余弦相似度从训练集中检索出最相似的 K=5 个问题。

附录 C：其它研究成果

附录 D：完整示例

最开始的问题：

操作指南：

相关问题：

解决最开始的问题：

初始问题：

注意事项：

类似问题：

解决初始问题：

利用Httprunner框架实现接口自动化测试详解

音色克隆工具使用指南及常见问题解决

AI的基础知识：了解强弱人工智能和机器学习与深度学习的区别

语言模型与儿童语言习得的相似性研究

GPT-4：特点、能力和预测

GPT-4技术解析：大语言模型的发展历程与应用创新