面向低资源和增量类型的命名实体识别挑战赛简介

释放双眼,带上耳机,听听看~!
本文介绍了面向低资源和增量类型的命名实体识别挑战赛,并提供了使用PaddleNLP进行数据处理和格式转换的基线。

一、面向低资源和增量类型的命名实体识别挑战赛简介

使用无所不能的PaddleNLP写个比赛基线,第一次提交,分数虽然比较低,但是还凑合,主要是给的初赛数据集覆盖范围小,太小了。
面向低资源和增量类型的命名实体识别挑战赛简介

竞赛地址:

面向低资源和增量类型的命名实体识别

1.数据简介

本赛题采用的数据聚焦装备领域,主要从以下三个方面的来源收集整理得到,具有一定的权威性和领域价值:

  • 开源资讯:对国内外主流新闻网站、百度百科、维基百科、武器大全等开源资讯网站进行数据收集,优先收集中文,并将外文数据进行翻译后获得情报数据;
  • 智库报告:从智库网站中获取含有装备情报信息的论文以及报告;
  • 内部成果:通过国内军工企业、研究院所、国内综合图书馆、数字图书馆、军工院所图书馆等内部网站获取成果相关的文件进行分析和整理。

本赛题从上述来源收集到充足原始无标注数据后,先结合人工排查和关键字匹配等自动化方法过滤偏离主题、不真实和有偏见的数据;随后清洗无效和非法字符并过滤篇幅较长以及不含领域实体的文本;其次采用参考权威装备标准与论著制定的标签体系对文本进行标注,并采用相关领域以往研究成果中的模型对数据进行预打标;最终统计筛选出类型分布符合任务需求的样本作为原始数据集。

2.数据说明

• 初赛数据说明
该赛题数据集共计约6000条样本,包含以下9种实体类型:飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备。参考低资源学习领域的任务设置,为每种类型从原始数据集中采样50个左右样本案例,形成共97条标注样本的训练集(每一条样本可能包含多个实体和实体类型),其余样本均用于测试集。所有数据文件编码均为UTF-8。

文件类型 文件名 文件说明
训练集 ner_train.json 97条已标注样本,每个样本对应内容为:样本id(sample_id),原始文本(text)和标注实体列表(annotations),列表中每个元素对应一个实体,包括类型(type)、文本(text)、跨度起始位置(start)和结束位置(end)
测试集 ner_test.json 5920条未标注样本,每个样本对应内容为:样本id(sample_id)和原始文本(text)

二、数据处理

1.数据查看

!ls data/data218296/
ner_test.json  ner_train.json
  • 查看可知,需要提取9种实体类型:飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备
  • 目前训练集97条标注样本

2.数据集格式转换&& 数据集划分

主要是:

  • 转换格式
    一般使用docano进行数据标注,完毕进行格式转换。这里我直接处理文件格式为我所需要的二个是。
  • 分割训练集和测试机
    按照 8:2比例进行数据切分
%cd ~

import json
import csv
from pprint import pprint
import random


# 读取 JSON 文件
with open('data/data218296/ner_train.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
print(len(data))
data_len=len(data)
random.shuffle(data)
train_data=data[:int(data_len*0.75)]
dev_data=data[int(data_len*0.75):]
/home/aistudio
97
%cd ~

import json
import csv
from pprint import pprint
import random


# schema
key_words = "飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备".split(", ")


# 数据集格式转换
# 并根据8:2比例分割为train和dev
def convert(source_data, key_word):
    convert_target = []
    for item in source_data:
        # 单条记录
        result_list = []
        # 标注格式化
        for item2 in item["annotations"]:
            result_temp = dict()
            if item2['type'] == key_word:
                # 构造结果列表
                result_temp['text'] = item2['text']
                result_temp['start'] = item2['start']
                result_temp['end'] = item2['end']
                result_list.append(result_temp)
        # 构造单条数据
        temp = dict()
        temp['content'] = item['text']
        temp['result_list'] = result_list
        temp['prompt'] = key_word
        # 加入列表
        convert_target.append(temp)
    return convert_target


def convert_main(data,key_words):
    result=[]
    for key_word in key_words:
        temp_list = convert(data, key_word)
        result=result+temp_list
    random.shuffle(result)
    return result
    
    
# 转换后总列表
train_data_convert = convert_main(train_data,key_words)
dev_data_convert = convert_main(dev_data,key_words)


# 将JSON数据转换为CSV格式
with open('train.txt', 'w', encoding="utf-8") as f:
    for item in train_data_convert:
        f.write(json.dumps(item, ensure_ascii=False) + 'n')
with open('dev.txt', 'w', encoding="utf-8") as f:
    for item in dev_data_convert:
        f.write(json.dumps(item, ensure_ascii=False) + 'n')
/home/aistudio

三、训练训练

1.环境设置

主要是下载并安装PaddleNLP

# git 下载PaddleNLP
!git clone https://gitee.com/paddlepaddle/PaddleNLP.git  --depth=1
正克隆到 'PaddleNLP'...
remote: Enumerating objects: 5825, done.[K
remote: Counting objects: 100% (5825/5825), done.[K
remote: Compressing objects: 100% (4099/4099), done.[K
remote: Total 5825 (delta 2254), reused 3581 (delta 1437), pack-reused 0[K
接收对象中: 100% (5825/5825), 22.98 MiB | 1.19 MiB/s, 完成.
处理 delta 中: 100% (2254/2254), 完成.
检查连接... 完成。
# 安装升级PaddleNLP
%cd ~/PaddleNLP
!pip install -U -e ./
from IPython.display import clear_output
clear_output() # 清理很长的内容

2.模型微调

推荐使用 Trainer API 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。

可配置参数说明:

  • model_name_or_path:必须,进行 few shot 训练使用的预训练模型。可选择的有 “uie-base”、 “uie-medium”, “uie-mini”, “uie-micro”, “uie-nano”, “uie-m-base”, “uie-m-large”。
  • multilingual:是否是跨语言模型,用 “uie-m-base”, “uie-m-large” 等模型进微调得到的模型也是多语言模型,需要设置为 True;默认为 False。
  • output_dir:必须,模型训练或压缩后保存的模型目录;默认为 None
  • device: 训练设备,可选择 ‘cpu’、’gpu’ 、’npu’其中的一种;默认为 GPU 训练。
  • per_device_train_batch_size:训练集训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 32。
  • per_device_eval_batch_size:开发集评测过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 32。
  • learning_rate:训练最大学习率,UIE 推荐设置为 1e-5;默认值为3e-5。
  • num_train_epochs: 训练轮次,使用早停法时可以选择 100;默认为10。
  • logging_steps: 训练过程中日志打印的间隔 steps 数,默认100。
  • save_steps: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。
  • seed:全局随机种子,默认为 42。
  • weight_decay:除了所有 bias 和 LayerNorm 权重之外,应用于所有层的权重衰减数值。可选;默认为 0.0;
  • do_train:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。
  • do_eval:是否进行评估,设置该参数表示进行评估。

该示例代码中由于设置了参数 --do_eval,因此在训练完会自动进行评估。

%cd ~/PaddleNLP/model_zoo/uie/
# !python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py 
!python finetune.py  
    --device gpu 
    --logging_steps 10 
    --save_steps 50 
    --eval_steps 50 
    --seed 1000 
    --model_name_or_path uie-medium 
    --output_dir ./checkpoint/model_best 
    --train_path ~/train.txt 
    --dev_path ~/dev.txt  
    --max_seq_length 512  
    --per_device_eval_batch_size 16 
    --per_device_train_batch_size  16 
    --num_train_epochs 32 
    --learning_rate 1e-5 
    --label_names "start_positions" "end_positions" 
    --do_train 
    --do_eval 
    --do_export 
    --export_model_dir ./checkpoint/model_best 
    --overwrite_output_dir 
    --disable_tqdm True 
    --metric_for_best_model eval_f1 
    --load_best_model_at_end  True 
    --save_total_limit 1


/home/aistudio/PaddleNLP/model_zoo/uie
[33m[2023-06-19 16:20:21,620] [ WARNING][0m - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.[0m[32m[2023-06-19 16:20:21,620] [    INFO][0m - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - ============================================================[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m -      Model Configuration Arguments      [0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - paddle commit id              :3fa7a736e32508e797616b6344d97814c37d3ff8[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - export_model_dir              :./checkpoint/model_best[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - model_name_or_path            :uie-medium[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - multilingual                  :False[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - [0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - ============================================================[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m -       Data Configuration Arguments      [0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - paddle commit id              :3fa7a736e32508e797616b6344d97814c37d3ff8[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - dev_path                      :/home/aistudio/dev.txt[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - dynamic_max_length            :None[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - max_seq_length                :512[0m[32m[2023-06-19 16:20:21,622] [    INFO][0m - train_path                    :/home/aistudio/train.txt[0m[32m[2023-06-19 16:20:21,622] [    INFO][0m - [0m[33m[2023-06-19 16:20:21,622] [ WARNING][0m - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: False[0m[32m[2023-06-19 16:20:21,622] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-medium'.[0m[32m[2023-06-19 16:20:21,622] [    INFO][0m - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/uie-medium[0m[32m[2023-06-19 16:20:21,752] [    INFO][0m - Downloading ernie_3.0_medium_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt[0m100%|█████████████████████████████████████████| 182k/182k [00:00<00:00, 889kB/s]
[32m[2023-06-19 16:20:22,092] [    INFO][0m - tokenizer config file saved in /home/aistudio/.paddlenlp/models/uie-medium/tokenizer_config.json[0m[32m[2023-06-19 16:20:22,092] [    INFO][0m - Special tokens file saved in /home/aistudio/.paddlenlp/models/uie-medium/special_tokens_map.json[0m[32m[2023-06-19 16:20:22,093] [    INFO][0m - Model config ErnieConfig {  "attention_probs_dropout_prob": 0.1,  "enable_recompute": false,  "fuse": false,  "hidden_act": "gelu",  "hidden_dropout_prob": 0.1,  "hidden_size": 768,  "initializer_range": 0.02,  "intermediate_size": 3072,  "layer_norm_eps": 1e-12,  "max_position_embeddings": 2048,  "model_type": "ernie",  "num_attention_heads": 12,  "num_hidden_layers": 6,  "pad_token_id": 0,  "paddlenlp_version": null,  "pool_act": "tanh",  "task_id": 0,  "task_type_vocab_size": 16,  "type_vocab_size": 4,  "use_task_id": true,  "vocab_size": 40000}[0m[32m[2023-06-19 16:20:22,094] [    INFO][0m - Configuration saved in /home/aistudio/.paddlenlp/models/uie-medium/config.json[0m[32m[2023-06-19 16:20:22,095] [    INFO][0m - Downloading uie_medium.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_medium.pdparams[0m100%|████████████████████████████████████████| 288M/288M [02:40<00:00, 1.88MB/s]
W0619 16:23:05.005470  1653 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0619 16:23:05.010092  1653 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-19 16:23:05,669] [    INFO][0m - All model checkpoint weights were used when initializing UIE.[0m[32m[2023-06-19 16:23:05,669] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at uie-medium.If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m

四、模型评估

1.评估模型

可配置参数说明:

  • model_path: 进行评估的模型文件夹路径,路径下需包含模型权重文件model_state.pdparams及配置文件model_config.json
  • test_path: 进行评估的测试集文件。
  • batch_size: 批处理大小,请结合机器情况进行调整,默认为16。
  • max_seq_len: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。
  • debug: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。
  • multilingual: 是否是跨语言模型,默认关闭。
  • schema_lang: 选择schema的语言,可选有chen。默认为ch,英文数据集请选择en

通过运行以下命令进行模型评估:

%cd ~/PaddleNLP/model_zoo/uie/

!python evaluate.py 
    --model_path ./checkpoint0.67789894/model_best 
    --test_path ~/dev.txt 
    --batch_size 16 
    --max_seq_len 512 

# --model_path ./checkpoint/model_best 
/home/aistudio/PaddleNLP/model_zoo/uie
[32m[2023-06-03 19:58:07,106] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.[0m
[32m[2023-06-03 19:58:07,132] [    INFO][0m - loading configuration file ./checkpoint0.67789894/model_best/config.json[0m
[32m[2023-06-03 19:58:07,133] [    INFO][0m - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
[0m
W0603 19:58:09.367168  9655 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:58:09.370771  9655 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-03 19:58:10,168] [    INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-06-03 19:58:10,169] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-06-03 19:58:15,447] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:58:15,447] [    INFO][0m - Class Name: all_classes[0m
[32m[2023-06-03 19:58:15,447] [    INFO][0m - Evaluation Precision: 0.92920 | Recall: 0.86066 | F1: 0.89362[0m
[0m
%cd ~/PaddleNLP/model_zoo/uie/

!python evaluate.py 
    --model_path ./checkpoint/model_best 
    --test_path ~/dev.txt 
    --batch_size 16 
    --max_seq_len 512 

/home/aistudio/PaddleNLP/model_zoo/uie
[32m[2023-06-03 19:58:59,933] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.[0m
[32m[2023-06-03 19:58:59,958] [    INFO][0m - loading configuration file ./checkpoint/model_best/config.json[0m
[32m[2023-06-03 19:58:59,960] [    INFO][0m - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
[0m
W0603 19:59:02.195425  9944 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:59:02.199051  9944 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-03 19:59:02,995] [    INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-06-03 19:59:02,996] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-06-03 19:59:08,312] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:08,312] [    INFO][0m - Class Name: all_classes[0m
[32m[2023-06-03 19:59:08,312] [    INFO][0m - Evaluation Precision: 0.89922 | Recall: 0.95082 | F1: 0.92430[0m
[0m

2.debug模式评估模型

可开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试:

%cd ~/PaddleNLP/model_zoo/uie/

!python evaluate.py 
    --model_path ./checkpoint0.67789894/model_best 
    --test_path ~/dev.txt 
    --debug 

# --model_path ./checkpoint/model_best 
/home/aistudio/PaddleNLP/model_zoo/uie
[32m[2023-06-03 19:59:51,590] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.[0m
[32m[2023-06-03 19:59:51,617] [    INFO][0m - loading configuration file ./checkpoint0.67789894/model_best/config.json[0m
[32m[2023-06-03 19:59:51,618] [    INFO][0m - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
[0m
W0603 19:59:53.878067 10161 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:59:53.881646 10161 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-03 19:59:54,673] [    INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-06-03 19:59:54,673] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-06-03 19:59:55,584] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,585] [    INFO][0m - Class Name: 炸弹[0m
[32m[2023-06-03 19:59:55,585] [    INFO][0m - Evaluation Precision: 0.86667 | Recall: 0.86667 | F1: 0.86667[0m
[32m[2023-06-03 19:59:55,660] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,660] [    INFO][0m - Class Name: 装甲车辆[0m
[32m[2023-06-03 19:59:55,660] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 19:59:55,809] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,809] [    INFO][0m - Class Name: 火炮[0m
[32m[2023-06-03 19:59:55,809] [    INFO][0m - Evaluation Precision: 0.84615 | Recall: 0.73333 | F1: 0.78571[0m
[32m[2023-06-03 19:59:55,903] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,904] [    INFO][0m - Class Name: 舰船舰艇[0m
[32m[2023-06-03 19:59:55,904] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 19:59:55,966] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,966] [    INFO][0m - Class Name: 飞行器[0m
[32m[2023-06-03 19:59:55,966] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 19:59:56,072] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:56,072] [    INFO][0m - Class Name: 单兵武器[0m
[32m[2023-06-03 19:59:56,072] [    INFO][0m - Evaluation Precision: 0.90000 | Recall: 0.75000 | F1: 0.81818[0m
[32m[2023-06-03 19:59:56,169] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:56,170] [    INFO][0m - Class Name: 太空装备[0m
[32m[2023-06-03 19:59:56,170] [    INFO][0m - Evaluation Precision: 0.95833 | Recall: 0.95833 | F1: 0.95833[0m
[32m[2023-06-03 19:59:56,227] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:56,228] [    INFO][0m - Class Name: 导弹[0m
[32m[2023-06-03 19:59:56,228] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 19:59:56,308] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:56,308] [    INFO][0m - Class Name: 其他武器装备[0m
[32m[2023-06-03 19:59:56,308] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 0.50000 | F1: 0.66667[0m
[0m
%cd ~/PaddleNLP/model_zoo/uie/

!python evaluate.py 
    --model_path ./checkpoint/model_best 
    --test_path ~/dev.txt 
    --debug 
/home/aistudio/PaddleNLP/model_zoo/uie
[32m[2023-06-03 20:00:12,238] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.[0m
[32m[2023-06-03 20:00:12,263] [    INFO][0m - loading configuration file ./checkpoint/model_best/config.json[0m
[32m[2023-06-03 20:00:12,265] [    INFO][0m - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
[0m
W0603 20:00:14.519615 10292 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 20:00:14.523311 10292 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-03 20:00:15,331] [    INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-06-03 20:00:15,332] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-06-03 20:00:16,208] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,208] [    INFO][0m - Class Name: 炸弹[0m
[32m[2023-06-03 20:00:16,208] [    INFO][0m - Evaluation Precision: 0.93750 | Recall: 1.00000 | F1: 0.96774[0m
[32m[2023-06-03 20:00:16,281] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,281] [    INFO][0m - Class Name: 装甲车辆[0m
[32m[2023-06-03 20:00:16,281] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 20:00:16,425] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,426] [    INFO][0m - Class Name: 火炮[0m
[32m[2023-06-03 20:00:16,426] [    INFO][0m - Evaluation Precision: 0.92308 | Recall: 0.80000 | F1: 0.85714[0m
[32m[2023-06-03 20:00:16,519] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,519] [    INFO][0m - Class Name: 舰船舰艇[0m
[32m[2023-06-03 20:00:16,519] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 20:00:16,579] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,579] [    INFO][0m - Class Name: 飞行器[0m
[32m[2023-06-03 20:00:16,579] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 20:00:16,681] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,681] [    INFO][0m - Class Name: 单兵武器[0m
[32m[2023-06-03 20:00:16,681] [    INFO][0m - Evaluation Precision: 0.91667 | Recall: 0.91667 | F1: 0.91667[0m
[32m[2023-06-03 20:00:16,775] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,775] [    INFO][0m - Class Name: 太空装备[0m
[32m[2023-06-03 20:00:16,775] [    INFO][0m - Evaluation Precision: 0.88889 | Recall: 1.00000 | F1: 0.94118[0m
[32m[2023-06-03 20:00:16,830] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,830] [    INFO][0m - Class Name: 导弹[0m
[32m[2023-06-03 20:00:16,830] [    INFO][0m - Evaluation Precision: 0.75000 | Recall: 1.00000 | F1: 0.85714[0m
[32m[2023-06-03 20:00:16,907] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,908] [    INFO][0m - Class Name: 其他武器装备[0m
[32m[2023-06-03 20:00:16,908] [    INFO][0m - Evaluation Precision: 0.80000 | Recall: 0.85714 | F1: 0.82759[0m
[0m

五、预测

1.读取test数据集

%cd ~/PaddleNLP/model_zoo/uie/

import json
import csv
from pprint import pprint

# 读取 JSON 文件
with open('https://b2.7b2.com/home/aistudio/data/data218296/ner_test.json', 'r', encoding='utf-8') as f:
    test_data = json.load(f)
print(f"数据集长度:{len(test_data)}")
print("查看数据样例:")
pprint(test_data[0])   
/home/aistudio/PaddleNLP/model_zoo/uie
数据集长度:5920
查看数据样例:
{'sample_id': 0,
 'text': '第五艘西班牙海军F-100级护卫舰即将装备集成通信控制系统。该系统由葡萄牙EID公司生产。该系统已经用于巴西海军的圣保罗航母,荷兰海军的四艘荷兰级海上巡逻舰和四艘西班牙海军BAM近海巡逻舰。F-105护卫舰于2009年初铺设龙骨。该舰预计2010年建造完成,2012年夏交付。'}

2.设定抽取目标 && 定制化模型权重路径

from pprint import pprint
from paddlenlp import Taskflow

schema = "飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备".split(", ")
# 设定抽取目标和定制化模型权重路径
# PaddleNLP/model_zoo/uie/checkpoint0.67789894
my_ie = Taskflow("information_extraction", schema=schema,  task_path='./checkpoint0.67789894/model_best')
# my_ie = Taskflow("information_extraction", schema=schema,  task_path='./checkpoint/model_best')
[2023-06-03 20:04:27,292] [    INFO] - loading configuration file ./checkpoint0.67789894/model_best/config.json
[2023-06-03 20:04:27,297] [    INFO] - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}

W0603 20:04:27.792821  9916 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 20:04:27.796425  9916 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-06-03 20:04:28,581] [    INFO] - All model checkpoint weights were used when initializing UIE.

[2023-06-03 20:04:28,584] [    INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2023-06-03 20:04:28,590] [    INFO] - Converting to the inference model cost a little time.
[2023-06-03 20:04:42,631] [    INFO] - The inference model save in the path:./checkpoint0.67789894/model_best/static/inference
[2023-06-03 20:04:44,793] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.

3.预测

list_text=[]
for item in test_data:
    list_text.append(item['text'])
%%time

results=my_ie(list_text)
CPU times: user 16min 20s, sys: 2min 27s, total: 18min 47s
Wall time: 18min 45s
print(len(results))
5920
print(results[0])
{'舰船舰艇': [{'text': 'F-105护卫舰', 'start': 95, 'end': 103, 'probability': 0.9999301440846722}, {'text': 'F-100级护卫舰', 'start': 8, 'end': 17, 'probability': 0.999897838723939}, {'text': 'BAM近海巡逻舰', 'start': 86, 'end': 94, 'probability': 0.9974057948851112}, {'text': '荷兰级海上巡逻舰', 'start': 70, 'end': 78, 'probability': 0.9998418145644905}, {'text': '圣保罗航母', 'start': 57, 'end': 62, 'probability': 0.9989470402669269}]}
with open('https://b2.7b2.com/home/aistudio/result_list.json','w', encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False)
results_list=[]
for i in range(len(test_data)):
    for key, item in results[i].items(): 
        for ii in range(len(item)):
            temp_result=dict()
            temp_result['sample_id']=test_data[i]['sample_id']
            temp_result['text']=item[ii]['text']
            temp_result['type']=key
            temp_result['start']=item[ii]['start']
            temp_result['end']=item[ii]['end']
            results_list.append(temp_result)

with open('https://b2.7b2.com/home/aistudio/result.json','w', encoding="utf-8") as f:
    json.dump(results_list, f,indent=4, ensure_ascii=False)
results=[]
for i in range(len(test_data)):
    uie_result=my_ie(test_data[i]['text'])
    # pprint(uie_result)
    for key, item in uie_result[0].items(): 
        for ii in range(len(item)):
            temp_result=dict()
            temp_result['sample_id']=test_data[i]['sample_id']
            temp_result['text']=item[ii]['text']
            temp_result['type']=key
            temp_result['start']=item[ii]['start']
            temp_result['end']=item[ii]['end']
            results.append(temp_result)
print(len(results))

4.保存结果

with open('result.json','w', encoding="utf-8") as f:
    json.dump(results, f,indent=4, ensure_ascii=False)

六、提交

面向低资源和增量类型的命名实体识别挑战赛简介

本网站的内容主要来自互联网上的各种资源,仅供参考和信息分享之用,不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益,请联系我们,我们将尽快采取行动,包括删除或更正。
AI教程

PyTorch MPS性能测试及安装指南

2023-12-9 9:21:14

AI教程

指令微调方法教会Stable Diffusion按照PS图像操作指令的技巧

2023-12-9 9:33:14

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索