释放双眼，带上耳机，听听看~！

本文介绍了经过一个月的努力，在Kubeflow上搭建完整可用的Pipeline，并解析了Kubeflow的核心组件的基础用法。同时提及了Elyra私有镜像的构建过程以及使用Katib进行自动调参的方法。

经过一个月的努力，围绕Mnist手写数字识别任务，我们已经逐渐在Kubeflow上搭建了一个完整可用的Pipeline，并摸清了Kubeflow一些核心组件的基础用法。

本章将解决之前的一点遗留问题，并摸清Kubeflow的最后几个核心功能，对该系列做一个收尾。

Elyra私有镜像的构建

首先是关于ELyra私有镜像的构建。

上周已经提到,Elyra在建立每个组件容器时，都要进行如下两个操作：

从github上下载bootstrapper.py,requirements-elyra.txt,requirements-elyra-py37.txt文件
pip install -r requirements-elyra.txt(或requirements-elyra-py37.txt)；

我们对ELrya做了这两处修改：

将三个文件存放至本地Minio,更改下载路径；
将pip源改为清华源；

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

重构后，Elyra成功跑通，速度嘎嘎快。

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

Katib 自动调参

第二件事，也是上周提到的，利用katib tune API 超参调优pytorch训练mnist任务过程；

为了代码结构清晰，我将训练过程单独放在一个脚本中：

------------------train_model.py--------------------

def main(parameters):
    import torch
    import numpy as np
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    from torch.utils.data import Dataset, DataLoader
    import torch.distributed as dist
    import logging
    from minio import Minio
    import os

# 定义网络结构
    class Net(nn.Module):
        def __init__(self):
            ...

        def forward(self, x):
            ...

# 定义Dataset,便于DataLoader
    class Mnistset(Dataset):
        def __init__(self, x, y):
            ...

        def __getitem__(self, index):
            ...

        def __len__(self):
            ...

# 定义训练过程 
    def train(...):
        ...
      
# 重点，通过logging.info输出固定格式的指标，便于tune API捕捉
    def test(model, device, test_loader, epoch):
        model.eval()
        test_loss = 0
        correct = 0
        with torch.no_grad():
            for data, target in test_loader:
                data, target = data.to(device), target.to(device)
                data = data.float()
                output = model(data)
                test_loss += F.nll_loss(output, target, reduction="sum").item()  # sum up batch loss
                pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
                correct += pred.eq(target.view_as(pred)).sum().item()

        test_loss /= len(test_loader.dataset)
        test_accuracy = float(correct) / len(test_loader.dataset)
        logging.info("Epoch {}. accuracy={:.4f} - loss={:.4f}".format(epoch, test_accuracy, test_loss))
    
    logging.basicConfig(
        format="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%SZ",
        level=logging.INFO,
    )
    logging.info("--------------------------------------------------------------------------------------")
    logging.info(f"Input Parameters: {parameters}")
    logging.info("--------------------------------------------------------------------------------------nn")

    lr = float(parameters["lr"])
    momentum = float(parameters["momentum"])
    epochs = int(parameters["epochs"])
    no_cuda = parameters["no_cuda"]
    endpoint = parameters["endpoint"]
    access_key = parameters["access_key"]
    secret_key = parameters["secret_key"]
    bucket_name = parameters["bucket_name"]
    object_name = parameters["object_name"]
    model_name = parameters["model_name"]
    save_model = parameters["save_model"]
    
    use_cuda = not no_cuda and torch.cuda.is_available()
    if use_cuda:
        print("Using CUDA")

    device = torch.device("cuda" if use_cuda else "cpu")
    
    kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}

# 从本地MinIO下载数据
    client = Minio(endpoint,access_key,secret_key,secure=False)
    client.fget_object(bucket_name,object_name,object_name)
    
    with np.load(object_name) as f:
        x_train, y_train = f['x_train'], f['y_train']
        x_test, y_test = f['x_test'], f['y_test']

# 制作Dataloader
    train_set = Mnistset(x_train, y_train)
    train_loader = DataLoader(train_set,batch_size=64,shuffle=True,**kwargs)
    
    test_set = Mnistset(x_test, y_test)
    test_loader = DataLoader(test_set,batch_size=1000,shuffle=False,**kwargs)

    model = Net().to(device)

    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

# 训练
    for epoch in range(1, epochs + 1):
        train(model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader, epoch)

然后利用 katib.KatibClient().tune(),最可以愉快地调优了：

from kubeflow import katib
from train_model import main

exp_name = "tune-pytorch-mnist"
katib_client = katib.KatibClient()

parameters = {
    "lr": katib.search.double(min=0.01, max=0.2),
    "momentum": katib.search.double(min=0.1, max=1),
    "epochs": katib.search.int(min=1, max=5),
    "no_cuda": True,
    "endpoint": "47.96.106.97:9000",
    "access_key": "minio",
    "secret_key": "minio123",
    "bucket_name": "lifu963",
    "object_name": "mnist.npz",
    "model_name": "model.onnx",
}

katib_client.tune(
    name=exp_name,
    objective=main, 
    parameters=parameters,
    algorithm_name="cmaes", 
    objective_metric_name="accuracy", 
    additional_metric_names=["loss"],
    max_trial_count=12, 
    parallel_trial_count=2,
    base_image='...',
)

tune API的函数参数如下：

name: str, 实验名
objective: Callable, 目标函数
parameters: Dict[str, Any], 目标函数需要调优的参数
base_image: str = ‘docker.io/tensorflow/tensorflow:2.9.1’, 目标函数运行的基础镜像
namespace: str = ‘kubeflow-user-example-com’,
algorithm_name: str = ‘random’, 调优算法
objective_metric_name: str = None, 调优主要指标
additional_metric_names: List[str] = [], 调优次要指标
objective_type: str = ‘maximize’,
objective_goal: float = None,
max_trial_count: int = None,
parallel_trial_count: int = None,
max_failed_trial_count: int = None,
retain_trials: bool = False,
packages_to_install: List[str] = None, pip所需安装包
pip_index_url: str = ‘pypi.org/simple‘, pip安装路径

可以说非常全面，当然也有不足之处，主要就是，输出的log格式是要求固定的: {param_1}={:.4f} - {param_2}={:.4f} ...，无法自定义格式。

调优结果：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

还可以利用Katib 其他API打印调优相关信息；

在这里，我们每10秒轮询一次当前的调优最佳参数,若当前最优参数不为空，则记录该参数，用于真正的模型训练：

# 记录当前调优状态
status = katib_client.get_experiment_status(exp_name)
print(f"Katib Experiment status: {status}n")

best_hps = {}

while best_hps == {}:
    best_hps = katib_client.get_optimal_hyperparameters(exp_name) #当前最优参数
    time.sleep(10)
    continue
    
if best_hps != {}:
    import json
    print("Current Optimal Trialn")
    print(json.dumps(best_hps, indent=4))
    
    for hp in best_hps["currentOptimalTrial"]["parameterAssignments"]:
        if hp["name"] == "lr":
            best_lr = hp["value"]
        elif hp["name"] == "momentum":
            best_momentum = hp["value"]
        elif hp["name"] == "epochs":
            best_epochs = hp["value"]

收集最优参数后，我们就可以将该参数用于下一步的模型训练。

PytorchJob分布式训练

Kubeflow作为一个部署在k8s上的机器学习平台，若不能分布式训练，那还有什么意义？

关于在Kubeflow上进行分布式训练，有两点需要做：

在训练脚本上利用分布式训练
在Kubeflow上部署分布式训练

先从第二点讲起。

在Kubeflow上部署分布式训练

PytorchJob其实和Kaitb体验相似，但会更好上手一点。它的原生方法也是通过直接部署yaml文件来开启分布式训练的，之后出现了Python SDK，以及更高层的API对其封装。

先看它的yaml文件：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

PytorchJob的yaml文件结构很清晰：分别需要定义Master节点和Worker节点；并在每个节点中定义一个容器，写明容器的镜像及运行命令行、命令参数、所需资源等。

通过Python SDK，我们可以更清楚地看懂该结构：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

当然，我们还可以利用高级API create_pytorchjob_from_func 直接部署分布式训练：

from train_model import main
from kubeflow.training import PyTorchJobClient

# lr,momentum等自己设值
parameters = {
    "lr": lr,
    "momentum": momentum,
    "epochs": epochs,
    "no_cuda": no_cuda,
    "endpoint": endpoint,
    "access_key": access_key,
    "secret_key": secret_key,
    "bucket_name": bucket_name,
    "object_name": object_name,
    "model_name": model_name,
    "save_model": save_model,
    "backend": backend,
}

job_name = "train-pytorch"
job_client = PyTorchJobClient()

job_client.create_pytorchjob_from_func(
    name=job_name,
    func=main,
    parameters=parameters,
    base_image="...",
    num_worker_replicas=3, #定义work节点数量
)

开启训练后，我们还可以调用其他API查看训练情况：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

在训练脚本上利用分布式训练

在训练脚本上利用分布式训练其实也并非难事。

def main(parameters):
    import ...
    
    RANK = int(os.environ.get("RANK", 0))
    WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))

    class Net(nn.Module):
          ...

    class Mnistset(Dataset):
          ...

        
    def train(...):
        ...
            
    def test(...):
        ...
    
    def should_distribute():
        return dist.is_available() and WORLD_SIZE > 1
    
    def is_distributed():
        return dist.is_available() and dist.is_initialized()
    
    logging.basicConfig(...)
    logging.info("...")

    lr = float(parameters["lr"])
    momentum = float(parameters["momentum"])
    ...
    
    if dist.is_available():
        backend = parameters["backend"]

    use_cuda = not no_cuda and torch.cuda.is_available()
    if use_cuda:
        print("Using CUDA")

    device = torch.device("cuda" if use_cuda else "cpu")
    
    if should_distribute():
        print("Using distributed PyTorch with {} backend".format(backend))
        dist.init_process_group(backend=backend, rank=RANK, world_size=WORLD_SIZE)

    kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}
    
    client = Minio(endpoint,access_key,secret_key,secure=False)
    client.fget_object(bucket_name,object_name,object_name)
    
    with np.load(object_name) as f:
        x_train, y_train = f['x_train'], f['y_train']
        x_test, y_test = f['x_test'], f['y_test']
    
    train_set = Mnistset(x_train, y_train)
    train_loader =  DataLoader(train_set,...)
    
    test_set = Mnistset(x_test, y_test)
    test_loader =  DataLoader(test_set,...)

    model = Net().to(device)
    
    if is_distributed():
        Distributor = nn.parallel.DistributedDataParallel
        model = Distributor(model)

    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
    
    logging.info(f"Start training for RANK: {RANK}. WORLD_SIZE: {WORLD_SIZE}")
    for epoch in range(1, epochs + 1):
        train(...)
        test(...)
        
    if save_model:
       ...

if the distributed package is available

首先，获取RANK和WORLD_SIZE值；Kubeflow Training Operator 会基于环境配置自动设置合理的RANK和WORLD_SIZE值；
编写should_distribute函数检验能否开启分布式；编写is_distributed函数检验当前是否处于分布式状态；
若dist.is_available(),则设置backend值；
若should_distribute(),则初始化分布式训练;dist.init_process_group(backend=backend, rank=RANK, world_size=WORLD_SIZE)
若is_distributed()，则通过model = Distributor(model)使模型处于分布式训练的状态下；

注：

world_size 为整个job的进程数；上文中我们配置了1个master和3个node，所以这里world_size应为4；
rank 用于表示当前进程的序号，值应在0~world_size-1之间；
backend 配置各进程间的通信方式，主要有nccl(NVIDIA推出)、gloo(Facebook推出)、mpi(OpenMPI推出)，一般默认为”gloo”；从测试效果来看，若显卡支持nccl,建议选择nccl，若为其他硬件（非NVIDIA卡），则可以考虑gloo、mpi。

不过这里还有个问题，就是调用PytorchJob API后，我训练完后的模型保存在哪就不方便找了；所以在脚本中，我完成模型训练后，就直接将它传上本地Minio了。

    if save_model:
        dummy_input = torch.randn(1, 28, 28)
        torch.onnx.export(model, dummy_input, model_name)
        client.fput_object(bucket_name, model_name, model_name)
        logging.info("save model done!")

Artifacts

Ariifacts，简单来说，就是可视化。
当我们根据一定要求配置mlpipeline-ui-metadata.json文件后，Kubeflow可以根据该配置进行可视化。

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

采用v1 SDK 编写指定元数据的Json文件

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

当我们按一定要求编写好mlpipeline-ui-metadata.json文件，在source中指定好csv路径，就可以将其可视化；

container_op_hopeVisual = partial(
    components.func_to_container_op,
    base_image= ...,
)

@container_op_hopeVisual
def roc(data_dir):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_curve,roc_auc_score
    from sklearn.datasets import load_wine
    from sklearn.model_selection import train_test_split, cross_val_predict
    import pandas as pd
    import json
    import os
    from pathlib import Path
    
    X, y = load_wine(return_X_y=True)
    y = y == 1
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    rfc = RandomForestClassifier(n_estimators=10, random_state=42)
    rfc.fit(X_train, y_train)
    y_scores = cross_val_predict(rfc, X_train, y_train, cv=3, method='predict_proba')
    y_predict = cross_val_predict(rfc, X_train, y_train, cv=3, method='predict')
    roc_auc = roc_auc_score(y_train, y_scores[:,1])
    fpr, tpr, thresholds = roc_curve(y_true=y_train, y_score=y_scores[:,1], pos_label=True)
    df_roc = pd.DataFrame({'fpr': fpr, 'tpr': tpr, 'thresholds': thresholds})
    roc_file = "roc.csv"
    os.makedirs(os.path.join(data_dir,"csv_file"),exist_ok=True)
    file_path = os.path.join("csv_file",roc_file)
    df_roc.to_csv(os.path.join(data_dir,file_path),columns=['fpr', 'tpr', 'thresholds'], header=False, index=False)
    
    metadata = {
    'outputs': [{
      'type': 'roc',
      'format': 'csv',
      'schema': [
        {'name': 'fpr', 'type': 'NUMBER'},
        {'name': 'tpr', 'type': 'NUMBER'},
        {'name': 'thresholds', 'type': 'NUMBER'},
      ],
      'source': file_path,
    }]
    }
    ui_metadata_output_path = 'mlpipeline-ui-metadata.json'
    Path(os.path.join(data_dir,ui_metadata_output_path)).parent.mkdir(parents=True, exist_ok=True)
    Path(os.path.join(data_dir,ui_metadata_output_path)).write_text(json.dumps(metadata))
    metrics = {
    'metrics': [{
      'name': 'roc-auc-score',
      'numberValue':  roc_auc,
    }]
    }
    metrics_output_path = 'mlpipeline-metrics.json'
    Path(os.path.join(data_dir,metrics_output_path)).parent.mkdir(parents=True, exist_ok=True)
    Path(os.path.join(data_dir,metrics_output_path)).write_text(json.dumps(metrics))
    print("list outputDir: ", os.listdir(data_dir))
    return
    
@dsl.pipeline(
    name='hope_visual pipeline',
    description='',
)
def pipeline(pvcMountDir:str = "/tmp/outputs"):
    createPvc = dsl.VolumeOp(
        name="create-pvc",
        resource_name="my-pvc-visual",
        modes=dsl.VOLUME_MODE_RWO,
        size='100M',)
    roc_op = roc(pvcMountDir)
    roc_op.add_pvolumes({pvcMountDir:createPvc.volume})
    roc_op.after(createPvc)

注意，artifacts主要是通过访问pipeliine的文件系统，获取mlpipeline-ui-metadata.json文件，来渲染可视化；因此，我们必须为pipeline配置PV卷，或者采用Erlya的方式，将文件系统放在MinIO上；若未配置持久化存储，仅存放在临时的pod空间内，将无法渲染可视化输出。

因此，在pipeline函数中，我特别为该组件挂上了PV卷，用于保存mlpipeline-ui-metadata.json等文件。

但是，csv文件（source）又必须放在云端上…，所以就蛮麻烦的就。
(或者将csv文件解析为string,storage选择”inline”)

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

支持的可视化模块：

混淆矩阵 confusion_matrix

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

ROC曲线

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

表格

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

tensorboard

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

注：准确来说，artifacts并不会可视化出一个叫做tensorboard的东西；tensorboard是tensorflow的可视化工具；artifacts通过json文件在source中配置tensorboard的文件路径，在可视化界面中将会提供一个Start Tensorboard的按钮，点击可跳转至Kubeflow的Tensorboard界面：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

因此，在json文件中配置Tensorboard时，只需要配置source属性就够了：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

采用v2 SDK：使用开发工具包可视化API

有意思的的是，虽然当前我用的dsl版本是v1的，但是也可以通过v2兼容模式，使用方便的SDK API用于指标可视化。

例如，在定制组件时，使用

from kfp.v2.dsl import component

@component
def func(...)
    ...

在编译Pipeline时，添加mode=kfp.dsl.PipelineExecutionMode.V2_COMPATIBLE：

compiler.Compiler(mode=kfp.dsl.PipelineExecutionMode.V2_COMPATIBLE)
    .compile(pipeline_func=add_pipeline, package_path='pipeline.yaml')

目前kfp支持roc、混淆矩阵 confusion_matrix、标量度量格式Scalar Metrics formats（类似表格）.

import os
from kfp.v2 import dsl
from kfp.v2.dsl import (
    component,
    Output,
    ClassificationMetrics,
    Metrics,
    HTML,
    Markdown
)

@component(
    base_image='...'
)
def iris_sgdclassifier(test_samples_fraction: float, metrics: Output[ClassificationMetrics]):
    from sklearn import datasets, model_selection
    from sklearn.linear_model import SGDClassifier
    from sklearn.metrics import confusion_matrix

    iris_dataset = datasets.load_iris()
    train_x, test_x, train_y, test_y = model_selection.train_test_split(
        iris_dataset['data'], iris_dataset['target'], test_size=test_samples_fraction)


    classifier = SGDClassifier()
    classifier.fit(train_x, train_y)
    predictions = model_selection.cross_val_predict(classifier, train_x, train_y, cv=3)
    metrics.log_confusion_matrix(
        ['Setosa', 'Versicolour', 'Virginica'],
        confusion_matrix(train_y, predictions).tolist() # .tolist() to convert np array to list.
    )

@dsl.pipeline(name='metrics-visualization-pipeline')
def metrics_visualization_pipeline():
    iris_sgdclassifier_op = iris_sgdclassifier(test_samples_fraction=0.3)

通过metrics.log_confusion_matrix函数，即可简单地完成可视化输出。

当然，目前我还没很看明白metrics: Output[ClassificationMetrics]参数，为甚恶魔不赋值也不会报错；同时，这个API貌似也还不是很稳定，可视化输出有时行有时不行：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

就很奇怪。

关于metrics.log_confusion_matrix等内容，源码主要存放在：pipelines/artifact_types.py at sdk/release-1.8 · kubeflow/pipelines (github.com)中。

我还没有细看，但是稍微瞥了一眼，严重怀疑是因为v2的可视化输出sdk是把json文件的source放在云端上？

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

然后这些云端又是放在外网上，所以可视化渲染不稳定？

总结

artifacts目前在我看来，还是比较鸡肋：主要支持的可视化太少了：v2 SDK主要支持的仅有：roc、混淆矩阵 confusion_matrix、表格；而V1通过配置json文件，也仅仅只多了一个跳转Tensorboard。

所以甚至很难提起大的兴趣去琢磨怎么解决v2可视化不稳定的问题。

相比于把功夫放在artifacts上，可能把我们的实验结果等文件直接存入MinIO，或者琢磨琢磨怎么利用成熟的可视化工具Tensorboard，会显得更好一点？

然而Tensorboard又是TensorFlow自带的可视化工具,妈的，那是不是还得准备下TensorBoard？

Elyra自定义组件及启发

在技术群里，有人提到Elyra与自定义组件的结合：在Elyra中直接添加自定义组件，就可以为平台提供一些通用的组件，将平台的各个能力贯通起来，以低代码或者无代码的形式提供？

忽然发现这确实可能是个能搞的技术点？

我后续又去研究了一下，这事儿是这样：

我们在搭管道的时候，需要由一个一个component组成pipeline：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

比如这个，就是个标准的component,我们可以把它拿去搭pipeline:

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

但其实，我们还可以在构建component时，把它输出为yaml文件，复用起来：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

具体怎么复用呢？在原生方法里，通过调用API:kfp.components.load_component_from_file，就可以通过编译的yaml文件获得该组件了；

而在Elyra里，我们需要这么干：

这yaml文件目前在这里：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

该文件夹的绝对路径是这里：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

现在点这里：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

然后主要配置下这个：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

在这里，我们就可以调用这个组件了：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

我们可以设置该组件的自定义参数，挂载已存在的PV卷，等：

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

除了该方式外，还有两种方式：

通过URL，也就是将yaml文件传云端上；然后path就要改成url路径了；
加载一整个目录下的yaml文件，这样我们就可以把一组功能齐全的组件放在同一目录下，然后Path填该目录的路径就行。

所以，我们可以通过第二种方式，制作一系列功能齐全的组件或管道、全编译成yaml文件，放在同一文件夹下，传到云端上；然后别人git下来整个目录，用第二种方式配置一波，就可以得到一堆组件，进行低代码开发了，我觉得。

OK,到这里，我觉得整个系列也该做个完结了。先不管模型监控，数据、模型版本控制啥的，我们总算可以说，我们可以在k8s上做点AutoML的开发了。

本网站的内容主要来自互联网上的各种资源，仅供参考和信息分享之用，不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益，请联系我们，我们将尽快采取行动，包括删除或更正。

{{userData.name}}已认证

Kubeflow Mnist手写数字识别任务的完整Pipeline搭建及核心功能解析

Elyra私有镜像的构建

Katib 自动调参

PytorchJob分布式训练

在Kubeflow上部署分布式训练

在训练脚本上利用分布式训练

Artifacts

采用v1 SDK 编写指定元数据的Json文件

采用v2 SDK：使用开发工具包可视化API

总结

Elyra自定义组件及启发

混淆矩阵与评估指标：二分类与多分类的区别

AI技术的最新进展和应用领域

GeoSpy.ai

即梦Dreamina

Globe Explorer

Luma Dream Machine

抖音即创

Motionshop

归档

{{userData.name}}已认证

Elyra私有镜像的构建

Katib 自动调参

PytorchJob分布式训练

在Kubeflow上部署分布式训练

在训练脚本上利用分布式训练

Artifacts

采用v1 SDK 编写指定元数据的Json文件

采用v2 SDK：使用开发工具包可视化API

总结

Elyra自定义组件及启发

混淆矩阵与评估指标：二分类与多分类的区别

AI技术的最新进展和应用领域

GPT原理与使用技巧

DALL ・ E 3：集成ChatGPT的新特点和生成效果

ChatGLM2-6B 新模型介绍及部署方法

GPT-4.5：代码解释器，开启编程新纪元