释放双眼，带上耳机，听听看~！

本文将介绍LightGBM在GPU上进行训练时的并行计算特性，包括相关CUDA算子的使用和LightGBM源码的结构。

引言

LightGBM作为GBDT的一个拓展实现，由于使用门槛低、性能优良，已成为传统机器学习领域出场率极高的一个模型。相比于上一代的GBDT实现——Xgboost，LightGBM在训练时要更轻、更快。轻和快一方面是由于LightGBM在算法层面缩减了数据（行）、特征（列）两个维度上计算与内存开销。其中：

数据维度（行）
- 使用直方图替代了特征预排序，同时节省了计算与内存开销；
- bagging过程中支持单边梯度采样（GOSS），在每轮计算和更新目标函数值时使用少量大梯度样本+采样得到的小梯度样本，节省了计算开销。
特征维度（列）
- 对互斥率高的稀疏特征进行捆绑，这种特征融合方式同时节省了计算与内存开销。

另一方面，LightGBM在工程上对数据和特征两个维度上的并行方式也进行了优化，其中：

数据并行（行）
- 每个worker上对数据子集构建子直方图后，使用Reduce Scatter对子直方图局部归并，然后寻找局部最优分裂信息，再找最优分裂点，这样在寻找分裂点步骤的并行度更高；而不是先将子直方图图归并为全局直方图，再寻找最优分裂点。
- 使用直方图减法减少数据传输次数，降低通讯成本：一个叶子节点数据完成之后，另外一个叶子节点可以通过直方图减法计算得到。
特征并行（列）
- 每个worker上有全量的特征，各worker在全量特征中负责一个特征子集，找到子集下的局部最优分裂点后互相通讯得到全局最优分裂点。每个worker在自己全量特征下根据全局最优信息进行分裂，生成决策树。与传统的特征并行每个worker上只有自己负责的局部特征集相比，少了一轮对分裂结果进行同步的通信开销。

以上介绍的特性对大多数使用者来说都耳熟能详，在本文中我们将介绍点不太一样的内容。LightGBM训练程序都通常采用单机运行，或是结合Spark、Mars等分布式计算框架在集群上运行。其实LightGBM也支持在GPU上进行训练，相关CUDA算子可以在LightGBM官方开源代码中查看。本文将结合官方提供的源码，重点对LightGBM中相关CUDA算子并行计算的特性进行介绍。

图1. reduce vs reduce scatter

LightGBM在GPU上的并行计算特性介绍

LightGBM源码简介

在介绍CUDA算子之前，我们先要对LightGBM源码结构有个基本的了解。LightGBM底层使用C++实现，上层封装了Python与R的接口。如果我们在windows环境下使用pip安装lightGBM库，进入到python lightGBM库安装目录下会发现所有Python LightGBM API执行路径最终都指向了一个lib_lightgbm.dll的文件。.dll拓展名的文件是c++项目在windows环境下编译得到的动态链接库文件。lightGBM C++源码被编译成.dll文件，然后在python中使用ctypes包进行调用。下面我们先使用git clone –recursive github.com/microsoft/L… 将LightGBM工程源码clone到本地，–recursive选项指定在clone时将external_libs下的相关依赖也进行clone。

图2. lightGBM依赖的第三方库external_libs目录

LightGBM在GPU上的并行计算特性介绍

lightGBM核心的c++实现代码在./include和./src两个目录下：

图3. lightGBM工程目录

LightGBM在GPU上的并行计算特性介绍

lightGBM项目在工程上使用Cmake进行构建，在构建文件CMakeLists.txt中指定了大量编译、链接相关的构建参数和选项。首先找到测试或执行文件的程序入口，在cmakeLists.txt中搜索add_executable，可以找到两处：

add_executable(lightgbm src/main.cpp src/application/application.cpp)

add_executable(testlightgbm ${CPP_TEST_SOURCES})

其中第1处由构建参数BUILD_CLI控制，指定在构建库文件之外是否还生成可执行文件，默认是开启：

option(BUILD_CLI "Build the 'lightbgm' command-line interface in addition to lib_lightgbm" ON)

第2处由参数BUILD_CPP_TEST控制，指定是否生成测试用的可执行文件，默认是关闭；如果开启，需要安装谷歌测试程序才能正常进行测试。（Tips: 谷歌测试和catch2一样，是一款C++单元测试工具）

option(BUILD_CPP_TEST "Build C++ tests with Google Test" OFF)

通过第1处程序入口（src/main.cpp）进入执行路径：main()=>Application::Run()=>Application::Train()=>boosting_->Train()，
定位到了一个Boosting的抽象类，Boosting抽象类下游有4个具体的实现子类，分别为DART、GBDT、GBDTBase、RF：

图4. Boosting抽象类下4个实现抽放方法的子类

LightGBM在GPU上的并行计算特性介绍

这4个子类分别对应的LightGBM中4种boosting类型的具体实现。

LightGBM boosting

我们挑一个最常用的也是默认的boosting实现——GBDT，了解boosting过程。进入到其中的GBDT->train()方法：

// gbdt.cpp
void GBDT::Train(int snapshot_freq, const std::string& model_output_path) {
  Common::FunctionTimer fun_timer("GBDT::Train", global_timer);
  bool is_finished = false;
  auto start_time = std::chrono::steady_clock::now();
  for (int iter = 0; iter < config_->num_iterations && !is_finished; ++iter) {
    is_finished = TrainOneIter(nullptr, nullptr);
    if (!is_finished) {
      is_finished = EvalAndCheckEarlyStopping();
    }
    auto end_time = std::chrono::steady_clock::now();
    // output used time per iteration
    Log::Info("%f seconds elapsed, finished iteration %d", std::chrono::duration<double,
              std::milli>(end_time - start_time) * 1e-3, iter + 1);
    if (snapshot_freq > 0
        && (iter + 1) % snapshot_freq == 0) {
      std::string snapshot_out = model_output_path + ".snapshot_iter_" + std::to_string(iter + 1);
      SaveModelToFile(0, -1, config_->saved_feature_importance_type, snapshot_out.c_str());
    }
  }
}

GBDT::Train()方法的第6 ~ 20行有一个循环体，子树都在此过程中生成。其中循环次数config_->num_iterations为树的棵数，每轮迭代通过第7行的TrainOneIter(nullptr, nullptr)方法生成一棵子树，GBDT::TrainOneIter逻辑如下：

// gbdt.cpp
bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) {
  Common::FunctionTimer fun_timer("GBDT::TrainOneIter", global_timer);
  // init_scores存放单棵树所贡献的 predictRaw，类似于神经网络的logits
  std::vector<double> init_scores(num_tree_per_iteration_, 0.0);
  if (gradients == nullptr || hessians == nullptr) {
    for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
      // 对predictRaw初始化：
      // 如果目标函数(obj)为logloss, 经过BoostFromAverage后 
      // score = log(p/(1-p)) 
      //    => sigmoid(score) = p
      // 如果训练样本weight=1，p为该类别在训练数据集中的占比
      init_scores[cur_tree_id] = BoostFromAverage(cur_tree_id, true); 
    }
    // Boosting()更新梯度(GBDT::gradients_pointer_)与二阶偏导(GBDT::hessians_pointer_)
    Boosting();
    // 获取更新后的梯度与二阶偏导
    gradients = gradients_pointer_; 
    hessians = hessians_pointer_;
  } else { // 如果调用方传入了梯度和二阶偏导数据，则直接将传入数据复制到GBDT梯度和二阶偏导属性下

    CHECK(objective_function_ == nullptr);
    if (data_sample_strategy_->IsHessianChange()) {
      // need to copy customized gradients when using GOSS
      int64_t total_size = static_cast<int64_t>(num_data_) * num_tree_per_iteration_;
      #pragma omp parallel for schedule(static)
      for (int64_t i = 0; i < total_size; ++i) {
        // 复制数据，GBDT::gradients_和GBDT::gradients_pointer_
        // 一个是使用自定义的allocator分配堆内存所得到的容器，一个是指针，
         // 两者两位一体，GBDT::hessians_和GBDT::hessians_pointer_也一样
         // 具体关系可见 GBDT::init() => ResetGradientBuffers()
        gradients_[i] = gradients[i]; 
        hessians_[i] = hessians[i];
      }
      CHECK_EQ(gradients_pointer_, gradients_.data());
      CHECK_EQ(hessians_pointer_, hessians_.data());
      // 重新指向GBDT下的梯度和二阶偏导指针
      gradients = gradients_pointer_; 
      hessians = hessians_pointer_;
    }
  }
  // 对行进行bagging采样逻辑，由GBDT::data_sample_strategy_决定是采用简单的bagging，还是Goss

  data_sample_strategy_->Bagging(iter_, tree_learner_.get(), gradients_.data(), hessians_.data());
  const bool is_use_subset = data_sample_strategy_->is_use_subset();
  const data_size_t bag_data_cnt = data_sample_strategy_->bag_data_cnt();
  const std::vector<data_size_t, Common::AlignmentAllocator<data_size_t, kAlignedSize>>& bag_data_indices = data_sample_strategy_->bag_data_indices();

  if (objective_function_ == nullptr && is_use_subset && bag_data_cnt < num_data_ && !boosting_on_gpu_ && !data_sample_strategy_->IsHessianChange()) {
    ResetGradientBuffers();
  }

  bool should_continue = false;
  
  // 新子树的生成，每个类别1棵，每个类别的梯度数据在1列上
  for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
    // 生成单个类别的子树
    const size_t offset = static_cast<size_t>(cur_tree_id) * num_data_; // 列索引计算偏移量
    std::unique_ptr<Tree> new_tree(new Tree(2, false, false)); // 初始化一棵树
    if (class_need_train_[cur_tree_id] && train_data_->num_features() > 0) {
      auto grad = gradients + offset; // 获取该列梯度和二阶偏导
      auto hess = hessians + offset;
      
      if (is_use_subset && bag_data_cnt < num_data_ && !boosting_on_gpu_) {
          // 根据bagging索引，获取采样得到的子集梯度
        for (int i = 0; i < bag_data_cnt; ++i) {
          gradients_pointer_[offset + i] = grad[bag_data_indices[i]];
          hessians_pointer_[offset + i] = hess[bag_data_indices[i]];
        }
        grad = gradients_pointer_ + offset;
        hess = hessians_pointer_ + offset;
      }
      bool is_first_tree = models_.size() < static_cast<size_t>(num_tree_per_iteration_);
      // 学习、分裂、生成树
      new_tree.reset(tree_learner_->Train(grad, hess, is_first_tree));
    }

    // 新树赋值
    if (new_tree->num_leaves() > 1) { // 没到叶子节点，还可以继续分裂
      should_continue = true;
      // 更新子树的leaf_value_
      auto score_ptr = train_score_updater_->score() + offset;
      auto residual_getter = [score_ptr](const label_t* label, int i) {return static_cast<double>(label[i]) - score_ptr[i]; };
      tree_learner_->RenewTreeOutput(new_tree.get(), objective_function_, residual_getter,
                                     num_data_, bag_data_indices.data(), bag_data_cnt, train_score_updater_->score());
      // 对子树进行shrinkage, leaf_value_ *= rate; internal_value_*=rate 
      new_tree->Shrinkage(shrinkage_rate_);
      // 更新GBDT score
      UpdateScore(new_tree.get(), cur_tree_id);
      if (std::fabs(init_scores[cur_tree_id]) > kEpsilon) {
        new_tree->AddBias(init_scores[cur_tree_id]);
      }
    } else {
      // 无需分裂的叶子节点
      if (models_.size() < static_cast<size_t>(num_tree_per_iteration_)) {
        if (objective_function_ != nullptr && !config_->boost_from_average && !train_score_updater_->has_init_score()) {
          init_scores[cur_tree_id] = ObtainAutomaticInitialScore(objective_function_, cur_tree_id);

          train_score_updater_->AddScore(init_scores[cur_tree_id], cur_tree_id);
          for (auto& score_updater : valid_score_updater_) {
            score_updater->AddScore(init_scores[cur_tree_id], cur_tree_id);
          }
        }
        new_tree->AsConstantTree(init_scores[cur_tree_id]);
      }
    }
    // 将新生成的树添加到GBDT::models_中
    models_.push_back(std::move(new_tree));
  }

  if (!should_continue) {
    Log::Warning("Stopped training because there are no more leaves that meet the split requirements");
    if (models_.size() > static_cast<size_t>(num_tree_per_iteration_)) {
      for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
        models_.pop_back();
      }
    }
    return true;
  }

  ++iter_;
  return false;
}

GBDT::TrainOneIter()方法中传入了两个指针gradients和hessians，类型score_t是使用typedef定义的浮点数别名，用宏在预编译阶段控制数据类型为float还是double。gradients和hessians分别用来存放梯度和二阶偏导，数据size = num_data_ * num_tree_per_iteration_，其中num_data_为训练集样本数量，num_tree_per_iteration_为每一轮迭代中生成树的棵树，与多分类任务类别数量对应，默认二分类num_tree_per_iteration_值为1，即在每一轮迭代中只需生成1棵树。gradients和hessians存放的是num_data_ × num_tree_per_iteration_ 二维数据，布局方式为列主序（0~num_data_-1处数据在内存中连续）。GBDT类下有两对属性：GBDT::gradients_ || GBDT::gradients_pointer_， GBDT::hessians_ || GBDT::hessians_pointer_，两对属性两位一体，一个是使用自定义的allocator分配堆内存所得到的容器，一个是指针。

TrainOneIter分为3个部分：
5 ~ 41行初始化score，并更新GBDT梯度和二阶偏导；
44 ~ 51 进行bagging；根据初始化参数执行普通或GOSS行采样；
51 ~ 108 生成子树，并更新子树的leaf_value_和整个GBDT的score

GBDT的score为所有子树的leaf_value_之和，也是predictRaw API的输出，相当于神经网络的logits，具体的细节请见代码和本人提供的注释。

LightGBM tree_learner

沿着GBDT::TrainOneIter中的tree_learner_->Train()方法展示了树的生成与节点分裂：

Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians, bool /*is_first_tree*/) {
  Common::FunctionTimer fun_timer("SerialTreeLearner::Train", global_timer);
  gradients_ = gradients;
  hessians_ = hessians;
  int num_threads = OMP_NUM_THREADS();
  if (share_state_->num_threads != num_threads && share_state_->num_threads > 0) {
    Log::Warning(
        "Detected that num_threads changed during training (from %d to %d), "
        "it may cause unexpected errors.",
        share_state_->num_threads, num_threads);
  }
  share_state_->num_threads = num_threads;

  if (config_->use_quantized_grad) {
    gradient_discretizer_->DiscretizeGradients(num_data_, gradients_, hessians_);
  }

  // 对ialTreeLearner相关属性初始化
  BeforeTrain();

  // 构造一棵树
  bool track_branch_features = !(config_->interaction_constraints_vector.empty());
  auto tree = std::unique_ptr<Tree>(new Tree(config_->num_leaves, track_branch_features, false));
  auto tree_ptr = tree.get();
  constraints_->ShareTreePointer(tree_ptr);

  // 根节点根据SerialTreeLearner::forced_split_json_进行分裂
  // forced_split_json_可以理解为已经提前训练好的节点的信息
  int left_leaf = 0;
  int cur_depth = 1;
  int right_leaf = -1;
  
  // SerialTreeLearner::ForceSplits()根据forced_split_json_中的信息采用广度优先的方式，一层一层的构造
  int init_splits = ForceSplits(ptr, &left_leaf, &right_leaf, &cur_depth);
  
  // “续训”
  for (int split = init_splits; split < config_->num_leaves - 1; ++split) {
    if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) {
      // 找到每个特征的最佳分裂点
      FindBestSplits(tree_ptr);
    }
    // 获取增益最大的叶子节点索引
    int best_leaf = static_cast<int>(ArrayArgs<SplitInfo>::ArgMax(best_split_per_leaf_));
    // 获取增益最大叶子节点的分裂信息
    const SplitInfo& best_leaf_SplitInfo = best_split_per_leaf_[best_leaf];
    // 如果增益<=0，停止分裂
    if (best_leaf_SplitInfo.gain <= 0.0) {
      Log::Warning("No further splits with positive gain, best gain: %f", best_leaf_SplitInfo.gain);
      break;
    }
    // 对增益最大的叶子节点进行分裂
    Split(tree_ptr, best_leaf, &left_leaf, &right_leaf);
    cur_depth = std::max(cur_depth, tree->leaf_depth(left_leaf));
  }

  if (config_->use_quantized_grad && config_->quant_train_renew_leaf) {
    gradient_discretizer_->RenewIntGradTreeOutput(tree.get(), config_, data_partition_.get(), gradients_, hessians_,
      [this] (int leaf_index) { return GetGlobalDataCountInLeaf(leaf_index); });
  }

  Log::Debug("Trained a tree with leaves = %d and depth = %d", tree->num_leaves(), cur_depth);
  return tree.release();
}

tree_learner_->Train()会先根据传入的forced_split_json_加载一棵树，force_split_json_可以想象成一个载有“预训练树”信息的json，然后在37~54行对加载的“预训练树”进行续训，直到深度达到最大数深（num_leaves-1）或不再有增益（best_leaf_SplitInfo.gain <= 0.0），则树的生长停止。

LightGBM cuda算子

前面简单介绍了训练时的boosting过程和tree的生成，接下来我们进入正题，聊聊LightGBM源码中cuda相关部分。cuda代码分布在include和scr目录下：

include/LightGBM/cuda目录下的cuda头文件，cuda_algorithms.hpp中定义了在设备端调用(_device_)的基础并行算子，包括规约和双调排序两类；cuda_utils.hpp中定义了设备端内存分配，host-device数据传输函数，其他.hpp中对相关数据结构进行了定义，如meta_data, split_info, tree等。
cuda源文件(.cu)在src目录下较为分散，src/cuda目录下有两个源文件cuda_algorithms.cu和cuda_utils.cpp，对基础的算子进行了补充和封装。子目录boosting、metric、objective下也都有cuda文件夹，其中boosting/cuda目录下定义了cuda版本的score_update，treelearner/cuda目录下定义了split、histogram、learner的cuda逻辑。

规约

首先介绍cuda_algorithms(.hpp, .cu)中实现的两类基础并行算法：规约和双调排序。其中并行规约算子采用线程束的洗牌指令（shuffle）实现，用于求数组的和、数组的前缀和，求最大、最小值等。cuda_algorithms.hpp中定义了设备端运行的内联函数，然后在cuda_algorithms.cu中进行调用，以ShuffleReduceSum为例：

// cuda_algorithms.hpp
// warp内规约求和
template <typename T>
__device__ __forceinline__ T ShuffleReduceSumWarp(T value, const data_size_t len) {
  if (len > 0) {
    const uint32_t mask = 0xffffffff;
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
      value += __shfl_down_sync(mask, value, offset);
    }
  }
  return value;
}

// block内规约求和
// reduce values from an 1-dimensional block (block size must be no greather than 1024)
template <typename T>
__device__ __forceinline__ T ShuffleReduceSum(T value, T* shared_mem_buffer, const size_t len) {
  const uint32_t warpLane = threadIdx.x % warpSize;
  const uint32_t warpID = threadIdx.x / warpSize;
  
  // 兼容len不能被warpSize整除的情形
  const data_size_t warp_len = min(static_cast<data_size_t>(warpSize), static_cast<data_size_t>(len) - static_cast<data_size_t>(warpID * warpSize));
  
  // 每个线程束内求和，并将结果写进shared memory
  value = ShuffleReduceSumWarp<T>(value, warp_len);
  if (warpLane == 0) {
    shared_mem_buffer[warpID] = value;
  }
  __syncthreads();
  
  // 线程束间求和
  const data_size_t num_warp = static_cast<data_size_t>((len + warpSize - 1) / warpSize);
  if (warpID == 0) {
    value = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
    value = ShuffleReduceSumWarp<T>(value, num_warp);
  }
  return value;
}

ShuffleReduceSumWarp和ShuffleReduceSum算子采用线程束洗牌指令实现了一个很基础的reduceSum的功能。ShuffleReduceSum中先warp内求和并将warp内的和写入shared memory中，然后对shared memory进行第二次规约最终得到block内部和。第二次要想使用线程束洗牌指令求和，warp number须小于warp size(32)，因此单个block下的线程数须小于32*32=1024。__shfl_down_sync具体原理可见上一篇CUDA基础知识梳理。其他reduceMin，reduceMax实现原理类似。

下面的ShufflePrefixSum和ShufflePrefixSumGlobalKernel算子实现了求数组前缀和的功能：

// cuda_algorithms.hpp
template <typename T>
__device__ __forceinline__ T ShufflePrefixSum(T value, T* shared_mem_buffer) {
    const uint32_t mask = 0xffffffff;
    const uint32_t warpLane = threadIdx.x % warpSize;
    const uint32_t warpID = threadIdx.x / warpSize;
    const uint32_t num_warp = blockDim.x / warpSize;

    // warp内前缀和
    for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
        // 读取束内前面线程寄存器值
        const T other_value = __shfl_up_sync(mask, value, offset);
        // 线程束分化，前面线程寄存器值加到当前寄存器
        if (warpLane >= offset) {
            value += other_value; //前缀和
        }
    }

    // warpLane = warpSize - 1此时的value值为整个线程束的总和
    // 将当前线程束总和存入shared memory
    if (warpLane == warpSize - 1) {
        shared_mem_buffer[warpID] = value;
    }
    __syncthreads();

    // 第二次求程束间的前缀和
    if (warpID == 0) {
        T warp_sum = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
        for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
            // 先将线程束总和读入线程寄存器，然后用同样方式求束间前缀和
            const T other_warp_sum = __shfl_up_sync(mask, warp_sum, offset);
            if (warpLane >= offset) {
                warp_sum += other_warp_sum;
            }
        }
        shared_mem_buffer[warpLane] = warp_sum;
    }
    __syncthreads();
    // shared_mem_buffer[warpID - 1]存放的是当前线程束的束间前缀和
    const T warp_base = warpID == 0 ? 0 : shared_mem_buffer[warpID - 1];
    // 当前线程束的束间前缀和+当前线程在线程束内的前缀和，得到block内的前缀和
    return warp_base + value;
}

template <typename T>
__global__ void ShufflePrefixSumGlobalKernel(T* values, size_t len, T* block_prefix_sum_buffer) {
    __shared__ T shared_mem_buffer[32];
    const size_t index = static_cast<size_t>(threadIdx.x + blockIdx.x * blockDim.x);

    // 读取数据到寄存器
    T value = 0;
    if (index < len) {
        value = values[index];
    }
    // 在block内求前缀和并写入values
    const T prefix_sum_value = ShufflePrefixSum<T>(value, shared_mem_buffer);
    values[index] = prefix_sum_value;

    // block负责数据之和存入block_prefix_sum_buffer
    if (threadIdx.x == blockDim.x - 1) {
        block_prefix_sum_buffer[blockIdx.x] = prefix_sum_value;
    }
}

在ShufflePrefixSum中，利用__shfl_up_sync指令进行了2次前缀求和，第一次10~17行通过循环和线程束分化在warp内求各个thread的前缀和，并将数据存入各thread的value寄存器中。循环结束时warpLane = warpSize – 1的thread寄存器中存放的是整个warp value的总和。将warp总和存入shared memory，然后用同样方式求warp间的前缀和。最后block中当前thread的前缀和=warp间前缀和+thread在当前warp的前缀和。在循环中使用__shfl_up_sync求前缀和的执行过程如图5所示。

图 5 使用__shfl_up_sync()求前缀和示意图

LightGBM在GPU上的并行计算特性介绍

双调排序

双调排序算法以BitonicArgSortDevice()为例：

template <typename VAL_T, typename INDEX_T, bool ASCENDING, uint32_t BLOCK_DIM, uint32_t MAX_DEPTH>
__device__ void BitonicArgSortDevice(const VAL_T* values, INDEX_T* indices, const int len) {
  __shared__ VAL_T shared_values[BLOCK_DIM];
  __shared__ INDEX_T shared_indices[BLOCK_DIM];
  
  // 根据数据长度len获取深度depth，即每个block上需要进行双调排序的次数
  int len_to_shift = len - 1;
  int max_depth = 1;
  while (len_to_shift > 0) {
    len_to_shift >>= 1;
    ++max_depth;
  }
  
  // 获取block数量，grid中只使用单个block？
  const int num_blocks = (len + static_cast<int>(BLOCK_DIM) - 1) / static_cast<int>(BLOCK_DIM);
  
  // 使用共享内存在每个block内对1~MAX_DEPTH深度的数据进行双调排序
  for (int block_index = 0; block_index < num_blocks; ++block_index) {
    const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
    if (this_index < len) {
      // 读取全局内存数据到共享内存
      shared_values[threadIdx.x] = values[this_index];
      shared_indices[threadIdx.x] = this_index;
    } else {
      shared_indices[threadIdx.x] = len;
    }
    __syncthreads();
    
    // 第一轮双调排序, depth:[max_depth-1, max_depth-MAXDEPTH)
    for (int depth = max_depth - 1; depth > max_depth - static_cast<int>(MAX_DEPTH); --depth) {
      // 获取段长度、段索引、段上单调性(升序或降序)
      const int segment_length = (1 << (max_depth - depth));
      const int segment_index = this_index / segment_length;
      
      // ascending控制升序和降序的segment各一半
      const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
      {
        const int half_segment_length = (segment_length >> 1);
        const int half_segment_index = this_index / half_segment_length;
        // len和segment_lenth不对齐的情形，len往往不可能刚好是2^n
        const int num_total_segment = (len + segment_length - 1) / segment_length;
        // offset也是用于不对齐场景
        const int offset = (segment_index == num_total_segment - 1 && ascending == ASCENDING) ?
          (num_total_segment * segment_length - len) : 0;
        if (half_segment_index % 2 == 0) {
          // len与segment_lenth不对齐兼容
          const int segment_start = segment_index * segment_length;
          if (this_index >= offset + segment_start) {
            
            // other_index与this_index间隔half_segment_length
            const int other_index = static_cast<int>(threadIdx.x) + half_segment_length - offset;
            // 读取值this_data和other_data
            const INDEX_T this_data_index = shared_indices[threadIdx.x];
            const INDEX_T other_data_index = shared_indices[other_index];
            const VAL_T this_value = shared_values[threadIdx.x];
            const VAL_T other_value = shared_values[other_index];
            
            // 在段内对将不满足段单调条件的元素和索引进行位置交换
            if (other_data_index < len && (this_value > other_value) == ascending) {
              shared_indices[threadIdx.x] = other_data_index;
              shared_indices[other_index] = this_data_index;
              shared_values[threadIdx.x] = other_value;
              shared_values[other_index] = this_value;
            }
          }
        }
        __syncthreads();
      }
      // 内存循环，half_segment_length每次减半，段内进行排序
      for (int inner_depth = depth + 1; inner_depth < max_depth; ++inner_depth) {
        const int half_segment_length = (1 << (max_depth - inner_depth - 1));
        const int half_segment_index = this_index / half_segment_length;
        
        // 同样的逻辑对不满足单调条件的元素和索引进行位置交换
        if (half_segment_index % 2 == 0) {
          const int other_index = static_cast<int>(threadIdx.x) + half_segment_length;
          const INDEX_T this_data_index = shared_indices[threadIdx.x];
          const INDEX_T other_data_index = shared_indices[other_index];
          const VAL_T this_value = shared_values[threadIdx.x];
          const VAL_T other_value = shared_values[other_index];
          if (other_data_index < len && (this_value > other_value) == ascending) {
            shared_indices[threadIdx.x] = other_data_index;
            shared_indices[other_index] = this_data_index;
            shared_values[threadIdx.x] = other_value;
            shared_values[other_index] = this_value;
          }
        }
        __syncthreads();
      }
    }
    if (this_index < len) {
      indices[this_index] = shared_indices[threadIdx.x];
    }
    __syncthreads();
  }
  
  // 使用全局内存对MAX_DEPTH~max_depth深度的数据进行双调排序，逻辑相似，相当于是对上一个循环块的补充
  // 共享内存的size BLOCK_DIM 与 MAX_DEPTH有关
  // 这一块双调排序为了更好的利用共享内存，内存循环被拆成了两部分
  
  // 第2轮排序，剩余深度[max_depth-MAX_DEPTH, 1]
  for (int depth = max_depth - static_cast<int>(MAX_DEPTH); depth >= 1; --depth) {
    const int segment_length = (1 << (max_depth - depth));
    {
      const int num_total_segment = (len + segment_length - 1) / segment_length;
      const int half_segment_length = (segment_length >> 1);
      
      // 每个block内在base-half_segment_length上进行第一次元素交换
      for (int block_index = 0; block_index < num_blocks; ++block_index) {
        const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
        const int segment_index = this_index / segment_length;
        const int half_segment_index = this_index / half_segment_length;
        const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
        const int offset = (segment_index == num_total_segment - 1 && ascending == ASCENDING) ?
          (num_total_segment * segment_length - len) : 0;
        if (half_segment_index % 2 == 0) {
          const int segment_start = segment_index * segment_length;
          if (this_index >= offset + segment_start) {
            const int other_index = this_index + half_segment_length - offset;
            if (other_index < len) {
              const INDEX_T this_data_index = indices[this_index];
              const INDEX_T other_data_index = indices[other_index];
              const VAL_T this_value = values[this_data_index];
              const VAL_T other_value = values[other_data_index];
              if ((this_value > other_value) == ascending) {
                indices[this_index] = other_data_index;
                indices[other_index] = this_data_index;
              }
            }
          }
        }
      }
      __syncthreads();
    }
    
    // 内层循环1，对half_segment_length进行逐次减半，边界[depth+1, max_depth-MAX_DEPTH];是直接在全局内存上进行的
    for (int inner_depth = depth + 1; inner_depth <= max_depth - static_cast<int>(MAX_DEPTH; ++inner_depth) {
      const int half_segment_length = (1 << (max_depth - inner_depth - 1));
      for (int block_index = 0; block_index < num_blocks; ++block_index) {
        const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
        const int segment_index = this_index / segment_length;
        const int half_segment_index = this_index / half_segment_length;
        const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
        if (half_segment_index % 2 == 0) {
          const int other_index = this_index + half_segment_length;
          if (other_index < len) {
            const INDEX_T this_data_index = indices[this_index];
            const INDEX_T other_data_index = indices[other_index];
            const VAL_T this_value = values[this_data_index];
            const VAL_T other_value = values[other_data_index];
            if ((this_value > other_value) == ascending) {
              indices[this_index] = other_data_index;
              indices[other_index] = this_data_index;
            }
          }
        }
        __syncthreads();
      }
    }
    
    // 内层循环2，对half_segment_length继续逐次减半，边界[max_depth-MAX_DEPTH+1, max_depth)在共享内存上进行。
    for (int block_index = 0; block_index < num_blocks; ++block_index) {
      const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
      const int segment_index = this_index / segment_length;
      const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
      if (this_index < len) {
        const INDEX_T index = indices[this_index];
        shared_values[threadIdx.x] = values[index];
        shared_indices[threadIdx.x] = index;
      } else {
        shared_indices[threadIdx.x] = len;
      }
      __syncthreads();
      
      // 内存循环2的具体位置
      for (int inner_depth = max_depth - static_cast<int>(MAX_DEPTH) + 1; inner_depth < max_depth; ++inner_depth) {
        const int half_segment_length = (1 << (max_depth - inner_depth - 1));
        const int half_segment_index = this_index / half_segment_length;
        if (half_segment_index % 2 == 0) {
          const int other_index = static_cast<int>(threadIdx.x) + half_segment_length;
          const INDEX_T this_data_index = shared_indices[threadIdx.x];
          const INDEX_T other_data_index = shared_indices[other_index];
          const VAL_T this_value = shared_values[threadIdx.x];
          const VAL_T other_value = shared_values[other_index];
          if (other_data_index < len && (this_value > other_value) == ascending) {
            shared_indices[threadIdx.x] = other_data_index;
            shared_indices[other_index] = this_data_index;
            shared_values[threadIdx.x] = other_value;
            shared_values[other_index] = this_value;
          }
        }
        __syncthreads();
      }
      if (this_index < len) {
        indices[this_index] = shared_indices[threadIdx.x];
      }
      __syncthreads();
    }
  }
}

BitonicArgSortDevice()采用双调排序实现了block级别的argsort功能，为了更好的使用共享内存，循环写得比较庞大，整段代码分为3个部分：

7~12行根据数据长度获取max_depth;
30~95行进行了第1轮双调排序，深度[max_depth-1, max_depth-MAXDEPTH)
102到199进行了第2轮双调排序，针对剩余深度[max_depth-MAX_DEPTH, 1]

然后在第2轮双调排序中为了更好的利用全局内存，又对内存inner_depth循环进行了2次拆分，分别为：

137~159行，inner_depth范围[depth+1, max_depth-MAX_DEPTH]，没有使用共享内存；
162~198行，[max_depth-MAX_DEPTH+1, max_depth)，使用了共享内存。

如果共享内存充足或直接使用全局内存，两次循环拆分都可以避免；源码中采用这种拆分的实现也是为了更好的兼容。
下面我们将借用一个例子，详细的介绍BitonicArgSortDevice()实现逻辑。

我们输入的数组values = [1,3,6,2,1,4,8,3,15,34,12,31,3,8,9,21];
indice = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15];
len = 16
VAL_T = int, INDEX_T = int, ASCENDING=true(升序), BLOCK_DIM=16, MAX_DEPTH=4

只定义1个block，这样只有depth和inner_depth控制的两层循环，根据len=16计算得到的max_depth = 5；外层循环控制变量depth从4~1，内存循环控制变量inner_depth从depth+1 ~ 4；循环每一步对应数组values的元素值如图6所示，其中黄色表示在该次循环中对应段中元素升序，灰色表示在该次循环中对应段中元素应降序。

第1轮depth = 4, half_segment_length=1, 此轮中首先索引0, 2…14上的元素分别与索引1, 3…15上元素进行比较，在黄色格子内应满足升序，灰色格子内应满足降序，idx = 10和11处元素不满足降序进行位置交换，idx = 14和15处元素也不满足，也进行位置交换。此轮inner_depth=5，不满足边界条件，没有内存循环对half_segment_length进行减半。

第2轮depth = 3, half_segment_length=2, 此轮中首先索引0和2处元素进行比较，1和3处元素进行比较…对不满足升降序条件idx = 1,3; idx = 4,6; idx = 9,11; idx = 12,14; idx = 13, 15处元素进行交换。然后inner_depth = 4，对half_segment_length进行1次减半，half_segment_length变为1，对不满足单调性的idx = 2,3; idx=6,7; idx=8,9; idx=14,15进行交换。

后续第3轮depth=2和第4轮depth=1也同理进行交换，最终完成排序的数组为[1,1,2,3,3,3,4,6,8,8,9,12,15,21,31,34]。

当然，BitonicArgSortDevice()并不是对数组元素进行排序，而是数组的索引进行排序，就连传入的values都是使用const修饰的；本文用元素的例子是为了方便演示和理解。

图6 双调排序实现argsort示意图

LightGBM在GPU上的并行计算特性介绍

小节

至此，我们对LightGBM源码结构和训练过程中部分核心代码进行了介绍，了解了训练过程中的boosting和tree的生长。并且对LightGBM中基础的cuda算子，如并行规约和双调排序进行了详细的介绍。后续文章中我们将介绍更多业务层面的算子，如histogram的构造、寻找best split等。

本网站的内容主要来自互联网上的各种资源，仅供参考和信息分享之用，不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益，请联系我们，我们将尽快采取行动，包括删除或更正。

{{userData.name}}已认证

LightGBM在GPU上的并行计算特性介绍

引言

LightGBM源码简介

LightGBM boosting

LightGBM tree_learner

LightGBM cuda算子

规约

双调排序

小节

Argus-3D：一款强大的多模态3D形状生成大模型

文本识别算法的突破与实际应用

GeoSpy.ai

即梦Dreamina

Globe Explorer

Luma Dream Machine

抖音即创

Motionshop

归档

{{userData.name}}已认证

引言

LightGBM源码简介

LightGBM boosting

LightGBM tree_learner

LightGBM cuda算子

规约

双调排序

小节

Argus-3D：一款强大的多模态3D形状生成大模型

文本识别算法的突破与实际应用

LightGBM直方图算子的实现与优化

机器学习系列入门系列[七]：基于英雄联盟数据集的LightGBM的分类预测

Ubuntu 20.04安装CUDA开发工具教程

深入理解GPU：工作原理、用途和区别