当前位置:网站首页>Dual tone sorting of CUDA and large arrays
Dual tone sorting of CUDA and large arrays
2022-07-19 06:37:00 【Woodland ningning】
CUDA Bi tonal sorting with large arrays
Two tone sort
Two tone sort , Also known as Bitonic Sort, The overall complexity is n log 2 ( n ) n\log^2(n) nlog2(n).
Merge and sort disadvantages
Compared with merge sort n log ( n ) n\log(n) nlog(n) Complexity , It seems to have little advantage , But the advantage is that it can use multi-core and multithreading more effectively than merge sorting , So in CUDA Under the blessing , Efficiency is often faster than merging .
Disadvantages of merge sort : When returning and sorting to the later stage , There are mainly a few paragraphs to be merged “ Large section ” constitute , The merging of these large segments is difficult to apply the advantages of multi-core and multi threading . Although the advantages of several large sections can be optimized through certain means , But it's more complicated to implement .
Since merging is so complicated , Why not implement a simpler Bi tonal sort ?
Dual tone sorting principle
The principle of Bi tonal sorting , Many bloggers have detailed instructions , One of the key properties is Batcher Theorem .
Double tone sequence
Double tone sequence (Bitonic Sequence) It refers to a non strictly increasing sequence X X X And non strict subtractive sequence Y Y Y Sequence of composition , Example sequence (23,10,8,3,5,7,11,78).
Definition : A sequence a 1 , a 2 , … , a n a_1,a_2,…,a_n a1,a2,…,an It's a two tone sequence (Bitonic Sequence), If :
- There is one. a k ( 1 ≤ k ≤ n ) a_k(1\le k \le n) ak(1≤k≤n), bring a 1 ≥ … ≥ a k ≤ … ≤ a n a_1\ge…\ge a_k \le…\le a_n a1≥…≥ak≤…≤an establish ; perhaps
- If the sequence can be circularly shifted, the condition is satisfied 1
Batcher Theorem
Grow any one into 2 n 2n 2n The double tone sequence of A A A Divided into two equal lengths X X X and Y Y Y, take X X X Elements in and Y Y Y The elements in are compared one by one in the original order , namely a [ i ] a[i] a[i] And a [ i + n ] , ( i < n ) a[i+n],(i < n) a[i+n],(i<n) Compare , Put the larger into MAX Sequence , Put in the smaller one MIN Sequence . Obtained MAX and MIN The sequence is still a two tone sequence , also MAX Any element in the sequence is not less than MIN Any element in the sequence .
Detailed explanation of dual tone sorting
But more details , Attached are two good blogs :
CUDA Code implementation
The actual code
#include <stdio.h>
const char* version_name = "Naive, sort.";
__device__ void swap_float(float* f1, float* f2) {
float tmp = *f1;
*f1 = *f2;
*f2 = tmp;
}
__global__ void _bitonic_sort(float* d_arr, unsigned stride, unsigned inner_stride) {
unsigned flipper = inner_stride >> 1;
unsigned tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned tid_other = tid ^ flipper;
if (tid < tid_other) {
// Manipulate the left half
if ((tid & stride) == 0) {
// Ascending order will be left here
if (d_arr[tid] > d_arr[tid_other]) {
swap_float(&d_arr[tid], &d_arr[tid_other]);
}
} else {
// The descending order will be left here
if (d_arr[tid] < d_arr[tid_other]) {
swap_float(&d_arr[tid], &d_arr[tid_other]);
}
}
}
}
/// entry point for gpu float sorting
/// \param arr memory on gpu device
/// \param len the length of the array
void float_sort(float arr[], int len) {
// First, check whether the length is 2 The power of
unsigned twoUpper = 1;
for (;twoUpper < len; twoUpper <<= 1) {
if (twoUpper == len) {
break;
}
}
// If it is host The pointer , return
cudaPointerAttributes attrs;
cudaPointerGetAttributes(&attrs, arr);
if (attrs.type != cudaMemoryTypeDevice) {
return;
}
float* d_input_arr;
unsigned input_arr_len;
if (twoUpper == len) {
input_arr_len = len;
d_input_arr = arr;
} else {
// need padding
input_arr_len = twoUpper;
cudaMalloc(&d_input_arr, sizeof(float) * input_arr_len);
// Then initialize
cudaMemcpy(d_input_arr, arr, sizeof(float) * len, cudaMemcpyHostToDevice);
cudaMemset(d_input_arr + len, 0x7f, sizeof(float) * (input_arr_len - len));
}
dim3 grid_dim((input_arr_len / 256 == 0)? 1 : input_arr_len / 256);
dim3 block_dim((input_arr_len / 256 == 0)? input_arr_len : 256);
// Sorting process ( a key )
for (unsigned stride = 2; stride <= input_arr_len; stride <<= 1) {
for (unsigned inner_stride = stride; inner_stride >= 2; inner_stride >>= 1) {
_bitonic_sort<<<grid_dim, block_dim>>>(d_input_arr, stride, inner_stride);
}
}
// If padding too , Then restore here
if (twoUpper != len) {
cudaMemcpy(arr, d_input_arr, sizeof(float) * len, cudaMemcpyDeviceToDevice);
cudaFree(d_input_arr);
}
}
Code analysis and example analysis
function float_sort(float arr[], int len)
Is the main function body of our sorting function .
ad locum ,float arr[]
Must be by CUDA Allocated video memory space , And has passed cudaMemcpy
Load the data to be sorted . If memory space is passed in , Will directly return.
Because the original Bi tonal sorting can only 2 n 2^n 2n Data sorting of , Therefore, the length is not 2 n 2^n 2n Array of , to padding.
During sorting , Would call _bitonic_sort(float* d_arr, unsigned stride, unsigned inner_stride)
, And this is the main part of Bi tonal sorting .float* d_arr
by padding Array after .
unsigned stride
Indicates the step size of the current Bi tonal sorting , Here is a detailed example :
When the bitonic sequence is 1 , 2 , 3 , 4 , 5 , 3 , 2 , 1 , 3 , 4 , 5 , 6 , 7 , 8 , 7 , 3 1,2,3,4,5,3,2,1,3,4,5,6,7,8,7,3 1,2,3,4,5,3,2,1,3,4,5,6,7,8,7,3 And in steps of 8 when , Means in steps 8 Grouping , Each group has a bitonic sequence , Here are two groups : 1 , 2 , 3 , 4 , 5 , 3 , 2 , 1 1,2,3,4,5,3,2,1 1,2,3,4,5,3,2,1 and 3 , 4 , 5 , 6 , 7 , 8 , 7 , 3 3,4,5,6,7,8,7,3 3,4,5,6,7,8,7,3.
after , We want to treat these two sequences as the former increasing , The latter decreases , Make the whole length 16 The array of becomes a bitonic sequence , Therefore, we will sort these two groups of sequences in two tones , It's just that in the results , The former is in ascending order , The latter is in descending order .
unsignd inner_stride
It means that one has passed stride
Sort the grouped bitonic sequence . Continue to use the above Liezi here , And the second half 3 , 4 , 5 , 6 , 7 , 8 , 7 , 3 3,4,5,6,7,8,7,3 3,4,5,6,7,8,7,3 Sort in descending order . stay float_sort()
Function body , so inner_stride
from stride
The index decreases to 2
.
inner_stride = 8
, The length is 8 The second half of is divided into two sections , Make two separate stanzas , The left half is larger than the right half , It depends on Batcher Theorem gets two sections 7 , 8 , 7 , 6 7,8,7,6 7,8,7,6 and 3 , 4 , 5 , 3 3,4,5,3 3,4,5,3( Note that there 7 , 8 , 7 , 6 7,8,7,6 7,8,7,6 After the shift is 7 , 6 , 7 , 8 7,6,7,8 7,6,7,8, It is still a bi tonal sequence , Satisfy Batcher Theorem )inner_stride = 4
, Here is simple , Only deal with 7 , 8 , 7 , 6 7,8,7,6 7,8,7,6, Another section 3 , 4 , 5 , 3 3,4,5,3 3,4,5,3 similar . The length will be 4 Of 7 , 8 , 7 , 6 7,8,7,6 7,8,7,6 according to Batcher The theorem is divided into two sections , obtain 7 , 8 7,8 7,8 and 7 , 6 7,6 7,6inner_stride = 2
, Then the length of the pair is 2 Of 7 , 8 7,8 7,8 and 7 , 6 7,6 7,6 It is divided into two sections and sorted , obtain 8 8 8, 7 7 7, 7 7 7 and 6 6 6, At this time, the first half of the second half has been arranged as 8 , 7 , 7 , 6 8,7,7,6 8,7,7,6 In descending order . Empathy , The second half of the second half is sorted as 5 , 4 , 3 , 3 5,4,3,3 5,4,3,3.- here , The second half is sorted as 8 , 7 , 7 , 6 , 5 , 4 , 3 , 3 8,7,7,6,5,4,3,3 8,7,7,6,5,4,3,3, Become a descending sequence . Empathy , Sort the first half , Get the ascending sequence 1 , 1 , 2 , 2 , 3 , 3 , 4 , 5 1,1,2,2,3,3,4,5 1,1,2,2,3,3,4,5, Then get a new array 1 , 1 , 2 , 2 , 3 , 3 , 4 , 5 , 8 , 7 , 7 , 6 , 5 , 4 , 3 , 3 1,1,2,2,3,3,4,5,8,7,7,6,5,4,3,3 1,1,2,2,3,3,4,5,8,7,7,6,5,4,3,3, It can be seen that there is a new Bi tonal sequence .
When stride = 8
when , from 1 , 2 , 3 , 4 , 5 , 3 , 2 , 1 , 3 , 4 , 5 , 6 , 7 , 8 , 7 , 3 1,2,3,4,5,3,2,1,3,4,5,6,7,8,7,3 1,2,3,4,5,3,2,1,3,4,5,6,7,8,7,3 To 1 , 1 , 2 , 2 , 3 , 3 , 4 , 5 , 8 , 7 , 7 , 6 , 5 , 4 , 3 , 3 1,1,2,2,3,3,4,5,8,7,7,6,5,4,3,3 1,1,2,2,3,3,4,5,8,7,7,6,5,4,3,3. after , Only need to stride = 16
Re execute the above process , You can get the final sequence in ascending order .
summary
Above we pass an example , Understand the code and thought in detail .
This code does not use device Medium share_memory, Because the data size can be very large ,share_memory Cannot save the entire array , So only through global Shared video memory address , And pass float_sort()
The cycle in threadBlock Synchronization between .
边栏推荐
- 机器人缝合手势识别和分类
- Es aggregation analysis reports an error: "reason": "text fields are not optimized for operations
- 2022/07/12 learning notes (day05) cycle
- Visual saliency based visual gaze estimation
- Mapping index attribute & operation of creating index
- 2022 robocom world robot developer competition - undergraduate group (provincial competition)
- 感知智能手机上用户的关注状态
- TypeScript学习
- 【力扣】用栈实现队列
- Get the current month, day, hour, minute, second and week, and update them in real time
猜你喜欢
Positional change of the eyeball during eye movements: evidence of translational movement
吴恩达机器学习第6-7章
DSL implements metrics aggregation
深度优先遍历(Depth First Search, 简称 DFS)
通過VOR深度估計解决三維注視交互中的目標模糊問題
Learning non posture gaze deviation with head movement
Read pictures and convert them to show different color spaces
Eye tracking in virtual reality
Perceive the attention status of users on smart phones
Handle Chinese word segmentation IK word segmenter and expand and stop dictionary
随机推荐
[Niuke] traversal of binary tree
Solutions to slow transmission speed of FileZilla virtual machine
Robot stitching gesture recognition and classification
XOR gun (bit operation, thinking, interval violence)
工作中遇到的一些问题
Markdown syntax and common shortcuts
Acwing daily question three thousand five hundred and eleven
使用候选选择从人类注视中学习视频显著性
从输入URL到展示出页面
斑点检测 记录
Cygwin 配合 Listary 切换当前目录快速打开
感知智能手機上用戶的關注狀態
Restclient query document
Perceive the attention status of users on smart phones
ACWing每日一题.3511
[force buckle] the same tree
網絡中的一些基本概念
Pytorch deep learning practice-b station Liu erden-day7
Make menuconfig missing ncurses
Markdown语法和常用快捷键