当前位置：网站首页>Dual tone sorting of CUDA and large arrays

Dual tone sorting of CUDA and large arrays

2022-07-19 06:37:00 【Woodland ningning】

CUDA Bi tonal sorting with large arrays

Two tone sort

Two tone sort , Also known as Bitonic Sort, The overall complexity is $n\log^2(n)$ .

Merge and sort disadvantages

Compared with merge sort $n\log(n)$ Complexity , It seems to have little advantage , But the advantage is that it can use multi-core and multithreading more effectively than merge sorting , So in CUDA Under the blessing , Efficiency is often faster than merging .

Disadvantages of merge sort ： When returning and sorting to the later stage , There are mainly a few paragraphs to be merged “ Large section ” constitute , The merging of these large segments is difficult to apply the advantages of multi-core and multi threading . Although the advantages of several large sections can be optimized through certain means , But it's more complicated to implement .

Since merging is so complicated , Why not implement a simpler Bi tonal sort ？

Dual tone sorting principle

The principle of Bi tonal sorting , Many bloggers have detailed instructions , One of the key properties is Batcher Theorem .

Double tone sequence

Double tone sequence (Bitonic Sequence) It refers to a non strictly increasing sequence $X$ And non strict subtractive sequence $Y$ Sequence of composition , Example sequence （23,10,8,3,5,7,11,78）.

Definition : A sequence $a_1,a_2,…,a_n$ It's a two tone sequence (Bitonic Sequence), If ：

There is one. $a_k(1\le k \le n)$ , bring $a_1\ge…\ge a_k \le…\le a_n$ establish ; perhaps
If the sequence can be circularly shifted, the condition is satisfied 1

Batcher Theorem

Grow any one into $2 n$ The double tone sequence of $A$ Divided into two equal lengths $X$ and $Y$ , take $X$ Elements in and $Y$ The elements in are compared one by one in the original order , namely $a [i]$ And $a [i + n], (i < n)$ Compare , Put the larger into MAX Sequence , Put in the smaller one MIN Sequence . Obtained MAX and MIN The sequence is still a two tone sequence , also MAX Any element in the sequence is not less than MIN Any element in the sequence .

Detailed explanation of dual tone sorting

But more details , Attached are two good blogs ：

CUDA Code implementation

The actual code

#include <stdio.h>
const char* version_name = "Naive, sort.";

__device__ void swap_float(float* f1, float* f2) {
    
    float tmp = *f1;
    *f1 = *f2;
    *f2 = tmp;
}

__global__ void _bitonic_sort(float* d_arr, unsigned stride, unsigned inner_stride) {
    
    unsigned flipper = inner_stride >> 1;
    unsigned tid = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned tid_other = tid ^ flipper;

    if (tid < tid_other) {
    
        //  Manipulate the left half 
        if ((tid & stride) == 0) {
    
            //  Ascending order will be left here 
            if (d_arr[tid] > d_arr[tid_other]) {
    
                swap_float(&d_arr[tid], &d_arr[tid_other]);
            }
        } else {
    
            //  The descending order will be left here 
            if (d_arr[tid] < d_arr[tid_other]) {
    
                swap_float(&d_arr[tid], &d_arr[tid_other]);
            }
        }
    }
}

/// entry point for gpu float sorting
/// \param arr memory on gpu device
/// \param len the length of the array
void float_sort(float arr[], int len) {
    
    //  First, check whether the length is  2  The power of 
    unsigned twoUpper = 1;
    for (;twoUpper < len; twoUpper <<= 1) {
    
        if (twoUpper == len) {
    
            break;
        }
    }

    //  If it is  host  The pointer , return 
    cudaPointerAttributes attrs;
    cudaPointerGetAttributes(&attrs, arr);
    if (attrs.type != cudaMemoryTypeDevice) {
    
        return;
    }

    float* d_input_arr;
    unsigned input_arr_len;
    if (twoUpper == len) {
    
        input_arr_len = len;
        d_input_arr = arr;
    } else {
    
        //  need  padding
        input_arr_len = twoUpper;
        cudaMalloc(&d_input_arr, sizeof(float) * input_arr_len);
        //  Then initialize 
        cudaMemcpy(d_input_arr, arr, sizeof(float) * len, cudaMemcpyHostToDevice);
        cudaMemset(d_input_arr + len, 0x7f, sizeof(float) * (input_arr_len - len));
    }

    dim3 grid_dim((input_arr_len / 256 == 0)? 1 : input_arr_len / 256);
    dim3 block_dim((input_arr_len / 256 == 0)? input_arr_len : 256);
    
    //  Sorting process ( a key )
    for (unsigned stride = 2; stride <= input_arr_len; stride <<= 1) {
    
        for (unsigned inner_stride = stride; inner_stride >= 2; inner_stride >>= 1) {
    
            _bitonic_sort<<<grid_dim, block_dim>>>(d_input_arr, stride, inner_stride);
        }
    }

    //  If  padding  too , Then restore here 
    if (twoUpper != len) {
    
        cudaMemcpy(arr, d_input_arr, sizeof(float) * len, cudaMemcpyDeviceToDevice);
        cudaFree(d_input_arr);
    }
}

Code analysis and example analysis

function float_sort(float arr[], int len) Is the main function body of our sorting function .

ad locum ,float arr[] Must be by CUDA Allocated video memory space , And has passed cudaMemcpy Load the data to be sorted . If memory space is passed in , Will directly return.

Because the original Bi tonal sorting can only $2^n$ Data sorting of , Therefore, the length is not $2^n$ Array of , to padding.

During sorting , Would call _bitonic_sort(float* d_arr, unsigned stride, unsigned inner_stride), And this is the main part of Bi tonal sorting .
float* d_arr by padding Array after .

unsigned stride Indicates the step size of the current Bi tonal sorting , Here is a detailed example ：

When the bitonic sequence is $1, 2, 3, 4, 5, 3, 2, 1, 3, 4, 5, 6, 7, 8, 7, 3$ And in steps of 8 when , Means in steps 8 Grouping , Each group has a bitonic sequence , Here are two groups ： $1, 2, 3, 4, 5, 3, 2, 1$ and $3, 4, 5, 6, 7, 8, 7, 3$ .

after , We want to treat these two sequences as the former increasing , The latter decreases , Make the whole length 16 The array of becomes a bitonic sequence , Therefore, we will sort these two groups of sequences in two tones , It's just that in the results , The former is in ascending order , The latter is in descending order .

unsignd inner_stride It means that one has passed stride Sort the grouped bitonic sequence . Continue to use the above Liezi here , And the second half $3, 4, 5, 6, 7, 8, 7, 3$ Sort in descending order . stay float_sort() Function body , so inner_stride from stride The index decreases to 2.

inner_stride = 8, The length is 8 The second half of is divided into two sections , Make two separate stanzas , The left half is larger than the right half , It depends on Batcher Theorem gets two sections $7, 8, 7, 6$ and $3, 4, 5, 3$ （ Note that there $7, 8, 7, 6$ After the shift is $7, 6, 7, 8$ , It is still a bi tonal sequence , Satisfy Batcher Theorem ）
inner_stride = 4, Here is simple , Only deal with $7, 8, 7, 6$ , Another section $3, 4, 5, 3$ similar . The length will be 4 Of $7, 8, 7, 6$ according to Batcher The theorem is divided into two sections , obtain $7, 8$ and $7, 6$
inner_stride = 2, Then the length of the pair is 2 Of $7, 8$ and $7, 6$ It is divided into two sections and sorted , obtain $8$ , $7$ , $7$ and $6$ , At this time, the first half of the second half has been arranged as $8, 7, 7, 6$ In descending order . Empathy , The second half of the second half is sorted as $5, 4, 3, 3$ .
here , The second half is sorted as $8, 7, 7, 6, 5, 4, 3, 3$ , Become a descending sequence . Empathy , Sort the first half , Get the ascending sequence $1, 1, 2, 2, 3, 3, 4, 5$ , Then get a new array $1, 1, 2, 2, 3, 3, 4, 5, 8, 7, 7, 6, 5, 4, 3, 3$ , It can be seen that there is a new Bi tonal sequence .

When stride = 8 when , from $1, 2, 3, 4, 5, 3, 2, 1, 3, 4, 5, 6, 7, 8, 7, 3$ To $1, 1, 2, 2, 3, 3, 4, 5, 8, 7, 7, 6, 5, 4, 3, 3$ . after , Only need to stride = 16 Re execute the above process , You can get the final sequence in ascending order .

summary

Above we pass an example , Understand the code and thought in detail .

This code does not use device Medium share_memory, Because the data size can be very large ,share_memory Cannot save the entire array , So only through global Shared video memory address , And pass float_sort() The cycle in threadBlock Synchronization between .

原网站

版权声明
本文为[Woodland ningning]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207170519277972.html