当前位置:网站首页>In depth understanding of machine learning - unbalanced learning: sample sampling technology - [adasyn sampling method of manual sampling technology]

In depth understanding of machine learning - unbalanced learning: sample sampling technology - [adasyn sampling method of manual sampling technology]

2022-07-19 03:12:00 von Neumann

Catalogues :《 In depth understanding of machine learning 》 General catalogue


And Borderline-SMOTE The algorithm is similar ,ADASYN(Adaptive Synthetic Sampling) The algorithm is also an improved SMOTE Algorithm . The algorithm is based on 2008 in , The main idea is to make full use of the density distribution information of samples to determine the frequency of each minority sample as the main sample , Synthesize more training data for a few difficult category samples , So as to correct the negative effects caused by the unbalanced distribution of categories as much as possible .

ADASYN The algorithm first needs to determine the number of new minority samples to be generated , namely N + × SR N^+\times \text{SR} N+×SR, Then in the original training set S S S Find each minority sample on x i + , i = 1 , 2 , ⋯   , N + x_i^+, i=1, 2, \cdots, N^+ xi+,i=1,2,,N+ Of K K K a near neighbor , among , The first i i i The number of nearest neighbors of the majority classes of the minority class samples is recorded as N i major N_i^\text{major} Nimajor, Then the proportion parameters of each minority sample can be determined by the following formula Γ i \Gamma_i Γi
Γ i = N i major Z × K \Gamma_i=\frac{N_i^\text{major}}{Z\times K} Γi=Z×KNimajor

among , Z Z Z Is the standardization factor , To guarantee ∑ Γ i = 1 \sum\Gamma_i=1 Γi=1. After the proportion parameter is determined , The frequency that each minority sample is selected as the main sample can be determined by the following formula :
g i = Γ i × N + × SR g_i=\Gamma_i\times N^+\times\text{SR} gi=Γi×N+×SR

It is not difficult to see from the above formula , And Borderline-SMOTE The algorithm is similar ,ADASYN The algorithm pays more attention to a few samples located near the decision boundary , They are selected as the main samples much more frequently than those located in a few decision-making areas . Of course , This will further amplify the propagation intensity of a few types of noise information ,ADASYN The specific flow of the algorithm is as follows :

ADASYN Sampling method
Input : Training set S = { ( x i , y i ) , i = 1 , 2 , ⋯   , N , y i ∈ { + , − } } S=\{(x_i, y_i), i=1, 2, \cdots, N, y_i\in\{+, -\}\} S={(xi,yi),i=1,2,,N,yi{ +,}}; Number of samples of most classes N − N^- N, Number of samples of a few classes N + N^+ N+, among N − + N + = N N^-+N^+=N N+N+=N; Unbalance ratio IR = N − N + \text{IR}=\frac{N^-}{N^+} IR=N+N; Sampling rate SR \text{SR} SR; Proximity parameter K K K
Output : Training set after oversampling S = { ( x i , y i ) , i = 1 , 2 , ⋯   , N + N + × SR , y i ∈ { + , − } } S=\{(x_i, y_i), i=1, 2, \cdots, N+N^+\times \text{SR}, y_i\in\{+, -\}\} S={(xi,yi),i=1,2,,N+N+×SR,yi{ +,}}
( 1 ) From the training set S S S Take out all the samples of majority and minority classes , Make up the training sample set of most classes S − S^- S And a few training sample sets S + S^+ S+
( 2 ) Set the newly generated sample set S New S^\text{New} SNew It's empty +
( 3 ) for i = 1 : N + i=1:N^+ i=1:N+
( 4 ) \quad stay S + S^+ S+ Find the corresponding sample in x i x_i xi
( 5 ) \quad stay S S S Find x i x_i xi Of K K K a near neighbor , Record that the number of nearest neighbors of most classes is N major N^{\text{major}} Nmajor
( 6 ) \quad Calculate its scale parameters : Γ i = N i major Z × K \Gamma_i=\frac{N_i^\text{major}}{Z\times K} Γi=Z×KNimajor
( 7 ) \quad Calculate the main sample frequency : g i = Γ i × N + × SR g_i=\Gamma_i\times N^+\times\text{SR} gi=Γi×N+×SR
( 8 ) for i = 1 : N + i=1:N^+ i=1:N+
( 9 ) \quad stay S + S^+ S+ Select a main sample randomly x i x_i xi
(10) \quad for i = 1 : g i i=1:g_i i=1:gi
(11) \qquad call SMOTE Algorithm generates master samples x i x_i xi A new sample of x i new x_i^\text{new} xinew
(12) \qquad add to x i new x_i^\text{new} xinew to S New S^\text{New} SNew S New = S New ∪ x i new S^\text{New}=S^\text{New}\cup x_i^\text{new} SNew=SNewxinew
(13) return Training set after oversampling S ′ = S − ∪ S New S'=S^-\cup S^\text{New} S=SSNew

原网站

版权声明
本文为[von Neumann]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/200/202207170033450589.html