当前位置:网站首页>In depth understanding of machine learning - unbalanced learning: sample sampling technology - [adasyn sampling method of manual sampling technology]
In depth understanding of machine learning - unbalanced learning: sample sampling technology - [adasyn sampling method of manual sampling technology]
2022-07-19 03:12:00 【von Neumann】
Catalogues :《 In depth understanding of machine learning 》 General catalogue
And Borderline-SMOTE The algorithm is similar ,ADASYN(Adaptive Synthetic Sampling) The algorithm is also an improved SMOTE Algorithm . The algorithm is based on 2008 in , The main idea is to make full use of the density distribution information of samples to determine the frequency of each minority sample as the main sample , Synthesize more training data for a few difficult category samples , So as to correct the negative effects caused by the unbalanced distribution of categories as much as possible .
ADASYN The algorithm first needs to determine the number of new minority samples to be generated , namely N + × SR N^+\times \text{SR} N+×SR, Then in the original training set S S S Find each minority sample on x i + , i = 1 , 2 , ⋯ , N + x_i^+, i=1, 2, \cdots, N^+ xi+,i=1,2,⋯,N+ Of K K K a near neighbor , among , The first i i i The number of nearest neighbors of the majority classes of the minority class samples is recorded as N i major N_i^\text{major} Nimajor, Then the proportion parameters of each minority sample can be determined by the following formula Γ i \Gamma_i Γi:
Γ i = N i major Z × K \Gamma_i=\frac{N_i^\text{major}}{Z\times K} Γi=Z×KNimajor
among , Z Z Z Is the standardization factor , To guarantee ∑ Γ i = 1 \sum\Gamma_i=1 ∑Γi=1. After the proportion parameter is determined , The frequency that each minority sample is selected as the main sample can be determined by the following formula :
g i = Γ i × N + × SR g_i=\Gamma_i\times N^+\times\text{SR} gi=Γi×N+×SR
It is not difficult to see from the above formula , And Borderline-SMOTE The algorithm is similar ,ADASYN The algorithm pays more attention to a few samples located near the decision boundary , They are selected as the main samples much more frequently than those located in a few decision-making areas . Of course , This will further amplify the propagation intensity of a few types of noise information ,ADASYN The specific flow of the algorithm is as follows :
ADASYN Sampling method
Input : Training set S = { ( x i , y i ) , i = 1 , 2 , ⋯ , N , y i ∈ { + , − } } S=\{(x_i, y_i), i=1, 2, \cdots, N, y_i\in\{+, -\}\} S={(xi,yi),i=1,2,⋯,N,yi∈{ +,−}}; Number of samples of most classes N − N^- N−, Number of samples of a few classes N + N^+ N+, among N − + N + = N N^-+N^+=N N−+N+=N; Unbalance ratio IR = N − N + \text{IR}=\frac{N^-}{N^+} IR=N+N−; Sampling rate SR \text{SR} SR; Proximity parameter K K K
Output : Training set after oversampling S = { ( x i , y i ) , i = 1 , 2 , ⋯ , N + N + × SR , y i ∈ { + , − } } S=\{(x_i, y_i), i=1, 2, \cdots, N+N^+\times \text{SR}, y_i\in\{+, -\}\} S={(xi,yi),i=1,2,⋯,N+N+×SR,yi∈{ +,−}}
( 1 ) From the training set S S S Take out all the samples of majority and minority classes , Make up the training sample set of most classes S − S^- S− And a few training sample sets S + S^+ S+
( 2 ) Set the newly generated sample set S New S^\text{New} SNew It's empty +
( 3 ) for i = 1 : N + i=1:N^+ i=1:N+
( 4 ) \quad stay S + S^+ S+ Find the corresponding sample in x i x_i xi
( 5 ) \quad stay S S S Find x i x_i xi Of K K K a near neighbor , Record that the number of nearest neighbors of most classes is N major N^{\text{major}} Nmajor
( 6 ) \quad Calculate its scale parameters : Γ i = N i major Z × K \Gamma_i=\frac{N_i^\text{major}}{Z\times K} Γi=Z×KNimajor
( 7 ) \quad Calculate the main sample frequency : g i = Γ i × N + × SR g_i=\Gamma_i\times N^+\times\text{SR} gi=Γi×N+×SR
( 8 ) for i = 1 : N + i=1:N^+ i=1:N+
( 9 ) \quad stay S + S^+ S+ Select a main sample randomly x i x_i xi
(10) \quad for i = 1 : g i i=1:g_i i=1:gi
(11) \qquad call SMOTE Algorithm generates master samples x i x_i xi A new sample of x i new x_i^\text{new} xinew
(12) \qquad add to x i new x_i^\text{new} xinew to S New S^\text{New} SNew: S New = S New ∪ x i new S^\text{New}=S^\text{New}\cup x_i^\text{new} SNew=SNew∪xinew
(13) return Training set after oversampling S ′ = S − ∪ S New S'=S^-\cup S^\text{New} S′=S−∪SNew
边栏推荐
- MySQL面试题(2022)
- 人脸关键点检测
- Flutter开发:运行flutter upgrade命令报错Exception:Flutter failed to create a directory at…解决方法
- First knowledge of JPA (ORM idea, basic operation of JPA)
- Analysis skills of time complexity and space complexity
- SQL经典练习题(x30)
- We should increase revenue and reduce expenditure
- Affine transformation implementation
- BiSeNetV2-面部分割 ncnn推理
- MySQL multi table query
猜你喜欢
About XML file (VI) - the difference between JSON and XML file
OpenVINO中的FCOS人脸检测模型代码演示
[regression prediction] lithium ion battery life prediction based on particle filter with matlab code
Pytorch best practices and code templates
Summary of the most complete methods of string interception in Oracle
zsh: command not found: mysql
MySQL storage engine details
04_服务注册Eureka
ncnn param文件及bin模型可视化解析
一个优酷VIP会员帐号可以几个人用的设备同时登录如何共享多人使用优酷会员账号?
随机推荐
BiSeNetV2-面部分割 ncnn推理
快照:数据快照(数据兜底方式)
2022-07-16:以下go语言代码输出什么?A:[];B:[5];C:[5 0 0 0 0];D:[0 0 0 0 0]。 package main import ( “fmt“ )
通过Dao投票STI的销毁,SeekTiger真正做到由社区驱动
这是数学的问题
Graphql first acquaintance
【单片机仿真】(五)寻址方式 — 立即寻址与寄存器间接寻址
Detailed explanation of case when usage of SQL
Systick timer basic learning and hand tearing code
【单片机仿真】(七)寻址方式 — 位寻址
ncnn paramdict&modelbin
yolov6 学习初篇
无线用的鉴权代码
【单片机仿真】(一)Proteus8.9 安装教程
The place where the dream begins ---- first knowing C language
Go语言 实现发送短信验证码 并登录
D. Permutation Restoration(贪心/双指针/set)
【回归预测】基于粒子滤波实现锂离子电池寿命预测附matlab代码
Can't access this website can't find DNS address DNS_ PROBE_ What about started?
4年开发二面美团最终败给:volatile关键字作用和原理这道面试题