当前位置:网站首页>PCA feature dimensionality reduction of machine learning + case practice
PCA feature dimensionality reduction of machine learning + case practice
2022-07-19 05:05:00 【Big data Da Wenxi】
Machine learning PCA Feature dimension reduction + Case practice
1 Dimension reduction
Dimension reduction It means that under certain limited conditions , Reduce random variables ( features ) Number , obtain A group of “ Unrelated ” The main variable The process of
- Reduce the number of random variables
- Relevant features (correlated feature)
- Correlation between relative humidity and rainfall
- wait
It's because during training , We all use features to learn . If there is a problem with the feature itself or the correlation between features is strong , It will have a great impact on algorithm learning and prediction
2 Two ways of dimension reduction
- feature selection
- Principal component analysis ( You can understand a way of feature extraction )
- Relevant features (correlated feature)
2 What is feature selection
1 Definition
Data contains Redundant or irrelevant variables ( Or characteristics 、 attribute 、 Indicators, etc ), Aimed at Find the main features from the original features .
2 Method
- Filter( Filter type ): Mainly explore the characteristics of the characteristics themselves 、 Features are associated with features and target values
- Variance selection method : Low variance feature filtering
- The correlation coefficient
- Embedded ( The embedded ): Automatic feature selection algorithm ( The relationship between features and target values )
- Decision tree : Information entropy 、 Information gain
- Regularization :L1、L2
- Deep learning : Convolution, etc
about Embedded The way , It can only be introduced when explaining the algorithm , Better understand
3 modular
sklearn.feature_selection
4 Filter type
4.1 Low variance feature filtering
Remove some features of low variance , We talked about the meaning of variance . Then consider the angle of this way in combination with the size of variance .
- The characteristic variance is small : Most samples of a certain feature have similar values
- The characteristic variance is large : Many samples of a certain feature have different values
4.1.1 API
- sklearn.feature_selection.VarianceThreshold(threshold = 0.0)
- Delete all low variance features
- Variance.fit_transform(X)
- X:numpy array Formatted data [n_samples,n_features]
- Return value : The difference of training set is lower than threshold The features of will be removed . The default value is to keep all non-zero variance features , That is to delete the features with the same value in all samples .
4.1.2 Data calculation
We are right. There is a screening between the index characteristics of some stocks , The data is in "factor_regression_data/factor_returns.csv" The file of , remove ’index,‘date’,'return’ Column does not consider **( These types do not match , Nor is it the required indicator )**
All these characteristics
pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense
index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense,date,return
0,000001.XSHE,5.9572,1.1818,85252550922.0,0.8008,14.9403,1211444855670.0,2.01,20701401000.0,10882540000.0,2012-01-31,0.027657228229937388
1,000002.XSHE,7.0289,1.588,84113358168.0,1.6463,7.8656,300252061695.0,0.326,29308369223.2,23783476901.2,2012-01-31,0.08235182370820669
2,000008.XSHE,-262.7461,7.0003,517045520.0,-0.5678,-0.5943,770517752.56,-0.006,11679829.03,12030080.04,2012-01-31,0.09978900335112327
3,000060.XSHE,16.476,3.7146,19680455995.0,5.6036,14.617,28009159184.6,0.35,9189386877.65,7935542726.05,2012-01-31,0.12159482758620697
4,000069.XSHE,12.5878,2.5616,41727214853.0,2.8729,10.9097,81247380359.0,0.271,8951453490.28,7091397989.13,2012-01-31,-0.0026808154146886697
- analysis
1、 initialization VarianceThreshold, Specify the threshold variance
2、 call fit_transform
def variance_demo():
""" Remove low variance features —— feature selection :return: None """
data = pd.read_csv("factor_returns.csv")
print(data)
# 1、 Instantiate a converter class
transfer = VarianceThreshold(threshold=1)
# 2、 call fit_transform
data = transfer.fit_transform(data.iloc[:, 1:10])
print(" Results of deleting low variance features :\n", data)
print(" shape :\n", data.shape)
return None
Return results :
index pe_ratio pb_ratio market_cap \
0 000001.XSHE 5.9572 1.1818 8.525255e+10
1 000002.XSHE 7.0289 1.5880 8.411336e+10
... ... ... ... ...
2316 601958.XSHG 52.5408 2.4646 3.287910e+10
2317 601989.XSHG 14.2203 1.4103 5.911086e+10
return_on_asset_net_profit du_return_on_equity ev \
0 0.8008 14.9403 1.211445e+12
1 1.6463 7.8656 3.002521e+11
... ... ... ...
2316 2.7444 2.9202 3.883803e+10
2317 2.0383 8.6179 2.020661e+11
earnings_per_share revenue total_expense date return
0 2.0100 2.070140e+10 1.088254e+10 2012-01-31 0.027657
1 0.3260 2.930837e+10 2.378348e+10 2012-01-31 0.082352
2 -0.0060 1.167983e+07 1.203008e+07 2012-01-31 0.099789
... ... ... ... ... ...
2315 0.2200 1.789082e+10 1.749295e+10 2012-11-30 0.137134
2316 0.1210 6.465392e+09 6.009007e+09 2012-11-30 0.149167
2317 0.2470 4.509872e+10 4.132842e+10 2012-11-30 0.183629
[2318 rows x 12 columns]
Results of deleting low variance features :
[[ 5.95720000e+00 1.18180000e+00 8.52525509e+10 ..., 1.21144486e+12
2.07014010e+10 1.08825400e+10]
[ 7.02890000e+00 1.58800000e+00 8.41133582e+10 ..., 3.00252062e+11
2.93083692e+10 2.37834769e+10]
[ -2.62746100e+02 7.00030000e+00 5.17045520e+08 ..., 7.70517753e+08
1.16798290e+07 1.20300800e+07]
...,
[ 3.95523000e+01 4.00520000e+00 1.70243430e+10 ..., 2.42081699e+10
1.78908166e+10 1.74929478e+10]
[ 5.25408000e+01 2.46460000e+00 3.28790988e+10 ..., 3.88380258e+10
6.46539204e+09 6.00900728e+09]
[ 1.42203000e+01 1.41030000e+00 5.91108572e+10 ..., 2.02066110e+11
4.50987171e+10 4.13284212e+10]]
shape :
(2318, 8)
4.2 The correlation coefficient
- Pearson correlation coefficient (Pearson Correlation Coefficient)
- A statistical indicator that reflects the close relationship between variables
4.2.2 Formula calculation case ( understand , Don't remember )
- The formula
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-3B0lsFTx-1652341752446)(C:\Users\X\AppData\Roaming\Typora\typora-user-images\1652341249812.png)]
- For example, we calculate the annual advertising investment and average monthly sales
So how to calculate the correlation coefficient between
Final calculation :
So we finally come to the conclusion that there is a high positive correlation between advertising investment and monthly average sales .
4.2.3 characteristic
The correlation coefficient is between –1 And +1 Between , namely –1≤ r ≤+1. Its nature is as follows :
- When r>0 when , It means that two variables are positively correlated ,r<0 when , The two variables are negatively correlated
- When |r|=1 when , It means that two variables are completely related , When r=0 when , There is no correlation between two variables
- When 0<|r|<1 when , It means that there is a certain degree of correlation between two variables . And |r| The closer the 1, The closer the linear relationship between the two variables is ;|r| The more close to 0, The weaker the linear correlation between two variables
- Generally, it can be divided into three levels :|r|<0.4 For low correlation ;0.4≤|r|<0.7 It is significant correlation ;0.7≤|r|<1 It's a highly linear correlation
This symbol :|r| by r The absolute value of , |-5| = 5
4.2.4 API
- from scipy.stats import pearsonr
- x : (N,) array_like
- y : (N,) array_like Returns: (Pearson’s correlation coefficient, p-value)
4.2.5 Case study : Correlation calculation of financial indicators of stocks
We just calculated the correlation of these indexes of the stock , Suppose we use
factor = ['pe_ratio','pb_ratio','market_cap','return_on_asset_net_profit','du_return_on_equity','ev','earnings_per_share','revenue','total_expense']
Two of these features are calculated , Get some characteristics with high correlation
- analysis
- The correlation between two features is calculated
import pandas as pd
from scipy.stats import pearsonr
def pearsonr_demo():
""" Calculation of correlation coefficient :return: None """
data = pd.read_csv("factor_returns.csv")
factor = ['pe_ratio', 'pb_ratio', 'market_cap', 'return_on_asset_net_profit', 'du_return_on_equity', 'ev',
'earnings_per_share', 'revenue', 'total_expense']
for i in range(len(factor)):
for j in range(i, len(factor) - 1):
print(
" indicators %s And indicators %s The correlation between them is %f" % (factor[i], factor[j + 1], pearsonr(data[factor[i]], data[factor[j + 1]])[0]))
return None
Return results :
indicators pe_ratio And indicators pb_ratio The correlation between them is -0.004389
indicators pe_ratio And indicators market_cap The correlation between them is -0.068861
indicators pe_ratio And indicators return_on_asset_net_profit The correlation between them is -0.066009
indicators pe_ratio And indicators du_return_on_equity The correlation between them is -0.082364
indicators pe_ratio And indicators ev The correlation between them is -0.046159
indicators pe_ratio And indicators earnings_per_share The correlation between them is -0.072082
indicators pe_ratio And indicators revenue The correlation between them is -0.058693
indicators pe_ratio And indicators total_expense The correlation between them is -0.055551
indicators pb_ratio And indicators market_cap The correlation between them is 0.009336
indicators pb_ratio And indicators return_on_asset_net_profit The correlation between them is 0.445381
indicators pb_ratio And indicators du_return_on_equity The correlation between them is 0.291367
indicators pb_ratio And indicators ev The correlation between them is -0.183232
indicators pb_ratio And indicators earnings_per_share The correlation between them is 0.198708
indicators pb_ratio And indicators revenue The correlation between them is -0.177671
indicators pb_ratio And indicators total_expense The correlation between them is -0.173339
indicators market_cap And indicators return_on_asset_net_profit The correlation between them is 0.214774
indicators market_cap And indicators du_return_on_equity The correlation between them is 0.316288
indicators market_cap And indicators ev The correlation between them is 0.565533
indicators market_cap And indicators earnings_per_share The correlation between them is 0.524179
indicators market_cap And indicators revenue The correlation between them is 0.440653
indicators market_cap And indicators total_expense The correlation between them is 0.386550
indicators return_on_asset_net_profit And indicators du_return_on_equity The correlation between them is 0.818697
indicators return_on_asset_net_profit And indicators ev The correlation between them is -0.101225
indicators return_on_asset_net_profit And indicators earnings_per_share The correlation between them is 0.635933
indicators return_on_asset_net_profit And indicators revenue The correlation between them is 0.038582
indicators return_on_asset_net_profit And indicators total_expense The correlation between them is 0.027014
indicators du_return_on_equity And indicators ev The correlation between them is 0.118807
indicators du_return_on_equity And indicators earnings_per_share The correlation between them is 0.651996
indicators du_return_on_equity And indicators revenue The correlation between them is 0.163214
indicators du_return_on_equity And indicators total_expense The correlation between them is 0.135412
indicators ev And indicators earnings_per_share The correlation between them is 0.196033
indicators ev And indicators revenue The correlation between them is 0.224363
indicators ev And indicators total_expense The correlation between them is 0.149857
indicators earnings_per_share And indicators revenue The correlation between them is 0.141473
indicators earnings_per_share And indicators total_expense The correlation between them is 0.105022
indicators revenue And indicators total_expense The correlation between them is 0.995845
From which we can draw
- indicators revenue And indicators total_expense The correlation between them is 0.995845
- indicators return_on_asset_net_profit And indicators du_return_on_equity The correlation between them is 0.818697
We can also observe the results by drawing pictures
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8), dpi=100)
plt.scatter(data['revenue'], data['total_expense'])
plt.show()
There is a greater correlation between the two indicators , You can do subsequent processing , For example, synthesize these two indicators .
3 What is principal component analysis (PCA)
- Definition : The process of transforming high dimensional data into low dimensional data , In the process May discard the original data 、 Create new variables
- effect : It's data dimension compression , Reduce the dimension of the original data as much as possible ( Complexity ), Lose a little information .
- application : Regression analysis or cluster analysis
For the word information , It will be introduced in the decision tree
How about a better understanding of this process ? Let's take a look at a picture
1 Computing case understanding ( understand , No need to remember )
Suppose for a given 5 A little bit , The data are as follows
(-1,-2)
(-1, 0)
( 0, 0)
( 2, 1)
( 0, 1)
requirement : Simplify this two-dimensional data into one-dimensional data ? And lose a little bit of information
How is this process calculated ? Find a right line , The results of principal component analysis are obtained by a matrix operation ( There is no need to understand )
2 API
- sklearn.decomposition.PCA(n_components=None)
- Decompose the data into lower dimension space
- n_components:
- decimal : What percentage of information is retained
- Integers : How many features are reduced to
- PCA.fit_transform(X) X:numpy array Formatted data [n_samples,n_features]
- Return value : After the transformation, specify the array
3 Data calculation
First take a simple data to calculate
[[2,8,4,5],
[6,3,0,8],
[5,4,9,1]]
from sklearn.decomposition import PCA
def pca_demo():
""" On data PCA Dimension reduction :return: None """
data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]
# 1、 Instantiation PCA, decimal —— How much information to keep
transfer = PCA(n_components=0.9)
# 2、 call fit_transform
data1 = transfer.fit_transform(data)
print(" Retain 90% Information about , The result of dimension reduction is :\n", data1)
# 1、 Instantiation PCA, Integers —— Specify the dimension reduced to
transfer2 = PCA(n_components=3)
# 2、 call fit_transform
data2 = transfer2.fit_transform(data)
print(" Dimensionality reduction 3 The result of dimension :\n", data2)
return None
Return results :
Retain 90% Information about , The result of dimension reduction is :
[[ -3.13587302e-16 3.82970843e+00]
[ -5.74456265e+00 -1.91485422e+00]
[ 5.74456265e+00 -1.91485422e+00]]
Dimensionality reduction 3 The result of dimension :
[[ -3.13587302e-16 3.82970843e+00 4.59544715e-16]
[ -5.74456265e+00 -1.91485422e+00 4.59544715e-16]
[ 5.74456265e+00 -1.91485422e+00 4.59544715e-16]]
4 Case study : Explore users' preferences for item categories, subdivision and dimensionality reduction
The data are as follows :
- order_products__prior.csv: Order and product information
- Field :order_id, product_id, add_to_cart_order, reordered
- products.csv: Commodity information
- Field :product_id, product_name, aisle_id, department_id
- orders.csv: User's order information
- Field :order_id,user_id,eval_set,order_number,….
- aisles.csv: The specific item category to which the commodity belongs
- Field : aisle_id, aisle
1 demand
2 analysis
Merge tables , bring user_id And aisle In a table
Perform crosstab transformation
Carry out dimension reduction
3 Complete code
import pandas as pd from sklearn.decomposition import PCA # 1、 Get data set # · Commodity information - products.csv: # Fields:product_id, product_name, aisle_id, department_id # · Order and product information - order_products__prior.csv: # Fields:order_id, product_id, add_to_cart_order, reordered # · User's order information - orders.csv: # Fields:order_id, user_id,eval_set, order_number,order_dow, order_hour_of_day, days_since_prior_order # · The specific item category to which the commodity belongs - aisles.csv: # Fields:aisle_id, aisle products = pd.read_csv("./instacart/products.csv") order_products = pd.read_csv("./instacart/order_products__prior.csv") orders = pd.read_csv("./instacart/orders.csv") aisles = pd.read_csv("./instacart/aisles.csv") # 2、 Merge tables , take user_id and aisle On a sheet # 1) Merge orders and order_products on=order_id tab1:order_id, product_id, user_id tab1 = pd.merge(orders, order_products, on=["order_id", "order_id"]) # 2) Merge tab1 and products on=product_id tab2:aisle_id tab2 = pd.merge(tab1, products, on=["product_id", "product_id"]) # 3) Merge tab2 and aisles on=aisle_id tab3:user_id, aisle tab3 = pd.merge(tab2, aisles, on=["aisle_id", "aisle_id"]) # 3、 Crosstab processing , hold user_id and aisle Grouping table = pd.crosstab(tab3["user_id"], tab3["aisle"]) # 4、 Principal component analysis method for dimensionality reduction # 1) Instantiate a converter class PCA transfer = PCA(n_components=0.95) # 2)fit_transform data = transfer.fit_transform(table) data.shape
Return results :
(206209, 44)
边栏推荐
猜你喜欢
MYSQL数据库表A数据同步到表B
DirectExchange交换机的简单使用。
毕设:基于Vue+Socket+Redis的分布式高并发防疫健康管理系统
Elment UI usage
FanoutExchange交换机简单使用
浅聊链路追踪
PyGame aircraft War 1.0 (step + window no response problem)
Cve-2021-44228 log4j reproduction and principle
数据分析与数据挖掘实战案例本地房价预测(716):
Feature extraction of machine learning (digitization and discretization of category features and digitization of text features)
随机推荐
Attendance check-in and leave system based on SSM framework
How to upload qiniu cloud
Infinite classification
简单快速建立pytorch环境YOLOv5目标检测 模型跑起来(超简单)
solve [email protected] : `node install. Problems of js`
浅聊全局过滤器
一文带你了解HAProxy
RestAPI
pygame-飞机大战1.0(步骤+窗口无响应问题)
Es document operation
用户管理-分页
NVIDIA GeForce Experience登录报错:验证程序加载失败,请检查您的浏览器设置,例如广告拦截程序(解决办法)
游玩数据获取与数据分析、数据挖掘 【2022.5.30】
一文了解配置中心
mysql数据库实验实训5,数据查询yggl数据库查询(详细)
Using circular statements to make login programs
Getting started with harmonios
MySQL takes the union of two query conditions and then queries
Elment UI usage
毕设:基于Vue+Socket+Redis的分布式高并发防疫健康管理系统