当前位置：网站首页>PCA feature dimensionality reduction of machine learning + case practice

PCA feature dimensionality reduction of machine learning + case practice

2022-07-19 05:05:00 【Big data Da Wenxi】

Machine learning PCA Feature dimension reduction + Case practice

1 Dimension reduction

Dimension reduction It means that under certain limited conditions , Reduce random variables ( features ) Number , obtain A group of “ Unrelated ” The main variable The process of

Reduce the number of random variables

Insert picture description here

- Relevant features (correlated feature)
  - Correlation between relative humidity and rainfall
  - wait
It's because during training , We all use features to learn . If there is a problem with the feature itself or the correlation between features is strong , It will have a great impact on algorithm learning and prediction
2 Two ways of dimension reduction
- feature selection
- Principal component analysis （ You can understand a way of feature extraction ）

2 What is feature selection

1 Definition

Data contains Redundant or irrelevant variables （ Or characteristics 、 attribute 、 Indicators, etc ）, Aimed at Find the main features from the original features .

Insert picture description here

2 Method

Filter( Filter type )： Mainly explore the characteristics of the characteristics themselves 、 Features are associated with features and target values
- Variance selection method ： Low variance feature filtering
- The correlation coefficient
Embedded ( The embedded )： Automatic feature selection algorithm （ The relationship between features and target values ）
- Decision tree : Information entropy 、 Information gain
- Regularization ：L1、L2
- Deep learning ： Convolution, etc

about Embedded The way , It can only be introduced when explaining the algorithm , Better understand

3 modular

sklearn.feature_selection

4 Filter type

4.1 Low variance feature filtering

Remove some features of low variance , We talked about the meaning of variance . Then consider the angle of this way in combination with the size of variance .

The characteristic variance is small ： Most samples of a certain feature have similar values
The characteristic variance is large ： Many samples of a certain feature have different values

4.1.1 API

sklearn.feature_selection.VarianceThreshold(threshold = 0.0)
- Delete all low variance features
- Variance.fit_transform(X)
  - X:numpy array Formatted data [n_samples,n_features]
  - Return value ： The difference of training set is lower than threshold The features of will be removed . The default value is to keep all non-zero variance features , That is to delete the features with the same value in all samples .

4.1.2 Data calculation

We are right. There is a screening between the index characteristics of some stocks , The data is in "factor_regression_data/factor_returns.csv" The file of , remove ’index,‘date’,'return’ Column does not consider **（ These types do not match , Nor is it the required indicator ）**

All these characteristics

pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense
index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense,date,return
0,000001.XSHE,5.9572,1.1818,85252550922.0,0.8008,14.9403,1211444855670.0,2.01,20701401000.0,10882540000.0,2012-01-31,0.027657228229937388
1,000002.XSHE,7.0289,1.588,84113358168.0,1.6463,7.8656,300252061695.0,0.326,29308369223.2,23783476901.2,2012-01-31,0.08235182370820669
2,000008.XSHE,-262.7461,7.0003,517045520.0,-0.5678,-0.5943,770517752.56,-0.006,11679829.03,12030080.04,2012-01-31,0.09978900335112327
3,000060.XSHE,16.476,3.7146,19680455995.0,5.6036,14.617,28009159184.6,0.35,9189386877.65,7935542726.05,2012-01-31,0.12159482758620697
4,000069.XSHE,12.5878,2.5616,41727214853.0,2.8729,10.9097,81247380359.0,0.271,8951453490.28,7091397989.13,2012-01-31,-0.0026808154146886697

analysis

1、 initialization VarianceThreshold, Specify the threshold variance

2、 call fit_transform

def variance_demo():
    """  Remove low variance features —— feature selection  :return: None """
    data = pd.read_csv("factor_returns.csv")
    print(data)
    # 1、 Instantiate a converter class 
    transfer = VarianceThreshold(threshold=1)
    # 2、 call fit_transform
    data = transfer.fit_transform(data.iloc[:, 1:10])
    print(" Results of deleting low variance features ：\n", data)
    print(" shape ：\n", data.shape)

    return None

Return results ：

            index  pe_ratio  pb_ratio    market_cap  \
0     000001.XSHE    5.9572    1.1818  8.525255e+10   
1     000002.XSHE    7.0289    1.5880  8.411336e+10    
...           ...       ...       ...           ...   
2316  601958.XSHG   52.5408    2.4646  3.287910e+10   
2317  601989.XSHG   14.2203    1.4103  5.911086e+10   

      return_on_asset_net_profit  du_return_on_equity            ev  \
0                         0.8008              14.9403  1.211445e+12   
1                         1.6463               7.8656  3.002521e+11    
...                          ...                  ...           ...   
2316                      2.7444               2.9202  3.883803e+10   
2317                      2.0383               8.6179  2.020661e+11   

      earnings_per_share       revenue  total_expense        date    return  
0                 2.0100  2.070140e+10   1.088254e+10  2012-01-31  0.027657  
1                 0.3260  2.930837e+10   2.378348e+10  2012-01-31  0.082352  
2                -0.0060  1.167983e+07   1.203008e+07  2012-01-31  0.099789   
...                  ...           ...            ...         ...       ...  
2315              0.2200  1.789082e+10   1.749295e+10  2012-11-30  0.137134  
2316              0.1210  6.465392e+09   6.009007e+09  2012-11-30  0.149167  
2317              0.2470  4.509872e+10   4.132842e+10  2012-11-30  0.183629  

[2318 rows x 12 columns]
 Results of deleting low variance features ：
 [[  5.95720000e+00   1.18180000e+00   8.52525509e+10 ...,   1.21144486e+12
    2.07014010e+10   1.08825400e+10]
 [  7.02890000e+00   1.58800000e+00   8.41133582e+10 ...,   3.00252062e+11
    2.93083692e+10   2.37834769e+10]
 [ -2.62746100e+02   7.00030000e+00   5.17045520e+08 ...,   7.70517753e+08
    1.16798290e+07   1.20300800e+07]
 ..., 
 [  3.95523000e+01   4.00520000e+00   1.70243430e+10 ...,   2.42081699e+10
    1.78908166e+10   1.74929478e+10]
 [  5.25408000e+01   2.46460000e+00   3.28790988e+10 ...,   3.88380258e+10
    6.46539204e+09   6.00900728e+09]
 [  1.42203000e+01   1.41030000e+00   5.91108572e+10 ...,   2.02066110e+11
    4.50987171e+10   4.13284212e+10]]
 shape ：
 (2318, 8)

4.2 The correlation coefficient

Pearson correlation coefficient (Pearson Correlation Coefficient)
- A statistical indicator that reflects the close relationship between variables

4.2.2 Formula calculation case ( understand , Don't remember )

The formula

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-3B0lsFTx-1652341752446)(C:\Users\X\AppData\Roaming\Typora\typora-user-images\1652341249812.png)]

For example, we calculate the annual advertising investment and average monthly sales

Insert picture description here

So how to calculate the correlation coefficient between

Insert picture description here

Final calculation ：

Insert picture description here

So we finally come to the conclusion that there is a high positive correlation between advertising investment and monthly average sales .

4.2.3 characteristic

The correlation coefficient is between –1 And +1 Between , namely –1≤ r ≤+1. Its nature is as follows ：

When r>0 when , It means that two variables are positively correlated ,r<0 when , The two variables are negatively correlated
When |r|=1 when , It means that two variables are completely related , When r=0 when , There is no correlation between two variables
When 0<|r|<1 when , It means that there is a certain degree of correlation between two variables . And |r| The closer the 1, The closer the linear relationship between the two variables is ;|r| The more close to 0, The weaker the linear correlation between two variables
Generally, it can be divided into three levels ：|r|<0.4 For low correlation ;0.4≤|r|<0.7 It is significant correlation ;0.7≤|r|<1 It's a highly linear correlation

This symbol ：|r| by r The absolute value of , |-5| = 5

4.2.4 API

from scipy.stats import pearsonr
- x : (N,) array_like
- y : (N,) array_like Returns: (Pearson’s correlation coefficient, p-value)

4.2.5 Case study ： Correlation calculation of financial indicators of stocks

We just calculated the correlation of these indexes of the stock , Suppose we use

factor = ['pe_ratio','pb_ratio','market_cap','return_on_asset_net_profit','du_return_on_equity','ev','earnings_per_share','revenue','total_expense']

Two of these features are calculated , Get some characteristics with high correlation

Insert picture description here

analysis
- The correlation between two features is calculated

import pandas as pd
from scipy.stats import pearsonr

def pearsonr_demo():
    """  Calculation of correlation coefficient  :return: None """
    data = pd.read_csv("factor_returns.csv")

    factor = ['pe_ratio', 'pb_ratio', 'market_cap', 'return_on_asset_net_profit', 'du_return_on_equity', 'ev',
              'earnings_per_share', 'revenue', 'total_expense']

    for i in range(len(factor)):
        for j in range(i, len(factor) - 1):
            print(
                " indicators %s And indicators %s The correlation between them is %f" % (factor[i], factor[j + 1], pearsonr(data[factor[i]], data[factor[j + 1]])[0]))

    return None

Return results ：

 indicators pe_ratio And indicators pb_ratio The correlation between them is -0.004389
 indicators pe_ratio And indicators market_cap The correlation between them is -0.068861
 indicators pe_ratio And indicators return_on_asset_net_profit The correlation between them is -0.066009
 indicators pe_ratio And indicators du_return_on_equity The correlation between them is -0.082364
 indicators pe_ratio And indicators ev The correlation between them is -0.046159
 indicators pe_ratio And indicators earnings_per_share The correlation between them is -0.072082
 indicators pe_ratio And indicators revenue The correlation between them is -0.058693
 indicators pe_ratio And indicators total_expense The correlation between them is -0.055551
 indicators pb_ratio And indicators market_cap The correlation between them is 0.009336
 indicators pb_ratio And indicators return_on_asset_net_profit The correlation between them is 0.445381
 indicators pb_ratio And indicators du_return_on_equity The correlation between them is 0.291367
 indicators pb_ratio And indicators ev The correlation between them is -0.183232
 indicators pb_ratio And indicators earnings_per_share The correlation between them is 0.198708
 indicators pb_ratio And indicators revenue The correlation between them is -0.177671
 indicators pb_ratio And indicators total_expense The correlation between them is -0.173339
 indicators market_cap And indicators return_on_asset_net_profit The correlation between them is 0.214774
 indicators market_cap And indicators du_return_on_equity The correlation between them is 0.316288
 indicators market_cap And indicators ev The correlation between them is 0.565533
 indicators market_cap And indicators earnings_per_share The correlation between them is 0.524179
 indicators market_cap And indicators revenue The correlation between them is 0.440653
 indicators market_cap And indicators total_expense The correlation between them is 0.386550
 indicators return_on_asset_net_profit And indicators du_return_on_equity The correlation between them is 0.818697
 indicators return_on_asset_net_profit And indicators ev The correlation between them is -0.101225
 indicators return_on_asset_net_profit And indicators earnings_per_share The correlation between them is 0.635933
 indicators return_on_asset_net_profit And indicators revenue The correlation between them is 0.038582
 indicators return_on_asset_net_profit And indicators total_expense The correlation between them is 0.027014
 indicators du_return_on_equity And indicators ev The correlation between them is 0.118807
 indicators du_return_on_equity And indicators earnings_per_share The correlation between them is 0.651996
 indicators du_return_on_equity And indicators revenue The correlation between them is 0.163214
 indicators du_return_on_equity And indicators total_expense The correlation between them is 0.135412
 indicators ev And indicators earnings_per_share The correlation between them is 0.196033
 indicators ev And indicators revenue The correlation between them is 0.224363
 indicators ev And indicators total_expense The correlation between them is 0.149857
 indicators earnings_per_share And indicators revenue The correlation between them is 0.141473
 indicators earnings_per_share And indicators total_expense The correlation between them is 0.105022
 indicators revenue And indicators total_expense The correlation between them is 0.995845

From which we can draw

indicators revenue And indicators total_expense The correlation between them is 0.995845
indicators return_on_asset_net_profit And indicators du_return_on_equity The correlation between them is 0.818697

We can also observe the results by drawing pictures

import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8), dpi=100)
plt.scatter(data['revenue'], data['total_expense'])
plt.show()

Insert picture description here

There is a greater correlation between the two indicators , You can do subsequent processing , For example, synthesize these two indicators .

3 What is principal component analysis (PCA)

Definition ： The process of transforming high dimensional data into low dimensional data , In the process May discard the original data 、 Create new variables
effect ： It's data dimension compression , Reduce the dimension of the original data as much as possible （ Complexity ）, Lose a little information .
application ： Regression analysis or cluster analysis

For the word information , It will be introduced in the decision tree

How about a better understanding of this process ？ Let's take a look at a picture

Insert picture description here

1 Computing case understanding ( understand , No need to remember )

Suppose for a given 5 A little bit , The data are as follows

(-1,-2)
(-1, 0)
( 0, 0)
( 2, 1)
( 0, 1)

Insert picture description here

requirement ： Simplify this two-dimensional data into one-dimensional data ？ And lose a little bit of information

Insert picture description here

How is this process calculated ？ Find a right line , The results of principal component analysis are obtained by a matrix operation （ There is no need to understand ）

Insert picture description here

2 API

sklearn.decomposition.PCA(n_components=None)
- Decompose the data into lower dimension space
- n_components:
  - decimal ： What percentage of information is retained
  - Integers ： How many features are reduced to
- PCA.fit_transform(X) X:numpy array Formatted data [n_samples,n_features]
- Return value ： After the transformation, specify the array

3 Data calculation

First take a simple data to calculate

[[2,8,4,5],
[6,3,0,8],
[5,4,9,1]]
from sklearn.decomposition import PCA

def pca_demo():
    """  On data PCA Dimension reduction  :return: None """
    data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]

    # 1、 Instantiation PCA,  decimal —— How much information to keep 
    transfer = PCA(n_components=0.9)
    # 2、 call fit_transform
    data1 = transfer.fit_transform(data)

    print(" Retain 90% Information about , The result of dimension reduction is ：\n", data1)

    # 1、 Instantiation PCA,  Integers —— Specify the dimension reduced to 
    transfer2 = PCA(n_components=3)
    # 2、 call fit_transform
    data2 = transfer2.fit_transform(data)
    print(" Dimensionality reduction 3 The result of dimension ：\n", data2)

    return None

Return results ：

 Retain 90% Information about , The result of dimension reduction is ：
 [[ -3.13587302e-16   3.82970843e+00]
 [ -5.74456265e+00  -1.91485422e+00]
 [  5.74456265e+00  -1.91485422e+00]]
 Dimensionality reduction 3 The result of dimension ：
 [[ -3.13587302e-16   3.82970843e+00   4.59544715e-16]
 [ -5.74456265e+00  -1.91485422e+00   4.59544715e-16]
 [  5.74456265e+00  -1.91485422e+00   4.59544715e-16]]

4 Case study ： Explore users' preferences for item categories, subdivision and dimensionality reduction

Insert picture description here

The data are as follows ：

order_products__prior.csv： Order and product information
- Field ：order_id, product_id, add_to_cart_order, reordered
products.csv： Commodity information
- Field ：product_id, product_name, aisle_id, department_id
orders.csv： User's order information
- Field ：order_id,user_id,eval_set,order_number,….
aisles.csv： The specific item category to which the commodity belongs
- Field ： aisle_id, aisle

1 demand

Insert picture description here

2 analysis

Merge tables , bring user_id And aisle In a table
Perform crosstab transformation

Carry out dimension reduction

3 Complete code

import pandas as pd
from sklearn.decomposition import PCA

# 1、 Get data set 
# · Commodity information - products.csv：
# Fields：product_id, product_name, aisle_id, department_id
# · Order and product information - order_products__prior.csv：
# Fields：order_id, product_id, add_to_cart_order, reordered 
# · User's order information - orders.csv：
# Fields：order_id, user_id,eval_set, order_number,order_dow, order_hour_of_day, days_since_prior_order 
# · The specific item category to which the commodity belongs - aisles.csv：
# Fields：aisle_id, aisle 
products = pd.read_csv("./instacart/products.csv")
order_products = pd.read_csv("./instacart/order_products__prior.csv")
orders = pd.read_csv("./instacart/orders.csv")
aisles = pd.read_csv("./instacart/aisles.csv")

# 2、 Merge tables , take user_id and aisle On a sheet 
# 1） Merge orders and order_products on=order_id tab1:order_id, product_id, user_id
tab1 = pd.merge(orders, order_products, on=["order_id", "order_id"])
# 2） Merge tab1 and products on=product_id tab2:aisle_id
tab2 = pd.merge(tab1, products, on=["product_id", "product_id"])
# 3） Merge tab2 and aisles on=aisle_id tab3:user_id, aisle
tab3 = pd.merge(tab2, aisles, on=["aisle_id", "aisle_id"])

# 3、 Crosstab processing , hold user_id and aisle Grouping 
table = pd.crosstab(tab3["user_id"], tab3["aisle"])

# 4、 Principal component analysis method for dimensionality reduction 
# 1） Instantiate a converter class PCA
transfer = PCA(n_components=0.95)
# 2）fit_transform
data = transfer.fit_transform(table)

data.shape