当前位置:网站首页>PCA feature dimensionality reduction of machine learning + case practice

PCA feature dimensionality reduction of machine learning + case practice

2022-07-19 05:05:00 Big data Da Wenxi

Machine learning PCA Feature dimension reduction + Case practice

1 Dimension reduction

Dimension reduction It means that under certain limited conditions , Reduce random variables ( features ) Number , obtain A group of “ Unrelated ” The main variable The process of

  • Reduce the number of random variables

 Insert picture description here

    • Relevant features (correlated feature)
      • Correlation between relative humidity and rainfall
      • wait

    It's because during training , We all use features to learn . If there is a problem with the feature itself or the correlation between features is strong , It will have a great impact on algorithm learning and prediction

    2 Two ways of dimension reduction

    • feature selection
    • Principal component analysis ( You can understand a way of feature extraction )

2 What is feature selection

1 Definition

Data contains Redundant or irrelevant variables ( Or characteristics 、 attribute 、 Indicators, etc ), Aimed at Find the main features from the original features .

 Insert picture description here

2 Method

  • Filter( Filter type ): Mainly explore the characteristics of the characteristics themselves 、 Features are associated with features and target values
    • Variance selection method : Low variance feature filtering
    • The correlation coefficient
  • Embedded ( The embedded ): Automatic feature selection algorithm ( The relationship between features and target values )
    • Decision tree : Information entropy 、 Information gain
    • Regularization :L1、L2
    • Deep learning : Convolution, etc

about Embedded The way , It can only be introduced when explaining the algorithm , Better understand

3 modular

sklearn.feature_selection

4 Filter type

4.1 Low variance feature filtering

Remove some features of low variance , We talked about the meaning of variance . Then consider the angle of this way in combination with the size of variance .

  • The characteristic variance is small : Most samples of a certain feature have similar values
  • The characteristic variance is large : Many samples of a certain feature have different values
4.1.1 API
  • sklearn.feature_selection.VarianceThreshold(threshold = 0.0)
    • Delete all low variance features
    • Variance.fit_transform(X)
      • X:numpy array Formatted data [n_samples,n_features]
      • Return value : The difference of training set is lower than threshold The features of will be removed . The default value is to keep all non-zero variance features , That is to delete the features with the same value in all samples .
4.1.2 Data calculation

We are right. There is a screening between the index characteristics of some stocks , The data is in "factor_regression_data/factor_returns.csv" The file of , remove ’index,‘date’,'return’ Column does not consider **( These types do not match , Nor is it the required indicator )**

All these characteristics

pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense
index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense,date,return
0,000001.XSHE,5.9572,1.1818,85252550922.0,0.8008,14.9403,1211444855670.0,2.01,20701401000.0,10882540000.0,2012-01-31,0.027657228229937388
1,000002.XSHE,7.0289,1.588,84113358168.0,1.6463,7.8656,300252061695.0,0.326,29308369223.2,23783476901.2,2012-01-31,0.08235182370820669
2,000008.XSHE,-262.7461,7.0003,517045520.0,-0.5678,-0.5943,770517752.56,-0.006,11679829.03,12030080.04,2012-01-31,0.09978900335112327
3,000060.XSHE,16.476,3.7146,19680455995.0,5.6036,14.617,28009159184.6,0.35,9189386877.65,7935542726.05,2012-01-31,0.12159482758620697
4,000069.XSHE,12.5878,2.5616,41727214853.0,2.8729,10.9097,81247380359.0,0.271,8951453490.28,7091397989.13,2012-01-31,-0.0026808154146886697
  • analysis

1、 initialization VarianceThreshold, Specify the threshold variance

2、 call fit_transform

def variance_demo():
    """  Remove low variance features —— feature selection  :return: None """
    data = pd.read_csv("factor_returns.csv")
    print(data)
    # 1、 Instantiate a converter class 
    transfer = VarianceThreshold(threshold=1)
    # 2、 call fit_transform
    data = transfer.fit_transform(data.iloc[:, 1:10])
    print(" Results of deleting low variance features :\n", data)
    print(" shape :\n", data.shape)

    return None

Return results :

            index  pe_ratio  pb_ratio    market_cap  \
0     000001.XSHE    5.9572    1.1818  8.525255e+10   
1     000002.XSHE    7.0289    1.5880  8.411336e+10    
...           ...       ...       ...           ...   
2316  601958.XSHG   52.5408    2.4646  3.287910e+10   
2317  601989.XSHG   14.2203    1.4103  5.911086e+10   

      return_on_asset_net_profit  du_return_on_equity            ev  \
0                         0.8008              14.9403  1.211445e+12   
1                         1.6463               7.8656  3.002521e+11    
...                          ...                  ...           ...   
2316                      2.7444               2.9202  3.883803e+10   
2317                      2.0383               8.6179  2.020661e+11   

      earnings_per_share       revenue  total_expense        date    return  
0                 2.0100  2.070140e+10   1.088254e+10  2012-01-31  0.027657  
1                 0.3260  2.930837e+10   2.378348e+10  2012-01-31  0.082352  
2                -0.0060  1.167983e+07   1.203008e+07  2012-01-31  0.099789   
...                  ...           ...            ...         ...       ...  
2315              0.2200  1.789082e+10   1.749295e+10  2012-11-30  0.137134  
2316              0.1210  6.465392e+09   6.009007e+09  2012-11-30  0.149167  
2317              0.2470  4.509872e+10   4.132842e+10  2012-11-30  0.183629  

[2318 rows x 12 columns]
 Results of deleting low variance features :
 [[  5.95720000e+00   1.18180000e+00   8.52525509e+10 ...,   1.21144486e+12
    2.07014010e+10   1.08825400e+10]
 [  7.02890000e+00   1.58800000e+00   8.41133582e+10 ...,   3.00252062e+11
    2.93083692e+10   2.37834769e+10]
 [ -2.62746100e+02   7.00030000e+00   5.17045520e+08 ...,   7.70517753e+08
    1.16798290e+07   1.20300800e+07]
 ..., 
 [  3.95523000e+01   4.00520000e+00   1.70243430e+10 ...,   2.42081699e+10
    1.78908166e+10   1.74929478e+10]
 [  5.25408000e+01   2.46460000e+00   3.28790988e+10 ...,   3.88380258e+10
    6.46539204e+09   6.00900728e+09]
 [  1.42203000e+01   1.41030000e+00   5.91108572e+10 ...,   2.02066110e+11
    4.50987171e+10   4.13284212e+10]]
 shape :
 (2318, 8)

4.2 The correlation coefficient

  • Pearson correlation coefficient (Pearson Correlation Coefficient)
    • A statistical indicator that reflects the close relationship between variables
4.2.2 Formula calculation case ( understand , Don't remember )
  • The formula

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-3B0lsFTx-1652341752446)(C:\Users\X\AppData\Roaming\Typora\typora-user-images\1652341249812.png)]

  • For example, we calculate the annual advertising investment and average monthly sales

 Insert picture description here

So how to calculate the correlation coefficient between

 Insert picture description here

Final calculation :

 Insert picture description here

So we finally come to the conclusion that there is a high positive correlation between advertising investment and monthly average sales .

4.2.3 characteristic

The correlation coefficient is between –1 And +1 Between , namely –1≤ r ≤+1. Its nature is as follows :

  • When r>0 when , It means that two variables are positively correlated ,r<0 when , The two variables are negatively correlated
  • When |r|=1 when , It means that two variables are completely related , When r=0 when , There is no correlation between two variables
  • When 0<|r|<1 when , It means that there is a certain degree of correlation between two variables . And |r| The closer the 1, The closer the linear relationship between the two variables is ;|r| The more close to 0, The weaker the linear correlation between two variables
  • Generally, it can be divided into three levels :|r|<0.4 For low correlation ;0.4≤|r|<0.7 It is significant correlation ;0.7≤|r|<1 It's a highly linear correlation

This symbol :|r| by r The absolute value of , |-5| = 5

4.2.4 API
  • from scipy.stats import pearsonr
    • x : (N,) array_like
    • y : (N,) array_like Returns: (Pearson’s correlation coefficient, p-value)
4.2.5 Case study : Correlation calculation of financial indicators of stocks

We just calculated the correlation of these indexes of the stock , Suppose we use

factor = ['pe_ratio','pb_ratio','market_cap','return_on_asset_net_profit','du_return_on_equity','ev','earnings_per_share','revenue','total_expense']

Two of these features are calculated , Get some characteristics with high correlation

 Insert picture description here

  • analysis
    • The correlation between two features is calculated
import pandas as pd
from scipy.stats import pearsonr

def pearsonr_demo():
    """  Calculation of correlation coefficient  :return: None """
    data = pd.read_csv("factor_returns.csv")

    factor = ['pe_ratio', 'pb_ratio', 'market_cap', 'return_on_asset_net_profit', 'du_return_on_equity', 'ev',
              'earnings_per_share', 'revenue', 'total_expense']

    for i in range(len(factor)):
        for j in range(i, len(factor) - 1):
            print(
                " indicators %s And indicators %s The correlation between them is %f" % (factor[i], factor[j + 1], pearsonr(data[factor[i]], data[factor[j + 1]])[0]))

    return None

Return results :

 indicators pe_ratio And indicators pb_ratio The correlation between them is -0.004389
 indicators pe_ratio And indicators market_cap The correlation between them is -0.068861
 indicators pe_ratio And indicators return_on_asset_net_profit The correlation between them is -0.066009
 indicators pe_ratio And indicators du_return_on_equity The correlation between them is -0.082364
 indicators pe_ratio And indicators ev The correlation between them is -0.046159
 indicators pe_ratio And indicators earnings_per_share The correlation between them is -0.072082
 indicators pe_ratio And indicators revenue The correlation between them is -0.058693
 indicators pe_ratio And indicators total_expense The correlation between them is -0.055551
 indicators pb_ratio And indicators market_cap The correlation between them is 0.009336
 indicators pb_ratio And indicators return_on_asset_net_profit The correlation between them is 0.445381
 indicators pb_ratio And indicators du_return_on_equity The correlation between them is 0.291367
 indicators pb_ratio And indicators ev The correlation between them is -0.183232
 indicators pb_ratio And indicators earnings_per_share The correlation between them is 0.198708
 indicators pb_ratio And indicators revenue The correlation between them is -0.177671
 indicators pb_ratio And indicators total_expense The correlation between them is -0.173339
 indicators market_cap And indicators return_on_asset_net_profit The correlation between them is 0.214774
 indicators market_cap And indicators du_return_on_equity The correlation between them is 0.316288
 indicators market_cap And indicators ev The correlation between them is 0.565533
 indicators market_cap And indicators earnings_per_share The correlation between them is 0.524179
 indicators market_cap And indicators revenue The correlation between them is 0.440653
 indicators market_cap And indicators total_expense The correlation between them is 0.386550
 indicators return_on_asset_net_profit And indicators du_return_on_equity The correlation between them is 0.818697
 indicators return_on_asset_net_profit And indicators ev The correlation between them is -0.101225
 indicators return_on_asset_net_profit And indicators earnings_per_share The correlation between them is 0.635933
 indicators return_on_asset_net_profit And indicators revenue The correlation between them is 0.038582
 indicators return_on_asset_net_profit And indicators total_expense The correlation between them is 0.027014
 indicators du_return_on_equity And indicators ev The correlation between them is 0.118807
 indicators du_return_on_equity And indicators earnings_per_share The correlation between them is 0.651996
 indicators du_return_on_equity And indicators revenue The correlation between them is 0.163214
 indicators du_return_on_equity And indicators total_expense The correlation between them is 0.135412
 indicators ev And indicators earnings_per_share The correlation between them is 0.196033
 indicators ev And indicators revenue The correlation between them is 0.224363
 indicators ev And indicators total_expense The correlation between them is 0.149857
 indicators earnings_per_share And indicators revenue The correlation between them is 0.141473
 indicators earnings_per_share And indicators total_expense The correlation between them is 0.105022
 indicators revenue And indicators total_expense The correlation between them is 0.995845

From which we can draw

  • indicators revenue And indicators total_expense The correlation between them is 0.995845
  • indicators return_on_asset_net_profit And indicators du_return_on_equity The correlation between them is 0.818697

We can also observe the results by drawing pictures

import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8), dpi=100)
plt.scatter(data['revenue'], data['total_expense'])
plt.show()

 Insert picture description here

There is a greater correlation between the two indicators , You can do subsequent processing , For example, synthesize these two indicators .

3 What is principal component analysis (PCA)

  • Definition : The process of transforming high dimensional data into low dimensional data , In the process May discard the original data 、 Create new variables
  • effect : It's data dimension compression , Reduce the dimension of the original data as much as possible ( Complexity ), Lose a little information .
  • application : Regression analysis or cluster analysis

For the word information , It will be introduced in the decision tree

How about a better understanding of this process ? Let's take a look at a picture

 Insert picture description here

1 Computing case understanding ( understand , No need to remember )

Suppose for a given 5 A little bit , The data are as follows

(-1,-2)
(-1, 0)
( 0, 0)
( 2, 1)
( 0, 1)

 Insert picture description here

requirement : Simplify this two-dimensional data into one-dimensional data ? And lose a little bit of information

 Insert picture description here

How is this process calculated ? Find a right line , The results of principal component analysis are obtained by a matrix operation ( There is no need to understand )

 Insert picture description here

2 API

  • sklearn.decomposition.PCA(n_components=None)
    • Decompose the data into lower dimension space
    • n_components:
      • decimal : What percentage of information is retained
      • Integers : How many features are reduced to
    • PCA.fit_transform(X) X:numpy array Formatted data [n_samples,n_features]
    • Return value : After the transformation, specify the array

3 Data calculation

First take a simple data to calculate

[[2,8,4,5],
[6,3,0,8],
[5,4,9,1]]
from sklearn.decomposition import PCA

def pca_demo():
    """  On data PCA Dimension reduction  :return: None """
    data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]

    # 1、 Instantiation PCA,  decimal —— How much information to keep 
    transfer = PCA(n_components=0.9)
    # 2、 call fit_transform
    data1 = transfer.fit_transform(data)

    print(" Retain 90% Information about , The result of dimension reduction is :\n", data1)

    # 1、 Instantiation PCA,  Integers —— Specify the dimension reduced to 
    transfer2 = PCA(n_components=3)
    # 2、 call fit_transform
    data2 = transfer2.fit_transform(data)
    print(" Dimensionality reduction 3 The result of dimension :\n", data2)

    return None

Return results :

 Retain 90% Information about , The result of dimension reduction is :
 [[ -3.13587302e-16   3.82970843e+00]
 [ -5.74456265e+00  -1.91485422e+00]
 [  5.74456265e+00  -1.91485422e+00]]
 Dimensionality reduction 3 The result of dimension :
 [[ -3.13587302e-16   3.82970843e+00   4.59544715e-16]
 [ -5.74456265e+00  -1.91485422e+00   4.59544715e-16]
 [  5.74456265e+00  -1.91485422e+00   4.59544715e-16]]

4 Case study : Explore users' preferences for item categories, subdivision and dimensionality reduction

 Insert picture description here

The data are as follows :

  • order_products__prior.csv: Order and product information
    • Field :order_id, product_id, add_to_cart_order, reordered
  • products.csv: Commodity information
    • Field :product_id, product_name, aisle_id, department_id
  • orders.csv: User's order information
    • Field :order_id,user_id,eval_set,order_number,….
  • aisles.csv: The specific item category to which the commodity belongs
    • Field : aisle_id, aisle

1 demand

 Insert picture description here

2 analysis

  • Merge tables , bring user_id And aisle In a table

  • Perform crosstab transformation

  • Carry out dimension reduction

    3 Complete code

    import pandas as pd
    from sklearn.decomposition import PCA
    
    # 1、 Get data set 
    # · Commodity information - products.csv:
    # Fields:product_id, product_name, aisle_id, department_id
    # · Order and product information - order_products__prior.csv:
    # Fields:order_id, product_id, add_to_cart_order, reordered 
    # · User's order information - orders.csv:
    # Fields:order_id, user_id,eval_set, order_number,order_dow, order_hour_of_day, days_since_prior_order 
    # · The specific item category to which the commodity belongs - aisles.csv:
    # Fields:aisle_id, aisle 
    products = pd.read_csv("./instacart/products.csv")
    order_products = pd.read_csv("./instacart/order_products__prior.csv")
    orders = pd.read_csv("./instacart/orders.csv")
    aisles = pd.read_csv("./instacart/aisles.csv")
    
    # 2、 Merge tables , take user_id and aisle On a sheet 
    # 1) Merge orders and order_products on=order_id tab1:order_id, product_id, user_id
    tab1 = pd.merge(orders, order_products, on=["order_id", "order_id"])
    # 2) Merge tab1 and products on=product_id tab2:aisle_id
    tab2 = pd.merge(tab1, products, on=["product_id", "product_id"])
    # 3) Merge tab2 and aisles on=aisle_id tab3:user_id, aisle
    tab3 = pd.merge(tab2, aisles, on=["aisle_id", "aisle_id"])
    
    # 3、 Crosstab processing , hold user_id and aisle Grouping 
    table = pd.crosstab(tab3["user_id"], tab3["aisle"])
    
    # 4、 Principal component analysis method for dimensionality reduction 
    # 1) Instantiate a converter class PCA
    transfer = PCA(n_components=0.95)
    # 2)fit_transform
    data = transfer.fit_transform(table)
    
    data.shape
    

    Return results :

    (206209, 44)
    
    
原网站

版权声明
本文为[Big data Da Wenxi]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/200/202207170502377185.html