15 분 소요

[공지사항] “출처: https://www.kaggle.com/code/jundthird/kor-manual-feature-engineering-pt2”

Introduction: Manual Feature Engineering (part two)

이 노트북에서는 Introduction to Manual Feature Engineering에서 한단계 더 확장해 나갈 것입니다. 여기서는 previous_application, POS_CASH_balance, installments_paymentscredit_card_balance 데이터 파일의 정보를 통합하기 위해 해당 노트북에서 만든 aggregation 및 value 계산 함수를 사용할 것입니다. 이미 이전 노트북에서 BureauBureau_balance의 정보를 사용했으며 application 데이터만 사용하는 것에 비해 competition 점수를 향상시킬 수 있었습니다. 여기에 포함된 features로 모델을 실행하면 성능이 향상되지만 feature의 수가 폭발적으로 늘어나는 문제에 봉착하게 됩니다. feature selection 노트북을 작성 중이지만, 이 노트북을 위해 우리 모델에 대한 많은 데이터 세트를 계속 구축해 나갈 것입니다.


4개의 추가 데이터에 대한 정의는:

  • previous_application(called previous): application 데이터에 대출이 있는 고객의 Home Credit 대출에 대한 previous applications. application 데이터의 각 현재 대출에는 여러 개의 previous 대출이 있을 수 있습니다. 각 previous application에는 하나의 행이 있으며 SK_ID_PREV feature로 식별됩니다.


  • POS_CASH_BALANCE(called cash): 고객이 Home Credit을 통해 보유한 previous 판매 시점 또는 현금 대출에 대한 월별 데이터입니다. 각 행은 previous 판매 시점 또는 현금 대출의 한 달치이며 하나의 previous 대출에는 여러 행이 있을 수 있습니다.


  • credit_card_balance(called credit): 고객이 Home Credit을 사용하여 가지고 있던 previous 신용 카드에 대한 월별 데이터입니다. 각 행은 신용 카드 잔액의 한 달치이며 단일 신용 카드에는 여러 행이 있을 수 있습니다.


  • installments_payment(called installments): Home Credit에서 previous 대출에 대한 지불 내역. 모든 payment에 대해 하나의 행이 있고 모든 missed payment에 대해 하나의 행이 있습니다.

Functions

이전 노트북에서는 두가지 함수를 만들었습니다:

  • agg_numeric: numeric 변수에 대한 mean, count, max, min를 계산합니다.
  • agg_categorical: 카테고리형 변수에서 각 카테고리에 대한 counts와 normalized counts를 계산합니다.

이 두 함수는 함께 데이터 프레임의 숫자 및 카테고리형 데이터에 대한 정보를 추출할 수 있습니다. 여기서의 일반적인 접근 방식은 클라이언트 ID SK_ID_CURR로 그룹화하여 이 두 함수를 모두 데이터 프레임에 적용하는 것입니다. POS_CASH_balance, credit_card_balanceinstallment_payments의 경우 먼저 previous 대출의 고유 ID인 SK_ID_PREV로 그룹화할 수 있습니다. 그런 다음 결과로 나온 데이터 프레임을 SK_ID_CURR로 그룹화하여 모든 previous 대출에 대한 각 클라이언트의 통계치를 계산합니다.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
plt.style.use("fivethirtyeight")
import gc

Function to Aggregate Numeric Data

group_var 별로 데이터를 그룹화하고 mean, max, minsum을 계산합니다.

def agg_numeric(df, parent_var, df_name):
    
    # id variables 제거
    for col in df:
        if col != parent_var and "SK_ID" in col:
            df = df.drop(columns=col)
            
    parent_ids = df[parent_var].copy()
    numeric_df = df.select_dtypes("number").copy()
    numeric_df[parent_var] = parent_ids
    
    agg = numeric_df.groupby(parent_var).agg(['count', 'mean', 'max', 'min', 'sum'])
    
    columns = []
    
    for var in agg.columns.levels[0]:
        if var != parent_var:
            for stat in agg.columns.levels[1]:
                columns.append("%s_%s_%s" % (df_name, var, stat))
                
    agg.columns = columns
    
    # 불필요한 중복된 변수 제거
    _, idx = np.unique(agg, axis=1, return_index=True # 각각 처음으로 등장한 인덱스 번호
                       )
    agg = agg.iloc[:, idx]
    
    return agg

Function to Calculate Categorical Counts

이 함수는 각 클라이언트에 대한 카테고리형 변수에서 각 카테고리의 등장 횟수(개수)을 계산합니다. 또한 카테고리형 변수의 모든 카테고리에 대한 총 개수로 나눈 카테고리의 개수인 normed count를 계산합니다.

def agg_categorical(df, parent_var, df_name):
    categorical = pd.get_dummies(df.select_dtypes("category"))
    categorical[parent_var] = df[parent_var]
    categorical = categorical.groupby(parent_var).agg(["sum", "count", "mean"])
    
    column_names = []
    
    for var in categorical.columns.levels[0]:
        for stat in ["sum", "count", "mean"]:
            column_names.append("%s_%s_%s" % (df_name, var, stat))
            
    categorical.columns = column_names
    
    _, idx = np.unique(categorical, axis=1, return_index=True)
    categorical = categorical.iloc[:, idx]
    
    return categorical

Function for KDE Plots of Variable

TARGET 값에 따라 다르게 색깔을 입힌 변수의 분포를 그리는 함수를 만들었습니다(1은 대출 상환 x, 0은 대출 상환). 이 함수를 사용하여 생성한 새 변수를 시각적으로 확인할 수 있습니다. 이는 생성된 변수가 유용할지 여부에 대한 근사치로 사용할 수 있는 target과 변수의 상관 계수를 계산합니다.

def kde_target(var_name, df):
    corr = df["TARGET"].corr(df[var_name])
    avg_repaid = df.loc[df["TARGET"] == 0, var_name].median()
    avg_not_repaid = df.loc[df["TARGET"] == 1, var_name].median()
    plt.figure(figsize=(12, 6))
    
    sns.kdeplot(df.loc[df["TARGET"] == 0, var_name], label="TARGET == 0")
    sns.kdeplot(df.loc[df["TARGET"] == 1, var_name], label="TARGET == 1")
    
    plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
    plt.legend(loc="best")
    
    print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
    print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
    print('Median value for loan that was repaid = %0.4f' % avg_repaid)

Function to Convert Data Types

변수에 보다 효율적인 type을 사용하여 메모리 사용량을 줄이는 데 도움이 됩니다. 예를 들어, category는 보통 object보다 더 나은 type입니다(unique 값의 개수가 데이터 프레임의 행 개수에 가까운 값이 아닌 경우)

import sys

def return_size(df):
    """Return size of dataframe in gigabytes"""
    return round(sys.getsizeof(df) / 1e9, 2)

def convert_types(df, print_info=False):
    original_memory = df.memory_usage().sum()
    
    for c in df:
        if ("SK_ID" in c):
            df[c] = df[c].fillna(0).astype(np.int32)
        
        elif (df[c].dtype == "object") and (df[c].nunique() < df.shape[0]):
            df[c] = df[c].astype("category")
            
        elif set(df[c].unique()) == set([1, 0]):
            df[c] = df[c].astype(bool)
        
        elif df[c].dtype == float:
            df[c] = df[c].astype(np.float32)
            
        # Int64 to int32
        elif df[c].dtype == int:
            df[c] = df[c].astype(np.int32)
            
    new_memory = df.memory_usage().sum()
    
    if print_info:
        print(f'Original Memory Usage: {round(original_memory / 1e9, 2)} gb.')
        print(f'New Memory Usage: {round(new_memory / 1e9, 2)} gb.')
    
    return df

한 번에 하나의 데이터 프레임을 처리합시다. 첫 번째는 previous_applications입니다. 여기에는 고객이 Home Credit에서 가지고 있었던 모든 previous 대출에 대한 하나의 행이 있습니다. 고객은 여러 개의 previous 대출을 보유할 수 있으므로 각 고객에 대한 통계를 만들어야 합니다.

previous_application

previous = pd.read_csv('./home_credit/previous_application.csv')
previous = convert_types(previous, print_info=True)
previous.head()
Original Memory Usage: 0.49 gb.
New Memory Usage: 0.18 gb.
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430054 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615234 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735352 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335938 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.394531 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN

5 rows × 37 columns

previous_agg = agg_numeric(previous, 'SK_ID_CURR', 'previous')
print('Previous aggregation shape: ', previous_agg.shape)
previous_agg.head()
Previous aggregation shape:  (338857, 80)
previous_DAYS_DECISION_sum previous_DAYS_DECISION_min previous_DAYS_DECISION_mean previous_DAYS_DECISION_max previous_DAYS_FIRST_DUE_sum previous_DAYS_FIRST_DUE_min previous_DAYS_FIRST_DUE_mean previous_DAYS_FIRST_DUE_max previous_DAYS_LAST_DUE_sum previous_DAYS_LAST_DUE_min ... previous_DAYS_FIRST_DRAWING_min previous_DAYS_FIRST_DRAWING_mean previous_DAYS_FIRST_DRAWING_max previous_DAYS_FIRST_DRAWING_sum previous_RATE_INTEREST_PRIMARY_min previous_RATE_INTEREST_PRIMARY_mean previous_RATE_INTEREST_PRIMARY_max previous_RATE_INTEREST_PRIVILEGED_min previous_RATE_INTEREST_PRIVILEGED_mean previous_RATE_INTEREST_PRIVILEGED_max
SK_ID_CURR
100001 -1740 -1740 -1740.0 -1740 -1709.0 -1709.0 -1709.000000 -1709.0 -1619.0 -1619.0 ... 365243.0 365243.0 365243.0 365243.0 NaN NaN NaN NaN NaN NaN
100002 -606 -606 -606.0 -606 -565.0 -565.0 -565.000000 -565.0 -25.0 -25.0 ... 365243.0 365243.0 365243.0 365243.0 NaN NaN NaN NaN NaN NaN
100003 -3915 -2341 -1305.0 -746 -3823.0 -2310.0 -1274.333374 -716.0 -3163.0 -1980.0 ... 365243.0 365243.0 365243.0 1095729.0 NaN NaN NaN NaN NaN NaN
100004 -815 -815 -815.0 -815 -784.0 -784.0 -784.000000 -784.0 -724.0 -724.0 ... 365243.0 365243.0 365243.0 365243.0 NaN NaN NaN NaN NaN NaN
100005 -1072 -757 -536.0 -315 -706.0 -706.0 -706.000000 -706.0 -466.0 -466.0 ... 365243.0 365243.0 365243.0 365243.0 NaN NaN NaN NaN NaN NaN

5 rows × 80 columns

previous_counts = agg_categorical(previous, 'SK_ID_CURR', 'previous')
print('Previous counts shape: ', previous_counts.shape)
previous_counts.head()
Previous counts shape:  (338857, 285)
previous_NAME_GOODS_CATEGORY_Animals_mean previous_NAME_GOODS_CATEGORY_Animals_sum previous_NAME_GOODS_CATEGORY_House Construction_mean previous_NAME_GOODS_CATEGORY_House Construction_sum previous_NAME_CASH_LOAN_PURPOSE_Refusal to name the goal_mean previous_NAME_CASH_LOAN_PURPOSE_Refusal to name the goal_sum previous_NAME_CASH_LOAN_PURPOSE_Money for a third person_mean previous_NAME_CASH_LOAN_PURPOSE_Money for a third person_sum previous_NAME_CASH_LOAN_PURPOSE_Hobby_mean previous_NAME_CASH_LOAN_PURPOSE_Hobby_sum ... previous_CODE_REJECT_REASON_XAP_mean previous_FLAG_LAST_APPL_PER_CONTRACT_Y_mean previous_NAME_PORTFOLIO_POS_sum previous_NAME_CONTRACT_TYPE_Consumer loans_sum previous_NAME_CASH_LOAN_PURPOSE_XAP_sum previous_NAME_PRODUCT_TYPE_XNA_sum previous_NAME_CONTRACT_STATUS_Approved_sum previous_CODE_REJECT_REASON_XAP_sum previous_FLAG_LAST_APPL_PER_CONTRACT_Y_sum previous_NAME_CONTRACT_TYPE_Cash loans_count
SK_ID_CURR
100001 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 ... 1.0 1.0 1 1 1 1 1 1 1 1
100002 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 ... 1.0 1.0 1 1 1 1 1 1 1 1
100003 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 ... 1.0 1.0 2 2 2 2 3 3 3 3
100004 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 ... 1.0 1.0 1 1 1 1 1 1 1 1
100005 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 ... 1.0 1.0 1 1 1 2 1 2 2 2

5 rows × 285 columns

merge를 사용하여 계산된 데이터 프레임을 기본 training 데이터 프레임에 합칩니다. 그런 다음 kernel memory를 너무 많이 사용하지 않도록 계산된 데이터 프레임을 삭제해야 합니다.

train = pd.read_csv('./home_credit/application_train.csv')
train = convert_types(train)
test = pd.read_csv('./home_credit/application_test.csv')
test = convert_types(test)

train = train.merge(previous_counts, on="SK_ID_CURR", how="left")
train = train.merge(previous_agg, on="SK_ID_CURR", how="left")

test = test.merge(previous_counts, on="SK_ID_CURR", how="left")
test = test.merge(previous_agg, on="SK_ID_CURR", how="left")

# 메모리를 위해 변수 제거
gc.enable()
del previous, previous_agg, previous_counts
gc.collect()
0

너무 많은 feature을 계산하는 것에 주의해야 합니다. 관련 없는 feature이 너무 많거나 null 값이 너무 많은 feature이 있는 모델을 과하게 만들고 싶지 않습니다. 이전 노트북에서는 null 값이 75% 이상인 모든 feature을 제거했습니다. 일관성을 위해 동일한 logic을 적용합니다.

Function to Calculate Missing Values

def missing_values_table(df, print_info=False):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    mis_val_table_ren_columns = mis_val_table.rename(columns={0: "Missing Values",
                                                              1: "% of Total Values"})
    
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:, 1]].\
                                sort_values("% of Total Values", asencding=False).round(1)
    
    if print_info:
            print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
                "There are " + str(mis_val_table_ren_columns.shape[0]) +
                  " columns that have missing values.")
        
    return mis_val_table_ren_columns
def remove_missing_columns(train, test, threshold=90):
    train_miss = pd.DataFrame(train.isnull().sum())
    train_miss["percent"] = 100 * train_miss[0] / len(train)
    
    test_miss = pd.DataFrame(test.isnull().sum())
    test_miss["percent"] = 100 * test_miss[0] / len(test)
    
    missing_train_columns = list(train_miss.index[train_miss["percent"] > threshold])
    missing_test_columns = list(test_miss.index[test_miss["percent"] > threshold])
    
    missing_columns = list(set(missing_train_columns + missing_test_columns))
    
    print('There are %d columns with greater than %d%% missing values.' % (len(missing_columns), threshold))
    
    train = train.drop(columns=missing_columns)
    test = test.drop(columns=missing_columns)
    
    return train, test

# def remove_missing_columns(train, test, threshold = 90):
#     # Calculate missing stats for train and test (remember to calculate a percent!)
#     train_miss = pd.DataFrame(train.isnull().sum())
#     train_miss['percent'] = 100 * train_miss[0] / len(train)
    
#     test_miss = pd.DataFrame(test.isnull().sum())
#     test_miss['percent'] = 100 * test_miss[0] / len(test)
    
#     # list of missing columns for train and test
#     missing_train_columns = list(train_miss.index[train_miss['percent'] > threshold])
#     missing_test_columns = list(test_miss.index[test_miss['percent'] > threshold])
    
#     # Combine the two lists together
#     missing_columns = list(set(missing_train_columns + missing_test_columns))
    
#     # Print information
#     print('There are %d columns with greater than %d%% missing values.' % (len(missing_columns), threshold))
    
#     # Drop the missing columns and return
#     train = train.drop(columns = missing_columns)
#     test = test.drop(columns = missing_columns)
    
#     return train, test
train, test = remove_missing_columns(train, test)
There are 6 columns with greater than 90% missing values.

Applying to More Data

Function to Aggregate Stats at the Client Level

def aggregate_client(df, group_vars, df_names):
    """
    group_vars = ['SK_ID_PREV', 'SK_ID_CURR']
    df_names = ['cash', 'client']
    """
    
    df_agg = agg_numeric(df, parent_var=group_vars[0], df_name=df_names[0])
    
    if any(df.dtypes == "category"):
        df_counts = agg_categorical(df, parent_var=group_vars[0], df_name=df_names[0])
        
        # numeric, categorical 합치기
        df_by_loan = df_counts.merge(df_agg, on=group_vars[0], how="outer")
        
        gc.enable()
        del df_agg, df_counts
        gc.collect()
        
        # client id로 합치기
        df_by_loan = df_by_loan.merge(df[[group_vars[0], group_vars[1]]],
                                      on=group_vars[0], how="left")
        
        df_by_loan = df_by_loan.drop(columns=[group_vars[0]])
        
        df_by_client = agg_numeric(df_by_loan, parent_var=group_vars[1],
                                   df_name=df_names[1])
        
    else:
        df_by_loan = df_agg.merge(df[[group_vars[0], group_vars[1]]],
                                  on=group_vars[0], how="left")
        
        gc.enable()
        del df_agg
        gc.collect()
        
        df_by_loan = df_by_loan.drop(columns=[group_vars[0]])
        
        df_by_client = agg_numeric(df_by_loan, parent_var=group_vars[1],
                                   df_name=df_names[1])
        
    gc.enable()
    del df, df_by_loan
    gc.collect()
    
    return df_by_client

Monthly Cash Data

cash = pd.read_csv('./home_credit/POS_CASH_balance.csv')
cash = convert_types(cash, print_info=True)
cash.head()
Original Memory Usage: 0.64 gb.
New Memory Usage: 0.41 gb.
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0
cash_by_client = aggregate_client(cash, group_vars=['SK_ID_PREV', 'SK_ID_CURR'], df_names=['cash', 'client'])
cash_by_client.head()
client_cash_MONTHS_BALANCE_sum_sum client_cash_MONTHS_BALANCE_min_sum client_cash_MONTHS_BALANCE_mean_sum client_cash_MONTHS_BALANCE_max_sum client_cash_MONTHS_BALANCE_sum_min client_cash_MONTHS_BALANCE_sum_mean client_cash_MONTHS_BALANCE_sum_max client_cash_MONTHS_BALANCE_min_min client_cash_MONTHS_BALANCE_mean_min client_cash_MONTHS_BALANCE_max_min ... client_cash_CNT_INSTALMENT_FUTURE_max_sum client_cash_NAME_CONTRACT_STATUS_Active_sum_sum client_cash_CNT_INSTALMENT_min_sum client_cash_CNT_INSTALMENT_mean_sum client_cash_CNT_INSTALMENT_max_sum client_cash_CNT_INSTALMENT_count_sum client_cash_CNT_INSTALMENT_FUTURE_count_sum client_cash_NAME_CONTRACT_STATUS_Active_count_sum client_cash_CNT_INSTALMENT_FUTURE_sum_sum client_cash_CNT_INSTALMENT_sum_sum
SK_ID_CURR
100001 -2887 -669 -653.0 -637 -378 -320.777778 -275 -96 -94.5 -93 ... 28.0 32.0 36.0 36.000000 36.0 41 41 41 62.0 164.0
100002 -3610 -361 -190.0 -19 -190 -190.000000 -190 -19 -10.0 -1 ... 456.0 361.0 456.0 456.000000 456.0 361 361 361 5415.0 8664.0
100003 -13240 -1348 -1226.0 -1104 -858 -472.857143 -172 -77 -71.5 -66 ... 288.0 256.0 248.0 283.000000 288.0 272 272 272 1608.0 2840.0
100004 -408 -108 -102.0 -96 -102 -102.000000 -102 -27 -25.5 -24 ... 16.0 12.0 12.0 15.000000 16.0 16 16 16 36.0 60.0
100005 -2420 -275 -220.0 -165 -220 -220.000000 -220 -25 -20.0 -15 ... 132.0 99.0 99.0 128.699997 132.0 110 110 121 792.0 1287.0

5 rows × 162 columns

print('Cash by Client Shape: ', cash_by_client.shape)
train = train.merge(cash_by_client, on="SK_ID_CURR", how="left")
test = test.merge(cash_by_client, on="SK_ID_CURR", how="left")

gc.enable()
del cash, cash_by_client
gc.collect()
Cash by Client Shape:  (337252, 162)





0
train, test = remove_missing_columns(train, test)
There are 0 columns with greater than 90% missing values.

Monthly Credit Data

credit = pd.read_csv("./home_credit/credit_card_balance.csv")
credit = convert_types(credit, print_info=True)
credit.head()
Original Memory Usage: 0.71 gb.
New Memory Usage: 0.42 gb.
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970001 135000 0.0 877.5 0.0 877.5 1700.324951 ... 0.000000 0.000000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.554688 45000 2250.0 2250.0 0.0 0.0 2250.000000 ... 64875.554688 64875.554688 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.224609 450000 0.0 0.0 0.0 0.0 2250.000000 ... 31460.085938 31460.085938 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.109375 225000 2250.0 2250.0 0.0 0.0 11795.759766 ... 233048.968750 233048.968750 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.468750 450000 0.0 11547.0 0.0 11547.0 22924.890625 ... 453919.468750 453919.468750 0.0 1 0.0 1.0 101.0 Active 0 0

5 rows × 23 columns

credit_by_client = aggregate_client(credit, group_vars=['SK_ID_PREV', 'SK_ID_CURR'], df_names=['credit', 'client'])
credit_by_client.head()
client_credit_MONTHS_BALANCE_sum_sum client_credit_MONTHS_BALANCE_min_sum client_credit_MONTHS_BALANCE_mean_sum client_credit_MONTHS_BALANCE_sum_min client_credit_MONTHS_BALANCE_sum_mean client_credit_MONTHS_BALANCE_sum_max client_credit_MONTHS_BALANCE_max_sum client_credit_MONTHS_BALANCE_min_min client_credit_MONTHS_BALANCE_min_mean client_credit_MONTHS_BALANCE_min_max ... client_credit_AMT_DRAWINGS_ATM_CURRENT_mean_max client_credit_AMT_PAYMENT_CURRENT_mean_min client_credit_AMT_PAYMENT_CURRENT_mean_mean client_credit_AMT_PAYMENT_CURRENT_mean_max client_credit_AMT_PAYMENT_CURRENT_max_min client_credit_AMT_PAYMENT_CURRENT_max_mean client_credit_AMT_PAYMENT_CURRENT_max_max client_credit_AMT_DRAWINGS_ATM_CURRENT_max_min client_credit_AMT_DRAWINGS_ATM_CURRENT_max_mean client_credit_AMT_DRAWINGS_ATM_CURRENT_max_max
SK_ID_CURR
100006 -126 -36 -21.0 -21 -21.0 -21 -6 -6 -6.0 -6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
100011 -210826 -5550 -2849.0 -2849 -2849.0 -2849 -148 -75 -75.0 -75 ... 2432.432373 4843.063965 4843.063965 4843.063965 55485.0 55485.0 55485.0 180000.0 180000.0 180000.0
100013 -446976 -9216 -4656.0 -4656 -4656.0 -4656 -96 -96 -96.0 -96 ... 6350.000000 7168.346191 7168.346191 7168.346191 153675.0 153675.0 153675.0 157500.0 157500.0 157500.0
100021 -2890 -306 -170.0 -170 -170.0 -170 -34 -18 -18.0 -18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
100023 -480 -88 -60.0 -60 -60.0 -60 -32 -11 -11.0 -11 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 376 columns

print('Credit by client shape: ', credit_by_client.shape)

train = train.merge(credit_by_client, on='SK_ID_CURR', how='left')
test = test.merge(credit_by_client, on='SK_ID_CURR', how='left')

gc.enable()
del credit, credit_by_client
gc.collect()
Credit by client shape:  (103558, 376)





0
train, test = remove_missing_columns(train, test)
There are 0 columns with greater than 90% missing values.

Installment Payments

installments = pd.read_csv("./home_credit/installments_payments.csv")
installments = convert_types(installments, print_info=True)
installments.head()
Original Memory Usage: 0.87 gb.
New Memory Usage: 0.49 gb.
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.0 6 -1180.0 -1187.0 6948.359863 6948.359863
1 1330831 151639 0.0 34 -2156.0 -2156.0 1716.525024 1716.525024
2 2085231 193053 2.0 1 -63.0 -63.0 25425.000000 25425.000000
3 2452527 199697 1.0 3 -2418.0 -2426.0 24350.130859 24350.130859
4 2714724 167756 1.0 2 -1383.0 -1366.0 2165.040039 2160.584961
installments_by_client = aggregate_client(installments, group_vars=['SK_ID_PREV', 'SK_ID_CURR'], df_names=['installments', 'client'])
installments_by_client.head()
client_installments_DAYS_ENTRY_PAYMENT_sum_sum client_installments_DAYS_INSTALMENT_sum_sum client_installments_DAYS_ENTRY_PAYMENT_min_sum client_installments_DAYS_INSTALMENT_min_sum client_installments_DAYS_ENTRY_PAYMENT_mean_sum client_installments_DAYS_INSTALMENT_mean_sum client_installments_DAYS_ENTRY_PAYMENT_max_sum client_installments_DAYS_INSTALMENT_max_sum client_installments_DAYS_INSTALMENT_sum_min client_installments_DAYS_ENTRY_PAYMENT_sum_min ... client_installments_AMT_PAYMENT_min_sum client_installments_AMT_INSTALMENT_min_sum client_installments_AMT_PAYMENT_sum_max client_installments_AMT_INSTALMENT_sum_max client_installments_AMT_PAYMENT_mean_sum client_installments_AMT_INSTALMENT_mean_sum client_installments_AMT_INSTALMENT_max_sum client_installments_AMT_PAYMENT_max_sum client_installments_AMT_PAYMENT_sum_sum client_installments_AMT_INSTALMENT_sum_sum
SK_ID_CURR
100001 -52813.0 -52598.0 -15608.0 -15584.0 -15365.0 -15314.0 -15080.0 -15044.0 -8658.0 -8647.0 ... 2.774678e+04 2.774678e+04 2.925090e+04 2.925090e+04 4.119593e+04 4.119593e+04 8.153775e+04 8.153775e+04 1.528387e+05 1.528387e+05
100002 -113867.0 -106495.0 -11153.0 -10735.0 -5993.0 -5605.0 -931.0 -475.0 -5605.0 -5993.0 ... 1.757837e+05 1.757837e+05 2.196257e+05 2.196257e+05 2.196257e+05 2.196257e+05 1.008781e+06 1.008781e+06 4.172888e+06 4.172888e+06
100003 -367137.0 -365546.0 -37757.0 -37514.0 -34633.0 -34454.0 -31594.0 -31394.0 -25740.0 -25821.0 ... 1.154108e+06 1.154108e+06 1.150977e+06 1.150977e+06 1.618865e+06 1.618865e+06 4.394102e+06 4.394102e+06 1.134881e+07 1.134881e+07
100004 -6855.0 -6786.0 -2385.0 -2352.0 -2285.0 -2262.0 -2181.0 -2172.0 -2262.0 -2285.0 ... 1.607175e+04 1.607175e+04 2.128846e+04 2.128846e+04 2.128846e+04 2.128846e+04 3.172189e+04 3.172189e+04 6.386539e+04 6.386539e+04
100005 -49374.0 -47466.0 -6624.0 -6354.0 -5486.0 -5274.0 -4230.0 -4194.0 -5274.0 -5486.0 ... 4.331880e+04 4.331880e+04 5.616184e+04 5.616184e+04 5.616184e+04 5.616184e+04 1.589062e+05 1.589062e+05 5.054566e+05 5.054566e+05

5 rows × 106 columns

print('Installments by client shape: ', installments_by_client.shape)

train = train.merge(installments_by_client, on='SK_ID_CURR', how='left')
test = test.merge(installments_by_client, on='SK_ID_CURR', how='left')

gc.enable()
del installments, installments_by_client
gc.collect()
Installments by client shape:  (339587, 106)





20
train, test = remove_missing_columns(train, test)
There are 0 columns with greater than 90% missing values.
print('Final Training Shape: ', train.shape)
print('Final Testing Shape: ', test.shape)
Final Training Shape:  (307511, 1125)
Final Testing Shape:  (48744, 1124)
print(f'Final training size: {return_size(train)}')
print(f'Final testing size: {return_size(test)}')
Final training size: 2.08
Final testing size: 0.33

Save All Newly Calculated Features

안타깝게도 생성된 모든 feature를 저장하는 것은 Kaggle 노트북에서 작동하지 않습니다. 개인 컴퓨터에서 코드를 실행해야 합니다.

Modeling

from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
def model(features, test_features, encoding="ohe", n_folds=5):
    train_ids = features["SK_ID_CURR"]
    test_ids = test_features["SK_ID_CURR"]
    
    labels = features["TARGET"]
    
    features = features.drop(columns=["SK_ID_CURR"])
    test_features = test_features.drop(columns=["SK_ID_CURR"])
    
    if encoding == "ohe":
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        features, test_features = features.align(test_features, join="inner",
                                                 axis=1)
        
        # ohe-hot encoding을 한 경우, 카테고리형 변수를 알려주지 않아도 됨
        cat_indices = "auto"
        
    elif encoding == "le":
        label_encoder = LabelEncdoer()
        
        cat_indices = []
        
        for i, col in enumerate(features):
            if features[col].dtype == "object":
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).resshape((-1, )))
                test_features[col] = label_encoder.fit_transform(np.array(test_features[col].astype(str)).resshape((-1, )))
                
                cat_indices.append(i)
                
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
    
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    feature_names = list(features.columns)
    
    features = np.array(features)
    test_features = np.array(test_features)
    
    k_fold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=50)
    
    feature_importance_values = np.zeros(len(feature_names))
    
    test_predictions = np.zeros(test_features.shape[0])
    
    out_of_fold = np.zeros(features.shape[0])
    
    valid_scores = []
    train_scores = []
    
    for train_indices, valid_indices in k_fold.split(features, labels):
        train_features, train_labels = features[train_indices], labels[train_indices]
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        model = lgb.LGBMClassifier(n_estimators=10000, objective="binary",
                                   class_weight="balanced",
                                   learning_rate=0.05,
                                   reg_alpha=0.1, reg_lambda=0.1,
                                   subsample=0.8, n_jobs=-1, random_state=50)
        
        model.fit(train_features, train_labels, eval_metric="auc",
                  eval_set=[(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names=["valid", "train"], categorical_feature=cat_indices,
                  early_stopping_rounds=100, verbose=200)
        
        best_iteration = model.best_iteration_
        
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        test_predictions += model.predict_proba(test_features, 
                                                num_iteration=best_iteration)[:, 1] / k_fold.n_splits
        
        out_of_fold[valid_indices] = model.predict_proba(valid_features, 
                                                         num_iteration=best_iteration)[:, 1]
        
        valid_score = model.best_score_["valid"]["auc"]
        train_score = model.best_score_["train"]["auc"]
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
    
    submission = pd.DataFrame({"SK_ID_CURR": test_ids,
                               "TARGET": test_predictions})
    
    feature_importances = pd.DataFrame({"feature": feature_names, 
                                        "importance": feature_importance_values})
    
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # 전체 점수
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    fold_names = list(range(n_folds))
    fold_names.append("overall")
    
    metrics = pd.DataFrame({"fold": fold_names, "train": train_scores,
                            "valid": valid_scores})
    
    return submission, feature_importance_values, metrics
submission, fi, metrics = model(train, test)
Training Data Shape:  (307511, 1244)
Testing Data Shape:  (48744, 1244)
[200]	train's auc: 0.827847	train's binary_logloss: 0.517851	valid's auc: 0.776498	valid's binary_logloss: 0.53754
[200]	train's auc: 0.826893	train's binary_logloss: 0.519028	valid's auc: 0.777111	valid's binary_logloss: 0.539778
[400]	train's auc: 0.862668	train's binary_logloss: 0.479481	valid's auc: 0.778615	valid's binary_logloss: 0.514953
[200]	train's auc: 0.82652	train's binary_logloss: 0.519854	valid's auc: 0.782785	valid's binary_logloss: 0.536581
[400]	train's auc: 0.862059	train's binary_logloss: 0.480909	valid's auc: 0.783537	valid's binary_logloss: 0.512251
[200]	train's auc: 0.827632	train's binary_logloss: 0.518235	valid's auc: 0.780399	valid's binary_logloss: 0.538406
[400]	train's auc: 0.863706	train's binary_logloss: 0.478578	valid's auc: 0.780408	valid's binary_logloss: 0.513719
[200]	train's auc: 0.828105	train's binary_logloss: 0.517592	valid's auc: 0.776329	valid's binary_logloss: 0.538545
[400]	train's auc: 0.863528	train's binary_logloss: 0.478105	valid's auc: 0.776944	valid's binary_logloss: 0.513118
metrics
fold train valid
0 0 0.838376 0.777163
1 1 0.864993 0.778851
2 2 0.853780 0.783790
3 3 0.847372 0.780880
4 4 0.860250 0.777134
5 overall 0.852954 0.779494
submission.to_csv('submission_manualp2.csv', index=False)

참고: https://www.kaggle.com/code/willkoehrsen/introduction-to-manual-feature-engineering-p2/notebook

댓글남기기