[Home Credit Default Risk] Introduction: Manual Feature Engineering (part two)
[공지사항] “출처: https://www.kaggle.com/code/jundthird/kor-manual-feature-engineering-pt2”
Introduction: Manual Feature Engineering (part two)
이 노트북에서는 Introduction to Manual Feature Engineering에서 한단계 더 확장해 나갈 것입니다. 여기서는 previous_application
, POS_CASH_balance
, installments_payments
및 credit_card_balance
데이터 파일의 정보를 통합하기 위해 해당 노트북에서 만든 aggregation 및 value 계산 함수를 사용할 것입니다. 이미 이전 노트북에서 Bureau
및 Bureau_balance
의 정보를 사용했으며 application
데이터만 사용하는 것에 비해 competition 점수를 향상시킬 수 있었습니다. 여기에 포함된 features로 모델을 실행하면 성능이 향상되지만 feature의 수가 폭발적으로 늘어나는 문제에 봉착하게 됩니다. feature selection 노트북을 작성 중이지만, 이 노트북을 위해 우리 모델에 대한 많은 데이터 세트를 계속 구축해 나갈 것입니다.
4개의 추가 데이터에 대한 정의는:
- previous_application(called
previous
): application 데이터에 대출이 있는 고객의 Home Credit 대출에 대한 previous applications. application 데이터의 각 현재 대출에는 여러 개의 previous 대출이 있을 수 있습니다. 각 previous application에는 하나의 행이 있으며SK_ID_PREV
feature로 식별됩니다.
- POS_CASH_BALANCE(called
cash
): 고객이 Home Credit을 통해 보유한 previous 판매 시점 또는 현금 대출에 대한 월별 데이터입니다. 각 행은 previous 판매 시점 또는 현금 대출의 한 달치이며 하나의 previous 대출에는 여러 행이 있을 수 있습니다.
- credit_card_balance(called
credit
): 고객이 Home Credit을 사용하여 가지고 있던 previous 신용 카드에 대한 월별 데이터입니다. 각 행은 신용 카드 잔액의 한 달치이며 단일 신용 카드에는 여러 행이 있을 수 있습니다.
- installments_payment(called
installments
): Home Credit에서 previous 대출에 대한 지불 내역. 모든 payment에 대해 하나의 행이 있고 모든 missed payment에 대해 하나의 행이 있습니다.
Functions
이전 노트북에서는 두가지 함수를 만들었습니다:
agg_numeric
: numeric 변수에 대한mean
,count
,max
,min
를 계산합니다.agg_categorical
: 카테고리형 변수에서 각 카테고리에 대한 counts와 normalized counts를 계산합니다.
이 두 함수는 함께 데이터 프레임의 숫자 및 카테고리형 데이터에 대한 정보를 추출할 수 있습니다. 여기서의 일반적인 접근 방식은 클라이언트 ID SK_ID_CURR
로 그룹화하여 이 두 함수를 모두 데이터 프레임에 적용하는 것입니다. POS_CASH_balance
, credit_card_balance
및 installment_payments
의 경우 먼저 previous 대출의 고유 ID인 SK_ID_PREV
로 그룹화할 수 있습니다. 그런 다음 결과로 나온 데이터 프레임을 SK_ID_CURR
로 그룹화하여 모든 previous 대출에 대한 각 클라이언트의 통계치를 계산합니다.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
plt.style.use("fivethirtyeight")
import gc
Function to Aggregate Numeric Data
group_var
별로 데이터를 그룹화하고 mean
, max
, min
및 sum
을 계산합니다.
def agg_numeric(df, parent_var, df_name):
# id variables 제거
for col in df:
if col != parent_var and "SK_ID" in col:
df = df.drop(columns=col)
parent_ids = df[parent_var].copy()
numeric_df = df.select_dtypes("number").copy()
numeric_df[parent_var] = parent_ids
agg = numeric_df.groupby(parent_var).agg(['count', 'mean', 'max', 'min', 'sum'])
columns = []
for var in agg.columns.levels[0]:
if var != parent_var:
for stat in agg.columns.levels[1]:
columns.append("%s_%s_%s" % (df_name, var, stat))
agg.columns = columns
# 불필요한 중복된 변수 제거
_, idx = np.unique(agg, axis=1, return_index=True # 각각 처음으로 등장한 인덱스 번호
)
agg = agg.iloc[:, idx]
return agg
Function to Calculate Categorical Counts
이 함수는 각 클라이언트에 대한 카테고리형 변수에서 각 카테고리의 등장 횟수(개수)을 계산합니다. 또한 카테고리형 변수의 모든 카테고리에 대한 총 개수로 나눈 카테고리의 개수인 normed count를 계산합니다.
def agg_categorical(df, parent_var, df_name):
categorical = pd.get_dummies(df.select_dtypes("category"))
categorical[parent_var] = df[parent_var]
categorical = categorical.groupby(parent_var).agg(["sum", "count", "mean"])
column_names = []
for var in categorical.columns.levels[0]:
for stat in ["sum", "count", "mean"]:
column_names.append("%s_%s_%s" % (df_name, var, stat))
categorical.columns = column_names
_, idx = np.unique(categorical, axis=1, return_index=True)
categorical = categorical.iloc[:, idx]
return categorical
Function for KDE Plots of Variable
TARGET
값에 따라 다르게 색깔을 입힌 변수의 분포를 그리는 함수를 만들었습니다(1은 대출 상환 x, 0은 대출 상환). 이 함수를 사용하여 생성한 새 변수를 시각적으로 확인할 수 있습니다. 이는 생성된 변수가 유용할지 여부에 대한 근사치로 사용할 수 있는 target과 변수의 상관 계수를 계산합니다.
def kde_target(var_name, df):
corr = df["TARGET"].corr(df[var_name])
avg_repaid = df.loc[df["TARGET"] == 0, var_name].median()
avg_not_repaid = df.loc[df["TARGET"] == 1, var_name].median()
plt.figure(figsize=(12, 6))
sns.kdeplot(df.loc[df["TARGET"] == 0, var_name], label="TARGET == 0")
sns.kdeplot(df.loc[df["TARGET"] == 1, var_name], label="TARGET == 1")
plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
plt.legend(loc="best")
print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
print('Median value for loan that was repaid = %0.4f' % avg_repaid)
Function to Convert Data Types
변수에 보다 효율적인 type을 사용하여 메모리 사용량을 줄이는 데 도움이 됩니다. 예를 들어, category
는 보통 object
보다 더 나은 type입니다(unique 값의 개수가 데이터 프레임의 행 개수에 가까운 값이 아닌 경우)
import sys
def return_size(df):
"""Return size of dataframe in gigabytes"""
return round(sys.getsizeof(df) / 1e9, 2)
def convert_types(df, print_info=False):
original_memory = df.memory_usage().sum()
for c in df:
if ("SK_ID" in c):
df[c] = df[c].fillna(0).astype(np.int32)
elif (df[c].dtype == "object") and (df[c].nunique() < df.shape[0]):
df[c] = df[c].astype("category")
elif set(df[c].unique()) == set([1, 0]):
df[c] = df[c].astype(bool)
elif df[c].dtype == float:
df[c] = df[c].astype(np.float32)
# Int64 to int32
elif df[c].dtype == int:
df[c] = df[c].astype(np.int32)
new_memory = df.memory_usage().sum()
if print_info:
print(f'Original Memory Usage: {round(original_memory / 1e9, 2)} gb.')
print(f'New Memory Usage: {round(new_memory / 1e9, 2)} gb.')
return df
한 번에 하나의 데이터 프레임을 처리합시다. 첫 번째는 previous_applications
입니다. 여기에는 고객이 Home Credit에서 가지고 있었던 모든 previous 대출에 대한 하나의 행이 있습니다. 고객은 여러 개의 previous 대출을 보유할 수 있으므로 각 고객에 대한 통계를 만들어야 합니다.
previous_application
previous = pd.read_csv('./home_credit/previous_application.csv')
previous = convert_types(previous, print_info=True)
previous.head()
Original Memory Usage: 0.49 gb.
New Memory Usage: 0.18 gb.
SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2030495 | 271877 | Consumer loans | 1730.430054 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
1 | 2802425 | 108129 | Cash loans | 25188.615234 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
2 | 2523466 | 122040 | Cash loans | 15060.735352 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
3 | 2819243 | 176158 | Cash loans | 47041.335938 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
4 | 1784265 | 202054 | Cash loans | 31924.394531 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
previous_agg = agg_numeric(previous, 'SK_ID_CURR', 'previous')
print('Previous aggregation shape: ', previous_agg.shape)
previous_agg.head()
Previous aggregation shape: (338857, 80)
previous_DAYS_DECISION_sum | previous_DAYS_DECISION_min | previous_DAYS_DECISION_mean | previous_DAYS_DECISION_max | previous_DAYS_FIRST_DUE_sum | previous_DAYS_FIRST_DUE_min | previous_DAYS_FIRST_DUE_mean | previous_DAYS_FIRST_DUE_max | previous_DAYS_LAST_DUE_sum | previous_DAYS_LAST_DUE_min | ... | previous_DAYS_FIRST_DRAWING_min | previous_DAYS_FIRST_DRAWING_mean | previous_DAYS_FIRST_DRAWING_max | previous_DAYS_FIRST_DRAWING_sum | previous_RATE_INTEREST_PRIMARY_min | previous_RATE_INTEREST_PRIMARY_mean | previous_RATE_INTEREST_PRIMARY_max | previous_RATE_INTEREST_PRIVILEGED_min | previous_RATE_INTEREST_PRIVILEGED_mean | previous_RATE_INTEREST_PRIVILEGED_max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100001 | -1740 | -1740 | -1740.0 | -1740 | -1709.0 | -1709.0 | -1709.000000 | -1709.0 | -1619.0 | -1619.0 | ... | 365243.0 | 365243.0 | 365243.0 | 365243.0 | NaN | NaN | NaN | NaN | NaN | NaN |
100002 | -606 | -606 | -606.0 | -606 | -565.0 | -565.0 | -565.000000 | -565.0 | -25.0 | -25.0 | ... | 365243.0 | 365243.0 | 365243.0 | 365243.0 | NaN | NaN | NaN | NaN | NaN | NaN |
100003 | -3915 | -2341 | -1305.0 | -746 | -3823.0 | -2310.0 | -1274.333374 | -716.0 | -3163.0 | -1980.0 | ... | 365243.0 | 365243.0 | 365243.0 | 1095729.0 | NaN | NaN | NaN | NaN | NaN | NaN |
100004 | -815 | -815 | -815.0 | -815 | -784.0 | -784.0 | -784.000000 | -784.0 | -724.0 | -724.0 | ... | 365243.0 | 365243.0 | 365243.0 | 365243.0 | NaN | NaN | NaN | NaN | NaN | NaN |
100005 | -1072 | -757 | -536.0 | -315 | -706.0 | -706.0 | -706.000000 | -706.0 | -466.0 | -466.0 | ... | 365243.0 | 365243.0 | 365243.0 | 365243.0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 80 columns
previous_counts = agg_categorical(previous, 'SK_ID_CURR', 'previous')
print('Previous counts shape: ', previous_counts.shape)
previous_counts.head()
Previous counts shape: (338857, 285)
previous_NAME_GOODS_CATEGORY_Animals_mean | previous_NAME_GOODS_CATEGORY_Animals_sum | previous_NAME_GOODS_CATEGORY_House Construction_mean | previous_NAME_GOODS_CATEGORY_House Construction_sum | previous_NAME_CASH_LOAN_PURPOSE_Refusal to name the goal_mean | previous_NAME_CASH_LOAN_PURPOSE_Refusal to name the goal_sum | previous_NAME_CASH_LOAN_PURPOSE_Money for a third person_mean | previous_NAME_CASH_LOAN_PURPOSE_Money for a third person_sum | previous_NAME_CASH_LOAN_PURPOSE_Hobby_mean | previous_NAME_CASH_LOAN_PURPOSE_Hobby_sum | ... | previous_CODE_REJECT_REASON_XAP_mean | previous_FLAG_LAST_APPL_PER_CONTRACT_Y_mean | previous_NAME_PORTFOLIO_POS_sum | previous_NAME_CONTRACT_TYPE_Consumer loans_sum | previous_NAME_CASH_LOAN_PURPOSE_XAP_sum | previous_NAME_PRODUCT_TYPE_XNA_sum | previous_NAME_CONTRACT_STATUS_Approved_sum | previous_CODE_REJECT_REASON_XAP_sum | previous_FLAG_LAST_APPL_PER_CONTRACT_Y_sum | previous_NAME_CONTRACT_TYPE_Cash loans_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100001 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | ... | 1.0 | 1.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
100002 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | ... | 1.0 | 1.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
100003 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | ... | 1.0 | 1.0 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 |
100004 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | ... | 1.0 | 1.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
100005 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | ... | 1.0 | 1.0 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | 2 |
5 rows × 285 columns
merge
를 사용하여 계산된 데이터 프레임을 기본 training 데이터 프레임에 합칩니다. 그런 다음 kernel memory를 너무 많이 사용하지 않도록 계산된 데이터 프레임을 삭제해야 합니다.
train = pd.read_csv('./home_credit/application_train.csv')
train = convert_types(train)
test = pd.read_csv('./home_credit/application_test.csv')
test = convert_types(test)
train = train.merge(previous_counts, on="SK_ID_CURR", how="left")
train = train.merge(previous_agg, on="SK_ID_CURR", how="left")
test = test.merge(previous_counts, on="SK_ID_CURR", how="left")
test = test.merge(previous_agg, on="SK_ID_CURR", how="left")
# 메모리를 위해 변수 제거
gc.enable()
del previous, previous_agg, previous_counts
gc.collect()
0
너무 많은 feature을 계산하는 것에 주의해야 합니다. 관련 없는 feature이 너무 많거나 null 값이 너무 많은 feature이 있는 모델을 과하게 만들고 싶지 않습니다. 이전 노트북에서는 null 값이 75% 이상인 모든 feature을 제거했습니다. 일관성을 위해 동일한 logic을 적용합니다.
Function to Calculate Missing Values
def missing_values_table(df, print_info=False):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(columns={0: "Missing Values",
1: "% of Total Values"})
mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:, 1]].\
sort_values("% of Total Values", asencding=False).round(1)
if print_info:
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
return mis_val_table_ren_columns
def remove_missing_columns(train, test, threshold=90):
train_miss = pd.DataFrame(train.isnull().sum())
train_miss["percent"] = 100 * train_miss[0] / len(train)
test_miss = pd.DataFrame(test.isnull().sum())
test_miss["percent"] = 100 * test_miss[0] / len(test)
missing_train_columns = list(train_miss.index[train_miss["percent"] > threshold])
missing_test_columns = list(test_miss.index[test_miss["percent"] > threshold])
missing_columns = list(set(missing_train_columns + missing_test_columns))
print('There are %d columns with greater than %d%% missing values.' % (len(missing_columns), threshold))
train = train.drop(columns=missing_columns)
test = test.drop(columns=missing_columns)
return train, test
# def remove_missing_columns(train, test, threshold = 90):
# # Calculate missing stats for train and test (remember to calculate a percent!)
# train_miss = pd.DataFrame(train.isnull().sum())
# train_miss['percent'] = 100 * train_miss[0] / len(train)
# test_miss = pd.DataFrame(test.isnull().sum())
# test_miss['percent'] = 100 * test_miss[0] / len(test)
# # list of missing columns for train and test
# missing_train_columns = list(train_miss.index[train_miss['percent'] > threshold])
# missing_test_columns = list(test_miss.index[test_miss['percent'] > threshold])
# # Combine the two lists together
# missing_columns = list(set(missing_train_columns + missing_test_columns))
# # Print information
# print('There are %d columns with greater than %d%% missing values.' % (len(missing_columns), threshold))
# # Drop the missing columns and return
# train = train.drop(columns = missing_columns)
# test = test.drop(columns = missing_columns)
# return train, test
train, test = remove_missing_columns(train, test)
There are 6 columns with greater than 90% missing values.
Applying to More Data
Function to Aggregate Stats at the Client Level
def aggregate_client(df, group_vars, df_names):
"""
group_vars = ['SK_ID_PREV', 'SK_ID_CURR']
df_names = ['cash', 'client']
"""
df_agg = agg_numeric(df, parent_var=group_vars[0], df_name=df_names[0])
if any(df.dtypes == "category"):
df_counts = agg_categorical(df, parent_var=group_vars[0], df_name=df_names[0])
# numeric, categorical 합치기
df_by_loan = df_counts.merge(df_agg, on=group_vars[0], how="outer")
gc.enable()
del df_agg, df_counts
gc.collect()
# client id로 합치기
df_by_loan = df_by_loan.merge(df[[group_vars[0], group_vars[1]]],
on=group_vars[0], how="left")
df_by_loan = df_by_loan.drop(columns=[group_vars[0]])
df_by_client = agg_numeric(df_by_loan, parent_var=group_vars[1],
df_name=df_names[1])
else:
df_by_loan = df_agg.merge(df[[group_vars[0], group_vars[1]]],
on=group_vars[0], how="left")
gc.enable()
del df_agg
gc.collect()
df_by_loan = df_by_loan.drop(columns=[group_vars[0]])
df_by_client = agg_numeric(df_by_loan, parent_var=group_vars[1],
df_name=df_names[1])
gc.enable()
del df, df_by_loan
gc.collect()
return df_by_client
Monthly Cash Data
cash = pd.read_csv('./home_credit/POS_CASH_balance.csv')
cash = convert_types(cash, print_info=True)
cash.head()
Original Memory Usage: 0.64 gb.
New Memory Usage: 0.41 gb.
SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
---|---|---|---|---|---|---|---|---|
0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
cash_by_client = aggregate_client(cash, group_vars=['SK_ID_PREV', 'SK_ID_CURR'], df_names=['cash', 'client'])
cash_by_client.head()
client_cash_MONTHS_BALANCE_sum_sum | client_cash_MONTHS_BALANCE_min_sum | client_cash_MONTHS_BALANCE_mean_sum | client_cash_MONTHS_BALANCE_max_sum | client_cash_MONTHS_BALANCE_sum_min | client_cash_MONTHS_BALANCE_sum_mean | client_cash_MONTHS_BALANCE_sum_max | client_cash_MONTHS_BALANCE_min_min | client_cash_MONTHS_BALANCE_mean_min | client_cash_MONTHS_BALANCE_max_min | ... | client_cash_CNT_INSTALMENT_FUTURE_max_sum | client_cash_NAME_CONTRACT_STATUS_Active_sum_sum | client_cash_CNT_INSTALMENT_min_sum | client_cash_CNT_INSTALMENT_mean_sum | client_cash_CNT_INSTALMENT_max_sum | client_cash_CNT_INSTALMENT_count_sum | client_cash_CNT_INSTALMENT_FUTURE_count_sum | client_cash_NAME_CONTRACT_STATUS_Active_count_sum | client_cash_CNT_INSTALMENT_FUTURE_sum_sum | client_cash_CNT_INSTALMENT_sum_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100001 | -2887 | -669 | -653.0 | -637 | -378 | -320.777778 | -275 | -96 | -94.5 | -93 | ... | 28.0 | 32.0 | 36.0 | 36.000000 | 36.0 | 41 | 41 | 41 | 62.0 | 164.0 |
100002 | -3610 | -361 | -190.0 | -19 | -190 | -190.000000 | -190 | -19 | -10.0 | -1 | ... | 456.0 | 361.0 | 456.0 | 456.000000 | 456.0 | 361 | 361 | 361 | 5415.0 | 8664.0 |
100003 | -13240 | -1348 | -1226.0 | -1104 | -858 | -472.857143 | -172 | -77 | -71.5 | -66 | ... | 288.0 | 256.0 | 248.0 | 283.000000 | 288.0 | 272 | 272 | 272 | 1608.0 | 2840.0 |
100004 | -408 | -108 | -102.0 | -96 | -102 | -102.000000 | -102 | -27 | -25.5 | -24 | ... | 16.0 | 12.0 | 12.0 | 15.000000 | 16.0 | 16 | 16 | 16 | 36.0 | 60.0 |
100005 | -2420 | -275 | -220.0 | -165 | -220 | -220.000000 | -220 | -25 | -20.0 | -15 | ... | 132.0 | 99.0 | 99.0 | 128.699997 | 132.0 | 110 | 110 | 121 | 792.0 | 1287.0 |
5 rows × 162 columns
print('Cash by Client Shape: ', cash_by_client.shape)
train = train.merge(cash_by_client, on="SK_ID_CURR", how="left")
test = test.merge(cash_by_client, on="SK_ID_CURR", how="left")
gc.enable()
del cash, cash_by_client
gc.collect()
Cash by Client Shape: (337252, 162)
0
train, test = remove_missing_columns(train, test)
There are 0 columns with greater than 90% missing values.
Monthly Credit Data
credit = pd.read_csv("./home_credit/credit_card_balance.csv")
credit = convert_types(credit, print_info=True)
credit.head()
Original Memory Usage: 0.71 gb.
New Memory Usage: 0.42 gb.
SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2562384 | 378907 | -6 | 56.970001 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.324951 | ... | 0.000000 | 0.000000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
1 | 2582071 | 363914 | -1 | 63975.554688 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000000 | ... | 64875.554688 | 64875.554688 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
2 | 1740877 | 371185 | -7 | 31815.224609 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000000 | ... | 31460.085938 | 31460.085938 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
3 | 1389973 | 337855 | -4 | 236572.109375 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.759766 | ... | 233048.968750 | 233048.968750 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
4 | 1891521 | 126868 | -1 | 453919.468750 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890625 | ... | 453919.468750 | 453919.468750 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
credit_by_client = aggregate_client(credit, group_vars=['SK_ID_PREV', 'SK_ID_CURR'], df_names=['credit', 'client'])
credit_by_client.head()
client_credit_MONTHS_BALANCE_sum_sum | client_credit_MONTHS_BALANCE_min_sum | client_credit_MONTHS_BALANCE_mean_sum | client_credit_MONTHS_BALANCE_sum_min | client_credit_MONTHS_BALANCE_sum_mean | client_credit_MONTHS_BALANCE_sum_max | client_credit_MONTHS_BALANCE_max_sum | client_credit_MONTHS_BALANCE_min_min | client_credit_MONTHS_BALANCE_min_mean | client_credit_MONTHS_BALANCE_min_max | ... | client_credit_AMT_DRAWINGS_ATM_CURRENT_mean_max | client_credit_AMT_PAYMENT_CURRENT_mean_min | client_credit_AMT_PAYMENT_CURRENT_mean_mean | client_credit_AMT_PAYMENT_CURRENT_mean_max | client_credit_AMT_PAYMENT_CURRENT_max_min | client_credit_AMT_PAYMENT_CURRENT_max_mean | client_credit_AMT_PAYMENT_CURRENT_max_max | client_credit_AMT_DRAWINGS_ATM_CURRENT_max_min | client_credit_AMT_DRAWINGS_ATM_CURRENT_max_mean | client_credit_AMT_DRAWINGS_ATM_CURRENT_max_max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100006 | -126 | -36 | -21.0 | -21 | -21.0 | -21 | -6 | -6 | -6.0 | -6 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
100011 | -210826 | -5550 | -2849.0 | -2849 | -2849.0 | -2849 | -148 | -75 | -75.0 | -75 | ... | 2432.432373 | 4843.063965 | 4843.063965 | 4843.063965 | 55485.0 | 55485.0 | 55485.0 | 180000.0 | 180000.0 | 180000.0 |
100013 | -446976 | -9216 | -4656.0 | -4656 | -4656.0 | -4656 | -96 | -96 | -96.0 | -96 | ... | 6350.000000 | 7168.346191 | 7168.346191 | 7168.346191 | 153675.0 | 153675.0 | 153675.0 | 157500.0 | 157500.0 | 157500.0 |
100021 | -2890 | -306 | -170.0 | -170 | -170.0 | -170 | -34 | -18 | -18.0 | -18 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
100023 | -480 | -88 | -60.0 | -60 | -60.0 | -60 | -32 | -11 | -11.0 | -11 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 376 columns
print('Credit by client shape: ', credit_by_client.shape)
train = train.merge(credit_by_client, on='SK_ID_CURR', how='left')
test = test.merge(credit_by_client, on='SK_ID_CURR', how='left')
gc.enable()
del credit, credit_by_client
gc.collect()
Credit by client shape: (103558, 376)
0
train, test = remove_missing_columns(train, test)
There are 0 columns with greater than 90% missing values.
Installment Payments
installments = pd.read_csv("./home_credit/installments_payments.csv")
installments = convert_types(installments, print_info=True)
installments.head()
Original Memory Usage: 0.87 gb.
New Memory Usage: 0.49 gb.
SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
---|---|---|---|---|---|---|---|---|
0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.359863 | 6948.359863 |
1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525024 | 1716.525024 |
2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000000 | 25425.000000 |
3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130859 | 24350.130859 |
4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040039 | 2160.584961 |
installments_by_client = aggregate_client(installments, group_vars=['SK_ID_PREV', 'SK_ID_CURR'], df_names=['installments', 'client'])
installments_by_client.head()
client_installments_DAYS_ENTRY_PAYMENT_sum_sum | client_installments_DAYS_INSTALMENT_sum_sum | client_installments_DAYS_ENTRY_PAYMENT_min_sum | client_installments_DAYS_INSTALMENT_min_sum | client_installments_DAYS_ENTRY_PAYMENT_mean_sum | client_installments_DAYS_INSTALMENT_mean_sum | client_installments_DAYS_ENTRY_PAYMENT_max_sum | client_installments_DAYS_INSTALMENT_max_sum | client_installments_DAYS_INSTALMENT_sum_min | client_installments_DAYS_ENTRY_PAYMENT_sum_min | ... | client_installments_AMT_PAYMENT_min_sum | client_installments_AMT_INSTALMENT_min_sum | client_installments_AMT_PAYMENT_sum_max | client_installments_AMT_INSTALMENT_sum_max | client_installments_AMT_PAYMENT_mean_sum | client_installments_AMT_INSTALMENT_mean_sum | client_installments_AMT_INSTALMENT_max_sum | client_installments_AMT_PAYMENT_max_sum | client_installments_AMT_PAYMENT_sum_sum | client_installments_AMT_INSTALMENT_sum_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100001 | -52813.0 | -52598.0 | -15608.0 | -15584.0 | -15365.0 | -15314.0 | -15080.0 | -15044.0 | -8658.0 | -8647.0 | ... | 2.774678e+04 | 2.774678e+04 | 2.925090e+04 | 2.925090e+04 | 4.119593e+04 | 4.119593e+04 | 8.153775e+04 | 8.153775e+04 | 1.528387e+05 | 1.528387e+05 |
100002 | -113867.0 | -106495.0 | -11153.0 | -10735.0 | -5993.0 | -5605.0 | -931.0 | -475.0 | -5605.0 | -5993.0 | ... | 1.757837e+05 | 1.757837e+05 | 2.196257e+05 | 2.196257e+05 | 2.196257e+05 | 2.196257e+05 | 1.008781e+06 | 1.008781e+06 | 4.172888e+06 | 4.172888e+06 |
100003 | -367137.0 | -365546.0 | -37757.0 | -37514.0 | -34633.0 | -34454.0 | -31594.0 | -31394.0 | -25740.0 | -25821.0 | ... | 1.154108e+06 | 1.154108e+06 | 1.150977e+06 | 1.150977e+06 | 1.618865e+06 | 1.618865e+06 | 4.394102e+06 | 4.394102e+06 | 1.134881e+07 | 1.134881e+07 |
100004 | -6855.0 | -6786.0 | -2385.0 | -2352.0 | -2285.0 | -2262.0 | -2181.0 | -2172.0 | -2262.0 | -2285.0 | ... | 1.607175e+04 | 1.607175e+04 | 2.128846e+04 | 2.128846e+04 | 2.128846e+04 | 2.128846e+04 | 3.172189e+04 | 3.172189e+04 | 6.386539e+04 | 6.386539e+04 |
100005 | -49374.0 | -47466.0 | -6624.0 | -6354.0 | -5486.0 | -5274.0 | -4230.0 | -4194.0 | -5274.0 | -5486.0 | ... | 4.331880e+04 | 4.331880e+04 | 5.616184e+04 | 5.616184e+04 | 5.616184e+04 | 5.616184e+04 | 1.589062e+05 | 1.589062e+05 | 5.054566e+05 | 5.054566e+05 |
5 rows × 106 columns
print('Installments by client shape: ', installments_by_client.shape)
train = train.merge(installments_by_client, on='SK_ID_CURR', how='left')
test = test.merge(installments_by_client, on='SK_ID_CURR', how='left')
gc.enable()
del installments, installments_by_client
gc.collect()
Installments by client shape: (339587, 106)
20
train, test = remove_missing_columns(train, test)
There are 0 columns with greater than 90% missing values.
print('Final Training Shape: ', train.shape)
print('Final Testing Shape: ', test.shape)
Final Training Shape: (307511, 1125)
Final Testing Shape: (48744, 1124)
print(f'Final training size: {return_size(train)}')
print(f'Final testing size: {return_size(test)}')
Final training size: 2.08
Final testing size: 0.33
Save All Newly Calculated Features
안타깝게도 생성된 모든 feature를 저장하는 것은 Kaggle 노트북에서 작동하지 않습니다. 개인 컴퓨터에서 코드를 실행해야 합니다.
Modeling
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
def model(features, test_features, encoding="ohe", n_folds=5):
train_ids = features["SK_ID_CURR"]
test_ids = test_features["SK_ID_CURR"]
labels = features["TARGET"]
features = features.drop(columns=["SK_ID_CURR"])
test_features = test_features.drop(columns=["SK_ID_CURR"])
if encoding == "ohe":
features = pd.get_dummies(features)
test_features = pd.get_dummies(test_features)
features, test_features = features.align(test_features, join="inner",
axis=1)
# ohe-hot encoding을 한 경우, 카테고리형 변수를 알려주지 않아도 됨
cat_indices = "auto"
elif encoding == "le":
label_encoder = LabelEncdoer()
cat_indices = []
for i, col in enumerate(features):
if features[col].dtype == "object":
features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).resshape((-1, )))
test_features[col] = label_encoder.fit_transform(np.array(test_features[col].astype(str)).resshape((-1, )))
cat_indices.append(i)
else:
raise ValueError("Encoding must be either 'ohe' or 'le'")
print('Training Data Shape: ', features.shape)
print('Testing Data Shape: ', test_features.shape)
feature_names = list(features.columns)
features = np.array(features)
test_features = np.array(test_features)
k_fold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=50)
feature_importance_values = np.zeros(len(feature_names))
test_predictions = np.zeros(test_features.shape[0])
out_of_fold = np.zeros(features.shape[0])
valid_scores = []
train_scores = []
for train_indices, valid_indices in k_fold.split(features, labels):
train_features, train_labels = features[train_indices], labels[train_indices]
valid_features, valid_labels = features[valid_indices], labels[valid_indices]
model = lgb.LGBMClassifier(n_estimators=10000, objective="binary",
class_weight="balanced",
learning_rate=0.05,
reg_alpha=0.1, reg_lambda=0.1,
subsample=0.8, n_jobs=-1, random_state=50)
model.fit(train_features, train_labels, eval_metric="auc",
eval_set=[(valid_features, valid_labels), (train_features, train_labels)],
eval_names=["valid", "train"], categorical_feature=cat_indices,
early_stopping_rounds=100, verbose=200)
best_iteration = model.best_iteration_
feature_importance_values += model.feature_importances_ / k_fold.n_splits
test_predictions += model.predict_proba(test_features,
num_iteration=best_iteration)[:, 1] / k_fold.n_splits
out_of_fold[valid_indices] = model.predict_proba(valid_features,
num_iteration=best_iteration)[:, 1]
valid_score = model.best_score_["valid"]["auc"]
train_score = model.best_score_["train"]["auc"]
valid_scores.append(valid_score)
train_scores.append(train_score)
gc.enable()
del model, train_features, valid_features
gc.collect()
submission = pd.DataFrame({"SK_ID_CURR": test_ids,
"TARGET": test_predictions})
feature_importances = pd.DataFrame({"feature": feature_names,
"importance": feature_importance_values})
valid_auc = roc_auc_score(labels, out_of_fold)
# 전체 점수
valid_scores.append(valid_auc)
train_scores.append(np.mean(train_scores))
fold_names = list(range(n_folds))
fold_names.append("overall")
metrics = pd.DataFrame({"fold": fold_names, "train": train_scores,
"valid": valid_scores})
return submission, feature_importance_values, metrics
submission, fi, metrics = model(train, test)
Training Data Shape: (307511, 1244)
Testing Data Shape: (48744, 1244)
[200] train's auc: 0.827847 train's binary_logloss: 0.517851 valid's auc: 0.776498 valid's binary_logloss: 0.53754
[200] train's auc: 0.826893 train's binary_logloss: 0.519028 valid's auc: 0.777111 valid's binary_logloss: 0.539778
[400] train's auc: 0.862668 train's binary_logloss: 0.479481 valid's auc: 0.778615 valid's binary_logloss: 0.514953
[200] train's auc: 0.82652 train's binary_logloss: 0.519854 valid's auc: 0.782785 valid's binary_logloss: 0.536581
[400] train's auc: 0.862059 train's binary_logloss: 0.480909 valid's auc: 0.783537 valid's binary_logloss: 0.512251
[200] train's auc: 0.827632 train's binary_logloss: 0.518235 valid's auc: 0.780399 valid's binary_logloss: 0.538406
[400] train's auc: 0.863706 train's binary_logloss: 0.478578 valid's auc: 0.780408 valid's binary_logloss: 0.513719
[200] train's auc: 0.828105 train's binary_logloss: 0.517592 valid's auc: 0.776329 valid's binary_logloss: 0.538545
[400] train's auc: 0.863528 train's binary_logloss: 0.478105 valid's auc: 0.776944 valid's binary_logloss: 0.513118
metrics
fold | train | valid | |
---|---|---|---|
0 | 0 | 0.838376 | 0.777163 |
1 | 1 | 0.864993 | 0.778851 |
2 | 2 | 0.853780 | 0.783790 |
3 | 3 | 0.847372 | 0.780880 |
4 | 4 | 0.860250 | 0.777134 |
5 | overall | 0.852954 | 0.779494 |
submission.to_csv('submission_manualp2.csv', index=False)
참고: https://www.kaggle.com/code/willkoehrsen/introduction-to-manual-feature-engineering-p2/notebook
댓글남기기