実践データ分析100本ノック(第4章-顧客の行動を予測する-38～40)

2020年10月8日2020年12月21日

実践データ分析100本ノック(第4章-顧客の行動を予測する-38～39)

コード


#--ノック38------------------------------
# 線形回帰モデル
# %%
# 2018年4月以降に入会した人を対象とする
predict_data = predict_data.loc[predict_data["start_date"]>pd.to_datetime("20180401")]
# ライブラリインポート
from sklearn import linear_model
import sklearn.model_selection
# 線形回帰モデルの呼び出し
model = linear_model.LinearRegression()
# 予測に使用する変数「説明変数」を定義
X = predict_data[["count_0", "count_1", "count_2", "count_3", "count_4", "count_5", "period"]]
# 予測したい変数「目的変数」を定義
y = predict_data["count_pred"]
# 説明変数トレーニングデータ, 説明変テストデータ, 目的変数トレーニングデータ, 目的変数テストデータを取得
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)
# モデルの訓練
model.fit(X_train, y_train)
# 結果表示
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
# 0.614770992255967
# 0.5863569706845245

#--ノック39------------------------------
# モデルに寄与している係数を確認
# ※coef_で係数を確認出来る
# %%
coef = pd.DataFrame({"feature_names":X.columns, "coefficient":model.coef_})
print(coef)
#   feature_names  coefficient
# 0       count_0     0.360580
# 1       count_1     0.177157
# 2       count_2     0.152378
# 3       count_3     0.181042
# 4       count_4     0.087210
# 5       count_5     0.064458
# 6        period     0.046358
# ※直近のデータ程寄与率が高いのが確認できる

#--ノック40------------------------------
# 来月の利用回数を予測しよう
# %%
x1 = [3, 4, 4, 6, 8, 7, 8]
x2 = [2, 2, 3, 3, 4, 6, 8]
x_pred = [x1, x2]

model.predict(x_pred)
# array([3.84308464, 1.99249379])

uselog_months.to_csv("./samples/5/use_log_months.csv", index=False)