pandasをインストールしたのでデータをいじってみる

2020年12月21日

pandasをインストールしたのでデータをいじってみる

自分用の備忘録ですがよろしくお願いします。

CSVの読み込み


import pandas as pd
transaction = pd.read_csv('transaction_1.csv')
transaction.head()

     transaction_id   price         payment_date customer_id
0       T0000000113  210000  2019-02-01 01:36:57    PL563502
1       T0000000114   50000  2019-02-01 01:37:23    HD678019
2       T0000000115  120000  2019-02-01 02:34:19    HD298120
3       T0000000116  210000  2019-02-01 02:47:23    IK452215
4       T0000000117  170000  2019-02-01 04:33:46    PL542865
...             ...     ...                  ...         ...

縦に結合(concat)


import pandas as pd
transaction_1 = pd.read_csv('transaction_1.csv')
transaction_2 = pd.read_csv('transaction_2.csv')
transaction = pd.concat([transaction_1, transaction_2], ignore_index=True)
transaction.head()

print(len(transaction_1))#5000(レコード数)
print(len(transaction_2))#1786(レコード数)
print(len(transaction))  #6786(レコード数)

     transaction_id   price         payment_date customer_id
0       T0000000113  210000  2019-02-01 01:36:57    PL563502
1       T0000000114   50000  2019-02-01 01:37:23    HD678019
2       T0000000115  120000  2019-02-01 02:34:19    HD298120
3       T0000000116  210000  2019-02-01 02:47:23    IK452215
4       T0000000117  170000  2019-02-01 04:33:46    PL542865
...             ...     ...                  ...         ...
6781    T0000006894  180000  2019-07-31 21:20:44    HI400734
6782    T0000006895   85000  2019-07-31 21:52:48    AS339451
6783    T0000006896  100000  2019-07-31 23:35:25    OA027325
6784    T0000006897   85000  2019-07-31 23:39:35    TS624738
6785    T0000006898   85000  2019-07-31 23:41:38    AS834214

横に結合(merge)


import pandas as pd
transaction_1 = pd.read_csv('transaction_1.csv')
transaction_2 = pd.read_csv('transaction_2.csv')
transaction_detail_1 = pd.read_csv('transaction_detail_1.csv')
transaction_detail_2 = pd.read_csv('transaction_detail_2.csv')
customer_master = pd.read_csv('customer_master.csv')
item_master     = pd.read_csv('item_master.csv')
transaction        = pd.concat([transaction_1, transaction_2], ignore_index=True)
transaction_detail = pd.concat([transaction_detail_1, transaction_detail_2], ignore_index=True)

join_data = pd.merge(transaction_detail, transaction[["transaction_id", "payment_date", "customer_id"]], on="transaction_id", how="left")
join_data = pd.merge(join_data, customer_master, on="customer_id", how="left")
join_data = pd.merge(join_data, item_master, on="item_id", how="left")
join_data.head()

      detail_id transaction_id item_id  quantity         payment_date customer_id customer_name  ...                           email gender age       birth  pref item_name item_price
0             0    T0000000113    S005         1  2019-02-01 01:36:57    PL563502         井本 芳正  ...     imoto_yoshimasa@example.com      M  30   1989/7/15   熊本県      PC-E     210000
1             1    T0000000114    S001         1  2019-02-01 01:37:23    HD678019         三船 六郎  ...      mifune_rokurou@example.com      M  73  1945/11/29   京都府      PC-A      50000
2             2    T0000000115    S003         1  2019-02-01 02:34:19    HD298120         山根 小雁  ...        yamane_kogan@example.com      M  42   1977/5/17   茨城県      PC-C     120000
3             3    T0000000116    S005         1  2019-02-01 02:47:23    IK452215         池田 菜摘  ...       ikeda_natsumi@example.com      F  47   1972/3/17   兵庫県      PC-E     210000
4             4    T0000000117    S002         2  2019-02-01 04:33:46    PL542865         栗田 憲一  ...      kurita_kenichi@example.com      M  74  1944/12/17   長崎県      PC-B      85000
...         ...            ...     ...       ...                  ...         ...           ...  ...                             ...    ...  ..         ...   ...       ...        ...
7139       7139    T0000006894    S004         1  2019-07-31 21:20:44    HI400734          宍戸 明  ...      shishido_akira@example.com      M  64   1955/1/13   福井県      PC-D     180000
7140       7140    T0000006895    S002         1  2019-07-31 21:52:48    AS339451         相原 みき  ...         aihara_miki@example.com      F  74    1945/2/3   北海道      PC-B      85000
7141       7141    T0000006896    S001         2  2019-07-31 23:35:25    OA027325         松田 早紀  ...        matsuda_saki@example.com      F  40   1979/5/25   福島県      PC-A      50000
7142       7142    T0000006897    S002         1  2019-07-31 23:39:35    TS624738         進藤 正敏  ...  shinndou_masatoshi@example.com      M  56   1963/2/21   東京都      PC-B      85000
7143       7143    T0000006898    S002         1  2019-07-31 23:41:38    AS834214         田原 結子  ...        tahara_yuuko@example.com      F  74  1944/12/18   愛知県      PC-B      85000

データ列の作成


join_data["price"] = join_data["quantity"] * join_data["item_price"]
join_data[["quantity", "item_price", "price"]].head()
join_data.head()

      quantity  item_price   price
0            1      210000  210000
1            1       50000   50000
2            1      120000  120000
3            1      210000  210000
4            2       85000  170000
...        ...         ...     ...
7139         1      180000  180000
7140         1       85000   85000
7141         2       50000  100000
7142         1       85000   85000
7143         1       85000   85000

検算


join_data["price"].sum() == transaction["price"].sum()

True

欠損の値の検出と全体の数字感の確認


join_data.isnull().sum()

detail_id             0
transaction_id        0
item_id               0
quantity              0
payment_date          0
customer_id           0
customer_name         0
registration_date     0
customer_name_kana    0
email                 0
gender                0
age                   0
birth                 0
pref                  0
item_name             0
item_price            0
price                 0

join_data.describe()

         detail_id     quantity          age     item_price          price
count  7144.000000  7144.000000  7144.000000    7144.000000    7144.000000
mean   3571.500000     1.199888    50.265677  121698.628219  135937.150056
std    2062.439494     0.513647    17.190314   64571.311830   68511.453297
min       0.000000     1.000000    20.000000   50000.000000   50000.000000
25%    1785.750000     1.000000    36.000000   50000.000000   85000.000000
50%    3571.500000     1.000000    50.000000  102500.000000  120000.000000
75%    5357.250000     1.000000    65.000000  187500.000000  210000.000000
max    7143.000000     4.000000    80.000000  210000.000000  420000.000000

describe()の項目一覧

count データ件数
mean 平均値
std 標準偏差
min 最小値
25%, 75% 四分位数(しぶんいすう)
50% 中央値
max 最大値

データ型の確認


join_data.dtypes

日付の型とフォーマット変換


#object型→datetime型に変換
join_data["payment_date"] = pd.to_datetime(join_data["payment_date"])
#年月表示に変換
join_data["payment_month"] = join_data["payment_date"].dt.strftime("%Y%m")

集計(groupby)


#groupby
join_data.groupby("payment_month").sum()["price"]

payment_month
201902    160185000
201903    160370000
201904    160510000
201905    155420000
201906    164030000
201907    170620000

#商品別に売上、数量を表示
join_data.groupby(["payment_month", "item_name"]).sum()[["price", "quantity"]]

                            price  quantity
payment_month item_name                    
201902        PC-A       24150000       483
              PC-B       25245000       297
              PC-C       19800000       165
              PC-D       31140000       173
              PC-E       59850000       285
201903        PC-A       26000000       520
              PC-B       25500000       300
              PC-C       19080000       159
              PC-D       25740000       143
              PC-E       64050000       305
201904        PC-A       25900000       518
              PC-B       23460000       276
              PC-C       21960000       183
              PC-D       24300000       135
              PC-E       64890000       309
              ...        ...            ...

ピボットテーブル


pd.pivot_table(join_data, index='item_name', columns='payment_month', values=['price', 'quantity'], aggfunc='sum')

                  price                                                   quantity                                   
payment_month    201902    201903    201904    201905    201906    201907   201902 201903 201904 201905 201906 201907
item_name                                                                                                            
PC-A           24150000  26000000  25900000  24850000  26000000  25250000      483    520    518    497    520    505
PC-B           25245000  25500000  23460000  25330000  23970000  28220000      297    300    276    298    282    332
PC-C           19800000  19080000  21960000  20520000  21840000  19440000      165    159    183    171    182    162
PC-D           31140000  25740000  24300000  25920000  28800000  26100000      173    143    135    144    160    145
PC-E           59850000  64050000  64890000  58800000  63420000  71610000      285    305    309    280    302    341

グラフ化


graph_data = pd.pivot_table(join_data, index='payment_month', columnst='item_name', values='price', aggfunc=''sum)
graph_data.head()
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(list(graph_data.index), graph_data["PC-A"], label='PC-A')
plt.plot(list(graph_data.index), graph_data["PC-B"], label='PC-B')
plt.plot(list(graph_data.index), graph_data["PC-C"], label='PC-C')
plt.plot(list(graph_data.index), graph_data["PC-D"], label='PC-D')
plt.plot(list(graph_data.index), graph_data["PC-E"], label='PC-E')
plt.legend()

2020年12月21日