How does one hot encoding works?
You are simply creating a vector of each transaction. This vector will contain the count of items in each transaction under each item as each item is now addressed in the feature side.
1) Fruits, Clothes
Transaction Fruits Clothes Vegetables
1) 1, 1, 0
2) 0, 0, 1
In the following line:
hot_encoded_bakery = bakery.groupby(['Transaction','Item'])['Item'].count().unstack().reset_index().fillna(0).set_index('Transaction')
You are performing chained operations as follows:
- You first grouped all the transactions and items and then only selecting count of feature "item".
- You unstacked it and it will convert all the values of item features to the axis = 1 i.e. features side.
- Then you reset the index, a new feature will be added i.e. Transaction.
- The you replaced all the NaN cells with zeros.
- Then you again set Transaction feature to index.
Instead you could simply have performed:
hot_encoded_bakery = bakery.groupby(['Transaction','Item'])['Item'].count().unstack().fillna(0)