kaggle.json
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.9 / client 1.5.4)
song_extra_info.csv.7z: Skipping, found more recently modified local copy (use --force to force download)
train.csv.7z: Skipping, found more recently modified local copy (use --force to force download)
test.csv.7z: Skipping, found more recently modified local copy (use --force to force download)
songs.csv.7z: Skipping, found more recently modified local copy (use --force to force download)
members.csv.7z: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv.7z: Skipping, found more recently modified local copy (use --force to force download)
1 2 3 4 5 6 7 8 9 10 11 12
!mkdir kaggle/working !mkdir kaggle/working/train !mkdir kaggle/working/train/data !apt-get install p7zip !apt-get install p7zip-full !7za e members.csv.7z !7za e songs.csv.7z !7za e song_extra_info.csv.7z !7za e train.csv.7z !7za e sample_submission.csv.7z !7za e test.csv.7z !mv *.csv kaggle/working/train/data
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os for dirname, _, filenames in os.walk('./kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
In this project, we are going to build a recommendation system to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, its target is marked 1, and 0 otherwise in the training set. The same rule applies to the testing set.
0. Data Collection and Description
The KKBox dataset is composed of following files:
train.csv
msno: user id
song_id: song id
source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search.
source_screen_name: name of the layout a user sees.
source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song .. etc.
target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise .
test.csv
id: row id (will be used for submission)
msno: user id
song_id: song id
source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search.
source_screen_name: name of the layout a user sees.
source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song .. etc.
sample_submission.csv
sample submission file in the format that we expect you to submit
id: same as id in test.csv
target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise .
songs.csv The songs. Note that data is in unicode.
song_id
song_length: in ms
genre_ids: genre category. Some songs have multiple genres and they are separated by |
artist_name
composer
lyricist
language
members.csv user information.
msno
city
bd: age. Note: this column has outlier values, please use your judgement.
gender
registered_via: registration method
registration_init_time: format %Y%m%d
expiration_date: format %Y%m%d
song_extra_info.csv
song_id
song name - the name of the song.
isrc - International Standard Recording Code, theoretically can be used as an identity of a song. However, what worth to note is, ISRCs generated from providers have not been officially verified; therefore the information in ISRC, such as country code and reference year, can be misleading/incorrect. Multiple songs could share one ISRC since a single recording could be re-published several times.
1. Data Cleaning and Exploratory Data Analysis (EDA)
Find the Description and summary of each CSV file and Determine Null object, categorical attributes, numerical attributes
Convert some attribute types to correct data type, like convert string to float, if necessary
Handle Missing values
Plot univariate, bivariate plots to visualize and analyze relationship between attributes and target
Analysis Summary in this section
2. Data Preprocessing
Note that This section is to give some examples to preprocess data like filling missing values and removing outliers.
In order to train models, you should start from step 3 ETL to extract and transform data directly using integrated functions
3. Data Pipeline: Extract, Transformation, Load (ETL)
4. Machine Learning Modeling
LGBM Boosting machine In the modeling part, we use LGBM model first, which is a light gradient boosting machine model using tree-based basic models for boosting. Since the dataset is large, 1.9 GB and the number of attributes can increase during transformation, LGBM provides a very fast way to train a machine model, so we try it here.
In this part, we try LGBM model with different max_depth of tree: [10, 15,20, 25, 30] and see how max_depth affects the accuracy on prediction
Wide and Deep Neural network model In addition to LGBM model, we are also interested in trying the Wide and Deep Neural network model since it is one of the popular neural network model in recommendation system and we want to see if this can help us improve the accuracy. In wide and deep model, It first uses a technique called embedding, which projects the sparse categorical features into dense features vectors with smaller dimension and extract the main features. Then it concatenates the embedded vectors with the numerical features together to train a traditional neural network classifier.
5. Model Training and validation
In model training and validation step, we split the data set into training set(80% of dataset) and validation set (20% of dataset) and then use them to train and keep track of the performance of models.
6. Model Evaluation
In Model evaluation step, we simply use the validation set to validate the final trained models and then let models make predictions on testset from kaggle and submit predictions to kaggle to see the final evaluation scores.
7. Summary
1 2 3 4 5 6 7 8 9 10 11 12
#import necessary packages here import warnings warnings.filterwarnings('ignore') import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import lightgbm as lgb from subprocess import check_output # print(check_output(["ls", "../input"]).decode("utf8"))
print("Unique Song amount in trainset:",train_df['song_id'].nunique()) print("Unique Song amount in testset:", test_df['song_id'].nunique()) print("Unique Song amount in song list:",song_df['song_id'].nunique())
Unique Song amount in trainset: 359966
Unique Song amount in testset: 224753
Unique Song amount in song list: 2296320
1.2 Explore Song information
1 2 3 4 5
# Merge two dataframe based on song_id so that we can analyze the song information together with training data
#selec the top-20 frequent artist_name df_artist = user_music_df["artist_name"].value_counts().sort_values(ascending=False)[:20] #plot in descending order in horizonal direction df_artist = df_artist.sort_values(ascending=True) ax = df_artist.plot.barh(figsize=(15,10)) ax.set_ylabel("song artist_name") ax.set_xlabel("Count")
sns.countplot(y= 'source_system_tab',hue='target', order = user_music_df['source_system_tab'].value_counts().index, data=user_music_df,dodge=True, ax= ax[0])
sns.countplot(y= 'source_screen_name',hue='target', order = user_music_df['source_screen_name'].value_counts().index, data=user_music_df,dodge=True, ax= ax[1])
sns.countplot(y= 'source_type',hue='target', order = user_music_df['source_type'].value_counts().index, data=user_music_df,dodge=True, ax= ax[2])
<matplotlib.axes._subplots.AxesSubplot at 0x7f269a90e128>
We can see that local library and local playlist are the main sources that users repeat playing music and Most of users more prefer to play music from local library than to play music online
Analyze Relationship between Target and members info
#after merging, the axis used to merge becomes object type,so need to convert it back to category type member_music_df["msno"] = member_music_df["msno"].astype("category") member_music_df["song_id"] = member_music_df["song_id"].astype("category") member_music_df.head()
#dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. #Any na values are automatically excluded. For any non-numeric data type columns in the data frame it is ignored. corr_matrix = member_music_df.corr() _ = sns.heatmap(corr_matrix)
#print the top threee attributes that have the strongest correlation with "Target" and the corresponding correlation coefficients. corr = corr_matrix['target'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") #print the top threee attributes that have the strongest correlation with "song_length" and the corresponding correlation coefficients. corr = corr_matrix['song_length'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") #print the top threee attributes that have the strongest correlation with "language" and the corresponding correlation coefficients. corr = corr_matrix['language'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") #print the top threee attributes that have the strongest correlation with "city" and the corresponding correlation coefficients. corr = corr_matrix['city'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") #print the top threee attributes that have the strongest correlation with "bd" and the corresponding correlation coefficients. corr = corr_matrix['bd'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") #print the top threee attributes that have the strongest correlation with "registered_via" and the corresponding correlation coefficients. corr = corr_matrix['registered_via'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") corr = corr_matrix['registration_init_day'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") corr = corr_matrix['registration_init_month'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") corr = corr_matrix['registration_init_year'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") corr = corr_matrix['expiration_day'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") corr = corr_matrix['expiration_month'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("") corr = corr_matrix['expiration_year'].sort_values(ascending= False) for x in corr.index[1:4].to_list(): print("{} {}".format(x, corr[x])) print("")
repeats=train_df[train_df.target==1] song_repeats=repeats.groupby('song_id',as_index=False).msno.count() song_repeats.columns=['song_id','count'] ##merge together 2 dataframe and create a new dataframe song_repeats=pd.DataFrame(song_repeats).merge(song_df,left_on='song_id',right_on='song_id') print("Print top 50 songs repeated") repeats.song_id.value_counts().head(50)
Note: This section is to show how to preprocess data. We can also directly start from Step 3 for data extract, transformation and load using integrated transformation function and skip this step if necessary
2.1 Filling missing values
1 2
missing_value_cols = [c for c in member_music_df.columns if member_music_df[c].isnull().any()] missing_value_cols
# list of columns with missing values # ['source_system_tab', # 'source_screen_name', # 'source_type', # 'song_length', # 'genre_ids', # 'artist_name', # 'composer', # 'lyricist', # 'language', # 'bd', # 'gender']
deffill_missing_value_v1(x): # fill missing values with the most frequent values return x.fillna(x.value_counts().sort_values(ascending=False).index[0])
numerical_ls = ['song_length','language','bd'] # Fill missing values for index in numerical_ls: member_music_df[index].fillna(member_music_df[index].median(), inplace=True) for index in categorical_ls: member_music_df[index].fillna("no_data", inplace=True)
We can see that the columns like genre_ids, composer, lyricist have multiple values in a cell. In this case, the count of genres, composers, lyricist could be useful information as well
We can skip Step 2 if we just want to transform data directly
3.1 Transformation Function for Data cleaning
1 2 3 4 5 6 7 8 9 10 11 12
#import necessary packages here import warnings warnings.filterwarnings('ignore') import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import lightgbm as lgb from subprocess import check_output # print(check_output(["ls", "../input"]).decode("utf8"))
numerical_ls = ['song_length','language','bd'] # Fill missing values for index in numerical_ls: data[index].fillna(data[index].median(), inplace=True) for index in categorical_ls: data[index].fillna("no_data", inplace=True) defcount_items(x): if x =="no_data": return0 return sum(map(x.count, ['|', '/', '\\', ';',','])) + 1
data['genre_count']= data['genre_ids'].apply(count_items) data['composer_count']= data['composer'].apply(count_items) data['lyricist_count']= data['lyricist'].apply(count_items) # Convert object type to categorical type for c in data.columns: if data[c].dtype=='O': data[c] = data[c].astype("category",copy=False) if'id'in data.columns: ids = data['id'] data.drop(['id'], inplace=True,axis=1) else: ids =None return ids, data
3.2 Transform the composer, artist, lyricist to counts as new features
Transform the name of composer, artist, lyricist to new features like counts, number of intersection of names
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
deftransform_names_intersection(data): #This function finds the intersection of names in composer, artist, lyricist defcheck_name_list(x): #convert string to name list dataframe strings = None strings = x.str.split(r"//|/|;|、|\| ") return strings df = data[["composer","artist_name", "lyricist"]].apply(check_name_list) data["composer_artist_intersect"] =[len(set(a) & set(b)) for a, b in zip(df.composer, df.artist_name)] data["composer_lyricist_intersect"] =[len(set(a) & set(b)) for a, b in zip(df.composer, df.lyricist)] data["artist_lyricist_intersect"] =[len(set(a) & set(b)) for a, b in zip(df.artist_name, df.lyricist)] return data _ = transform_names_intersection(X_train) X_train.head()
# Split training set and testing set train_index, valid_index ,test_index = None, None, None for train_i, test_i in ss_split.split(np.zeros(y_train.shape) ,y_train): train_index = train_i test_index = test_i print(train_index.shape, test_index.shape)
eval_df = pd.DataFrame({"Lgbm with max_depth":max_depths,"Validation Accuracy":[acc_1,acc_2,acc_3,acc_4,acc_5]}) eval_df
Lgbm with max_depth
Validation Accuracy
0
10
0.709764
1
15
0.719106
2
20
0.723689
3
25
0.725822
4
30
0.728842
Create Submission Files
1 2 3 4 5 6 7 8
models = [model_f1,model_f2,model_f3,model_f4,model_f5] for i in range(len(models)): preds_test = models[i].predict(test_data) submission = pd.DataFrame() submission['id'] = ids submission['target'] = preds_test submission.to_csv(root + 'submission_lgbm_model_'+ str(i)+'.csv.gz', compression = 'gzip', index=False, float_format = '%.5f') print("Predictions from model ",i,": ",preds_test)
Predictions from model 0 : [0.47177512 0.48584262 0.19651648 ... 0.39917036 0.30263348 0.36468783]
Predictions from model 1 : [0.45280296 0.55415074 0.17824637 ... 0.41500494 0.30757934 0.34520384]
Predictions from model 2 : [0.39847416 0.48724786 0.15954141 ... 0.38293317 0.27657349 0.28451098]
Predictions from model 3 : [0.3825275 0.39659855 0.15904321 ... 0.3515784 0.21812496 0.28995803]
Predictions from model 4 : [0.3951268 0.45704878 0.14609333 ... 0.35033303 0.23065677 0.2885925 ]
Scores from kaggle test set
Model name
private score
public score
LGBM Boosting Machine Model 4
0.67423
0.67256
LGBM Boosting Machine Model 3
0.67435
0.67241
LGBM Boosting Machine Model 2
0.67416
0.67208
LGBM Boosting Machine Model 1
0.67416
0.67188
LGBM Boosting Machine Model 0
0.67206
0.66940
4. Wide & Depth neural network model
Wide and Deep model(2 branches–>merge two branches–>main branch)
This model converts categorical attributes into dense vectors using embedding network in neural network, which enable us to reduce the dimension of categorical data and extract main features like PCA.
Then it combines dense embedded vectors with numerical data for features selection and classifcation in the main branch.
The output is possibility that user may repeat listening to the music
Label Encoding for categorical data
Convert categorical data into numerical labels before using embedding
import torch from torch import nn import torch.nn.functional as F from torch.utils.data import Dataset, DataLoader classTabularDataset(Dataset): def__init__(self, x_data, y_data, cat_cols1, cat_cols2, num_cols): """ data: pandas data frame; cat_cols: list of string, the names of the categorical columns in the data, will be passed through the embedding layers; num_cols: list of string y_data: the target """ self.n = x_data.shape[0] self.y = y_data.astype(np.float32).reshape(-1, 1)#.values.reshape(-1, 1) self.cat_cols1 = cat_cols1 self.cat_cols2 = cat_cols2 self.num_cols = num_cols self.num_X = x_data[self.num_cols].astype(np.float32).values self.cat_X1 = x_data[self.cat_cols1].astype(np.int64).values self.cat_X2 = x_data[self.cat_cols2].astype(np.int64).values defprint_data(self): return self.num_X, self.cat_X1, self.cat_X2, self.y def__len__(self): """ total number of samples """ return self.n def__getitem__(self, idx): """ Generates one sample of data. """ return [self.y[idx], self.num_X[idx], self.cat_X1[idx], self.cat_X2[idx]]
classFeedForwardNN(nn.Module): def__init__(self, emb_dims1, emb_dims2, no_of_num, lin_layer_sizes, output_size, emb_dropout, lin_layer_dropouts, branch2_enable=0): """ emb_dims: List of two element tuples; no_of_num: Integer, the number of continuous features in the data; lin_layer_sizes: List of integers. The size of each linear layer; output_size: Integer, the size of the final output; emb_dropout: Float, the dropout to be used after the embedding layers. lin_layer_dropouts: List of floats, the dropouts to be used after each linear layer. """ super().__init__() self.branch2_enable = branch2_enable # embedding layers self.emb_layers1 = nn.ModuleList([nn.Embedding(x, y) for x, y in emb_dims1]) self.emb_layers2 = nn.ModuleList([nn.Embedding(x, y) for x, y in emb_dims2]) # 计算各个emb参数数量,为后续Linear layer的输入做准备 self.no_of_embs1 = sum([y for x, y in emb_dims1]) self.no_of_embs2 = sum([y for x, y in emb_dims2]) self.no_of_num = no_of_num # 分支1 self.branch1 = nn.Linear(self.no_of_embs1, lin_layer_sizes[0]) self.branch1_2 = nn.Linear(lin_layer_sizes[0], lin_layer_sizes[1]) nn.init.kaiming_normal_(self.branch1.weight.data) nn.init.kaiming_normal_(self.branch1_2.weight.data) # 分支2 if branch2_enable: self.branch2 = nn.Linear(self.no_of_embs2, lin_layer_sizes[0] * 2) self.branch2_2 = nn.Linear(lin_layer_sizes[0] * 2, lin_layer_sizes[1] * 2) nn.init.kaiming_normal_(self.branch2.weight.data) nn.init.kaiming_normal_(self.branch2_2.weight.data)
# 主分支 # self.main_layer1 = nn.Linear(lin_layer_sizes[1] * 3 + self.no_of_num, lin_layer_sizes[2]) self.main_layer1 = nn.Linear(77, lin_layer_sizes[2]) self.main_layer2 = nn.Linear(lin_layer_sizes[2], lin_layer_sizes[3]) # batch normal self.branch_bn_layers1 = nn.BatchNorm1d(lin_layer_sizes[0]) self.branch_bn_layers2 = nn.BatchNorm1d(lin_layer_sizes[0] * 2) self.main_bn_layer = nn.BatchNorm1d(lin_layer_sizes[2]) # Dropout Layers self.emb_dropout_layer = nn.Dropout(emb_dropout) self.dropout_layers = nn.ModuleList([nn.Dropout(size) for size in lin_layer_dropouts]) # Output layer self.output_layer = nn.Linear(lin_layer_sizes[-1], output_size) nn.init.kaiming_normal_(self.output_layer.weight.data) self.sigmoid = nn.Sigmoid() defforward(self, num_data, cat_data1, cat_data2): # embedding categorical feature and cat them together x1 = [emb_layer(torch.tensor(cat_data1[:, i])) for i, emb_layer in enumerate(self.emb_layers1)] x1 = torch.cat(x1, 1) x1 = self.emb_dropout_layer(F.relu(self.branch1(x1))) x1 = self.branch_bn_layers1(x1) x1 = self.dropout_layers[0](F.relu(self.branch1_2(x1))) if self.branch2_enable: x2 = [emb_layer(torch.tensor(cat_data2[:, i])) for i, emb_layer in enumerate(self.emb_layers2)] x2 = torch.cat(x2, 1)