Implementing libFM in Keras
JeanFrancoisPuget 2700028FGP Visits (8904)
I just won a gold medal on Talk
I'll explain it here. Before that, let me briefly explain what this competition was about. Here is how the problem is described on Kaggle site:
Each row of the training data contains a click record, with the following features.
In that dataset, a combination of ip, device and os represents a user in most cases. In some cases ip are shared among many users, which adds some complexity. We'll ignore that for the sake of simplicity. We can also ignore channel here, it appeared to not being very useful. We are therefore given about 200 millions rows, each row containing a user, an app, and a click time. Goal is to predict when users download the app.
This is reminiscent of recommender systems where the goal is to predict whether users will buy a product (app is the product here).
The most popular technique for building recommender systems is matrix factorization. Goal is to compute profiles for users and products so that profiles for users that have a similar buying pattern have a similar profile, and so that products sold to similar users are similar. One surprisingly simple way to do it is to assign a vector of k numerical values to each user and each product. The inner product (dot product) of a user vector and a product vector yields the affinity of the two.
The matrix factorization approach is actually quite simple in principle. You start from a matrix (no kidding... ), and find ways to approximate it. For recommender systems, the matrix captures past user x products interactions (sales for instance). The matrix has users as row, and products as columns (rows and columns can be swapped, result is the same). This gives you a MxN matrix C with M users and N products. The matrix element C(i,j) has the number of times user i bought product j so far. Most elements of C have value 0. For better accuracy I often replace counts by their log after adding 1 (numpy.log1p() function) as it maps 0 to 0, and makes the count distribution less skewed towards large counts.
Matrix factorizations are great, but they are limited to two factors. What if we want to use the relationship of 3 or more factors? For instance, what if we want to use the interactions between app, os, and device? Fortunately for us, a great generalization of matrix factorization has been introduced in this seminal paper:
Moreover, Rendle implmented his algorithm in the very successful libFM. I could have use libFM directly, but it requires a different data format than what I was using (pandas or numpy). Also it is a batch command line tool, not integrated with the Python environment I am familiar with. I therefore decided to implement the libFM approach as a deep learning model, using Keras. I shar
Let me focus here on the libfFM code. This code is rather short:
k_latent = 2 embedding_reg = 0.0002 kernel_reg = 0.1 def get_embed(x_input, x_size, k_latent): if x_size > 0: #category embed = Embedding(x_size, k_latent, input_length=1, embe
Let's look at it bits per bits. We start with the input, as usual.
input_x = [Input(shape=(1,)) for i in range(dim_input)]
The first layer is to create latent vectors. Here we create two latent vectors per category, a factor vector, and a bias vector. This is done in these lines:
biases = [get_embed(x, size, 1) for (x, size) in zip(input_x, f_size)] factors = [get_embed(x, size, k_latent) for (x, size) in zip(input_x, f_size)]
The factor vectors will be multiplied two by two. We could do it by looping over all combinations, but this would result in N(N -1)/2 dot products if we have N categories. This can lead to memory issues as the number of categories increases. Rendle devised a clever way to compute these products linearly in N. I am using a slightly different way that I find simpler. Let me try to explain it. We want to compute :
F = sum xi xj for all i, j such that i < j
where xi is the factor vector for i-th category. If we set
S = sum xi
F = sum xj (S - xj) for all j
We only have N products this way! We actually get each product twice, but this is not an issue as coefficients in vectors will be scaled down to compensate for it. We can implement the set of products xj (S - xj) directly:
s = Add()(factors) diffs = [Subtract()([s, x]) for x in factors] dots = [Dot(axes=1)([d, x]) for d,x in zip(diffs, factors)]
Let's concatenate them with the biases.
x = Conc
We are almost done. If we implement the original libFM then we should take the sum of x. Given we use a flexible deep learning framework we can use a mode complex last layer. And we can add some batch normalization to get even better results:
x = Batc
Our model is complete, we can finalize it and compile it. We clip norms of the optimizer to avoid exploding gradients, but this may not be necessary in other applications.
model = Mode
Given the purpose of this model is to compute the latent vectors, we need to retrieve them once they are computed by training the model. For this we create a secondary model that shares the embedding layer with the previous model.
output_f = factors + biases model_features = Mode
We can then train the main model. I use a fairly large batch size of 2¹⁷ . I advise users to tune the batch size to their data and not rely on this very large one by default.
n_epochs = 100 P = 17
batch_size = 2**P
earlystopper = Earl
Once the main model model is trained, we simply have to make predictions with the model_features to get the embeddings.
X_pred = mode
This is one beautiful trick if you think of it: we shared layers between models, and we don't directly use the model we trained for predictions!
I used the above to compute feature for my model, i.e. as an unsupervised learning model. One could however use it in a supervised learning way by using the target of the problem directly, instead of interaction counts.
All the code above plus its use with Talking Data competition is available on github.