Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/recsys/factorization_machine/factorization_machine.ipynb
2579 views
Kernel: Python 3 (ipykernel)
# code for loading the format for the notebook import os # path : store the current path to convert back to it later path = os.getcwd() os.chdir(os.path.join('..', '..', 'notebook_format')) from formats import load_style load_style(css_style = 'custom2.css', plot_style = False)
os.chdir(path) # 1. magic for inline plot # 2. magic to print version # 3. magic so that the notebook will reload external python modules # 4. magic to enable retina (high resolution) plots # https://gist.github.com/minrk/3301035 %matplotlib inline %load_ext watermark %load_ext autoreload %autoreload 2 %config InlineBackend.figure_format = 'retina' import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.metrics import roc_auc_score from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer %watermark -a 'Ethen' -d -t -v -p numba,numpy,pandas,sklearn,matplotlib
Ethen 2018-02-24 20:09:44 CPython 3.6.3 IPython 6.1.0 numba 0.37.0 numpy 1.14.1 pandas 0.22.0 sklearn 0.19.1 matplotlib 2.1.0

Factorization Machine (FM)

Factorization Machine type algorithms are a combination of linear regression and matrix factorization, the cool idea behind this type of algorithm is it aims model interactions between features (a.k.a attributes, explanatory variables) using factorized parameters. By doing so it has the ability to estimate all interactions between features even with extremely sparse data.

Introduction

Normally, when we think of linear regression, we would think of the following formula:

y^(x)=w0+i=1nwixi\begin{align} \hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} \end{align}

Where:

  • w0w_0 is the bias term, a.k.a intercept.

  • wiw_i are weights corresponding to each feature vector xix_i, here we assume we have nn total features.

This formula's advantage is that it can computed in linear time, O(n)O(n). The drawback, however, is that it does not handle feature interactions. To capture interactions, we could introduce a weight for each feature combination. This is sometimes referred to as a 2nd2_{nd} ordered polynomial. The resulting model is shown below:

y^(x)=w0+i=1nwixi+i=1nj=i+1nwijxixj\begin{align} \hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} + \sum_{i=1}^n \sum_{j=i+1}^n w_{ij} x_{i} x_{j} \end{align}

Compared to our previous model, this formulation has the advantages that it can capture feature interactions at least for two features at a time. But we have now ended up with a O(n2)O(n^2) complexity which means that to train the model, we now require a lot more time and memory. Another issue is that when we have categorical variables with high cardinality, after one-hot encoding them, we would end up with a lot of columns that are sparse, making it harder to actually capture their interactions (not enough data).

To solve this complexity issue, Factorization Machines takes inspiration from matrix factorization, and models the feature interaction using latent factors. Every feature fif_i has a corresponding latent factor viv_i, and two features' interactions are modelled as vi,vj\langle \textbf{v}_i, \textbf{v}_{j} \rangle, where   ,\langle \cdot \;,\cdot \rangle refers to the dot product of the two feature vector. If we assume its of size kk (this is a hyperparameter that we can tune). Then:

vi,vj=f=1kvi,fvj,f\begin{align} \langle \textbf{v}_i, \textbf{v}_{j} \rangle = \sum_{f=1}^k v_{i,f} v_{j,f} \end{align}

This leads of our new equation:

y^(x)=w0+i=1nwixi+i=1nj=i+1nvi,vjxixj\begin{align} \hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} + \sum_{i=1}^{n} \sum_{j=i+1}^n \langle \textbf{v}_i , \textbf{v}_{j} \rangle x_i x_{j} \end{align}

This is an improvement from our previous model (when we modeled each pair of interaction terms with weight wijw_{ij}) as the number of parameters is reduced from n2n^2 to n×kn \times k, since knk \ll n, which also helps mitigate overfitting issues. Using the naive way of formulating factorization machine results in a complexity of O(kn2)O(kn^2), because all pairwise interactions have to be computed, but we can reformulate it to make it run in O(kn)O(kn).

i=1nj=i+1nvi,vjxixj=12i=1nj=1nvi,vjxixj12i=1nvi,vixixi=12(i=1nj=1nf=1kvi,fvj,fxixj)12(i=1nf=1kvi,fvi,fxixi)=12(i=1nj=1nf=1kvi,fvj,fxixji=1nf=1kvi,fvi,fxixi)=12f=1k((i=1nvi,fxi)(j=1nvj,fxj)i=1nvi,f2xi2)=12f=1k((invi,fxi)2i=1nvi,f2xi2)\begin{align} \sum_{i=1}^n \sum_{j=i+1}^n \langle \textbf{v}_i, \textbf{v}_{j} \rangle x_{i} x_{j} &= \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \langle \textbf{v}_i, \textbf{v}_{j} \rangle x_{i} x_{j} - \frac{1}{2} \sum_{i=1}^n \langle \textbf{v}_i , \textbf{v}_{i} \rangle x_{i} x_{i} \\ &= \frac{1}{2}\left(\sum_{i=1}^n \sum_{j=1}^n \sum_{f=1}^k v_{i,f} v_{j,f} x_{i} x_{j} \right) - \frac{1}{2}\left( \sum_{i=1}^n \sum_{f=1}^k v_{i,f} v_{i,f} x_{i} x_{i} \right) \\ &= \frac{1}{2}\left(\sum_{i=1}^n \sum_{j=1}^n \sum_{f=1}^k v_{i,f} v_{j,f} x_{i} x_{j} - \sum_{i=1}^n \sum_{f=1}^k v_{i,f} v_{i,f} x_{i} x_{i} \right) \\ &= \frac{1}{2} \sum_{f=1}^{k} \left( \left(\sum_{i=1}^n v_{i,f}x_{i} \right) \left( \sum_{j=1}^n v_{j,f}x_{j} \right) - \sum_{i=1}^{n} v_{i,f}^2 x_{i}^2 \right) \\ &= \frac{1}{2} \sum_{f=1}^{k} \left( \left( \sum_{i}^{n} v_{i,f}x_{i} \right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_{i}^2 \right) \end{align}

Note, summing over different pairs is the same as summing over all pairs minus the self-interactions (divided by two). This is the reason why the value 1/2 is introduced from the beginning of the derivation.

This reformulated equation has a linear complexity in both kk and nn, i.e. its computation is in O(kn)O(kn), substituting this new equation into the existing factorization machine formula, we end up with:

y^(x)=w0+i=1nwixi+12f=1k((invi,fxi)2i=1nvi,f2xi2)\begin{align} \hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} + \frac{1}{2} \sum_{f=1}^{k} \left( \left( \sum_{i}^{n} v_{i,f}x_{i} \right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_{i}^2 \right) \end{align}

In a machine learning setting, factorization machine can be applied to different supervised prediction tasks:

  • Regression:, in this case y^(x)\hat{y}(\textbf{x}) can be used directly by minimizing the mean squared error between the model prediction and target value, e.g. 1NN(yy^(x))2\frac{1}{N}\sum^{N}\big(y - \hat{y}(\textbf{x})\big)^2

  • Classification:, if we were to use it in a binary classification setting, we could then minimize the log loss, ln(eyy^(x)+1)\ln \big(e^{-y \cdot \hat{y}(\textbf{x})} + 1 \big), where σ\sigma is the sigmoid/logistic function and y1,1y \in {-1, 1}.

To train factorization machine, we can use a gradient descent based optimization techniques, the parameters to be learned are (w0,w,(w_0, \mathbf{w}, and V\mathbf{V}).

θy^(x)={1,if θ is w0xi,if θ is wixij=1nvj,fxjvi,fxi2if θ is vi,f\begin{align} \frac{\partial}{\partial\theta}\hat{y}(\textbf{x}) = \begin{cases} 1, & \text{if $\theta$ is $w_0$} \\ x_i, & \text{if $\theta$ is $w_i$} \\ x_i\sum_{j=1}^{n} v_{j,f}x_j - v_{i,f}x_{i}^2 & \text{if $\theta$ is $v_{i,f}$} \end{cases} \end{align}
  • Notice that j=1nvj,fxj\sum_{j=1}^n v_{j, f} x_j does not depend on ii, thus it can be computed independently.

  • The last formula above, can also be written as xi(j=1nvj,fxjvi,fxi)x_i(\sum_{j=1}^{n} v_{j,f}x_j - v_{i,f}x_{i}).

  • In practice, we would throw in some L2 regularization to prevent overfitting.

As the next section contains implementation of the algorithm from scratch, the gradient of the log loss is also provided here for completeness. The predicted value y^(x)\hat{y}(\textbf{x}) is replaced with xx for making the notation cleaner.

ddx[ln(eyx+1)]=1eyx+1ddx[eyx+1]=ddx[eyx]+ddx[1]eyx+1=eyxddx[yx]+0eyx+1=eyxyeyx+1=yeyxeyx+1=yeyx+1\begin{align} \frac{d}{dx}\left[ \ln \big(e^{-yx} + 1 \big) \right] &= \frac{1}{e^{-yx} + 1} \cdot \frac{d}{dx}\left[e^{-yx} + 1 \right] \\ &= \frac{\frac{d}{dx}\left[e^{-yx} \right] + \frac{d}{dx}\left[1 \right]}{e^{-yx} + 1} \\ &= \frac{e^{-yx} \cdot \frac{d}{dx}\left[-yx \right] + 0}{e^{-yx} + 1} \\ &= \frac{e^{-yx} \cdot -y}{e^{-yx} + 1} \\ &= -\frac{ye^{-yx}}{e^{-yx} + 1} \\ &= -\frac{y}{e^{yx} + 1} \end{align}

Advantages: We'll now wrap up the theoretical section of factorization machine, with some of its advantages:

  • We can observe from the model equation that it can be computed in linear time.

  • By leveraging ideas from matrix factorization, we can estimate higher order interaction effects even under very sparse data.

  • Compared to traditional matrix factorization methods, which is restricted to modeling a user-item matrix, we can leverage other user or item specific features making factorization machine more flexible.

Implementation

For the implementation of factorization machine, we'll use a for loop based code as I personally find it easier to comprehend for the gradient update section. There are different ways to speed up for loop based code in Python, such as using Cython or Numba, here we'll be using Numba.

# using the example spam dataset # read it in, extract the input and output columns label_col = 'label_num' sms = pd.read_table('sms.tsv', header = None, names = ['label', 'message']) sms[label_col] = sms['label'].map({'ham': 0, 'spam': 1}) X = sms['message'] y = sms[label_col].values # split X and y into training and testing sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state = 1) # convert both sets' text column to document-term matrix; # ideally, we would want to perform some preprocessing on # our text data, but let's be lazy here as that's not # the goal of this documentation tfidf = TfidfVectorizer(min_df = 2, max_df = 0.5) X_train_dtm = tfidf.fit_transform(X_train) X_test_dtm = tfidf.transform(X_test) X_train_dtm
<4179x3508 sparse matrix of type '<class 'numpy.float64'>' with 51261 stored elements in Compressed Sparse Row format>
import numpy as np from numba import njit from tqdm import trange from sklearn.base import BaseEstimator, ClassifierMixin class FactorizationMachineClassifier(BaseEstimator, ClassifierMixin): """ Factorization Machine [1]_ using Stochastic Gradient Descent. For binary classification only. Parameters ---------- n_iter : int, default 10 Number of iterations to train the algorithm. n_factors : int, default 10 Number/dimension of features' latent factors. learning_rate : float, default 0.1 Learning rate for the gradient descent optimizer. reg_coef : float, default 0.01 Regularization strength for weights/coefficients. reg_factors : float, default 0.01 Regularization strength for features' latent factors. random_state : int, default 1234 Seed for the randomly initialized features latent factors verbose : bool, default True Whether to print progress bar while training. Attributes ---------- intercept_ : double Intercept term, w0 based on the original notations. coef_ : 1d ndarray, shape [n_features,] Coefficients, w based on the original notations. feature_factors_ : 2d ndarray, shape [n_factors, n_features] Latent factors for all features. v based on the original notations. The learned factors can be viewed as the embeddings for each features. If a pair of features tends to co-occur often, then their embeddings should be close/similar (in terms of cosine similarity) to each other. history_ : list Loss function's history at each iteration, useful for evaluating whether the algorithm converged or not. References ---------- .. [1] `S. Rendle Factorization Machines (2010) <http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf>`_ """ def __init__(self, n_iter = 10, n_factors = 10, learning_rate = 0.1, reg_coef = 0.01, reg_factors = 0.01, random_state = 1234, verbose = False): self.n_iter = n_iter self.verbose = verbose self.reg_coef = reg_coef self.n_factors = n_factors self.reg_factors = reg_factors self.random_state = random_state self.learning_rate = learning_rate def fit(self, X, y): """ Fit the model to the input data and label. Parameters ---------- X : scipy sparse csr_matrix, shape [n_samples, n_features] Data in sparse matrix format. y : 1d ndarray, shape [n_samples,] Training data's corresponding label. Returns ------- self """ n_samples, n_features = X.shape self.coef_ = np.zeros(n_features) self.intercept_ = 0.0 # the factors are often initialized with a mean of 0 and standard deviation # of 1 / sqrt(number of latent factor specified) np.random.seed(self.random_state) self.feature_factors_ = np.random.normal( scale = 1 / np.sqrt(self.n_factors), size = (self.n_factors, n_features)) # the gradient is implemented in a way that requires # the negative class to be labeled as -1 instead of 0 y = y.copy().astype(np.int32) y[y == 0] = -1 loop = range(self.n_iter) if self.verbose: loop = trange(self.n_iter) self.history_ = [] for _ in loop: loss = _sgd_update(X.data, X.indptr, X.indices, y, n_samples, n_features, self.intercept_, self.coef_, self.feature_factors_, self.n_factors, self.learning_rate, self.reg_coef, self.reg_factors) self.history_.append(loss) return self def predict_proba(self, X): """ Probability estimates. The returned estimates for all classes are ordered by the label of classes. Paramters --------- X : scipy sparse csr_matrix, shape [n_samples, n_features] Data in sparse matrix format. Returns ------- proba : 2d ndarray, shape [n_samples, n_classes] The probability of the sample for each class in the model. """ pred = self._predict(X) pred_proba = 1.0 / (1.0 + np.exp(-pred)) proba = np.vstack((1 - pred_proba, pred_proba)).T return proba def _predict(self, X): """Similar to _predict_instance but vectorized for all samples""" linear_output = X * self.coef_ v = self.feature_factors_.T term = (X * v) ** 2 - (X.power(2) * (v ** 2)) factor_output = 0.5 * np.sum(term, axis = 1) return self.intercept_ + linear_output + factor_output def predict(self, X): """ Predict class labels for samples in X. Parameters ---------- X : scipy sparse csr_matrix, shape [n_samples, n_features] Data in sparse matrix format. Returns ------- Predicted class label per sample. """ pred_proba = self.predict_proba(X)[:, 1] return pred_proba.round().astype(np.int) @njit def _sgd_update(data, indptr, indices, y, n_samples, n_features, w0, w, v, n_factors, learning_rate, reg_w, reg_v): """ Compute the loss of the current iteration and update gradients accordingly. """ loss = 0.0 for i in range(n_samples): pred, summed = _predict_instance(data, indptr, indices, w0, w, v, n_factors, i) # calculate loss and its gradient loss += _log_loss(pred, y[i]) loss_gradient = -y[i] / (np.exp(y[i] * pred) + 1.0) # update bias/intercept term w0 -= learning_rate * loss_gradient # update weight for index in range(indptr[i], indptr[i + 1]): feature = indices[index] w[feature] -= learning_rate * (loss_gradient * data[index] + 2 * reg_w * w[feature]) # update factor for factor in range(n_factors): for index in range(indptr[i], indptr[i + 1]): feature = indices[index] term = summed[factor] - v[factor, feature] * data[index] v_gradient = loss_gradient * data[index] * term v[factor, feature] -= learning_rate * (v_gradient + 2 * reg_v * v[factor, feature]) loss /= n_samples return loss @njit def _predict_instance(data, indptr, indices, w0, w, v, n_factors, i): """predicting a single instance""" summed = np.zeros(n_factors) summed_squared = np.zeros(n_factors) # linear output w * x pred = w0 for index in range(indptr[i], indptr[i + 1]): feature = indices[index] pred += w[feature] * data[index] # factor output for factor in range(n_factors): for index in range(indptr[i], indptr[i + 1]): feature = indices[index] term = v[factor, feature] * data[index] summed[factor] += term summed_squared[factor] += term * term pred += 0.5 * (summed[factor] * summed[factor] - summed_squared[factor]) # summed is the independent term that can be re-used # during the gradient update stage return pred, summed @njit def _log_loss(pred, y): """ negative log likelihood of the current prediction and label, y. """ return np.log(np.exp(-pred * y) + 1.0)
fm = FactorizationMachineClassifier(n_iter = 30, learning_rate = 0.1) fm.fit(X_train_dtm, y_train)
FactorizationMachineClassifier(learning_rate=0.1, n_factors=10, n_iter=30, random_state=1234, reg_coef=0.01, reg_factors=0.01, verbose=False)
# change default style figure and font size plt.rcParams['figure.figsize'] = 8, 6 plt.rcParams['font.size'] = 12 # one quick way to check that we've implemented # the gradient descent is to ensure that the loss # curve is steadily decreasing plt.plot(fm.history_) plt.title('Loss Curve Per Iteration') plt.xlabel('Iterations') plt.ylabel('Loss') plt.show()
Image in a Jupyter notebook
# predict on the test set and output the auc score y_pred_prob = fm.predict_proba(X_test_dtm)[:, 1] auc = roc_auc_score(y_test, y_pred_prob) print('auc', auc)
auc 0.9973867907642742
# we can compare it with a logistic regression, logreg = LogisticRegression() logreg.fit(X_train_dtm, y_train) y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1] auc = roc_auc_score(y_test, y_pred_prob) print('auc', auc)
auc 0.9949615178092

There are various open-sourced implementations floating around the web, here are the links to some of them:

I personally haven't tested which one is more efficient, feel free to grab one of them as see if it helps solve your problem.