Author:Abheesht Sharma, Fabien Hertschuh Date created: 2025/04/28 Last modified: 2025/04/28 Description: Rank movies using Deep and Cross Networks (DCN).
ⓘ This example uses Keras 2
[**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/keras_rs/ipynb/dcn.ipynb) • [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/examples/keras_rs/dcn.py)
Introduction
This tutorial demonstrates how to use Deep & Cross Networks (DCN) to effectivelylearn feature crosses. Before diving into the example, let's briefly discussfeature crosses.
Imagine that we are building a recommender system for blenders. Individualfeatures might include a customer's past purchase history (e.g.,purchased_bananas, purchased_cooking_books) or geographic location. However,a customer who has purchased both bananas and cooking books is more likely to beinterested in a blender than someone who purchased only one or the other. Thecombination of purchased_bananas and purchased_cooking_books is a featurecross. Feature crosses capture interaction information between individualfeatures, providing richer context than the individual features alone.
Learning effective feature crosses presents several challenges. In web-scaleapplications, data is often categorical, resulting in high-dimensional andsparse feature spaces. Identifying impactful feature crosses in suchenvironments typically relies on manual feature engineering or computationallyexpensive exhaustive searches. While traditional feed-forward multilayerperceptrons (MLPs) are universal function approximators, they often struggle toefficiently learn even second- or third-order feature interactions.
The Deep & Cross Network (DCN) architecture is designed for more effectivelearning of explicit and bounded-degree feature crosses. It comprises three maincomponents: an input layer (typically an embedding layer), a cross network formodeling explicit feature interactions, and a deep network for capturingimplicit interactions.
The cross network is the core of the DCN. It explicitly performs featurecrossing at each layer, with the highest polynomial degree of featureinteraction increasing with depth. The following figure shows the (i+1)-thcross layer.
The deep network is a standard feedforward multilayer perceptron(MLP). These two networks are then combined to form the DCN. Two commoncombination strategies exist: a stacked structure, where the deep network isplaced on top of the cross network, and a parallel structure, where theyoperate in parallel.
Parallel layers
Stacked layers
Now that we know a little bit about DCN, let's start writing some code. We willfirst train a DCN on a toy dataset, and demonstrate that the model has indeedlearnt important feature crosses.
Let's set the backend to JAX, and get our imports sorted.
Here, we define a helper function for visualising weights of the cross layer inorder to better understand its functioning. Also, we define a function forcompiling, training and evaluating a given model.
To illustrate the benefits of DCNs, let's consider a simple example. Suppose wehave a dataset for modeling the likelihood of a customer clicking on a blenderadvertisement. The features and label are defined as follows:
Features / Label
Description
Range
x1 = country
Customer's resident country
[0, 199]
x2 = bananas
# bananas purchased
[0, 23]
x3 = cookbooks
# cooking books purchased
[0, 5]
y
Blender ad click likelihood
-
Then, we let the data follow the following underlying distribution:y = f(x1, x2, x3) = 0.1x1 + 0.4x2 + 0.7x3 + 0.1x1x2 +3.1x2x3 + 0.1x3^2.
This distribution shows that the click likelihood (y) depends linearly onindividual features (xi) and on multiplicative interactions between them. Inthis scenario, the likelihood of purchasing a blender (y) is influenced notonly by purchasing bananas (x2) or cookbooks (x3) individually, but alsosignificantly by the interaction of purchasing both bananas and cookbooks(x2x3).
Preparing the dataset
Let's create synthetic data based on the above equation, and form the train-testsplits.
To demonstrate the advantages of a cross network in recommender systems, we'llcompare its performance with a deep network. Since our example data onlycontains second-order feature interactions, a single-layered cross network willsuffice. For datasets with higher-order interactions, multiple cross layers canbe stacked to form a multi-layered cross network. We will build two models:
A cross network with a single cross layer.
A deep network with wider and deeper feedforward layers.
Let's train both models. Remember we have set verbose=0 for brevity'ssake, so do not be alarmed if you do not see any output for a while.
After training, we evaluate the models on the unseen dataset. We will reportthe Root Mean Squared Error (RMSE) here.
We observe that the cross network achieved significantly lower RMSE compared toa ReLU-based DNN, while also using fewer parameters. This points to theefficiency of the cross network in learning feature interactions.
```
:11: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
ax.set_xticklabels([""] + features, rotation=45, fontsize=10)
:12: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
ax.set_yticklabels([""] + features, fontsize=10)
Dataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-ratings/0.1.1. Subsequent calls will reuse this data.
</div>Foreveryfeature,let's get the list of unique values, i.e., vocabulary, sothatwecanusethatfortheembeddinglayer.```pythonvocabularies = {}for feature_name in MOVIELENS_CONFIG["int_features"] + MOVIELENS_CONFIG["str_features"]:vocabulary = ratings_ds.batch(10_000).map(lambda x, y: x[feature_name])vocabularies[feature_name] = np.unique(np.concatenate(list(vocabulary)))
One thing we need to do is to use keras.layers.StringLookup andkeras.layers.IntegerLookup to convert all features into indices, which canthen be fed into embedding layers.
We have three models - a deep cross network, an optimised deep crossnetwork with a low-rank matrix (to reduce training and serving costs) and anormal deep network without cross layers. The deep cross network is a stackedDCN model, i.e., the inputs are fed to cross layers, followed by feedforwardlayers. Let's run each model 10 times, and report the average/standarddeviation of the RMSE.
</div>DCNoutperformsasimilarlysizedDNNwithReLUlayers,demonstratingsuperiorperformance.Furthermore,thelow-rankDCNeffectivelyreducesthenumberofparameterswithoutcompromisingaccuracy.### Visualizing feature interactionsLikewedidforthetoyexample,wewillplottheweightmatrixofthecrosslayertoseewhichfeaturecrossesareimportant.Inthepreviousexample,theimportanceofinteractionsbetweenthe`i`-thand`j-th`featuresiscapturedbythe`(i,j)`-{th}elementoftheweightmatrix.Inthiscase,thefeatureembeddingsareofsize32ratherthan1.Therefore,theimportanceoffeatureinteractionsisrepresentedbythe`(i,j)`-thblockoftheweightmatrix,whichhasdimensions`32x32`.Toquantifythesignificanceoftheseinteractions,weusetheFrobeniusnormofeachblock.Alargervalueimplieshigherimportance.```pythonfeatures=list(vocabularies.keys())mat=cross_network.weights[len(features)].numpy()embedding_dim=MOVIELENS_CONFIG["embedding_dim"]block_norm=np.zeros([len(features),len(features)])# Compute the norms of the blocks.foriinrange(len(features)):forjinrange(len(features)):block=mat[i*embedding_dim:(i+1)*embedding_dim,j*embedding_dim:(j+1)*embedding_dim,]block_norm[i,j]=np.linalg.norm(block,ord="fro")visualize_layer(matrix=block_norm,features=features,)
```
:11: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
ax.set_xticklabels([""] + features, rotation=45, fontsize=10)
:12: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
ax.set_yticklabels([""] + features, fontsize=10)
</div>And we are all done!