Introduction

This notebook is implemented based on paper Billion-scale Commodity Embedding for E-commerce Recommendation. In Taobao, milion of new items are continously uploaded each hour. There are no user behaviors for these items. Learning item representation is important for matching, ranking in order to recommend these items to user. Collaborative Filtering based methods is only computed co-occurence of items in user history behavior. It is quite challenge to learn item representation with few or even no interactions. Authors proposed new approach: Incorporate side-information to enhance embedding vectors, dubbed Graph Embedding with Side Information. For example, items with same brands or category should be closer in embedding space Throughout the rest of this notebook, we will develop a model which incorporate side information into graph embedding and test model with new items not in the dataset to see the performance

# 1. magic for inline plot
# 2. magic so that the notebook will reload external python modules
# 3. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035

%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

import shutil
import os
import glob
import json
import random
import gc
import torch


import numpy as np
import pandas as pd

from tqdm import tqdm
from collections import Counter, defaultdict

from torch.utils.data import DataLoader, Dataset

# Set fixed seed
random_state = 4111
torch.manual_seed(random_state)
random.seed(random_state)
np.random.seed(random_state)

Cold Start Recommendation

Preparation

We will use public available movielen throughout this experiment. At time of writing, dataset movielen25m is up-to-date and contains lots of samples. You can download via this link

1 2	!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip !unzip ml-25m.zip

Given this dataset, we have few code chunks to clean and preprocess the data.

The raw movies.csv contains movie_id, movie title and genres. We will convert movie_id into ordinal ids, lowercase and split genres into array of string
The raw tags.csv contains user_id, movie_id, tag and timestamp. We will drop null tag, convert to lowercase and aggregate tags by movie

1
2
3

df_entity = pd.read_csv("ml-25m/movies.csv")
df_entity["genres"] = df_entity["genres"].map(lambda d: d.split("|"))
df_entity.head()

	movieId	title	genres
0	1	Toy Story (1995)	[Adventure, Animation, Children, Comedy, Fantasy]
1	2	Jumanji (1995)	[Adventure, Children, Fantasy]
2	3	Grumpier Old Men (1995)	[Comedy, Romance]
3	4	Waiting to Exhale (1995)	[Comedy, Drama, Romance]
4	5	Father of the Bride Part II (1995)	[Comedy]

1 2	df_tag = pd.read_csv("ml-25m/tags.csv").dropna() df_tag.head()

	userId	movieId	tag	timestamp
0	3	260	classic	1439472355
1	3	260	sci-fi	1439472256
2	4	1732	dark comedy	1573943598
3	4	1732	great dialogue	1573943604
4	4	7569	so bad it's good	1573943455

df_agg_tag = df_tag.drop(["userId","timestamp"],axis=1)\
        .assign(tag=lambda df: df["tag"].map(lambda d: d.lower().lstrip().rstrip()))\
        .groupby(["movieId"])["tag"].agg("unique").reset_index()
df_agg_tag

	movieId	tag
0	1	[owned, imdb top 250, pixar, time travel, chil...
1	2	[robin williams, time travel, fantasy, based o...
2	3	[funny, best friend, duringcreditsstinger, fis...
3	4	[based on novel or book, chick flick, divorce,...
4	5	[aging, baby, confidence, contraception, daugh...
...	...	...
45246	208813	[might like]
45247	208933	[black and white, deal with the devil]
45248	209035	[computer animation, japan, mass behavior, mas...
45249	209037	[chameleon, computer animation, gluttony, humo...
45250	209063	[black, education, friends schools, independen...

45251 rows × 2 columns

1 2	df_movie_joint = df_entity.merge(df_agg_tag, on=["movieId"],how="outer") df_movie_joint

	movieId	title	genres	tag
0	1	Toy Story (1995)	[Adventure, Animation, Children, Comedy, Fantasy]	[owned, imdb top 250, pixar, time travel, chil...
1	2	Jumanji (1995)	[Adventure, Children, Fantasy]	[robin williams, time travel, fantasy, based o...
2	3	Grumpier Old Men (1995)	[Comedy, Romance]	[funny, best friend, duringcreditsstinger, fis...
3	4	Waiting to Exhale (1995)	[Comedy, Drama, Romance]	[based on novel or book, chick flick, divorce,...
4	5	Father of the Bride Part II (1995)	[Comedy]	[aging, baby, confidence, contraception, daugh...
...	...	...	...	...
62418	209157	We (2018)	[Drama]	NaN
62419	209159	Window of the Soul (2001)	[Documentary]	NaN
62420	209163	Bad Poems (2018)	[Comedy, Drama]	NaN
62421	209169	A Girl Thing (2001)	[(no genres listed)]	NaN
62422	209171	Women of Devil's Island (1962)	[Action, Adventure, Drama]	NaN

62423 rows × 4 columns

Preprocessing

Ordinal Encoding

We need to create custom ordinal encoder to serve our usecase. We transform list of category/ids features into ordinal ids which indicates start index, replace empty values by

class OrdinalEncoder():
    """
        Convert categorical into ordinal integer ids
        If value is not existed in vocab, it will be replaced by <unk> val
    """
    def __init__(self,start_from=1, unknown=0):
        self.vocabs = {}
        self.wc = defaultdict(int)
        self.inv = []
        self.start_from = start_from
        self.unknown = unknown
    
    def __len__(self, ):
        return len(self.inv)

    def fit(self, X):
        
        import numpy as np

        for i in range(len(X)):
            self.wc[X[i]] += 1
        
        X_uniq = np.unique(X)

        self.inv = [0] * (len(X_uniq) + self.start_from)
        
        for idx, item in enumerate(X_uniq):
            self.vocabs[item] = self.start_from + idx
            self.inv[self.start_from + idx] = item
        
        self.inv = np.array(self.inv)

        return self
    
    def transform(self, X):
        import numpy as np

        if isinstance(X[0],(list,np.ndarray)):
            res = []
            for idx in range(len(X)):
                tmp = [self.vocabs.get(item) for item in X[idx] if item in self.vocabs]

                res.append(tmp)                
            return res
        else:
            return np.array([ self.vocabs.get(item, self.unknown) for item in X ],dtype="int")
    
    def fit_transform(self, X):
        return self.fit(X).transform(X)    

    def inverse_transform(self, X):
        if len(X) == 0:
            return []

        X = np.array(X, dtype="int")

        return self.inv[X]  

    @property
    def n_classes_(self):
        return len(self.inv)  

    def word_count_table(self):
        return [self.wc[w] for w in self.inv]

class MultiLabelEncoder(OrdinalEncoder):
    """
        Convert multi-labels into ordinal ids
    """
    def fit(self, X: list):
        X_extend = []
        for i in range(len(X)):
            if isinstance(X[i],(list,np.ndarray)):
                X_extend.extend(X[i])
        
        super().fit(X_extend)

        return self

1	metadata_cols = ["genres","tag"]

1
2

movie_encoder = OrdinalEncoder(start_from=1).fit(df_movie_joint["movieId"].tolist())
encoder_mapper = {col: MultiLabelEncoder(start_from=1).fit(df_movie_joint[col].tolist()) for col in metadata_cols}

df_movie_joint_encoder = df_movie_joint[["movieId"]].copy(deep=True)
df_movie_joint_encoder["movieId"] = movie_encoder.transform(df_movie_joint["movieId"].tolist())

for col in metadata_cols:
    df_movie_joint_encoder[col] = encoder_mapper[col].transform(df_movie_joint[col].fillna("").tolist())
    
df_movie_joint_encoder["movieId"] = df_movie_joint_encoder["movieId"].astype("int")
movie_metadata_info = df_movie_joint_encoder.set_index(["movieId"])[metadata_cols].to_dict(orient="index")

df_movie_joint_encoder["title"] = df_movie_joint["title"]

1	df_movie_joint_encoder

	movieId	genres	tag	title
0	1	[3, 4, 5, 6, 10]	[42617, 27920, 44422, 58453, 10932, 12378, 225...	Toy Story (1995)
1	2	[3, 5, 10]	[48952, 58453, 20353, 5932, 7677, 16422, 23585...	Jumanji (1995)
2	3	[6, 16]	[22590, 6651, 17688, 21398, 41566, 51540, 3803...	Grumpier Old Men (1995)
3	4	[6, 9, 16]	[5956, 10772, 16687, 28762, 52968, 12059, 4824...	Waiting to Exhale (1995)
4	5	[6]	[1785, 5229, 12797, 12978, 14653, 25202, 37295...	Father of the Bride Part II (1995)
...	...	...	...	...
62418	62419	[9]	[]	We (2018)
62419	62420	[8]	[]	Window of the Soul (2001)
62420	62421	[6, 9]	[]	Bad Poems (2018)
62421	62422	[1]	[]	A Girl Thing (2001)
62422	62423	[2, 3, 9]	[]	Women of Devil's Island (1962)

62423 rows × 4 columns

1	df_movie_joint_encoder.to_parquet("./df_movie_joint_encoder.pq",index=False)

Given ratings.csv, we will sort user rating by timestamp in order to create time-order sequences.

Note that user ratings behaviors is tried to mimics user watch behaviors

Random walk methods

Convert rating into sequences by using random walk technique. We leverage source code from Bryan Perozzi

Implement Graph class

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Graph utilities."""

import logging
import random
from collections import defaultdict
from io import open
from time import time

from six import iterkeys
from six.moves import range, zip, zip_longest

logger = logging.getLogger("deepwalk")

LOGFORMAT = "%(asctime).19s %(levelname)s %(filename)s: %(lineno)s %(message)s"


class Graph(defaultdict):
    """Efficient basic implementation of nx `Graph'  Undirected graphs with self loops"""

    def __init__(self):
        super(Graph, self).__init__(list)

    def nodes(self):
        return self.keys()

    def adjacency_iter(self):
        return self.iteritems()

    def random_walk(self, path_length, alpha=0, rand=random.Random(), start=None):
        """ Returns a truncated random walk.

            path_length: Length of the random walk.
            alpha: probability of restarts.
            start: the start node of the random walk.
        """
        G = self
        if start:
            path = [start]
        else:
            # Sampling is uniform w.r.t V, and not w.r.t E
            path = [rand.choice(list(G.keys()))]

        while len(path) < path_length:
            cur = path[-1]
            if len(G[cur]) > 0:
                if rand.random() >= alpha:
                    path.append(rand.choice(G[cur]))
                else:
                    path.append(path[0])
            else:
                break
        return [str(node) for node in path]

    def make_undirected(self):

        t0 = time()

        for v in list(self):
            for other in self[v]:
                if v != other:
                    self[other].append(v)

        t1 = time()
        logger.info('make_directed: added missing edges {}s'.format(t1-t0))

        self.make_consistent()
        return self

    def make_consistent(self):
        t0 = time()
        for k in iterkeys(self):
            self[k] = list(sorted(set(self[k])))

        t1 = time()
        logger.info('make_consistent: made consistent in {}s'.format(t1-t0))

        self.remove_self_loops()

        return self

    def remove_self_loops(self):

        removed = 0
        t0 = time()

        for x in self:
            if x in self[x]:
                self[x].remove(x)
                removed += 1

        t1 = time()

        logger.info(
            'remove_self_loops: removed {} loops in {}s'.format(removed, (t1-t0)))
        return self


def build_deepwalk_corpus_iter(G, num_paths, path_length, alpha=0,
                               rand=random.Random(0)):

    nodes = list(G.nodes())

    for cnt in range(num_paths):
        rand.shuffle(nodes)
        for node in nodes:
            yield G.random_walk(path_length, rand=rand, alpha=alpha, start=node)

# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python
def grouper(n, iterable, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)


def parse_adjacencylist(f):
    adjlist = []
    for l in f:
        if l and l[0] != "#":
            introw = [str(x) for x in l.strip().split()]
            row = [introw[0]]
            row.extend(set(sorted(introw[1:])))
            adjlist.extend([row])

    return adjlist


def parse_adjacencylist_unchecked(f):
    adjlist = []
    for l in f:
        if l and l[0] != "#":
            adjlist.extend([[str(x) for x in l.strip().split()]])

    return adjlist


def load_adjacencylist(file_, undirected=False, chunksize=10000,):

    parse_func = parse_adjacencylist_unchecked
    convert_func = from_adjlist_unchecked

    adjlist = []

    t0 = time()

    total = 0
    with open(file_) as f:
        for idx, adj_chunk in enumerate(map(parse_func, grouper(int(chunksize), f))):
            adjlist.extend(adj_chunk)
            total += len(adj_chunk)

    t1 = time()

    logger.info('Parsed {} edges with {} chunks in {}s'.format(
        total, idx, t1-t0))

    t0 = time()
    G = convert_func(adjlist)
    t1 = time()

    logger.info('Converted edges to graph in {}s'.format(t1-t0))

    if undirected:
        t0 = time()
        G = G.make_undirected()
        t1 = time()
        logger.info('Made graph undirected in {}s'.format(t1-t0))

    return G


def from_adjlist_unchecked(adjlist):
    G = Graph()

    for row in adjlist:
        node = row[0]
        neighbors = row[1:]
        G[node] = neighbors

    return G

Generate sequences

df_rating = pd.read_csv("ml-25m/ratings.csv")
df_rating["movieId"] =  movie_encoder.transform(df_rating["movieId"].tolist())
df_rating_path = df_rating.groupby(["userId"])["movieId"].agg(list).reset_index()

del df_rating

random_seed = 4111
number_walks = 10
walk_length = 20
restart_prob = 0.

min_len = 3


with open("paths_filtered.txt","w") as f, tqdm(total=len(df_rating_path)) as pbar:    
    for _, row in df_rating_path.iterrows():
        all_paths = row["movieId"]
        
        # Skip short sequences
        
        if len(all_paths) < min_len:
            continue

        f.write(" ".join([str(node) for node in all_paths])+"\n")
    
        pbar.update(1)

G = load_adjacencylist("paths_filtered.txt",undirected=True)

cursor = build_deepwalk_corpus_iter(G,
                                    num_paths=number_walks,
                                    path_length=walk_length,
                                    alpha=restart_prob,
                                    rand=random.Random(random_seed))

with open("walks.txt","w") as f:
    for path in cursor:
        f.write(" ".join(path) + "\n")

100%|██████████| 162541/162541 [00:09<00:00, 16561.69it/s]

Deep dive into model

Generate skip-gram and negative sampling for picking movies

feature_schema = {
    "genres": {
        "type": "categorical_list", # there are categorical_list and categorical
        "size": encoder_mapper["genres"].n_classes_,
    },
    "tag": {
        "type": "categorical_list",
        "size": encoder_mapper["tag"].n_classes_,        
    },
}

def generate_skipgram(sentence, i, window_size,unk:int = 0):
    iword = sentence[i]
    left = sentence[max(i - window_size, 0): i]
    right = sentence[i + 1: i + 1 + window_size]
    return iword, [unk for _ in range(window_size - len(left))] + left + right + [unk for _ in range(window_size - len(right))]

class MovieLenGraphPathDataset(Dataset):
    """
        MovieLen Torch Dataset
    """
    def __init__(
        self, 
        file_path, 
        side_info_lookup: dict=None, 
        feature_schema:dict=None, 
        window_size: int=5,
        subsample_rate: float = 0) -> None:
    
        """
            file_path - string : Line text sequence file
            side_info_lookup - dict: 
        """
        self.data = []
        
        self.wc = defaultdict(int)        
        self.window_size = window_size
        self.side_info_lookup = side_info_lookup
        self.feature_schema = feature_schema
        
        cols = list(feature_schema.keys())
        
        with open(file_path) as f:
            step = 0            
            for line in f:
                if step % 1000 == 0:
                    print("working on line: {}".format(step),end="\r")                
                    
                line = [int(w.strip()) for w in line.strip().split() if len(w.strip()) > 0]

                for i in range(len(line)):
                    self.wc[line[i]] += 1
                    
                    center_word, neighbor_words = generate_skipgram(line, i, window_size)
                    
                    if not center_word in self.side_info_lookup:
                        continue
                        
                    for neighbor_word in neighbor_words:
                        if neighbor_word == 0:
                            continue
                            
                        tmp = {"center_word": center_word,"neighbor_word": neighbor_word,"side_information":{}}
                            
                        for col in cols:
                            if col in self.side_info_lookup[center_word]:
                                tmp["side_information"][col] = self.side_info_lookup[center_word][col]
                                
                        self.data.append(tmp)
            
                step += 1
        
        self.wc_arr = np.zeros(len(side_info_lookup) + 1,dtype=int)
        for k, v in self.wc.items():
            self.wc_arr[k] = v
        
        if subsample_rate > 0:
            wf = self.wc_arr / self.wc_arr.sum()
            ws = 1 - np.sqrt(subsample_rate/wf)
            ws = np.clip(ws, 0, 1)

            data = []

            for i in range(len(self.data)):
                if random.random() > ws[self.data[i][0]]:
                    data.append(self.data[i])

            self.data = data            
        
        
    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)
    
    def get_word_count(self):
        return self.wc_arr

movie_ds = MovieLenGraphPathDataset("walks.txt",
                                    side_info_lookup=movie_metadata_info,
                                    feature_schema=feature_schema,
                                    window_size=5)

working on line: 88000

Wrap-up with Torch DataLoader

import torch
def padding_tensor(arr,maxlen, dtype):
    padded_sess = torch.zeros(len(arr), maxlen, dtype=dtype)
    
    for i in range(len(arr)):
        padded_sess[i, :len(arr[i])] = arr[i]
    
    return padded_sess

def get_dataloader(dataset, feature_mapper,batch_size=64,shuffle=True,num_workers=0):
    from torch.utils.data import DataLoader
                
    def collate_fn(inputs):
        outputs = {k: [] for k in feature_mapper.keys()}
        outputs["center_word"] = []
        outputs["neighbor_word"] = []   
        
        max_len_mapper = defaultdict(int)
        
        for i in range(len(inputs)):
            outputs["center_word"].append(torch.tensor(inputs[i]["center_word"], dtype=torch.int))
            outputs["neighbor_word"].append(torch.tensor(inputs[i]["neighbor_word"], dtype=torch.int))
            
            for k in feature_mapper.keys():
                outputs[k].append(torch.tensor(inputs[i]["side_information"][k],dtype=torch.int))
                max_len_mapper[k] = max(len(inputs[i]["side_information"][k]),max_len_mapper[k])
        
        outputs["center_word"] = torch.tensor(outputs["center_word"], dtype=torch.int)
        outputs["neighbor_word"] = torch.tensor(outputs["neighbor_word"], dtype=torch.int)
        
        for k in feature_mapper.keys():            
            if feature_mapper[k]["type"] == "categorical":
                outputs[k] = torch.tensor(outputs[k], dtype=torch.int)
                
            elif feature_mapper[k]["type"] == "categorical_list":
                outputs[k] = padding_tensor(outputs[k],max_len_mapper[k],dtype=torch.int)      

        return outputs
    
    return DataLoader(dataset, batch_size = batch_size, shuffle = shuffle, collate_fn = collate_fn, num_workers=num_workers)

Weighted Skip Gram

Let’s start of by defining the problem. For the sake of charity , we use $W$ to define the embedding matrix of items or Side Information (SI). Specifically, $W_{v}^{0}$ denote the embedding of item $v$, and $W_{v}^{i}$ denote s-th type of embedding of the s-th type SI attached to item $v$

Then, for item $v$ with $n$ SIs, we have $n$ + 1 vector $W_{v}^{0} ,W_{v}^{1},...,W_{v}^{n} \in$ R^{d}$ with d is embedding dim. We proposed weighted layer to aggregate embedding of SI related to items. Given $a^j_v$ is weight of the s-type of SI of side information of item v with $a_0^v$ denoting the weight of item v itself. The formula is defined below the following:

$\begin{align*} H_{v} = \frac{\sum_{j = 0}^{n} e^{a^j_v}* W_v^j }{\sum_{j = 0}^{n} e^{a^j_v} } \\ \end{align*}$

where we calculate $e^{a^j_v}$ to ensure contribution of each SI is positive, ${\sum_{j = 0}^{n} e^{a^j_v} }$ is normalized of weights of each SI embedding.

For node $v$ and its context nodes $u$ in the training data, we represent $Z_u \in R_d$ to represent its embedding and $y$ is label. The objective function is defined below:

$\begin{align*} L(u,v, y) = - ( ylog(\sigma(H_v^TZ_u)) + (1-y)(log(1-\sigma(H_v^TZ_u)))) \\ \end{align*}$

from typing import List, Union

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data
import pytorch_lightning as pl


from typing import *
import numpy as np
import torch

def fixed_unigram_candidate_sampler(
    true_classes: Union[np.array, torch.Tensor],
    num_true: int,
    num_samples: int,
    range_max: int,
    unigrams: List[Union[int, float]],
    unique: bool = False,    
    distortion: float = 1.):
    """
    Generate candidates based on positive examples. I convert from tensorflow code to python code
    Args:
        true_classes:  A Tensor of type int64 and shape [batch_size,num_true]. The target classes.
        num_true: An int. The number of target classes per training example.
        num_samples: An int. The number of classes to randomly sample.
        range_max: An int. The number of possible classes.
        unigrams: A list of unigram counts or probabilities, one per ID in sequential order. Exactly one of vocab_file and unigrams should be passed to this operation.
        unique: A bool. Determines whether all sampled classes in a batch are unique.
        distortion: distortion is used to skew the unigram probability distribution. Each weight is first raised to the distortion's power before adding to the internal unigram distribution. As a result, distortion = 1.0 gives regular unigram sampling (as defined by the vocab file), and distortion = 0.0 gives a uniform distribution.
    """

    if isinstance(true_classes, torch.Tensor):
        true_classes = true_classes.detach().cpu().numpy()

    unigrams = np.array(unigrams)
    if distortion != 1.:
        unigrams = unigrams.astype(np.float64) ** distortion

    result = []
    has_seen = set()
    for i in range(len(true_classes)):
        for j in range(len(true_classes[i])):
            has_seen.add(true_classes[i][j])
    
    if range_max < num_samples and unique:
        raise Exception("Range max is lower than num samples")
    
    
    while len(result) < num_samples:
        sampler = torch.utils.data.WeightedRandomSampler(unigrams, num_samples,)
        candidates = np.array(list(sampler))
        
        for item in candidates:
            if unique:
                if item not in has_seen:
                    result.append(item)
                    has_seen.add(item)
            else:
                result.append(item)
        
    return result

class SkigGram(pl.LightningModule):
    def __init__(
        self, 
        embedding_size: int, 
        embedding_dim: int,
        side_information_schema: dict,
        word_count_table: np.ndarray,
        lr: float = 0.001,
        negative: int = 5,
        ):
        """
        Args:
            embedding_size: An int, unique word number
            embedding_dim: An int, embedding dimension
            side_information_schema: A dict, side information schema contains name, type of side information
            word_count_table: A array, define word count of item in order to generate negative sampling
        """  
        super(SkigGram, self).__init__()
        self.side_information_schema = side_information_schema
        self.embedding_size = embedding_size

        self.embedding_dim = embedding_dim
        self.negative = negative
        self.wc = word_count_table
        self.lr = lr

        # center word embedding
        self.center_word_embed = nn.Embedding(embedding_size, embedding_dim,padding_idx=0)
        # neighbor word embedding
        self.neighbor_word_embed = nn.Embedding(embedding_size, embedding_dim,padding_idx=0)
        
        self.si_dict = nn.ModuleDict()

        # side information embedding
        self.si_keys = []
        
        for key, info in side_information_schema.items():
            self.si_keys.append(key)
            
            if info["type"] == "categorical":
                self.si_dict[key] = nn.Embedding(info["size"],embedding_dim,padding_idx=0)
            elif info["type"] == "categorical_list":
                self.si_dict[key] = nn.EmbeddingBag(info["size"],embedding_dim,padding_idx=0)
        
        self._weight_init()

    def _weight_init(self):
        with torch.no_grad():
            self.center_word_embed.weight.data.normal_(0., 0.01)
            self.neighbor_word_embed.weight.data.normal_(0., 0.01)

            for key in self.si_dict.keys():
                self.si_dict[key].weight.data.normal_(0., 0.01)
            
        # init side information weight 
        self.register_buffer("embedding_weight",torch.rand((len(self.si_keys) + 1, 1), requires_grad=True,))
    
    def compute_vector(self, data:dict):
        embed_center_word = self.center_word_embed(data["center_word"].to(self.device)) # batch_size * embed_dim

        information_list = [embed_center_word]

        # side information
        for k in self.si_keys:
            information_list.append(self.si_dict[k](data[k].to(self.device)))
        
        # word and side information embeding list
        information_embed = torch.cat(information_list, dim=0).view(len(information_list), -1, self.embedding_dim)
        
        exp_embedding_weights = torch.exp(self.embedding_weight.view(-1, 1, 1))
        
        weight_sum_pooling = information_embed * exp_embedding_weights / torch.sum(exp_embedding_weights)
        
        embed_center_word_side_information = torch.sum(weight_sum_pooling, dim=0)
        
        return embed_center_word_side_information
    
    def forward(self, data: dict):
        """
        Argument
            data: dictionary of torch tensor
        """
        # assert "center_word" in data

        embed_center_word = self.center_word_embed(data["center_word"].to(self.device)) # batch_size * embed_dim

        information_list = [embed_center_word]

        # neighbor word
        embed_neighbor_word = self.neighbor_word_embed(data["neighbor_word"].to(self.device)) # batch_size * 1 * embed_dim
        
        # neg word
        neg_word = torch.tensor(fixed_unigram_candidate_sampler(
            data["center_word"].unsqueeze(1),
            num_true=1,
            num_samples=len(data["center_word"]) * self.negative,
            range_max=len(self.wc),
            unigrams=self.wc,
        ),dtype=torch.int).reshape((-1,self.negative))
        
        embed_neg_word = self.neighbor_word_embed(neg_word.to(self.device)) # batch_size * K * embed_dim       

        # side information
        for k in self.si_keys:
            information_list.append(self.si_dict[k](data[k].to(self.device)))
        

        # word and side information embeding list
        information_embed = torch.cat(information_list, dim=0).view(len(information_list), -1, self.embedding_dim)
        
        exp_embedding_weights = torch.exp(self.embedding_weight.view(-1, 1, 1))
        
        weight_sum_pooling = information_embed * exp_embedding_weights / torch.sum(exp_embedding_weights)
        
        embed_center_word_side_information = torch.sum(weight_sum_pooling, dim=0)        

        score = torch.sum(torch.mul(embed_center_word_side_information, embed_neighbor_word.squeeze()), dim=1)
        score = torch.clamp(score, max=10, min=-10)
        score = -F.logsigmoid(score)

        neg_score = torch.bmm(embed_neg_word, embed_center_word_side_information.unsqueeze(2)).squeeze()
        neg_score = torch.clamp(neg_score, max=10, min=-10)
        neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
        return torch.mean(score + neg_score)

    def save_embedding(self, id2word, file_name):
        embedding = self.u_embeddings.weight.cpu().data.numpy()
        with open(file_name, 'w') as f:
            f.write('%d %d\n' % (len(id2word), self.emb_dimension))
            for wid, w in id2word.items():
                e = ' '.join(map(lambda x: str(x), embedding[wid]))
                f.write('%s %s\n' % (w, e))
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(),lr=self.lr)
        return optimizer
    
    def training_step(self, train_batch, batch_idx):
        x = train_batch
        loss = model(x).mean()
        
        self.log('train_loss', loss)
        
        return loss    
    
    def write_histograms(self,):
        # iterating through all parameters
        for name,params in self.named_parameters():          
            self.logger.experiment.add_histogram(name,params,self.current_epoch)
            
    def training_epoch_end(self,outputs):
        self.write_histograms()
        
        return super().training_epoch_end(outputs)

Algorithm Training

We use Pytorch lightning Trainer to speed up multi-gpus and multi-processing Data Loader

1 2	import torch from pytorch_lightning import loggers as pl_loggers

num_gpus = torch.cuda.device_count()

lr = 0.001
epochs = 5
negative_items = 5
batch_size = 256 * num_gpus
num_items = len(movie_metadata_info) + 3
embedding_dim = 50

data_loader = get_dataloader(
    movie_ds,
    feature_schema,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_gpus)

model = SkigGram(
    num_items,
    embedding_dim,
    feature_schema,
    movie_ds.get_word_count(),
    lr=lr,
    negative=negative_items,)

tb_logger = pl_loggers.TensorBoardLogger("logs/",)
trainer = pl.Trainer(
    gpus=num_gpus, 
    strategy="dp",
    max_epochs=epochs,
    accelerator="gpu", 
    log_every_n_steps=1000,
    flush_logs_every_n_steps=1000,
    logger=tb_logger)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/jovyan/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:59: LightningDeprecationWarning: Setting `Trainer(flush_logs_every_n_steps=1000)` is deprecated in v1.5 and will be removed in v1.7. Please configure flushing in the logger instead.
  rank_zero_deprecation(

1	trainer.fit(model, data_loader,)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name                | Type       | Params
---------------------------------------------------
0 | center_word_embed   | Embedding  | 3.1 M 
1 | neighbor_word_embed | Embedding  | 3.1 M 
2 | si_dict             | ModuleDict | 3.3 M 
---------------------------------------------------
9.5 M     Trainable params
0         Non-trainable params
9.5 M     Total params
38.055    Total estimated model params size (MB)
/home/jovyan/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:132: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(



Training: 0it [00:00, ?it/s]

We just uploaded Tensorboard into TensorHub. Check it out on this link

Testing model

First, we take our movie embedding from model

from gensim.models import KeyedVectors
def convert_to_tensor(entity_id,data):
    tmp = {"center_word": torch.tensor([entity_id],dtype=torch.int)}
    
    for k, v in data.items():
        if len(v) == 0:
            v = [0]
        tmp[k] = torch.tensor([v],dtype=torch.int)
    
    
    return tmp

X = np.zeros((len(movie_metadata_info) + 3,embedding_dim))
for entity_id, entity_info in tqdm(movie_metadata_info.items()):
    X[entity_id] = model.compute_vector(convert_to_tensor(entity_id,entity_info)).detach().cpu().numpy()    

kv_metadata = KeyedVectors(vector_size=embedding_dim, count=len(X))
for i in range(1,len(X)-1):
    kv_metadata.add_vector(i, X[i])

100%|██████████| 62423/62423 [00:13<00:00, 4691.08it/s]

import tensorflow as tf
import tensorboard as tb
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile

labels = df_movie_joint_encoder["title"].tolist()
tb_logger.experiment.add_embedding(X[:len(labels)],labels)

Imagine that we just uploaded Spiderman Far Frome Home, new movie on Dec 2021 into our system.

The movie is new for us so we want to find similar movies related to Spiderman FFH. The movie is no interaction before, our team will assign genres and tags for this movie

import json
response = """{"status":"success","data":{"movieId":263007,"totalTagNum":112,"scoredTags":[{"tag":"Marvel Cinematic Universe","tagCountsViewModel":{"total":21,"positive":15,"neutral":3,"negative":3,"score":21.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Marvel","tagCountsViewModel":{"total":18,"positive":12,"neutral":4,"negative":2,"score":18.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"multiverse","tagCountsViewModel":{"total":15,"positive":12,"neutral":3,"negative":0,"score":15.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"spider-man","tagCountsViewModel":{"total":12,"positive":11,"neutral":1,"negative":0,"score":12.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"I'm something of a scientist myself","tagCountsViewModel":{"total":10,"positive":9,"neutral":1,"negative":0,"score":10.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Fan service","tagCountsViewModel":{"total":9,"positive":6,"neutral":3,"negative":0,"score":9.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"great cgi","tagCountsViewModel":{"total":9,"positive":9,"neutral":0,"negative":0,"score":9.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"tom holland","tagCountsViewModel":{"total":6,"positive":3,"neutral":3,"negative":0,"score":6.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Andrew Garfield","tagCountsViewModel":{"total":5,"positive":5,"neutral":0,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"New York City","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Tobey Maguire","tagCountsViewModel":{"total":5,"positive":5,"neutral":0,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Willem Dafoe","tagCountsViewModel":{"total":5,"positive":5,"neutral":0,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"actor reprises previous role","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"crossover","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"magic","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"nostalgia","tagCountsViewModel":{"total":5,"positive":4,"neutral":0,"negative":1,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"sorcerer","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"casting a spell","tagCountsViewModel":{"total":4,"positive":3,"neutral":1,"negative":0,"score":4.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Alfred Molina","tagCountsViewModel":{"total":3,"positive":3,"neutral":0,"negative":0,"score":3.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"lawyer","tagCountsViewModel":{"total":3,"positive":1,"neutral":1,"negative":1,"score":2.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"sandman the marvel comics character","tagCountsViewModel":{"total":3,"positive":0,"neutral":1,"negative":2,"score":3.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"statue of liberty","tagCountsViewModel":{"total":3,"positive":2,"neutral":1,"negative":0,"score":3.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"green goblin character","tagCountsViewModel":{"total":2,"positive":1,"neutral":1,"negative":0,"score":2.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"trippy","tagCountsViewModel":{"total":2,"positive":1,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"unmasked","tagCountsViewModel":{"total":2,"positive":1,"neutral":1,"negative":0,"score":2.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":" Nostalgia Done Right","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"2020s","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Beautifully completes Tom Holland's evolution from a boy reliant on the avengers and Tony to a man with his own suit, his own problems and his own responsibility, because after all with great power comes great responsibility","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":0.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Doctor Strange","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Dr. Strange","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Greatest marvel film ever","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":0.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Statue of Liberty","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Zendaya","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"a list","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"action","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"american patriotism","tagCountsViewModel":{"total":1,"positive":0,"neutral":0,"negative":1,"score":1.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"based on comic","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"based on comic book","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"bechdel test: fail","tagCountsViewModel":{"total":1,"positive":0,"neutral":0,"negative":1,"score":1.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"best friend","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"best hits","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"biopunk","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"bomb","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":0.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"bridge","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"british actor playing american character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"cameos","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"car","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"cinematography","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"comic book","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"construction site","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"costume","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"costumed hero","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"doctor octopus character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"doctor strange character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"dr. curt connors character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"electricity","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"electro character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"exciting","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"falling from height","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"fight","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"final showdown","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"fixes Spider-Man","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"framed for murder","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"friend","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"funny","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"good characters","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"greatest hits","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"hero","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"high school","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"identity revealed","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"j. jonah jameson character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"lightning","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"lizard character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"love interest","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"many characters","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"many villains","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"marvel cinematic universe (mcu)","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"marvel comics","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"marvel entertainment","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"mistaken identity","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"multiple villains","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"night","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"nostalgic","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"overrated","tagCountsViewModel":{"total":1,"positive":0,"neutral":0,"negative":1,"score":1.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"peter parker character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"plot holes","tagCountsViewModel":{"total":1,"positive":0,"neutral":0,"negative":1,"score":1.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"power","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"protest","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"psychotronic film","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"rogues gallery","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"sand","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"scaffolding","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"scientist","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"sequel","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"shared universe","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"slimehouse","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"spell gone awry","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"spider man character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"superhero","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"surrealism","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"swing","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"teenage boy","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"teenage girl","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"teenage superhero","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"teenager","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"third part","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"thrilling","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"train","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"trio","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"villain","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"villain team up","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"writing","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null}]}}"""
tags_response = json.loads(response)
tags = [item["tag"].strip().lower() for item in tags_response["data"]["scoredTags"] if item["tagCountsViewModel"].get("dominantAffect","") == "positive"]

test_movie_info = {
    "genres": ["action", "adventure", "science fiction"],
    "tag": tags
}

test_movie_tags = {k: encoder_mapper[k].transform(v) for k, v in test_movie_info.items()}

X_query = model.compute_vector(convert_to_tensor(0,test_movie_tags,)).detach().cpu().numpy()
list_candidates_ids = {movie_id: score for movie_id, score in kv_metadata.similar_by_vector(X_query[0])}
df_movie_joint_encoder[df_movie_joint_encoder["movieId"].isin(list_candidates_ids)]\
    .assign(score=lambda x: x["movieId"].map(lambda d: list_candidates_ids[d]))\
    .sort_values(by=["score"],ascending=False)[["title","score"]]

/home/jovyan/miniconda3/lib/python3.8/site-packages/gensim/models/keyedvectors.py:772: RuntimeWarning: invalid value encountered in true_divide
  dists = dot(self.vectors[clip_start:clip_end], mean) / self.norms[clip_start:clip_end]

	title	score
43355	Marvel One-Shot: All Hail the King (2014)	0.763174
59441	Batman: Hush (2019)	0.732942
7098	D.O.A. (1950)	0.725314
44344	Marvel One-Shot: Agent Carter (2013)	0.722448
43369	Marvel One-Shot: The Consultant (2011)	0.719176
50639	Rendel (2017)	0.713440
27213	45 Years (2015)	0.709849
44351	Marvel One-Shot: A Funny Thing Happened on the...	0.695612
40174	The BFG (2016)	0.694954
41081	The Three Musketeers (1946)	0.689934

Model is learned quite well about similar movies related to Spider-man FFH movie. In general, we can see lots of super-heroes movies on top movies. We have Marvel Cinematic Universe movies, DC movies and sci-fi movies as well

End Note

We have implemented Graph Embedding with Side Information to incorporate item side information. We introduced how to construct items graph from user’s behavior history, and learn the embeddings of all items in the graph. The item embeddings are employed to compute pairwise similarities between all items, which are then used in the recommendation process. To alleviate the sparsity and cold start problems, side information is incorporated into the graph embedding framework.

In scope of data, we just added two metadata genres and user tags into our model. In future, we can add new features such as actors, actress, directors,…

The model in this notebook is just based on negative sampling. We can think to integrate with more advanced graph embedding like Graph Convolutional Network, Knowledge Graph.

Stay tuned!

Reference

Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba

Word2vec