Graph Embedding with Side Information

Posted by Steve Tran on 26 Jan 2022

Introduction

This notebook is implemented based on paper Billion-scale Commodity Embedding for E-commerce Recommendation. In Taobao, milion of new items are continously uploaded each hour. There are no user behaviors for these items. Learning item representation is important for matching, ranking in order to recommend these items to user. Collaborative Filtering based methods is only computed co-occurence of items in user history behavior. It is quite challenge to learn item representation with few or even no interactions. Authors proposed new approach: Incorporate side-information to enhance embedding vectors, dubbed Graph Embedding with Side Information. For example, items with same brands or category should be closer in embedding space Throughout the rest of this notebook, we will develop a model which incorporate side information into graph embedding and test model with new items not in the dataset to see the performance
1
2
3
4
5
6
7
8
9
# 1. magic for inline plot
# 2. magic so that the notebook will reload external python modules
# 3. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035

%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import shutil
import os
import glob
import json
import random
import gc
import torch


import numpy as np
import pandas as pd

from tqdm import tqdm
from collections import Counter, defaultdict

from torch.utils.data import DataLoader, Dataset

1
2
3
4
5
# Set fixed seed
random_state = 4111
torch.manual_seed(random_state)
random.seed(random_state)
np.random.seed(random_state)

Cold Start Recommendation

Preparation

We will use public available movielen throughout this experiment. At time of writing, dataset movielen25m is up-to-date and contains lots of samples. You can download via this link

1
2
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip ml-25m.zip

Given this dataset, we have few code chunks to clean and preprocess the data.

  • The raw movies.csv contains movie_id, movie title and genres. We will convert movie_id into ordinal ids, lowercase and split genres into array of string
  • The raw tags.csv contains user_id, movie_id, tag and timestamp. We will drop null tag, convert to lowercase and aggregate tags by movie
1
2
3
df_entity = pd.read_csv("ml-25m/movies.csv")
df_entity["genres"] = df_entity["genres"].map(lambda d: d.split("|"))
df_entity.head()
movieId title genres
0 1 Toy Story (1995) [Adventure, Animation, Children, Comedy, Fantasy]
1 2 Jumanji (1995) [Adventure, Children, Fantasy]
2 3 Grumpier Old Men (1995) [Comedy, Romance]
3 4 Waiting to Exhale (1995) [Comedy, Drama, Romance]
4 5 Father of the Bride Part II (1995) [Comedy]
1
2
df_tag = pd.read_csv("ml-25m/tags.csv").dropna()
df_tag.head()
userId movieId tag timestamp
0 3 260 classic 1439472355
1 3 260 sci-fi 1439472256
2 4 1732 dark comedy 1573943598
3 4 1732 great dialogue 1573943604
4 4 7569 so bad it's good 1573943455
1
2
3
4
df_agg_tag = df_tag.drop(["userId","timestamp"],axis=1)\
.assign(tag=lambda df: df["tag"].map(lambda d: d.lower().lstrip().rstrip()))\
.groupby(["movieId"])["tag"].agg("unique").reset_index()
df_agg_tag
movieId tag
0 1 [owned, imdb top 250, pixar, time travel, chil...
1 2 [robin williams, time travel, fantasy, based o...
2 3 [funny, best friend, duringcreditsstinger, fis...
3 4 [based on novel or book, chick flick, divorce,...
4 5 [aging, baby, confidence, contraception, daugh...
... ... ...
45246 208813 [might like]
45247 208933 [black and white, deal with the devil]
45248 209035 [computer animation, japan, mass behavior, mas...
45249 209037 [chameleon, computer animation, gluttony, humo...
45250 209063 [black, education, friends schools, independen...

45251 rows × 2 columns

1
2
df_movie_joint = df_entity.merge(df_agg_tag, on=["movieId"],how="outer")
df_movie_joint
movieId title genres tag
0 1 Toy Story (1995) [Adventure, Animation, Children, Comedy, Fantasy] [owned, imdb top 250, pixar, time travel, chil...
1 2 Jumanji (1995) [Adventure, Children, Fantasy] [robin williams, time travel, fantasy, based o...
2 3 Grumpier Old Men (1995) [Comedy, Romance] [funny, best friend, duringcreditsstinger, fis...
3 4 Waiting to Exhale (1995) [Comedy, Drama, Romance] [based on novel or book, chick flick, divorce,...
4 5 Father of the Bride Part II (1995) [Comedy] [aging, baby, confidence, contraception, daugh...
... ... ... ... ...
62418 209157 We (2018) [Drama] NaN
62419 209159 Window of the Soul (2001) [Documentary] NaN
62420 209163 Bad Poems (2018) [Comedy, Drama] NaN
62421 209169 A Girl Thing (2001) [(no genres listed)] NaN
62422 209171 Women of Devil's Island (1962) [Action, Adventure, Drama] NaN

62423 rows × 4 columns

Preprocessing

Ordinal Encoding

We need to create custom ordinal encoder to serve our usecase. We transform list of category/ids features into ordinal ids which indicates start index, replace empty values by

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
class OrdinalEncoder():
"""
Convert categorical into ordinal integer ids
If value is not existed in vocab, it will be replaced by <unk> val
"""
def __init__(self,start_from=1, unknown=0):
self.vocabs = {}
self.wc = defaultdict(int)
self.inv = []
self.start_from = start_from
self.unknown = unknown

def __len__(self, ):
return len(self.inv)

def fit(self, X):

import numpy as np

for i in range(len(X)):
self.wc[X[i]] += 1

X_uniq = np.unique(X)

self.inv = [0] * (len(X_uniq) + self.start_from)

for idx, item in enumerate(X_uniq):
self.vocabs[item] = self.start_from + idx
self.inv[self.start_from + idx] = item

self.inv = np.array(self.inv)

return self

def transform(self, X):
import numpy as np

if isinstance(X[0],(list,np.ndarray)):
res = []
for idx in range(len(X)):
tmp = [self.vocabs.get(item) for item in X[idx] if item in self.vocabs]

res.append(tmp)
return res
else:
return np.array([ self.vocabs.get(item, self.unknown) for item in X ],dtype="int")

def fit_transform(self, X):
return self.fit(X).transform(X)

def inverse_transform(self, X):
if len(X) == 0:
return []

X = np.array(X, dtype="int")

return self.inv[X]

@property
def n_classes_(self):
return len(self.inv)

def word_count_table(self):
return [self.wc[w] for w in self.inv]

class MultiLabelEncoder(OrdinalEncoder):
"""
Convert multi-labels into ordinal ids
"""
def fit(self, X: list):
X_extend = []
for i in range(len(X)):
if isinstance(X[i],(list,np.ndarray)):
X_extend.extend(X[i])

super().fit(X_extend)

return self

1
metadata_cols = ["genres","tag"]
1
2
movie_encoder = OrdinalEncoder(start_from=1).fit(df_movie_joint["movieId"].tolist())
encoder_mapper = {col: MultiLabelEncoder(start_from=1).fit(df_movie_joint[col].tolist()) for col in metadata_cols}
1
2
3
4
5
6
7
8
9
10
df_movie_joint_encoder = df_movie_joint[["movieId"]].copy(deep=True)
df_movie_joint_encoder["movieId"] = movie_encoder.transform(df_movie_joint["movieId"].tolist())

for col in metadata_cols:
df_movie_joint_encoder[col] = encoder_mapper[col].transform(df_movie_joint[col].fillna("").tolist())

df_movie_joint_encoder["movieId"] = df_movie_joint_encoder["movieId"].astype("int")
movie_metadata_info = df_movie_joint_encoder.set_index(["movieId"])[metadata_cols].to_dict(orient="index")

df_movie_joint_encoder["title"] = df_movie_joint["title"]
1
df_movie_joint_encoder
movieId genres tag title
0 1 [3, 4, 5, 6, 10] [42617, 27920, 44422, 58453, 10932, 12378, 225... Toy Story (1995)
1 2 [3, 5, 10] [48952, 58453, 20353, 5932, 7677, 16422, 23585... Jumanji (1995)
2 3 [6, 16] [22590, 6651, 17688, 21398, 41566, 51540, 3803... Grumpier Old Men (1995)
3 4 [6, 9, 16] [5956, 10772, 16687, 28762, 52968, 12059, 4824... Waiting to Exhale (1995)
4 5 [6] [1785, 5229, 12797, 12978, 14653, 25202, 37295... Father of the Bride Part II (1995)
... ... ... ... ...
62418 62419 [9] [] We (2018)
62419 62420 [8] [] Window of the Soul (2001)
62420 62421 [6, 9] [] Bad Poems (2018)
62421 62422 [1] [] A Girl Thing (2001)
62422 62423 [2, 3, 9] [] Women of Devil's Island (1962)

62423 rows × 4 columns

1
df_movie_joint_encoder.to_parquet("./df_movie_joint_encoder.pq",index=False)

Given ratings.csv, we will sort user rating by timestamp in order to create time-order sequences.

Note that user ratings behaviors is tried to mimics user watch behaviors

Random walk methods

Convert rating into sequences by using random walk technique. We leverage source code from Bryan Perozzi

Implement Graph class

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Graph utilities."""

import logging
import random
from collections import defaultdict
from io import open
from time import time

from six import iterkeys
from six.moves import range, zip, zip_longest

logger = logging.getLogger("deepwalk")

LOGFORMAT = "%(asctime).19s %(levelname)s %(filename)s: %(lineno)s %(message)s"


class Graph(defaultdict):
"""Efficient basic implementation of nx `Graph' Undirected graphs with self loops"""

def __init__(self):
super(Graph, self).__init__(list)

def nodes(self):
return self.keys()

def adjacency_iter(self):
return self.iteritems()

def random_walk(self, path_length, alpha=0, rand=random.Random(), start=None):
""" Returns a truncated random walk.

path_length: Length of the random walk.
alpha: probability of restarts.
start: the start node of the random walk.
"""
G = self
if start:
path = [start]
else:
# Sampling is uniform w.r.t V, and not w.r.t E
path = [rand.choice(list(G.keys()))]

while len(path) < path_length:
cur = path[-1]
if len(G[cur]) > 0:
if rand.random() >= alpha:
path.append(rand.choice(G[cur]))
else:
path.append(path[0])
else:
break
return [str(node) for node in path]

def make_undirected(self):

t0 = time()

for v in list(self):
for other in self[v]:
if v != other:
self[other].append(v)

t1 = time()
logger.info('make_directed: added missing edges {}s'.format(t1-t0))

self.make_consistent()
return self

def make_consistent(self):
t0 = time()
for k in iterkeys(self):
self[k] = list(sorted(set(self[k])))

t1 = time()
logger.info('make_consistent: made consistent in {}s'.format(t1-t0))

self.remove_self_loops()

return self

def remove_self_loops(self):

removed = 0
t0 = time()

for x in self:
if x in self[x]:
self[x].remove(x)
removed += 1

t1 = time()

logger.info(
'remove_self_loops: removed {} loops in {}s'.format(removed, (t1-t0)))
return self


def build_deepwalk_corpus_iter(G, num_paths, path_length, alpha=0,
rand=random.Random(0)):

nodes = list(G.nodes())

for cnt in range(num_paths):
rand.shuffle(nodes)
for node in nodes:
yield G.random_walk(path_length, rand=rand, alpha=alpha, start=node)

# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python
def grouper(n, iterable, padvalue=None):
"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)


def parse_adjacencylist(f):
adjlist = []
for l in f:
if l and l[0] != "#":
introw = [str(x) for x in l.strip().split()]
row = [introw[0]]
row.extend(set(sorted(introw[1:])))
adjlist.extend([row])

return adjlist


def parse_adjacencylist_unchecked(f):
adjlist = []
for l in f:
if l and l[0] != "#":
adjlist.extend([[str(x) for x in l.strip().split()]])

return adjlist


def load_adjacencylist(file_, undirected=False, chunksize=10000,):

parse_func = parse_adjacencylist_unchecked
convert_func = from_adjlist_unchecked

adjlist = []

t0 = time()

total = 0
with open(file_) as f:
for idx, adj_chunk in enumerate(map(parse_func, grouper(int(chunksize), f))):
adjlist.extend(adj_chunk)
total += len(adj_chunk)

t1 = time()

logger.info('Parsed {} edges with {} chunks in {}s'.format(
total, idx, t1-t0))

t0 = time()
G = convert_func(adjlist)
t1 = time()

logger.info('Converted edges to graph in {}s'.format(t1-t0))

if undirected:
t0 = time()
G = G.make_undirected()
t1 = time()
logger.info('Made graph undirected in {}s'.format(t1-t0))

return G


def from_adjlist_unchecked(adjlist):
G = Graph()

for row in adjlist:
node = row[0]
neighbors = row[1:]
G[node] = neighbors

return G

Generate sequences

1
2
3
4
5
df_rating = pd.read_csv("ml-25m/ratings.csv")
df_rating["movieId"] = movie_encoder.transform(df_rating["movieId"].tolist())
df_rating_path = df_rating.groupby(["userId"])["movieId"].agg(list).reset_index()

del df_rating
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
random_seed = 4111
number_walks = 10
walk_length = 20
restart_prob = 0.

min_len = 3


with open("paths_filtered.txt","w") as f, tqdm(total=len(df_rating_path)) as pbar:
for _, row in df_rating_path.iterrows():
all_paths = row["movieId"]

# Skip short sequences

if len(all_paths) < min_len:
continue

f.write(" ".join([str(node) for node in all_paths])+"\n")

pbar.update(1)

G = load_adjacencylist("paths_filtered.txt",undirected=True)

cursor = build_deepwalk_corpus_iter(G,
num_paths=number_walks,
path_length=walk_length,
alpha=restart_prob,
rand=random.Random(random_seed))

with open("walks.txt","w") as f:
for path in cursor:
f.write(" ".join(path) + "\n")
100%|██████████| 162541/162541 [00:09<00:00, 16561.69it/s]

Deep dive into model

Generate skip-gram and negative sampling for picking movies

1
2
3
4
5
6
7
8
9
10
feature_schema = {
"genres": {
"type": "categorical_list", # there are categorical_list and categorical
"size": encoder_mapper["genres"].n_classes_,
},
"tag": {
"type": "categorical_list",
"size": encoder_mapper["tag"].n_classes_,
},
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def generate_skipgram(sentence, i, window_size,unk:int = 0):
iword = sentence[i]
left = sentence[max(i - window_size, 0): i]
right = sentence[i + 1: i + 1 + window_size]
return iword, [unk for _ in range(window_size - len(left))] + left + right + [unk for _ in range(window_size - len(right))]

class MovieLenGraphPathDataset(Dataset):
"""
MovieLen Torch Dataset
"""
def __init__(
self,
file_path,
side_info_lookup: dict=None,
feature_schema:dict=None,
window_size: int=5,
subsample_rate: float = 0) -> None:

"""
file_path - string : Line text sequence file
side_info_lookup - dict:
"""
self.data = []

self.wc = defaultdict(int)
self.window_size = window_size
self.side_info_lookup = side_info_lookup
self.feature_schema = feature_schema

cols = list(feature_schema.keys())

with open(file_path) as f:
step = 0
for line in f:
if step % 1000 == 0:
print("working on line: {}".format(step),end="\r")

line = [int(w.strip()) for w in line.strip().split() if len(w.strip()) > 0]

for i in range(len(line)):
self.wc[line[i]] += 1

center_word, neighbor_words = generate_skipgram(line, i, window_size)

if not center_word in self.side_info_lookup:
continue

for neighbor_word in neighbor_words:
if neighbor_word == 0:
continue

tmp = {"center_word": center_word,"neighbor_word": neighbor_word,"side_information":{}}

for col in cols:
if col in self.side_info_lookup[center_word]:
tmp["side_information"][col] = self.side_info_lookup[center_word][col]

self.data.append(tmp)

step += 1

self.wc_arr = np.zeros(len(side_info_lookup) + 1,dtype=int)
for k, v in self.wc.items():
self.wc_arr[k] = v

if subsample_rate > 0:
wf = self.wc_arr / self.wc_arr.sum()
ws = 1 - np.sqrt(subsample_rate/wf)
ws = np.clip(ws, 0, 1)

data = []

for i in range(len(self.data)):
if random.random() > ws[self.data[i][0]]:
data.append(self.data[i])

self.data = data


def __getitem__(self, index):
return self.data[index]

def __len__(self):
return len(self.data)

def get_word_count(self):
return self.wc_arr
1
2
3
4
movie_ds = MovieLenGraphPathDataset("walks.txt",
side_info_lookup=movie_metadata_info,
feature_schema=feature_schema,
window_size=5)
working on line: 88000

Wrap-up with Torch DataLoader

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import torch
def padding_tensor(arr,maxlen, dtype):
padded_sess = torch.zeros(len(arr), maxlen, dtype=dtype)

for i in range(len(arr)):
padded_sess[i, :len(arr[i])] = arr[i]

return padded_sess

def get_dataloader(dataset, feature_mapper,batch_size=64,shuffle=True,num_workers=0):
from torch.utils.data import DataLoader

def collate_fn(inputs):
outputs = {k: [] for k in feature_mapper.keys()}
outputs["center_word"] = []
outputs["neighbor_word"] = []

max_len_mapper = defaultdict(int)

for i in range(len(inputs)):
outputs["center_word"].append(torch.tensor(inputs[i]["center_word"], dtype=torch.int))
outputs["neighbor_word"].append(torch.tensor(inputs[i]["neighbor_word"], dtype=torch.int))

for k in feature_mapper.keys():
outputs[k].append(torch.tensor(inputs[i]["side_information"][k],dtype=torch.int))
max_len_mapper[k] = max(len(inputs[i]["side_information"][k]),max_len_mapper[k])

outputs["center_word"] = torch.tensor(outputs["center_word"], dtype=torch.int)
outputs["neighbor_word"] = torch.tensor(outputs["neighbor_word"], dtype=torch.int)

for k in feature_mapper.keys():
if feature_mapper[k]["type"] == "categorical":
outputs[k] = torch.tensor(outputs[k], dtype=torch.int)

elif feature_mapper[k]["type"] == "categorical_list":
outputs[k] = padding_tensor(outputs[k],max_len_mapper[k],dtype=torch.int)

return outputs

return DataLoader(dataset, batch_size = batch_size, shuffle = shuffle, collate_fn = collate_fn, num_workers=num_workers)

Weighted Skip Gram

Let’s start of by defining the problem. For the sake of charity , we use $W$ to define the embedding matrix of items or Side Information (SI). Specifically, denote the embedding of item $v$, and $W_{v}^{i}$ denote s-th type of embedding of the s-th type SI attached to item $v$

Then, for item $v$ with $n$ SIs, we have $n$ + 1 vector with d is embedding dim. We proposed weighted layer to aggregate embedding of SI related to items. Given is weight of the s-type of SI of side information of item v with denoting the weight of item v itself. The formula is defined below the following:

where we calculate to ensure contribution of each SI is positive, is normalized of weights of each SI embedding.

For node $v$ and its context nodes $u$ in the training data, we represent $Z_u \in R_d$ to represent its embedding and $y$ is label. The objective function is defined below:

1
2
3
4
5
6
7
8
9
10
11
12
13
from typing import List, Union

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data
import pytorch_lightning as pl


from typing import *
import numpy as np
import torch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
def fixed_unigram_candidate_sampler(
true_classes: Union[np.array, torch.Tensor],
num_true: int,
num_samples: int,
range_max: int,
unigrams: List[Union[int, float]],
unique: bool = False,
distortion: float = 1.):
"""
Generate candidates based on positive examples. I convert from tensorflow code to python code
Args:
true_classes: A Tensor of type int64 and shape [batch_size,num_true]. The target classes.
num_true: An int. The number of target classes per training example.
num_samples: An int. The number of classes to randomly sample.
range_max: An int. The number of possible classes.
unigrams: A list of unigram counts or probabilities, one per ID in sequential order. Exactly one of vocab_file and unigrams should be passed to this operation.
unique: A bool. Determines whether all sampled classes in a batch are unique.
distortion: distortion is used to skew the unigram probability distribution. Each weight is first raised to the distortion's power before adding to the internal unigram distribution. As a result, distortion = 1.0 gives regular unigram sampling (as defined by the vocab file), and distortion = 0.0 gives a uniform distribution.
"""

if isinstance(true_classes, torch.Tensor):
true_classes = true_classes.detach().cpu().numpy()

unigrams = np.array(unigrams)
if distortion != 1.:
unigrams = unigrams.astype(np.float64) ** distortion

result = []
has_seen = set()
for i in range(len(true_classes)):
for j in range(len(true_classes[i])):
has_seen.add(true_classes[i][j])

if range_max < num_samples and unique:
raise Exception("Range max is lower than num samples")


while len(result) < num_samples:
sampler = torch.utils.data.WeightedRandomSampler(unigrams, num_samples,)
candidates = np.array(list(sampler))

for item in candidates:
if unique:
if item not in has_seen:
result.append(item)
has_seen.add(item)
else:
result.append(item)

return result

class SkigGram(pl.LightningModule):
def __init__(
self,
embedding_size: int,
embedding_dim: int,
side_information_schema: dict,
word_count_table: np.ndarray,
lr: float = 0.001,
negative: int = 5,
):
"""
Args:
embedding_size: An int, unique word number
embedding_dim: An int, embedding dimension
side_information_schema: A dict, side information schema contains name, type of side information
word_count_table: A array, define word count of item in order to generate negative sampling
"""
super(SkigGram, self).__init__()
self.side_information_schema = side_information_schema
self.embedding_size = embedding_size

self.embedding_dim = embedding_dim
self.negative = negative
self.wc = word_count_table
self.lr = lr

# center word embedding
self.center_word_embed = nn.Embedding(embedding_size, embedding_dim,padding_idx=0)
# neighbor word embedding
self.neighbor_word_embed = nn.Embedding(embedding_size, embedding_dim,padding_idx=0)

self.si_dict = nn.ModuleDict()

# side information embedding
self.si_keys = []

for key, info in side_information_schema.items():
self.si_keys.append(key)

if info["type"] == "categorical":
self.si_dict[key] = nn.Embedding(info["size"],embedding_dim,padding_idx=0)
elif info["type"] == "categorical_list":
self.si_dict[key] = nn.EmbeddingBag(info["size"],embedding_dim,padding_idx=0)

self._weight_init()

def _weight_init(self):
with torch.no_grad():
self.center_word_embed.weight.data.normal_(0., 0.01)
self.neighbor_word_embed.weight.data.normal_(0., 0.01)

for key in self.si_dict.keys():
self.si_dict[key].weight.data.normal_(0., 0.01)

# init side information weight
self.register_buffer("embedding_weight",torch.rand((len(self.si_keys) + 1, 1), requires_grad=True,))

def compute_vector(self, data:dict):
embed_center_word = self.center_word_embed(data["center_word"].to(self.device)) # batch_size * embed_dim

information_list = [embed_center_word]

# side information
for k in self.si_keys:
information_list.append(self.si_dict[k](data[k].to(self.device)))

# word and side information embeding list
information_embed = torch.cat(information_list, dim=0).view(len(information_list), -1, self.embedding_dim)

exp_embedding_weights = torch.exp(self.embedding_weight.view(-1, 1, 1))

weight_sum_pooling = information_embed * exp_embedding_weights / torch.sum(exp_embedding_weights)

embed_center_word_side_information = torch.sum(weight_sum_pooling, dim=0)

return embed_center_word_side_information

def forward(self, data: dict):
"""
Argument
data: dictionary of torch tensor
"""
# assert "center_word" in data

embed_center_word = self.center_word_embed(data["center_word"].to(self.device)) # batch_size * embed_dim

information_list = [embed_center_word]

# neighbor word
embed_neighbor_word = self.neighbor_word_embed(data["neighbor_word"].to(self.device)) # batch_size * 1 * embed_dim

# neg word
neg_word = torch.tensor(fixed_unigram_candidate_sampler(
data["center_word"].unsqueeze(1),
num_true=1,
num_samples=len(data["center_word"]) * self.negative,
range_max=len(self.wc),
unigrams=self.wc,
),dtype=torch.int).reshape((-1,self.negative))

embed_neg_word = self.neighbor_word_embed(neg_word.to(self.device)) # batch_size * K * embed_dim

# side information
for k in self.si_keys:
information_list.append(self.si_dict[k](data[k].to(self.device)))


# word and side information embeding list
information_embed = torch.cat(information_list, dim=0).view(len(information_list), -1, self.embedding_dim)

exp_embedding_weights = torch.exp(self.embedding_weight.view(-1, 1, 1))

weight_sum_pooling = information_embed * exp_embedding_weights / torch.sum(exp_embedding_weights)

embed_center_word_side_information = torch.sum(weight_sum_pooling, dim=0)

score = torch.sum(torch.mul(embed_center_word_side_information, embed_neighbor_word.squeeze()), dim=1)
score = torch.clamp(score, max=10, min=-10)
score = -F.logsigmoid(score)

neg_score = torch.bmm(embed_neg_word, embed_center_word_side_information.unsqueeze(2)).squeeze()
neg_score = torch.clamp(neg_score, max=10, min=-10)
neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
return torch.mean(score + neg_score)

def save_embedding(self, id2word, file_name):
embedding = self.u_embeddings.weight.cpu().data.numpy()
with open(file_name, 'w') as f:
f.write('%d %d\n' % (len(id2word), self.emb_dimension))
for wid, w in id2word.items():
e = ' '.join(map(lambda x: str(x), embedding[wid]))
f.write('%s %s\n' % (w, e))
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(),lr=self.lr)
return optimizer

def training_step(self, train_batch, batch_idx):
x = train_batch
loss = model(x).mean()

self.log('train_loss', loss)

return loss

def write_histograms(self,):
# iterating through all parameters
for name,params in self.named_parameters():
self.logger.experiment.add_histogram(name,params,self.current_epoch)

def training_epoch_end(self,outputs):
self.write_histograms()

return super().training_epoch_end(outputs)

Algorithm Training

We use Pytorch lightning Trainer to speed up multi-gpus and multi-processing Data Loader

1
2
import torch
from pytorch_lightning import loggers as pl_loggers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
num_gpus = torch.cuda.device_count()

lr = 0.001
epochs = 5
negative_items = 5
batch_size = 256 * num_gpus
num_items = len(movie_metadata_info) + 3
embedding_dim = 50

data_loader = get_dataloader(
movie_ds,
feature_schema,
batch_size=batch_size,
shuffle=True,
num_workers=num_gpus)

model = SkigGram(
num_items,
embedding_dim,
feature_schema,
movie_ds.get_word_count(),
lr=lr,
negative=negative_items,)

tb_logger = pl_loggers.TensorBoardLogger("logs/",)
trainer = pl.Trainer(
gpus=num_gpus,
strategy="dp",
max_epochs=epochs,
accelerator="gpu",
log_every_n_steps=1000,
flush_logs_every_n_steps=1000,
logger=tb_logger)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/jovyan/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:59: LightningDeprecationWarning: Setting `Trainer(flush_logs_every_n_steps=1000)` is deprecated in v1.5 and will be removed in v1.7. Please configure flushing in the logger instead.
  rank_zero_deprecation(
1
trainer.fit(model, data_loader,)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name                | Type       | Params
---------------------------------------------------
0 | center_word_embed   | Embedding  | 3.1 M 
1 | neighbor_word_embed | Embedding  | 3.1 M 
2 | si_dict             | ModuleDict | 3.3 M 
---------------------------------------------------
9.5 M     Trainable params
0         Non-trainable params
9.5 M     Total params
38.055    Total estimated model params size (MB)
/home/jovyan/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:132: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(



Training: 0it [00:00, ?it/s]

We just uploaded Tensorboard into TensorHub. Check it out on this link

Testing model

First, we take our movie embedding from model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from gensim.models import KeyedVectors
def convert_to_tensor(entity_id,data):
tmp = {"center_word": torch.tensor([entity_id],dtype=torch.int)}

for k, v in data.items():
if len(v) == 0:
v = [0]
tmp[k] = torch.tensor([v],dtype=torch.int)


return tmp

X = np.zeros((len(movie_metadata_info) + 3,embedding_dim))
for entity_id, entity_info in tqdm(movie_metadata_info.items()):
X[entity_id] = model.compute_vector(convert_to_tensor(entity_id,entity_info)).detach().cpu().numpy()

kv_metadata = KeyedVectors(vector_size=embedding_dim, count=len(X))
for i in range(1,len(X)-1):
kv_metadata.add_vector(i, X[i])
100%|██████████| 62423/62423 [00:13<00:00, 4691.08it/s]
1
2
3
4
5
6
import tensorflow as tf
import tensorboard as tb
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile

labels = df_movie_joint_encoder["title"].tolist()
tb_logger.experiment.add_embedding(X[:len(labels)],labels)

Imagine that we just uploaded Spiderman Far Frome Home, new movie on Dec 2021 into our system.

The movie is new for us so we want to find similar movies related to Spiderman FFH. The movie is no interaction before, our team will assign genres and tags for this movie

1
2
3
4
import json
response = """{"status":"success","data":{"movieId":263007,"totalTagNum":112,"scoredTags":[{"tag":"Marvel Cinematic Universe","tagCountsViewModel":{"total":21,"positive":15,"neutral":3,"negative":3,"score":21.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Marvel","tagCountsViewModel":{"total":18,"positive":12,"neutral":4,"negative":2,"score":18.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"multiverse","tagCountsViewModel":{"total":15,"positive":12,"neutral":3,"negative":0,"score":15.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"spider-man","tagCountsViewModel":{"total":12,"positive":11,"neutral":1,"negative":0,"score":12.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"I'm something of a scientist myself","tagCountsViewModel":{"total":10,"positive":9,"neutral":1,"negative":0,"score":10.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Fan service","tagCountsViewModel":{"total":9,"positive":6,"neutral":3,"negative":0,"score":9.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"great cgi","tagCountsViewModel":{"total":9,"positive":9,"neutral":0,"negative":0,"score":9.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"tom holland","tagCountsViewModel":{"total":6,"positive":3,"neutral":3,"negative":0,"score":6.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Andrew Garfield","tagCountsViewModel":{"total":5,"positive":5,"neutral":0,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"New York City","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Tobey Maguire","tagCountsViewModel":{"total":5,"positive":5,"neutral":0,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Willem Dafoe","tagCountsViewModel":{"total":5,"positive":5,"neutral":0,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"actor reprises previous role","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"crossover","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"magic","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"nostalgia","tagCountsViewModel":{"total":5,"positive":4,"neutral":0,"negative":1,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"sorcerer","tagCountsViewModel":{"total":5,"positive":4,"neutral":1,"negative":0,"score":5.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"casting a spell","tagCountsViewModel":{"total":4,"positive":3,"neutral":1,"negative":0,"score":4.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Alfred Molina","tagCountsViewModel":{"total":3,"positive":3,"neutral":0,"negative":0,"score":3.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"lawyer","tagCountsViewModel":{"total":3,"positive":1,"neutral":1,"negative":1,"score":2.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"sandman the marvel comics character","tagCountsViewModel":{"total":3,"positive":0,"neutral":1,"negative":2,"score":3.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"statue of liberty","tagCountsViewModel":{"total":3,"positive":2,"neutral":1,"negative":0,"score":3.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"green goblin character","tagCountsViewModel":{"total":2,"positive":1,"neutral":1,"negative":0,"score":2.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"trippy","tagCountsViewModel":{"total":2,"positive":1,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"unmasked","tagCountsViewModel":{"total":2,"positive":1,"neutral":1,"negative":0,"score":2.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":" Nostalgia Done Right","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"2020s","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Beautifully completes Tom Holland's evolution from a boy reliant on the avengers and Tony to a man with his own suit, his own problems and his own responsibility, because after all with great power comes great responsibility","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":0.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Doctor Strange","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Dr. Strange","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Greatest marvel film ever","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":0.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Statue of Liberty","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"Zendaya","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"a list","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"action","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"american patriotism","tagCountsViewModel":{"total":1,"positive":0,"neutral":0,"negative":1,"score":1.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"based on comic","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"based on comic book","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"bechdel test: fail","tagCountsViewModel":{"total":1,"positive":0,"neutral":0,"negative":1,"score":1.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"best friend","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"best hits","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"biopunk","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"bomb","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":0.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"bridge","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"british actor playing american character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"cameos","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"car","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"cinematography","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"comic book","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"construction site","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"costume","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"costumed hero","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"doctor octopus character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"doctor strange character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"dr. curt connors character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"electricity","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"electro character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"exciting","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"falling from height","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"fight","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"final showdown","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"fixes Spider-Man","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"framed for murder","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"friend","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"funny","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"good characters","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"greatest hits","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"hero","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"high school","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"identity revealed","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"j. jonah jameson character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"lightning","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"lizard character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"love interest","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"many characters","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"many villains","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"marvel cinematic universe (mcu)","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"marvel comics","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"marvel entertainment","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"mistaken identity","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"multiple villains","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"night","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"nostalgic","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"overrated","tagCountsViewModel":{"total":1,"positive":0,"neutral":0,"negative":1,"score":1.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"peter parker character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"plot holes","tagCountsViewModel":{"total":1,"positive":0,"neutral":0,"negative":1,"score":1.0,"dominantAffect":"negative"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"power","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"protest","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"psychotronic film","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"rogues gallery","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"sand","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"scaffolding","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"scientist","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"sequel","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"shared universe","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"slimehouse","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"spell gone awry","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"spider man character","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"superhero","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"surrealism","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"swing","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"teenage boy","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"teenage girl","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"teenage superhero","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"teenager","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"third part","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"thrilling","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"train","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"trio","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"villain","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"villain team up","tagCountsViewModel":{"total":1,"positive":0,"neutral":1,"negative":0,"score":1.0,"dominantAffect":"neutral"},"userHasTagged":false,"userAffect":null,"userRating":null},{"tag":"writing","tagCountsViewModel":{"total":1,"positive":1,"neutral":0,"negative":0,"score":1.0,"dominantAffect":"positive"},"userHasTagged":false,"userAffect":null,"userRating":null}]}}"""
tags_response = json.loads(response)
tags = [item["tag"].strip().lower() for item in tags_response["data"]["scoredTags"] if item["tagCountsViewModel"].get("dominantAffect","") == "positive"]
1
2
3
4
5
6
test_movie_info = {
"genres": ["action", "adventure", "science fiction"],
"tag": tags
}

test_movie_tags = {k: encoder_mapper[k].transform(v) for k, v in test_movie_info.items()}
1
2
3
4
5
X_query = model.compute_vector(convert_to_tensor(0,test_movie_tags,)).detach().cpu().numpy()
list_candidates_ids = {movie_id: score for movie_id, score in kv_metadata.similar_by_vector(X_query[0])}
df_movie_joint_encoder[df_movie_joint_encoder["movieId"].isin(list_candidates_ids)]\
.assign(score=lambda x: x["movieId"].map(lambda d: list_candidates_ids[d]))\
.sort_values(by=["score"],ascending=False)[["title","score"]]
/home/jovyan/miniconda3/lib/python3.8/site-packages/gensim/models/keyedvectors.py:772: RuntimeWarning: invalid value encountered in true_divide
  dists = dot(self.vectors[clip_start:clip_end], mean) / self.norms[clip_start:clip_end]
title score
43355 Marvel One-Shot: All Hail the King (2014) 0.763174
59441 Batman: Hush (2019) 0.732942
7098 D.O.A. (1950) 0.725314
44344 Marvel One-Shot: Agent Carter (2013) 0.722448
43369 Marvel One-Shot: The Consultant (2011) 0.719176
50639 Rendel (2017) 0.713440
27213 45 Years (2015) 0.709849
44351 Marvel One-Shot: A Funny Thing Happened on the... 0.695612
40174 The BFG (2016) 0.694954
41081 The Three Musketeers (1946) 0.689934

Model is learned quite well about similar movies related to Spider-man FFH movie. In general, we can see lots of super-heroes movies on top movies. We have Marvel Cinematic Universe movies, DC movies and sci-fi movies as well

End Note

We have implemented Graph Embedding with Side Information to incorporate item side information. We introduced how to construct items graph from user’s behavior history, and learn the embeddings of all items in the graph. The item embeddings are employed to compute pairwise similarities between all items, which are then used in the recommendation process. To alleviate the sparsity and cold start problems, side information is incorporated into the graph embedding framework.

In scope of data, we just added two metadata genres and user tags into our model. In future, we can add new features such as actors, actress, directors,…

The model in this notebook is just based on negative sampling. We can think to integrate with more advanced graph embedding like Graph Convolutional Network, Knowledge Graph.

Stay tuned!

Reference

Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba

Word2vec