Could you elaborate on "we learn high dimensional embeddings for each video in a...

pcovington · on Sept 5, 2016

Figure 3 illustrates that the variable sized watch history is combined with an average operation. This is partially why the embeddings need to be so large - in order to retain information after averaging, you need lots of dimensions to spread out disparate items.

This is of course not optimal, as the network should be able to learn how best to summarize the sequence. In the paper, however, we emphasize the importance of withholding certain sequential information from the classifier.

bearzoo · on Sept 5, 2016

Have you experimented with replacing the averaging operation on the vectors with a recurrent network such as an LSTM. This way you can not ignore the temporal nature of the feedback (I have had success improving metrics doing this on implicit streaming video feedback).