Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Could you elaborate on "we learn high dimensional embeddings for each video in a fixed vocabulary and feed these embeddings into a feedforward neural network."

So, each video is mapped to fixed size vector of floats? A user's history is now a matrix of size [number of videos, embedding size]? What are the other parameters in this sentence "Importantly, the embeddings are learned jointly with all other model parameters through normal gradient descent back propagation updates."? And how do you concatenate all these into a "wide layer" when users would have histories of different length?



Figure 3 illustrates that the variable sized watch history is combined with an average operation. This is partially why the embeddings need to be so large - in order to retain information after averaging, you need lots of dimensions to spread out disparate items.

This is of course not optimal, as the network should be able to learn how best to summarize the sequence. In the paper, however, we emphasize the importance of withholding certain sequential information from the classifier.


Have you experimented with replacing the averaging operation on the vectors with a recurrent network such as an LSTM. This way you can not ignore the temporal nature of the feedback (I have had success improving metrics doing this on implicit streaming video feedback).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: