I'll give the disclaimer that this paper isn't in my field, and I'm merely an observer. However, I'll do my best to explain, since it's a little unclear.
Based on my perspective, there were three sets of videos:
1) The several hours of "training" video, that they used to learn how the test subject's brain acted based on different stimuli. (The paper (which I've only skimmed) says 7,200 seconds, which is two hours)
2) 18,000,000 individual seconds of YouTube video that the test subject has never seen.
3) The test video, aka the video on the left.
So, the first step was to have the subject watch several hours of video (1), and watch how their brain responded.
Then, using this data, they predicted a model of how they thought the brain would respond for eighteen million separate one second clips sampled randomly from YouTube (2). They didn't see these, but they were only predictions.
As an interesting test of this model, they decided to show the test subject a new set of videos that was not contained in (1) or (2), the video you see in the link above, (3). They read the brain information from this viewing, then compared each one second clip of brain data to the predicted data in their database from (2).
So, they took the first one second of the brain data, derived from looking at Steve Martin from (3), then sorted the entire database from (2) by how similar the (predicted) brain patterns were to that generated by looking at Steve Martin.
They then took the top 100 of these 18M one second clips and mixed them together right on top of each other to make the general shape of what the person was seeing. Because this exact image of Steve Martin was nowhere in their database, this is their way to make an approximation of the image (as another example, maybe (2) didn't have any elephant footage, but mix 100 videos of vaguely elephant shaped things together and you can get close). They then did this for every second long clip. This is why the figure jumps around a bit and transforms into different people from seconds 20 to 22. For each of these individual seconds, it is exploring eighteen million second-long video clips, mixing together the top 100 most similar, then showing you that second long clip.
Since each of these seconds has its "predicted video" predicted independently just from the test subject's brain data, the video is not exact, and the figures created don't necessarily 100% resemble each other. However, the figures are in the correct area of the screen, and definitely seem to have a human quality to them, which means that their technique for classifying the videos in (2) is much better than random, since they are able to generate approximations of novel video by only analyzing brain signal.
Sorry, that was longer than I expected. :)
Edit: Also, if you see the paper, Figure 4 has a picture of how they reconstructed some of the frames (including the one from 20-22 seconds), by showing you screenshots whence the composite was generated.
Based on my perspective, there were three sets of videos:
1) The several hours of "training" video, that they used to learn how the test subject's brain acted based on different stimuli. (The paper (which I've only skimmed) says 7,200 seconds, which is two hours)
2) 18,000,000 individual seconds of YouTube video that the test subject has never seen.
3) The test video, aka the video on the left.
So, the first step was to have the subject watch several hours of video (1), and watch how their brain responded.
Then, using this data, they predicted a model of how they thought the brain would respond for eighteen million separate one second clips sampled randomly from YouTube (2). They didn't see these, but they were only predictions.
As an interesting test of this model, they decided to show the test subject a new set of videos that was not contained in (1) or (2), the video you see in the link above, (3). They read the brain information from this viewing, then compared each one second clip of brain data to the predicted data in their database from (2).
So, they took the first one second of the brain data, derived from looking at Steve Martin from (3), then sorted the entire database from (2) by how similar the (predicted) brain patterns were to that generated by looking at Steve Martin.
They then took the top 100 of these 18M one second clips and mixed them together right on top of each other to make the general shape of what the person was seeing. Because this exact image of Steve Martin was nowhere in their database, this is their way to make an approximation of the image (as another example, maybe (2) didn't have any elephant footage, but mix 100 videos of vaguely elephant shaped things together and you can get close). They then did this for every second long clip. This is why the figure jumps around a bit and transforms into different people from seconds 20 to 22. For each of these individual seconds, it is exploring eighteen million second-long video clips, mixing together the top 100 most similar, then showing you that second long clip.
Since each of these seconds has its "predicted video" predicted independently just from the test subject's brain data, the video is not exact, and the figures created don't necessarily 100% resemble each other. However, the figures are in the correct area of the screen, and definitely seem to have a human quality to them, which means that their technique for classifying the videos in (2) is much better than random, since they are able to generate approximations of novel video by only analyzing brain signal.
Sorry, that was longer than I expected. :)
Edit: Also, if you see the paper, Figure 4 has a picture of how they reconstructed some of the frames (including the one from 20-22 seconds), by showing you screenshots whence the composite was generated.