ofxTSNE + image sequence from video

Hello all,

I’m working on building a sort of algorithmic graphic-match generative video editor (don’t worry, I have a whole bunch of paragraphs that unpack that statement better, if you’re interested) that edits live from a bank of videos, and @bakercp, my instructor has led me to ofxTSNE (via @genekogan and the works of Laurens van der Maaten). I just wanted to share the interesting thing that happens when you take an image sequence from a video, or a few videos, and then plug them into the ofxTSNE example-images.

Here is an image sequence from a video of some emergency vehicles leaving a station panning from a fixed position:

(sorry, I lost the links to the youtube videos I pulled these examples from :frowning: . I renamed them without thinking about that…)

Pretty cool. The only problem is that you have to use a lot of images (frames per second) from the video to get these “pathways” that the visual characteristics sort of carve out. It’s very time-consuming to compile, then fairly heavy on RAM & CPU when only running everything in a few threads. I’ve tried with lots of images from lots of videos. The emergency vehicles example above is 5 frames taken per second from 30 seconds: not a lot.

Here’s an example from a couple different videos that might cross “pathways.”

And I said might, because to me these videos are very similar; they are both somewhat static shots of a large crowd of people sitting on the floor of a room, even facing the same way. Maybe there is a fair difference in general color palette, but nonetheless there is a striking graphic-match between the two. But, I find that [because?] this tSNE is particularly built to differentiate, to the point where it will spread these pathways a part from each other depending on how many things it has to differentiate between, it would take a third video of completely different visual characteristics to persuade the tSNE to put these two videos a little closer to each other.

Let’s test:


I am writing this post because I’m not really sure what I’m doing, and I’m not sure that I fully understand what the tSNE is supposed to be doing either. I have a general idea, and that idea seems to be very relevant to the vision of my project, but I feel as though I’m overlooking something, and it may be entirely subjective, or based on the actual chosen videos.

Either way, thought I’d share what’s in my little pocket of research.



this is a really neat idea, thanks for sharing.

not much you can do to speed it up. downsampling the images might help a tad since the encoding is usually the slowest part depending on how many frames you have, but probably won’t make a huge difference.

can’t really think of any good suggestions off the top of my head as i haven’t thought to apply it to video frames (except this). maybe one thing might be to try multiple videos of the same thing happening. it would be interesting to see if you get the videos interweaving content rather than putting consecutive frames near each other. ccv is trying to characterize the content of images and should be fairly resistant to superficial similarities (i.e. color, etc). i suspect this would be tricky but would be cool if it could be made to work.

Hey thanks for the reply @genekogan

I think that’s a splendid idea you have there about the ‘multiple videos of the same thing happening.’ Do you mean multiple angles (multiple cameras) of the same thing being filmed at the same time?

Either way, I’m going to get something done in the next few months, and I feel as though this tSNE stuff is going to be a part of it. I’ll keep you posted on what happens. :tv: :movie_camera:

yeah, that or different videos of a similar thing happening.

looking forward to it!

A little update. I took some images sequences from youtube videos (1, 2, 3, 4) of airplanes with contrails being filmed from the ground, 8 of them, and put them into the example-images. I got 10-ish groups back! Although, 5 of the 8 videos are from the same youtube, but completely different footage. I just edited them out separately.

What I’m going to do next is put the names of the images on the tSNE points, drawn over the images, so I can see the kind of ‘direction’ the lineage of the video is going according to the tSNE. Hopefully I’ll be able to tell what kind of similarities are happening.


@genekogan Also, I was talking to someone (@brannondorsey) about this project and I mentioned the time it takes to let the tSNE decide how to differentiate the images (encoding time), and he mentioned that it should be possible to write the information to a file instead of temporarily storing the information on the RAM every time I run the program. I understand that I would have to have the images I want exactly planned out before I do this sort of ‘pre-encoding’ process, that I can then have the program call from. Right?

I also tried my previous theory of giving the tSNE a couple sets of images that are completely different from the airplanes with contrails in order to persuade the tSNE to group the airplanes together, rather than differentiating the airplanes from each other as meticulously as it had when it only had images of airplanes.

Yet again, it did not work. Maybe I’m not adding enough diversity?

I took the “diversity” even further by adding all of the [random] images included in the addon (see readme.md -> clustering images -> step 3) in addition to the image sequences from the 8 videos of the airplanes with contrails filmed from the ground.

Weird results!

Look at the hummingbird grouped up with one of the contrail groups. Yet almost none of the separate image sequences of the contrail photos don’t get grouped together? There are other seemingly random photos grouped with some of the other groups of image sequences from the contrail videos too. So curious.

This is making me wonder whether or not training the tSNE is going to yield the results I’m looking for. That is, I’d like it to group together images from an image sequence from videos that clearly have visually similar things happening. And I think it’s safe to say that 8 videos of airplanes in clear, blue skies with white contrails generally pointing toward the same quadrant of the frame is a very clear example of videos of similar things happening. I’m a little confused!

1 Like

Hi, @mosspassion

That’s very interesting trials.

I have an idea that automatically extracting a video clips from an video file using periodic video scene estimation.

Can I get the github address of your trial code?

Thanks a lot.