Algorithmic Graphic Match Cut with tSNE

#Algorithmic Graphic Match Cut

###This is going to be a long post that has a lot of beginner information as well as some slightly advanced ideas. I will write a tl;dr summary for those with shorter literary attention spans like myself.

First of all I want to thank @bakercp && @brannondorsey for their direct help on this project; I absolutely would not have been able to get anywhere near as far into this project as I have without them. I also want to thank @ttyy, @genekogan, @gameover, && @kylemcdonald for their help and/or for making the tools I used in this project. Here is the git repo with instructions of what I did in the readme (not finished yet).

I started this project off wanting to create an application that plays through a bank of videos (as large a bank as possible), while algorithmically live-editing between the videos based on graphic match cuts (here is a montage of a couple famous match cuts). This is one component of a larger project that can be read more about in my project proposal if you feel so inclined.

So, for starters I made a couple mock up videos showing what I had intended the application do by itself (password to all: AdvProjATS2016):

Mock up example01. Mock up example02.

Those aren’t perfect examples, but some of the cuts are very clear, like the emergency automobiles cutting to the train, and the video of the bright orange ball in the center of the frame cutting to a ball of fire in the center of the frame of a different video. In fact, those two cuts in particular were the first cuts that were successfully agreed to by the algorithm I utilized in an application to see if the computer vision/machine learning “thought” in the way that I did. It wasn’t as concise as I’d like it to be, but it did well enough to proceed with it. First, here is the video of the application grouping together frames from aforementioned successful cuts (password: AdvProjATS2016):


I know it's hard to tell what's going on just from watching that video, I'll do my best to explain what is going on.

###A more thorough explanation

In lamen’s terms I wanted the application to transition from a video to another video that has a very similar object or background/color(s)/contrast/shape(s) exactly in the moment that two videos have one or all of those things in common. So first I had to figure out how to get the application to know what exactly is “similar.”" After some research my instructor, @bakercp, pointed me to some machine learning algorithms that were already wrapped up for openFrameworks. He particularly pointed me to ofxMSATensorFlow && ofxTSNE. TensorFlow is Google’s choice in machine learning in regard to visual information. And tSNE “…a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.” Basically it learns similarities in data, then sorts that information into groups based on what they are similar to. Here is an example of the algorithm very successfully sorting a set of images of human faces.

ofxTSNE ended up being the easiest to work with in openFrameworks (thanks @genekogan!), so I began testing if the algorithm could do what I wanted. See this post I made on the openFrameworks forum about those tests.

###The steps needed, pitfalls, changes of plans, and successes

First, I needed a bunch of videos. I would have preferably spent a lot more time carefully choosing videos, but for the sake of the semester (this was for a school project at an art school) I chose to focus more on the utility, and less on the aesthetics. That is, I wanted to make this application (process) work first, hand it off to the world, and then apply the artist’s touch to the tool to exploit it’s potential, and hopefully break it to remake it. So I went to and pressed the skip button many times until something caught my eye or gave me a search idea. I had a few parameters that the videos needed to meet. One, the most important one, is that there were no edits in the footage at all. The whole idea here is that the application makes live-edits between footage, so the videos had to be uncut && unedited. OK, well, it’s a lot harder to find those on Youtube than one might think! Secondly, I preferred that the videos were short, under 2:30 long, but longer than 0:30. Not so bad, because I could just take a chunk from a video that was too long if I really liked the content. Again, I wasn’t super picky because I wanted to work with the videos sooner than later. I ended up with 43 videos that you can download here (the mp4s). Note that I renamed them all to a three character base file name for convenience.

Now to figure out how to use the algorithm to sort moments in videos by visual similarities. Image sequences! A video is simply a lot of images that are changing to other images over time (frame rate). So, I already had a bunch of images. I used the FFmpeg command line tool to take image sequences of my 43 videos. Typing in the command for FFmpeg to take image sequences for 43 different videos was time consuming, so I made a bash script that made my life a little bit easier (you can find those in the git repo).

First I tried capturing 10 images from the mp4 files every second, 10 frames per second (FPS). I ended up with 36,000+ images, and didn’t think to resize any of them, so I had a GB and some of images. Unwittingly I loaded the images into the ofxTSNE example-images that comes with the addon (I’ll explain the example-images application in a bit), and waited for many hours. The one thing about the tSNE process is that it is time consuming for images. 36,000 images is a lot, and a lot of them were HD, so there were lots of pixels to load and store and whatnot. Well, that also takes A LOT of RAM and swap memory. My iMac computer has 22GB of RAM. All 22 of those GBs got used along with 49GBs of swap memory before the application crashed. OK, so that was too many images. So, I took image sequences at 5 FPS while also down sampling, 18,000+ images. Similar results, but the application got a little farther into what it was doing before crashing. I spent an entire day on a really fast computer with lots of RAM while playing with images sizes && FPS trying to get the ofxTSNE example-images to compile, load, encode, and sort with no success. At this point I was starting to get worried that I wouldn’t be able to use all 43 videos, compromising the the diversity of the bank of videos. But, my instructor @bakercp and I devised a plan for optimizing the way in which I captured images from the videos.

Some videos had very little change for long periods of time, negating the need to take images. For example a video of a bird standing very still on a branch for 10 seconds does not necessitate 50 images, just one image of that ‘scene’ will do for the machine learning. At this point another art student from a different school had got a hold of me via the openFrameworks forum post I made about testing ofxTSNE for graphic match cuts. He was doing a very similar project, so we exchanged notes. You can see his work here. One of the computer vision examples he used for cinema was the optical flow example from the ofxCv addon (included in the of_v0.9.x releases). Brilliant, I’ll take an image based on how much change of the content in the frame there is. You’ll find that in the git repo I linked to at the top. First and foremost, the optical flow was slow with HD videos, so I down sampled every video to 320 by 240-ish because I was going to analyze them in tSNE at that resolution anyway. I made a bash script for that too, you can find it in the repo. Secondly, @bakercp wrote in code that averaged the pixel movement and set a threshold of that average to take an image that is time stamped from where the image was taken from the video. It was great, but I really had to pay close attention to each video. Some videos were shot with hand-held cameras, and some were very still, so I had to adjust the threshold depending on the video to get a reasonable amount of images from it, but only images that were worth grabbing. There will be instructions and an accompanying video demonstration of that in the git repo.

Now that I had a reasonable amount of images, ~9000, I ran the ofxTSNE example-images and success! It didn’t crash! Now I just needed to adjust the parameters on the actual tSNE process to get results that were adequate enough to proceed with. There are instructions on what parameters I configured in the git repo. Next, and probably the most important step of this entire process, was to capture those plotted points (the ones that the tSNE sorts the images to in its final process) into a JSON file so that I didn’t have to wait 3-5 hours for each tSNE process. Yeah, it took that long every time! So, again, @bakercp wrote some code for me that did just that, and later @brannondorsey wrote the code that read the JSON information back into the application.

###But how is this all going to work together

Everything is going pretty well at this point. What I haven’t explained is how @bakercp and I devised to take that 2D plot information (it could be 3D as well, but I wanted less things to think about) and use it to create a playhead that knows that in that moment on a video that there may be another video that has a good graphic match to cut to. That is, if a video is being played, the play head is aware that there is another video that has similar visual material in the frame, and to then choose to cut to that video. We figured that since we have this plot of points, with time stamps, we can just send information from a video player, to the play head saying “Hey, I’m on video hpp.mp4, and I’m at 01:05 on the video, check if there is anything around I should cut to.” Then, that plot of points will respond “Oh, 01:05 on hpp.mp4 is pretty close to this image from mkp.mp4 at 00:35, so let’s switch to that moment in mkp.mp4 and play on from there.” And so on and so forth, the play head and the video player will constantly communicate, and constantly change from one video to another. If it happens that a video has no visual similarities, then either stop or go to a random video and start the whole process over. If it happens that two or three videos keep switching to each other, then switch to a random video or start the whole process over.

###Still have some work to do!

I haven’t finished the video player or the play head quite yet, but it is coming along, and [in theory] it isn’t nearly as complicated as the rest of the project proved to be. For the video player I will be using the addon ofxThreadedVideo.

###As promised, TL;DR
To make an algorithmic match-cut application I needed to take image sequences of videos, run them through a crazy algorithmic image sorting machine learning application (tSNE) that then plots the images as points on a 2D, grouping them by their similarities. Then after the images are plotted I need to record those images coordinates into a JSON file with time stamps, so that I don’t have to wait 3-5 hours for the tSNE algorithm to sort the images every time. After that, I can take the plotted points on the 2D space and communicate to a video player when an image from a video is close to an image from another video. The images are time stamped, so the points in the 2D space can ping back to the video player where to set the position of the video that it wants to transition to.