This project investigates Video Textures, a type of video-based rendering technique, where new videos are synthesized from short sample videos that exhibit certain similarity or repetition in time. These textures are seamless and can be of any arbitrary length, and therefore give the illusion of being natural. Video textures due to their small space requirement and dynamic qualities can replace both still-photos and videos, especially in certain applications like display of dynamic scenes on web pages.
Here mainly one form of video texture as introduced by A. Schödl  is dealt with. The various algorithms for analysis of sample videos and generation of textures were experimented and refined. Different scenes were used as samples and some inferences were made on the techniques that gave acceptable results for these different scenes. Moreover, some new techniques were designed for better results with certain scenario or to lower computational complexity and cost.
I wish to express my deepest gratitude to my supervisor, Associate Professor Dr. Cham Tat Jen, for his valuable guidance and constructive advice throughout the duration of the project.
This experience has served me as a chance to explore much deeper into the two fields of my interest, computer graphics and image processing. I got to build up on Matlab knowledge and skills, which has its application in countless fields today. Moreover, handling multiple assignments and coursework simultaneously during this period has taught me some important lessons in time management and multi-tasking.
This project was aimed at investigating methods for generating Video textures, which are basically indefinitely-long, natural-looking video streams synthesized from a short video clip that exhibit certain similarity or repetition in time.
In  A. Schödl et al introduced a type of Video texture generated by cycling through the frames of a sample video in a new order. By using similarity in these individual frames as a measure, they were able to come up with entirely new random-ordered or looping videos of arbitrary length which looked seamless and natural.
Here these algorithms were implemented, tested and refined. Different scenes were used as samples and some inferences were made on the techniques that gave acceptable results for these different scenes. Moreover, some new techniques were designed for better results with certain scenario or to lower computational complexity and cost.
The experiments were conducted using the following resources:
Hardware Intel Core 2 Duo (1.80 Ghz) with 2 GB RAM
Software Matlab R2008b on Windows Vista
Data Several short video samples obtained from internet
One of the important aims of computer graphics is to be able to synthesize natural phenomena. There are ways to obtain such results from scratch using complex modeling and rendering algorithms, which make use of detailed geometric models, surfaces and illuminations, and thus, are computationally very expensive. Other methods rely on use of images to achieve these results, collectively called as Image-based modeling and rendering techniques. Therefore, the obvious next step to such techniques is obtaining and interpreting information from videos and using it to synthesize new sequences, techniques called as Video-based rendering.
One of the early works in field of video-based rendering was 'Video Rewrite' in 1997 , where a sequence of the face of a person speaking was extracted and associated with corresponding audio phonemes. The sequence could then be reordered to match a new audio track.
In 2000, Schödl et al  introduced 'Video textures', a technique where frame-pairs were found in a video sample such that a transition from one to the other would not cause any noticeable change. Thus, jumping within the frames of the video, new videos could be formed to look like natural. Moreover, blending techniques were added top these transitions to completely hide any discontinuities. Extracting the dynamics qualities of videos in such ways was relatively simple, and yet efficient.
Doretto et al  in 2001, came up with another new concept which involved learning the dynamics of a video sequence, calling it 'Dynamic textures'. The original video sequence could be learned on maximum likelihood learning models, and entirely new frames could be synthesized, completely removing the need to store any video which takes space. However, the results were not very realistic looking and highly depended on the random spacial characteristics of the subject.
Concept & Implementation
There are few key steps to generate video textures from frames of the sample video:
First, we need to find the distance (or similarity) between the frames, so as to decide on the pair of frames which would produce acceptably smooth transitions from one to the other. Using these pairs, new order of visiting these frames can be generated, thus giving us entirely new videos.
Secondly, many times though two frames would be very similar, a jump from one to the other would lead to an abrupt and unacceptable change. This is because of difference in motion of a certain object in the scene. The transition between these frames needs to be prevented by taking measures to preserve the dynamics of motion.
Thirdly, we need to decide what order to play the video frames in. There can be a random play method, which chooses the next frame randomly from a list of possible transitions, thus a stochastic technique. Another way could to be finding the best transition of a good length and play it in loop.
Lastly, in many cases it would not be possible to find very similar frames and therefore imperfect transitions will be required to be made. In these cases, we can benefit from using techniques for disguising discontinuities in videos, such as Cross-fading.
Review of theory:
To fulfill the first step which is finding the pairs of frames with acceptable transitions from one to the other, we need an image distance metric i.e. a method to measure distance (inversely proportional to similarity) between images.
There are a number of distance-metrics defined for use with images. These are based on finding difference in corresponding pixels of two images of same dimensions. Two most common and easy to calculate metrics are L1 distance and L2 distance, and are discussed in the following pages.
L1 distance (or Manhattan distance)
To find L1 distance between two images, they must be in grayscale or converted to grayscale. A Grayscale is an image in which the value of each pixel carries only intensity information. Also called black and white, they are images made of different shades of grey varying in their intensity. As each pixel in grayscale images carries only one value, these are nothing but 2D matrices, and L1 distance formula for images translates to following formula, where L1 is the distance between two images X1 and X2, and x1ij and x2ij represent the intensity value of each individual corresponding pixels in these images.
L2 Distance (Or Euclidean Distance)
As in case of L1 distance, to find the L2 distance between two Images, we first convert them to greyscale to get 2D matrices. And thus the formula translates to following, where L2 is the distance between two images X1 and X2, and x1ij and x2ij represent the intensity value of each individual corresponding pixels in these images.
Monte carlo methods:
Often used for simulating physical and mathematical systems, Monte Carlo methods are simulation algorithms that repeatedly use random numbers to generate results. Because of dependence on random numbers these simulations are usually run by computer program.
Some general patterns of Monte Carlo methods:
- A set of possible inputs is defined.
- Choose random inputs from this set at each step.
- Perform calculations using these inputs and retain outputs.
- Combine these outputs into one aggregate output.
A transition from one image to another, with intermediate states where the visible image is a mixture of the two images. When used in a video, the source frame of the transition is linearly faded out as the destination frame is faded in.
The system was designed on Matlab with help of various built-in image processing features. It constituted of mainly two parts:
- Analysis component: This part gets the similarity measures and other required data from the frames of sample videos.
- Synthesis component: This part uses the information extracted earlier to render new videos with required properties.
Analysis: Extracting the video texture
Euclidean distance was used for measuring distances as it is simple and works well in practice. Euclidean or L2 distance between two images was calculated by examining all pairs of corresponding pixels in the two images and computing the sum of the squares of the differences in values stored at the pixels (the difference in grey scale) and then calculating the square root of this total. The algorithm looked like the following.
For the purpose of calculating and storing distances between all the frames in the video, a matrix was used. Distance matrix is a matrix where value at row i and column j is the distance between ith and jth frame of the analysed video.
As a formula, given by
Dij = d(Xi, Xj)
where Dij is the the value at row i and column j of the matrix, and d is the Euclidean distance between two images. The algorithm therefore looked like the following.
On passing the distance matrix through an exponential function a measure of similarity of the frames was obtained. Given by formula,
In this matrix, a larger value in row i and column j means greater similarity between frames i+1 and j in the sample video. The value of s therefore decides the sensitivity for making transitions i.e. if lesser but smoother transitions are made or higher number of transitions is allowed at the cost of some abrupt changes.
In the implementation, s was set equal to the average of all distance values from the distance matrix, to get values in matrix P varying from 0 to 1.
Just maintaining similarity across frames was not enough, there were frames in most videos which were very similar but jumping from one to the other produced unacceptable behavior due to the difference in direction of motion of certain object in the scene.
For example, in video of a swinging pendulum, each frame of the left-to-right swing will have a corresponding frame in right-to-left swing that looks very similar.
A simple way to overcome this problem is requiring the possibility of jump from a frame to other to be decided not just by similarity of these two frames but similarity of some adjacent frames. Thus, we can filter the distance matrix using the formula, where m = 2 to 4 with wk in binomial to give the middle frames more influence.
In the implementation, m = 4 was used with binomial weights, thus the algorithm to get filtered matrix from a distance matrix looked like the following.
On computing similarity matrix P with these new distance values, the unacceptable jumps no longer had high probability, as seen in the image below.
In some cases, where only a smaller portion of the video was dynamic or the dynamics of remaining parts could be ignored, preserving similarity and motion only in this smaller portion gave better results, and also greatly reduced the computational time.
To automatically find the dynamic region in the video, variance of each pixel of the frames was calculated across time. Only the pixels with variance above a certain minimum were analyzed.
Synthesis: Sequencing the video texture
For generation of random textures on the fly, Monte-Carlo method was used. That is to start at any random frame in the video, and keep choosing the next frame randomly from the set of similar frames found using distance matrix. Thus, a completely new natural looking video similar to original can be obtained.
To get a set of similar frames at each transition, a minimum cut-off similarity was chosen, that is a number between 0.5-1. The choice depended upon the sample being processed, some videos with very high similarity across the frames required high cut-off, while those without frames of high similarity needed lower cut-offs for higher randomness.
Also, as most videos were seen to have a group of consecutive frames very similar to each other, to prevent getting stuck in these consecutive frames, a jump was only allowed to frames at some minimum distance from current frame or the next frame in the sequence.
Although previous methods ensured transitions with least discontinuities, to further smoothen the transitions, crossfading was used between them. That is instead of moving from one frame to the other directly, intermediate frames were displayed which were a blend of frames before and after the transition. A weighted average of all frames participating was calculated using following formula, where the weights ai are normalized to sum of 1.
This technique avoided any noticeable changes, though blurred the frames if there any discontinuity. To avoid this blur, a new approach to crossfading was used. Instead of crossfading from one to other, an average of current and next frame was always displayed, and never the original frames, thus a constant level of blur were maintained.
Best loop Play
In some cases where number of similar frame is very low, it was better to just find the best transition and loop it. Especially in cases of videos that repeat almost exactly naturally, this method is the better way to generate textures.
To find the best transition, a smallest allowed jump in number of frames was chosen and the similarity distance matrix was scanned for the most similar transition of length greater than or equal to the minimum length chosen. As group of consecutive frames show high similarity, the minimum length also prevented such small loops from being chosen.
Experiments & Observations:
A number of videos showing repetition or similarity in time were chosen as input to the system, and best possible video textures were generated by the methods described so far. Here is a summary of the results obtained.
Due to the very random and fast flickering of flame, even motion-dynamics were not needed to be preserved here. A random-play using un-filtered similarity matrix could generate completely different natural-looking videos of indefinite length in each try.
On filtering the matrix for dynamics preservation, a bit more improvement could be seen in the synthesized texture.
Random play method with motion preservation i.e filtered similarity matrix worked well here with slight-glitches occasionally. To further remove any discontinuity in the transitions, cross-fading was applied across frames.
Even better results were obtained when only a smaller portion enclosing the face was analyzed to form distance matrix, then filtered and used for making texture. This is because this part of video actually needs dynamics to be preserved; the rest just adds noise to the distance matrix.
Such textures are very useful as they can be used instead on static images on webpages and also display pictures in instant-messengers.
This is a case of almost perfect repetition in the original video itself, and therefore Best-loop method was used. The algorithm could perfectly find the full pendulum swing and turn into a loop, and thus, a perfectly natural looking texture of indefinite length was obtained.
This is an extension of the methods described before. The video sample had two people on swings side by side, but not swinging synchronously. Again being a case of almost perfect repetition, best loop method was tried on whole video, but the best loop found did not have full swing of either of the swings.
As the two swings were always in their own half of the frames and there was no crossover, a new method was used here. The video was divided into these two halves, each with a swing, and were analyzed separately. Now analysis being independent of each other, the algorithm could find best full swing for both swings. Thus, two textures were synthesized separately and joined back again to get a natural looking video texture with both swings.
Thus, the techniques described before worked well on the various samples experimented on here, resulting into natural-looking videos.
Video Textures provide an entirely new medium with dynamic qualities between that of a static image and a standard video, and due to its small size has potential applications in many fields.
This is just one case from the class of techniques called Video-based rendering. Just like video textures, these other techniques employ real-world footage into synthesis of textures with same dynamic nature.
- The method of calculation of the distance matrix from video samples described here is computationally expensive, with cost directly proportional to number of frames and number of pixels per frame, which are both typically very big.
- This method fails for videos like a waterfall, where all frames are highly and almost equally distant and it becomes impossible to find any similarity.
- Replacement of pictures and videos on web-pages and applications with small-sized yet dynamic video textures.
- Creation of dynamic backgrounds or characters for special effects and games.
- Creation of animation from video samples.
Areas for future work
- Better and faster distance metrics:
- Better blending techniques:
- Addition of UI and tools for users:
As said before, the slow computation of distance matrix is the bottleneck for the entire technique. Perhaps instead of L2 distance some other measure of image distance/similarity might be used.
For more realism especially in cases like a headshot, a more powerful blending technique than cross-fading might be experimented.
Instead of using fixed or intelligent settings for various parameters, it might be a good idea to provide the user full control over these, giving him more creativity with the results.
- A. Schödl, R. Szeliski, D. H. Salesin, and I. Essa. "Video Textures." Siggraph, Computer graphics Proceedings. 2000.
- C. Bregler, M. Covell, and M. Slaney. "Video Rewrite: Driving visual speech with audio." Computer Graphics, SIGGRAPH. 1997.
- Pedro H. Bugatti, Agma J. M. Traina, and Caetano Traina, Jr. "Assessing the best integration between distance-function and image-feature to answer similarity queries." ACM. 2008.
- S. Soatto, G. Doretto, and Y. NianWu. "Dynamic Textures." International Conference on Computer Vision. 2001.
- Rafael C. González, Richard Eugene Woods. "Digital Image Processing." 2008.