This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Throughout history, people have always had the need to meet and establish eye-to-eye contact. For a long time, the most important form of remote interaction (or interaction between physically distant persons) was written communication. Despite the enhanced language an essential part of human communication was missing - non-verbal communication. The invention of telephone allowed people to have more realistic communication and to hear true meaning behind the spoken words. Nevertheless, real face-to-face communication cannot be trade for voice one.
Research has shown that eye-contact and gestures are one of the most significant non-verbal cues responsible for expressing feelings and attitudes. Therefore, for most people eye contact plays a large role in every meaningful conversation. Traditional telephone conversation gives no eye-contact cues, but with videoconferencing things should change. Videoconferencing allows the parties to have the face-to-face communication spontaneously while being physically apart. It is the best technology so far invented, in the sense that it expresses more of human verbal and nonverbal communications, than any other system.
Seeing the person you are talking to is the whole point of videoconferencing. However, the lack of real eye contact is one of the major disadvantages of videoconferencing and it degrades the user-experience. Some scientists even claim that one of the reasons why the videophone did not succeed is the lack of eye-contact.
In the case of the low cost video conferencing system (when one web camera is used), the camera that is pointing at user's face is positioned above, below, or to the side of your display. Since the camera that captures the user's face does not align with the conference image that the user looks at, it is impossible to have eye contact. If the camera is mounted above the monitor, the face appears to be looking downward. By not looking directly at the display - at the person being talked to, an impression of disinterest can be unintentionally given. This means that only if user is looking directly into the camera, the viewer would have the impression user is looking into his or her eyes.
The ideal solution of the eye-contact lack in videoconferencing system would be to set the camera in exactly same place as the window the user is looking at. However, so far this is impossible, so to correct gaze requires a virtual image to be generated.
This thesis aims at developing software that automatically synthesizes a virtual viewpoint right beyond the used display. A simple and effective approach to this is to use a pair of cameras, where the first one is placed to the right of the display and the second one is placed to the left. The computed viewpoint will be at a fixed position between two cameras, enabling both users to watch each other in the eyes during the session.
The theoretical approach is based on DIBR (depth-image-based rendering) where a depth map is constructed and further used to extract the synthesized 2D viewpoint. In a second step, this DIBR approach is complemented with IBR (Image based Rendering) techniques to correct some visual artefacts.
A great deal of research has been performed to study different ways of making videoconferencing come closer to real face-to-face communication. Several hardware solutions have been introduced and most of them are based on semi-reflective transparent screen. Problem with these solutions are price and massive setup. Therefore, they are out of scope of this thesis.
On the other hand, various software solutions have been proposed to solve lack of eye-contact by exploiting computer vision and image processing algorithms. They differ by the number of cameras used in the system setup.
Yip et al. proposed a solution with a single camera. It involves face re-orientation, the rotate ellipsoid and the anti-rotate ellipsoid operation. Idea is to model the downward looking head with rotate ellipsoid operation, and then to rotate the head upward. Main drawback of monocular approach is reduced reality that can degrade user experience.
Another videoconferencing system with monocular camera is presented in. Idea is to correct gaze by tracking the eyes and then warping them by determining eye geometry and camera position of the remote user. As a result, orientation of eyes is changed and gaze is redirected towards camera location, i.e. towards second user. Nevertheless, someone can argue that established eye-contact does not look natural enough having in mind that position of face remains the same.
Systems with two cameras have the advantage of finding the depth and generating a personalized face model. The popular methods involve head modeling and various image processing techniques.
In proposed solution is based on locating a head in two stereo images and using this information to generate intermediate view. The personalized 3D face model is used by the head tracking module as a reference and it enables head position tracking by matching currents views from cameras with identical points in 3D face model. Although achievement in correcting gaze is undeniable, there are still open issues in reconstructing view when anything apart from head has to be visible in the scene (i.e. background). Also, this solution fails in the case when something is in front of the face, for example hand, which is possible scenario in videoconferencing.
Another method would be to apply some of the dense stereo matching algorithms to construct depth map of the scene and then to use that depth information to synthesize two dimensional viewpoint. Review of top-ranked dense stereo matching algorithms and their corresponding classification is given in Scharstein and Szeliski's survey.
According to graph cut method outperforms all other stereo matching methods and provides excellent results in terms of disparity map. However, graph cut methods are quite computationally intense to calculate. Eye-contact for video conferencing is required in real time; consequently these methods are unsuitable for our work.
Second best method evaluated in that produces precise depth map is belief propagation. In real-time belief propagation stereo approach based on energy-minimization optimization is proposed. Even though results presented in this paper are more than acceptable, using only 16 levels of disparity is not enough for high quality wide-baseline stereo.
Alternatively, dynamic programming based approach confirmed to be one of the most computationally efficient algorithms, thus appropriate for real time communication applications. Although dynamic programming method finds a global minimum in polynomial time it is not free from problems. One of the main issues with this approach are horizontal streaks in disparity map. Introducing smoothness constraint in optimization function could solve this problem.
First dynamic programming based stereo matching was introduced by Ohta and Kanade ohta in 1985. Afterwards, a variety of dynamic programming (DP) algorithms have been proposed to solve correspondence problem. In criminisi gaze correction for videoconferencing is accomplished by implementing DP approach. The key idea of proposed method is to optimize cost function.It computes the minimum cost path through all matching cost between two scanlines and improves it by introducing three-state model. As a result, this method provides more detailed disparity map and better occlusion detection.
Goal of this thesis is to implement dynamic programming algorithm based on solution presented in the paper Gaze Manipulation for One-to-one Teleconferencing by Criminisi et al.
The first two chapters of this thesis introduce theoretical background that is necessary to understand how we designed our algorithm. Chapter 2 describes our system setup and calibration of the cameras. In Chapter 3 we introduce the basic principles of stereo vision and discuss correspondence problem. Disparity estimation by exploiting dynamic programming is described in Chapter 4. Next step in establishing eye-contact is view synthesis. In Chapter 5 we present two methods to interpolate novel view. Finally, in Chapter 6 results of proposed method are shown and discussed. Conclusion is presented in Chapter 7.
In the previous chapter, we have informally presented the motivation and goals in order to provide an intuitive understanding of the problem. In the following, we will focus on a system setup description and camera calibration.
For the stereo system three Microsoft H5D-0003 Lifecam Cinema web cameras are used. They have a resolution of 1028x720 pixels and baseline distance from 30cm to 60cm. These were selected as a cheap way of obtaining video streams quickly for rapid development of the project. Our basic setup of three cameras in row. Third camera is used for collecting ground truth data. Camera holder was created to fix positions of cameras to avoid repetition of calibration procedure.
Like in photography, lighting for videoconferencing systems should come from behind the camera, not in front of it. That lighting should also be diffused rather than from point sources. Stark, direct lighting, such as indoor spotlights, which cause shadows, do not work well in a videoconferencing environment. On the other hand, lamps with large diffusing lampshades are good - creating soft shadows rather than high-contrast images.
For instance Cisco's Telepresence 500 all-in-one terminal includes its own striplight above the screen. Also, windows should be behind cameras rather than in front of it. This is all too often violated, particularly in office desktop systems where the user is sitting with a window behind, making for a dingy image as the camera tires to expose for the outside rather than the participant inside.
Camera calibration is important for relating camera measurements with measurements in the real, three-dimensional world. This is important because scenes are not only three dimensional; they are also physical spaces with physical units. Hence, the relation between the camera's natural units (pixels) and the units of the physical world (e.g. meters) is a critical component in any attempt to reconstruct a 3D scene.
The simplest form of real camera consists of a pinhole and an image plane. A pinhole camera model assumes that all projection rays from the camera intersect at a single point known as the camera centre. The relation between the world coordinates of a point P(X,Y,Z) and the coordinates on the image plane (x,y) in a pinhole camera is
where f is the focal distance of the lens.
Using homogeneous coordinates, P and p can be represented by (X,Y,Z,).
Represented as homogeneous vectors, the mapping from three dimensional space to two dimensional space can be expressed in matrix multiplication.
To compute a comparison between two images captured from two different cameras, intrinsic and extrinsic parameters are fundamental.
The pinhole model is an ideal camera model. It does not take into consideration distortion effects introduced by real lenses. The major components of the lens distortion are radial distortion and slight tangential distortion. Four parameters are used to describe the distortion: are radial distortion coefficients; are tangential distortion coefficients. After the calibration process of a camera, the distortion parameters are known and can be used to correct the distortion.
We have presented a camera model describing the projection of the real world coordinates into camera image. Now, parameters describing this camera model have to be determined. This is called intrinsic camera calibration.
The previous section introduced four internal parameters to describe a camera:
If the focal length in pixel related units, the coordinates in the camera image of the optical center. These parameters are completed by four distortion parameters.
Moreover six extrinsic parameters describe the position of the camera in the world coordinate system: three parameters for the rotation R and three for the translation t.
Some of the intrinsic parameters can be retrieved from the camera specification sheet, but due to manufacturing mechanics they are very inaccurate. Thus, these parameters must be determined for each camera with precision. This process is called calibration. We use the OpenCV library to calibrate the cameras. In this library, the algorithm of Zhang is used.
This algorithm requires the camera to observe a planar pattern at different orientations. The main idea of Zhang's algorithm is to estimate a homography between the model plane and its image in the camera for each view. A homography is a mathematical relation between two figures, so that any given point in one figure corresponds to one and only one point in the other, and vice versa. Feature points of the planar pattern are detected in the images and associated to feature points of the model plane. Using a technique based on maximum likelihood criterion, a homography is estimated for each observation, mapping the model image to the camera image. The extraction of the feature points can be easily automated in some cases. The OpenCV library provides a function to extract the corners of a chessboard pattern.
Zhang's algorithm first starts with an analytical solution. This analytical solution is computed using the linear part of the camera model (i.e. without distortion). This solution is then optimized using a nonlinear technique based on maximum likelihood criterion. Comparing both the analytical and the nonlinear solution, the distortion parameters are finally estimated.
This chapter presents background theories and concepts related to this thesis. A brief introduction to stereo vision is presented in Section 3.1. Camera models and calibrations are discussed, along with the fundamentals of epipolar and projective geometry.
In the classical stereo vision problem, there are two images observing the static scene from different views. The main task of stereo vision is to compute three-dimensional data from these two two-dimensional input images. Computer stereo vision tries to imitate the human visual system. The human visual system obtains information about the world through two planar images that are captured on the retina of each eye. The position of a scene point in right view is horizontally shifted in the left view. This displacement commonly referred as the disparity, human brain use to deduce the depth information of the scene. Although this course of action appears simple, for computers is surprisingly difficult.
The major challenge that one faces in computer vision is solving the correspondence problem.
The correspondence problem describes the task of automatically computing the correct disparity value at each pixel. This thesis focuses on finding the disparity information and reconstructing the intermediate view given a sequence of images captured from two cameras.
In other words, the epipole is the point where the left camera is seen from the right camera. Since the stereo problem is symmetrical, the same observations can be made when searching the matching point of in the left view. Using the knowledge about epipolar lines, the correspondence problem can be reduced to a one-dimensional search task.
To take benefit of this simple geometry, the images of two cameras in general positions can be reprojected onto a plane that is parallel to the baseline.
This process is known as rectification or epipolar rectification. The rectification step involves resampling of the image, and therefore some precision in the 3D-reconstruction is lost.
However, since it is more convenient to search for correspondences along horizontal scanlines than to trace general epipolar lines, this transformation is commonly applied. In fact, most of stereo matching algorithms work with rectified images.
In order to compute accurate correspondence between two images it is necessary to identify possible issues that make this task more difficult and to establish right assumptions relating to those issues.
The first issue one faces preparing the proper setup for recording stereo images is color/intensity change between pixels originating from same 3D point of the scene. Establishing the same lighting conditions for two viewpoints turned out to be quite challenging. Wide baseline between two cameras introduces different intensity values of corresponding pixels which can produce false matches. Since we want to avoid this we have to achieve same illumination of two views. This can be done by modeling diffused lighting.
A second problem introduces a solution that may not be unique. It is possible to have uncertain matching in an untextured area due to many possible corresponding points of almost same intensity model. Matching pairs of some points from left view cannot be uniquely found in right view.
Since we identified possible problems one encounters while searching for correspondence between two views, it is essential to limit their influence by imposing more constraints.
In previous section we introduce epipolar constraint which says that for observed point in the left image matching point lies on particular epipolar line in the right image. If extrinsic and intrinsic parameters of two cameras are known, then for each observed point of the left image corresponding epipolar line in the right image can be found. Same reasoning is fulfilled for points of the right view.
The epipolar constraint is one of the most important rules in stereo vision as it reduces stereo correspondence search area to one dimension (to one line). In order to match two views we have to assume that epipolar constraint is satisfied.
The photo-consistency constraint is required by almost every image based rendering algorithm. If lighting conditions of observed scene appear to be similar on two camera views photo-consistency is preserved. Consequently, we can include photo-consistency constraint to strengthen our final solution.
It is clear that finding correspondence only by matching intensity values along epipolar lines will produce poor results. In order to have minimum false matches it is necessary to introduce more constraints.
Many stereo matching algorithms assume a uniqueness constraint - that is, a pixel depicted in the left image corresponds to at most one pixel in the right image. The uniqueness constraint is violated in case of transparent surfaces. This means that pixel of one view can be combination of points from two different surfaces.
On the other hand, detection of occlusions is done by implementing one-to-one correspondences for visible pixels across images. Since we assume that our scene consists of solid surfaces we can take the uniqueness constraint into account.
The ordering constraint says that the order in which points occur is preserved in both images. More precisely, let pl be a position of the left point that corresponds to pr of the right image. Moreover, another point ql of the left image corresponds to qr of the second view. The ordering assumption then states. The advantage of using the ordering assumption is that its application allows for the explicit detection of occlusions. It is known to fail for scenes containing thin foreground objects which is not case in videoconferencing where foreground object is a person.
This assumption claims that disparity varies smoothly almost everywhere (except at depth boundaries). That means we can expect that our disparity map is piecewise smooth. Smoothness is assumed by almost every correspondence algorithm, either in an implicit or explicit way.
In general, the goal of every proposed correspondence algorithm is to find pixels/features in two input images that are correspond to same point/entity of observed 3D scene. Based on approach, stereo matching methods can be classified into either local or global ones.
Local Stereo Matching Methods
Local stereo matching methods are known as very fast one and therefore suitable for real time applications. They handle search for the best correspondence by matching local patches independently based on a matching cost function. Usually, windows centered on the measured pixel are used for comparison. Some of the most commonly used local cost functions are: Sum of Squared Differences (SSD), Sum of Absolute Differences (SAD), Normalized Cross-Correlation (NCC), etc.
Main disadvantages of local approach are inaccuracy and high sensitivity to intensity variation. It fails to find correct matches in textureless regions and at depth discontinuities. This is particularly true in the case of wide baseline between two cameras.
Global Stereo Matching Methods
Global stereo matching methods try to overcome these problems. Aim of this approach is to minimize global cost function. They tend to obtain more accurate estimates in challenging image regions. Here matching between two considered pixels on the left and on the right image do not depend only on their adjacent pixels but also on matches of those adjacent pixels.
The global smoothing aims to reduce the sensibility of stereo correspondence to uncertainties caused by occlusions, textureless regions or variation of illumination. This enhancement has a cost, which is the increasing of algorithms complexity, introducing a longer execution time, in addition to some secondary effects due to smoothing.
Based on the amount of papers published in the field of computer stereo vision, we can conclude that finding correspondence is not as straightforward as it seems. In previous chapters we introduced some basic concepts in solving stereo correspondence.
Summarizing pros and cons of proposed solutions, dynamic programming based algorithm seems to be the most appropriate solution for our system setup with wide baseline and great range of disparities.
In this chapter we will discuss this concept in details. We will start by explaining conventional dynamic programming algorithm proposed by Cox, which represents the basis for algorithm in. Afterwards, algorithm proposed by Criminisi et al will be introduced in section 4.2. Finally, in section 4.3 we will proposed our solution.
It is important to emphasize that precisely performed steps involving illumination adjustment, camera calibration and image rectification are required before starting any DP based correspondence search.
The method proposed by Cox et al. exploits a one-dimensional approach to generate the disparity map. It represent the trade-off between speed and accuracy of disparity estimation.
The basic idea behind the Cox algorithm is that it computes a minimum cost function along each scanline. We assume epipolar constraint is satisfied. If two cameras are aligned, the search is simplified to one dimension. Since perfect alignment is difficult to achieve additional rectification of two images is required.
We will build a matrix in which the row will correspond to the pixels of the right line image, and the columns will correspond to the pixels of the left line image. In this matrix we will represent the correspondences between the pixels of both images, and the pixels without correspondence due to an occlusion in the other image. Each element of this matrix represents the correspondence among one right image point of the current row, and one left image point corresponding to the current column.
The matrix is calculated by rows, indicating in each square the cost of establishing this correspondence. To calculate the cost in each position we distinguish three possible actions:
We leave of the previous square in the diagonal. In this case we establish the correspondence between the following pixel of the left line and the following one of the right line.
There can be a matched move were a particular pair of pixels are a match, or there can be an occluded move, were the pixel in one image is occluded by the corresponding pixel in the other. In a matched move, the cost is the matching function. For an occluded move, the cost is a constant value which represents the threshold of what can be considered a match or not. If the matching cost is too high, the pair of pixels is too different; therefore one of the pixels must occlude the other.
The intermediate values are stored in and W-by-W cumulative matrix of costs. The disparity map is calculated using a dynamic programming algorithm. The forward step consists of creating cumulative matrix of costs, the cost being the minimum value to reach a certain pixel pair.
Once the cumulative matrix of costs has been created, the path of minimum costs back from one end of the image to the other gives the surface for a certain scanline. This is the backward step of the algorithm.
The surface generated from the dynamic programming algorithm is used to generate a intermediate view.
The Cox algorithm produces a large number of artefacts when creating a intermediate view. The two biggest types of artefacts are horizontal streaks and haloing. The Criminisi algorithm takes the basic idea of Cox algorithm and extends it to generate gaze corrected views acceptable for use in video conferencing.
The key areas addressed in the Criminisi algorithm are the matching function and the moves available in the dynamic programming model.
The matching function used in the Criminisi algorithm uses a cross-correlation function over a window of pixels. By using a window of pixels, this leads to less fake matches between pixels. Tall windows are used (e.g. 3x7) as these enforce more consistency between scanlines.
After the matching costs are collected, they are smoothed between neighboring costs. A Gaussian filter is applied to do smoothing.The stronger matching function along with the cost smoothing gives more solid regions in the minimum cost surface generated. This leads a more solid image in the output, and reduces the horizontal streaks.
In addition to the matching function, the generation of the disparity map is performed differently. It extends the idea of cumulative cost matrices for each scanline, but also adds a pair of complementary cumulative cost matrices for each scanline. These extra matrices are to bias runs of occlusions, so there is one for both input frames.
As there are now three planes, there are many more moves available. The three moves from the original algorithm, moving along the centre (matched) plane, are present. In addition, these moves are available with a transition to the matched plane from an occluded plane. There are also moves to the occluded planes with occluded moves along them.
The forward step involves creating cumulative matrices of minimum matching costs to get to a given position. For every scanline, three cumulative cost matrices are created, one for each plane. The matched plane is biased towards matched moves, whilst the occlusions planes are biased to occlusions in their directions.
The cumulative cost matrices for each position are given in the following equations. Constants to give costs to moves which are not explicitly on the matched plane.
By starting from the left-most pixel pair and working towards the right-most pixel pair, the cumulative costs matrices are filled in. Along with the calculations, the move to get to each position is recorded, i.e. the position and plane of the previous value which gives the minimum of the selected equations for the current position.
The backward processing step starts from the right most pixel pair in the cumulative cost matrix on the matched plane, and back tracks through the recorded moves. This produces the minimum cost path to get from one edge of the image to the other.
In this work we implemented dynamic programming algorithm based on solution proposed by Criminisi et al. Three-plane model is adopted with modifications in cost function.
For the cost function we chose the sum of absolute differences (SAD) metric since it is much simpler to implement and requires less computations than normalized cross-correlation metric proposed by Criminisi et at. SAD values are computed according to equation 4.6 and stored in WxW matrix for each scanline pair. Color information is used instead of intensity due to improvement in signal to noise ratio by 20-25%.
For each two pixels at l and r positions in left and right scanline respectively, sum of absolute differences over 3x3 window is normalized and used as dissimilarity measure in three-plane model.
Since proposed algorithm processes stereo images line by line, and the window in SAD is rectangle, the information of images in vertical direction is not measured. Thus, inter-scanline consistency constraint is not satisfied. In order to solve this problem we apply simple 11x11 median filter to obtain smoother disparity map.
View synthesis is a procedure of generating novel views of the observed scene, using the fact that same scene is known from two or more different views. Although variety of methods for performing efficient view synthesis is introduce in literature, we will limit our work to disparity-based interpolation techniques.
Chen et al. present one of the first papers that introduced idea of view interpolation based on morphing two input views. Intermediate view is obtained by shifting the source image pixels according to linearly interpolated disparity values. After this work, concept of exploiting different interpolation methods to obtain novel view is widely used.
Therefore, in the sections 5.1 and 5.2 we will present two different methods for novel view synthesis proposed by Chaohui et al. and Jong et al.
In this paper dynamic programming algorithm is applied to estimate disparity values and uniqueness and smoothness constraints are employed to find reliable correspondence. Then view synthesis method is presented in following way.
If the position of virtual camera, original image pixel and its disparity are known, its projection coordinates on intermediate view are obtained from
In this paper view synthesis from stereo video is presented. Two-stage disparity estimation algorithm is proposed to obtain accurate disparity map. In first stage graph cut energy minimization is applied in order to get accurate disparity estimation. Then, in the second stage temporal constraint is introduced to reduce number of pixels used in disparity estimation.
After obtaining calibration results, images are rectified using OpenCV function cvStereoRectify. This function applies Bouguet's rectification method which computes the rotation matrices for each camera. As a result, images are aligned and epipolar lines are parallel to baseline which connects camera centres. In order to verify this, we mark few points manually and determine correspondence between them using epipolar constraint. Results before and after applying rectification procedure. Source code for stereo calibration and rectification is given in opencv.
Once we have rectified images, we are able to determine correspondence between pixels and to create disparity map. Wide baseline and high resolution of our input images introduce displacement of 0-280 pixels.
Having in mind this and the fact that untextured regions and different illumination are typical for videoconferencing systems, we can eliminate local stereo matching methods.
As we presented in chapter 4 dynamic programming method proposed by Criminisi et al. turned out to be the best solution for defined system setup. As mentioned before, we applied different cost function - sum of absolute differences. Spatial consistency of disparities is achieved with median filtering.
In order to reconstruct intermediate view winner-takes-all approach proposed in yong is implemented.
Since the presence of the background information is not explicitly needed in videoconferencing systems, two realizations will be presented - with and without background.
In two steps to make its explanation easier. First result represents intermediate view interpolated from left image. Black pixels around face area correspond to occluded region from left view. Therefore they are filled with matching pixels from right image.
White pixels on hair area represent mismatches and they are filled with neighbouring pixels.
Main goal of this thesis is achieved, we establish eye-contact from two referenced views. Now, we will introduce solution after background subtraction.
Most background subtraction methods require long frame sequence to isolate foreground object. The main goal of this implementation is to work in real-time. Therefore, delay initiated by performing common background subtraction is undesirable. In order to subtract background without introducing additional costs we exploit segmentation given by disparity map.
In order to evaluate established eye-contact we compute difference between ground truth and intermediate view for region of interest (face area).
This thesis introduces new method that produces eye-to-eye contact for video communication. Firstly, we presented motivation and summary of related work to define this problem in more details. Then, we encountered dilemma how to set proper conditions for videoconferencing. This was not an easy task and only after several attempts it was partly solved. Actually, key problem of setting videoconferencing system with two cameras placed on the left and right side of the screen was to achieve diffuse lighting and images without shadows.
The second problem was wide-baseline stereo image pair which introduced significant displacement of corresponding pixels. Finally, the chosen solution should be implemented in real-time. After several initial trials of different methods implemented in "StereoMatcher" code from code, dynamic programming optimization was selected for efficient disparity map estimation. Chapter 4 and Chapter 5 described this method and its implementation in details. In Chapter 6 results showing successfully synthesized eye-to-eye contact are presented.
The proposed implementation is currently unoptimized and produces 1 frame at every 5.2 seconds for 640x480 images. By adding smoothness term in cost function and synthesizing intermediate view in parallel with disparity estimation would improve performance. Finally, including temporal consistency constraint in stereo sequence generation could enhance proposed method.