The frame-to-global-model approach is widely used for accurate 3D modeling from sequences of RGB-D images. Because still no perfect camera tracking system exists, the accumulation of small errors generated when registering and integrating successive RGB-D images causes deformations of the 3D model being built up. In particular, the deformations become signi cant when the scale of the scene to model is large. To tackle this problem, we propose a two-stage strategy to build in details a large-scale 3D model with minimal deformations where the first stage creates accurate small-scale 3D scenes in real-time from short subsequences of RGB-D images while the second stage re-organises all the results from the first stage in a geometrically consistent manner to reduce deformations as much as possible. By employing planar patches as the 3D scene representation, our proposed method runs in real-time to build accurate 3D models with minimal deformations even for large-scale scenes.
Updating a global 3D model with live RGB-D measurements has proven to be successful for 3D reconstruction of indoor scenes. Recently, a Truncated Signed Distance Function (TSDF) volumetric model and a fusion algorithm have been introduced (KinectFusion), showing significant advantages such as computational speed and accuracy of the reconstructed scene. This algorithm, however, is expensive in memory when constructing and updating the global model. As a consequence, the method is not well scalable to large scenes. We propose a new flexible 3D scene representation using a set of planes that is cheap in memory use and, nevertheless, achieves accurate reconstruction of indoor scenes from RGB-D image sequences. Projecting the scene onto different planes reduces significantly the size of the scene representation and thus it allows us to generate a global textured 3D model with lower memory requirement while keeping accuracy and easiness to update with live RGB-D measurements. Experimental results demonstrate that our proposed flexible 3D scene representation achieves accurate reconstruction, while keeping the scalability for large indoor scenes.
We present a robust and accurate 3D registration method for a dense sequence of depth images taken from unknown viewpoints. Our method simultaneously estimates multiple extrinsic parameters of the depth images to obtain a registered full 3D model of the scanned scene. By arranging the depth measurements in a matrix form, we formulate the problem as a simultaneous estimation of multiple extrinsics and a low-rank matrix, which corresponds to the aligned depth images as well as a sparse error matrix. Unlike previous approaches that use sequential or heuristic global registration approaches, our solution method uses an advanced convex optimization technique for obtaining a robust solution via rank minimization. To achieve accurate computation, we develop a depth projection method that has minimum sensitivity to sampling by reading projected depth values in the input depth images. We demonstrate the effectiveness of the proposed method through extensive experiments and compare it with previous standard techniques.
This paper presents an illumination-free photometric metric for evaluating the goodness of a rigid transformation aligning two overlapping range images, under the assumption of Lambertian surface. Our metric is based on photometric re-projection error but not on feature detection and matching. We synthesize the color of one image using albedo of the other image to compute the photometric re-projection error. The unknown illumination and albedo are estimated from the correspondences induced by the input transformation using the spherical harmonics representation of image formation. This way allows us to derive an illumination-free photometric metric for range image alignment. We use a hypothesize-and-test method to search for the transformation that minimizes our illumination-free photometric function. Transformation candidates are efficiently generated by employing the spherical representation of each image. Experimental results using synthetic and real data show the usefulness of the proposed metric.
Human visual attention can be modulated not only by visual stimuli but also by ones from other modalities such as audition. Hence, incorporating auditory information into a human visual attention model would be a key issue for building more sophisticated models. However, the way of integrating multiple pieces of information arising from audio-visual domains still remains a challenging problem. This paper proposes a novel computational model of human visual attention driven by auditory cues. Founded on the Bayesian surprise model that is considered to be promising in the literature, our model uses surprising auditory events to serve as a clue for selecting synchronized visual features and then emphasizes the selected features to form the final surprise map. Our approach to audio-visual integration focuses on using effective visual features alone but not all available features for simulating visual attention with the help of auditory information.
The saliency map has been proposed to identify regions that draw human visual attention. Differences of features from the surroundings are hierarchially computed for an image or an image sequence in multiple resolutions and they are fused in a fully bottom-up manner to obtain a saliency map. A video usually contains sounds, and not only visual stimuli but also auditory stimuli attract human attention. Nevertheless, most conventional methods discard auditory information and image information alone is used in computing a saliency map. This paper presents a method for constructing a visual saliency map by integrating image features with auditory features. We assume a single moving sound source in a video and introduce a sound source feature. Our method detects the sound source feature using the correlation between audio signals and sound source motion, and computes its importance in each frame in a video using an auditory saliency map. The importance is used to fuse the sound source feature with image features to construct a visual saliency map. Experiments using subjects demonstrate that a saliency map by our proposed method reflects human’s visual attention more accurately than that by a conventional method.
The most important part of an information system that assists human activities is a natural interface with human beings. Gaze information strongly reflects the human interest or their attention, and thus, a gaze-based interface is promising for future usage. In particular, if we can smoothly guide the user's visual attention toward a target without interrupting their current visual attention, the usefulness of the gaze-based interface will be highly enhanced. To realize such an interface, this paper proposes a method for editing an image, when given a region in the image, to synthesize the image in which the region is most salient. Our method first computes a saliency map of a given image and then iteratively adjusts the intensity and color until the saliency inside the region becomes the highest for the entire image. Experimental results confirm that our image editing method naturally draws the human visual attention toward our specified region.
We propose a method of predicting human egocentric visual attention using bottom-up visual saliency and egomotion information. Computational models of visual saliency are often employed to predict human attention; however, its mechanism and effectiveness have not been fully explored in egocentric vision.
The purpose of our framework is to compute attention maps from an egocentric video that can be used to infer a person's visual attention. In addition to a standard visual saliency model, two kinds of attention maps are computed based on a camera's rotation velocity and direction of movement. These rotation-based and translation-based attention maps are aggregated with a bottom-up saliency map to enhance the accuracy with which the person's gaze positions can be predicted.
The efficiency of the proposed framework was examined in real environments by using a head-mounted gaze tracker, and we found that the egomotion-based attention maps contributed to accurately predicting human visual attention.
Texton is a representative dense visual word and it has proven its effectiveness in categorizing materials as well as generic object classes. Despite its success and popularity, no prior work has tackled the problem of its scale optimization for a given image data and associated object category. We propose scale-optimized textons to learn the best scale for each object in a scene, and incorporate them into image categorization and segmentation. Our textonization process produces a scaleoptimized codebook of visual words. We approach the scaleoptimization problem of textons by using the scene-context scale in each image, which is the effective scale of local context to classify an image pixel in a scene. We perform the textonization process using the randomized decision forest which is a powerful tool with high computational efficiency in vision applications. Our experiments using MSRC and VOC 2007 segmentation dataset show that our scale-optimized textons improve the performance of image categorization and segmentation.
Preserving connectivity is an important property commonly required for object discretization. Connectivity of a discretized object differs depending on how to discretize its original object. The morphological discretization is known to be capable of controlling the connectivity of a discretized object, by selecting appropriate structuring elements. The analytical approximation, which approximates the morphological discretization by a finite number of inequalities, on the other hand, is recently introduced to reduce the computational cost required for the morphological discretization. However, whether this approximate discretization has the same connectivity that the morphological discretization has is yet to be investigated. In this paper, we study the connectivity relationship between the morphological discretization and the analytical approximation, focusing on 2D explicit curves. We show that they guarantee the same connectivity for 2D explicit curves.
A discrete polynomial curve is defined as a set of points lying between two polynomial curves. This paper deals with the problem of fitting a discrete polynomial curve to given integer points in the presence of outliers. We formulate the problem as a discrete optimization problem in which the number of points included in the discrete polynomial curve, i.e., the number of inliers, is maximized. We then propose a method that effectively achieves a solution guaranteeing local maximality by using a local search, called rock climging, with a seed obtained by RANSAC. Experimental results demonstrate the effectiveness of our proposed method.
The technology for reconstructing three-dimensional images has made remarkable progress. However, an important issue still remains, that is, guaranteeing accuracy concerning reconstructed three-dimensional images. A tremendous amount of efforts have been made to deal with noise and to show the robustness of developed methods. In such studies, however, digitization errors and observation errors are not discriminated in spite that the two kinds are generated in different mechanisms.
We aim at discriminating the two kinds of errors, focusing on pixels/grid-points as the smallest unit of digital images, in order to clarify the limitation of accuracy in 3D reconstruction due to digitization errors.
Rotations in the discrete plane are important for many applications such as image matching or synthesizing mosaic images. Differently from rotations in the continuous plane, rotations in the discrete plane by two different angles can give the same result. Namely, two different angles give the same result after the rotation of a grid point followed by digitization. Generally a range of rotation angles exists in which the same result is obtained after the rotation. We have proposed a method for effectively finding the exact lower and upper bounds of this range using integer computations alone.