Bryankromann5007
By first using our newly proposed PCSC framework for spatial localization at the frame-level and then applying it for temporal segmentation at the tube-level, the action localization results are progressively improved at both the frame level and the video level. Comprehensive experiments demonstrate the effectiveness of our new approaches.Face detection has achieved significant progress in recent years. However, high performance face detection still remains a very challenging problem, especially when there exists many tiny faces. In this paper, we present a single-shot refinement face detector namely RefineFace to achieve high performance. Specifically, it consists of five modules Selective Two-step Regression (STR), Selective Two-step Classification (STC), Scale-aware Margin Loss (SML), Feature Supervision Module (FSM) and Receptive Field Enhancement (RFE). Birinapant molecular weight To enhance the regression ability for high location accuracy, STR coarsely adjusts locations and sizes of anchors from high level detection layers to provide better initialization for subsequent regressor. To improve the classification ability for high recall efficiency, STC first filters out most simple negatives from low level detection layers to reduce search space for subsequent classifier, then SML is applied to better distinguish faces from background at various scales and FSM is introduced to let the backbone learn more discriminative features for classification. Besides, RFE is presented to provide more diverse receptive field to better capture faces in some extreme poses. Extensive experiments conducted on WIDER FACE, AFW, PASCAL Face, FDDB, MAFA demonstrate that our method achieves state-of-the-art results and runs at 37.3 FPS with ResNet-18 for VGA-resolution images.Omni-directional images are becoming more prevalent for understanding the scene of all directions around a camera, as they provide a much wider field-of-view (FoV) compared to conventional images. In this work, we present a novel approach to represent omni-directional images and suggest how to apply CNNs on the proposed image representation. The proposed image representation method utilizes a spherical polyhedron to reduce distortion introduced inevitably when sampling pixels on a non-Euclidean spherical surface around the camera center. To apply convolution operation on our representation of images, we stack the neighboring pixels on top of each pixel and multiply with trainable parameters. This approach enables us to apply the same CNN architectures used in conventional Euclidean 2D images on our proposed method in a straightforward manner. Compared to the previous work, we additionally compare different designs of kernels that can be applied to our proposed method. We also show that our method outperforms in monocular depth estimation task compared to other state-of-the-art representation methods of omni-directional images. In addition, we propose a novel method to fit bounding ellipses of arbitrary orientation using object detection networks and apply it to an omni-directional real-world human detection dataset.Current NRSfM algorithms are limited from two perspectives (i) the number of images, and (ii) the type of shape variability they can handle. In this paper we propose a novel hierarchical sparse coding model for NRSFM which can overcome (i) and (ii) to such an extent, that NRSFM can be applied to problems in vision previously thought too ill posed. Our approach is realized in practice as the training of an unsupervised deep neural network (DNN) auto-encoder with a unique architecture that is able to disentangle pose from 3D structure. Using modern deep learning computational platforms allows us to solve NRSfM problems at an unprecedented scale and shape complexity. Our approach has no 3D supervision, relying solely on 2D point correspondences. Further, our approach is also able to handle missing/occluded 2D points without the need for matrix completion. Extensive experiments demonstrate the impressive performance of our approach where we exhibit superior precision and robustness against all available state-of-the-art works in some instances by an order of magnitude. We further propose a new quality measure (based on the network weights) which circumvents the need for 3D ground-truth to ascertain the confidence we have in the reconstructability.The ability of camera arrays to efficiently capture higher space-bandwidth product than single cameras has led to various multiscale and hybrid systems. These systems play vital roles in computational photography, including light field imaging, 360 VR camera, gigapixel videography, etc. One of the critical tasks in multiscale hybrid imaging is matching and fusing cross-resolution images from different cameras under perspective parallax. In this paper, we investigate the reference-based super-resolution (RefSR) problem associated with dual-camera or multi-camera systems, with a significant resolution gap (8x) and large parallax (10%pixel displacement). We present CrossNet++, an end-to-end network containing novel two-stage cross-scale warping modules. The stage I learns to narrow down the parallax distinctively with the strong guidance of landmarks and intensity distribution consensus. Then the stage II operates more fine-grained alignment and aggregation in feature domain to synthesize the final super-resolved image. To further address the large parallax, new hybrid loss functions comprising warping loss, landmark loss and super-resolution loss are proposed to regularize training and enable better convergence. CrossNet++ significantly outperforms the state-of-art on light field datasets as well as real dual-camera data. We further demonstrate the generalization of our framework by transferring it to video super-resolution and video denoising.Multi-view stereopsis (MVS) tries to recover the 3D model from 2D images. As the observations become sparser, the significant 3D information loss makes the MVS problem more challenging. Instead of only focusing on densely sampled conditions, we investigate sparse-MVS with large baseline angles since sparser sampling is always more favorable inpractice. By investigating various observation sparsities, we show that the classical depth-fusion pipeline becomes powerless for thecase with larger baseline angle that worsens the photo-consistency check. As another line of solution, we present SurfaceNet+, a volumetric method to handle the 'incompleteness' and 'inaccuracy' problems induced by very sparse MVS setup. Specifically, the former problem is handled by a novel volume-wise view selection approach. It owns superiority in selecting valid views while discarding invalid occluded views by considering the geometric prior. Furthermore, the latter problem is handled via a multi-scale strategy that consequently refines the recovered geometry around the region with repeating pattern.