Doddcunningham6984

Z Iurium Wiki

Verze z 5. 10. 2024, 16:26, kterou vytvořil Doddcunningham6984 (diskuse | příspěvky) (Založena nová stránka s textem „Moreover, a novel attentive adversarial network architecture is designed to align the spatiotemporal dynamics actions with higher domain discrepancies. In…“)
(rozdíl) ← Starší verze | zobrazit aktuální verzi (rozdíl) | Novější verze → (rozdíl)

Moreover, a novel attentive adversarial network architecture is designed to align the spatiotemporal dynamics actions with higher domain discrepancies. In addition, the pairwise margin discrimination loss is designed for the pairwise network architecture to improve the discrimination of the learned domain-invariant spatiotemporal feature. The results of extensive experiments performed on three public benchmarks of the cross-domain action recognition datasets, including SDAI Action I, SDAI Action II and UCF50-OlympicSport, demonstrate that the proposed PASTN can significantly outperform the state-of-the-art cross-domain action recognition methods in terms of both the accuracy and computational time. Even when only two labelled training samples per category are considered in the office1 scenario of the SDAI Action I dataset, the accuracy of the PASTN is improved by 6.1%, 10.9%, 16.8%, and 14% compared to that of the TA3N , TemporalPooling, I3D, and P3D methods, respectively.Hashing-based techniques have provided attractive solutions to cross-modal similarity search when addressing vast quantities of multimedia data. However, existing cross-modal hashing (CMH) methods face two critical limitations 1) there is no previous work that simultaneously exploits the consistent or modality-specific information of multi-modal data; 2) the discriminative capabilities of pairwise similarity is usually neglected due to the computational cost and storage overhead. Moreover, to tackle the discrete constraints, relaxation-based strategy is typically adopted to relax the discrete problem to the continuous one, which severely suffers from large quantization errors and leads to sub-optimal solutions. To overcome the above limitations, in this article, we present a novel supervised CMH method, namely Asymmetric Supervised Consistent and Specific Hashing (ASCSH). Specifically, we explicitly decompose the mapping matrices into the consistent and modality-specific ones to sufficiently exploit the intrinsic correlation between different modalities. Meanwhile, a novel discrete asymmetric framework is proposed to fully explore the supervised information, in which the pairwise similarity and semantic labels are jointly formulated to guide the hash code learning process. Unlike existing asymmetric methods, the discrete asymmetric structure developed is capable of solving the binary constraint problem discretely and efficiently without any relaxation. To validate the effectiveness of the proposed approach, extensive experiments on three widely used datasets are conducted and encouraging results demonstrate the superiority of ASCSH over other state-of-the-art CMH methods.Human motion prediction, which aims at predicting future human skeletons given the past ones, is a typical sequence-to-sequence problem. Therefore, extensive efforts have been devoted to exploring different RNN-based encoder-decoder architectures. However, by generating target poses conditioned on the previously generated ones, these models are prone to bringing issues such as error accumulation problem. In this paper, we argue that such issue is mainly caused by adopting autoregressive manner. Hence, a novel Non-AuToregressive model (NAT) is proposed with a complete non-autoregressive decoding scheme, as well as a context encoder and a positional encoding module. More specifically, the context encoder embeds the given poses from temporal and spatial perspectives. The frame decoder is responsible for predicting each future pose independently. The positional encoding module injects positional signal into the model to indicate the temporal order. Besides, a multitask training paradigm is presented for both low-level human skeleton prediction and high-level human action recognition, resulting in the considerable improvement for the prediction task. Our approach is evaluated on Human3.6M and CMU-Mocap benchmarks and outperforms state-of-the-art autoregressive methods.Facilitated by deep neural networks, numerous tracking methods have made significant advances. selleck chemicals Existing deep trackers mainly utilize independent frames to model the target appearance, while paying less attention to its temporal coherence. In this paper, we propose a recurrent memory activation network (RMAN) to exploit the untapped temporal coherence of the target appearance for visual tracking. We build the RMAN on top of the long short-term memory network (LSTM) with an additional memory activation layer. Specifically, we first use the LSTM to model the temporal changes of the target appearance. Then we selectively activate the memory blocks via the activation layer to produce a temporally coherent representation. The recurrent memory activation layer enriches the target representations from independent frames and reduces the background interference through temporal consistency. The proposed RMAN is fully differentiable and can be optimized end-to-end. To facilitate network training, we propose a temporal coherence loss together with the original binary classification loss. Extensive experimental results on standard benchmarks demonstrate that our method performs favorably against the state-of-the-art approaches.Cross-modal retrieval aims to identify relevant data across different modalities. In this work, we are dedicated to cross-modal retrieval between images and text sentences, which is formulated into similarity measurement for each image-text pair. To this end, we propose a Cross-modal Relation Guided Network (CRGN) to embed image and text into a latent feature space. The CRGN model uses GRU to extract text feature and ResNet model to learn the globally guided image feature. Based on the global feature guiding and sentence generation learning, the relation between image regions can be modeled. The final image embedding is generated by a relation embedding module with an attention mechanism. With the image embeddings and text embeddings, we conduct cross-modal retrieval based on the cosine similarity. The learned embedding space well captures the inherent relevance between image and text. We evaluate our approach with extensive experiments on two public benchmark datasets, i.e., MS-COCO and Flickr30K. Experimental results demonstrate that our approach achieves better or comparable performance with the state-of-the-art methods with notable efficiency.

Autoři článku: Doddcunningham6984 (McGrath Persson)