2025-02-20 |
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework |
Yuming Yang et.al. |
2502.14864v1 |
null |
2025-02-20 |
AVD2: Accident Video Diffusion for Accident Video Description |
Cheng Li et.al. |
2502.14801v1 |
null |
2025-02-20 |
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States |
Yilei Jiang et.al. |
2502.14744v1 |
null |
2025-02-20 |
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models |
Yifu Chen et.al. |
2502.14727v1 |
null |
2025-02-20 |
ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors |
Yuguo Yin et.al. |
2502.14627v1 |
null |
2025-02-20 |
Dynamic Preference-based Multi-modal Trip Planning of Public Transport and Shared Mobility |
Yimeng Zhang et.al. |
2502.14528v1 |
null |
2025-02-20 |
Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well |
Chengyu Fang et.al. |
2502.14471v1 |
null |
2025-02-20 |
Visual and Auditory Aesthetic Preferences Across Cultures |
Harin Lee et.al. |
2502.14439v1 |
null |
2025-02-20 |
MedFuncta: Modality-Agnostic Representations Based on Efficient Neural Fields |
Paul Friedrich et.al. |
2502.14401v1 |
link |
2025-02-20 |
SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images |
Yichi Zhang et.al. |
2502.14351v1 |
null |
2025-02-20 |
Bolide infrasound signal morphology and yield estimates: A case study of two events detected by a dense acoustic sensor network |
Trevor C. Wilson et.al. |
2502.14232v1 |
null |
2025-02-20 |
SleepGMUformer: A gated multimodal temporal neural network for sleep staging |
Chenjun Zhao et.al. |
2502.14227v1 |
null |
2025-02-20 |
Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition |
Tianyi Shang et.al. |
2502.14195v1 |
link |
2025-02-20 |
A modal logic translation of the AGM axioms for belief revision |
Giacomo Bonanno et.al. |
2502.14176v1 |
null |
2025-02-19 |
Additive Enrichment from Coderelictions |
Jean-Simon Pacaud Lemay et.al. |
2502.14134v1 |
null |
2025-02-19 |
Object-centric Binding in Contrastive Language-Image Pretraining |
Rim Assouel et.al. |
2502.14113v1 |
null |
2025-02-19 |
Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging |
Shansong Wang et.al. |
2502.14064v1 |
null |
2025-02-19 |
Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion Recognition |
Jingwang Huang et.al. |
2502.13954v1 |
link |
2025-02-19 |
A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models |
Hao Huang et.al. |
2502.13942v1 |
null |
2025-02-19 |
Multi-view Video-Pose Pretraining for Operating Room Surgical Activity Recognition |
Idris Hamoud et.al. |
2502.13883v1 |
null |
2025-02-19 |
MEX: Memory-efficient Approach to Referring Multi-Object Tracking |
Huu-Thien Tran et.al. |
2502.13875v1 |
null |
2025-02-19 |
Generative Video Semantic Communication via Multimodal Semantic Fusion with Large Model |
Hang Yin et.al. |
2502.13838v1 |
null |
2025-02-19 |
Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge |
Nikolaos Dionelis et.al. |
2502.13818v1 |
null |
2025-02-19 |
Exploring Embodied Emotional Communication: A Human-oriented Review of Mediated Social Touch |
Liwen He et.al. |
2502.13816v1 |
null |
2025-02-19 |
From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education |
Yi-Fan Zhang et.al. |
2502.13789v1 |
null |
2025-02-19 |
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking |
Florian Schneider et.al. |
2502.13766v1 |
null |
2025-02-19 |
Cascading CMA-ES Instances for Generating Input-diverse Solution Batches |
Maria Laura Santoni et.al. |
2502.13730v1 |
link |
2025-02-19 |
Adapting Large Language Models for Time Series Modeling via a Novel Parameter-efficient Adaptation Method |
Juyuan Zhang et.al. |
2502.13725v1 |
null |
2025-02-19 |
Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields |
Taewoo Kim et.al. |
2502.13716v1 |
link |
2025-02-19 |
TALKPLAY: Multimodal Music Recommendation with Large Language Models |
Seungheon Doh et.al. |
2502.13713v2 |
null |