AI RESEARCH PAPERS & ACADEMIC SOURCES
- In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data : Abstract: Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simp...
- MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features : Abstract: Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomo...
- Hyperspectral Image Classification using Spectral-Spatial Mixer Network : Abstract: This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local s...
- First Frame Is the Place to Go for Video Content Customization : Abstract: What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this ...
- GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization : Abstract: Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models...
- RoMa v2: Harder Better Faster Denser Feature Matching : Abstract: Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. H...
- Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors : Abstract: Fiber Specklegram Sensors (FSS) are highly effective for environmental monitoring, particularly for detecting temperature variations. However, the nonlinear nature of specklegram data presen...
- Image Denoising Using Transformed L1 (TL1) Regularization via ADMM : Abstract: Total variation (TV) regularization is a classical tool for image denoising, but its convex $\ell_1$ formulation often leads to staircase artifacts and loss of contrast. To address these iss...
- BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer : Abstract: Document Visual Question Answering (DocVQA) is a fundamental task for multimodal document understanding and a key testbed for vision language reasoning. However, most existing DocVQA dataset...
- Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception : Abstract: In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spat...
- Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration : Abstract: Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whe...
- IPR-1: Interactive Physical Reasoner : Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from intera...
- A Novel CustNetGC Boosted Model with Spectral Features for Parkinson's Disease Prediction : Abstract: Parkinson's disease is a neurodegenerative disorder that can be very tricky to diagnose and treat. Such early symptoms can include tremors, wheezy breathing, and changes in voice quality as ...
- MHR: Momentum Human Rig : Abstract: We present MHR, a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum lib...
- Multi-source-free Domain Adaptation via Uncertainty-aware Adaptive Distillation : Abstract: Source-free domain adaptation (SFDA) alleviates the domain discrepancy among data obtained from domains without accessing the data for the awareness of data privacy. However, existing conven...
- UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization : Abstract: Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on...
- MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition : Abstract: In recent years, multimodal Graph Convolutional Networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. The reliance on high-energy-consuming continuous fl...
- Unobtrusive Monitoring of Simulated Physical Weakness Using Fine-Grained Behavioral Features and Personalized Modeling : Abstract: Aging and chronic conditions affect older adults' daily lives, making early detection of developing health issues crucial. Weakness, common in many conditions, alters physical movements and ...
- Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation : Abstract: Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel i...
- MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation : Abstract: Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While pre...
- Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models : Abstract: Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, $\textit{etc}$. ...
- Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning models : Abstract: Background and Objectives: The variability and biases in the real-world performance benchmarking of deep learning models for medical imaging compromise their trustworthiness for real-world d...
- Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs : Abstract: Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential f...
- DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions : Abstract: Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are...
- RN-SDEs: Limited-Angle CT Reconstruction with Residual Null-Space Diffusion Stochastic Differential Equations : Abstract: Computed tomography is a widely used imaging modality with applications ranging from medical imaging to material analysis. One major challenge arises from the lack of scanning information at...
- Human-AI Collaboration and Explainability for 2D/3D Registration Quality Assurance : Abstract: Purpose: As surgery increasingly integrates advanced imaging, algorithms, and robotics to automate complex tasks, human judgment of system correctness remains a vital safeguard for patient s...
- Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting : Abstract: The recent introduction of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis. Several studies have further improved the rendering quality of 3DGS, yet they still e...
- An Event-triggered System for Social Persuasion and Danger Alert in Elder Home Monitoring : Abstract: In the study, the physical state and mental state of elders are both considered, and an event-triggered system has developed to detect events: watch dog, danger notice and photo link. By ado...
- Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation : Abstract: Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization abi...
- SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection : Abstract: Accurate, up-to-date High-Definition (HD) maps are critical for urban planning, infrastructure monitoring, and autonomous navigation. However, these maps quickly become outdated as environme...
- Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance : Abstract: Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual...
- Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation : Abstract: Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weat...
- MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction : Abstract: This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures, a probabilistic HMP method predicts such multiple mo...
- Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset : Abstract: The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text ...
- Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition : Abstract: Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical "blenders" that lack...
- Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval : Abstract: This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is ...
- Physics-Based Benchmarking Metrics for Multimodal Synthetic Images : Abstract: Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-d...
- SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning : Abstract: We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning...
- SplitFlux: Learning to Decouple Content and Style from a Single Image : Abstract: Disentangling image content and style is essential for customized image generation. Existing SDXL-based methods struggle to achieve high-quality results, while the recently proposed Flux mod...
- Edge-Centric Relational Reasoning for 3D Scene Graph Prediction : Abstract: 3D scene graph prediction aims to abstract complex 3D environments into structured graphs consisting of objects and their pairwise relationships. Existing approaches typically adopt object-c...
- Taming Generative Synthetic Data for X-ray Prohibited Item Detection : Abstract: Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insuff...
- Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language : Abstract: We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cros...
- Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models : Abstract: 3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often u...
- A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data : Abstract: Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of sing...
- What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs : Abstract: Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reco...
- Adaptive thresholding pattern for fingerprint forgery detection : Abstract: Fingerprint liveness detection systems have been affected by spoofing, which is a severe threat for fingerprint-based biometric systems. Therefore, it is crucial to develop some techniques t...
- IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers : Abstract: Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resou...
- Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training : Abstract: Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis...
- Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models : Abstract: Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) ...
- ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation : Abstract: Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and sever...
- EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects : Abstract: Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi...
- Representation Space Constrained Learning with Modality Decoupling for Multimodal Object Detection : Abstract: Multimodal object detection has attracted significant attention in both academia and industry for its enhanced robustness. Although numerous studies have focused on improving modality fusion...
- HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation : Abstract: Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel s...
- A Dataset and Baseline for Deep Learning-Based Visual Quality Inspection in Remanufacturing : Abstract: Remanufacturing describes a process where worn products are restored to like-new condition and it offers vast ecological and economic potentials. A key step is the quality inspection of disa...
- Driving in Spikes: An Entropy-Guided Object Detector for Spike Cameras : Abstract: Object detection in autonomous driving suffers from motion blur and saturation under fast motion and extreme lighting. Spike cameras, offer microsecond latency and ultra high dynamic range f...
- Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners : Abstract: Purse seiners play a crucial role in tuna fishing, as approximately 69% of the world's tropical tuna is caught using this gear. All tuna Regional Fisheries Management Organizations have esta...
- FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI : Abstract: Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are e...
- Evaluating Low-Light Image Enhancement Across Multiple Intensity Levels : Abstract: Imaging in low-light environments is challenging due to reduced scene radiance, which leads to elevated sensor noise and reduced color saturation. Most learning-based low-light enhancement m...
- Learning to Expand Images for Efficient Visual Autoregressive Modeling : Abstract: Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from ...
- Multi-Text Guided Few-Shot Semantic Segmentation : Abstract: Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, the...
- A Hybrid CNN-ViT-GNN Framework with GAN-Based Augmentation for Intelligent Weed Detection in Precision Agriculture : Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable ag...
- Scriboora: Rethinking Human Pose Forecasting : Abstract: Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interact...
- Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector : Abstract: Recent AI-generated image (AIGI) detectors achieve impressive accuracy under clean condition. In view of antiforensics, it is significant to develop advanced adversarial attacks for evaluati...
- From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers : Abstract: Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we ...
- AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning : Abstract: With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large visio...
- CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking : Abstract: 3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point...
- Learning from Mistakes: Loss-Aware Memory Enhanced Continual Learning for LiDAR Place Recognition : Abstract: LiDAR place recognition plays a crucial role in SLAM, robot navigation, and autonomous driving. However, existing LiDAR place recognition methods often struggle to adapt to new environments ...
- MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation : Abstract: Medical image segmentation typically adopts a point-wise convolutional segmentation head to predict dense labels, where each output channel is heuristically tied to a specific class. This ri...
- FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation : Abstract: Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in...
- The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification : Abstract: Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identifi...
- Multi-Stage Residual-Aware Unsupervised Deep Learning Framework for Consistent Ultrasound Strain Elastography : Abstract: Ultrasound Strain Elastography (USE) is a powerful non-invasive imaging technique for assessing tissue mechanical properties, offering crucial diagnostic value across diverse clinical applic...
- MambaIO: Global-Coordinate Inertial Odometry for Pedestrians via Multi-Scale Frequency-Decoupled Modeling : Abstract: Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for lo...
- INQUIRE-Search: A Framework for Interactive Discovery in Large-Scale Biodiversity Databases : Abstract: Large community science platforms such as iNaturalist contain hundreds of millions of biodiversity images that often capture ecological context on behaviors, interactions, phenology, and hab...
- GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI : Abstract: Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning...
- Eguard: Defending LLM Embeddings Against Inversion Attacks via Text Mutual Information Optimization : Abstract: Embeddings have become a cornerstone in the functionality of large language models (LLMs) due to their ability to transform text data into rich, dense numerical representations that capture ...
- Gaussian See, Gaussian Do: Semantic 3D Motion Transfer from Multiview Video : Abstract: We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with ...
- When CNNs Outperform Transformers and Mambas: Revisiting Deep Architectures for Dental Caries Segmentation : Abstract: Accurate identification and segmentation of dental caries in panoramic radiographs are critical for early diagnosis and effective treatment planning. Automated segmentation remains challengi...
- B-Rep Distance Functions (BR-DF): How to Represent a B-Rep Model by Volumetric Distance Functions? : Abstract: This paper presents a novel geometric representation for CAD Boundary Representation (B-Rep) based on volumetric distance functions, dubbed B-Rep Distance Functions (BR-DF). BR-DF encodes th...
- GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis : Abstract: Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for...
- InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization : Abstract: We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify ...
- FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding : Abstract: As CLIP's global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specif...
- nnMIL: A generalizable multiple instance learning framework for computational pathology : Abstract: Computational pathology holds substantial promise for improving diagnosis and guiding treatment decisions. Recent pathology foundation models enable the extraction of rich patch-level repres...
- X-WIN: Building Chest Radiograph World Model via Predictive Sensing : Abstract: Chest X-ray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail...
- CPSL: Representing Volumetric Video via Content-Promoted Scene Layers : Abstract: Volumetric video enables immersive and interactive visual experiences by supporting free viewpoint exploration and realistic motion parallax. However, existing volumetric representations fro...
- Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities : Abstract: Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities -- characterized by simple structures and high-con...
- RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems : Abstract: Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene und...
- FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation : Abstract: We introduce FinCriticalED (Financial Critical Error Detection), a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level. Financial document...
- WarNav: An Autonomous Driving Benchmark for Segmentation of Navigable Zones in War Scenes : Abstract: We introduce WarNav, a novel real-world dataset constructed from images of the open-source DATTALION repository, specifically tailored to enable the development and benchmarking of semantic ...
- CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification : Abstract: Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matchi...
- Computer Vision Modeling of the Development of Geometric and Numerical Concepts in Humans : Abstract: Mathematical thinking is a fundamental aspect of human cognition. Cognitive scientists have investigated the mechanisms that underlie our ability to thinking geometrically and numerically, t...
- UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space : Abstract: In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive inte...
- Hyperspectral Super-Resolution with Inter-Image Variability via Degradation-based Low-Rank and Residual Fusion Method : Abstract: The fusion of hyperspectral image (HSI) with multispectral image (MSI) provides an effective way to enhance the spatial resolution of HSI. However, due to different acquisition conditions, t...
- CellGenNet: A Knowledge-Distilled Framework for Robust Cell Segmentation in Cancer Tissues : Abstract: Accurate nuclei segmentation in microscopy whole slide images (WSIs) remains challenging due to variability in staining, imaging conditions, and tissue morphology. We propose CellGenNet, a k...
- ProPL: Universal Semi-Supervised Ultrasound Image Segmentation via Prompt-Guided Pseudo-Labeling : Abstract: Existing approaches for the problem of ultrasound image segmentation, whether supervised or semi-supervised, are typically specialized for specific anatomical structures or tasks, limiting t...
- Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks : Abstract: Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in la...
- BokehFlow: Depth-Free Controllable Bokeh Rendering via Flow Matching : Abstract: Bokeh rendering simulates the shallow depth-of-field effect in photography, enhancing visual aesthetics and guiding viewer attention to regions of interest. Although recent approaches perfor...
- MambaTrack3D: A State Space Model Framework for LiDAR-Based Object Tracking under High Temporal Variation : Abstract: Dynamic outdoor environments with high temporal variation (HTV) pose significant challenges for 3D single object tracking in LiDAR point clouds. Existing memory-based trackers often suffer f...
- TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition : Abstract: Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approache...
- Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis : Abstract: Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion ...
- A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models : Abstract: Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirec...
- Computer-Use Agents as Judges for Generative User Interface : Abstract: Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily fo...
- SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models : Abstract: Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance...
- When to Think and When to Look: Uncertainty-Guided Lookback : Abstract: Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision la...
- MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping : Abstract: Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert s...
- Think Visually, Reason Textually: Vision-Language Synergy in ARC : Abstract: Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation r...
- MessIRve: A Large-Scale Spanish Information Retrieval Dataset : Abstract: Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR dataset...
- Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts? : Abstract: Human understanding of text depends on general semantic concepts of words rather than their superficial forms. To what extent does our human intuition transfer to language models? In this wo...
- Decentralized Gaussian Process Classification and an Application in Subsea Robotics : Abstract: Teams of cooperating autonomous underwater vehicles (AUVs) rely on acoustic communication for coordination, yet this communication medium is constrained by limited range, multi-path effects,...
- Convergence and Sketching-Based Efficient Computation of Neural Tangent Kernel Weights in Physics-Based Loss : Abstract: In multi-objective optimization, multiple loss terms are weighted and added together to form a single objective. These weights are chosen to properly balance the competing losses according t...
- A Physics Informed Machine Learning Framework for Optimal Sensor Placement and Parameter Estimation : Abstract: Parameter estimation remains a challenging task across many areas of engineering. Because data acquisition can often be costly, limited, or prone to inaccuracies (noise, uncertainty) it is c...
- US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery : Abstract: Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intr...
- Near-optimal delta-convex estimation of Lipschitz functions : Abstract: This paper presents a tractable algorithm for estimating an unknown Lipschitz function from noisy observations and establishes an upper bound on its convergence rate. The approach extends ma...
- CODE-II: A large-scale dataset for artificial intelligence in ECG analysis : Abstract: Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitatio...
- Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning : Abstract: Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via m...
- R\'enyi Differential Privacy for Heavy-Tailed SDEs via Fractional Poincar\'e Inequalities : Abstract: Characterizing the differential privacy (DP) of learning algorithms has become a major challenge in recent years. In parallel, many studies suggested investigating the behavior of stochastic...
- VisPlay: Self-Evolving Vision-Language Models from Images : Abstract: Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annota...
- Front-door Reducibility: Reducing ADMGs to the Standard Front-door Setting via a Graphical Criterion : Abstract: Front-door adjustment provides a simple closed-form identification formula under the classical front-door criterion, but its applicability is often viewed as narrow and strict. Although ID a...
- RescueLens: LLM-Powered Triage and Action on Volunteer Feedback for Food Rescue : Abstract: Food rescue organizations simultaneously tackle food insecurity and waste by working with volunteers to redistribute food from donors who have excess to recipients who need it. Volunteer fee...
- Tokenisation over Bounded Alphabets is Hard : Abstract: Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given t...
- Resource-Constrained Decentralized Federated Learning via Personalized Event-Triggering : Abstract: Federated learning (FL) is a popular technique for distributing machine learning (ML) across a set of edge devices. In this paper, we study fully decentralized FL, where in addition to devic...
- Explaining Time Series Classification Predictions via Causal Attributions : Abstract: Despite the excelling performance of machine learning models, understanding their decisions remains a long-standing goal. Although commonly used attribution methods from explainable AI attem...
- ExDAG: an MIQP Algorithm for Learning DAGs : Abstract: There has been a growing interest in causal learning in recent years. Commonly used representations of causal structures, including Bayesian networks and structural equation models (SEM), ta...
- Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Tensorflow Pretrained Models : Abstract: The application of TensorFlow pre-trained models in deep learning is explored, with an emphasis on practical guidance for tasks such as image classification and object detection. The study c...
- Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and Acceleration : Abstract: Gradient clipping has long been considered essential for ensuring the convergence of Stochastic Gradient Descent (SGD) in the presence of heavy-tailed gradient noise. In this paper, we revis...
- xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories : Abstract: Time series data is prevalent across numerous fields, necessitating the development of robust and accurate forecasting models. Capturing patterns both within and between temporal and multiva...
- Coresets from Trajectories: Selecting Data via Correlation of Loss Differences : Abstract: Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlati...
- Optimizing In-Context Learning for Efficient Full Conformal Prediction : Abstract: Reliable uncertainty quantification is critical for trustworthy AI. Conformal Prediction (CP) provides prediction sets with distribution-free coverage guarantees, but its two main variants f...
- Fast convergence of the Expectation Maximization algorithm under a logarithmic Sobolev inequality : Abstract: We present a new framework for analysing the Expectation Maximization (EM) algorithm. Drawing on recent advances in the theory of gradient flows over Euclidean-Wasserstein spaces, we extend ...
- Towards Data Valuation via Asymmetric Data Shapley : Abstract: As data emerges as a vital driver of technological and economic advancements, a key challenge is accurately quantifying its value in algorithmic decision-making. The Shapley value, a well-es...
- Abnormality Prediction and Forecasting of Laboratory Values from Electrocardiogram Signals Using Multimodal Deep Learning : Abstract: This study investigates the feasibility of using electrocardiogram (ECG) data combined with basic patient metadata to estimate and monitor prompt laboratory abnormalities. We use the MIMIC-I...
- Beacon2Science: Enhancing STEREO/HI beacon data with machine learning for efficient CME tracking : Abstract: Observing and forecasting coronal mass ejections (CME) in real-time is crucial due to the strong geomagnetic storms they can generate that can have a potentially damaging effect, for example...
- Test-time Scaling of LLMs: A Survey from A Subproblem Structure Perspective : Abstract: With this paper, we survey techniques for improving the predictive accuracy of pretrained large language models by allocating additional compute at inference time. In categorizing test-time ...
- Temporal Predictors of Outcome in Reasoning Language Models : Abstract: The chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning, gradually refining the model's latent representation of a solution. However, it ...
- LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs : Abstract: Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure dur...
- COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation : Abstract: Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between...
- The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech : Abstract: Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although spe...
- Human or LLM as Standardized Patients? A Comparative Study for Medical Education : Abstract: Standardized Patients (SP) are indispensable for clinical skills training but remain expensive, inflexible, and difficult to scale. Existing large-language-model (LLM)-based SP simulators pr...
- Opinion Mining and Analysis Using Hybrid Deep Neural Networks : Abstract: Understanding customer attitudes has become a critical component of decision-making due to the growing influence of social media and e-commerce. Text-based opinions are the most structured, ...
- Mathematical Analysis of Hallucination Dynamics in Large Language Models: Uncertainty Quantification, Advanced Decoding, and Principled Mitigation : Abstract: Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, w...
- OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition : Abstract: Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotat...
- Context Cascade Compression: Exploring the Upper Limits of Text Compression : Abstract: Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the fea...
- IndicGEC: Powerful Models, or a Measurement Mirage? : Abstract: In this paper, we report the results of the TeamNRC's participation in the BHASHA-Task 1 Grammatical Error Correction shared task https://github.com/BHASHA-Workshop/IndicGEC2025/ for 5 India...
- MAPROC at AHaSIS Shared Task: Few-Shot and Sentence Transformer for Sentiment Analysis of Arabic Hotel Reviews : Abstract: Sentiment analysis of Arabic dialects presents significant challenges due to linguistic diversity and the scarcity of annotated data. This paper describes our approach to the AHaSIS shared t...
- Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models : Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, ...
- HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning : Abstract: We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The upda...
- The Empowerment of Science of Science by Large Language Models: New Tools and Methods : Abstract: Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI ...
- A Compliance-Preserving Retrieval System for Aircraft MRO Task Search : Abstract: Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certif...
- DEPO: Dual-Efficiency Preference Optimization for LLM Agents : Abstract: Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cos...
- NAMeGEn: Creative Name Generation via A Novel Agent-based Multiple Personalized Goal Enhancement Framework : Abstract: Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising...
- Building Robust and Scalable Multilingual ASR for Indian Languages : Abstract: This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to i...
- LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering : Abstract: Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their di...
- Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis : Abstract: The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse langu...
- Multimodal Evaluation of Russian-language Architectures : Abstract: Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks r...
- HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning : Abstract: Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language ...
- Optimizing Agricultural Research: A RAG-Based Approach to Mycorrhizal Fungi Information : Abstract: Retrieval-Augmented Generation (RAG) represents a transformative approach within natural language processing (NLP), combining neural information retrieval with generative language modeling t...
- Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis : Abstract: The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthi...
- Evaluating Multimodal Large Language Models on Vertically Written Japanese Text : Abstract: Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range ...
- ProRAC: A Neuro-symbolic Method for Reasoning about Actions with LLM-based Progression : Abstract: In this paper, we propose ProRAC (Progression-based Reasoning about Actions and Change), a neuro-symbolic framework that leverages LLMs to tackle RAC problems. ProRAC extracts fundamental RA...
- Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents : Abstract: The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating fe...
- CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries : Abstract: We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established b...
- M, Toolchain and Language for Reusable Model Compilation : Abstract: Complex software-driven systems often interleave distributed, concurrent computation processes with physical interactions with the environment. Developing these systems more efficiently and ...
- ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing : Abstract: Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. T...
- SkyEgg: Joint Implementation Selection and Scheduling for Hardware Synthesis using E-graphs : Abstract: Hardware synthesis from high-level descriptions remains fundamentally limited by the sequential optimization of interdependent design decisions. Current methodologies, including state-of-the...
- CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search : Abstract: Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline ...
- From Solving to Verifying: A Unified Objective for Robust Reasoning in LLMs : Abstract: The reasoning capabilities of large language models (LLMs) have been significantly improved through reinforcement learning (RL). Nevertheless, LLMs still struggle to consistently verify thei...
- Cross-Modal Consistency-Guided Active Learning for Affective BCI Systems : Abstract: Deep learning models perform best with abundant, high-quality labels, yet such conditions are rarely achievable in EEG-based emotion recognition. Electroencephalogram (EEG) signals are easil...
- Complex variational autoencoders admit K\"ahler structure : Abstract: It has been discovered that latent-Euclidean variational autoencoders (VAEs) admit, in various capacities, Riemannian structure. We adapt these arguments but for complex VAEs with a complex ...
- FaultDiffusion: Few-Shot Fault Time Series Generation with Diffusion Model : Abstract: In industrial equipment monitoring, fault diagnosis is critical for ensuring system reliability and enabling predictive maintenance. However, the scarcity of fault data, due to the rarity of...
- Vehicle Routing Problems via Quantum Graph Attention Network Deep Reinforcement Learning : Abstract: The vehicle routing problem (VRP) is a fundamental NP-hard task in intelligent transportation systems with broad applications in logistics and distribution. Deep reinforcement learning (DRL)...
- Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning : Abstract: Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR su...
- Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones : Abstract: Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for d...
- D2D Power Allocation via Quantum Graph Neural Network : Abstract: Increasing wireless network complexity demands scalable resource management. Classical GNNs excel at graph learning but incur high computational costs in large-scale settings. We present a f...
- EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control : Abstract: Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context...
- Optimized scheduling of electricity-heat cooperative system considering wind energy consumption and peak shaving and valley filling : Abstract: With the global energy transition and rapid development of renewable energy, the scheduling optimization challenge for combined power-heat systems under new energy integration and multiple u...
- PLATONT: Learning a Platonic Representation for Unified Network Tomography : Abstract: Network tomography aims to infer hidden network states, such as link performance, traffic load, and topology, from external observations. Most existing methods solve these problems separatel...
- GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning : Abstract: The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as ...
- SNAP: Low-Latency Test-Time Adaptation with Sparse Updates : Abstract: Test-Time Adaptation (TTA) adjusts models using unlabeled test data to handle dynamic distribution shifts. However, existing methods rely on frequent adaptation and high computational cost, ...
- Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs : Abstract: Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoin...
- On the Internal Semantics of Time-Series Foundation Models : Abstract: Time-series Foundation Models (TSFMs) have recently emerged as a universal paradigm for learning across diverse temporal domains. However, despite their empirical success, the internal mecha...
- KrawtchoukNet: A Unified GNN Solution for Heterophily and Over-smoothing with Adaptive Bounded Polynomials : Abstract: Spectral Graph Neural Networks (GNNs) based on polynomial filters, such as ChebyNet, suffer from two critical limitations: 1) performance collapse on "heterophilic" graphs and 2) performance...
- LaguerreNet: Advancing a Unified Solution for Heterophily and Over-smoothing with Adaptive Continuous Polynomials : Abstract: Spectral Graph Neural Networks (GNNs) suffer from two critical limitations: poor performance on "heterophilic" graphs and performance collapse at high polynomial degrees (K), known as over-s...
- STREAM-VAE: Dual-Path Routing for Slow and Fast Dynamics in Vehicle Telemetry Anomaly Detection : Abstract: Automotive telemetry data exhibits slow drifts and fast spikes, often within the same sequence, making reliable anomaly detection challenging. Standard reconstruction-based methods, includin...
- Multi-layer Stack Ensembles for Time Series Forecasting : Abstract: Ensembling is a powerful technique for improving the accuracy of machine learning models, with methods like stacking achieving strong results in tabular tasks. In time series forecasting, ho...
- Cost-Aware Prediction (CAP): An LLM-Enhanced Machine Learning Pipeline and Decision Support System for Heart Failure Mortality Prediction : Abstract: Objective: Machine learning (ML) predictive models are often developed without considering downstream value trade-offs and clinical interpretability. This paper introduces a cost-aware predi...
- CID: Measuring Feature Importance Through Counterfactual Distributions : Abstract: Assessing the importance of individual features in Machine Learning is critical to understand the model's decision-making process. While numerous methods exist, the lack of a definitive grou...
- Parameter Importance-Driven Continual Learning for Foundation Models : Abstract: Domain-specific post-training often causes catastrophic forgetting, making foundation models lose their general reasoning ability and limiting their adaptability to dynamic real-world enviro...
- EVA-Net: Interpretable Brain Age Prediction via Continuous Aging Prototypes from EEG : Abstract: The brain age is a key indicator of brain health. While electroencephalography (EEG) is a practical tool for this task, existing models struggle with the common challenge of imperfect medica...
- Proximal Approximate Inference in State-Space Models : Abstract: We present a class of algorithms for state estimation in nonlinear, non-Gaussian state-space models. Our approach is based on a variational Lagrangian formulation that casts Bayesian inferen...
- Towards Understanding Layer Contributions in Tabular In-Context Learning Models : Abstract: Despite the architectural similarities between tabular in-context learning (ICL) models and large language models (LLMs), little is known about how individual layers contribute to tabular pr...
- TSFM in-context learning for time-series classification of bearing-health status : Abstract: This paper introduces a classification method using in-context learning in time-series foundation models (TSFM). We show how data, which was not part of the TSFM training data corpus, can be...
- FairEnergy: Contribution-Based Fairness meets Energy Efficiency in Federated Learning : Abstract: Federated learning (FL) enables collaborative model training across distributed devices while preserving data privacy. However, balancing energy efficiency and fair participation while ensur...
- NTK-Guided Implicit Neural Teaching : Abstract: Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3...
- Sample-Adaptivity Tradeoff in On-Demand Sampling : Abstract: We study the tradeoff between sample complexity and round complexity in on-demand sampling, where the learning algorithm adaptively samples from $k$ distributions over a limited number of ro...
- PCARNN-DCBF: Minimal-Intervention Geofence Enforcement for Ground Vehicles : Abstract: Runtime geofencing for ground vehicles is rapidly emerging as a critical technology for enforcing Operational Design Domains (ODDs). However, existing solutions struggle to reconcile high-fi...
- CODE: A global approach to ODE dynamics learning : Abstract: Ordinary differential equations (ODEs) are a conventional way to describe the observed dynamics of physical systems. Scientists typically hypothesize about dynamical behavior, propose a math...
- Continual Reinforcement Learning for Cyber-Physical Systems: Lessons Learned and Open Challenges : Abstract: Continual learning (CL) is a branch of machine learning that aims to enable agents to adapt and generalise previously learned abilities so that these can be reapplied to new tasks or environ...
- DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models : Abstract: Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. Howeve...
- Walrus: A Cross-Domain Foundation Model for Continuum Dynamics : Abstract: Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable long-...
- The Impact of Quantization on Large Reasoning Model Reinforcement Learning : Abstract: Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aw...
- Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications : Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever-growing corpora, to effectively answer...
- Reservoir Computing via Multi-Scale Random Fourier Features for Forecasting Fast-Slow Dynamical Systems : Abstract: Forecasting nonlinear time series with multi-scale temporal structures remains a central challenge in complex systems modeling. We present a novel reservoir computing framework that combines...
- Convex Clustering Redefined: Robust Learning with the Median of Means Estimator : Abstract: Clustering approaches that utilize convex loss functions have recently attracted growing interest in the formation of compact data clusters. Although classical methods like k-means and its w...
- MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging : Abstract: Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying o...
- Fully Differentiable dMRI Streamline Propagation in PyTorch : Abstract: Diffusion MRI (dMRI) provides a distinctive means to probe the microstructural architecture of living tissue, facilitating applications such as brain connectivity analysis, modeling across m...
- Implicit Bias of the JKO Scheme : Abstract: Wasserstein gradient flow provides a general framework for minimizing an energy functional $J$ over the space of probability measures on a Riemannian manifold $(M,g)$. Its canonical time-dis...
- How to pick the best anomaly detector? : Abstract: Anomaly detection has the potential to discover new physics in unexplored regions of the data. However, choosing the best anomaly detector for a given data set in a model-agnostic way is an ...
- Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings : Abstract: Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality....
- Attacking Autonomous Driving Agents with Adversarial Machine Learning: A Holistic Evaluation with the CARLA Leaderboard : Abstract: To autonomously control vehicles, driving agents use outputs from a combination of machine-learning (ML) models, controller logic, and custom modules. Although numerous prior works have show...
- Exact Learning of Weighted Graphs Using Composite Queries : Abstract: In this paper, we study the exact learning problem for weighted graphs, where we are given the vertex set, $V$, of a weighted graph, $G=(V,E,w)$, but we are not given $E$. The problem, which...
- HULFSynth : An INR based Super-Resolution and Ultra Low-Field MRI Synthesis via Contrast factor estimation : Abstract: We present an unsupervised single image bidirectional Magnetic Resonance Image (MRI) synthesizer that synthesizes an Ultra-Low Field (ULF) like image from a High-Field (HF) magnitude image a...
- On-Premise SLMs vs. Commercial LLMs: Prompt Engineering and Incident Classification in SOCs and CSIRTs : Abstract: In this study, we evaluate open-source models for security incident classification, comparing them with proprietary models. We utilize a dataset of anonymized real incidents, categorized acc...
- Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report : Abstract: This technical report investigates the performance of pre-trained audio models on COVID-19 detection tasks using established benchmark datasets. We fine-tuned Audio-MAE and three PANN archit...
- Artificial intelligence approaches for energy-efficient laser cutting machines : Abstract: This research addresses the significant challenges of energy consumption and environmental impact in laser cutting by proposing novel deep learning (DL) methodologies to achieve energy reduc...
- Compiling to recurrent neurons : Abstract: Discrete structures are currently second-class in differentiable programming. Since functions over discrete structures lack overt derivatives, differentiable programs do not differentiate th...
- Reconstruction of three-dimensional shapes of normal and disease-related erythrocytes from partial observations using multi-fidelity neural networks : Abstract: Reconstruction of 3D erythrocyte or red blood cell (RBC) morphology from partial observations, such as microscope images, is essential for understanding the physiology of RBC aging and the p...
- MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation : Abstract: Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in ge...
- Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion : Abstract: This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality...
- Selective Forgetting in Option Calibration: An Operator-Theoretic Gauss-Newton Framework : Abstract: Calibration of option pricing models is routinely repeated as markets evolve, yet modern systems lack an operator for removing data from a calibrated model without full retraining. When quot...
- Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation : Abstract: Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functi...
- Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation : Abstract: This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of mo...
- Task Specific Sharpness Aware O-RAN Resource Management using Multi Agent Reinforcement Learning : Abstract: Next-generation networks utilize the Open Radio Access Network (O-RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep rei...
- Resource-Based Time and Cost Prediction in Project Networks: From Statistical Modeling to Graph Neural Networks : Abstract: Accurate prediction of project duration and cost remains one of the most challenging aspects of project management, particularly in resource-constrained and interdependent task networks. Tra...
- Latent space analysis and generalization to out-of-distribution data : Abstract: Understanding the relationships between data points in the latent decision space derived by the deep learning system is critical to evaluating and interpreting the performance of the system ...
- Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference : Abstract: Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantizatio...
- Complex-Valued 2D Gaussian Representation for Computer-Generated Holography : Abstract: We propose a new hologram representation based on structured complex-valued 2D Gaussian primitives, which replaces per-pixel information storage and reduces the parameter search space by up ...
- Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization : Abstract: Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively li...
- Beyond GeneGPT: A Multi-Agent Architecture with Open-Source LLMs for Enhanced Genomic Question Answering : Abstract: Genomic question answering often requires complex reasoning and integration across diverse biomedical sources. GeneGPT addressed this challenge by combining domain-specific APIs with OpenAI'...
- GPU-Initiated Networking for NCCL : Abstract: Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU commu...
- Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit : Abstract: In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussia...
- WaveFuse-AL: Cyclical and Performance-Adaptive Multi-Strategy Active Learning for Medical Images : Abstract: Active learning reduces annotation costs in medical imaging by strategically selecting the most informative samples for labeling. However, individual acquisition strategies often exhibit inc...
- CASPER: Cross-modal Alignment of Spatial and single-cell Profiles for Expression Recovery : Abstract: Spatial Transcriptomics enables mapping of gene expression within its native tissue context, but current platforms measure only a limited set of genes due to experimental constraints and exc...
- Beyond Uncertainty Sets: Leveraging Optimal Transport to Extend Conformal Predictive Distribution to Multivariate Settings : Abstract: Conformal prediction (CP) constructs uncertainty sets for model outputs with finite-sample coverage guarantees. A candidate output is included in the prediction set if its non-conformity sco...
- DCL-SE: Dynamic Curriculum Learning for Spatiotemporal Encoding of Brain Imaging : Abstract: High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose ...
- Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation : Abstract: High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises...
- Multimodal Wireless Foundation Models : Abstract: Wireless foundation models (WFMs) have recently demonstrated promising capabilities, jointly performing multiple wireless functions and adapting effectively to new environments. However, whi...
- Teaching According to Students' Aptitude: Personalized Mathematics Tutoring via Persona-, Memory-, and Forgetting-Aware LLMs : Abstract: Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture h...
- Data-driven Prediction of Species-Specific Plant Responses to Spectral-Shifting Films from Leaf Phenotypic and Photosynthetic Traits : Abstract: The application of spectral-shifting films in greenhouses to shift green light to red light has shown variable growth responses across crop species. However, the yield enhancement of crops u...
- HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples : Abstract: With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence...
- BrainRotViT: Transformer-ResNet Hybrid for Explainable Modeling of Brain Aging from 3D sMRI : Abstract: Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manu...
- Particle Monte Carlo methods for Lattice Field Theory : Abstract: High-dimensional multimodal sampling problems from lattice field theory (LFT) have become important benchmarks for machine learning assisted sampling methods. We show that GPU-accelerated pa...
- Learning Where, What and How to Transfer: A Multi-Role Reinforcement Learning Approach for Evolutionary Multitasking : Abstract: Evolutionary multitasking (EMT) algorithms typically require tailored designs for knowledge transfer, in order to assure convergence and optimality in multitask optimization. In this paper, ...
- Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story : Abstract: Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain und...
- Why Physics Still Matters: Improving Machine Learning Prediction of Material Properties with Phonon-Informed Datasets : Abstract: Machine learning (ML) methods have become powerful tools for predicting material properties with near first-principles accuracy and vastly reduced computational cost. However, the performanc...
- Reinforcement Learning in Queue-Reactive Models: Application to Optimal Execution : Abstract: We investigate the use of Reinforcement Learning for the optimal execution of meta-orders, where the objective is to execute incrementally large orders while minimizing implementation shortf...
- Graph Query Networks for Object Detection with Automotive Radar : Abstract: Object detection with 3D radar is essential for 360-degree automotive perception, but radar's long wavelengths produce sparse and irregular reflections that challenge traditional grid and se...
- Robust Bayesian Optimisation with Unbounded Corruptions : Abstract: Bayesian Optimization is critically vulnerable to extreme outliers. Existing provably robust methods typically assume a bounded cumulative corruption budget, which makes them defenseless aga...
- Exponential Lasso: robust sparse penalization under heavy-tailed noise and outliers with exponential-type loss : Abstract: In high-dimensional statistics, the Lasso is a cornerstone method for simultaneous variable selection and parameter estimation. However, its reliance on the squared loss function renders it ...
- Fast Post-Hoc Confidence Fusion for 3-Class Open-Set Aerial Object Detection : Abstract: Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While man...
- Controlling False Positives in Image Segmentation via Conformal Prediction : Abstract: Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post-hoc frame...
- D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models : Abstract: Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While ...
- Neural network-driven domain decomposition for efficient solutions to the Helmholtz equation : Abstract: Accurately simulating wave propagation is crucial in fields such as acoustics, electromagnetism, and seismic analysis. Traditional numerical methods, like finite difference and finite elemen...
- Gini Score under Ties and Case Weights : Abstract: The Gini score is a popular tool in statistical modeling and machine learning for model validation and model selection. It is a purely rank based score that allows one to assess risk ranking...
- SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome : Abstract: Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles...
- RS-CA-HSICT: A Residual and Spatial Channel Augmented CNN Transformer Framework for Monkeypox Detection : Abstract: This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN ...
- A Tensor Compiler for Processing-In-Memory Architectures : Abstract: Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Lang...
- Transformer Injectivity & Geometric Robustness - Analytic Margins and Bi-Lipschitz Uniformity of Sequence-Level Hidden States : Abstract: Under real-analytic assumptions on decoder-only Transformers, recent work shows that the map from discrete prompts to last-token hidden states is generically injective on finite prompt sets....
- DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models : Abstract: Assessing the reasoning ability of Large Language Models (LLMs) over data remains an open and pressing research question. Compared with LLMs, human reasoning can derive corresponding modific...
- Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence : Abstract: Contemporary machine learning models, including large language models, exhibit remarkable capabilities in static tasks yet falter in non-stationary environments due to rigid architectures th...
- Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization : Abstract: Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for ...
- FinTRec: Transformer Based Unified Contextual Ads Targeting and Personalization for Financial Applications : Abstract: Transformer-based architectures are widely adopted in sequential recommendation systems, yet their application in Financial Services (FS) presents distinct practical and modeling challenges ...
- Transformer-Guided Deep Reinforcement Learning for Optimal Takeoff Trajectory Design of an eVTOL Drone : Abstract: The rapid advancement of electric vertical take-off and landing (eVTOL) aircraft offers a promising opportunity to alleviate urban traffic congestion. Thus, developing optimal takeoff trajec...
- Bringing Federated Learning to Space : Abstract: As Low Earth Orbit (LEO) satellite constellations rapidly expand to hundreds and thousands of spacecraft, the need for distributed on-board machine learning becomes critical to address downl...
- It's LIT! Reliability-Optimized LLMs with Inspectable Tools : Abstract: Large language models (LLMs) have exhibited remarkable capabilities across various domains. The ability to call external tools further expands their capability to handle real-world tasks. Ho...
- Structured Contrastive Learning for Interpretable Latent Representations : Abstract: Neural networks exhibit severe brittleness to semantically irrelevant transformations. A mere 75ms electrocardiogram (ECG) phase shift degrades latent cosine similarity from 1.0 to 0.2, whil...
- Integrating Causal Inference with Graph Neural Networks for Alzheimer's Disease Analysis : Abstract: Deep graph learning has advanced Alzheimer's (AD) disease classification from MRI, but most models remain correlational, confounding demographic and genetic factors with disease specific fea...
- How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding : Abstract: Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for de...
- Knowledge Graphs as Structured Memory for Embedding Spaces: From Training Clusters to Explainable Inference : Abstract: We introduce Graph Memory (GM), a structured non-parametric framework that augments embedding-based inference with a compact, relational memory over region-level prototypes. Rather than trea...
- IonCast: A Deep Learning Framework for Forecasting Ionospheric Dynamics : Abstract: The ionosphere is a critical component of near-Earth space, shaping GNSS accuracy, high-frequency communications, and aviation operations. For these reasons, accurate forecasting and modelin...
- Simulated Human Learning in a Dynamic, Partially-Observed, Time-Series Environment : Abstract: While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficu...
- Oversampling techniques for predicting COVID-19 patient length of stay : Abstract: COVID-19 is a respiratory disease that caused a global pandemic in 2019. It is highly infectious and has the following symptoms: fever or chills, cough, shortness of breath, fatigue, muscle ...
- Interpretable temporal fusion network of multi- and multi-class arrhythmia classification : Abstract: Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms. However,...
- Deep Pathomic Learning Defines Prognostic Subtypes and Molecular Drivers in Colorectal Cancer : Abstract: Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personali...
- Fourier-KAN-Mamba: A Novel State-Space Equation Approach for Time-Series Anomaly Detection : Abstract: Time-series anomaly detection plays a critical role in numerous real-world applications, including industrial monitoring and fault diagnosis. Recently, Mamba-based state-space models have sh...
- Semiconductor Industry Trend Prediction with Event Intervention Based on LSTM Model in Sentiment-Enhanced Time Series Data : Abstract: The innovation of the study is that the deep learning method and sentiment analysis are integrated in traditional business model analysis and forecasting, and the research subject is TSMC fo...
- Efficient RF Passive Components Modeling with Bayesian Online Learning and Uncertainty Aware Sampling : Abstract: Conventional radio frequency (RF) passive components modeling based on machine learning requires extensive electromagnetic (EM) simulations to cover geometric and frequency design spaces, cr...
- Novel sparse matrix algorithm expands the feasible size of a self-organizing map of the knowledge indexed by a database of peer-reviewed medical literature : Abstract: Past efforts to map the Medline database have been limited to small subsets of the available data because of the exponentially increasing memory and processing demands of existing algorithms...
Research Sources: 276 | Generated: 11/20/2025
