AI RESEARCH PAPERS & ACADEMIC SOURCES
- Optimization of Sums of Bivariate Functions: An Introduction to Relaxation-Based Methods for the Case of Finite Domains : Abstract: We study the optimization of functions with $n>2$ arguments that have a representation as a sum of several functions that have only $2$ of the $n$ arguments each, termed sums of bivariates, ...
- Vision-Language Memory for Spatial Reasoning : Abstract: Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. Thi...
- PixelDiT: Pixel Diffusion Transformers for Image Generation : Abstract: Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, le...
- 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding : Abstract: This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL)...
- Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization : Abstract: While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single ...
- LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight : Abstract: To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D de...
- Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout : Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE),...
- MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities : Abstract: Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD)...
- RubricRL: Simple Generalizable Rewards for Text-to-Image Generation : Abstract: Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effe...
- Splatblox: Traversability-Aware Gaussian Splatting for Outdoor Robot Navigation : Abstract: We present Splatblox, a real-time system for autonomous navigation in outdoor environments with dense vegetation, irregular obstacles, and complex terrain. Our method fuses segmented RGB ima...
- Not Quite Anything: Overcoming SAMs Limitations for 3D Medical Imaging : Abstract: Foundation segmentation models such as SAM and SAM-2 perform well on natural images but struggle with brain MRIs where structures like the caudate and thalamus lack sharp boundaries and have...
- PhysDNet: Physics-Guided Decomposition Network of Side-Scan Sonar Imagery : Abstract: Side-scan sonar (SSS) imagery is widely used for seafloor mapping and underwater remote sensing, yet the measured intensity is strongly influenced by seabed reflectivity, terrain elevation, ...
- The Selective Disk Bispectrum and Its Inversion, with Application to Multi-Reference Alignment : Abstract: In many computer vision and shape analysis tasks, practitioners are interested in learning from the shape of the object in an image, while disregarding the object's orientation. To this end,...
- Frequency Bias Matters: Diving into Robust and Generalized Deep Image Forgery Detection : Abstract: As deep image forgery powered by AI generative models, such as GANs, continues to challenge today's digital world, detecting AI-generated forgeries has become a vital security topic. General...
- DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models : Abstract: With the rapid advancement of diffusion models, a variety of fine-tuning methods have been developed, enabling high-fidelity image generation with high similarity to the target content using...
- Redefining Radar Segmentation: Simultaneous Static-Moving Segmentation and Ego-Motion Estimation using Radar Point Clouds : Abstract: Conventional radar segmentation research has typically focused on learning category labels for different moving objects. Although fundamental differences between radar and optical sensors le...
- ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation : Abstract: Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based polic...
- VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning : Abstract: Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on ...
- Development of a fully deep learning model to improve the reproducibility of sector classification systems for predicting unerupted maxillary canine likelihood of impaction : Abstract: Objectives. The aim of the present study was to develop a fully deep learning model to reduce the intra- and inter-operator reproducibility of sector classification systems for predicting un...
- Natural Image Stitching Using Depth Maps : Abstract: Natural image stitching aims to create a single, natural-looking mosaic from overlapped images that capture the same 3D scene from different viewing positions. Challenges inevitably arise wh...
- Rethinking the Learning Paradigm for Facial Expression Recognition : Abstract: Due to the subjective crowdsourcing annotations and the inherent inter-class similarity of facial expressions, the real-world Facial Expression Recognition (FER) datasets usually exhibit amb...
- SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM optimization : Abstract: In this paper, we introduce Segmentation-Driven Deformation Multi-View Stereo (SD-MVS), a method that can effectively tackle challenges in 3D reconstruction of textureless areas. We are the ...
- GMT: Effective Global Framework for Multi-Camera Multi-Target Tracking : Abstract: Multi-Camera Multi-Target (MCMT) tracking aims to locate and associate the same targets across multiple camera views. Existing methods typically adopt a two-stage framework, involving single...
- MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence : Abstract: Recent advancements in video generation have primarily leveraged diffusion models for short-duration content. However, these approaches often fall short in modeling complex narratives and ma...
- E$^{3}$NeRF: Efficient Event-Enhanced Neural Radiance Fields from Blurry Images : Abstract: Neural Radiance Fields (NeRF) achieves impressive novel view rendering performance by learning implicit 3D representation from sparse view images. However, it is difficult to reconstruct a s...
- Temporally Compressed 3D Gaussian Splatting for Dynamic Scenes : Abstract: Recent advancements in high-fidelity dynamic scene reconstruction have leveraged dynamic 3D Gaussians and 4D Gaussian Splatting for realistic scene representation. However, to make these met...
- LiHi-GS: LiDAR-Supervised Gaussian Splatting for Highway Driving Scene Reconstruction : Abstract: Photorealistic 3D scene reconstruction plays an important role in autonomous driving, enabling the generation of novel data from existing datasets to simulate safety-critical scenarios and e...
- The Early Bird Identifies the Worm: You Can't Beat a Head Start in Long-Term Body Re-ID (ECHO-BID) : Abstract: A wide range of model-based approaches to long-term person re-identification have been proposed. Whether these models perform more accurately than direct domain transfer learning applied to ...
- Comparison of Generative Learning Methods for Turbulence Surrogates : Abstract: Numerical simulations of turbulent flows present significant challenges in fluid dynamics due to their complexity and high computational cost. High resolution techniques such as Direct Numer...
- High Resolution UDF Meshing via Iterative Networks : Abstract: Unsigned Distance Fields (UDFs) are a natural implicit representation for open surfaces but, unlike Signed Distance Fields (SDFs), are challenging to triangulate into explicit meshes. This i...
- IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection : Abstract: Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learni...
- AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models : Abstract: End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinf...
- 3D Motion Perception of Binocular Vision Target with PID-CNN : Abstract: This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceler...
- ShelfRectNet: Single View Shelf Image Rectification with Homography Estimation : Abstract: Estimating homography from a single image remains a challenging yet practically valuable task, particularly in domains like retail, where only one viewpoint is typically available for shelf ...
- AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend : Abstract: We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, vol...
- Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin : Abstract: 3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-...
- Thinking in 360{\deg}: Humanoid Visual Search in the Wild : Abstract: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limite...
- GS-Checker: Tampering Localization for 3D Gaussian Splatting : Abstract: Recent advances in editing technologies for 3D Gaussian Splatting (3DGS) have made it simple to manipulate 3D scenes. However, these technologies raise concerns about potential malicious man...
- From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations : Abstract: Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods...
- VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild : Abstract: Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to i...
- FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers : Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference ...
- A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control : Abstract: Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a cust...
- Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs : Abstract: Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates signific...
- MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts : Abstract: Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and cont...
- Block Cascading: Training Free Acceleration of Block-Causal Video Models : Abstract: Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiven...
- BRIC: Bridging Kinematic Plans and Physical Control at Test Time : Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion pla...
- Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder : Abstract: The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological ass...
- Learning to Generate Human-Human-Object Interactions from Textual Descriptions : Abstract: The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to underst...
- Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search : Abstract: With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, c...
- AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs : Abstract: Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, ...
- HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation : Abstract: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods...
- Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos : Abstract: We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output...
- Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning : Abstract: Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique ...
- PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding : Abstract: While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some r...
- A Reason-then-Describe Instruction Interpreter for Controllable Video Generation : Abstract: Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex ...
- DINO-Tok: Adapting DINO for Visual Tokenizers : Abstract: Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing...
- VQ-VA World: Towards High-Quality Visual Question-Visual Answering : Abstract: This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary sy...
- The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment : Abstract: Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our ai...
- Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities : Abstract: This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using...
- Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI : Abstract: Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sen...
- ShapeGen: Towards High-Quality 3D Shape Synthesis : Abstract: Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, curr...
- iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation : Abstract: Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained ...
- SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery : Abstract: Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently propose...
- Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs : Abstract: While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignme...
- While recognizing actions, LMMs struggle to detect core interaction events : Abstract: Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to ...
- ADNet: A Large-Scale and Extensible Multi-Domain Benchmark for Anomaly Detection Across 380 Real-World Categories : Abstract: Anomaly detection (AD) aims to identify defects using normal-only training data. Existing anomaly detection benchmarks (e.g., MVTec-AD with 15 categories) cover only a narrow range of catego...
- Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware : Abstract: Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-b...
- Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis : Abstract: Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we...
- SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA : Abstract: Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant c...
- GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering : Abstract: We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricat...
- Robust 3D Brain MRI Inpainting with Random Masking Augmentation : Abstract: The ASNR-MICCAI BraTS-Inpainting Challenge was established to mitigate dataset biases that limit deep learning models in the quantitative analysis of brain tumor MRI. This paper details our ...
- OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation : Abstract: Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha ...
- Text-guided Controllable Diffusion for Realistic Camouflage Images Generation : Abstract: Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their s...
- V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs : Abstract: Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Mod...
- HistoSpeckle-Net: Mutual Information-Guided Deep Learning for high-fidelity reconstruction of complex OrganAMNIST images via perturbed Multimode Fibers : Abstract: Existing deep learning methods in multimode fiber (MMF) imaging often focus on simpler datasets, limiting their applicability to complex, real-world imaging tasks. These models are typically...
- PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling : Abstract: Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts rema...
- Zoo3D: Zero-Shot 3D Object Detection at Scene Level : Abstract: 3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitatio...
- XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface : Abstract: Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that ...
- The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation : Abstract: A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to ...
- Advancing Image Classification with Discrete Diffusion Classification Modeling : Abstract: Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are...
- Object-Centric Vision Token Pruning for Vision Language Models : Abstract: In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant v...
- DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection : Abstract: Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, o...
- VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs : Abstract: While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social...
- ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis : Abstract: Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-obje...
- DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion : Abstract: Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffe...
- SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors : Abstract: Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain criti...
- Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement : Abstract: Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we...
- Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations : Abstract: Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the mo...
- Prompting Lipschitz-constrained network for multiple-in-one sparse-view CT reconstruction : Abstract: Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is cha...
- CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation : Abstract: In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. H...
- TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection : Abstract: Remote sensing change detection (RSCD) aims to identify surface changes across bi-temporal satellite images. Most previous methods rely solely on mask supervision, which effectively guides s...
- TReFT: Taming Rectified Flow Models For One-Step Image Translation : Abstract: Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly...
- GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion : Abstract: 3D face recognition offers a robust biometric solution by capturing facial geometry, providing resilience to variations in illumination, pose changes, and presentation attacks. Its strong sp...
- MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing : Abstract: Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbf{Mam...
- HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning : Abstract: Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving mult...
- VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction : Abstract: Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D...
- Boosting Reasoning in Large Multimodal Models via Activation Replay : Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underl...
- EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback : Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing...
- SONIC: Spectral Optimization of Noise for Inpainting with Consistency : Abstract: We propose a novel training-free method for inpainting with off-the-shelf text-to-image models. While guidance-based methods in theory allow generic models to be used for inverse problems su...
- GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR : Abstract: Predicting gaze behavior in virtual reality environments remains a significant challenge with implications for rendering optimization and interface design. This paper introduces a multimodal...
- OmniRefiner: Reinforcement-Guided Local Diffusion Refinement : Abstract: Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a referenc...
- CREward: A Type-Specific Creativity Reward Model : Abstract: Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this wo...
- On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation : Abstract: Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake ca...
- Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network : Abstract: Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, t...
- Multi-Context Fusion Transformer for Pedestrian Crossing Intention Prediction in Urban Environments : Abstract: Pedestrian crossing intention prediction is essential for autonomous vehicles to improve pedestrian safety and reduce traffic accidents. However, accurate pedestrian intention prediction in ...
- ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction : Abstract: Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from ...
- WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving : Abstract: Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-...
- SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM : Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment and recognize objects universally. Trained on extensive high-quality segmentation data, the segment anything model (SAM) has demo...
- Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention : Abstract: Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that...
- Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization : Abstract: Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent comple...
- History-Augmented Contrastive Meta-Learning for Unsupervised Blind Super-Resolution of Planetary Remote Sensing Images : Abstract: Planetary remote sensing images are affected by diverse and unknown degradations caused by imaging environments and hardware constraints. These factors limit image quality and hinder supervi...
- DeLightMono: Enhancing Self-Supervised Monocular Depth Estimation in Endoscopy by Decoupling Uneven Illumination : Abstract: Self-supervised monocular depth estimation serves as a key task in the development of endoscopic navigation systems. However, performance degradation persists due to uneven illumination inhe...
- FLaTEC: Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds : Abstract: Point cloud compression methods jointly optimize bitrates and reconstruction distortion. However, balancing compression ratio and reconstruction quality is difficult because low-frequency an...
- PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images : Abstract: Autoregressive (AR) image generation has recently emerged as a powerful paradigm for image synthesis. Leveraging the generation principle of large language models, they allow for efficiently...
- Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding : Abstract: Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by alig...
- Blind Adaptive Local Denoising for CEST Imaging : Abstract: Chemical Exchange Saturation Transfer (CEST) MRI enables molecular-level visualization of low-concentration metabolites by leveraging proton exchange dynamics. However, its clinical translat...
- Explainable Visual Anomaly Detection via Concept Bottleneck Models : Abstract: In recent years, Visual Anomaly Detection (VAD) has gained significant attention due to its ability to identify anomalous images using only normal images during training. Many VAD models wor...
- WPT: World-to-Policy Transfer via Online World Model Distillation : Abstract: Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent's actions and the evolving environment. Howe...
- Exploring State-of-the-art models for Early Detection of Forest Fires : Abstract: There have been many recent developments in the use of Deep Learning Neural Networks for fire detection. In this paper, we explore an early warning system for detection of forest fires. Due ...
- Multi Head Attention Enhanced Inception v3 for Cardiomegaly Detection : Abstract: The healthcare industry has been revolutionized significantly by novel imaging technologies, not just in the diagnosis of cardiovascular diseases but also by the visualization of structural ...
- LungEvaty: A Scalable, Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening : Abstract: Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable method...
- UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers : Abstract: Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model...
- Vision-Language Models for Automated 3D PET/CT Report Generation : Abstract: Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated...
- Hybrid Convolution and Frequency State Space Network for Image Compression : Abstract: Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local ...
- Restora-Flow: Mask-Guided Image Restoration with Flow Matching : Abstract: Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajector...
- Alzheimers Disease Progression Prediction Based on Manifold Mapping of Irregularly Sampled Longitudinal Data : Abstract: The uncertainty of clinical examinations frequently leads to irregular observation intervals in longitudinal imaging data, posing challenges for modeling disease progression.Most existing im...
- Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving : Abstract: Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich...
- The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models : Abstract: Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities. While prior work has shown that LLMs can represent task p...
- BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali : Abstract: Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We p...
- A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines : Abstract: Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or ro...
- Generation, Evaluation, and Explanation of Novelists' Styles with Single-Token Prompts : Abstract: Recent advances in large language models have created new opportunities for stylometry, the study of writing styles and authorship. Two challenges, however, remain central: training generati...
- Adversarial Confusion Attack: Disrupting Multimodal Large Language Models : Abstract: We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce...
- The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models : Abstract: Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders l...
- Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition : Abstract: Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. Th...
- From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding : Abstract: Identifying discourse features in student conversations is quite important for educational researchers to recognize the curricular and pedagogical variables that cause students to engage in ...
- Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration : Abstract: This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one ...
- Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs : Abstract: While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limit...
- CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding : Abstract: Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However,...
- QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation : Abstract: Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer pro...
- DesignPref: Capturing Personal Preferences in Visual Design Generation : Abstract: Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetu...
- Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward : Abstract: Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introdu...
- Gram2Vec: An Interpretable Document Vectorizer : Abstract: We present Gram2Vec, a grammatical style embedding system that embeds documents into a higher dimensional space by extracting the normalized relative frequencies of grammatical features pres...
- Multi-Modal Data Exploration via Language Agents : Abstract: International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progres...
- CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications : Abstract: General-purpose VLMs demonstrate impressive capabilities, but their opaque training on uncurated internet data poses critical limitations for high-stakes decision-making, such as in neurosur...
- PuzzlePoles: Cylindrical Fiducial Markers Based on the PuzzleBoard Pattern : Abstract: Reliable perception of the environment is a key enabler for autonomous systems, where calibration and localization tasks often rely on robust visual markers. We introduce the PuzzlePole, a n...
- Personalized Reward Modeling for Text-to-Image Generation : Abstract: Recent text-to-image (T2I) models generate semantically coherent images from textual prompts, yet evaluating how well they align with individual user preferences remains an open challenge. C...
- Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks : Abstract: Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly c...
- Tracking and Segmenting Anything in Any Modality : Abstract: Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared...
- The Determinant Ratio Matrix Approach to Solving 3D Matching and 2D Orthographic Projection Alignment Tasks : Abstract: Pose estimation is a general problem in computer vision with wide applications. The relative orientation of a 3D reference object can be determined from a 3D rotated version of that object, ...
- Single Image to High-Quality 3D Object via Latent Features : Abstract: 3D assets are essential in the digital age. While automatic 3D generation, such as image-to-3d, has made significant strides in recent years, it often struggles to achieve fast, detailed, an...
- Fewer Tokens, Greater Scaling: Self-Adaptive Visual Bases for Efficient and Expansive Representation Learning : Abstract: This paper investigates the fundamental relationship between model capacity and the minimal number of visual tokens required to preserve image semantics. Inspired by the Minimum Description ...
- Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning : Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive tas...
- VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning : Abstract: By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnabl...
- Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models : Abstract: We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as mater...
- MapRF: Weakly Supervised Online HD Map Construction via NeRF-Guided Self-Training : Abstract: Autonomous driving systems benefit from high-definition (HD) maps that provide critical information about road infrastructure. The online construction of HD maps offers a scalable approach t...
- Vidi2: Large Multimodal Models for Video Understanding and Creation : Abstract: Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve to...
- Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation : Abstract: We introduce SpLap, a proxy-free deformation method for Gaussian splats (GS) based on a Laplacian operator computed from our novel surface-aware splat graph. Existing approaches to GS deform...
- HunyuanOCR Technical Report : Abstract: This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Visio...
- Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach : Abstract: Ischemic stroke is a time-critical medical emergency where rapid diagnosis is essential for improving patient outcomes. Non-contrast computed tomography (NCCT) serves as the frontline imagin...
- Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis : Abstract: Gastrointestinal (GI) imaging via Wireless Capsule Endoscopy (WCE) generates a large number of images requiring manual screening. Deep learning-based Clinical Decision Support (CDS) systems ...
- SkillSight: Efficient First-Person Skill Assessment with Gaze : Abstract: Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce S...
- On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction : Abstract: Purpose: To investigate whether a vision-language foundation model can enhance undersampled MRI reconstruction by providing high-level contextual information beyond conventional priors. Meth...
- Navigating Gigapixel Pathology Images with Large Multimodal Models : Abstract: Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, part...
- CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization : Abstract: Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual rea...
- OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis : Abstract: OncoVision is a multimodal AI pipeline that combines mammography images and clinical data for better breast cancer diagnosis. Employing an attention-based encoder-decoder backbone, it jointl...
- INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models : Abstract: We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to signi...
- IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants : Abstract: We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woo...
- CountXplain: Interpretable Cell Counting with Prototype-Based Density Map Estimation : Abstract: Cell counting in biomedical imaging is pivotal for various clinical applications, yet the interpretability of deep learning models in this domain remains a significant challenge. We propose ...
- RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models : Abstract: Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentati...
- Rethinking Vision Transformer Depth via Structural Reparameterization : Abstract: The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-leve...
- Maritime Small Object Detection from UAVs using Deep Learning with Altitude-Aware Dynamic Tiling : Abstract: Unmanned Aerial Vehicles (UAVs) are crucial in Search and Rescue (SAR) missions due to their ability to monitor vast maritime areas. However, small objects often remain difficult to detect f...
- Efficient Transferable Optimal Transport via Min-Sliced Transport Plans : Abstract: Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, inclu...
- Leveraging Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools : Abstract: Despite the promise of computational pathology foundation models, adapting them to specific clinical tasks remains challenging due to the complexity of whole-slide image (WSI) processing, th...
- What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities : Abstract: Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks ha...
- Vision--Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation : Abstract: Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (...
- A Storage-Efficient Feature for 3D Concrete Defect Segmentation to Replace Normal Vector : Abstract: Point cloud reconstruction of damage offers an effective solution to image-based methods vulnerable to background noise, yet its application is constrained by the high volume of 3D data. Thi...
- Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation : Abstract: Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make w...
- Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering : Abstract: Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for ...
- One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer : Abstract: We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When ...
- Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes : Abstract: As VLMs are deployed in safety-critical applications, their ability to abstain from answering when uncertain becomes crucial for reliability, especially in Scene Text Visual Question Answeri...
- ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding : Abstract: We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous...
- Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation : Abstract: Deep learning methods face dual challenges of limited clinical samples and low inter-class differentiation among Diffuse Cystic Lung Diseases (DCLDs) in advancing Birt-Hogg-Dube syndrome (BH...
- Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation : Abstract: Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focu...
- 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models : Abstract: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dy...
- Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum : Abstract: Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastr...
- DOGE: Differentiable Bezier Graph Optimization for Road Network Extraction : Abstract: Automatic extraction of road networks from aerial imagery is a fundamental task, yet prevailing methods rely on polylines that struggle to model curvilinear geometry. We maintain that road g...
- STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction : Abstract: Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gau...
- Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks : Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous s...
- GigaWorld-0: World Models as Data Engine to Empower Embodied AI : Abstract: World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a ...
- ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images : Abstract: Change detection (CD) in multitemporal remote sensing imagery presents significant challenges for fine-grained recognition, owing to heterogeneity and spatiotemporal misalignment. However, e...
- Distilling Cross-Modal Knowledge via Feature Disentanglement : Abstract: Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scena...
- LiMT: A Multi-task Liver Image Benchmark Dataset : Abstract: Computer-aided diagnosis (CAD) technology can assist clinicians in evaluating liver lesions and intervening with treatment in time. Although CAD technology has advanced in recent years, the ...
- VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering : Abstract: Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questi...
- Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning : Abstract: Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervi...
- MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition : Abstract: This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (AS...
- Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance : Abstract: We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative...
- Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving : Abstract: Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference an...
- Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects : Abstract: We present a framework that pioneers the prediction of photochemical conversion in complex three-dimensionally printed objects, introducing a challenging new computer vision task: predicting...
- Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models : Abstract: Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However,...
- HybriDLA: Hybrid Generation for Document Layout Analysis : Abstract: Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-genera...
- Intelligent Image Search Algorithms Fusing Visual Large Models : Abstract: Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection....
- Context-Aware Token Pruning and Discriminative Selective Attention for Transformer Tracking : Abstract: One-stream Transformer-based trackers have demonstrated remarkable performance by concatenating template and search region tokens, thereby enabling joint attention across all tokens. However...
- Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos : Abstract: Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis....
- Low-Resolution Editing is All You Need for High-Resolution Editing : Abstract: High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expres...
- Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting : Abstract: Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While la...
- Latent Collaboration in Multi-Agent Systems : Abstract: Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-base...
- MotionV2V: Editing Motion in a Video : Abstract: While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion...
- Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition : Abstract: Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models tha...
- Concept-Aware Batch Sampling Improves Language-Image Pretraining : Abstract: What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i...
- Graph Kernel Neural Networks : Abstract: The convolution operator at the core of many modern neural architectures can effectively be seen as performing a dot product between an input matrix and a filter. While this is readily appli...
- Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions : Abstract: We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previou...
- MGAS: Multi-Granularity Architecture Search for Trade-Off Between Model Effectiveness and Efficiency : Abstract: Neural architecture search (NAS) has gained significant traction in automating the design of neural networks. To reduce search time, differentiable architecture search (DAS) reframes the tra...
- Multiple-Input Auto-Encoder Guided Feature Selection for IoT Intrusion Detection Systems : Abstract: While intrusion detection systems (IDSs) benefit from the diversity and generalization of IoT data features, the data diversity (e.g., the heterogeneity and high dimensions of data) also mak...
- A Survey on Diffusion Models for Time Series and Spatio-Temporal Data : Abstract: Diffusion models have been widely used in time series and spatio-temporal data, enhancing generative, inferential, and downstream capabilities. These models are applied across diverse fields...
- Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths : Abstract: Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context scenarios. Existing methods typically em...
- LINSCAN -- A Linearity Based Clustering Algorithm : Abstract: DBSCAN and OPTICS are powerful algorithms for identifying clusters of points in domains where few assumptions can be made about the structure of the data. In this paper, we leverage these st...
- TopER: Topological Embeddings in Graph Representation Learning : Abstract: Graph embeddings play a critical role in graph representation learning, allowing machine learning models to explore and interpret graph-structured data. However, existing methods often rely ...
- SCNode: Spatial and Contextual Coordinates for Graph Representation Learning : Abstract: Effective node representation lies at the heart of Graph Neural Networks (GNNs), as it directly impacts their ability to perform downstream tasks such as node classification and link predict...
- Domain Fusion Controllable Generalization for Cross-Domain Time Series Forecasting from Multi-Domain Integrated Distribution : Abstract: Conventional deep models have achieved unprecedented success in time series forecasting. However, facing the challenge of cross-domain generalization, existing studies utilize statistical pr...
- ARBoids: Adaptive Residual Reinforcement Learning With Boids Model for Cooperative Multi-USV Target Defense : Abstract: The target defense problem (TDP) for unmanned surface vehicles (USVs) concerns intercepting an adversarial USV before it breaches a designated target region, using one or more defending USVs...
- AirFed: A Federated Graph-Enhanced Multi-Agent Reinforcement Learning Framework for Multi-UAV Cooperative Mobile Edge Computing : Abstract: Multiple Unmanned Aerial Vehicles (UAVs) cooperative Mobile Edge Computing (MEC) systems face critical challenges in coordinating trajectory planning, task offloading, and resource allocatio...
- Quantum Boltzmann machine learning of ground-state energies : Abstract: Estimating the ground-state energy of Hamiltonians is a fundamental task for which it is believed that quantum computers can be helpful. Several approaches have been proposed toward this goa...
- Deep learning and whole-brain networks for biomarker discovery: modeling the dynamics of brain fluctuations in resting-state and cognitive tasks : Abstract: Background: Brain network models offer insights into brain dynamics, but the utility of model-derived bifurcation parameters as biomarkers remains underexplored. Objective: This study evalua...
- CardioComposer: Leveraging Differentiable Geometry for Compositional Control of Anatomical Diffusion Models : Abstract: Generative models of 3D cardiovascular anatomy can synthesize informative structures for clinical research and medical device evaluation, but face a trade-off between geometric controllabili...
- Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search : Abstract: Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Lar...
- Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian : Abstract: Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, es...
- What does it mean to understand language? : Abstract: Language understanding entails not just extracting the surface-level meaning of the linguistic input, but constructing rich mental models of the situation it describes. Here we propose that ...
- Gender Bias in Emotion Recognition by Large Language Models : Abstract: The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we exam...
- Breaking Bad: Norms for Valence, Arousal, and Dominance for over 10k English Multiword Expressions : Abstract: Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D). Existing lexicons such as the NRC VAD Lexicon, published in 20...
- Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana : Abstract: Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sen...
- Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs : Abstract: Personalized Large Language Models (LLMs) have been shown to be an effective way to create more engaging and enjoyable user-AI interactions. While previous studies have explored using prompt...
- A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction : Abstract: Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors...
- AppSelectBench: Application-Level Tool Selection Benchmark : Abstract: Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, whic...
- $\text{R}^2\text{R}$: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers : Abstract: Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine...
- Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test : Abstract: Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report...
- A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media : Abstract: Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclas...
- Online-PVLM: Advancing Personalized VLMs with Online Concept Learning : Abstract: Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user's bike). Exi...
- MTA: A Merge-then-Adapt Framework for Personalized Large Language Model : Abstract: Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of...
- More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering : Abstract: With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations:...
- SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space : Abstract: The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend ...
- EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning : Abstract: This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantl...
- Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach : Abstract: Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, ...
- "When Data is Scarce, Prompt Smarter"... Approaches to Grammatical Error Correction in Low-Resource Settings : Abstract: Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in tr...
- SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models : Abstract: Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segm...
- KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP : Abstract: Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model f...
- REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance : Abstract: The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, ex...
- Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios : Abstract: Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume ...
- Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features : Abstract: Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a comple...
- Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification : Abstract: Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspire...
- InferF: Declarative Factorization of AI/ML Inferences over Joins : Abstract: Real-world AI/ML workflows often apply inference computations to feature vectors joined from multiple datasets. To avoid the redundant AI/ML computations caused by repeated data records in t...
- A Physics-Informed Loss Function for Boundary-Consistent and Robust Artery Segmentation in DSA Sequences : Abstract: Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex ...
- Generative Modeling with Manifold Percolation : Abstract: Generative modeling is typically framed as learning mapping rules, but from an observer's perspective without access to these rules, the task manifests as disentangling the geometric support...
- Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models : Abstract: Visual Language Models (VLMs) are powerful generative tools but often produce factually inaccurate outputs due to a lack of robust reasoning capabilities. While extensive research has been c...
- Automated Monitoring of Cultural Heritage Artifacts Using Semantic Segmentation : Abstract: This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architect...
- New York Smells: A Large Multimodal Dataset for Olfaction : Abstract: While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimo...
- Spatio-Temporal Hierarchical Causal Models : Abstract: The abundance of fine-grained spatio-temporal data, such as traffic sensor networks, offers vast opportunities for scientific discovery. However, inferring causal relationships from such obs...
- Gated Uncertainty-Aware Runtime Dual Invariants for Neural Signal-Controlled Robotics : Abstract: Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runt...
- PaTAS: A Parallel System for Trust Propagation in Neural Networks Using Subjective Logic : Abstract: Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics such as accuracy and preci...
- On Evaluating LLM Alignment by Evaluating LLMs as Judges : Abstract: Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language ...
- MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models : Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing mu...
- Model-Based Learning of Whittle indices : Abstract: We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an em...
- Short-Range Oversquashing : Abstract: Message Passing Neural Networks (MPNNs) are widely used for learning on graphs, but their ability to process long-range information is limited by the phenomenon of oversquashing. This limita...
- Tight Margin-Based Generalization Bounds for Voting Classifiers over Finite Hypothesis Sets : Abstract: We prove the first margin-based generalization bound for voting classifiers, that is asymptotically tight in the tradeoff between the size of the hypothesis set, the margin, the fraction of ...
- Diffusion for Fusion: Designing Stellarators with Generative AI : Abstract: Stellarators are a prospective class of fusion-based power plants that confine a hot plasma with three-dimensional magnetic fields. Typically framed as a PDE-constrained optimization problem...
- Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks : Abstract: Machine learning has become integral to Channel State Information (CSI)-based human sensing systems and is expected to power applications such as device-free activity recognition and identit...
- NVIDIA Nemotron Parse 1.1 : Abstract: We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved...
- Ranking-Enhanced Anomaly Detection Using Active Learning-Assisted Attention Adversarial Dual AutoEncoders : Abstract: Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data...
- MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology : Abstract: Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations pr...
- From One Attack Domain to Another: Contrastive Transfer Learning with Siamese Networks for APT Detection : Abstract: Advanced Persistent Threats (APT) pose a major cybersecurity challenge due to their stealth, persistence, and adaptability. Traditional machine learning detectors struggle with class imbalan...
- DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning : Abstract: Adaptive optimizers are the de facto standard in non-private training as they often enable faster convergence and improved performance. In contrast, differentially private (DP) training is s...
- Adam Simplified: Bias Correction Simplified : Abstract: The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused invest...
- Feature-Modulated UFNO for Improved Prediction of Multiphase Flow in Porous Media : Abstract: The UNet-enhanced Fourier Neural Operator (UFNO) extends the Fourier Neural Operator (FNO) by incorporating a parallel UNet pathway, enabling the retention of both high- and low-frequency co...
- E2E-GRec: An End-to-End Joint Training Framework for Graph Neural Networks and Recommender Systems : Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for modeling graph-structured data and have been widely used in recommender systems, such as for capturing complex user-item and i...
- MSTN: Fast and Efficient Multivariate Time Series Model : Abstract: Real-world time-series data is highly non stationary and complex in dynamics that operate across multiple timescales, ranging from fast, short-term changes to slow, long-term trends. Most ex...
- A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent : Abstract: Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key dist...
- Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models : Abstract: We present Anatomica: an inference-time framework for generating multi-class anatomical voxel maps with localized geo-topological control. During generation, we use cuboidal control domains ...
- Attention Trajectories as a Diagnostic Axis for Deep Reinforcement Learning : Abstract: The learning process of a reinforcement learning (RL) agent remains poorly understood beyond the mathematical formulation of its learning algorithm. To address this gap, we introduce attenti...
- Latent Diffusion Inversion Requires Understanding the Latent Space : Abstract: The recovery of training data from generative models (``model inversion'') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding la...
- BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents : Abstract: The integration of artificial intelligence (AI) agents into web browsers introduces security challenges that go beyond traditional web application threat models. Prior work has identified pr...
- The Driver-Blindness Phenomenon: Why Deep Sequence Models Default to Autocorrelation in Blood Glucose Forecasting : Abstract: Deep sequence models for blood glucose forecasting consistently fail to leverage clinically informative drivers--insulin, meals, and activity--despite well-understood physiological mechanism...
- How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets : Abstract: We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models ...
- Adaptive Hopfield Network: Rethinking Similarities in Associative Memory : Abstract: Associative memory models are content-addressable memory systems fundamental to biological intelligence and are notable for their high interpretability. However, existing models evaluate the...
- Sparse-to-Field Reconstruction via Stochastic Neural Dynamic Mode Decomposition : Abstract: Many consequential real-world systems, like wind fields and ocean currents, are dynamic and hard to model. Learning their governing dynamics remains a central challenge in scientific machine...
- Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning : Abstract: The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Pr...
- DiFR: Inference Verification Despite Nondeterminism : Abstract: As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tam...
- ROOT: Robust Orthogonalized Optimizer for Neural Network Training : Abstract: The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Rece...
- Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model : Abstract: Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-ex...
- Temperature in SLMs: Impact on Incident Categorization in On-Premises Environments : Abstract: SOCs and CSIRTs face increasing pressure to automate incident categorization, yet the use of cloud-based LLMs introduces costs, latency, and confidentiality risks. We investigate whether loc...
- SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data : Abstract: Approximating training-point influence on test predictions is critical for deploying deep-learning vision models, essential for locating noisy data. Though the influence function was propose...
- Towards a future space-based, highly scalable AI infrastructure system design : Abstract: If AI is a foundational general-purpose technology, we should anticipate that demand for AI compute -- and energy -- will continue to grow. The Sun is by far the largest energy source in our...
- FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection : Abstract: Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: ...
- A Multi-Stage Deep Learning Framework with PKCP-MixUp Augmentation for Pediatric Liver Tumor Diagnosis Using Multi-Phase Contrast-Enhanced CT : Abstract: Pediatric liver tumors are one of the most common solid tumors in pediatrics, with differentiation of benign or malignant status and pathological classification critical for clinical treatme...
- Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments : Abstract: As the demand grows for scalable and privacy-aware AI systems, Federated Learning (FL) has emerged as a promising solution, allowing decentralized model training without moving raw data. At ...
- stable-pretraining-v1: Foundation Model Research Made Simple : Abstract: Foundation models and self-supervised learning (SSL) have become central to modern AI, yet research in this area remains hindered by complex codebases, redundant re-implementations, and the ...
- CycleChemist: A Dual-Pronged Machine Learning Framework for Organic Photovoltaic Discovery : Abstract: Organic photovoltaic (OPV) materials offer a promising path toward sustainable energy generation, but their development is limited by the difficulty of identifying high performance donor and...
- Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning : Abstract: Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency...
- Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation : Abstract: Detecting driver drowsiness reliably is crucial for enhancing road safety and supporting advanced driver assistance systems (ADAS). We introduce the Eyelid Angle (ELA), a novel, reproducible...
- Masked Autoencoder Joint Learning for Robust Spitzoid Tumor Classification : Abstract: Accurate diagnosis of spitzoid tumors (ST) is critical to ensure a favorable prognosis and to avoid both under- and over-treatment. Epigenetic data, particularly DNA methylation, provide a v...
- Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment : Abstract: The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global co...
- Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment : Abstract: Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze la...
- SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models : Abstract: Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists unde...
- Optimization and Regularization Under Arbitrary Objectives : Abstract: This study investigates the limitations of applying Markov Chain Monte Carlo (MCMC) methods to arbitrary objective functions, focusing on a two-block MCMC framework which alternates between ...
- Agint: Agentic Graph Compilation for Software Engineering Agents : Abstract: LLM-based coding agents are increasingly common but still face challenges in context management, latency, reliability, reproducibility, and scalability. We present Agint, an agentic graph co...
- Synthetic Data: AI's New Weapon Against Android Malware : Abstract: The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection met...
- The Alexander-Hirschowitz theorem for neurovarieties : Abstract: We study neurovarieties for polynomial neural networks and fully characterize when they attain the expected dimension in the single-output case. As consequences, we establish non-defectivene...
- Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability : Abstract: Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmet...
- Large Scale Community-Aware Network Generation : Abstract: Community detection, or network clustering, is used to identify latent community structure in networks. Due to the scarcity of labeled ground truth in real-world networks, evaluating these a...
- Individual and group fairness in geographical partitioning : Abstract: Socioeconomic segregation often arises in school districting and other contexts, causing some groups to be over- or under-represented within a particular district. This phenomenon is closely...
- An Adaptive, Data-Integrated Agent-Based Modeling Framework for Explainable and Contestable Policy Design : Abstract: Multi-agent systems often operate under feedback, adaptation, and non-stationarity, yet many simulation studies retain static decision rules and fixed control parameters. This paper introduc...
- Integrating RCTs, RWD, AI/ML and Statistics: Next-Generation Evidence Synthesis : Abstract: Randomized controlled trials (RCTs) have been the cornerstone of clinical evidence; however, their cost, duration, and restrictive eligibility criteria limit power and external validity. Stu...
- Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation : Abstract: Domain-specific text embeddings are critical for clinical natural language processing, yet systematic comparisons across model architectures remain limited. This study evaluates ten transfor...
- CAMformer: Associative Memory is All You Need : Abstract: Transformers face scalability challenges due to the quadratic cost of attention, which involves dense similarity computations between queries and keys. We propose CAMformer, a novel accelera...
- Clustering Approaches for Mixed-Type Data: A Comparative Study : Abstract: Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing appro...
- KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA) : Abstract: Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplin...
- Latent-space metrics for Complex-Valued VAE out-of-distribution detection under radar clutter : Abstract: We investigate complex-valued Variational AutoEncoders (CVAE) for radar Out-Of-Distribution (OOD) detection in complex radar environments. We proposed several detection metrics: the reconstr...
- Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization : Abstract: Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering bot...
- CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception : Abstract: Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and ...
- Reinforcement Learning with $\omega$-Regular Objectives and Constraints : Abstract: Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic ...
- Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains : Abstract: As AI agents become integral to enterprise workflows, their reliance on shared tool libraries and pre-trained components creates significant supply chain vulnerabilities. While previous work...
- It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models : Abstract: Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-...
- MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization : Abstract: Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. ...
- Learning Degenerate Manifolds of Frustrated Magnets with Boltzmann Machines : Abstract: We show that Restricted Boltzmann Machines (RBMs) provide a flexible generative framework for modeling spin configurations in disordered yet strongly correlated phases of frustrated magnets....
- Complex Instruction Following with Diverse Style Policies in Football Games : Abstract: Despite advancements in language-controlled reinforcement learning (LC-RL) for basic domains and straightforward commands (e.g., object manipulation and navigation), effectively extending LC...
- Designing Reputation Systems for Manufacturing Data Trading Markets: A Multi-Agent Evaluation with Q-Learning and IRL-Estimated Utilities : Abstract: Recent advances in machine learning and big data analytics have intensified the demand for high-quality cross-domain datasets and accelerated the growth of data trading across organizations....
- AI/ML based Joint Source and Channel Coding for HARQ-ACK Payload : Abstract: Channel coding from 2G to 5G has assumed the inputs bits at the physical layer to be uniformly distributed. However, hybrid automatic repeat request acknowledgement (HARQ-ACK) bits transmitt...
- Softmax Transformers are Turing-Complete : Abstract: Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-co...
- MFM-point: Multi-scale Flow Matching for Point Cloud Generation : Abstract: In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without rely...
- Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design : Abstract: LLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottle...
- From data to concepts via wiring diagrams : Abstract: A wiring diagram is a labeled directed graph that represents an abstract concept such as a temporal process. In this article, we introduce the notion of a quasi-skeleton wiring diagram graph...
- CostNav: A Navigation Benchmark for Cost-Aware Evaluation of Embodied Agents : Abstract: Existing navigation benchmarks focus on task success metrics while overlooking economic viability -- critical for commercial deployment of autonomous delivery robots. We introduce \emph{Cost...
- Actionable and diverse counterfactual explanations incorporating domain knowledge and causal constraints : Abstract: Counterfactual explanations enhance the actionable interpretability of machine learning models by identifying the minimal changes required to achieve a desired outcome of the model. However,...
- Quantum-Enhanced Reinforcement Learning for Accelerating Newton-Raphson Convergence with Ising Machines: A Case Study for Power Flow Analysis : Abstract: The Newton-Raphson (NR) method is widely used for solving power flow (PF) equations due to its quadratic convergence. However, its performance deteriorates under poor initialization or extre...
- Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation : Abstract: Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the no...
- Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization : Abstract: Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution p...
- Solving Heterogeneous Agent Models with Physics-informed Neural Networks : Abstract: Understanding household behaviour is essential for modelling macroeconomic dynamics and designing effective policy. While heterogeneous agent models offer a more realistic alternative to rep...
- Forgetting by Pruning: Data Deletion in Join Cardinality Estimation : Abstract: Machine unlearning in learned cardinality estimation (CE) systems presents unique challenges due to the complex distributional dependencies in multi-table relational data. Specifically, data...
- NNGPT: Rethinking AutoML with Large Language Models : Abstract: Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving ...
- Extension and neural operator approximation of the electrical impedance tomography inverse map : Abstract: This paper considers the problem of noise-robust neural operator approximation for the solution map of Calderón's inverse conductivity problem. In this continuum model of electrical impedanc...
- Differentiable Attenuation Filters for Feedback Delay Networks : Abstract: We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Networks (FDNs). Our approach uses Second Order Sections (SOS) of...
- StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections : Abstract: Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current a...
- A Fully Probabilistic Tensor Network for Regularized Volterra System Identification : Abstract: Modeling nonlinear systems with Volterra series is challenging because the number of kernel coefficients grows exponentially with the model order. This work introduces Bayesian Tensor Networ...
- STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow : Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the ...
- Exploiting the Experts: Unauthorized Compression in MoE-LLMs : Abstract: Mixture-of-Experts (MoE) architectures are increasingly adopted in large language models (LLMs) for their scalability and efficiency. However, their modular structure introduces a unique vul...
- Quality analysis and evaluation prediction of RAG retrieval based on machine learning algorithms : Abstract: With the rapid evolution of large language models, retrieval enhanced generation technology has been widely used due to its ability to integrate external knowledge to improve output accuracy...
- OmniTFT: Omni Target Forecasting for Vital Signs and Laboratory Result Trajectories in Multi Center ICU Data : Abstract: Accurate multivariate time-series prediction of vital signs and laboratory results is crucial for early intervention and precision medicine in intensive care units (ICUs). However, vital sig...
- Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification : Abstract: Driven by recent advances in artificial intelligence (AI), a growing body of work demonstrates the potential of using large language models (LLMs) to generate human-like responses in market ...
- The Generalized Proximity Forest : Abstract: Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualizat...
- Generative Model-Aided Continual Learning for CSI Feedback in FDD mMIMO-OFDM Systems : Abstract: Deep autoencoder (DAE) frameworks have demonstrated their effectiveness in reducing channel state information (CSI) feedback overhead in massive multiple-input multiple-output (mMIMO) orthog...
- OpenCML: End-to-End Framework of Open-world Machine Learning to Learn Unknown Classes Incrementally : Abstract: Open-world machine learning is an emerging technique in artificial intelligence, where conventional machine learning models often follow closed-world assumptions, which can hinder their abil...
- RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression : Abstract: RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler's Random Forest classification methodology in Python....
- A Systematic Study of Compression Ordering for Large Language Models : Abstract: Large Language Models (LLMs) require substantial computational resources, making model compression essential for efficient deployment in constrained environments. Among the dominant compress...
- Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM : Abstract: Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present \textbf{Xmodel-2....
- PeriodNet: Boosting the Potential of Attention Mechanism for Time Series Forecasting : Abstract: The attention mechanism has demonstrated remarkable potential in sequence modeling, exemplified by its successful application in natural language processing with models such as Bidirectional...
- Hierarchical Dual-Strategy Unlearning for Biomedical and Healthcare Intelligence Using Imperfect and Privacy-Sensitive Medical Data : Abstract: Large language models (LLMs) exhibit exceptional performance but pose substantial privacy risks due to training data memorization, particularly within healthcare contexts involving imperfect...
- Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection : Abstract: The rapid advancement of generators (e.g., StyleGAN, Midjourney, DALL-E) has produced highly realistic synthetic images, posing significant challenges to digital media authenticity. These ge...
- Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma : Abstract: Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scal...
- Profile Generators: A Link between the Narrative and the Binary Matrix Representation : Abstract: Mental health disorders, particularly cognitive disorders defined by deficits in cognitive abilities, are described in detail in the DSM-5, which includes definitions and examples of signs a...
- TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception : Abstract: Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual m...
- Row-stochastic matrices can provably outperform doubly stochastic matrices in decentralized learning : Abstract: Decentralized learning often involves a weighted global loss with heterogeneous node weights $λ$. We revisit two natural strategies for incorporating these weights: (i) embedding them into t...
- Automating Deception: Scalable Multi-Turn LLM Jailbreaks : Abstract: Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass saf...
- Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space : Abstract: Deep neural networks are prone to learning shortcuts, spurious and easily learned correlations in training data that cause severe failures in out-of-distribution (OOD) generalization. A domi...
- Learning to Solve Weighted Maximum Satisfiability with a Co-Training Architecture : Abstract: Wepropose SplitGNN, a graph neural network (GNN)-based approach that learns to solve weighted maximum satisfiabil ity (MaxSAT) problem. SplitGNN incorporates a co-training architecture c...
- When Should Neural Data Inform Welfare? A Critical Framework for Policy Uses of Neuroeconomics : Abstract: Neuroeconomics promises to ground welfare analysis in neural and computational evidence about how people value outcomes, learn from experience and exercise self-control. At the same time, po...
- Online Sparse Feature Selection in Data Streams via Differential Evolution : Abstract: The processing of high-dimensional streaming data commonly utilizes online streaming feature selection (OSFS) techniques. However, practical implementations often face challenges with data i...
- Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport : Abstract: Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approa...
- ModHiFi: Identifying High Fidelity predictive components for Model Modification : Abstract: Open weight models, which are ubiquitous, rarely provide access to their training data or loss function. This makes modifying such models for tasks such as pruning or unlearning constrained ...
- An Invariant Latent Space Perspective on Language Model Inversion : Abstract: Language model inversion (LMI), i.e., recovering hidden prompts from outputs, emerges as a concrete threat to user privacy and system security. We recast LMI as reusing the LLM's own latent ...
- Neural Tractability via Structure: Learning-Augmented Algorithms for Graph Combinatorial Optimization : Abstract: Neural models have shown promise in solving NP-hard graph combinatorial optimization (CO) problems. Once trained, they offer fast inference and reasonably high-quality solutions for in-distr...
- Learning Massively Multitask World Models for Continuous Control : Abstract: General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offl...
- Many Ways to be Right: Rashomon Sets for Concept-Based Neural Networks : Abstract: Modern neural networks rarely have a single way to be right. For many tasks, multiple models can achieve identical performance while relying on different features or reasoning patterns, a pr...
- Lower Complexity Bounds for Nonconvex-Strongly-Convex Bilevel Optimization with First-Order Oracles : Abstract: Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we ...
- Structured Noise Modeling for Enhanced Time-Series Forecasting : Abstract: Time-series forecasting remains difficult in real-world settings because temporal patterns operate at multiple scales, from broad contextual trends to fast, fine-grained fluctuations that dr...
- Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds : Abstract: We derive a new theoretical interpretation of the reweighted losses that are widely used for training diffusion models. Our method is based on constructing a cascade of time-dependent variat...
- TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding : Abstract: Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnor...
- TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification : Abstract: The ubiquity of time series data creates a strong demand for general-purpose foundation models, yet developing them for classification remains a significant challenge, largely due to the hig...
- CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding : Abstract: Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. ...
- Training-Free Active Learning Framework in Materials Science with Large Language Models : Abstract: Active learning (AL) accelerates scientific discovery by prioritizing the most informative experiments, but traditional machine learning (ML) models used in AL suffer from cold-start limitat...
- DISCO: A Browser-Based Privacy-Preserving Framework for Distributed Collaborative Learning : Abstract: Data is often impractical to share for a range of well considered reasons, such as concerns over privacy, intellectual property, and legal constraints. This not only fragments the statistica...
- When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements : Abstract: Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implement...
- Terminal Velocity Matching : Abstract: We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two dif...
- Scalable Data Attribution via Forward-Only Test-Time Inference : Abstract: Data attribution seeks to trace model behavior back to the training examples that shaped it, enabling debugging, auditing, and data valuation at scale. Classical influence-function methods o...
- Learning to Clean: Reinforcement Learning for Noisy Label Correction : Abstract: The challenge of learning with noisy labels is significant in machine learning, as it can severely degrade the performance of prediction models if not addressed properly. This paper introduc...
- Provably Outlier-resistant Semi-parametric Regression for Transferable Calibration of Low-cost Air-quality Sensors : Abstract: We present a case study for the calibration of Low-cost air-quality (LCAQ) CO sensors from one of the largest multi-site-multi-season-multi-sensor-multi-pollutant mobile air-quality monitori...
- Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models : Abstract: Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total...
- GED-Consistent Disentanglement of Aligned and Unaligned Substructures for Graph Similarity Learning : Abstract: Graph Similarity Computation (GSC) is a fundamental graph related task where Graph Edit Distance (GED) serves as a prevalent metric. GED is determined by an optimal alignment between a pair ...
- Cisco Time Series Model Technical Report : Abstract: We introduce the Cisco Time Series Model, a univariate zero-shot forecaster. This time series foundation model is the result of a general architectural innovation to a time series model enab...
- SX-GeoTree: Self-eXplaining Geospatial Regression Tree Incorporating the Spatial Similarity of Feature Attributions : Abstract: Decision trees remain central for tabular prediction but struggle with (i) capturing spatial dependence and (ii) producing locally stable (robust) explanations. We present SX-GeoTree, a self...
- Accelerating Wireless Distributed Learning via Hybrid Split and Federated Learning Optimization : Abstract: Federated learning (FL) and split learning (SL) are two effective distributed learning paradigms in wireless networks, enabling collaborative model training across mobile devices without sha...
- Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms : Abstract: Ride-hailing platforms are characterized by high-frequency, behavior-driven environments. Although survival analysis has been applied to recurrent events in other domains, its use in modelin...
- EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning : Abstract: The rapid advancement of large language models (LLMs) has increased the demand for domain-specialized variants in areas such as law, healthcare, and finance. However, their large size remain...
- Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization : Abstract: Universal online learning aims to achieve optimal regret guarantees without requiring prior knowledge of the curvature of online functions. Existing methods have established minimax-optimal ...
- Optimize Flip Angle Schedules In MR Fingerprinting Using Reinforcement Learning : Abstract: Magnetic Resonance Fingerprinting (MRF) leverages transient-state signal dynamics generated by the tunable acquisition parameters, making the design of an optimal, robust sequence a complex,...
- Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning : Abstract: It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to \textit{diversity collapse}, where outputs lack variety. Prior work has proposed ...
- Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios : Abstract: Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi-agent interaction modeling with real-time dec...
- Prompt Fairness: Sub-group Disparities in LLMs : Abstract: Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairne...
- ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models : Abstract: Federated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training la...
- Stragglers Can Contribute More: Uncertainty-Aware Distillation for Asynchronous Federated Learning : Abstract: Asynchronous federated learning (FL) has recently gained attention for its enhanced efficiency and scalability, enabling local clients to send model updates to the server at their own pace w...
- Rethinking Semi-Supervised Node Classification with Self-Supervised Graph Clustering : Abstract: The emergence of graph neural networks (GNNs) has offered a powerful tool for semi-supervised node classification tasks. Subsequent studies have achieved further improvements through refinin...
- Operator Learning at Machine Precision : Abstract: Neural operator learning methods have garnered significant attention in scientific computing for their ability to approximate infinite-dimensional operators. However, increasing their comple...
- Rethinking Message Passing Neural Networks with Diffusion Distance-guided Stress Majorization : Abstract: Message passing neural networks (MPNNs) have emerged as go-to models for learning on graph-structured data in the past decade. Despite their effectiveness, most of such models still incur se...
- On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices : Abstract: Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O o...
- RankOOD - Class Ranking-based Out-of-Distribution Detection : Abstract: We propose RankOOD, a rank-based Out-of-Distribution (OOD) detection approach based on training a model with the Placket-Luce loss, which is now extensively used for preference alignment tas...
- REWA: Witness-Overlap Theory -- Foundations for Composable Binary Similarity Systems : Abstract: REWA introduces a general theory of similarity based on witness-overlap structures. We show that whenever similarity between concepts can be expressed as monotone witness overlap -- whether ...
- Zero-Shot Transfer Capabilities of the Sundial Foundation Model for Leaf Area Index Forecasting : Abstract: This work investigates the zero-shot forecasting capability of time-series foundation models for Leaf Area Index (LAI) forecasting in agricultural monitoring. Using the HiQ dataset (U.S., 20...
- iRadioDiff: Physics-Informed Diffusion Model for Indoor Radio Map Construction and Localization : Abstract: Radio maps (RMs) serve as environment-aware electromagnetic (EM) representations that connect scenario geometry and material properties to the spatial distribution of signal strength, enabli...
- Cross-Contrastive Clustering for Multimodal Attributed Graphs with Dual Graph Filtering : Abstract: Multimodal Attributed Graphs (MMAGs) are an expressive data model for representing the complex interconnections among entities that associate attributes from multiple data modalities (text, ...
- RED-F: Reconstruction-Elimination based Dual-stream Contrastive Forecasting for Multivariate Time Series Anomaly Prediction : Abstract: The proactive prediction of anomalies (AP) in multivariate time series (MTS) is a critical challenge to ensure system dependability. The difficulty lies in identifying subtle anomaly precurs...
- SOMBRL: Scalable and Optimistic Model-Based RL : Abstract: We address the challenge of efficient exploration in model-based reinforcement learning (MBRL), where the system dynamics are unknown and the RL agent must learn directly from online interac...
- QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression : Abstract: Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descri...
- The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs : Abstract: Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While ...
- Multivariate Forecasting of Bitcoin Volatility with Gradient Boosting: Deterministic, Probabilistic, and Feature Importance Perspectives : Abstract: This study investigates the application of the Light Gradient Boosting Machine (LGBM) model for both deterministic and probabilistic forecasting of Bitcoin realized volatility. Utilizing a c...
- CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows : Abstract: Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static sc...
- IDAP++: Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization : Abstract: This paper presents a novel approach to neural network compression that addresses redundancy at both the filter and architectural levels through a unified framework grounded in information f...
- On the Limits of Momentum in Decentralized and Federated Optimization : Abstract: Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a...
- AdaCap: An Adaptive Contrastive Approach for Small-Data Neural Networks : Abstract: Neural networks struggle on small tabular datasets, where tree-based models remain dominant. We introduce Adaptive Contrastive Approach (AdaCap), a training scheme that combines a permutatio...
- Learning Subgroups with Maximum Treatment Effects without Causal Heuristics : Abstract: Discovering subgroups with the maximum average treatment effect is crucial for targeted decision making in domains such as precision medicine, public policy, and education. While most prior ...
- In-Context Compositional Learning via Sparse Coding Transformer : Abstract: Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning ta...
- Communication-Efficient Learning for Satellite Constellations : Abstract: Satellite constellations in low-Earth orbit are now widespread, enabling positioning, Earth imaging, and communications. In this paper we address the solution of learning problems using thes...
- Decoupling and Damping: Structurally-Regularized Gradient Matching for Multimodal Graph Condensation : Abstract: In critical web applications such as e-commerce and recommendation systems, multimodal graphs integrating rich visual and textual attributes are increasingly central, yet their large scale i...
- DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning : Abstract: Semi-supervised multi-label learning (SSMLL) aims to address the challenge of limited labeled data in multi-label learning (MLL) by leveraging unlabeled data to improve the model's performan...
- Leveraging weights signals - Predicting and improving generalizability in reinforcement learning : Abstract: Generalizability of Reinforcement Learning (RL) agents (ability to perform on environments different from the ones they have been trained on) is a key problem as agents have the tendency to ...
- Interpretable Air Pollution Forecasting by Physics-Guided Spatiotemporal Decoupling : Abstract: Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-gu...
- Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits : Abstract: Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically ...
- HVAdam: A Full-Dimension Adaptive Optimizer : Abstract: Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-ada...
- Geometry of Decision Making in Language Models : Abstract: Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the ge...
- MXtalTools: A Toolkit for Machine Learning on Molecular Crystals : Abstract: We present MXtalTools, a flexible Python package for the data-driven modelling of molecular crystals, facilitating machine learning studies of the molecular solid state. MXtalTools comprises...
- Soft Adaptive Policy Optimization : Abstract: Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remain...
- Complexity Reduction Study Based on RD Costs Approximation for VVC Intra Partitioning : Abstract: In this paper, a complexity study is conducted for Versatile Video Codec (VVC) intra partitioning to accelerate the exhaustive search involved in Rate-Distortion Optimization (RDO) process. ...
- PRISM: Periodic Representation with multIscale and Similarity graph Modelling for enhanced crystal structure property prediction : Abstract: Crystal structures are characterised by repeating atomic patterns within unit cells across three-dimensional space, posing unique challenges for graph-based representation learning. Current ...
- MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers : Abstract: Representation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones h...
- Identifying environmental factors associated with tetrodotoxin contamination in bivalve mollusks using eXplainable AI : Abstract: Since 2012, tetrodotoxin (TTX) has been found in seafoods such as bivalve mollusks in temperate European waters. TTX contamination leads to food safety risks and economic losses, making earl...
- Hidden markov model to predict tourists visited place : Abstract: Nowadays, social networks are becoming a popular way of analyzing tourist behavior, thanks to the digital traces left by travelers during their stays on these networks. The massive amount of...
- Quantifying Modality Contributions via Disentangling Multimodal Representations : Abstract: Quantifying modality contributions in multimodal models remains a challenge, as existing approaches conflate the notion of contribution itself. Prior work relies on accuracy-based approaches...
- PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer : Abstract: Prefix adders are widely used in compute-intensive applications for their high speed. However, designing optimized prefix adders is challenging due to strict design rules and an exponentiall...
- WavefrontDiffusion: Dynamic Decoding Schedule or Improved Reasoning : Abstract: Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an importan...
Research Sources: 435 | Generated: 11/26/2025
