AI Research News Feeds for October 28th, 2025

AI RESEARCH PAPERS & ACADEMIC SOURCES

ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning : Abstract: Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for...
ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation : Abstract: We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial ground...
Segment then Splat: Unified 3D Open-Vocabulary Segmentation via Gaussian Splatting : Abstract: Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing m...
DERD-Net: Learning Depth from Event-based Ray Densities : Abstract: Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed...
DEEMO: De-identity Multimodal Emotion Recognition and Reasoning : Abstract: Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concern...
Neural Stereo Video Compression with Hybrid Disparity Compensation : Abstract: Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one...
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding : Abstract: Vision-Language Models (VLMs) have achieved strong results in video understanding, yet a key question remains: do they truly comprehend visual content or only learn shallow correlations betw...
Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection : Abstract: 3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features...
Zero-Shot Multi-modal Large Language Model v.s. Supervised Deep Learning: A Comparative Study on CT-Based Intracranial Hemorrhage Subtyping : Abstract: Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes on non-contrast computed tomography is critical for prognosis prediction and therapeutic decision-making, yet re...
Attention! Your Vision Language Model Could Be Maliciously Manipulated : Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in understanding complex real-world scenarios and supporting data-driven decision-making processes. However, VLMs exhibit...
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding : Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, pri...
Navigating the Accuracy-Size Trade-Off with Flexible Model Merging : Abstract: Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promi...
Radiant Triangle Soup with Soft Connectivity Forces for 3D Reconstruction and Novel View Synthesis : Abstract: We introduce an inference-time scene optimization algorithm utilizing triangle soup, a collection of disconnected translucent triangle primitives, as the representation for the geometry and ...
Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations : Abstract: Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embe...
3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks : Abstract: Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This pa...
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning : Abstract: Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA...
Self-supervised Representation Learning with Local Aggregation for Image-based Profiling : Abstract: Image-based cell profiling aims to create informative representations of cell images. This technique is critical in drug discovery and has greatly advanced with recent improvements in comput...
AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays : Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, i...
Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection : Abstract: Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications, where they frequently face data distributions un...
Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration : Abstract: Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density St...
THUNDER: Tile-level Histopathology image UNDERstanding benchmark : Abstract: Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many found...
Contrastive Conditional-Unconditional Alignment for Long-tailed Diffusion Model : Abstract: Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the dive...
ExpressNet-MoE: A Hybrid Deep Neural Network for Emotion Recognition : Abstract: In many domains, including online education, healthcare, security, and human-computer interaction, facial emotion recognition (FER) is essential. Real-world FER is still difficult despite it...
Reducing the Representation Error of GAN Image Priors Using the Deep Decoder : Abstract: Generative models, such as GANs, learn an explicit low-dimensional representation of a particular class of images, and so they may be used as natural image priors for solving inverse problem...
Generalization Bounds for Robust Contrastive Learning: From Theory to Practice : Abstract: Contrastive Learning first extracts features from unlabeled data, followed by linear probing with labeled data. Adversarial Contrastive Learning (ACL) integrates Adversarial Training into th...
Continuous and complete liver vessel segmentation with graph-attention guided diffusion : Abstract: Improving connectivity and completeness are the most challenging aspects of liver vessel segmentation, especially for small vessels. These challenges require both learning the continuous ves...
Slot-BERT: Self-supervised Object Discovery in Surgical Video : Abstract: Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including ...
CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection : Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for d...
Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction? : Abstract: Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail...
A Poisson-Guided Decomposition Network for Extreme Low-Light Image Enhancement : Abstract: Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such ...
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree : Abstract: Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. ...
WaveMAE: Wavelet decomposition Masked Auto-Encoder for Remote Sensing : Abstract: Self-supervised learning (SSL) has recently emerged as a key strategy for building foundation models in remote sensing, where the scarcity of annotated data limits the applicability of fully...
IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction : Abstract: Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most ...
LRW-Persian: Lip-reading in the Wild Dataset for Persian Language : Abstract: Lipreading has emerged as an increasingly important research area for developing robust speech recognition systems and assistive technologies for the hearing-impaired. However, non-English r...
Cross-view Localization and Synthesis - Datasets, Challenges and Opportunities : Abstract: Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imag...
ConMatFormer: A Multi-attention and Transformer Integrated ConvNext based Deep Learning Model for Enhanced Diabetic Foot Ulcer Classification : Abstract: Diabetic foot ulcer (DFU) detection is a clinically significant yet challenging task due to the scarcity and variability of publicly available datasets. To solve these problems, we propose C...
Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models : Abstract: Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that ...
MedXplain-VQA: Multi-Component Explainable Medical Visual Question Answering : Abstract: Explainability is critical for the clinical adoption of medical visual question answering (VQA) systems, as physicians require transparent reasoning to trust AI-generated diagnoses. We prese...
MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control : Abstract: Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they ofte...
FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment : Abstract: Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies -- face classifiers and contrastiv...
Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models : Abstract: The "style trap" poses a significant challenge for Large Vision-Language Models (LVLMs), hindering robust semantic understanding across diverse visual styles, especially in in-context learni...
FastJAM: a Fast Joint Alignment Model for Images : Abstract: Joint Alignment (JA) of images aims to align a collection of images into a unified coordinate frame, such that semantically-similar features appear at corresponding spatial locations. Most e...
Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models : Abstract: Wind turbine blades operate in harsh environments, making timely damage detection essential for preventing failures and optimizing maintenance. Drone-based inspection and deep learning are p...
Estimating Pasture Biomass from Top-View Images: A Dataset for Precision Agriculture : Abstract: Accurate estimation of pasture biomass is important for decision-making in livestock production systems. Estimates of pasture biomass can be used to manage stocking rates to maximise pasture...
Positional Preservation Embedding for Multimodal Large Language Models : Abstract: Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token mergin...
Bi-Encoder Contrastive Learning for Fingerprint and Iris Biometrics : Abstract: There has been a historic assumption that the biometrics of an individual are statistically uncorrelated. We test this assumption by training Bi-Encoder networks on three verification tasks,...
Switchable Token-Specific Codebook Quantization For Face Image Compression : Abstract: With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in moder...
LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation : Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computation...
Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges : Abstract: Foundation models have transformed natural language processing and computer vision, and their impact is now reshaping remote sensing image analysis. With powerful generalization and transfer...
VALA: Learning Latent Anchors for Training-Free and Temporally Consistent : Abstract: Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods...
Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method : Abstract: Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently ...
VoMP: Predicting Volumetric Mechanical Property Fields : Abstract: Physical simulation relies on spatially-varying mechanical properties, often laboriously hand-crafted. VoMP is a feed-forward method trained to predict Young's modulus ($E$), Poisson's ratio...
SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency : Abstract: Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on characte...
LoMix: Learnable Weighted Multi-Scale Logits Mixing for Medical Image Segmentation : Abstract: U-shaped networks output logits at multiple spatial scales, each capturing a different blend of coarse context and fine detail. Yet, training still treats these logits in isolation - either ...
CoMo: Compositional Motion Customization for Text-to-Video Generation : Abstract: While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-...
UGAE: Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds : Abstract: Lossy compression of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address ...
HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling : Abstract: Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal deta...
Strategies for Robust Deep Learning Based Deformable Registration : Abstract: Deep learning based deformable registration methods have become popular in recent years. However, their ability to generalize beyond training data distribution can be poor, significantly hin...
EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction : Abstract: In robot-assisted minimally invasive surgery, accurate 3D reconstruction from endoscopic video is vital for downstream tasks and improved outcomes. However, endoscopic scenarios present uniq...
Revisiting Multimodal Positional Encoding in Vision-Language Models : Abstract: Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysi...
Residual Diffusion Bridge Model for Image Restoration : Abstract: Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat t...
Task-Agnostic Fusion of Time Series and Imagery for Earth Observation : Abstract: We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explore...
DeepSalt: Bridging Laboratory and Satellite Spectra through Domain Adaptation and Knowledge Distillation for Large-Scale Soil Salinity Estimation : Abstract: Soil salinization poses a significant threat to both ecosystems and agriculture because it limits plants' ability to absorb water and, in doing so, reduces crop productivity. This phenomenon...
Note on the Construction of Structure Tensor : Abstract: This note presents a theoretical discussion of two structure tensor constructions: one proposed by Bigun and Granlund 1987, and the other by Granlund and Knutsson 1995. At first glance, thes...
Fast Voxel-Wise Kinetic Modeling in Dynamic PET using a Physics-Informed CycleGAN : Abstract: Tracer kinetic modeling serves a vital role in diagnosis, treatment planning, tracer development and oncology, but burdens practitioners with complex and invasive arterial input function est...
DQ3D: Depth-guided Query for Transformer-Based 3D Object Detection in Traffic Scenarios : Abstract: 3D object detection from multi-view images in traffic scenarios has garnered significant attention in recent years. Many existing approaches rely on object queries that are generated from 3D...
Implicit Modeling for Transferability Estimation of Vision Foundation Models : Abstract: Transferability estimation identifies the best pre-trained models for downstream tasks without incurring the high computational cost of full fine-tuning. This capability facilitates deployme...
AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes : Abstract: Multimodal camera-LiDAR fusion technology has found extensive application in 3D object detection, demonstrating encouraging performance. However, existing methods exhibit significant perform...
Finding 3D Scene Analogies with Multimodal Foundation Models : Abstract: Connecting current observations with prior experiences helps robots adapt and plan in new, unseen 3D environments. Recently, 3D scene analogies have been proposed to connect two 3D scenes, w...
Evaluation of Vision-LLMs in Surveillance Video : Abstract: The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public ...
DecoDINO: 3D Human-Scene Contact Prediction with Semantic Classification : Abstract: Accurate vertex-level contact prediction between humans and surrounding objects is a prerequisite for high fidelity human object interaction models used in robotics, AR/VR, and behavioral si...
VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting : Abstract: End-to-end autonomous driving (E2E-AD) has emerged as a promising paradigm that unifies perception, prediction, and planning into a holistic, data-driven framework. However, achieving robust...
Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment : Abstract: The rapid digitization of histopathology slides has opened up new possibilities for computational tools in clinical and research workflows. Among these, content-based slide retrieval stands ...
Through the Lens: Benchmarking Deepfake Detectors Against Moir\'e-Induced Distortions : Abstract: Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moir\'e artifacts that can distort ...
Autoregressive Styled Text Image Generation, but Make it Reliable : Abstract: Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, doc...
A Video Is Not Worth a Thousand Words : Abstract: As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the d...
hYOLO Model: Enhancing Object Classification with Hierarchical Context in YOLOv8 : Abstract: Current convolution neural network (CNN) classification methods are predominantly focused on flat classification which aims solely to identify a specified object within an image. However, re...
Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling : Abstract: Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-b...
MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection : Abstract: Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across m...
MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification : Abstract: Real-world object re-identification (ReID) systems often face modality inconsistencies, where query and gallery images come from different sensors (e.g., RGB, NIR, TIR). However, most existi...
Interpretable Tile-Based Classification of Paclitaxel Exposure : Abstract: Medical image analysis is central to drug discovery and preclinical evaluation, where scalable, objective readouts can accelerate decision-making. We address classification of paclitaxel (Ta...
PlanarTrack: A high-quality and challenging benchmark for large-scale planar object tracking : Abstract: Planar tracking has drawn increasing interest owing to its key roles in robotics and augmented reality. Despite recent great advancement, further development of planar tracking, particularly...
An Efficient Remote Sensing Super Resolution Method Exploring Diffusion Priors and Multi-Modal Constraints for Crop Type Mapping : Abstract: Super resolution offers a way to harness medium even lowresolution but historically valuable remote sensing image archives. Generative models, especially diffusion models, have recently been...
VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations : Abstract: Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Lan...
Color and Frequency Correction for Image Colorization : Abstract: The project has carried out the re-optimization of image coloring in accordance with the existing Autocolorization direction model DDColor. For the experiments on the existing weights of DDC...
Symmetria: A Synthetic Dataset for Learning in Point Clouds : Abstract: Unlike image or text domains that benefit from an abundance of large-scale datasets, point cloud learning techniques frequently encounter limitations due to the scarcity of extensive dataset...
Towards Generalisable Foundation Models for 3D Brain MRI : Abstract: Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introdu...
Quality-controlled registration of urban MLS point clouds reducing drift effects by adaptive fragmentation : Abstract: This study presents a novel workflow designed to efficiently and accurately register large-scale mobile laser scanning (MLS) point clouds to a target model point cloud in urban street scenar...
MiCADangelo: Fine-Grained Reconstruction of Constrained CAD Models from 3D Scans : Abstract: Computer-Aided Design (CAD) plays a foundational role in modern manufacturing and product development, often requiring designers to modify or build upon existing models. Converting 3D scans ...
CURVETE: Curriculum Learning and Progressive Self-supervised Training for Medical Image Classification : Abstract: Identifying high-quality and easily accessible annotated samples poses a notable challenge in medical image analysis. Transfer learning techniques, leveraging pre-training data, offer a flex...
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning : Abstract: Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reaso...
UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception : Abstract: Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome ch...
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding : Abstract: Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requir...
Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap : Abstract: Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffu...
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation : Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning ...
iPac: Incorporating Intra-image Patch Context into Graph Neural Networks for Medical Image Classification : Abstract: Graph neural networks have emerged as a promising paradigm for image processing, yet their performance in image classification tasks is hindered by a limited consideration of the underlying ...
FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time : Abstract: This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods th...
DPGLA: Bridging the Gap between Synthetic and Real Data for Unsupervised Domain Adaptation in 3D LiDAR Semantic Segmentation : Abstract: Annotating real-world LiDAR point clouds for use in intelligent autonomous systems is costly. To overcome this limitation, self-training-based Unsupervised Domain Adaptation (UDA) has been w...
EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT : Abstract: Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained...
More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models : Abstract: Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter u...
Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation : Abstract: Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to g...
FARMER: Flow AutoRegressive Transformer over Pixels : Abstract: Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregress...
InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras : Abstract: Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, whi...
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection : Abstract: We introduce \textbf{PRISM-Bench}, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prio...
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity : Abstract: Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, s...
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations : Abstract: Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Conc...
A Robotic Stirring Method with Trajectory Optimization and Adaptive Speed Control for Accurate Pest Counting in Water Traps : Abstract: Accurate monitoring of pest population dynamics is crucial for informed decision-making in precision agriculture. Currently, mainstream image-based pest counting methods primarily rely on im...
A supervised discriminant data representation: application to pattern classification : Abstract: The performance of machine learning and pattern recognition algorithms generally depends on data representation. That is why, much of the current effort in performing machine learning algori...
MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification : Abstract: Generative modeling has emerged as a powerful paradigm for representation learning, but its direct applicability to challenging fields like medical imaging remains limited: mere generation, ...
Frequency-Spatial Interaction Driven Network for Low-Light Image Enhancement : Abstract: Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination. With the advent of deep learning, the L...
LT-Exosense: A Vision-centric Multi-session Mapping System for Lifelong Safe Navigation of Exoskeletons : Abstract: Self-balancing exoskeletons offer a promising mobility solution for individuals with lower-limb disabilities. For reliable long-term operation, these exoskeletons require a perception system...
Expert Validation of Synthetic Cervical Spine Radiographs Generated with a Denoising Diffusion Probabilistic Model : Abstract: Machine learning in neurosurgery is limited by challenges in assembling large, high-quality imaging datasets. Synthetic data offers a scalable, privacy-preserving solution. We evaluated the ...
Simplifying Knowledge Transfer in Pretrained Models : Abstract: Pretrained models are ubiquitous in the current deep learning landscape, offering strong results on a broad range of tasks. Recent works have shown that models differing in various design ch...
Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy : Abstract: Retrieval over visually rich documents is essential for tasks such as legal discovery, scientific search, and enterprise knowledge management. Existing approaches fall into two paradigms: si...
Privacy-Aware Federated nnU-Net for ECG Page Digitization : Abstract: Deep neural networks can convert ECG page images into analyzable waveforms, yet centralized training often conflicts with cross-institutional privacy and deployment constraints. A cross-silo...
Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration : Abstract: Recent advancements in multi-agent systems have demonstrated significant potential for enhancing creative task performance, such as long video generation. This study introduces three innovat...
LAMP: Data-Efficient Linear Affine Weight-Space Models for Parameter-Controlled 3D Shape Generation and Extrapolation : Abstract: Generating high-fidelity 3D geometries that satisfy specific parameter constraints has broad applications in design and engineering. However, current methods typically rely on large training...
Learning Event-guided Exposure-agnostic Video Frame Interpolation via Adaptive Feature Blending : Abstract: Exposure-agnostic video frame interpolation (VFI) is a challenging task that aims to recover sharp, high-frame-rate videos from blurry, low-frame-rate inputs captured under unknown and dynam...
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS : Abstract: Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of ...
DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection : Abstract: The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial soc...
Edge Collaborative Gaussian Splatting with Integrated Rendering and Communication : Abstract: Gaussian splatting (GS) struggles with degraded rendering quality on low-cost devices. To address this issue, we present edge collaborative GS (ECO-GS), where each user can switch between a ...
S-Chain: Structured Visual Chain-of-Thought For Medicine : Abstract: Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-...
Understanding What Is Not Said:Referring Remote Sensing Image Segmentation with Scarce Expressions : Abstract: Referring Remote Sensing Image Segmentation (RRSIS) aims to segment instances in remote sensing images according to referring expressions. Unlike Referring Image Segmentation on general imag...
Neural-HAR: A Dimension-Gated CNN Accelerator for Real-Time Radar Human Activity Recognition : Abstract: Radar-based human activity recognition (HAR) is attractive for unobtrusive and privacy-preserving monitoring, yet many CNN/RNN solutions remain too heavy for edge deployment, and even lightw...
An Intelligent Water-Saving Irrigation System Based on Multi-Sensor Fusion and Visual Servoing Control : Abstract: This paper introduces an intelligent water-saving irrigation system designed to address critical challenges in precision agriculture, such as inefficient water use and poor terrain adaptabil...
Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation : Abstract: We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in realworld environments. Seq-DeepIPC advances intelligent sensing for autonomous leg...
Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction : Abstract: Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering task...
T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning : Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data, often by enforcing invariance to input transformations such as rotations ...
Localising under the drape: proprioception in the era of distributed surgical robotic system : Abstract: Despite their mechanical sophistication, surgical robots remain blind to their surroundings. This lack of spatial awareness causes collisions, system recoveries, and workflow disruptions, is...
Revising Second Order Terms in Deep Animation Video Coding : Abstract: First Order Motion Model is a generative model that animates human heads based on very little motion information derived from keypoints. It is a promising solution for video communication be...
Invertible generative models for inverse problems: mitigating representation error and dataset bias : Abstract: Trained generative models have shown remarkable performance as priors for inverse problems in imaging -- for example, Generative Adversarial Network priors permit recovery of test images fro...
Unbiased Scene Graph Generation from Biased Training : Abstract: Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse "human walk on / sit on / lay on beach" into "human on...
Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect : Abstract: As the class size grows, maintaining a balanced dataset across many classes is challenging because the data are long-tailed in nature; it is even impossible when the sample-of-interest co-ex...
Weakly Supervised Learning for Facial Behavior Analysis : A Review : Abstract: In the recent years, there has been a shift in facial behavior analysis from the laboratory-controlled conditions to the challenging in-the-wild conditions due to the superior performance of...
Revisiting Transformation Invariant Geometric Deep Learning: An Initial Representation Perspective : Abstract: Deep neural networks have achieved great success in the last decade. When designing neural networks to handle the ubiquitous geometric data such as point clouds and graphs, it is critical th...
Blockchain and Biometrics: Survey, GDPR Analysis, and Future Directions : Abstract: Biometric recognition as an efficient and hard-to-forge way of identification and verification has become an indispensable part of the current digital world. The fast evolution of this techn...
Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM : Abstract: Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work, SI Maps (Nanwani L, Ag...
Steerable Transformers for Volumetric Data : Abstract: We introduce Steerable Transformers, an extension of the Vision Transformer mechanism that maintains equivariance to the special Euclidean group $\mathrm{SE}(d)$. We propose an equivariant a...
RealCustom++: Representing Images as Real Textual Word for Real-Time Customization : Abstract: Given a text and an image of a specific subject, text-to-image customization aims to generate new images that align with both the text and the subject's appearance. Existing works follow the...
Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark : Abstract: Multimodal Industrial Anomaly Detection (MIAD), which utilizes 3D point clouds and 2D RGB images to identify abnormal regions in products, plays a crucial role in industrial quality inspecti...
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis : Abstract: Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial...
FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention : Abstract: Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While...
GS-ProCams: Gaussian Splatting-based Projector-Camera Systems : Abstract: We present GS-ProCams, the first Gaussian Splatting-based framework for projector-camera systems (ProCams). GS-ProCams is not only view-agnostic but also significantly enhances the efficienc...
Optimize the Unseen - Fast NeRF Cleanup with Free Space Prior : Abstract: Neural Radiance Fields (NeRF) have advanced photorealistic novel view synthesis, but their reliance on photometric reconstruction introduces artifacts, commonly known as "floaters". These ar...
BCR-Net: Boundary-Category Refinement Network for Weakly Semi-Supervised X-Ray Prohibited Item Detection with Points : Abstract: Automatic prohibited item detection in X-ray images is crucial for public safety. However, most existing detection methods either rely on expensive box annotations to achieve high performanc...
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning : Abstract: Video causal reasoning aims to achieve a high-level understanding of videos from a causal perspective. However, it exhibits limitations in its scope, primarily executed in a question-answeri...
Dual-Flow: Transferable Multi-Target, Instance-Agnostic Attacks via In-the-wild Cascading Flow Optimization : Abstract: Adversarial attacks are widely used to evaluate model robustness, and in black-box scenarios, the transferability of these attacks becomes crucial. Existing generator-based attacks have exce...
Pulling Back the Curtain: Unsupervised Adversarial Detection via Contrastive Auxiliary Networks : Abstract: Deep learning models are widely employed in safety-critical applications yet remain susceptible to adversarial attacks -- imperceptible perturbations that can significantly degrade model per...
T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting : Abstract: Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but ofte...
Now you see me! Attribution Distributions Reveal What is Truly Important for a Prediction : Abstract: Neural networks are regularly employed in high-stakes decision-making, where understanding and transparency is key. Attribution methods have been developed to gain understanding into which i...
EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing : Abstract: Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, ...
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models : Abstract: Vision-language models (VLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored. Medical vision-language tasks demand...
Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents : Abstract: There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "W...
SemiETPicker: Fast and Label-Efficient Particle Picking for CryoET Tomography Using Semi-Supervised Learning : Abstract: Cryogenic Electron Tomography (CryoET) combined with sub-volume averaging (SVA) is the only imaging modality capable of resolving protein structures inside cells at molecular resolution. Par...
AesCrop: Aesthetic-driven Cropping Guided by Composition : Abstract: Aesthetic-driven image cropping is crucial for applications like view recommendation and thumbnail generation, where visual appeal significantly impacts user engagement. A key factor in visu...
Bag-of-Word-Groups (BoWG): A Robust and Efficient Loop Closure Detection Method Under Perceptual Aliasing : Abstract: Loop closure is critical in Simultaneous Localization and Mapping (SLAM) systems to reduce accumulative drift and ensure global mapping consistency. However, conventional methods struggle in...
SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning : Abstract: Existing diffusion-based super-resolution approaches often exhibit semantic ambiguities due to inaccuracies and incompleteness in their text conditioning, coupled with the inherent tendency ...
MELDAE: A Framework for Micro-Expression Spotting, Detection, and Automatic Evaluation in In-the-Wild Conversational Scenes : Abstract: Accurately analyzing spontaneous, unconscious micro-expressions is crucial for revealing true human emotions, but this task remains challenging in wild scenarios, such as natural conversatio...
From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy : Abstract: Light field microscopy (LFM) has become an emerging tool in neuroscience for large-scale neural imaging in vivo, notable for its single-exposure volumetric imaging, broad field of view, and ...
Cross-View UAV Geo-Localization with Precision-Focused Efficient Design: A Hierarchical Distillation Approach with Multi-view Refinement : Abstract: Cross-view geo-localization (CVGL) enables UAV localization by matching aerial images to geo-tagged satellite databases, which is critical for autonomous navigation in GNSS-denied environmen...
PSScreen V2: Partially Supervised Multiple Retinal Disease Screening : Abstract: In this work, we propose PSScreen V2, a partially supervised self-training framework for multiple retinal disease screening. Unlike previous methods that rely on fully labelled or single-dom...
Projection Embedded Diffusion Bridge for CT Reconstruction from Incomplete Data : Abstract: Reconstructing CT images from incomplete projection data remains challenging due to the ill-posed nature of the problem. Diffusion bridge models have recently shown promise in restoring clea...
SWAN: Self-supervised Wavelet Neural Network for Hyperspectral Image Unmixing : Abstract: In this article, we present SWAN: a three-stage, self-supervised wavelet neural network for joint estimation of endmembers and abundances from hyperspectral imagery. The contiguous and overl...
Robust Atypical Mitosis Classification with DenseNet121: Stain-Aware Augmentation and Hybrid Loss for Domain Generalization : Abstract: Atypical mitotic figures are important biomarkers of tumor aggressiveness in histopathology, yet reliable recognition remains challenging due to severe class imbalance and variability across...
Self-Attention Decomposition For Training Free Diffusion Editing : Abstract: Diffusion models achieve remarkable fidelity in image synthesis, yet precise control over their outputs for targeted editing remains challenging. A key step toward controllability is to iden...
Alias-Free ViT: Fractional Shift Invariance via Linear Attention : Abstract: Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. ...
DAMap: Distance-aware MapNet for High Quality HD Map Construction : Abstract: Predicting High-definition (HD) map elements with high quality (high classification and localization scores) is crucial to the safety of autonomous driving vehicles. However, current methods...
Estimation of Fireproof Structure Class and Construction Year for Disaster Risk Assessment : Abstract: Structural fireproof classification is vital for disaster risk assessment and insurance pricing in Japan. However, key building metadata such as construction year and structure type are ofte...
Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation : Abstract: Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating n...
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination : Abstract: In this paper we introduce Tale, Task-Aware Layer Elimination, an inference-time algorithm that prunes entire transformer layers in an LLM by directly optimizing task-specific validation per...
Offline Preference Optimization via Maximum Marginal Likelihood Estimation : Abstract: Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this w...
Modeling Political Discourse with Sentence-BERT and BERTopic : Abstract: Social media has reshaped political discourse, offering politicians a platform for direct engagement while reinforcing polarization and ideological divides. This study introduces a novel top...
Can Language Models Compose Skills In-Context? : Abstract: Composing basic skills from simple tasks to accomplish composite tasks is crucial for modern intelligent systems. We investigate the in-context composition ability of language models to perf...
M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark : Abstract: Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-...
UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization : Abstract: With the rapid proliferation of image generative models, the authenticity of digital images has become a significant concern. While existing studies have proposed various methods for detecti...
Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts : Abstract: Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning abili...
Fast-MIA: Efficient and Scalable Membership Inference for LLMs : Abstract: We propose Fast-MIA (https://github.com/Nikkei/fast-mia), a Python library for efficiently evaluating membership inference attacks (MIA) against Large Language Models (LLMs). MIA against LLM...
LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization : Abstract: We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diariz...
A U-Net and Transformer Pipeline for Multilingual Image Translation : Abstract: This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-...
ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models : Abstract: Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense inter...
TrendFact: A Benchmark for Explainable Hotspot Perception in Fact-Checking with Natural Language Explanation : Abstract: Fact-checking benchmarks provide standardized testing criteria for automated fact-checking systems, driving technological advancement. With the surge of misinformation on social media and th...
Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide : Abstract: Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained ...
Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion : Abstract: Disfluencies are a natural feature of spontaneous human speech but are typically absent from the outputs of Large Language Models (LLMs). This absence can diminish the perceived naturalness ...
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression : Abstract: With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generatio...
Superficial Self-Improved Reasoners Benefit from Model Merging : Abstract: As scaled language models (LMs) approach human-level reasoning capabilities, self-improvement emerges as a solution to synthesizing high-quality data corpus. While previous research has iden...
Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment : Abstract: Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integr...
Unified Sparse Mixture of Experts : Abstract: Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead. Early designs typically relied on a fixed value of $k$, where $k$ rep...
Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions : Abstract: Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. H...
Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters : Abstract: Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a ...
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale : Abstract: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consult...
A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings : Abstract: Content moderation research has recently made significant advances, but remains limited in serving the majority of the world's languages due to the lack of resources, leaving millions of vul...
Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models : Abstract: Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilin...
Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models : Abstract: Continual learning (CL), which requires the model to learn multiple tasks sequentially, is crucial for large language models (LLMs). Recently, low-rank adaptation~(LoRA), one of the most rep...
LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing : Abstract: Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates. However, current mainstream locate-then-edi...
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation : Abstract: Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings rema...
Gatsby Without the 'E': Crafting Lipograms with LLMs : Abstract: Lipograms are a unique form of constrained writing where all occurrences of a particular letter are excluded from the text, typified by the novel Gadsby, which daringly avoids all usage of t...
TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine : Abstract: Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tail...
A Simple Linear Patch Revives Layer-Pruned Large Language Models : Abstract: Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation....
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning : Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike...
Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference : Abstract: Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accu...
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers : Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone ...
Human-Aligned Faithfulness in Toxicity Explanations of LLMs : Abstract: The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that...
Improving the Distributional Alignment of LLMs using Supervision : Abstract: The ability to accurately align LLMs with human population groups on subjective questions would have great value. In this work, we show that use of simple supervision can greatly improve lan...
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation : Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decode...
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models : Abstract: Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset ...
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation : Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typica...
Automated HIV Screening on Dutch Electronic Health Records with Large Language Models : Abstract: Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic H...
Bootstrapping Referring Multi-Object Tracking : Abstract: Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrain...
FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation : Abstract: While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational eff...
Probabilistic adaptation of language comprehension for individual speakers: evidence from neural oscillations : Abstract: Listeners adapt language comprehension based on their mental representations of speakers, but how these representations are updated remains unclear. We investigated whether listeners probabi...
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought : Abstract: Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent...
SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training : Abstract: Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardwar...
SafeCOMM: A Study on Safety Degradation in Fine-Tuned Telecom Large Language Models : Abstract: Fine-tuning large language models (LLMs) on telecom datasets is a common practice to adapt general-purpose models to the telecom domain. However, little attention has been paid to how this p...
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs : Abstract: Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce ...
Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries : Abstract: Agricultural disease management in developing countries such as India, Kenya, and Nigeria faces significant challenges due to limited access to expert plant pathologists, unreliable internet...
H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows : Abstract: Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics...
Ageing Drift in Binary Face Templates: A Bits-per-Decade Analysis : Abstract: We study the longitudinal stability of compact binary face templates and quantify ageing drift directly in bits per decade. Float embeddings from a modern face CNN are compressed with PCA-IT...
Promptable Fire Segmentation: Unleashing SAM2's Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance : Abstract: Fire segmentation remains a critical challenge in computer vision due to flames' irregular boundaries, translucent edges, and highly variable intensities. While the Segment Anything Models (...
Multi-Agent Pose Uncertainty: A Differentiable Rendering Cram\'er-Rao Bound : Abstract: Pose estimation is essential for many applications within computer vision and robotics. Despite its uses, few works provide rigorous uncertainty quantification for poses under dense or learn...
Mismatch reconstruction theory for unknown measurement matrix in imaging through multimode fiber bending : Abstract: Multimode fiber imaging requires strict matching between measurement value and measurement matrix to achieve image reconstruction. However, in practical applications, the measurement matrix ...
Exploring the design space of diffusion and flow models for data fusion : Abstract: Data fusion is an essential task in various domains, enabling the integration of multi-source information to enhance data quality and insights. One key application is in satellite remote sen...
AI-Boosted Video Annotation: Assessing the Process Enhancement : Abstract: We explore the enhancement of Human-in-the-Loop video annotation by integrating automatic capabilities to ease the task for annotators and assess their performance. The research delves into ...
Morphology-Aware KOA Classification: Integrating Graph Priors with Vision Models : Abstract: Knee osteoarthritis (KOA) diagnosis from radiographs remains challenging due to the subtle morphological details that standard deep learning models struggle to capture effectively. We propos...
It Takes Two to Tango: Two Parallel Samplers Improve Quality in Diffusion Models for Limited Steps : Abstract: We consider the situation where we have a limited number of denoising steps, i.e., of evaluations of a diffusion model. We show that two parallel processors or samplers under such limitation...
Embodied Navigation with Auxiliary Task of Action Description Prediction : Abstract: The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision sy...
A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis : Abstract: In recent years, multimodal medical data-based survival analysis has attracted much attention. However, real-world datasets often suffer from the problem of incomplete modality, where some p...
Towards Accurate and Efficient Waste Image Classification: A Hybrid Deep Learning and Machine Learning Approach : Abstract: Automated image-based garbage classification is a critical component of global waste management; however, systematic benchmarks that integrate Machine Learning (ML), Deep Learning (DL), and ...
Improving the Physics of Video Generation with VJEPA-2 Reward Signal : Abstract: This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exh...
RatioWaveNet: A Learnable RDWT Front-End for Robust and Interpretable EEG Motor-Imagery Classification : Abstract: Brain-computer interfaces (BCIs) based on motor imagery (MI) translate covert movement intentions into actionable commands, yet reliable decoding from non-invasive EEG remains challenging du...
Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory? : Abstract: We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being tra...
LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation : Abstract: Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to ex...
Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers : Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. To...
LiteDiff : Abstract: In recent years, diffusion models have demonstrated remarkable success in high-fidelity image synthesis. However, fine-tuning these models for specialized domains, such as medical imaging, r...
FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing : Abstract: The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to ...
Caption-Driven Explainability: Probing CNNs for Bias via CLIP : Abstract: Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred ...
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation : Abstract: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-relian...
Scanner-Agnostic MRI Harmonization via SSIM-Guided Disentanglement : Abstract: The variability introduced by differences in MRI scanner models, acquisition protocols, and imaging sites hinders consistent analysis and generalizability across multicenter studies. We pres...
CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding : Abstract: Deep stereo matching has advanced significantly on benchmark datasets through fine-tuning but falls short of the zero-shot generalization seen in foundation models in other vision tasks. We ...
Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions : Abstract: Pretrained vision-language models such as CLIP achieve strong zero-shot generalization but remain vulnerable to distribution shifts caused by input corruptions. In this work, we investigate ...
egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks : Abstract: Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person's emotional states that shape their decisions and actions. ...
STG-Avatar: Animatable Human Avatars via Spacetime Gaussian : Abstract: Realistic animatable human avatars from monocular videos are crucial for advancing human-robot interaction and enhancing immersive virtual experiences. While recent research on 3DGS-based hu...
Attention Residual Fusion Network with Contrast for Source-free Domain Adaptation : Abstract: Source-free domain adaptation (SFDA) involves training a model on source domain and then applying it to a related target domain without access to the source data and labels during adaptation...
I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions : Abstract: Participating in efforts to endow generative AI with the 3D physical world perception, we propose I2-NeRF, a novel neural radiance field framework that enhances isometric and isotropic metri...
HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models : Abstract: The growing deployment of Vision-Language Models (VLMs) in high-stakes applications such as autonomous driving and assistive technologies for visually impaired individuals necessitates relia...
MOGRAS: Human Motion with Grasping in 3D Scenes : Abstract: Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate ...
LongCat-Video Technical Report : Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generat...
TrajGATFormer: A Graph-Based Transformer Approach for Worker and Obstacle Trajectory Prediction in Off-site Construction Environments : Abstract: As the demand grows within the construction industry for processes that are not only faster but also safer and more efficient, offsite construction has emerged as a solution, though it bring...
DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum : Abstract: Generating dynamic and interactive 3D objects, such as trees, has wide applications in virtual reality, games, and world simulation. Nevertheless, existing methods still face various challen...
Enpowering Your Pansharpening Models with Generalizability: Unified Distribution is All You Need : Abstract: Existing deep learning-based models for remote sensing pansharpening exhibit exceptional performance on training datasets. However, due to sensor-specific characteristics and varying imaging...
Audio Frequency-Time Dual Domain Evaluation on Depression Diagnosis : Abstract: Depression, as a typical mental disorder, has become a prevalent issue significantly impacting public health. However, the prevention and treatment of depression still face multiple challeng...
Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation : Abstract: Semantic segmentation demands dense pixel-level annotations, which can be prohibitively expensive - especially under extremely constrained labeling budgets. In this paper, we address the pro...
DiffusionLane: Diffusion Model for Lane Detection : Abstract: In this paper, we present a novel diffusion-based model for lane detection, called DiffusionLane, which treats the lane detection task as a denoising diffusion process in the parameter space...
Accident Anticipation via Temporal Occurrence Prediction : Abstract: Accident anticipation aims to predict potential collisions in an online manner, enabling timely alerts to enhance road safety. Existing methods typically predict frame-level risk scores as i...
GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification : Abstract: Aerial-Ground person re-identification (AG-ReID) is an emerging yet challenging task that aims to match pedestrian images captured from drastically different viewpoints, typically from unman...
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping : Abstract: Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these f...
Beyond Augmentation: Leveraging Inter-Instance Relation in Self-Supervised Representation Learning : Abstract: This paper introduces a novel approach that integrates graph theory into self-supervised representation learning. Traditional methods focus on intra-instance variations generated by applying...
GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation : Abstract: Precise geometric control in image generation is essential for engineering \& product design and creative industries to control 3D object features accurately in image space. Traditional 3D e...
EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model : Abstract: 3D reconstruction of endoscopic surgery scenes plays a vital role in enhancing scene perception, enabling AR visualization, and supporting context-aware decision-making in image-guided surge...
A Fully Interpretable Statistical Approach for Roadside LiDAR Background Subtraction : Abstract: We present a fully interpretable and flexible statistical method for background subtraction in roadside LiDAR data, aimed at enhancing infrastructure-based perception in automated driving. O...
3D Roadway Scene Object Detection with LIDARs in Snowfall Conditions : Abstract: Because 3D structure of a roadway environment can be characterized directly by a Light Detection and Ranging (LiDAR) sensors, they can be used to obtain exceptional situational awareness for...
Model-Aware Tokenizer Transfer : Abstract: Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-s...
A Stylometric Application of Large Language Models : Abstract: We show that large language models (LLMs) can be used to distinguish the writings of different authors. Specifically, an individual GPT-2 model, trained from scratch on the works of one auth...
Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics : Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact...
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality : Abstract: Scaling laws research has focused overwhelmingly on English -- yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multi...
Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds : Abstract: Large Language Models (LLMs) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior w...
Generalization or Memorization: Dynamic Decoding for Mode Steering : Abstract: Large Language Models (LLMs) exhibit a troubling duality, capable of both remarkable generalization and brittle, verbatim memorization of their training data. This unpredictability undermine...
OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue : Abstract: Intelligent customer service (ICS) systems via retrieval-augmented generation (RAG) have been widely adopted in Web-based domains such as social platforms and e-commerce, achieving remarkabl...
SentiMaithili: A Benchmark Dataset for Sentiment and Reason Generation for the Low-Resource Maithili Language : Abstract: Developing benchmark datasets for low-resource languages poses significant challenges, primarily due to the limited availability of native linguistic experts and the substantial time and cos...
DETECT: Determining Ease and Textual Clarity of German Text Simplifications : Abstract: Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in t...
Evolution of the lexicon: a probabilistic point of view : Abstract: The Swadesh approach for determining the temporal separation between two languages relies on the stochastic process of words replacement (when a complete new word emerges to represent a give...
SteerX: Disentangled Steering for LLM Personalization : Abstract: Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications, including intelligent assistants that support users' daily life and work. A...
From Slides to Chatbots: Enhancing Large Language Models with University Course Materials : Abstract: Large Language Models (LLMs) have advanced rapidly in recent years. One application of LLMs is to support student learning in educational settings. However, prior work has shown that LLMs st...
Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling : Abstract: We present memory-based language modeling as an efficient, eco-friendly alternative to deep neural network-based language modeling. It offers log-linearly scalable next-token prediction perf...
Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models : Abstract: Ironic identification is a challenging task in Natural Language Processing, particularly when dealing with languages that differ in syntax and cultural context. In this work, we aim to detec...
GigaEmbeddings: Efficient Russian Language Embedding Model : Abstract: We introduce GigaEmbeddings, a novel framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning of the decoder-only LLM designed specific...
Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection : Abstract: We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CA...
The Tonogenesis Continuum in Tibetan: A Computational Investigation : Abstract: Tonogenesis-the historical process by which segmental contrasts evolve into lexical tone-has traditionally been studied through comparative reconstruction and acoustic phonetics. We introduc...
Frustratingly Easy Task-aware Pruning for Large Language Models : Abstract: Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for trai...
The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR : Abstract: How much audio is needed to fully observe a multilingual ASR model's learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these token...
A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus : Abstract: This paper presents a systematic evaluation of racial bias in four major commercial automatic speech recognition (ASR) systems using the Pacific Northwest English (PNWE) corpus. We analyze t...
SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size : Abstract: The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as a...
A Closed-Loop Personalized Learning Agent Integrating Neural Cognitive Diagnosis, Bounded-Ability Adaptive Testing, and LLM-Driven Feedback : Abstract: As information technology advances, education is moving from one-size-fits-all instruction toward personalized learning. However, most methods handle modeling, item selection, and feedback i...
Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems : Abstract: The interdisciplinary research domain of Artificial Intelligence in Education (AIED) has a long history of developing Intelligent Tutoring Systems (ITSs) by integrating insights from technol...
Culturally Grounded Physical Commonsense Reasoning in Italian and English: A Submission to the MRL 2025 Shared Task : Abstract: This paper presents our submission to the MRL 2025 Shared Task on Multilingual Physical Reasoning Datasets. The objective of the shared task is to create manually-annotated evaluation data i...
Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion : Abstract: Few-shot Knowledge Graph Completion (FKGC) infers missing triples from limited support samples, tackling long-tail distribution challenges. Existing methods, however, struggle to capture com...
Rule-Based Explanations for Retrieval-Augmented LLM Systems : Abstract: If-then rules are widely used to explain machine learning models; e.g., "if employed = no, then loan application = rejected." We present the first proposal to apply rules to explain the emer...
SALSA: Single-pass Autoregressive LLM Structured Classification : Abstract: Despite their impressive generalization capabilities, instruction-tuned Large Language Models often underperform on text classification benchmarks. We introduce SALSA, a coherent pipeline th...
EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models : Abstract: Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken ...
Iterative Layer Pruning for Efficient Translation Inference : Abstract: Large language models (LLMs) have transformed many areas of natural language processing, including machine translation. However, efficient deployment of LLMs remains challenging due to their...
MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion : Abstract: As Large Vision-Language Models (LVLMs) are increasingly deployed in domains such as shopping, health, and news, they are exposed to pervasive persuasive content. A critical question is how ...
Scalable Supervising Software Agents with Patch Reasoner : Abstract: While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. Th...
VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions : Abstract: Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but it remains a significant challenge due to the di...
Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays : Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays....
Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning : Abstract: Understanding how ideas develop and flow in small-group conversations is critical for analyzing collaborative learning. A key structural feature of these interactions is threading, the way d...
Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement : Abstract: Understanding how the human brain progresses from processing simple linguistic inputs to performing high-level reasoning is a fundamental challenge in neuroscience. While modern large langua...
Interpreting and Mitigating Unwanted Uncertainty in LLMs : Abstract: Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re...
A Comprehensive Dataset for Human vs. AI Generated Text Detection : Abstract: The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. ...
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) : Abstract: Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to simil...
Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts : Abstract: Recent investigations into effective context lengths of modern flagship large language models (LLMs) have revealed major limitations in effective question answering (QA) and reasoning over l...
LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models : Abstract: Language educators strive to create a rich experience for learners, while they may be restricted in the extend of feedback and practice they can provide. We present the design and developmen...
Knocking-Heads Attention : Abstract: Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of...
A Survey on LLM Mid-training : Abstract: Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that brid...
MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models : Abstract: Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing mul...
Flexing in 73 Languages: A Single Small Model for Multilingual Inflection : Abstract: We present a compact, single-model approach to multilingual inflection, the task of generating inflected word forms from base lemmas to express grammatical categories. Our model, trained joi...
Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation : Abstract: Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). LoRA essentially describes the projection of an input space into a ...
Corpus Frequencies in Morphological Inflection: Do They Matter? : Abstract: The traditional approach to morphological inflection (the task of modifying a base word (lemma) to express grammatical categories) has been, for decades, to consider lexical entries of lemma...
ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix : Abstract: Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs) to domain-specific instructions by training on a carefully curated subset of high-quality instruction-response pa...
MATCH: Task-Driven Code Evaluation through Contrastive Learning : Abstract: AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer i...
SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations : Abstract: As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in...
Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? : Abstract: Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as...
Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding : Abstract: Mubeen is a proprietary Arabic language model developed by MASARAT SA, optimized for deep understanding of Arabic linguistics, Islamic studies, and cultural heritage. Trained on an extensive...
Code Aesthetics with Agentic Reward Feedback : Abstract: Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, t...
A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results : Abstract: We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room...
DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model : Abstract: Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centri...
Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models : Abstract: LLM alignment remains a critical challenge. Inference-time methods provide a flexible alternative to fine-tuning, but their uniform computational effort often yields suboptimal alignment. We...
BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning : Abstract: Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult...
LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data : Abstract: The scarcity of high-quality knowledge graphs (KGs) remains a critical bottleneck for downstream AI applications, as existing extraction methods rely heavily on error-prone pattern-matching ...
How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes : Abstract: Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large la...
MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring : Abstract: Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promi...
M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset : Abstract: Existing real-world datasets for multimodal automated fact-checking have multiple limitations: they contain few instances, focus on only one or two languages and tasks, suffer from evidence ...
IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering : Abstract: Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or r...
LimRank: Less is More for Reasoning-Intensive Information Reranking : Abstract: Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs...
Think Twice: Branch-and-Rethink Reasoning Reward Model : Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliber...
When Robots Say No: Temporal Trust Recovery Through Explanation : Abstract: Mobile robots with some degree of autonomy could deliver significant advantages in high-risk missions such as search and rescue and firefighting. Integrated into a human-robot team (HRT), ro...
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting : Abstract: Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle...
Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images : Abstract: Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various for...
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models : Abstract: Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web page...
Transformer Based Linear Attention with Optimized GPU Kernel Implementation : Abstract: The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensio...
Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing : Abstract: Masked diffusion models (MDMs) offer a compelling alternative to autoregressive models (ARMs) for discrete text generation because they enable parallel token sampling, rather than sequential...
From Social Division to Cohesion with AI Message Suggestions in Online Chat Groups : Abstract: Social cohesion is difficult to sustain in societies marked by opinion diversity, particularly in online communication. As large language model (LLM)-driven messaging assistance becomes incr...
Optimal Detection for Language Watermarks with Pseudorandom Collision : Abstract: Text watermarking plays a crucial role in ensuring the traceability and accountability of large language model (LLM) outputs and mitigating misuse. While promising, most existing methods ass...
A Benchmark for Open-Domain Numerical Fact-Checking Enhanced by Claim Decomposition : Abstract: Fact-checking numerical claims is critical as the presence of numbers provide mirage of veracity despite being fake potentially causing catastrophic impacts on society. The prior works in au...
Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs : Abstract: Lifelong knowledge editing enables continuous, precise updates to outdated knowledge in large language models (LLMs) without computationally expensive full retraining. However, existing meth...
LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction : Abstract: Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene underst...
Surface Reading LLMs: Synthetic Text and its Styles : Abstract: Despite a potential plateau in ML advancement, the societal impact of large language models lies not in approaching superintelligence but in generating text surfaces indistinguishable from h...
M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR : Abstract: The Continuous Integrate-and-Fire (CIF) mechanism provides effective alignment for non-autoregressive (NAR) speech recognition. This mechanism creates a smooth and monotonic mapping from aco...
The Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I) : Abstract: Large Language Models (LLMs) can achieve near-optimal lossless compression by acting as powerful probability models. We investigate their use in the lossy domain, where reconstruction fideli...
WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models : Abstract: Large-scale and high-quality image-text pair datasets play an important role in developing high-performing Vision-Language Models (VLMs). In this work, we introduce WAON, a large-scale and h...
Mapping Faithful Reasoning in Language Models : Abstract: Chain-of-thought (CoT) traces promise transparency for reasoning language models, but prior work shows they are not always faithful reflections of internal computation. This raises challenge...
Label Smoothing Improves Gradient Ascent in LLM Unlearning : Abstract: LLM unlearning has emerged as a promising approach, aiming to enable models to forget hazardous/undesired knowledge at low cost while preserving as much model utility as possible. Among exis...
UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models : Abstract: Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely function...
Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views : Abstract: We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary ca...
RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance : Abstract: Scalable Vector Graphics (SVGs) are fundamental to digital design and robot control, encoding not only visual structure but also motion paths in interactive drawings. In this work, we introd...
Membership Inference Attacks on Recommender System: A Survey : Abstract: Recommender systems (RecSys) have been widely applied to various applications, including E-commerce, finance, healthcare, social media and have become increasingly influential in shaping use...
A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications : Abstract: We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and cl...
Policy Optimization Prefers The Path of Least Resistance : Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format t...
Explaining and Mitigating Crosslingual Tokenizer Inequities : Abstract: The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughpu...
Refusal as Silence: Gendered Disparities in Vision-Language Model Responses : Abstract: Refusal behavior by Large Language Models is increasingly visible in content moderation, yet little is known about how refusals vary by the identity of the user making the request. This stud...
How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text? : Abstract: Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their relianc...
R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models : Abstract: Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML), where components of large ML models are outsourced to remote servers. A significant chall...
Centralized Reward Agent for Knowledge Sharing and Transfer in Multi-Task Reinforcement Learning : Abstract: Reward shaping is effective in addressing the sparse-reward challenge in reinforcement learning (RL) by providing immediate feedback through auxiliary, informative rewards. Based on the rewa...
Painless Federated Learning: An Interplay of Line-Search and Extrapolation : Abstract: The classical line search for learning rate (LR) tuning in the stochastic gradient descent (SGD) algorithm can tame the convergence slowdown due to data-sampling noise. In a federated settin...
Can Large Language Models Unlock Novel Scientific Research Ideas? : Abstract: The widespread adoption of Large Language Models (LLMs) and publicly available ChatGPT have marked a significant turning point in the integration of Artificial Intelligence (AI) into people'...
AI for Water Sustainability: Global Water Quality Assessment and Prediction with Explainable AI with LLM Chatbot for Insights : Abstract: Ensuring safe water supplies requires effective water quality monitoring, especially in developing countries like Nepal, where contamination risks are high. This paper introduces various hyb...
Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination : Abstract: The scientific ideation process often involves blending salient aspects of existing papers to create new ideas -- a framework known as facet-based ideation. To see how large language models ...
Last Iterate Convergence in Monotone Mean Field Games : Abstract: In the Lasry--Lions framework, Mean-Field Games (MFGs) model interactions among an infinite number of agents. However, existing algorithms either require strict monotonicity or only guarante...
A Cycle Ride to HDR: Semantics Aware Self-Supervised Framework for Unpaired LDR-to-HDR Image Reconstruction : Abstract: Reconstruction of High Dynamic Range (HDR) from Low Dynamic Range (LDR) images is an important computer vision task. There is a significant amount of research utilizing both conventional non...
Temporal Relational Reasoning of Large Language Models for Detecting Stock Portfolio Crashes : Abstract: Stock portfolios are often exposed to rare consequential events (e.g., 2007 global financial crisis, 2020 COVID-19 stock market crash), as they do not have enough historical information to l...
TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration : Abstract: Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and pu...
MIBP-Cert: Certified Training against Data Perturbations with Mixed-Integer Bilinear Programs : Abstract: Data errors, corruptions, and poisoning attacks during training pose a major threat to the reliability of modern AI systems. While extensive effort has gone into empirical mitigations, the e...
Macro2Micro: A Rapid and Precise Cross-modal Magnetic Resonance Imaging Synthesis using Multi-scale Structural Brain Similarity : Abstract: The human brain is a complex system requiring both macroscopic and microscopic components for comprehensive understanding. However, mapping nonlinear relationships between these scales remai...
Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks : Abstract: Large Language Models (LLMs), particularly smaller variants, still struggle with complex reasoning tasks. While inference-time prompting can guide reasoning, existing methods often rely on s...
Dynamic-Aware Spatio-temporal Representation Learning for Dynamic MRI Reconstruction : Abstract: Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the use of deep learning techniques. Especially, the practical difficulty of obtaining ground truth data has led to t...
Solving the Unsolvable: Translating Case Law in Hong Kong : Abstract: This paper addresses the challenges translating case law under Hong Kong's bilingual legal system. It highlights the initial success of translating all written statutes into Chinese before t...
Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction : Abstract: Achieving high model robustness under adversarial settings is widely recognized as demanding considerable training samples. Recent works propose semi-supervised adversarial training (SSAT) m...
Zero-Shot Trajectory Planning for Signal Temporal Logic Tasks : Abstract: Signal Temporal Logic (STL) is a powerful specification language for describing complex temporal behaviors of continuous signals, making it well-suited for high-level robotic task descriptio...
Improving Video Generation with Human Feedback : Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we de...
Robust Multimodal Learning via Cross-Modal Proxy Tokens : Abstract: Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective appro...
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks : Abstract: Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting th...
Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context : Abstract: Transformers have demonstrated remarkable in-context learning (ICL) capabilities, adapting to new tasks by simply conditioning on demonstrations without parameter updates. Compelling empiric...
From Contextual Combinatorial Semi-Bandits to Bandit List Classification: Improved Sample Complexity with Sparse Rewards : Abstract: We study the problem of contextual combinatorial semi-bandits, where input contexts are mapped into subsets of size $m$ of a collection of $K$ possible actions. In each round, the learner ob...
On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning : Abstract: Graph Neural Networks (GNNs) are models that leverage the graph structure to transmit information between nodes, typically through the message-passing operation. While widely successful, thi...
ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations : Abstract: This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.Visual text rendering remains a si...
Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens : Abstract: Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these module...
Detecting Various DeFi Price Manipulations with LLM Reasoning : Abstract: DeFi (Decentralized Finance) is one of the most important applications of today's cryptocurrencies and smart contracts. It manages hundreds of billions in Total Value Locked (TVL) on-chain, ...
KL Penalty Control via Perturbation for Direct Preference Optimization : Abstract: Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that t...
Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting : Abstract: Time Series Forecasting (TSF) is a crucial task in various domains, yet existing TSF models rely heavily on high-quality data and insufficiently exploit all available data. This paper explor...
When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning : Abstract: While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users...
FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge : Abstract: Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and int...
Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs : Abstract: This paper presents an extensive examination of Parameter-Efficient Fine-Tuning (PEFT) for embedding domain specific facts into Large Language Models (LLMs), focusing on improving the fine-t...
RePO: Understanding Preference Learning Through ReLU-Based Optimization : Abstract: Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO esta...
Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction : Abstract: Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning offers transformative potential for large-scale water quality prediction...
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 : Abstract: Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against closed-source commercial LVLMs. Analyzing failed adversa...
AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation : Abstract: While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy...
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models : Abstract: Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Rew...
Out-of-Distribution Generalization in Time Series: A Survey : Abstract: Time series frequently manifest distribution shifts, diverse latent features, and non-stationary learning dynamics, particularly in open and evolving environments. These characteristics pose...
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation : Abstract: Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified fr...
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement : Abstract: Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. While numerica...
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging : Abstract: Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, cau...
Progressive Multi-Source Domain Adaptation for Personalized Facial Expression Recognition : Abstract: Personalized facial expression recognition (FER) involves adapting a machine learning model using samples from labeled sources and unlabeled target domains. Given the challenges of recognizi...
Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing : Abstract: Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often ...
DMol: A Schedule-Driven Diffusion Model for Highly Efficient and Versatile Molecule Generation : Abstract: We introduce a new graph diffusion model for small molecule generation, DMol, which outperforms the state-of-the-art DiGress model in terms of validity by roughly 1.5% across all benchmarkin...
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations : Abstract: We present a novel, open-source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simul...
SEAL: Steerable Reasoning Calibration of Large Language Models for Free : Abstract: Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. Ho...
Better Estimation of the Kullback--Leibler Divergence Between Language Models : Abstract: Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge disti...
Measuring the (Un)Faithfulness of Concept-Based Explanations : Abstract: Post-hoc, unsupervised concept-based explanation methods (U-CBEMs) translate a vision model's internal reasoning into human-understandable concepts, leading to interpretable explanations. Ho...
Depth-Constrained ASV Navigation with Deep RL and Limited Sensing : Abstract: Autonomous Surface Vehicles (ASVs) play a crucial role in maritime operations, yet their navigation in shallow-water environments remains challenging due to dynamic disturbances and depth co...
SAGE: A Generic Framework for LLM Safety Evaluation : Abstract: As Large Language Models are rapidly deployed across diverse applications from healthcare to financial advice, safety evaluation struggles to keep pace. Current benchmarks focus on single-tu...
Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking : Abstract: The growing spread of online misinformation has created an urgent need for scalable, reliable fact-checking solutions. Crowdsourced fact-checking - where non-experts evaluate claim veracity ...
ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement : Abstract: We present Agentic Retrieval-Augmented Code Synthesis (ARCS), a system that improves LLM-based code generation without fine-tuning. ARCS operates through a budgeted synthesize-execute-repair...
Temporal Robustness in Discrete Time Linear Dynamical Systems : Abstract: Discrete time linear dynamical systems, including Markov chains, have found many applications including in security settings such as in cybersecurity operations center (CSOC) management and ...
ComPO: Preference Alignment via Comparison Oracles : Abstract: Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood dis...
Flow-GRPO: Training Flow Matching Models via Online RL : Abstract: We propose Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conv...
ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data : Abstract: The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast...
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA : Abstract: Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in fede...
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming : Abstract: Competitive programming benchmarks are widely used in scenarios such as programming contests and large language model assessments. However, the growing presence of duplicate or highly simila...
RLVR-World: Training World Models with Reinforcement Learning : Abstract: World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estim...
Adaptive Inference-Time Scaling via Cyclic Diffusion Search : Abstract: Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. However, most inference-time scaling methods rely on...
Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2) : Abstract: Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remai...
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback : Abstract: Hypothesis ranking is vital for automated scientific discovery, especially in cost-intensive, throughput-limited natural science domains. Current methods focus on pre-experiment ranking, rel...
DataRater: Meta-Learned Dataset Curation : Abstract: The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-...
RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration : Abstract: The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing thei...
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays : Abstract: Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchm...
PhySense: Sensor Placement Optimization for Accurate Physics Sensing : Abstract: Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations an...
Performance and Generalizability Impacts of Incorporating Location Encoders into Deep Learning for Dynamic PM2.5 Estimation : Abstract: Deep learning has shown strong performance in geospatial prediction tasks, but the role of geolocation information in improving accuracy and generalizability remains underexamined. Recent wo...
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search : Abstract: Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodologi...
Preference Optimization by Estimating the Ratio of the Data Distribution : Abstract: Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO ...
PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding : Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) ...
First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training : Abstract: Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL), which require expensive and ma...
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling : Abstract: Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not f...
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents : Abstract: Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-perf...
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics : Abstract: Large language models (LLMs) are prone to hallucinations and sensitiveto prompt perturbations, often resulting in inconsistent or unreliablegenerated text. Different methods have been propos...
Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration : Abstract: Multi-task model merging aims to consolidate knowledge from multiple fine-tuned task-specific experts into a unified model while minimizing performance degradation. Existing methods primaril...
Representational Difference Explanations : Abstract: We propose a method for discovering and visualizing the differences between two learned representations, enabling more direct and interpretable model comparisons. We validate our method, whi...
Less is More: Local Intrinsic Dimensions of Contextual Language Models : Abstract: Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, ...
Psi-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models : Abstract: We introduce $\Psi$-Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based generative model. Infe...
Engram Memory Encoding and Retrieval: A Neurocomputational Perspective : Abstract: Despite substantial research into the biological basis of memory, the precise mechanisms by which experiences are encoded, stored, and retrieved in the brain remain incompletely understood. ...
Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM : Abstract: Neuroscience research publications encompass a vast wealth of knowledge. Accurately retrieving existing information and discovering new insights from this extensive literature is essential f...
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model : Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partial...
Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System : Abstract: Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task-specific models often l...
Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models : Abstract: Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically form...
Mixture-of-Experts Meets In-Context Reinforcement Learning : Abstract: In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in...
Zero-shot protein stability prediction by inverse folding models: a free energy interpretation : Abstract: Inverse folding models have proven to be highly effective zero-shot predictors of protein stability. Despite this success, the link between the amino acid preferences of an inverse folding m...
Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance : Abstract: Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized s...
Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries : Abstract: Generating high fidelity, differentially private (DP) synthetic images offers a promising route to share and analyze sensitive visual data without compromising individual privacy. However, e...
Vision Transformers Don't Need Trained Registers : Abstract: We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers - the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al., 2024)....
ME: Trigger Element Combination Backdoor Attack on Copyright Infringement : Abstract: The capability of generative diffusion models (DMs) like Stable Diffusion (SD) in replicating training data could be taken advantage of by attackers to launch the Copyright Infringement Atta...
KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills : Abstract: Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with del...
Distributional Training Data Attribution: What do Influence Functions Sample? : Abstract: Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact t...
Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment : Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed i...
Identifiability of Deep Polynomial Neural Networks : Abstract: Polynomial Neural Networks (PNNs) possess a rich algebraic and geometric structure. However, their identifiability -- a key property for ensuring interpretability -- remains poorly understoo...
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation : Abstract: Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of in...
FlightKooba: A Fast Interpretable FTP Model : Abstract: Flight trajectory prediction (FTP) and similar time series tasks typically require capturing smooth latent dynamics hidden within noisy signals. However, existing deep learning models face s...
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE : Abstract: Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to mod...
Reasoning as an Adaptive Defense for Safety : Abstract: Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approac...
Echo State Transformer: Attention Over Finite Memories : Abstract: While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such...
Deep Learning Atmospheric Models Reliably Simulate Out-of-Sample Land Heat and Cold Wave Frequencies : Abstract: Deep learning (DL)-based general circulation models (GCMs) are emerging as fast simulators, yet their ability to replicate extreme events outside their training range remains unknown. Here, ...
Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG : Abstract: Malware family classification aims to identify the specific family (e.g., GuLoader or BitRAT) a malware sample may belong to, in contrast to malware detection or sample classification, which...
OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model : Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive res...
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving : Abstract: AI agent frameworks operate in isolation, forcing agents to rediscover solutions and repeat mistakes across different systems. Despite valuable problem-solving experiences accumulated by fra...
Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy : Abstract: Differentially private (DP) mechanisms are difficult to interpret and calibrate because existing methods for mapping standard privacy parameters to concrete privacy risks -- re-identificatio...
The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora : Abstract: Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generatio...
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation : Abstract: In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a f...
Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis : Abstract: Transformers have revolutionized nucleotide sequence analysis, yet capturing long-range dependencies remains challenging. Recent studies show that autoregressive transformers often exhibit M...
Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training : Abstract: As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate f...
Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data : Abstract: Grounding language in perception and action is a key challenge when building situated agents that can interact with humans, or other agents, via language. In the past, addressing this challe...
A Lightweight Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes : Abstract: With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, se...
PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors : Abstract: Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires ...
Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries : Abstract: Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored...
FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network : Abstract: Low-light vision remains a fundamental challenge in computer vision due to severe illumination degradation, which significantly affects the performance of downstream tasks such as detection ...
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences : Abstract: Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused ...
BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents : Abstract: Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-tur...
Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts : Abstract: Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulato...
Robust Decision Making with Partially Calibrated Forecasts : Abstract: Calibration has emerged as a foundational goal in ``trustworthy machine learning'', in part because of its strong decision theoretic semantics. Independent of the underlying distribution, an...
BBOPlace-Bench: Benchmarking Black-Box Optimization for Chip Placement : Abstract: Chip placement is a vital stage in modern chip design as it has a substantial impact on the subsequent processes and the overall quality of the final chip. The use of black-box optimization ...
On the Faithfulness of Visual Thinking: Measurement and Enhancement : Abstract: Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual info...
Mixed Precision Training of Neural ODEs : Abstract: Exploiting low-precision computations has become a standard strategy in deep learning to address the growing computational costs imposed by ever larger models and datasets. However, naively ...
A Deep Latent Factor Graph Clustering with Fairness-Utility Trade-off Perspective : Abstract: Fair graph clustering seeks partitions that respect network structure while maintaining proportional representation across sensitive groups, with applications spanning community detection, t...
Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization : Abstract: Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simpl...
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation : Abstract: The pursuit of robot generalists - instructable agents capable of performing diverse tasks across diverse environments - demands rigorous and scalable evaluation. Yet real-world testing of r...
UrbanVLA: A Vision-Language-Action Model for Urban Micromobility : Abstract: Urban micromobility applications, such as delivery robots, demand reliable navigation across large-scale urban environments while following long-horizon route instructions. This task is part...
TAMI: Taming Heterogeneity in Temporal Interactions for Temporal Graph Link Prediction : Abstract: Temporal graph link prediction aims to predict future interactions between nodes in a graph based on their historical interactions, which are encoded in node embeddings. We observe that hete...
Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models : Abstract: The identification of hope speech has become a promised NLP task, considering the need to detect motivational expressions of agency and goal-directed behaviour on social media platforms. Thi...
A Survey of Data Agents: Emerging Paradigm or Overstated Hype? : Abstract: The rapid advancement of large language models (LLMs) has spurred the emergence of data agents--autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-rela...
Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling : Abstract: Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across differen...
Variational Masked Diffusion Models : Abstract: Masked diffusion models have recently emerged as a flexible framework for discrete generative modeling. However, a key limitation of standard masked diffusion is its inability to effectively...
Faster Reinforcement Learning by Freezing Slow States : Abstract: We study infinite horizon Markov decision processes (MDPs) with "fast-slow" structure, where some state variables evolve rapidly ("fast states") while others change more gradually ("slow sta...
Online POMDP Planning with Anytime Deterministic Optimality Guarantees : Abstract: Decision-making under uncertainty is a critical aspect of many practical autonomous systems due to incomplete information. Partially Observable Markov Decision Processes (POMDPs) offer a mat...
GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability : Abstract: Improving the general capabilities of large language models (LLMs) is an active research topic. As a common data structure in many real-world domains, understanding graph data is a crucial p...
Integrated Design and Governance of Agentic AI Systems through Adaptive Information Modulation : Abstract: Modern engineered systems increasingly involve complex sociotechnical environments where multiple agents, including humans and the emerging paradigm of agentic AI powered by large language m...
Learning to Better Search with Language Models via Guided Reinforced Self-Training : Abstract: While language models have shown remarkable performance across diverse tasks, they still encounter challenges in complex reasoning scenarios. Recent research suggests that language models tr...
Diversified and Adaptive Negative Sampling on Knowledge Graphs : Abstract: In knowledge graph embedding, aside from positive triplets (ie: facts in the knowledge graph), the negative triplets used for training also have a direct influence on the model performance. ...
Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals : Abstract: Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning w...
Human-AI Collaboration: Trade-offs Between Performance and Preferences : Abstract: Despite the growing interest in collaborative AI, designing systems that seamlessly integrate human input remains a major challenge. In this study, we developed a task to systematically exam...
Why Do Multi-Agent LLM Systems Fail? : Abstract: Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of w...
Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing : Abstract: This paper explores pruning attention heads as a post-processing bias mitigation method for large language models (LLMs). Modern AI systems such as LLMs are expanding into sensitive social c...
LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models : Abstract: Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems requiring structured reasoning. This limitation has drawn interest...
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? : Abstract: Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are o...
Scaling Laws For Scalable Oversight : Abstract: Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still uncle...
GVPO: Group Variance Policy Optimization for Large Language Model Post-Training : Abstract: Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as...
Lost in Transmission: When and Why LLMs Fail to Reason Globally : Abstract: Despite their many successes, transformer-based large language models (LLMs) continue to struggle with tasks that require complex reasoning over large parts of their input. We argue that the...
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis : Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use age...
Guarded Query Routing for Large Language Models : Abstract: Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must...
ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions : Abstract: Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusi...
On the Hardness of Approximating Distributions with Tractable Probabilistic Models : Abstract: A fundamental challenge in probabilistic modeling is to balance expressivity and inference efficiency. Tractable probabilistic models (TPMs) aim to directly address this tradeoff by imposing...
E-bike agents: Large Language Model-Driven E-Bike Accident Analysis and Severity Prediction : Abstract: E-bikes have rapidly gained popularity as a sustainable form of urban mobility, yet their safety implications remain underexplored. This paper analyzes injury incidents involving e-bikes and...
Towards Responsible AI: Advances in Safety, Fairness, and Accountability of Autonomous Systems : Abstract: Ensuring responsible use of artificial intelligence (AI) has become imperative as autonomous systems increasingly influence critical societal domains. However, the concept of trustworthy AI ...
When Can Model-Free Reinforcement Learning be Enough for Thinking? : Abstract: Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free R...
SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents : Abstract: Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancem...
Chatbot To Help Patients Understand Their Health : Abstract: Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel 'learning...
Representer Theorems for Metric and Preference Learning: Geometric Insights and Algorithms : Abstract: We develop a mathematical framework to address a broad class of metric and preference learning problems within a Hilbert space. We obtain a novel representer theorem for the simultaneous tas...
Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection : Abstract: The exploration problem is one of the main challenges in deep reinforcement learning (RL). Recent promising works tried to handle the problem with population-based methods, which collect sam...
TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security : Abstract: Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and t...
Graph Neural Architecture Search with GPT-4 : Abstract: Graph Neural Architecture Search (GNAS) has shown promising results in finding the best graph neural network architecture on a given graph dataset. However, existing GNAS methods still requi...
DocFinQA: A Long-Context Financial Reasoning Dataset : Abstract: For large language models (LLMs) to be effective in the financial domain -- where each decision can have a significant impact -- it is necessary to investigate realistic tasks and data. Fina...
FaithLM: Towards Faithful Explanations for Large Language Models : Abstract: Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses ...
Diffusion Models Meet Contextual Bandits : Abstract: Efficient decision-making in contextual bandits with large action spaces is challenging, as methods lacking additional prior information may suffer from computational and statistical ineffic...
Dynamic D2D-Assisted Federated Learning over O-RAN: Performance Analysis, MAC Scheduler, and Asymmetric User Selection : Abstract: Existing studies on federated learning (FL) are mostly focused on system orchestration for static snapshots of the network and making static control decisions (e.g., spectrum allocation). Ho...
UCINet0: A Machine Learning based Receiver for 5G NR PUCCH Format 0 : Abstract: Accurate decoding of Uplink Control Information (UCI) on the Physical Uplink Control Channel (PUCCH) is essential for enabling 5G wireless links. This paper explores an AI/ML-based receiver ...
REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning : Abstract: Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-worl...
ATLAS: Actor-Critic Task-Completion with Look-ahead Action Simulation : Abstract: We observe that current state-of-the-art web-agents are unable to effectively adapt to new environments without neural network fine-tuning, without which they produce inefficient execution p...
$\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker : Abstract: Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval p...
REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization : Abstract: In Taobao e-commerce visual search, user behavior analysis reveals a large proportion of no-click requests, suggesting diverse and implicit user intents. These intents are expressed in vario...
Policies over Poses: Reinforcement Learning based Distributed Pose-Graph Optimization for Multi-Robot SLAM : Abstract: We consider the distributed pose-graph optimization (PGO) problem, which is fundamental in accurate trajectory estimation in multi-robot simultaneous localization and mapping (SLAM). Convent...
Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study : Abstract: Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant...
Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models : Abstract: In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (LLMs) retrieve contextual information. Analogous to human episodic memory, whe...
PIP-LLM: Integrating PDDL-Integer Programming with LLMs for Coordinating Multi-Robot Teams Using Natural Language : Abstract: Enabling robot teams to execute natural language commands requires translating high-level instructions into feasible, efficient multi-robot plans. While Large Language Models (LLMs) combined...
Collaborative LLM Agents for C4 Software Architecture Design Automation : Abstract: Software architecture design is a fundamental part of creating every software system. Despite its importance, producing a C4 software architecture model, the preferred notation for such arch...
A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring) : Abstract: Traditional machine learning relies on explicit models and domain assumptions, limiting flexibility and interpretability. We introduce a model-free framework using surprisal (information the...
Air Quality Prediction Using LOESS-ARIMA and Multi-Scale CNN-BiLSTM with Residual-Gated Attention : Abstract: Air pollution remains a critical environmental and public health concern in Indian megacities such as Delhi, Kolkata, and Mumbai, where sudden spikes in pollutant levels challenge timely int...
Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP : Abstract: Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offe...
LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction : Abstract: This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorabili...
Once Upon an Input: Reasoning via Per-Instance Program Synthesis : Abstract: Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such...
Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models : Abstract: Concept erasure in text-to-image diffusion models is crucial for mitigating harmful content, yet existing methods often compromise generative quality. We introduce Semantic Surgery, a novel ...
Encoder-Decoder Diffusion Language Models for Efficient Training and Inference : Abstract: Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires ...
Guardian: Decoupling Exploration from Safety in Reinforcement Learning : Abstract: Hybrid offline--online reinforcement learning (O2O RL) promises both sample efficiency and robust exploration, but suffers from instability due to distribution shift between offline and onli...
Long-Term PM2.5 Forecasting Using a DTW-Enhanced CNN-GRU Model : Abstract: Reliable long-term forecasting of PM2.5 concentrations is critical for public health early-warning systems, yet existing deep learning approaches struggle to maintain prediction stability be...
Batch Speculative Decoding Done Right : Abstract: Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for...
Learning Reconfigurable Representations for Multimodal Federated Learning with Missing Data : Abstract: Multimodal federated learning in real-world settings often encounters incomplete and heterogeneous data across clients. This results in misaligned local feature representations that limit th...
Language Server CLI Empowers Language Agents with Process Rewards : Abstract: Large language models routinely hallucinate APIs and mislocalize edits, while language servers compute verified, IDE-grade facts about real code. We present Lanser-CLI, a CLI-first orchestra...
Rethinking Inference Placement for Deep Learning across Edge and Cloud Platforms: A Multi-Objective Optimization Perspective and Future Directions : Abstract: Edge intelligent applications like VR/AR and language model based chatbots have become widespread with the rapid expansion of IoT and mobile devices. However, constrained edge devices often ...
HyPerNav: Hybrid Perception for Object-Oriented Navigation in Unknown Environment : Abstract: Objective-oriented navigation(ObjNav) enables robot to navigate to target object directly and autonomously in an unknown environment. Effective perception in navigation in unknown environmen...
Gen-LangSplat: Generalized Language Gaussian Splatting with Pre-Trained Feature Compression : Abstract: Modeling open-vocabulary language fields in 3D is essential for intuitive human-AI interaction and querying within physical environments. State-of-the-art approaches, such as LangSplat, leve...
Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining : Abstract: Continual Learning (CL) is essential for enabling self-evolving large language models (LLMs) to adapt and remain effective amid rapid knowledge growth. Yet, despite its importance, little at...
Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies : Abstract: Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern. Existing studies predominantly ...
PASS-Enhanced MEC: Joint Optimization of Task Offloading and Uplink PASS Beamforming : Abstract: A pinching-antenna system (PASS)-enhanced mobile edge computing (MEC) architecture is investigated to improve the task offloading efficiency and latency performance in dynamic wireless envir...
Manifold Approximation leads to Robust Kernel Alignment : Abstract: Centered kernel alignment (CKA) is a popular metric for comparing representations, determining equivalence of networks, and neuroscience research. However, CKA does not account for the under...
FAME: Fairness-aware Attention-modulated Video Editing : Abstract: Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose \textbf{FAME} for \textit{Fairness-aware Attention-modul...
CompressionAttack: Exploiting Prompt Compression as a New Attack Surface in LLM-Powered Agents : Abstract: LLM-powered agents often use prompt compression to reduce inference costs, but this introduces a new security risk. Compression modules, which are optimized for efficiency rather than safety...
MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs : Abstract: The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and ed...
Measuring Teaching with LLMs : Abstract: Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to re...
The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination : Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) is a key strategy for building Agents that "think then act." However, recent observations, like OpenAI's o3, suggest a pa...
USF-MAE: Ultrasound Self-Supervised Foundation Model with Masked Autoencoding : Abstract: Ultrasound imaging is one of the most widely used diagnostic modalities, offering real-time, radiation-free assessment across diverse clinical domains. However, interpretation of ultrasound ...
Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures : Abstract: We perform in-depth evaluations of in-context learning (ICL) on state-of-the-art transformer, state-space, and hybrid large language models over two categories of knowledge-based ICL tasks. ...
Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms : Abstract: The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-s...
MoEMeta: Mixture-of-Experts Meta Learning for Few-Shot Relational Learning : Abstract: Few-shot knowledge graph relational learning seeks to perform reasoning over relations given only a limited number of training examples. While existing approaches largely adopt a meta-learni...
Nested AutoRegressive Models : Abstract: AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image gener...
Efficient and Encrypted Inference using Binarized Neural Networks within In-Memory Computing Architectures : Abstract: Binarized Neural Networks (BNNs) are a class of deep neural networks designed to utilize minimal computational resources, which drives their popularity across various applications. Recent st...
A high-capacity linguistic steganography based on entropy-driven rank-token mapping : Abstract: Linguistic steganography enables covert communication through embedding secret messages into innocuous texts; however, current methods face critical limitations in payload capacity and secur...
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning : Abstract: Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsi...
LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation : Abstract: Recent advances in generative modeling have shown significant promise in designing novel periodic crystal structures. Existing approaches typically rely on either large language models (LLMs...
Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients : Abstract: This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style ...
Quality-Aware Translation Tagging in Multilingual RAG system : Abstract: Multilingual Retrieval-Augmented Generation (mRAG) often retrieves English documents and translates them into the query language for low-resource settings. However, poor translation quality ...
Think before Recommendation: Autonomous Reasoning-enhanced Recommender : Abstract: The core task of recommender systems is to learn user preferences from historical user-item interactions. With the rapid development of large language models (LLMs), recent research has expl...
Leveraging Hierarchical Organization for Medical Multi-document Summarization : Abstract: Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical struc...
GroupSHAP-Guided Integration of Financial News Keywords and Technical Indicators for Stock Price Prediction : Abstract: Recent advances in finance-specific language models such as FinBERT have enabled the quantification of public sentiment into index-based measures, yet compressing diverse linguistic signals ...
Rethinking GSPO: The Perplexity-Entropy Equivalence : Abstract: We provide a new perspective on GSPO's length-normalized importance ratios by establishing their connection to information-theoretic quantities. We show that GSPO's sequence-level weight $s(...
Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI : Abstract: Deep reinforcement learning agents often struggle when tasks require understanding both vision and language. Conventional architectures typically isolate perception (for example, CNN-based v...
Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional Networks : Abstract: The growing demand for smart home interfaces has increased interest in non-intrusive sensing methods like vibration-based gesture recognition. While prior studies demonstrated feasibility, t...
Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs : Abstract: The screenplay serves as the foundation for television production, defining narrative structure, character development, and dialogue. While Large Language Models (LLMs) show great potential ...
DREaM: Drug-Drug Relation Extraction via Transfer Learning Method : Abstract: Relation extraction between drugs plays a crucial role in identifying drug drug interactions and predicting side effects. The advancement of machine learning methods in relation extraction, ...
PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets : Abstract: Continual pre-training (CPT) for domain adaptation must balance target-domain gains with stability on the base domain. Existing CPT scaling laws typically assume a fixed pre-training budget,...
Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks : Abstract: Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-align...
Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter : Abstract: Eigenvalue problems are among the most important topics in many scientific disciplines. With the recent surge and development of machine learning, neural eigenvalue methods have attracted si...
Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports : Abstract: Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risk...
Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation : Abstract: In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size...
Deep Active Inference with Diffusion Policy and Multiple Timescale World Model for Real-World Exploration and Navigation : Abstract: Autonomous robotic navigation in real-world environments requires exploration to acquire environmental information as well as goal-directed navigation in order to reach specified targets. Ac...
PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization : Abstract: Circuit discovery, which involves identifying sparse and task-relevant subnetworks in pre-trained language models, is a cornerstone of mechanistic interpretability. Automated Circuit Discove...
A Novel Framework for Multi-Modal Protein Representation Learning : Abstract: Accurate protein function prediction requires integrating heterogeneous intrinsic signals (e.g., sequence and structure) with noisy extrinsic contexts (e.g., protein-protein interactions and...
ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation : Abstract: Existing multi-view 3D object reconstruction methods heavily rely on sufficient overlap between input views, where occlusions and sparse coverage in practice frequently yield severe reconstr...
Arabic Little STT: Arabic Children Speech Recognition Dataset : Abstract: The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Mor...
Multitask Multimodal Self-Supervised Learning for Medical Images : Abstract: This thesis works to address a pivotal challenge in medical image analysis: the reliance on extensive labeled datasets, which are often limited due to the need for expert annotation and cons...
ZeroFlood: A Geospatial Foundation Model for Data-Efficient Flood Susceptibility Mapping : Abstract: Flood susceptibility mapping (FSM) is vital for disaster prevention but remains challenging in data-scarce regions where hydrodynamic models require dense geophysical inputs. This work intro...
Symbolic Neural Generation with Applications to Lead Discovery in Drug Design : Abstract: We investigate a relatively underexplored class of hybrid neurosymbolic models integrating symbolic learning with neural reasoning to construct data generators meeting formal correctness cri...
Detecting Religious Language in Climate Discourse : Abstract: Religious language continues to permeate contemporary discourse, even in ostensibly secular domains such as environmental activism and climate change debates. This paper investigates how exp...
EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting : Abstract: The immense success of the Transformer architecture in Natural Language Processing has led to its adoption in Time Se ries Forecasting (TSF), where superior performance has been shown. H...
Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach : Abstract: Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each dat...
Exploring Vulnerability in AI Industry : Abstract: The rapid ascent of Foundation Models (FMs), enabled by the Transformer architecture, drives the current AI ecosystem. Characterized by large-scale training and downstream adaptability, FMs ...
Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning : Abstract: Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we ...
Automatic Assessment of Students' Classroom Engagement with Bias Mitigated Multi-task Model : Abstract: With the rise of online and virtual learning, monitoring and enhancing student engagement have become an important aspect of effective education. Traditional methods of assessing a student's...
Frequentist Validity of Epistemic Uncertainty Estimators : Abstract: Decomposing prediction uncertainty into its aleatoric (irreducible) and epistemic (reducible) components is critical for the development and deployment of machine learning systems. A popular...
Agentic Reinforcement Learning for Real-World Code Repair : Abstract: We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipel...
Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models : Abstract: Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks that exploit contextual framing to bypass safety mechanisms, posing significant risks in cybersecur...
QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture : Abstract: The field of computer architecture, which bridges high-level software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations....
Mitigating Coordinate Prediction Bias from Positional Encoding Failures : Abstract: Multimodal large language models (MLLMs) excel at vision-language tasks such as VQA and document understanding, yet precise coordinate prediction remains challenging. High-resolution inputs ...
Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation : Abstract: Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate div...
STAR-RIS-assisted Collaborative Beamforming for Low-altitude Wireless Networks : Abstract: While low-altitude wireless networks (LAWNs) based on uncrewed aerial vehicles (UAVs) offer high mobility, flexibility, and coverage for urban communications, they face severe signal attenua...
Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows : Abstract: Most approaches to long-context processing increase the complexity of the transformer's internal architecture by integrating mechanisms such as recurrence or auxiliary memory modules. In thi...
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation : Abstract: We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to ...
When UAV Swarm Meets IRS: Collaborative Secure Communications in Low-altitude Wireless Networks : Abstract: Low-altitude wireless networks (LAWNs) represent a promising architecture that integrates unmanned aerial vehicles (UAVs) as aerial nodes to provide enhanced coverage, reliability, and throu...
GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation : Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning\textemdash{}a prerequisite for many applications. Empirically...
Efficient Utility-Preserving Machine Unlearning with Implicit Gradient Surgery : Abstract: Machine unlearning (MU) aims to efficiently remove sensitive or harmful memory from a pre-trained model. The key challenge is to balance the potential tradeoff between unlearning efficacy an...
Probing Neural Combinatorial Optimization Models : Abstract: Neural combinatorial optimization (NCO) has achieved remarkable performance, yet its learned model representations and decision rationale remain a black box. This impedes both academic resea...
Power to the Clients: Federated Learning in a Dictatorship Setting : Abstract: Federated learning (FL) has emerged as a promising paradigm for decentralized model training, enabling multiple clients to collaboratively learn a shared model without exchanging their local...
Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics : Abstract: Mean field games (MFGs) have emerged as a powerful framework for modeling interactions in large-scale multi-agent systems. Despite recent advancements in reinforcement learning (RL) for MFGs...
Scaling Non-Parametric Sampling with Representation : Abstract: Scaling and architectural advances have produced strikingly photorealistic image generative models, yet their mechanisms still remain opaque. Rather than advancing scaling, our goal is to st...
Multi-dataset Joint Pre-training of Emotional EEG Enables Generalizable Affective Computing : Abstract: Task-specific pre-training is essential when task representations diverge from generic pre-training features. Existing task-general pre-training EEG models struggle with complex tasks like e...
Bridging Perception and Reasoning: Dual-Pipeline Neuro-Symbolic Landing for UAVs in Cluttered Environments : Abstract: Autonomous landing in unstructured (cluttered, uneven, and map-poor) environments is a core requirement for Unmanned Aerial Vehicles (UAVs), yet purely vision-based or deep learning models o...
Right Place, Right Time: Market Simulation-based RL for Execution Optimisation : Abstract: Execution algorithms are vital to modern trading, they enable market participants to execute large orders while minimising market impact and transaction costs. As these algorithms grow more ...
LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation : Abstract: Automated unit test generation is essential for robust software development, yet existing approaches struggle to generalize across multiple programming languages and operate within real-time...
GALA: A GlobAl-LocAl Approach for Multi-Source Active Domain Adaptation : Abstract: Domain Adaptation (DA) provides an effective way to tackle target-domain tasks by leveraging knowledge learned from source domains. Recent studies have extended this paradigm to Multi-Source...
Estimating the Error of Large Language Models at Pairwise Text Comparison : Abstract: We measure LLMs' output error at pairwise text comparison, noting the probability of error in their preferences. Our method does not rely on the ground truth and supports two scenarios: (i) ...
Taming Silent Failures: A Framework for Verifiable AI Reliability : Abstract: The integration of Artificial Intelligence (AI) into safety-critical systems introduces a new reliability paradigm: silent failures, where AI produces confident but incorrect outputs that ca...
When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs : Abstract: Layer pruning has emerged as a widely adopted technique for improving the efficiency of large language models (LLMs). Although existing methods demonstrate strong performance retention on ge...
Rational Adversaries and the Maintenance of Fragility: A Game-Theoretic Theory of Rational Stagnation : Abstract: Cooperative systems often remain in persistently suboptimal yet stable states. This paper explains such "rational stagnation" as an equilibrium sustained by a rational adversary whose utilit...
PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading : Abstract: Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark tha...
Real-Time Semantic Segmentation on FPGA for Autonomous Vehicles Using LMIINet with the CGRA4ML Framework : Abstract: Semantic segmentation has emerged as a fundamental problem in computer vision, gaining particular importance in real-time applications such as autonomous driving. The main challenge is achie...
You Don't Need Prompt Engineering Anymore: The Prompting Inversion : Abstract: Prompt engineering, particularly Chain-of-Thought (CoT) prompting, significantly enhances LLM reasoning capabilities. We introduce "Sculpting," a constrained, rule-based prompting method des...
LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis : Abstract: Electroencephalography (EEG) offers a non-invasive lens into human brain activity, but building large-scale models is hampered by topological heterogeneity: each public EEG data defines its ...
Epistemic Deep Learning: Enabling Machine Learning Models to Know When They Do Not Know : Abstract: Machine learning has achieved remarkable successes, yet its deployment in safety-critical domains remains hindered by an inherent inability to manage uncertainty, resulting in overconfident ...
PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding : Abstract: Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a...
A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata : Abstract: Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level ...
CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning : Abstract: Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable devel...
Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER : Abstract: We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4...
Does Homophily Help in Robust Test-time Node Classification? : Abstract: Homophily, the tendency of nodes from the same class to connect, is a fundamental property of real-world graphs, underpinning structural and semantic patterns in domains such as citation net...
T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model : Abstract: Using risky text prompts, such as pornography and violent prompts, to test the safety of text-to-image (T2I) models is a critical task. However, existing risky prompt datasets are limited in...
AnyECG-Lab: An Exploration Study of Fine-tuning an ECG Foundation Model to Estimate Laboratory Values from Single-Lead ECG Signals : Abstract: Timely access to laboratory values is critical for clinical decision-making, yet current approaches rely on invasive venous sampling and are intrinsically delayed. Electrocardiography (ECG),...
LacMaterial: Large Language Models as Analogical Chemists for Materials Discovery : Abstract: Analogical reasoning, the transfer of relational structures across contexts (e.g., planet is to sun as electron is to nucleus), is fundamental to scientific discovery. Yet human insight is o...
Harnessing the Power of Large Language Models for Software Testing Education: A Focus on ISTQB Syllabus : Abstract: Software testing is a critical component in the software engineering field and is important for software engineering education. Thus, it is vital for academia to continuously improve and upd...
Multilingual Target-Stance Extraction : Abstract: Social media enables data-driven analysis of public opinion on contested issues. Target-Stance Extraction (TSE) is the task of identifying the target discussed in a document and the document...
Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction : Abstract: Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single high...
Toward Humanoid Brain-Body Co-design: Joint Optimization of Control and Morphology for Fall Recovery : Abstract: Humanoid robots represent a central frontier in embodied intelligence, as their anthropomorphic form enables natural deployment in humans' workspace. Brain-body co-design for humanoids prese...
FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation : Abstract: While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries tha...
T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models : Abstract: Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image wa...
BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles : Abstract: In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning ...
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations : Abstract: Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully r...
TraceTrans: Translation and Spatial Tracing for Surgical Prediction : Abstract: Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outco...
Efficient Large-Deformation Medical Image Registration via Recurrent Dynamic Correlation : Abstract: Deformable image registration estimates voxel-wise correspondences between images through spatial transformations, and plays a key role in medical imaging. While deep learning methods have s...
Dynamic Dropout: Leveraging Conway's Game of Life for Neural Networks Regularization : Abstract: Regularization techniques play a crucial role in preventing overfitting and improving the generalization performance of neural networks. Dropout, a widely used regularization technique, rand...
Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help? : Abstract: Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs)...
Top-Down Semantic Refinement for Image Captioning : Abstract: Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. Thi...
Knowledge-guided Continual Learning for Behavioral Analytics Systems : Abstract: User behavior on online platforms is evolving, reflecting real-world changes in how people post, whether it's helpful messages or hate speech. Models that learn to capture this content can e...
Group size effects and collective misalignment in LLM multi-agent systems : Abstract: Multi-agent systems of large language models (LLMs) are rapidly expanding across domains, introducing dynamics not captured by single-agent evaluations. Yet, existing work has mostly contras...
PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching : Abstract: Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcit...
SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks : Abstract: The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMix...
GraphTOP: Graph Topology-Oriented Prompting for Graph Neural Networks : Abstract: Graph Neural Networks (GNNs) have revolutionized the field of graph learning by learning expressive graph representations from massive graph data. As a common pattern to train powerful GNNs,...
Evaluating Multimodal Large Language Models on Core Music Perception Tasks : Abstract: Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Fla...
Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints : Abstract: Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive, primarily because conventional optimizers such as SGD or Adam assume access to exact gradients derived from ...
DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss : Abstract: Recent advancements in 2D and 3D generative models have expanded the capabilities of computer vision. However, generating high-quality 4D dynamic content from a single static image remains a...
CHOIR: Collaborative Harmonization fOr Inference Robustness : Abstract: Persona-assigned Large Language Models (LLMs) can adopt diverse roles, enabling personalized and context-aware reasoning. However, even minor demographic perturbations in personas, such as s...
Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization : Abstract: To combat the prohibitive communication costs of ``free-for-all" multi-agent systems (MAS), we introduce \textbf{Agent-GSPO}, a framework that directly optimizes for token economy using sequ...
Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity : Abstract: Knowledge Distillation (KD) aims to train a lightweight student model by transferring knowledge from a large, high-capacity teacher. Recent studies have shown that leveraging diverse teacher...
An Analytic Theory of Quantum Imaginary Time Evolution : Abstract: Quantum imaginary time evolution (QITE) algorithm is one of the most promising variational quantum algorithms (VQAs), bridging the current era of Noisy Intermediate-Scale Quantum devices and...
Scalable Oversight via Partitioned Human Supervision : Abstract: As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training beco...
Accelerating Materials Design via LLM-Guided Evolutionary Search : Abstract: Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials design (LL...
GateFuseNet: An Adaptive 3D Multimodal Neuroimaging Fusion Network for Parkinson's Disease Diagnosis : Abstract: Accurate diagnosis of Parkinson's disease (PD) from MRI remains challenging due to symptom variability and pathological heterogeneity. Most existing methods rely on conventional magnitude-ba...
Transitive RL: Value Learning via Divide and Conquer : Abstract: In this work, we present Transitive Reinforcement Learning (TRL), a new value learning algorithm based on a divide-and-conquer paradigm. TRL is designed for offline goal-conditioned reinforc...
Toward Robust Signed Graph Learning through Joint Input-Target Denoising : Abstract: Signed Graph Neural Networks (SGNNs) are widely adopted to analyze complex patterns in signed graphs with both positive and negative links. Given the noisy nature of real-world connections, ...
Open Multimodal Retrieval-Augmented Factual Image Generation : Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, ...
Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection : Abstract: Large Language Models (LLMs) have transformed text understanding, yet their adaptation to specialized legal domains remains constrained by the cost of full fine-tuning. This study provides a...
LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges? : Abstract: Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamental...
DDTR: Diffusion Denoising Trace Recovery : Abstract: With recent technological advances, process logs, which were traditionally deterministic in nature, are being captured from non-deterministic sources, such as uncertain sensors or machine le...
Blockchain Signatures to Ensure Information Integrity and Non-Repudiation in the Digital Era: A comprehensive study : Abstract: Blockchain systems rely on decentralized ledgers and strong security guarantees. A key requirement is non-repudiation, which prevents denial of transaction authorship and supports integrity ...
SPIRAL: Self-Play Incremental Racing Algorithm for Learning in Multi-Drone Competitions : Abstract: This paper introduces SPIRAL (Self-Play Incremental Racing Algorithm for Learning), a novel approach for training autonomous drones in multi-agent racing competitions. SPIRAL distinctively e...
Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing : Abstract: The coordination of multiple autonomous agents in high-speed, competitive environments represents a significant engineering challenge. This paper presents CRUISE (Curriculum-Based Iterative ...
STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models : Abstract: Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Visi...
Combining Deep Learning and Explainable AI for Toxicity Prediction of Chemical Compounds : Abstract: The task here is to predict the toxicological activity of chemical compounds based on the Tox21 dataset, a benchmark in computational toxicology. After a domain-specific overview of chemic...
AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment : Abstract: We present AutoBench, a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs) through reciprocal peer assessment. This paper provides a rigorous scientifi...
RoGER-SLAM: A Robust Gaussian Splatting SLAM System for Noisy and Low-light Environment Resilience : Abstract: The reliability of Simultaneous Localization and Mapping (SLAM) is severely constrained in environments where visual inputs suffer from noise and low illumination. Although recent 3D Gaussia...
Personal Care Utility (PCU): Building the Health Infrastructure for Everyday Insight and Guidance : Abstract: Building on decades of success in digital infrastructure and biomedical innovation, we propose the Personal Care Utility (PCU) - a cybernetic system for lifelong health guidance. PCU is conc...
Does In-IDE Calibration of Large Language Models work at Scale? : Abstract: The introduction of large language models into integrated development environments (IDEs) is revolutionizing software engineering, yet it poses challenges to the usefulness and reliability o...
PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion : Abstract: We introduced PerCoR (Persian Commonsense Reasoning), the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems dra...
Cross-Species Transfer Learning in Agricultural AI: Evaluating ZebraPose Adaptation for Dairy Cattle Pose Estimation : Abstract: Pose estimation serves as a cornerstone of computer vision for understanding animal posture, behavior, and welfare. Yet, agricultural applications remain constrained by the scarcity of large...
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents : Abstract: AI agents powered by large language models (LLMs) are being deployed at scale, yet we lack a systematic understanding of how the choice of backbone LLM affects agent security. The non-determ...
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks : Abstract: This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The...
Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal : Abstract: Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages throug...
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference : Abstract: Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency ...
Enhancing Graph Classification Robustness with Singular Pooling : Abstract: Graph Neural Networks (GNNs) have achieved strong performance across a range of graph representation learning tasks, yet their adversarial robustness in graph classification remains underexp...
A Critical Study on Tea Leaf Disease Detection using Deep Learning Techniques : Abstract: The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infec...
Variational Polya Tree : Abstract: Density estimation is essential for generative modeling, particularly with the rise of modern neural networks. While existing methods capture complex data distributions, they often lack inte...
Learning Without Augmenting: Unsupervised Time Series Representation Learning via Frame Projections : Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data. Most SSL approaches rely on strong, well-established, handcrafted data au...
SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery : Abstract: Synthetic Aperture Radar (SAR) has emerged as a crucial imaging modality due to its all-weather capabilities. While recent advancements in self-supervised learning and Masked Image Modeling ...
LVD-GS: Gaussian Splatting SLAM for Dynamic Scenes via Hierarchical Explicit-Implicit Representation Collaboration Rendering : Abstract: 3D Gaussian Splatting SLAM has emerged as a widely used technique for high-fidelity mapping in spatial intelligence. However, existing methods often rely on a single representation scheme, w...
Uncertainty-Aware Autonomous Vehicles: Predicting the Road Ahead : Abstract: Autonomous Vehicle (AV) perception systems have advanced rapidly in recent years, providing vehicles with the ability to accurately interpret their environment. Perception systems remain sus...
TABL-ABM: A Hybrid Framework for Synthetic LOB Generation : Abstract: The recent application of deep learning models to financial trading has heightened the need for high fidelity financial time series data. This synthetic data can be used to supplement histor...
FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning : Abstract: Reliable value estimation serves as the cornerstone of reinforcement learning (RL) by evaluating long-term returns and guiding policy improvement, significantly influencing the convergence s...
Step2Motion: Locomotion Reconstruction from Pressure Sensing Insoles : Abstract: Human motion is fundamentally driven by continuous physical interaction with the environment. Whether walking, running, or simply standing, the forces exchanged between our feet and the grou...
Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation : Abstract: To deploy large language models (LLMs) in high-stakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive m...
Understanding Network Behaviors through Natural Language Question-Answering : Abstract: Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors ...
Deep Literature Survey Automation with an Iterative Workflow : Abstract: Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static...
Software Engineering Agents for Embodied Controller Generation : A Study in Minigrid Environments : Abstract: Software Engineering Agents (SWE-Agents) have proven effective for traditional software engineering tasks with accessible codebases, but their performance for embodied tasks requiring well-d...
TOM-SWE: User Mental Modeling For Software Engineering Agents : Abstract: Recent advances in coding agents have made them capable of planning, editing, running, and testing complex code bases. Despite their growing ability in coding tasks, these systems still stru...
Structure-Aware Cooperative Ensemble Evolutionary Optimization on Combinatorial Problems with Multimodal Large Language Models : Abstract: Evolutionary algorithms (EAs) have proven effective in exploring the vast solution spaces typical of graph-structured combinatorial problems. However, traditional encoding schemes, such as b...
Enabling Robust In-Context Memory and Rapid Task Adaptation in Transformers with Hebbian and Gradient-Based Plasticity : Abstract: Large language models display in-context learning as an emergent effect of scale, but they rely on static weights during inference. In contrast, biological systems continually adapt via syna...
A Comparison of Conversational Models and Humans in Answering Technical Questions: the Firefox Case : Abstract: The use of Large Language Models (LLMs) to support tasks in software development has steadily increased over recent years. From assisting developers in coding activities to providing convers...
AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing : Abstract: Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statemen...
Towards Low-Latency and Adaptive Ransomware Detection Using Contrastive Learning : Abstract: Ransomware has become a critical threat to cybersecurity due to its rapid evolution, the necessity for early detection, and growing diversity, posing significant challenges to traditional de...
ArchISMiner: A Framework for Automatic Mining of Architectural Issue-Solution Pairs from Online Developer Communities : Abstract: Stack Overflow (SO), a leading online community forum, is a rich source of software development knowledge. However, locating architectural knowledge, such as architectural solutions remains ...
Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models : Abstract: Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporar...
Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks : Abstract: Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various a...
Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising : Abstract: Diffusion models, such as diffusion policy, have achieved state-of-the-art results in robotic manipulation by imitating expert demonstrations. While diffusion models were originally develope...
Is Temporal Difference Learning the Gold Standard for Stitching in RL? : Abstract: Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of the behaviors. This experience stitching capability is often viewed ...
From Black-box to Causal-box: Towards Building More Interpretable Models : Abstract: Understanding the predictions made by deep learning models remains a central challenge, especially in high-stakes applications. A promising approach is to equip models with the ability to an...
Impact and Implications of Generative AI for Enterprise Architects in Agile Environments: A Systematic Literature Review : Abstract: Generative AI (GenAI) is reshaping enterprise architecture work in agile software organizations, yet evidence on its effects remains scattered. We report a systematic literature review (SLR)...
Reconnaissance Automatique des Langues des Signes : Une Approche Hybrid\'ee CNN-LSTM Bas\'ee sur Mediapipe : Abstract: Sign languages play a crucial role in the communication of deaf communities, but they are often marginalized, limiting access to essential services such as healthcare and education. This stu...
Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models : Abstract: Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notab...
Normalization in Attention Dynamics : Abstract: We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts a...
Online Optimization for Offline Safe Reinforcement Learning : Abstract: We study the problem of Offline Safe Reinforcement Learning (OSRL), where the goal is to learn a reward-maximizing policy from fixed data under a cumulative cost constraint. We propose a nov...
Differentiable Constraint-Based Causal Discovery : Abstract: Causal discovery from observational data is a fundamental task in artificial intelligence, with far-reaching implications for decision-making, predictions, and interventions. Despite signifi...
Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models : Abstract: This work investigates how large language models (LLMs) internally represent emotion by analyzing the geometry of their hidden-state space. The paper identifies a low-dimensional emotional m...
VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT : Abstract: Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored {despite their...
CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation : Abstract: Accurate symptom-to-disease classification and clinically grounded treatment recommendations remain challenging, particularly in heterogeneous patient settings with high diagnostic risk. Exi...
SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming : Abstract: Correctness alone is insufficient: LLM-generated programs frequently satisfy unit tests while violating contest time or memory budgets. We present SwiftSolve, a complexity-aware multi-agent ...
Do Stop Me Now: Detecting Boilerplate Responses with a Single Iteration : Abstract: Large Language Models (LLMs) often expend significant computational resources generating boilerplate responses, such as refusals, simple acknowledgements and casual greetings, which adds unn...
Atlas Urban Index: A VLM-Based Approach for Spatially and Temporally Calibrated Urban Development Monitoring : Abstract: We introduce the {\em Atlas Urban Index} (AUI), a metric for measuring urban development computed using Sentinel-2 \citep{spoto2012sentinel2} satellite imagery. Existing approaches, such as ...
RaCoT: Plug-and-Play Contrastive Example Generation Mechanism for Enhanced LLM Reasoning Reliability : Abstract: Retrieval-Augmented Generation (RAG) faces a core bottleneck with knowledge-sparse and semantically ambiguous long-tail queries, where retrieval noise distorts reasoning and necessitates cos...
Critical Insights into Leading Conversational AI Models : Abstract: Big Language Models (LLMs) are changing the way businesses use software, the way people live their lives and the way industries work. Companies like Google, High-Flyer, Anthropic, OpenAI and...
Multi-Modal Fact-Verification Framework for Reducing Hallucinations in Large Language Models : Abstract: While Large Language Models have transformed how we interact with AI systems, they suffer from a critical flaw: they confidently generate false information that sounds entirely plausible. Th...
Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval : Abstract: The rapid development of Vision-language models (VLMs) enables open-ended perception and reasoning. Recent works have started to investigate how to adapt general-purpose VLMs into personaliz...
How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations : Abstract: AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human w...
Agentic Meta-Orchestrator for Multi-task Copilots : Abstract: Microsoft Copilot suites serve as the universal entry point for various agents skilled in handling important tasks, ranging from assisting a customer with product purchases to detecting vuln...
Will Humanity Be Rendered Obsolete by AI? : Abstract: This article analyzes the existential risks artificial intelligence (AI) poses to humanity, tracing the trajectory from current AI to ultraintelligence. Drawing on Irving J. Good and Nick Bo...
HRM-Agent: Training a recurrent reasoning model in dynamic environments using reinforcement learning : Abstract: The Hierarchical Reasoning Model (HRM) has impressive reasoning abilities given its small size, but has only been applied to supervised, static, fully-observable problems. One of HRM's stren...
Toward Agents That Reason About Their Computation : Abstract: While reinforcement learning agents can achieve superhuman performance in many complex tasks, they typically do not become more computationally efficient as they improve. In contrast, humans...
Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes : Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities acro...
Lyapunov Function-guided Reinforcement Learning for Flight Control : Abstract: A cascaded online learning flight control system has been developed and enhanced with respect to action smoothness. In this paper, we investigate the convergence performance of the control s...
Exploring Structures of Inferential Mechanisms through Simplistic Digital Circuits : Abstract: Cognitive studies and artificial intelligence have developed distinct models for various inferential mechanisms (categorization, induction, abduction, causal inference, contrast, merge, ...)...
On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset : Abstract: Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate stro...
GTR-Mamba: Geometry-to-Tangent Routing for Hyperbolic POI Recommendation : Abstract: Next Point-of-Interest (POI) recommendation is a critical task in modern Location-Based Social Networks (LBSNs), aiming to model the complex decision-making process of human mobility to prov...
Multi-Agent Conditional Diffusion Model with Mean Field Communication as Wireless Resource Allocation Planner : Abstract: In wireless communication systems, efficient and adaptive resource allocation plays a crucial role in enhancing overall Quality of Service (QoS). While centralized Multi-Agent Reinforcement ...
Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction : Abstract: Recently, semantically constrained adversarial examples (SemanticAE), which are directly generated from natural language instructions, have become a promising avenue for future research due ...
ProfileXAI: User-Adaptive Explainable AI : Abstract: ProfileXAI is a model- and domain-agnostic framework that couples post-hoc explainers (SHAP, LIME, Anchor) with retrieval - augmented LLMs to produce explanations for different types of user...
From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports : Abstract: Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, an...
Mixed Density Diffuser: Efficient Planning with Non-uniform Temporal Resolution : Abstract: Recent studies demonstrate that diffusion planners benefit from sparse-step planning over single-step planning. Training models to skip steps in their trajectories helps capture long-term de...
A Survey of AI Scientists: Surveying the automatic Scientists and Research : Abstract: Artificial intelligence is undergoing a profound transition from a computational instrument to an autonomous originator of scientific knowledge. This emerging paradigm, the AI scientist, is ...
TLCD: A Deep Transfer Learning Framework for Cross-Disciplinary Cognitive Diagnosis : Abstract: Driven by the dual principles of smart education and artificial intelligence technology, the online education model has rapidly emerged as an important component of the education industry. C...
Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards : Abstract: Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These ...
Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs : Abstract: Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw...
Guiding Skill Discovery with Foundation Models : Abstract: Learning diverse skills without hand-crafted reward functions could accelerate reinforcement learning in downstream tasks. However, existing skill discovery methods focus solely on maximizin...
AUPO - Abstracted Until Proven Otherwise: A Reward Distribution Based Abstraction Algorithm : Abstract: We introduce a novel, drop-in modification to Monte Carlo Tree Search's (MCTS) decision policy that we call AUPO. Comparisons based on a range of IPPC benchmark problems show that AUPO clear...
Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach : Abstract: While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI b...
Accelerating IC Thermal Simulation Data Generation via Block Krylov and Operator Action : Abstract: Recent advances in data-driven approaches, such as neural operators (NOs), have shown substantial efficacy in reducing the solution time for integrated circuit (IC) thermal simulations. Howe...
CNOT Minimal Circuit Synthesis: A Reinforcement Learning Approach : Abstract: CNOT gates are fundamental to quantum computing, as they facilitate entanglement, a crucial resource for quantum algorithms. Certain classes of quantum circuits are constructed exclusively f...
Planning Ahead with RSA: Efficient Signalling in Dynamic Environments by Projecting User Awareness across Future Timesteps : Abstract: Adaptive agent design offers a way to improve human-AI collaboration on time-sensitive tasks in rapidly changing environments. In such cases, to ensure the human maintains an accurate unders...
Opinion Mining Based Entity Ranking using Fuzzy Logic Algorithmic Approach : Abstract: Opinions are central to almost all human activities and are key influencers of our behaviors. In current times due to growth of social networking website and increase in number of e-commerce...
AutoStreamPipe: LLM Assisted Automatic Generation of Data Stream Processing Pipelines : Abstract: Data pipelines are essential in stream processing as they enable the efficient collection, processing, and delivery of real-time data, supporting rapid data analysis. In this paper, we prese...
Bid2X: Revealing Dynamics of Bidding Environment in Online Advertising from A Foundation Model Lens : Abstract: Auto-bidding is crucial in facilitating online advertising by automatically providing bids for advertisers. While previous work has made great efforts to model bidding environments for bette...
Causal Deep Q Network : Abstract: Deep Q Networks (DQN) have shown remarkable success in various reinforcement learning tasks. However, their reliance on associative learning often leads to the acquisition of spurious correl...
A Neuro-Symbolic Multi-Agent Approach to Legal-Cybersecurity Knowledge Integration : Abstract: The growing intersection of cybersecurity and law creates a complex information space where traditional legal research tools struggle to deal with nuanced connections between cases, statutes...
What are the odds? Risk and uncertainty about AI existential risk : Abstract: This work is a commentary of the article \href{https://doi.org/10.18716/ojs/phai/2025.2801}{AI Survival Stories: a Taxonomic Analysis of AI Existential Risk} by Cappelen, Goldstein, and Hawt...
Policy-Aware Generative AI for Safe, Auditable Data Access Governance : Abstract: Enterprises need access decisions that satisfy least privilege, comply with regulations, and remain auditable. We present a policy aware controller that uses a large language model (LLM) to ...
Human-AI Collaborative Uncertainty Quantification : Abstract: AI predictive systems are increasingly embedded in decision making pipelines, shaping high stakes choices once made solely by humans. Yet robust decisions under uncertainty still rely on cap...
Are Agents Just Automata? On the Formal Equivalence Between Agentic AI and the Chomsky Hierarchy : Abstract: This paper establishes a formal equivalence between the architectural classes of modern agentic AI systems and the abstract machines of the Chomsky hierarchy. We posit that the memory archit...
Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier : Abstract: The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent...
Toward Carbon-Neutral Human AI: Rethinking Data, Computation, and Learning Paradigms for Sustainable Intelligence : Abstract: The rapid advancement of Artificial Intelligence (AI) has led to unprecedented computational demands, raising significant environmental and ethical concerns. This paper critiques the prevail...
When No Paths Lead to Rome: Benchmarking Systematic Neural Relational Reasoning : Abstract: Designing models that can learn to reason in a systematic way is an important and long-standing challenge. In recent years, a wide range of solutions have been proposed for the specific case...
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence : Abstract: The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for ad...
OntoPret: An Ontology for the Interpretation of Human Behavior : Abstract: As human machine teaming becomes central to paradigms like Industry 5.0, a critical need arises for machines to safely and effectively interpret complex human behaviors. A research gap curre...
ReCode: Unify Plan and Action for Universal Granularity Control : Abstract: Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-le...
Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study : Abstract: The rapid adoption of generative artificial intelligence (GenAI) technologies has led many organizations to integrate AI into their products and services, often without considering user pref...
Multi-Agent Evolve: LLM Self-Improve through Co-evolution : Abstract: Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies...
Alita-G: Self-Evolving Generative Agent for Agent Generation : Abstract: Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work ...
An AI enhanced approach to the tree unimodality conjecture : Abstract: Given a graph $G$, its independence sequence is the integral sequence $a_1,a_2,...,a_n$, where $a_i$ is the number of independent sets of vertices of size i. In the late 80's Alavi, Erdos, M...
BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills : Abstract: High quality bugs are key to training the next generation of language model based software engineering (SWE) agents. We introduce a novel method for synthetic generation of difficult and div...
A Feature Engineering Approach for Business Impact-Oriented Failure Detection in Distributed Instant Payment Systems : Abstract: Instant payment infrastructures have stringent performance requirements, processing millions of transactions daily with zero-downtime expectations. Traditional monitoring approaches fail to ...
DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling : Abstract: Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through the dynamic integration of external knowledge. To furthe...
Beyond IVR Touch-Tones: Customer Intent Routing using LLMs : Abstract: Widespread frustration with rigid touch-tone Interactive Voice Response (IVR) systems for customer service underscores the need for more direct and intuitive language interaction. While spee...
AI-Enhanced Operator Assistance for UNICOS Applications : Abstract: This project explores the development of an AI-enhanced operator assistant for UNICOS, CERN's UNified Industrial Control System. While powerful, UNICOS presents a number of challenges, inclu...
GAMER PAT: Research as a Serious Game : Abstract: As generative AI increasingly outperforms students in producing academic writing, a critical question arises: how can we preserve the motivation, creativity, and intellectual growth of novic...
AquaVLM: Improving Underwater Situation Awareness with Mobile Vision Language Models : Abstract: Underwater activities like scuba diving enable millions annually to explore marine environments for recreation and scientific research. Maintaining situational awareness and effective commun...
Your Dense Retriever is Secretly an Expeditious Reasoner : Abstract: Dense retrievers enhance retrieval by encoding queries and documents into continuous vectors, but they often struggle with reasoning-intensive queries. Although Large Language Models (LLMs) ...
Modeling Bias Evolution in Fashion Recommender Systems: A System Dynamics Approach : Abstract: Bias in recommender systems not only distorts user experience but also perpetuates and amplifies existing societal stereotypes, particularly in sectors like fashion e-commerce. This study em...
CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora : Abstract: Dense embedding models have become critical for modern information retrieval, particularly in RAG pipelines, but their performance often degrades when applied to specialized corpora outside ...
A phase-aware AI car-following model for electric vehicles with adaptive cruise control: Development and validation using real-world data : Abstract: Internal combustion engine (ICE) vehicles and electric vehicles (EVs) exhibit distinct vehicle dynamics. EVs provide rapid acceleration, with electric motors producing peak power across a wi...
Learn2Drive: A neural network-based framework for socially compliant automated vehicle control : Abstract: This study introduces a novel control framework for adaptive cruise control (ACC) in automated driving, leveraging Long Short-Term Memory (LSTM) networks and physics-informed constraints. As...
Next-Generation LLM for UAV: From Natural Language to Autonomous Flight : Abstract: With the rapid advancement of Large Language Models (LLMs), their capabilities in various automation domains, particularly Unmanned Aerial Vehicle (UAV) operations, have garnered increasing ...
Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models : Abstract: Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, b...
J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception : Abstract: We introduce J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. J-O...
Proportion and Perspective Control for Flow-Based Image Generation : Abstract: While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce an...
OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment : Abstract: We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PN...
Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation : Abstract: In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In t...
What Causes Postoperative Aspiration? : Abstract: Background: Aspiration, the inhalation of foreign material into the lungs, significantly impacts surgical patient morbidity and mortality. This study develops a machine learning (ML) model t...
Bridging Accuracy and Interpretability: Deep Learning with XAI for Breast Cancer Detection : Abstract: In this study, we present an interpretable deep learning framework for the early detection of breast cancer using quantitative features extracted from digitized fine needle aspirate (FNA) im...
EdgeSync: Accelerating Edge-Model Updates for Data Drift through Adaptive Continuous Learning : Abstract: Real-time video analytics systems typically deploy lightweight models on edge devices to reduce latency. However, the distribution of data features may change over time due to various factor...
Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models : Abstract: Diffusion models have demonstrated powerful performance in generating high-quality images. A typical example is text-to-image generator like Stable Diffusion. However, their widespread use a...
EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction : Abstract: Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human ...
Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making : Abstract: We explore the use of expert-guided bandit learning, which we refer to as online mixture-of-experts (OMoE). In this setting, given a context, a candidate committee of experts must determine ...
Variance-Reduction Guidance: Sampling Trajectory Optimization for Diffusion Models : Abstract: Diffusion models have become emerging generative models. Their sampling process involves multiple steps, and in each step the models predict the noise from a noisy sample. When the models ma...
2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection : Abstract: Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-...
Token-Level Inference-Time Alignment for Vision-Language Models : Abstract: Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence, yet their outputs remain prone to hallucination-plausible text misaligned with visual inputs....
Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention : Abstract: The rapid advancement of time series foundation models (TSFMs) has been propelled by migrating architectures from language models. While existing TSFMs demonstrate impressive performance, th...
A Physics-Guided AI Cascaded Corrector Model Significantly Extends Madden-Julian Oscillation Prediction Skill : Abstract: The Madden-Julian Oscillation (MJO) is an important driver of global weather and climate extremes, but its prediction in operational dynamical models remains challenging, with skillful forec...
Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning : Abstract: Current mainstream approaches to addressing multimodal imbalance primarily focus on architectural modifications and optimization-based, often overlooking a quantitative analysis of the imbal...
DiffGRM: Diffusion-based Generative Recommendation Model : Abstract: Generative recommendation (GR) is an emerging paradigm that represents each item via a tokenizer as an n-digit semantic ID (SID) and predicts the next item by autoregressively generating its...
Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval : Abstract: With the rapid growth of video data, text-video retrieval technology has become increasingly important in numerous application scenarios such as recommendation and search. Early text-video r...
Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs : Abstract: Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a signif...
Semantic Relation-Enhanced CLIP Adapter for Domain Adaptive Zero-Shot Learning : Abstract: The high cost of data annotation has spurred research on training deep learning models in data-limited scenarios. Existing paradigms, however, fail to balance cross-domain transfer and cross...
Hybrid Deep Learning Framework for Enhanced Diabetic Retinopathy Detection: Integrating Traditional Features with AI-driven Insights : Abstract: Diabetic Retinopathy (DR), a vision-threatening complication of Dia-betes Mellitus (DM), is a major global concern, particularly in India, which has one of the highest diabetic populations. ...
Comparative Analysis of Object Detection Algorithms for Surface Defect Detection : Abstract: This article compares the performance of six prominent object detection algorithms, YOLOv11, RetinaNet, Fast R-CNN, YOLOv8, RT-DETR, and DETR, on the NEU-DET surface defect detection dataset...
Unifying Inductive, Cross-Domain, and Multimodal Learning for Robust and Generalizable Recommendation : Abstract: Recommender systems have long been built upon the modeling of interactions between users and items, while recent studies have sought to broaden this paradigm by generalizing to new users and...
SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling : Abstract: Earth Observation (EO) Foundation Modelling (FM) holds great promise for simplifying and improving the use of EO data for diverse real-world tasks. However, most existing models require addi...
Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding : Abstract: Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing so...
HDR Image Reconstruction using an Unsupervised Fusion Model : Abstract: High Dynamic Range (HDR) imaging aims to reproduce the wide range of brightness levels present in natural scenes, which the human visual system can perceive but conventional digital cameras ...
Unlocking Biomedical Insights: Hierarchical Attention Networks for High-Dimensional Data Interpretation : Abstract: The proliferation of high-dimensional datasets in fields such as genomics, healthcare, and finance has created an urgent need for machine learning models that are both highly accurate and in...
Prompt fidelity of ChatGPT4o / Dall-E3 text-to-image visualisations : Abstract: This study examines the prompt fidelity of ChatGPT4o / DALL-E3 text-to-image visualisations by analysing whether attributes explicitly specified in autogenously generated prompts are correct...
Wavelet-based GAN Fingerprint Detection using ResNet50 : Abstract: Identifying images generated by Generative Adversarial Networks (GANs) has become a significant challenge in digital image forensics. This research presents a wavelet-based detection method ...
Explainable Deep Learning in Medical Imaging: Brain Tumor and Pneumonia Detection : Abstract: Deep Learning (DL) holds enormous potential for improving medical imaging diagnostics, yet the lack of interpretability in most models hampers clinical trust and adoption. This paper present...
Precise classification of low quality G-banded Chromosome Images by reliability metrics and data pruning classifier : Abstract: In the last decade, due to high resolution cameras and accurate meta-phase analyzes, the accuracy of chromosome classification has improved substantially. However, current Karyotyping system...
GAPO: Group Adaptive Policy Optimization for Real-World Code Edit : Abstract: Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalize...
A Multimodal, Multitask System for Generating E Commerce Text Listings from Images : Abstract: Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Mo...
Evaluating ChatGPT's Performance in Classifying Pneumonia from Chest X-Ray Images : Abstract: In this study, we evaluate the ability of OpenAI's gpt-4o model to classify chest X-ray images as either NORMAL or PNEUMONIA in a zero-shot setting, without any prior fine-tuning. A balanced...
Training data membership inference via Gaussian process meta-modeling: a post-hoc analysis approach : Abstract: Membership inference attacks (MIAs) test whether a data point was part of a model's training set, posing serious privacy risks. Existing methods often depend on shadow models or heavy query ...
TowerVision: Understanding and Improving Multilinguality in Vision-Language Models : Abstract: Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this wo...
Poisson Flow Consistency Training : Abstract: The Poisson Flow Consistency Model (PFCM) is a consistency-style model based on the robust Poisson Flow Generative Model++ (PFGM++) which has achieved success in unconditional image generati...
Privacy-preserving Decision-focused Learning for Multi-energy Systems : Abstract: Decision-making for multi-energy system (MES) dispatch depends on accurate load forecasting. Traditionally, load forecasting and decision-making for MES are implemented separately. Forecasti...
Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence : Abstract: We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical worl...
The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems : Abstract: Large language models are often described as capable of reflective reasoning, yet recursive self-evaluation without external feedback frequently yields reformulation rather than progress. We...
A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model : Abstract: Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting comp...
Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs : Abstract: Accurate and reliable motion forecasting is essential for the safe deployment of autonomous vehicles (AVs), particularly in rare but safety-critical scenarios known as corner cases. Existing...
GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer : Abstract: Music generation in the audio domain using artificial intelligence (AI) has witnessed steady progress in recent years. However for some instruments, particularly the guitar, controllable ins...
A Physics-Informed Neural Network Approach for UAV Path Planning in Dynamic Environments : Abstract: Unmanned aerial vehicles (UAVs) operating in dynamic wind fields must generate safe and energy-efficient trajectories under physical and environmental constraints. Traditional planners, such...
AI Powered Urban Green Infrastructure Assessment Through Aerial Imagery of an Industrial Township : Abstract: Accurate assessment of urban canopy coverage is crucial for informed urban planning, effective environmental monitoring, and mitigating the impacts of climate change. Traditional practices o...
TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge : Abstract: Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose ...
Language Ranker: A Lightweight Ranking framework for LLM Decoding : Abstract: Conventional research on large language models (LLMs) has primarily focused on refining output distributions, while paying less attention to the decoding process that transforms these distri...
Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks : Abstract: The growing adoption of machine learning (ML) in sensitive domains has heightened the demand for transparent and interpretable artificial intelligence. Large Language Models (LLMs) are incre...
Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning : Abstract: Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety...
Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications : Abstract: In recent years, deep learning based generative models, particularly Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models (DMs), have been instrument...
The Principles of Diffusion Models : Abstract: This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical id...
A Multi-Component AI Framework for Computational Psychology: From Robust Predictive Modeling to Deployed Generative Dialogue : Abstract: The confluence of Artificial Intelligence and Computational Psychology presents an opportunity to model, understand, and interact with complex human psychological states through computationa...
PREFINE: Personalized Story Generation via Simulated User Critics and User-Specific Rubric Generation : Abstract: While recent advances in Large Language Models (LLMs) have improved the quality of creative text generation, significant challenges remain in producing personalized stories that reflect indi...
SIGN: Schema-Induced Games for Naming : Abstract: Real-world AI systems are tackling increasingly complex problems, often through interactions among large language model (LLM) agents. When these agents develop inconsistent conventions, coor...
Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks : Abstract: We document empirical capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks. Systematic evaluation of OPT and Pythia model families (70M-30B par...
GeoThought: A Dataset for Enhancing Mathematical Geometry Reasoning in Vision-Language Models : Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities in text-based mathematical problem solving; however, when adapted to visual reasoning tasks, particularly geometr...
Exploration through Generation: Applying GFlowNets to Structured Search : Abstract: This work applies Generative Flow Networks (GFlowNets) to three graph optimization problems: the Traveling Salesperson Problem, Minimum Spanning Tree, and Shortest Path. GFlowNets are genera...
Computational Hardness of Reinforcement Learning with Partial $q^{\pi}$-Realizability : Abstract: This paper investigates the computational complexity of reinforcement learning in a novel linear function approximation regime, termed partial $q^{\pi}$-realizability. In this framework, the...
Performance Trade-offs of Optimizing Small Language Models for E-Commerce : Abstract: Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized t...
Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions : Abstract: Large language models (LLMs) offer a promising way to simulate human survey responses, potentially reducing the cost of large-scale data collection. However, existing zero-shot methods suffe...
Foundation of Intelligence: Review of Math Word Problems from Human Cognition Perspective : Abstract: Math word problem (MWP) serves as a fundamental research topic in artificial intelligence (AI) dating back to 1960s. This research aims to advance the reasoning abilities of AI by mirroring ...
LightAgent: Mobile Agentic Foundation Models : Abstract: With the advancement of multimodal large language models (MLLMs), building GUI agent systems has become an increasingly promising direction-especially for mobile platforms, given their rich ...
LLM-AR: LLM-powered Automated Reasoning Framework : Abstract: Large language models (LLMs) can already identify patterns and reason effectively, yet their variable accuracy hampers adoption in high-stakes decision-making applications. In this paper, we...
Predictive Coding Enhances Meta-RL To Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability : Abstract: Learning a compact representation of history is critical for planning and generalization in partially observable environments. While meta-reinforcement learning (RL) agents can attain near B...
HW/SW Co-design of a PCM/PWM converter: a System Level Approach based in the SpecC Methodology : Abstract: We present a case study applying the SpecC methodology within a system-level hardware/software co-design flow to a PCM-to-PWM converter, the core of a Class-D audio amplifier. The converter ...
Towards Error-Centric Intelligence II: Energy-Structured Causal Models : Abstract: Contemporary machine learning optimizes for predictive accuracy, yet systems that achieve state of the art performance remain causally opaque: their internal representations provide no princ...
Energy-Efficient Domain-Specific Artificial Intelligence Models and Agents: Pathways and Paradigms : Abstract: The field of artificial intelligence (AI) has taken a tight hold on broad aspects of society, industry, business, and governance in ways that dictate the prosperity and might of the world's ...
Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies : Abstract: Brain-Computer Interfaces (BCIs) offer a direct communication pathway between the human brain and external devices, holding significant promise for individuals with severe neurological impai...
Controllable Mathematical Reasoning via Self-Optimizing Thought Vectors : Abstract: We present a novel approach for controllable mathematical reasoning that leverages self-optimizing thought vectors with entropy minimization. Our method introduces learnable thought vectors ...
Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests : Abstract: AI psychometrics evaluates AI systems in roles that traditionally require emotional judgment and ethical consideration. Prior work often reuses human trait inventories (Big Five, \hexaco) or...
Dopamine-driven synaptic credit assignment in neural networks : Abstract: Solving the synaptic Credit Assignment Problem(CAP) is central to learning in both biological and artificial neural systems. Finding an optimal solution for synaptic CAP means setting the sy...
OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling : Abstract: Optimization modeling is one of the most crucial but technical parts of operations research (OR). To automate the modeling process, existing works have leveraged large language models (LLMs)...
PACR: Progressively Ascending Confidence Reward for LLM Reasoning : Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved LLM reasoning, but its sparse, outcome-based reward provides no guidance for intermediate steps, slowing expl...
VietLyrics: A Large-Scale Dataset and Models for Vietnamese Automatic Lyrics Transcription : Abstract: Automatic Lyrics Transcription (ALT) for Vietnamese music presents unique challenges due to its tonal complexity and dialectal variations, but remains largely unexplored due to the lack of a...
Graph-Coarsening Approach for the Capacitated Vehicle Routing Problem with Time Windows : Abstract: The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) is a fundamental NP-hard optimization problem in logistics. Solving large-scale instances remains computationally challengi...
LIFT: Interpretable truck driving risk prediction with literature-informed fine-tuned LLMs : Abstract: This study proposes an interpretable prediction framework with literature-informed fine-tuned (LIFT) LLMs for truck driving risk prediction. The framework integrates an LLM-driven Inference ...
DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry : Abstract: Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning be...
Reasoning Models Reason Well, Until They Don't : Abstract: Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed ...
Modeling Hierarchical Thinking in Large Reasoning Models : Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities when they generate step-by-step solutions, known as chain-of-thought (CoT) reasoning. When trained to using chai...
Learning "Partner-Aware" Collaborators in Multi-Party Collaboration : Abstract: Large Language Models (LLMs) are increasingly bring deployed in agentic settings where they act as collaborators with humans. Therefore, it is increasingly important to be able to evaluate t...
OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models : Abstract: Advances in Multimodal Large Language Models (MLLMs) intensify concerns about data privacy, making Machine Unlearning (MU), the selective removal of learned information, a critical necessity...
ATOM: AdapTive and OptiMized dynamic temporal knowledge graph construction using LLMs : Abstract: In today's rapidly expanding data landscape, knowledge extraction from unstructured text is vital for real-time analytics, temporal inference, and dynamic memory frameworks. However, traditi...
A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning : Abstract: Pre-trained large language models have demonstrated a strong ability to learn from context, known as in-context learning (ICL). Despite a surge of recent applications that leverage such capa...

Research Sources: 850 | Generated: 10/28/2025