AI RESEARCH PAPERS & ACADEMIC SOURCES
- LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization : Abstract: Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely ...
- Layout Anything: One Transformer for Universal Room Layout Estimation : Abstract: We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our ...
- A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems : Abstract: In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensiv...
- BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection : Abstract: Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity ...
- InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration : Abstract: Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention...
- U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences : Abstract: Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat a...
- GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection : Abstract: Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relatio...
- TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond : Abstract: Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting t...
- DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling : Abstract: Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interactio...
- DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images : Abstract: Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, kno...
- SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting : Abstract: LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss smal...
- Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks : Abstract: When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network i...
- AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry : Abstract: The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-e...
- Unrolled Networks are Conditional Probability Flows in MRI Reconstruction : Abstract: Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by und...
- MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation : Abstract: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to produ...
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework : Abstract: Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controlla...
- OneThinker: All-in-one Reasoning Model for Image and Video : Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically tra...
- CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models : Abstract: Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this w...
- MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues : Abstract: We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models ...
- VIGS-SLAM: Visual Inertial Gaussian Splatting SLAM : Abstract: We present VIGS-SLAM, a visual-inertial 3D Gaussian Splatting SLAM system that achieves robust real-time tracking and high-fidelity reconstruction. Although recent 3DGS-based SLAM methods ac...
- SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction : Abstract: Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflict...
- Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education : Abstract: Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring of physiological and behavior...
- Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols : Abstract: Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally,...
- Permutation-Aware Action Segmentation via Unsupervised Frame-to-Segment Alignment : Abstract: This paper presents an unsupervised transformer-based framework for temporal activity segmentation which leverages not only frame-level cues but also segment-level cues. This is in contrast ...
- 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation : Abstract: The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and att...
- A multi-weight self-matching visual explanation for cnns on sar images : Abstract: In recent years, convolutional neural networks (CNNs) have achieved significant success in various synthetic aperture radar (SAR) tasks. However, the complexity and opacity of their internal...
- WSCF-MVCC: Weakly-supervised Calibration-free Multi-view Crowd Counting : Abstract: Multi-view crowd counting can effectively mitigate occlusion issues that commonly arise in single-image crowd counting. Existing deep-learning multi-view crowd counting methods project diffe...
- SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains : Abstract: Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access ...
- On-the-fly Feedback SfM: Online Explore-and-Exploit UAV Photogrammetry with Incremental Mesh Quality-Aware Indicator and Predictive Path Planning : Abstract: Compared with conventional offline UAV photogrammetry, real-time UAV photogrammetry is essential for time-critical geospatial applications such as disaster response and active digital-twin m...
- From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking : Abstract: End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection perf...
- Reproducing and Extending RaDelft 4D Radar with Camera-Assisted Labels : Abstract: Recent advances in 4D radar highlight its potential for robust environment perception under adverse conditions, yet progress in radar semantic segmentation remains constrained by the scarcit...
- Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch : Abstract: Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learni...
- Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation : Abstract: Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and lo...
- Generalizing Vision-Language Models with Dedicated Prompt Guidance : Abstract: Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and dom...
- GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning : Abstract: With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. H...
- Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors : Abstract: Spiking Neural Networks (SNNs), with their brain-inspired spatiotemporal dynamics and spike-driven computation, have emerged as promising energy-efficient alternatives to Artificial Neural N...
- nuScenes Revisited: Progress and Challenges in Autonomous Driving : Abstract: Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving d...
- ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation : Abstract: Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However,...
- Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation : Abstract: Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-...
- Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration : Abstract: Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in...
- TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution : Abstract: Dataset distillation compresses large datasets into compact synthetic ones to reduce storage and computational costs. Among various approaches, distribution matching (DM)-based methods have ...
- G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline : Abstract: We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of defo...
- YingVideo-MV: Music-Driven Multi-Stage Video Generation : Abstract: While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency...
- Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration : Abstract: This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the ...
- A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation : Abstract: Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task divers...
- dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model : Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses ...
- GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding : Abstract: Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hinderin...
- Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling : Abstract: In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a...
- SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts : Abstract: The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal f...
- On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection : Abstract: Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection an...
- WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens : Abstract: Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While metho...
- AVGGT: Rethinking Global Attention for Accelerating VGGT : Abstract: Since DUSt3R, models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse...
- OmniPerson: Unified Identity-Preserving Pedestrian Generation : Abstract: Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored ...
- Co-speech Gesture Video Generation via Motion-Based Graph Retrieval : Abstract: Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data....
- Content-Aware Texturing for Gaussian Splatting : Abstract: Gaussian Splatting has become the method of choice for 3D reconstruction and real-time rendering of captured real scenes. However, fine appearance details need to be represented as a large n...
- RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence : Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation model...
- PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding : Abstract: PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities...
- Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening : Abstract: Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and...
- PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking : Abstract: We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total,...
- Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation : Abstract: Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patc...
- PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes : Abstract: Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencie...
- UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking : Abstract: Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is s...
- PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution : Abstract: Although diffusion-based models have achieved impressive results in image super-resolution, they often rely on large-scale backbones such as Stable Diffusion XL (SDXL) and Diffusion Transfor...
- Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance : Abstract: Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without superv...
- ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data : Abstract: Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous drivin...
- GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization : Abstract: Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits ...
- Tissue-mask supported inter-subject whole-body image registration in the UK Biobank - A method benchmarking study : Abstract: The UK Biobank is a large-scale study collecting whole-body MR imaging and non-imaging health data. Robust and accurate inter-subject image registration of these whole-body MR images would e...
- GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding : Abstract: Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image ...
- AttMetNet: Attention-Enhanced Deep Neural Network for Methane Plume Detection in Sentinel-2 Satellite Imagery : Abstract: Methane is a powerful greenhouse gas that contributes significantly to global warming. Accurate detection of methane emissions is the key to taking timely action and minimizing their impact ...
- Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset : Abstract: Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into ...
- TrackNetV5: Residual-Driven Spatio-Temporal Refinement and Motion Direction Decoupling for Fast Object Tracking : Abstract: The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusio...
- UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits : Abstract: With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is wide...
- HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval : Abstract: Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target ...
- IC-World: In-Context Generation for Shared World Modeling : Abstract: Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling...
- PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation : Abstract: Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. Ho...
- Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video? : Abstract: Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, a...
- RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association : Abstract: Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-v...
- MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS Registration : Abstract: Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and ...
- Taming Camera-Controlled Video Generation with Verifiable Geometry Reward : Abstract: Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcem...
- MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm : Abstract: We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achie...
- Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models : Abstract: This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We...
- Glance: Accelerating Diffusion Models with 1 Sample : Abstract: Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Pre...
- DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation : Abstract: In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit...
- Mirror, Mirror on the Wall -- Which is the Best Model of Them All? : Abstract: Large Language Models (LLMs) have become one of the most transformative tools across many applications, as they have significantly boosted productivity and achieved impressive results in var...
- Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models : Abstract: Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immed...
- Dialect Identification Using Resource-Efficient Fine-Tuning Approaches : Abstract: Dialect Identification (DI) is a task to recognize different dialects within the same language from a speech signal. DI can help to improve the downstream speech related tasks even when spea...
- Swivuriso: The South African Next Voices Multilingual Speech Dataset : Abstract: This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech...
- Lightweight Latent Reasoning for Narrative Tasks : Abstract: Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A mode...
- CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering : Abstract: Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (fo...
- When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers : Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. ...
- TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models : Abstract: With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately ...
- What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints : Abstract: We present an evaluation-driven study of two practical tasks regarding online misinformation: (i) fake-news detection and (ii) virality prediction in the context of operational settings, wit...
- DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models : Abstract: We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follo...
- From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks : Abstract: Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, r...
- Spoken Conversational Agents with Large Language Models : Abstract: Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame ada...
- Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization : Abstract: Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs e...
- PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models : Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy...
- Towards Unification of Hallucination Detection and Fact Verification for Large Language Models : Abstract: Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their a...
- Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension : Abstract: Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference acro...
- TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages : Abstract: Low-resource African languages remain underrepresented in sentiment analysis, limiting both lexical coverage and the performance of multilingual Natural Language Processing (NLP) systems. Th...
- SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment : Abstract: Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward m...
- A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models : Abstract: The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such ev...
- BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion : Abstract: The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoke...
- promptolution: A Unified, Modular Framework for Prompt Optimization : Abstract: Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers show its effectiveness, prac...
- Bangla Hate Speech Classification with Fine-tuned Transformer Models : Abstract: Hate speech recognition in low-resource languages remains a difficult problem due to insufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than ...
- Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning : Abstract: Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code...
- Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules : Abstract: Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED...
- AutoNeural: Co-Designing Vision-Language Models for NPU Inference : Abstract: While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We att...
- Misalignment of LLM-Generated Personas with Human Perceptions in Low-Resource Settings : Abstract: Recent advances enable Large Language Models (LLMs) to generate AI personas, yet their lack of deep contextual, cultural, and emotional understanding poses a significant limitation. This stu...
- Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI : Abstract: The field of AI Control seeks to develop robust control protocols, deployment safeguards for untrusted AI which may be intentionally subversive. However, existing protocols that rely on weak...
- LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems : Abstract: Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in reasoning, planning, and tool usage. The recently proposed Model Context Protocol (MCP) has emerged as a ...
- See, Think, Learn: A Self-Taught Multimodal Reasoner : Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate p...
- Probabilistic energy profiler for statically typed JVM-based programming languages : Abstract: Energy consumption is a growing concern in several fields, from mobile devices to large data centers. Developers need detailed data on the energy consumption of their software to mitigate co...
- Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities : Abstract: While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-ima...
- Unifying Linear-Time Attention via Latent Probabilistic Modelling : Abstract: Transformers have achieved state-of-the-art results across a range of domains, but their quadratic attention mechanism poses significant challenges for long-sequence modelling. Recent effort...
- Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework : Abstract: Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivo...
- FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges : Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of ...
- Mapping of Lesion Images to Somatic Mutations : Abstract: Medical imaging is a critical initial tool used by clinicians to determine a patient's cancer diagnosis, allowing for faster intervention and more reliable patient prognosis. At subsequent s...
- RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation : Abstract: While recent advances in deep learning for surgical scene segmentation have demonstrated promising results on single-centre and single-imaging modality data, these methods usually do not gen...
- Towards Unified Video Quality Assessment : Abstract: Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnosti...
- Exploring the Potentials of Spiking Neural Networks for Image Deraining : Abstract: Biologically plausible and energy-efficient frameworks such as Spiking Neural Networks (SNNs) have not been sufficiently explored in low-level vision tasks. Taking image deraining as an exam...
- TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction : Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online...
- Representation of Inorganic Synthesis Reactions and Prediction: Graphical Framework and Datasets : Abstract: While machine learning has enabled the rapid prediction of inorganic materials with novel properties, the challenge of determining how to synthesize these materials remains largely unsolved....
- Flexible Gravitational-Wave Parameter Estimation with Transformers : Abstract: Gravitational-wave data analysis relies on accurate and efficient methods to extract physical information from noisy detector signals, yet the increasing rate and complexity of observations ...
- Learning Physically Consistent Lagrangian Control Models Without Acceleration Measurements : Abstract: This article investigates the modeling and control of Lagrangian systems involving non-conservative forces using a hybrid method that does not require acceleration calculations. It focuses i...
- Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization : Abstract: Traditional randomized A/B experiments assign arms with uniform random (UR) probability, such as 50/50 assignment to two versions of a website to discover whether one version engages users m...
- Minimax Hypothesis Testing for the Bradley-Terry-Luce Model : Abstract: The Bradley-Terry-Luce (BTL) model is one of the most widely used models for ranking a collection of items or agents based on pairwise comparisons among them. Given $n$ agents, the BTL model...
- FedSub: Introducing Class-aware Subnetworks Fusion to Enhance Personalized Federated Learning : Abstract: Personalized Federated Learning aims at addressing the challenges of non-IID data in collaborative model training. However, existing methods struggle to balance personalization and generaliz...
- Simulating classification models to evaluate Predict-Then-Optimize methods : Abstract: Uncertainty in optimization is often represented as stochastic parameters in the optimization model. In Predict-Then-Optimize approaches, predictions of a machine learning model are used as ...
- Spontaneous Kolmogorov-Arnold Geometry in Shallow MLPs : Abstract: The Kolmogorov-Arnold (KA) representation theorem constructs universal, but highly non-smooth inner functions (the first layer map) in a single (non-linear) hidden layer neural network. Such...
- On SkipGram Word Embedding Models with Negative Sampling: Unified Framework and Impact of Noise Distributions : Abstract: SkipGram word embedding models with negative sampling, or SGN in short, is an elegant family of word embedding models. In this paper, we formulate a framework for word embedding, referred to...
- Predicting Human Perceptions of Robot Performance During Navigation Tasks : Abstract: Understanding human perceptions of robot performance is crucial for designing socially intelligent robots that can adapt to human expectations. Current approaches often rely on surveys, whic...
- ContourDiff: Unpaired Medical Image Translation with Structural Consistency : Abstract: Accurately translating medical images between different modalities, such as Computed Tomography (CT) to Magnetic Resonance Imaging (MRI), has numerous downstream clinical and machine learnin...
- Anomalous Change Point Detection Using Probabilistic Predictive Coding : Abstract: Change point detection (CPD) and anomaly detection (AD) are essential techniques in various fields to identify abrupt changes or abnormal data instances. However, existing methods are often ...
- Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs : Abstract: The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disad...
- Online Convex Optimization with Memory and Limited Predictions : Abstract: This paper addresses an online convex optimization problem where the cost function at each step depends on a history of past decisions (i.e., memory), and the decision maker has access to li...
- Convolution goes higher-order: a biologically inspired mechanism empowers image classification : Abstract: We propose a novel approach to image classification inspired by complex nonlinear biological visual processing, whereby classical convolutional neural networks (CNNs) are equipped with learn...
- kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions : Abstract: We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the $k$ most similar units to ...
- On the identifiability of causal graphs with multiple environments : Abstract: Causal discovery from i.i.d. observational data is known to be generally ill-posed. We demonstrate that if we have access to the distribution of a structural causal model, and additional dat...
- Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review : Abstract: The rapid advancement of large language models (LLMs) has accelerated their integration into clinical decision support, particularly in prescription review. To enable systematic and fine-gra...
- Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models : Abstract: Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically...
- Adaptive Weighted LSSVM for Multi-View Classification : Abstract: Multi-view learning integrates diverse representations of the same instances to improve performance. Most existing kernel-based multi-view learning methods use fusion techniques without enfo...
- Conformal Correction for Efficiency May be at Odds with Entropy : Abstract: Conformal prediction (CP) provides a comprehensive framework to produce statistically rigorous uncertainty sets for black-box machine learning models. To further improve the efficiency of CP...
- FGC-Comp: Adaptive Neighbor-Grouped Attribute Completion for Graph-based Anomaly Detection : Abstract: Graph-based Anomaly Detection models have gained widespread adoption in recent years, identifying suspicious nodes by aggregating neighborhood information. However, most existing studies ove...
- Credal Graph Neural Networks : Abstract: Uncertainty quantification is essential for deploying reliable Graph Neural Networks (GNNs), where existing approaches primarily rely on Bayesian inference or ensembles. In this paper, we in...
- Adversarial Jamming for Autoencoder Distribution Matching : Abstract: We propose the use of adversarial wireless jamming to regularise the latent space of an autoencoder to match a diagonal Gaussian distribution. We consider the minimisation of a mean squared ...
- FiMMIA: scaling semantic perturbation-based membership inference across modalities : Abstract: Membership Inference Attacks (MIAs) aim to determine whether a specific data point was included in the training set of a target model. Although there are have been numerous methods developed...
- Adaptive Decentralized Federated Learning for Robust Optimization : Abstract: In decentralized federated learning (DFL), the presence of abnormal clients, often caused by noisy or poisoned data, can significantly disrupt the learning process and degrade the overall ro...
- Assessing the performance of correlation-based multi-fidelity neural emulators : Abstract: Outer loop tasks such as optimization, uncertainty quantification or inference can easily become intractable when the underlying high-fidelity model is computationally expensive. Similarly, ...
- Hypothesis Testing for Generalized Thurstone Models : Abstract: In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying \emph{generalized Thurstone model} $\mathcal{T}_F$ for a g...
- Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation : Abstract: We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primar...
- Fast Gaussian Process Approximations for Autocorrelated Data : Abstract: This paper is concerned with the problem of how to speed up computation for Gaussian process models trained on autocorrelated data. The Gaussian process model is a powerful tool commonly use...
- Pruning AMR: Efficient Visualization of Implicit Neural Representations via Weight Matrix Analysis : Abstract: An implicit neural representation (INR) is a neural network that approximates a spatiotemporal function. Many memory-intensive visualization tasks, including modern 4D CT scanning methods, r...
- ProteinPNet: Prototypical Part Networks for Concept Learning in Spatial Proteomics : Abstract: Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part net...
- A Real-time Face Mask Detection and Social Distancing System for COVID-19 using Attention-InceptionV3 Model : Abstract: One of the deadliest pandemics is now happening in the current world due to COVID-19. This contagious virus is spreading like wildfire around the whole world. To minimize the spreading of th...
- Seizure-NGCLNet: Representation Learning of SEEG Spatial Pathological Patterns for Epileptic Seizure Detection via Node-Graph Dual Contrastive Learning : Abstract: Complex spatial connectivity patterns, such as interictal suppression and ictal propagation, complicate accurate drug-resistant epilepsy (DRE) seizure detection using stereotactic electroenc...
- Generative design and validation of therapeutic peptides for glioblastoma based on a potential target ATP5A : Abstract: Glioblastoma (GBM) remains the most aggressive tumor, urgently requiring novel therapeutic strategies. Here, we present a dry-to-wet framework combining generative modeling and experimental ...
- From 'What-is' to 'What-if' in Human-Factor Analysis: A Post-Occupancy Evaluation Case : Abstract: Human-factor analysis typically employs correlation analysis and significance testing to identify relationships between variables. However, these descriptive ('what-is') methods, while effec...
- Quantum Machine Learning for Secondary Frequency Control : Abstract: Frequency control in power systems is critical to maintaining stability and preventing blackouts. Traditional methods like meta-heuristic algorithms and machine learning face limitations in ...
- From Betti Numbers to Persistence Diagrams: A Hybrid Quantum Algorithm for Topological Data Analysis : Abstract: Persistence diagrams serve as a core tool in topological data analysis, playing a crucial role in pathological monitoring, drug discovery, and materials design. However, existing quantum top...
- Opening the Black Box: Nowcasting Singapore's GDP Growth and its Explainability : Abstract: Timely assessment of current conditions is essential especially for small, open economies such as Singapore, where external shocks transmit rapidly to domestic activity. We develop a real-ti...
- CoatFusion: Controllable Material Coating in Images : Abstract: We introduce Material Coating, a novel image editing task that simulates applying a thin material layer onto an object while preserving its underlying coarse and fine geometry. Material coat...
- SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting : Abstract: 3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy i...
- Sampling on Metric Graphs : Abstract: Metric graphs are structures obtained by associating edges in a standard graph with segments of the real line and gluing these segments at the vertices of the graph. The resulting structure ...
- PhishSnap: Image-Based Phishing Detection Using Perceptual Hashing : Abstract: Phishing remains one of the most prevalent online threats, exploiting human trust to harvest sensitive credentials. Existing URL- and HTML-based detection systems struggle against obfuscatio...
- Verifying Closed-Loop Contractivity of Learning-Based Controllers via Partitioning : Abstract: We address the problem of verifying closed-loop contraction in nonlinear control systems whose controller and contraction metric are both parameterized by neural networks. By leveraging inte...
- Adversarial Robustness of Traffic Classification under Resource Constraints: Input Structure Matters : Abstract: Traffic classification (TC) plays a critical role in cybersecurity, particularly in IoT and embedded contexts, where inspection must often occur locally under tight hardware constraints. We ...
- Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training : Abstract: Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-...
- Molecular Embedding-Based Algorithm Selection in Protein-Ligand Docking : Abstract: Selecting an effective docking algorithm is highly context-dependent, and no single method performs reliably across structural, chemical, or protocol regimes. We introduce MolAS, a lightweig...
- Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients : Abstract: The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on sm...
- Leveraging Large Language Models to Bridge On-chain and Off-chain Transparency in Stablecoins : Abstract: Stablecoins such as USDT and USDC aspire to peg stability by coupling issuance controls with reserve attestations. In practice, however, the transparency is split across two worlds: verifiab...
- Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation : Abstract: Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require a...
- QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning : Abstract: Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discov...
- WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling : Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Tempora...
- Stress-Testing Causal Claims via Cardinality Repairs : Abstract: Causal analyses derived from observational data underpin high-stakes decisions in domains such as healthcare, public policy, and economics. Yet such conclusions can be surprisingly fragile: ...
- Bayesian Physics-Informed Neural Networks for Inverse Problems (BPINN-IP): Application in Infrared Image Processing : Abstract: Inverse problems arise across scientific and engineering domains, where the goal is to infer hidden parameters or physical fields from indirect and noisy observations. Classical approaches, ...
- Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation : Abstract: Singing voice synthesis (SVS) has advanced significantly, enabling models to generate vocals with accurate pitch and consistent style. As these capabilities improve, the need for reliable ev...
- A Concise Review of Hallucinations in LLMs and their Mitigation : Abstract: Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes cruc...
- Laplace Approximation For Tensor Train Kernel Machines In System Identification : Abstract: To address the scalability limitations of Gaussian process (GP) regression, several approximation techniques have been proposed. One such method is based on tensor networks, which utilizes a...
- Hear What Matters! Text-conditioned Selective Video-to-Audio Generation : Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especia...
- Embedding networks with the random walk first return time distribution : Abstract: We propose the first return time distribution (FRTD) of a random walk as an interpretable and mathematically grounded node embedding. The FRTD assigns a probability mass function to each nod...
- ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection : Abstract: Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to v...
- VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm : Abstract: Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many ...
- CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer : Abstract: Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-...
- Generative modeling using evolved quantum Boltzmann machines : Abstract: Born-rule generative modeling, a central task in quantum machine learning, seeks to learn probability distributions that can be efficiently sampled by measuring complex quantum states. One h...
- Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone : Abstract: Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; howeve...
- LumiX: Structured and Coherent Text-to-Intrinsic Generation : Abstract: We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g....
- Revisiting Theory of Contrastive Learning for Domain Generalization : Abstract: Contrastive learning is among the most popular and powerful approaches for self-supervised representation learning, where the goal is to map semantically similar samples close together while...
- VLM as Strategist: Adaptive Generation of Safety-critical Testing Scenarios via Guided Diffusion : Abstract: The safe deployment of autonomous driving systems (ADSs) relies on comprehensive testing and evaluation. However, safety-critical scenarios that can effectively expose system vulnerabilities...
- Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study : Abstract: Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors ...
- Leveraging generative adversarial networks with spatially adaptive denormalization for multivariate stochastic seismic data inversion : Abstract: Probabilistic seismic inverse modeling often requires the prediction of both spatially correlated geological heterogeneities (e.g., facies) and continuous parameters (e.g., rock and elastic ...
- Remotely sensing stress evolution in elastic media: a passive approach to earthquake monitoring : Abstract: Stress evolution governs material failure across scales, from microscopic fractures to large earthquakes, yet direct observation of its dynamics in natural systems has remained elusive. Labo...
- AdvisingWise: Supporting Academic Advising in Higher Education Settings Through a Human-in-the-Loop Multi-Agent Framework : Abstract: Academic advising is critical to student success in higher education, yet high student-to-advisor ratios limit advisors' capacity to provide timely support, particularly during peak periods....
- MIMIC-MJX: Neuromechanical Emulation of Animal Behavior : Abstract: The primary output of the nervous system is movement and behavior. While recent advances have democratized pose tracking during complex behavior, kinematic trajectories alone provide only in...
- An Improved Ensemble-Based Machine Learning Model with Feature Optimization for Early Diabetes Prediction : Abstract: Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use e...
- Pharmacophore-based design by learning on voxel grids : Abstract: Ligand-based drug discovery (LBDD) relies on making use of known binders to a protein target to find structurally diverse molecules similarly likely to bind. This process typically involves ...
- PIBNet: a Physics-Inspired Boundary Network for Multiple Scattering Simulations : Abstract: The boundary element method (BEM) provides an efficient numerical framework for solving multiple scattering problems in unbounded homogeneous domains, since it reduces the discretization to ...
- Contextual Gating within the Transformer Stack: Synergistic Feature Modulation for Enhanced Lyrical Classification and Calibration : Abstract: This study introduces a significant architectural advancement in feature fusion for lyrical content classification by integrating auxiliary structural features directly into the self-attenti...
- Cross-View Topology-Aware Graph Representation Learning : Abstract: Graph classification has gained significant attention due to its applications in chemistry, social networks, and bioinformatics. While Graph Neural Networks (GNNs) effectively capture local ...
- How Market Volatility Shapes Algorithmic Collusion: A Comparative Analysis of Learning-Based Pricing Algorithms : Abstract: Autonomous pricing algorithms are increasingly influencing competition in digital markets; however, their behavior under realistic demand conditions remains largely unexamined. This paper of...
- Modelling the Doughnut of social and planetary boundaries with frugal machine learning : Abstract: The 'Doughnut' of social and planetary boundaries has emerged as a popular framework for assessing environmental and social sustainability. Here, we provide a proof-of-concept analysis that ...
- WhAM: Towards A Translative Model of Sperm Whale Vocalization : Abstract: Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale c...
- InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages : Abstract: Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the dif...
- Uncertainty Reasoning with Photonic Bayesian Machines : Abstract: Artificial intelligence (AI) systems increasingly influence safety-critical aspects of society, from medical diagnosis to autonomous mobility, making uncertainty awareness a central requirem...
- On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks : Abstract: Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early succes...
- The Effect of Enforcing Fairness on Reshaping Explanations in Machine Learning Models : Abstract: Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, li...
- Limitations of Membership Queries in Testable Learning : Abstract: Membership queries (MQ) often yield speedups for learning tasks, particularly in the distribution-specific setting. We show that in the \emph{testable learning} model of Rubinfeld and Vasily...
- Training Dynamics of Learning 3D-Rotational Equivariance : Abstract: While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a p...
- Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation : Abstract: Boltzmann machines (BMs) are powerful energy-based generative models, but their heavy training cost has largely confined practical use to Restricted BMs (RBMs) trained with an efficient lear...
- Retrieval-Augmented Memory for Online Learning : Abstract: Retrieval-augmented models couple parametric predictors with non-parametric memories, but their use in streaming supervised learning with concept drift is not well understood. We study onlin...
- Forecasting MBTA Transit Dynamics: A Performance Benchmarking of Statistical and Machine Learning Models : Abstract: The Massachusetts Bay Transportation Authority (MBTA) is the main public transit provider in Boston, operating multiple means of transport, including trains, subways, and buses. However, the...
- SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification : Abstract: Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Spec...
- Reinforcement Learning in POMDP's via Direct Gradient Ascent : Abstract: This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE...
- Risk-Sensitive Q-Learning in Continuous Time with Application to Dynamic Portfolio Selection : Abstract: This paper studies the problem of risk-sensitive reinforcement learning (RSRL) in continuous time, where the environment is characterized by a controllable stochastic differential equation (...
- ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity : Abstract: Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high...
- Dynamic Configuration of On-Street Parking Spaces using Multi Agent Reinforcement Learning : Abstract: With increased travelling needs more than ever, traffic congestion has become a major concern in most urban areas. Allocating spaces for on-street parking, further hinders traffic flow, by l...
- Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering : Abstract: Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possi...
- Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts : Abstract: Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains ...
- Hybrid(Penalized Regression and MLP) Models for Outcome Prediction in HDLSS Health Data : Abstract: I present an application of established machine learning techniques to NHANES health survey data for predicting diabetes status. I compare baseline models (logistic regression, random forest...
- A Fully First-Order Layer for Differentiable Optimization : Abstract: Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires sol...
- Water Quality Estimation Through Machine Learning Multivariate Analysis : Abstract: The quality of water is key for the quality of agrifood sector. Water is used in agriculture for fertigation, for animal husbandry, and in the agrifood processing industry. In the context of...
- Decentralized Fairness Aware Multi Task Federated Learning for VR Network : Abstract: Wireless connectivity promises to unshackle virtual reality (VR) experiences, allowing users to engage from anywhere, anytime. However, delivering seamless, high-quality, real-time VR video ...
- In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs : Abstract: The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using h...
- Tensor Network Based Feature Learning Model : Abstract: Many approximations were suggested to circumvent the cubic complexity of kernel-based algorithms, allowing their application to large-scale datasets. One strategy is to consider the primal f...
- GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies : Abstract: Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control....
- Modeling and Inverse Identification of Interfacial Heat Conduction in Finite Layer and Semi-Infinite Substrate Systems via a Physics-Guided Neural Framework : Abstract: Heat transfer in semiconductor devices is dominated by chip and substrate assemblies, where heat generated within a finite chip layer dissipates into a semi-infinite substrate with much high...
- Adapting Tensor Kernel Machines to Enable Efficient Transfer Learning for Seizure Detection : Abstract: Transfer learning aims to optimize performance in a target task by learning from a related source problem. In this work, we propose an efficient transfer learning method using a tensor kerne...
- SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization : Abstract: Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly...
- VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling : Abstract: Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness prima...
- Towards a fully differentiable digital twin for solar cells : Abstract: Maximizing energy yield (EY) - the total electric energy generated by a solar cell within a year at a specific location - is crucial in photovoltaics (PV), especially for emerging technologi...
- MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding : Abstract: Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops an...
- In Silico Development of Psychometric Scales: Feasibility of Representative Population Data Simulation with LLMs : Abstract: Developing and validating psychometric scales requires large samples, multiple testing phases, and substantial resources. Recent advances in Large Language Models (LLMs) enable the generatio...
- EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis : Abstract: Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time renderin...
- Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench : Abstract: The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome model...
- Lumos: Let there be Language Model System Certification : Abstract: We introduce the first principled framework, Lumos, for specifying and formally certifying Language Model System (LMS) behaviors. Lumos is an imperative probabilistic programming DSL over gr...
- Rethinking Generalized BCIs: Benchmarking 340,000+ Unique Algorithmic Configurations for EEG Mental Command Decoding : Abstract: Robust decoding and classification of brain patterns measured with electroencephalography (EEG) remains a major challenge for real-world (i.e. outside scientific lab and medical facilities) ...
- Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic : Abstract: Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic wit...
- In-Context Sync-LoRA for Portrait Video Editing : Abstract: Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of ...
- Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge : Abstract: Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or ...
- TokenPowerBench: Benchmarking the Power Consumption of LLM Inference : Abstract: Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. How...
- LORE: A Large Generative Model for Search Relevance : Abstract: Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27...
- The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models : Abstract: The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts....
- SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control : Abstract: Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been ...
- ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation : Abstract: Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipel...
- Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation : Abstract: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, ...
- PPTArena: A Benchmark for Agentic PowerPoint Editing : Abstract: We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text...
- Computational Copyright: Towards A Royalty Model for Music Generative AI : Abstract: The rapid rise of generative AI has intensified copyright and economic tensions in creative industries, particularly in music. Current approaches addressing this challenge often focus on pre...
- A process algebraic framework for multi-agent dynamic epistemic systems : Abstract: This paper combines the classical model of labeled transition systems with the epistemic model for reasoning about knowledge. The result is a unifying framework for modeling and analyzing mu...
- Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks : Abstract: Large Language Models (LLMs) have demonstrated impressive abilities in symbol processing through in-context learning (ICL). This success flies in the face of decades of critiques asserting t...
- Reinforcement Learning: An Overview : Abstract: This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based methods, policy-based methods, mod...
- CoT-X: An Adaptive Framework for Cross-Model Chain-of-Thought Transfer and Optimization : Abstract: Chain-of-Thought (CoT) reasoning enhances the problem-solving ability of large language models (LLMs) but leads to substantial inference overhead, limiting deployment in resource-constrained...
- Large Language Models for Robotics: A Survey : Abstract: The human ability to learn, generalize, and control complex manipulation tasks through multi-modality feedback suggests a unique capability, which we refer to as dexterity intelligence. Unde...
- Brain-aligning of semantic vectors improves neural decoding of visual stimuli : Abstract: The development of algorithms to accurately decode neural information has long been a research focus in the field of neuroscience. Brain decoding typically involves training machine learning...
- Mixture of Experts Softens the Curse of Dimensionality in Operator Learning : Abstract: We study the approximation-theoretic implications of mixture-of-experts architectures for operator learning, where the complexity of a single large neural operator is distributed across many...
- CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios : Abstract: 3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a...
- Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation : Abstract: Large language models (LLMs) have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate uninte...
- Aligning Diffusion Models with Noise-Conditioned Perception : Abstract: Recent advancements in human preference optimization, initially developed for Language Models (LMs), have shown promise for text-to-image Diffusion Models, enhancing prompt alignment, visual...
- Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer : Abstract: Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer's capability to model...
- Mutually-Aware Feature Learning for Few-Shot Object Counting : Abstract: Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without additional training. How...
- Cohort-Based Active Modality Acquisition : Abstract: Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not ...
- See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models : Abstract: Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech....
- DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models : Abstract: Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understud...
- Spatiotemporal Pyramid Flow Matching for Climate Emulation : Abstract: Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but ...
- Progressive Image Restoration via Text-Conditioned Video Generation : Abstract: Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo fo...
- Enhancing Cross Domain SAR Oil Spill Segmentation via Morphological Region Perturbation and Synthetic Label-to-SAR Generation : Abstract: Deep learning models for SAR oil spill segmentation often fail to generalize across regions due to differences in sea-state, backscatter statistics, and slick morphology, a limitation that i...
- HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models : Abstract: How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason ove...
- COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers : Abstract: This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CA...
- Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision : Abstract: Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers stru...
- FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data : Abstract: Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement,...
- Understanding and Harnessing Sparsity in Unified Multimodal Models : Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to sup...
- VACoT: Rethinking Visual Data Augmentation with VLMs : Abstract: While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-...
- Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering : Abstract: Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large lan...
- Tackling Tuberculosis: A Comparative Dive into Machine Learning for Tuberculosis Detection : Abstract: This study explores the application of machine learning models, specifically a pretrained ResNet-50 model and a general SqueezeNet model, in diagnosing tuberculosis (TB) using chest X-ray im...
- Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention : Abstract: Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios. Existing methods often stru...
- Process-Centric Analysis of Agentic Software Systems : Abstract: Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution (i...
- WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate : Abstract: Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to levera...
- Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles : Abstract: Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-b...
- MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture : Abstract: Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized...
- Vehicle Dynamics Embedded World Models for Autonomous Driving : Abstract: World models have gained significant attention as a promising approach for autonomous driving. By emulating human-like perception and decision-making processes, these models can predict and ...
- The brain-AI convergence: Predictive and generative world models for general-purpose computation : Abstract: Recent advances in general-purpose AI systems with attention-based transformers offer a potential window into how the neocortex and cerebellum, despite their relatively uniform circuit archi...
- Quantum feature encoding optimization : Abstract: Quantum Machine Learning (QML) holds the promise of enhancing machine learning modeling in terms of both complexity and accuracy. A key challenge in this domain is the encoding of input data...
- WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning : Abstract: Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challeng...
- LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework : Abstract: As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed a...
- Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources : Abstract: In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or ...
- When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents : Abstract: Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer cont...
- HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild : Abstract: Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle larg...
- TabGRU: An Enhanced Design for Urban Rainfall Intensity Estimation Using Commercial Microwave Links : Abstract: In the face of accelerating global urbanization and the increasing frequency of extreme weather events, highresolution urban rainfall monitoring is crucial for building resilient smart citie...
- scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing : Abstract: Cell clustering is crucial for uncovering cellular heterogeneity in single-cell RNA sequencing (scRNA-seq) data by identifying cell types and marker genes. Despite its importance, benchmarks...
- Q-BERT4Rec: Quantized Semantic-ID Representation Learning for Multimodal Recommendation : Abstract: Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users' next interactions is ess...
- UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making : Abstract: Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermin...
- Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding : Abstract: Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. Ho...
- AskNearby: An LLM-Based Application for Neighborhood Information Retrieval and Personalized Cognitive-Map Recommendations : Abstract: The "15-minute city" envisions neighborhoods where residents can meet daily needs via a short walk or bike ride. Realizing this vision requires not only physical proximity but also efficient...
- Sparse Computations in Deep Learning Inference : Abstract: The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contribu...
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning : Abstract: In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM...
- ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce : Abstract: Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models' reliance on the scarcity of domain-specific hard samples....
- EZYer: A simulacrum of high school with generative agent : Abstract: With the rapid development of the online education and large language model, the existing educational tools still suffer from incomplete service, insufficient performance and weak interactiv...
- From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature : Abstract: There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current...
- Feedback Loops and Code Perturbations in LLM-based Software Engineering: A Case Study on a C-to-Rust Translation System : Abstract: The advent of strong generative AI has a considerable impact on various software engineering tasks such as code repair, test generation, or language translation. While tools like GitHub Copi...
- CryptoQA: A Large-scale Question-answering Dataset for AI-assisted Cryptography : Abstract: Large language models (LLMs) excel at many general-purpose natural language processing tasks. However, their ability to perform deep reasoning and mathematical analysis, particularly for com...
- Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training : Abstract: Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the avai...
- Distill, Forget, Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models : Abstract: The recent rapid growth of visual generative models trained on vast web-scale datasets has created significant tension with data privacy regulations and copyright laws, such as GDPR's ``Righ...
- Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents : Abstract: The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error prop...
- SAND Challenge: Four Approaches for Dysartria Severity Classification : Abstract: This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All model...
- Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions : Abstract: This paper examines why safety mechanisms designed for human-model interaction do not scale to environments where large language models (LLMs) interact with each other. Most current governan...
- An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation : Abstract: Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent...
- Empirical Assessment of the Perception of Software Product Line Engineering by an SME before Migrating its Code Base : Abstract: Migrating a set of software variants into a software product line (SPL) is an expensive and potentially challenging endeavor. Indeed, SPL engineering can significantly impact a company's dev...
- Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs : Abstract: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively pro...
- DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions : Abstract: Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE...
- Reasoning-Aware Multimodal Fusion for Hateful Video Detection : Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing method...
- SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys : Abstract: LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipeli...
- Perception of AI-Generated Music - The Role of Composer Identity, Personality Traits, Music Preferences, and Perceived Humanness : Abstract: The rapid rise of AI-generated art has sparked debate about potential biases in how audiences perceive and evaluate such works. This study investigates how composer information and listener ...
- Phase-Adaptive LLM Framework with Multi-Stage Validation for Construction Robot Task Allocation: A Systematic Benchmark Against Traditional Optimization Algorithms : Abstract: Multi-robot task allocation in construction automation has traditionally relied on optimization methods such as Dynamic Programming and Reinforcement Learning. This research introduces the L...
- From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity : Abstract: Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly und...
- Defense That Attacks: How Robust Models Become Better Attackers : Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. ...
- A Comparative Study on How Data Normalization Affects Zero-Shot Generalization in Time Series Foundation Models : Abstract: We investigate input normalization methods for Time-Series Foundation Models (TSFMs). While normalization is well-studied in dataset-specific time-series models, it remains overlooked in TSF...
- Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach : Abstract: Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation,...
- ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning : Abstract: Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet exist...
- Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages : Abstract: System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world...
- GraphMatch: Fusing Language and Graph Representations in a Dynamic Two-Sided Work Marketplace : Abstract: Recommending matches in a text-rich, dynamic two-sided marketplace presents unique challenges due to evolving content and interaction graphs. We introduce GraphMatch, a new large-scale recom...
- OptPO: Optimal Rollout Allocation for Test-time Policy Optimization : Abstract: Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixe...
- Model-Based Diagnosis with Multiple Observations: A Unified Approach for C Software and Boolean Circuits : Abstract: Debugging is one of the most time-consuming and expensive tasks in software development and circuit design. Several formula-based fault localisation (FBFL) methods have been proposed, but th...
- FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization : Abstract: Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing represe...
- Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents : Abstract: Automated paper reproduction has emerged as a promising approach to accelerate scientific research, employing multi-step workflow frameworks to systematically convert academic papers into ex...
- Radiologist Copilot: An Agentic Assistant with Orchestrated Tools for Radiology Reporting with Quality Control : Abstract: Radiology reporting is an essential yet time-consuming and error-prone task for radiologists in clinical examinations, especially for volumetric medical images. Rigorous quality control is a...
- The future of AI in critical mineral exploration : Abstract: The energy transition through increased electrification has put the worlds attention on critical mineral exploration Even with increased investments a decrease in new discoveries has taken p...
- Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning : Abstract: Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and...
- Invasive Context Engineering to Control Large Language Models : Abstract: Current research on operator control of Large Language Models improves model robustness against adversarial attacks and misbehavior by training on preference examples, prompting, and input/o...
- From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars? : Abstract: The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster ...
- DySTAN: Joint Modeling of Sedentary Activity and Social Context from Smartphone Sensors : Abstract: Accurately recognizing human context from smartphone sensor data remains a significant challenge, especially in sedentary settings where activities such as studying, attending lectures, rela...
- Towards Sustainable Precision: Machine Learning for Laser Micromachining Optimization : Abstract: In the pursuit of sustainable manufacturing, ultra-short pulse laser micromachining stands out as a promising solution while also offering high-precision and qualitative laser processing. Ho...
- On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts : Abstract: Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts ar...
- Characterizing Continuous and Discrete Hybrid Latent Spaces for Structural Connectomes : Abstract: Structural connectomes are detailed graphs that map how different brain regions are physically connected, offering critical insight into aging, cognition, and neurodegenerative diseases. How...
- CONFIDE: Hallucination Assessment for Reliable Biomolecular Structure Prediction and Design : Abstract: Reliable evaluation of protein structure predictions remains challenging, as metrics like pLDDT capture energetic stability but often miss subtle errors such as atomic clashes or conformatio...
- Integration of LSTM Networks in Random Forest Algorithms for Stock Market Trading Predictions : Abstract: The aim of this paper is the analysis and selection of stock trading systems that combine different models with data of different nature, such as financial and microeconomic information. Spe...
- Statistical Arbitrage in Polish Equities Market Using Deep Learning Techniques : Abstract: We study a systematic approach to a popular Statistical Arbitrage technique: Pairs Trading. Instead of relying on two highly correlated assets, we replace the second asset with a replication...
- Deep Research: A Systematic Survey : Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, whi...
- The Impact of Artificial Intelligence on Enterprise Decision-Making Process : Abstract: Artificial intelligence improves enterprise decision-making by accelerating data analysis, reducing human error, and supporting evidence-based choices. A quantitative survey of 92 companies ...
- Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale : Abstract: Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) ...
- Reversing Large Language Models for Efficient Training and Fine-Tuning : Abstract: Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pr...
- Opening the Black Box: An Explainable, Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering : Abstract: The industrial adoption of Artificial Intelligence for Engineering (AI4E) faces two fundamental bottlenecks: scarce high-quality data and the lack of interpretability in black-box models-par...
- Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting : Abstract: Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving s...
- Superpixel Attack: Enhancing Black-box Adversarial Attack with Image-driven Division Areas : Abstract: Deep learning models are used in safety-critical tasks such as automated driving and face recognition. However, small perturbations in the model input can significantly change the prediction...
- Parallel Multi-Circuit Quantum Feature Fusion in Hybrid Quantum-Classical Convolutional Neural Networks for Breast Tumor Classification : Abstract: Quantum machine learning has emerged as a promising approach to improve feature extraction and classification tasks in high-dimensional data domains such as medical imaging. In this work, we...
- Large Language Model based Smart Contract Auditing with LLMBugScanner : Abstract: This paper presents LLMBugScanner, a large language model (LLM) based framework for smart contract vulnerability detection using fine-tuning and ensemble learning. Smart contract auditing pr...
- DPWMixer: Dual-Path Wavelet Mixer for Long-Term Time Series Forecasting : Abstract: Long-term time series forecasting (LTSF) is a critical task in computational intelligence. While Transformer-based models effectively capture long-range dependencies, they often suffer from ...
- HTG-GCL: Leveraging Hierarchical Topological Granularity from Cellular Complexes for Graph Contrastive Learning : Abstract: Graph contrastive learning (GCL) aims to learn discriminative semantic invariance by contrasting different views of the same graph that share critical topological patterns. However, existing...
- FDRMFL:Multi-modal Federated Feature Extraction Model Based on Information Maximization and Contrastive Learning : Abstract: This study focuses on the feature extraction problem in multi-modal data regression. To address three core challenges in real-world scenarios: limited and non-IID data, effective extraction ...
- Comparing Baseline and Day-1 Diffusion MRI Using Multimodal Deep Embeddings for Stroke Outcome Prediction : Abstract: This study compares baseline (J0) and 24-hour (J1) diffusion magnetic resonance imaging (MRI) for predicting three-month functional outcomes after acute ischemic stroke (AIS). Seventy-four A...
- Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation : Abstract: Abusive speech on social media poses a persistent and evolving challenge, driven by the continuous emergence of novel slang and obfuscated terms designed to circumvent detection systems. In ...
- Young Children's Anthropomorphism of AI Chatbots and the Role of Parent Co-Presence : Abstract: Artificial Intelligence (AI) chatbots powered by a large language model (LLM) are entering young children's learning and play, yet little is known about how young children construe these age...
- CLEF: Clinically-Guided Contrastive Learning for Electrocardiogram Foundation Models : Abstract: The electrocardiogram (ECG) is a key diagnostic tool in cardiovascular health. Single-lead ECG recording is integrated into both clinical-grade and consumer wearables. While self-supervised ...
- Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models : Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time o...
- Story2MIDI: Emotionally Aligned Music Generation from Text : Abstract: In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. To develop this model, we construct th...
- Enforcing Orderedness to Improve Feature Consistency : Abstract: Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered...
- A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation : Abstract: This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi-agent laboratory environment for unsupervised language acquisition experiments such t...
- Bin2Vec: Interpretable and Auditable Multi-View Binary Analysis for Code Plagiarism Detection : Abstract: We introduce Bin2Vec, a new framework that helps compare software programs in a clear and explainable way. Instead of focusing only on one type of information, Bin2Vec combines what a progra...
- Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation : Abstract: Multifractal analysis has revealed regularities in many self-seeding phenomena, yet its use in modern deep learning remains limited. Existing end-to-end multifractal methods rely on heavy po...
- Improved Training Mechanism for Reinforcement Learning via Online Model Selection : Abstract: We study the problem of online model selection in reinforcement learning, where the selector has access to a class of reinforcement learning agents and learns to adaptively select the agent ...
- Orchestration Framework for Financial Agents: From Algorithmic Trading to Agentic Trading : Abstract: The financial market is a mission-critical playground for AI agents due to its temporal dynamics and low signal-to-noise ratio. Building an effective algorithmic trading system may require a...
- The 4/$\delta$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee : Abstract: The idea of using Formal Verification tools with large language models (LLMs) has enabled scaling software verification beyond manual workflows. However, current methods remain unreliable. W...
- Flowchart2Mermaid: A Vision-Language Model Powered System for Converting Flowcharts into Editable Diagram Code : Abstract: Flowcharts are common tools for communicating processes but are often shared as static images that cannot be easily edited or reused. We present \textsc{Flowchart2Mermaid}, a lightweight web...
- From monoliths to modules: Decomposing transducers for efficient world modelling : Abstract: World models have been recently proposed as sandbox environments in which AI agents can be trained and evaluated before deployment. Although realistic world models often have high computatio...
- STRIDE: A Systematic Framework for Selecting AI Modalities - Agentic AI, AI Assistants, or LLM Calls : Abstract: The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step re...
- Benchmarking LLM Agents for Wealth-Management Workflows : Abstract: Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAge...
- TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful? : Abstract: LLM-based trading agents are increasingly deployed in real-world financial markets to perform autonomous analysis and execution. However, their reliability and robustness under adversarial o...
- Bridging the Gap: Toward Cognitive Autonomy in Artificial Intelligence : Abstract: Artificial intelligence has advanced rapidly across perception, language, reasoning, and multimodal domains. Yet despite these achievements, modern AI systems remain fundamentally limited in...
- DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses : Abstract: Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly unders...
- Model Recovery at the Edge under Resource Constraints for Physical AI : Abstract: Model Recovery (MR) enables safe, explainable decision making in mission-critical autonomous systems (MCAS) by learning governing dynamical equations, but its deployment on edge devices is h...
- Breast Cell Segmentation Under Extreme Data Constraints: Quantum Enhancement Meets Adaptive Loss Stabilization : Abstract: Annotating medical images demands significant time and expertise, often requiring pathologists to invest hundreds of hours in labeling mammary epithelial nuclei datasets. We address this cri...
- OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning : Abstract: Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail resear...
- Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective : Abstract: Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintai...
- Beyond Playtesting: A Generative Multi-Agent Simulation System for Massively Multiplayer Online Games : Abstract: Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on large-sc...
- Synthetic Error Injection Fails to Elicit Self-Correction In Language Models : Abstract: Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration...
- Semantic Trading: Agentic AI for Clustering and Relationship Discovery in Prediction Markets : Abstract: Prediction markets allow users to trade on outcomes of real-world events, but are prone to fragmentation through overlapping questions, implicit equivalences, and hidden contradictions acros...
- Guided Self-Evolving LLMs with Minimal Human Supervision : Abstract: AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet i...
- COPE: Chain-Of-Thought Prediction Engine for Open-Source Large Language Model Based Stroke Outcome Prediction from Clinical Notes : Abstract: Predicting outcomes in acute ischemic stroke (AIS) guides clinical decision-making, patient counseling, and resource allocation. Clinical notes contain rich contextual information, but their...
- Aetheria: A multimodal interpretable content safety framework based on multi-agent debate and collaboration : Abstract: The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations...
- Empathy Level Prediction in Multi-Modal Scenario with Supervisory Documentation Assistance : Abstract: Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilizati...
- PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing : Abstract: Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, struc...
- IACT: A Self-Organizing Recursive Model for General AI Agents: A Technical White Paper on the Architecture Behind kragent.ai : Abstract: This technical white paper introduces the Interactive Agents Call Tree (IACT), a computational model designed to address the limitations of static, hard-coded agent workflows. Unlike traditi...
- Target-specific Adaptation and Consistent Degradation Alignment for Cross-Domain Remaining Useful Life Prediction : Abstract: Accurate prediction of the Remaining Useful Life (RUL) in machinery can significantly diminish maintenance costs, enhance equipment up-time, and mitigate adverse outcomes. Data-driven RUL pr...
- Zero-Shot Instruction Following in RL via Structured LTL Representations : Abstract: Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instruction...
- Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks : Abstract: Large language models have demonstrated remarkable capabilities across many tasks, yet face significant challenges when dealing with recursive reasoning problems, those requiring the resolut...
- Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding : Abstract: In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large...
- Training Data Attribution for Image Generation using Ontology-Aligned Knowledge Graphs : Abstract: As generative models become powerful, concerns around transparency, accountability, and copyright violations have intensified. Understanding how specific training data contributes to a model...
- Menta: A Small Language Model for On-Device Mental Health Prediction : Abstract: Mental health conditions affect hundreds of millions globally, yet early detection remains limited. While large language models (LLMs) have shown promise in mental health applications, their...
- StockMem: An Event-Reflection Memory Framework for Stock Forecasting : Abstract: Stock price prediction is challenging due to market volatility and its sensitivity to real-time events. While large language models (LLMs) offer new avenues for text-based forecasting, their...
- AuditCopilot: Leveraging LLMs for Fraud Detection in Double-Entry Bookkeeping : Abstract: Auditors rely on Journal Entry Tests (JETs) to detect anomalies in tax-related ledger records, but rule-based methods generate overwhelming false positives and struggle with subtle irregular...
- Self-Improving AI Agents through Self-Play : Abstract: We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the ...
- A Framework for Causal Concept-based Model Explanations : Abstract: This work presents a conceptual framework for causal concept-based post-hoc Explainable Artificial Intelligence (XAI), based on the requirements that explanations for non-interpretable model...
Research Sources: 390 | Generated: 12/3/2025
