The recent wave of research focusing on machine learning and artificial intelligence and its applications has been fuelled by both hardware improvements, ML Compilers and deep learning frameworks that simplify the design, training, inference of neural models. Over the past five years, EuroMLSys was held in conjunction with EuroSys (https://www.euromlsys.eu). The program included topics such as neural model and inference performance optimisation, and compilation for ML workloads. The past workshops were successful, with high quality papers and good size audiences.
This year’s workshop will provide a platform for discussing emerging trends in building frameworks, programming models, optimisation algorithms, and software engineering to support AI/ML applications. At the same time, using ML for building such frameworks or optimisation tools will be discussed. Recent emergence of LLM and agentic systems is remarked by their substantial computational requirements, and optimisation in every possible part of the system will be important. EuroMLSys aims to bridge the gap between AI research and practice, through a technical program of fresh ideas on software infrastructure, tools, design principles, and theory/algorithms, from a systems perspective. We will also explore potential applications that will take advantages of ML.
News
- Keynote speaker is announced! Laurent Bindschaedler (the Max Planck Institute) will give a talk on What Survives When Code Doesn't?.
- The workshop program is up! It will start at 9:00 am.
Key dates
- Paper submission deadline:
February 15, 2026 (23:59 AoE)February 24, 2026 (23:59 AoE) - Acceptance notification:
March 17, 2026March 22, 2026 - Final paper due:
April 2, 2026April 10, 2026 - Workshop: April 27, 2026 (full-day workshop)
Past Editions
Call for Papers
A growing area of interest in machine intelligence is at the intersection of AI/ML and systems design. At the same time, applications of ML are growing in complexity and so is the volume of data they produce/consume. For computer systems to scale, new learning approaches and advanced optimisation techniques are needed. We also need to understand better the current AI/ML frameworks, in terms of their functionality, limitations, and target applications. This will clarify potential desired functions and future architectures. Novel machine learning methods to optimise and accelerate software and hardware systems must also be developed.
EuroMLSys is an interdisciplinary workshop that brings together researchers in computer architecture, systems and machine learning, along with practitioners who are active in these emerging areas.
Topics of interest include, but are not limited to, the following:
- Scheduling algorithms for data processing clusters
- Custom hardware for machine learning
- Hardware-efficient ML methods
- Accelerators/GPU optimisation
- LLM-based hardware design or system optimisation techniques
- Programming languages for machine learning
- Benchmarking systems (for machine learning algorithms)
- Synthetic input data generation for training
- Systems for training and serving machine learning models at scale
- Graph neural networks
- Neural network compression and pruning in systems
- Large scale distributed learning algorithms in practice
- Database systems for large scale learning
- Systems for model-free and model-based Reinforcement Learning
- Optimisation in end-to-end deep learning
- System optimisation using Bayesian Optimisation
- Use of probabilistic models in ML/AI application
- Analysis of distributed ML algorithms
- Probabilistic modelling for distributed ML algorithms
- Synchronisation and state control of distributed ML algorithms
- ML Compiler Optimisation
- Optimisation in Large Language Model (LLM)
- Agentic Systems
Accepted papers will be published in the ACM Digital Library (you can opt out from this).
Program
Program timezone is BST (UTC+1.00).
| 08:50 | Opening | |
| 09:00 | Session 1: LLM 1: Inference, Memory, agent (11mins x 8) Chair (Eiko Yoneki - Unicersity of Cambridge) | |
| ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents Mofasshara Rafique (Ferring); Laurent Bindschaedler (Max Planck Institute for Software Systems) Stateful tool-using LLM agents treat the context window as working memory, yet today's agent harnesses manage residency and durability as best-effort, causing recurring failures: lost state after compaction, bypassed flushes on reset, and destructive writeback. We present \textsc{ClawVM}, a virtual memory layer that manages state as typed pages with minimum-fidelity invariants, multi-resolution representations under a token budget, and validated writeback at every lifecycle boundary. Because the harness already assembles prompts, mediates tools, and observes lifecycle events, it is the natural enforcement point; placing the contract there makes residency and durability deterministic and auditable. Across synthetic workloads, 12 real-session traces, and adversarial stress tests, \textsc{ClawVM} eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle, and adds median $<$ 50 $\mu$s of policy-engine overhead per turn. | ||
| Sampling Where It Matters: Predicting LLM Serving Performance Emile Aydar, Christian Pinto, Srikumar Venugopal (IBM Research Europe); Dimitris Chatzopoulos (University College Dublin) Characterizing Large Language Model (LLM) serving performance is a combinatorial problem where a suboptimal choice wastes profiling budget: every change in model, hardware, or software version requires fresh profiling, yet exhaustive benchmarking is infeasible. Existing approaches -- simulators and static performance estimators -- lose fidelity on novel architectures or target only optima. We introduce \textbf{Predictive Kernel Herding (PKH)}, a sampler which reformulates Random Forest leaf co-occurrence as linear-time histogram matching, replacing $O(N^2)$ kernel comparisons. On four real-world LLM serving traces spanning 3,000+ configurations, PKH is the only sampler that delivers top-ranked accuracy on both throughput and latency predictions, dominating the cost–accuracy Pareto frontier. PKH predicts output throughput within 10\% MAPE and mean Time to First Token (TTFT) within 20\% MAPE, reaching practically useful accuracy with up to 1.6× lower profiling time than the next-best method at equivalent error. | ||
| PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems Amit Singh Bhatti, Vishal Vaddina, Dagnachew Birru (Phi Labs, Quantiphi) Production LLM deployments serve diverse workloads where cost and quality requirements vary by customer tier, time of day, and query criticality. Model serving systems accept latency service-level objectives (SLOs) directly. LLM routers do not. They force operators to tune parameters offline and guess what accuracy might result. The relationship between parameters and outcomes is indirect, non-monotonic, and dataset-dependent. Operators need to specify accuracy targets, not infer them from opaque settings. We present \textbf{PROTEUS} (\textbf{P}olymorphic \textbf{R}outer for \textbf{O}perational \textbf{T}arget \textbf{E}nforce\-ment with \textbf{U}nified \textbf{S}LA), a router that accepts accuracy targets $\tau$ as runtime input. PROTEUS uses Lagrangian dual control. A learned dual variable $\lambda$ tracks constraint violations during training and conditions the policy network. This lets the router translate specified $\tau$ values into routing decisions that satisfy them. A single trained model serves the full accuracy spectrum without retraining. We evaluate on RouterBench (11 models, 405K queries) and SPROUT (14 models, 45K queries). PROTEUS achieves consistent floor compliance where accuracy meets or exceeds $\tau$. The target-response correlation reaches 0.97 to 0.98. The closest baseline, OmniRouter, meets floors only 22\% of the time despite also using Lagrangian optimization. PROTEUS operates across $\tau \in [0.85, 0.95]$ from a single model. On RouterBench it achieves 90.1\% accuracy, within 1.3\% of oracle. On SPROUT it achieves 94.0\% accuracy, within 4.6\% of oracle. Cost savings reach 89.8\% versus the best fixed model. | ||
| Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving Takeshi Yoshimura (IBM Research - Tokyo); Valentijn Dymphnus van de Beek (Delft University of Technology); Tatsuhiro Chiba (IBM Research - Tokyo) Distributed LLM serving systems optimize per-request latency and throughput. However, under long-context workloads, inference accuracy becomes more variable. When incorrect responses trigger retries, accuracy directly translates into cumulative user-visible delay that is not captured by single-shot latency metrics. In this work, we argue that under long-context serving, \textbf{accuracy becomes speed} through retry dynamics. We introduce \textit{Time-to-Correct-Answer (TTCA)}, a metric that measures the wall-clock time required to obtain the first correct response. Our measurement study shows that prompt characteristics such as length and language amplify accuracy variance, which inflates TTCA. We demonstrate \textit{Lightweight Accuracy-Aware Routing (LAAR)}, a capability-based routing design that reduces TTCA. Our results suggest that in long-context distributed serving, accuracy should be treated as a first-class systems objective. | ||
| Dealing With The Elephant in the KV Cache: Video Frame Sampling for Multimodal LLM Inference Konstantinos Papaioannou (Universidad Politécnica de Madrid & IMDEA Software Institute); Thaleia Dimitra Doudali (IMDEA Software Institute) | ||
| Towards a Solution to the Management Scaling Paradox in Distributed LLM Inference Amir Noohi, Bita Asoodeh, Antonio Barbalace (The University of Edinburgh) In disaggregated LLM serving, prefix caching avoids redundant prefill by reusing KV cache across nodes, but existing systems manage this cache entirely in user space---atop the very kernel that already manages memory, page tables, and RDMA registration. Our instrumented profiling of the LMCache + Mooncake stack reveals a \emph{management scaling paradox}: user-space overhead from coordination RPCs, redundant memory copies, and transfer fragmentation grows to 79\% of achievable time-to-first-token (TTFT) as the cache warms, nearly doubling latency when caching should help most. We present RMC (Remote Memory for Cache), an in-kernel framework that extends OS-provided shared memory---already used locally via \texttt{/dev/shm} in systems like vLLM---to cluster-wide scope. RMC exploits two KV cache properties---fixed per-token size and content-determined identity---to provide a content-addressed Partitioned Global Address Space where any node computes the address of any cached chunk without coordination. Across four representative workloads, RMC achieves $1.1$--$1.3\times$ TTFT reduction over LMCache + Mooncake and up to $2.1\times$ over full recompute. | ||
| Asynchronous Verified Semantic Caching for Tiered LLM Architectures Asmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam, Weihua Zhu (Apple) Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to $\textbf{3.9}$ times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency. | ||
| Pooling Engram Conditional Memory in Large Language Models using CXL Ruiyang Ma (Peking University); Teng Ma (Alibaba Cloud Computing); Zhiyuan Su (Shandong Yingxin Computer Technology Co., Ltd); Hantian Zha (Renmin University of China); Xinpeng Zhao, Xuchun Shang, Xingrui Yi, Zheng Liu (Alibaba Cloud Computing); Zhu Cao, An Wu, Zhichong Dou (Shandong Yingxin Computer Technology Co., Ltd); Ziqian Liu (The University of Hong Kong); Daikang Kuang, Guojie Luo (Peking University) Engram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance. | ||
| 10:30 | Coffee Break / Poster Session (Browsing) | |
| 11:00 | Session 2: LLM2: Inference, Hardware (11mins x 8) Chair (Paul Patras - University of Edinburgh) | |
| SwiftNPU: Scalable Shape-Flexible Allocation for Inter-Core Connected NPUs Gangmin Lee (KAIST); Wonsup Yoon (The University of Texas at Austin); Sue Moon (KAIST) Modern NPUs adopt an inter-core connected (ICC) architecture: a 2D mesh of compute tiles with distributed SRAM scratchpads and a network-on-chip. In multi-tenant settings, this makes virtualization a \emph{spatial allocation} problem, where the virtualization layer chooses not only how many tiles to allocate to each job, but also their shape and location. Recent topology-aware NPU virtualization improves performance by allocating shape-flexible submeshes, but relies on an NP-hard graph edit distance (GED) algorithm that does not scale to modern NPU mesh sizes. We present \textbf{SwiftNPU}, a host-side virtualization layer for ICC NPUs that enables concurrent execution on arbitrary submeshes and performs shape-flexible spatial allocation in polynomial time. SwiftNPU’s design targets utilization, workload makespan, and allocation latency with four polynomial-time spatial algorithms (LSSA, ASFF, ASBF, and NAS) that exhibit different trade-off spaces. We implement SwiftNPU on an off-the-shelf NPU, Tenstorrent's Blackhole. To support concurrent multi-tenant execution on disjoint submeshes in the NPU, we extend its runtime to enforce per-job tile sets. We evaluate SwiftNPU using operator-level GEMM workloads derived from BERT and ResNet. SwiftNPU improves utilization by up to 19.7 percentage points (1.37$\times$ improvement) and reduces workload makespan by up to 12.4\% (1.14$\times$ speedup) over fixed-shape baselines, while keeping allocation overhead in the millisecond range. In a scaling microbenchmark, SwiftNPU’s NAS allocator achieves up to 890,000$\times$ lower allocation overhead than the GED-based approach. These results show that polynomial-time, shape-flexible mesh allocation is a practical path to scalable NPU virtualization on large meshes of modern ICC NPUs. | ||
| Cost-Efficient Training and Checkpointing for Large Models on Preemptible Cloud VMs Omkar Desai (Syracuse University); Shuyi Pei (Samsung Semiconductor, Inc.); Janki Bhimani (Florida International University); Bryan S. Kim (Syracuse University) Training large models on discounted spot VMs offers significant cost savings but remains challenging due to their low availability and unilateral preemptions. To address these challenges, we present a cost-effective training and checkpointing system for large models using spot VMs. First, we predict the preemption rate for a spot instance using historical preemption data. Second, we dynamically tune checkpointing interval through a mathematical model that uses the preemption predictions and balances the overheads of checkpointing and recovery. Finally, we guarantee consistent training throughput and minimized training cost through prediction-informed hybrid resource utilization: switching to on-demand instances when spot VM availability is low while also opportunistically scaling the number of spot VMs for training when availability is high. We improve training throughput by up to 60.27% with dynamic checkpoint interval tuning when compared to fixed-interval approaches. We also achieve up to 2.04× higher throughput at a 51.41% lower cost through effective use of spot and on-demand instances. | ||
| Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching Hiroki Matsutani, Naoki Matsuda, Naoto Sugiura (Keio University) Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices. To fully utilize prompt similarity, our distributed caching mechanism also supports partial matching. As this approach introduces communication overhead associated with state sharing over a wireless network, we introduce a Bloom-filter-based data structure, referred to as a catalog, to determine whether a remote server possesses the desired internal states, thereby suppressing unnecessary communication. Experiments using the Gemma-3 270M model and the MMLU dataset on the Raspberry Pi Zero 2W platform demonstrate that the proposed approach reduces TTFT (Time to First Token) and TTLT (Time to Last Token) by 93.12% and 50.07% on average, respectively. | ||
| GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations Ehsan Yousefzadeh-Asl-Miandoab (IT University of Copenhagen); Reza Karimzadeh (University of Copenhagen); Danyal Yorulmaz (IT University of Copenhagen); Bulat Ibragimov (University of Copenhagen); Pinar Tözün (IT University of Copenhagen) Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation -- a key proxy for contention -- enables interference-aware scheduling. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by non-additive utilization metrics and GPU heterogeneity. We conduct a systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and our lightweight ML-based estimator -- evaluating accuracy, generalizability, and overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction, and experiment with utilization estimation. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models lack generalization and cannot easily be extended to new GPU architectures or accurately reflect memory optimization savings; CPU-side libraries impose intrusive integration overhead; and both analytical and ML-based estimators rely on model specifications or computation graphs, limiting generalization across diverse architectures and hardware variants. We release all datasets, tools, and artifacts to support further research. | ||
| GPU Acceleration of Sparse Fully Homomorphic Encrypted DNNs Lara D’Agata (University of Glasgow); Carlos Agulló-Domingo, Óscar Vera-López (University of Murcia); Kaustubh Shivdikar, Ardhi W. B. Yudha, Ferhat Yaman (AMD); David Kaeli (Northeastern University); José L. Abellán (University of Murcia); Ian Colbert (AMD); José Cano (University of Glasgow) Fully homomorphic encryption (FHE) has recently attracted significant attention as both a cryptographic primitive and a systems challenge. Given the latest advances in accelerated computing, FHE presents a promising opportunity for progress, with applications ranging from machine learning to information security. We target the most computationally intensive operation in deep neural networks from a hardware perspective, matrix multiplication (matmul), and adapt it for execution on AMD GPUs. We propose a new optimized method that improves the runtime and complexity of ciphertext matmul by using FIDESlib, a recent open-source FHE library designed specifically for GPUs. By exploiting sparsity in both operands, our sparse matmul implementation outperforms its CPU counterpart by up to 3.0x and reduces the time complexity from cubic to semi-linear, demonstrating an improvement over existing FHE matmul implementations. | ||
| Bridging CPU and GPU Autoscaling for Cost-Efficient Inference Serving Mehran Salmani (TU Ilmenau); Kamran Razavi (turbalance); Peter Amthor (TU Ilmenau); Boris Koldehofe (Marburg University) Autoscaling mechanisms have been used by inference serving systems to maintain service quality and efficiently use cloud resources (CPUs/GPUs). Horizontal scaling-based mechanisms have a responsiveness issue due to cold-start delays. In-place CPU vertical scaling adopted by recent works avoids cold-start; however, it is constrained by hardware limits and model parallelizability, and is limited to CPUs, which are efficient for low workloads but cost-inefficient for large workloads. Thus, combining CPUs and GPUs under dynamic workloads is worthwhile. However, enabling responsive switching between them is challenging. We propose VCG, an inference serving system that utilizes different resources based on the workload, and enables a smooth transition from CPUs to GPUs and vertically scaling GPUs (adjusting the GPU percentage) by forking a new process with enough GPU compute resources using CUDA MPS, and delivering inference requests to the child process through inter-process communication pipes. When the capacity of service is (nearly) exhausted, VCG uses horizontal scaling to provide scalability headroom for future steps. We evaluate a prototype of VCG under two different workloads and show that it reduces SLO violations up to 8× compared to a baseline using horizontal scaling of GPU instances under a bursty workload. It can also reduce the cost by around 60% compared to two CPU-based autoscaling baselines under a dynamic (non-bursty) workload. | ||
| Where the Time Goes: Analysis of a Public LLM Serving System Büsra Karatay Demiray (HES-SO); Ehsan Yousefzadeh-Asl-Miandoab (IT University of Copenhagen); Benoît Garbinato (University of Lausanne); Pinar Tözün (IT University of Copenhagen); Pamela Delgado (HES-SO) In this study, we present a characterization of serving traces collected from PublicAI's serving of Apertus, an open source Large Language Model (LLM). The trace spans roughly five months (September 2025–January 2026) and contains 337K requests. We analyzed request sizes, token and timing behaviour, latency, model-size effects, and temporal patterns. Our findings show insights that do not align with common assumptions; (1) time-to-first-token is often driven by queuing rather than prefill compute, especially for small requests; (2) the 8B and 70B models show nearly the same user-perceived latency despite a 9× parameter gap; (3) a substantial fraction of requests are prefill/queuing-dominated rather than decode-dominated; and; (4) observable input features are weak predictors of output, which makes size-aware scheduling difficult at arrival time. As a contribution to the research community, we will publish this anonymized trace along with its analysis. | ||
| Characterizing Energy and Performance for Distributed Training of Large Language Models Zhendong Zhang, Foteini Strati, Ana Klimovic (ETH Zurich) Distributed LLM training dominates compute, power and energy consumption in modern datacenters. Parallelization strategies are tuned to to maximize training throughput under memory constraints, while GPU frequency can be adjusted to reduce power. However, these knobs are usually tuned in isolation, and their combined effect on energy efficiency is poorly understood. We evaluate how different parallelization strategies interact with GPU frequency scaling, measuring throughput, power draw, and total energy across different transformer models. We find that faster configurations generally reduce energy under fixed hardware settings, but sensitivity to reduced GPU frequency varies widely across parallelism strategies. As a result, achieving Pareto-optimal performance-energy tradeoffs, requires joint optimization of parallelization strategies and GPU frequency. We provide guidelines for scheduling and modeling distributed LLM training to balance performance, power, and energy constraints. | ||
| 12:30 | Introduction to Poster Exhibition (Full day) | |
| 12:30 | Lunch Break / Poster Session (Browsing) | |
| 14:00 | Keynotes : What Survives When Code Doesn't? Laurent Bindschaedler (MPI) | |
| 14:55 | Session 3: Security, Privacy (11mins x 3) Chair (Jingjie Li - University of Edinburgh) | |
| Systems-Level Attack Surface of Edge Agent Deployments on IoT Zhonghao Zhan, Krinos Li (Imperial College London); Yefan Zhang (ByteDance); Hamed Haddadi (Imperial College London) Edge deployment of LLM agents on IoT hardware introduces attack surfaces absent from cloud-hosted orchestration. We present an empirical security analysis of three architectures (cloud-hosted, edge-local swarm, and hybrid) using a multi-device home-automation testbed with local MQTT messaging and an Android smartphone as an edge inference node. We identify five systems-level attack surfaces, including two emergent failures observed during live testbed operation: coordination-state divergence and induced trust erosion. We frame core security properties as measurable systems metrics: data egress volume, failover window exposure, sovereignty boundary integrity, and provenance chain completeness. Our measurements show that edge-local deployments eliminate routine cloud data exposure but silently degrade sovereignty when fallback mechanisms trigger, with boundary crossings invisible at the application layer. Provenance chains remain complete under cooperative operation yet are trivially bypassed without cryptographic enforcement. Failover windows create transient blind spots exploitable for unauthorised actuation. These results demonstrate that deployment architecture, not just model or prompt design, is a primary determinant of security risk in agent-controlled IoT systems. | ||
| Peeling the Layers of Privacy-Utility Onion on Tabular Data Jiawei Wang (University of Southampton); Zilong Zhao (Betterdata AI); Leonardo Aniello, Han Wu (University of Southampton) Machine Learning (ML) systems developed on sensitive data should be evaluated for unintended information leakage. Membership inference attacks (MIAs) serve as a standard auditing tool for quantifying unintended memorization. LiRA has emerged as the prevailing MIA methodology by evaluating true positive rates under strict false-positive constraints, thereby emphasizing worst-case leakage. Under LiRA, the \textit{privacy onion effect} reveals that removing the most vulnerable samples and retraining can shift vulnerability to a different subset of the data. However, previous analyses focus primarily on image benchmarks, omit utility evaluation and do not systematically examine layer-by-layer removal. This paper introduces a structured evaluation protocol to analyze the privacy–utility onion effect in tabular machine learning models through iterative peeling. We evaluate the protocol across diverse tabular datasets and models. Our findings show that progressive layer-by-layer removal can yield stronger reductions in worst-case leakage than one-shot removal; also, they uncover consistent privacy–utility tradeoff, informing privacy auditing and defenses for tabular ML systems. | ||
| Towards Practically-Secure Tools for AI Agents Justus Adam, Yuchen Lu, Deepti Raghavan, Malte Schwarzkopf, Nikos Vasilakis (Brown University) Agentic AI applications rely on “tools” to operate on their environment. Tools are external programs invoked in response to a model’s request. Today, agentic applications must blindly trust that third-party tool documentation accurately describes a tool’s behavior. This risks tools accidentally leaking, misusing, or destroying user data. We present a new approach to protecting against unwanted behavior of tool code. Our approach combines automated code analysis with dynamic, fine-grained sandboxes that apply runtime policy checks. Code analysis captures a complete picture of a tool’s “effects”, such as how it accesses the network and file system. From those effects, it produces synopses: coarse-grained descriptions of tool behavior. Fine-grained sandboxes provide runtime policy enforcement, and the code analysis verifies that the untrusted tool code uses fine-grained sandboxes correctly. An application-side policy enforcement layer then decides whether a tool call requested by the model should be allowed, denied or restricted, depending on the types of effects the tool performs and whether they are sandboxed. Preliminary experiments with a real tool server demonstrate that our approach offers improved policy enforcement outcomes and preserves utility for users. | ||
| 15:35 | Coffee Break / Poster Session | |
| 16:00 | Session 4: RL, Post training, Edge (11mins x 7) Chair (Luo Mai - University of Edinburgh) | |
| EARL: Efficient Agentic RL Post-Training for LLMs under Dynamic Context Lengths Zheyue Tan, Tuo Shi (Aalto University); Huining Yuan, Zelai Xu, Chao Yu (Tsinghua University); Boxun Li (Infinigence AI); Yu Wang (Tsinghua University); Bo Zhao (Aalto University) Agentic reinforcement learning (RL) is increasingly used to post-train large language models (LLMs) into multi-turn agents. However, scaling agentic RL is challenging because the \emph{effective context length} grows during training: agents become more verbose and rollouts accumulate long histories. This \emph{context-length explosion} creates two coupled systems bottlenecks: (i) attention compute and memory scale with length, causing truncation or out-of-memory (OOM) events that can destabilize learning; and (ii) experience tensors (e.g., tokens and log-probabilities) scale with length and make cross-device exchange a dominant cost. We present EARL, a length-elastic agentic RL system for LLM post-training under dynamic context lengths. EARL co-designs three mechanisms: (1) a \emph{context length controller} that adapts the rollout cap to observed length dynamics to avoid premature truncation; (2) a \emph{parallelism selector} that switches configurations as lengths shift to sustain throughput and prevent OOM; and (3) a \emph{layout-aware data dispatcher} that decentralizes intermediate-batch exchange to eliminate centralized bottlenecks. Across strategy-game agent benchmarks on 128 NVIDIA H100 GPUs, EARL improves training stability and throughput under mixed and long-context regimes, delivering up to 11$\times$ faster experience dispatch and up to 1.31$\times$ higher rollout throughput, without relying on static hard limits or explicit length penalties. | ||
| Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs Jihao Xin, Tian Lyu, Qilong Pan (KAUST); Kesen Wang (HUMAIN); Marco Canini (KAUST) Post-training compression reduces LLM parameter counts but often produces irregular tensor dimensions that degrade GPU performance---a phenomenon we call \emph{dimensional misalignment}. We present a full-stack analysis tracing root causes at three levels: framework, library, and hardware. The key insight is that model inference becomes slower because the resulting dimensions are unfriendly with the GPU execution stack. For example, compressing Llama-3-8B with activation-aware singular value decomposition (ASVD) has 15\% fewer parameters yet runs no faster than the uncompressed baseline, because 95\% of its dimensions are misaligned. We propose \textbf{GAC} (GPU-Aligned Compression), a new compression paradigm that wraps any dimension-reducing compressor and re-selects hardware-aligned dimensions via multi-choice knapsack optimization under the same parameter budget. We evaluate GAC on Llama-3-8B with ASVD and LLM-Pruner, achieving 100\% alignment and recovering up to 1.5$\times$ speedup while preserving model quality. | ||
| LayoutBench: Performance Benchmarking of Cloud Storage Layouts for Multimedia Data Debopam Sanyal (Georgia Institute of Technology); Hongjie Chen (Dolby Laboratories); Alexey Tumanov (Georgia Institute of Technology); Joshua Kimball (Dolby Laboratories) Modern multimedia machine learning workloads increasingly store large-scale datasets in cloud object storage services such as AWS S3. How these samples are physically organized in storage (i.e., \emph{storage layout}) directly affects how quickly and cheaply they can be retrieved. Yet the benchmarks used to guide storage decisions today focus on database engines and query processing, and none systematically evaluates how different storage layouts perform for multimedia data retrieval. We present \textbf{LayoutBench}, the first benchmark designed to fill this gap. It evaluates three representative layout strategies: storing each sample as an individual object (L1), sequentially packing samples into tar archives (L2), and organizing samples as columns in Parquet files (L3). We measure retrieval time, data transferred, and monetary cost using 11 queries of varying result-set sizes on ImageNet across six AWS EC2 instance configurations that span different network bandwidth and memory tiers. Our experiments reveal that L2 achieves lower latency than L1 and L3 through connection reuse, but loses this advantage as retrieval sizes become very large. L3 is the fastest for very large retrievals but transfers substantially more data across all query sizes due to row-group granularity, and requires significantly more memory. Across all layouts, data transfer cost dominates total expenditure, with L3 costing an order of magnitude more than L1 or L2. | ||
| Hardware-Aware Co-Design of Multi-Chip LLM Serving via Performance Modeling Suyeol Lee, Gyunghee Park (FuriosaAI) Efficient LLM inference on multi-chip accelerators requires joint optimization of computation, memory movement, and inter-chip communication. Compilation-based exploration of fine-grained strategies is expensive and often obscured by compiler artifacts. We present a hardware-aware performance simulator that reconstructs accelerator execution at command granularity, modeling tiled computation, explicit DRAM--SRAM transfers, collective communication, and overlap constraints. The proposed simulation framework is general; in this work, we instantiate and validate it on the RNGD accelerator system. The simulator closely matches real hardware traces. Using it, we demonstrate: (i) a fine-grained parallelism design achieving up to 77% throughput improvement over full tensor parallelism, and (ii) a tail-constrained chunked prefill strategy that improves the TTFT--TPOT trade-off. The quantitative results in this paper are derived from the proposed simulator which is validated against hardware-level traces. Our results show that hardware-aware simulation enables scalable multi-chip inference optimization design. | ||
| STEER: Software Toolkit for Edge Efficient Retraining Konstantina Orfanou (Foundation for Research and Technology-Hellas, Institute of Computer Science and University of Crete, Computer Science Department); Christos Kozanitis (Foundation for Research and Technology-Hellas, Institute of Computer Science) Edge AI enables low-latency inference on distributed devices, but on-device fine-tuning remains constrained by limited unified memory. Although modern edge accelerators such as NVIDIA Jetson make on-device training computationally feasible, optimizer state often dominates memory consumption, accounting for up to 42% in our PyTorch profiling. The optimizer state persists throughout training and often exceeds the memory footprint of activations and gradients. We present STEER, a PyTorch runtime that streams optimizer state from fast storage to reduce RAM usage without altering the training semantics. STEER retrieves layer-wise the optimizer state during backpropagation and co-designs the storage path to minimize overhead via direct transfers, sequential I/O, and lightweight array handling. On NVIDIA Jetson Orin Nano, STEER reduces peak training memory by up to 46% and enables training of models, such as Huge Vision Transformer, that fail to train on baseline PyTorch. | ||
| From Code to Execution: Multi-Agent GraphRAG for Automated Artifact Generation Amirhossein Layegh (KTH Royal Institute of Technology); Amir Payberah (KTH); Mihhail Matskin (KTH Royal Institute of Technology) Modern data and Machine Learning (ML) pipelines are developed as code repositories but must be manually transformed into deployable artifacts such as Docker images and Kubernetes workflows, a process prone to dependency and configuration errors. While Large Language Models (LLMs) can assist with code understanding, they often fail on multi-file repositories due to missing structural context and incomplete dependency resolution. We propose a multi-agent graph-based Retrieval-Augmented Generation (GraphRAG) framework that automates the generation of deployment artifacts from code repositories. The framework constructs an Abstract Syntax Tree (AST)-grounded knowledge graph, performs structure-aware retrieval, and coordinates specialized agents to synthesize containerization and workflow specifications. The system further improves reliability through iterative self-healing driven by execution logs. Evaluated on 17 GitHub repositories spanning ETL, ML training, and LLM-based evaluation pipelines, our approach achieves 65\% end-to-end execution success on Kubernetes, outperforming a vector-only RAG baseline (24\%) and a single-agent GraphRAG variant (47\%). These results indicate that combining graph-centric retrieval with agent decomposition improves execution reliability. | ||
| SHARD: A Compatibility Framework for Deploying Transformer Models on Edge NPUs Adhitya Mohan, Richard Thompson (University of Colorado Boulder); Eric Keller (University of Colorado, Boulder); Mark Zhao (University of Colorado Boulder) Running inference on diverse transformer models, especially for emerging architectures such as Vision Language Models (VLMs), is often infeasible on many edge accelerators such as Neural Processing Units (NPUs). This limitation arises because transformer execution graphs frequently violate constraints imposed by vndor-provided, black-box NPU toolchains. These constraints include restricted operator sets, rigid tensor shape and layout requirements, and tight hardware limits such as limited on-chip SRAM. Consequently, many NPUs fail to execute modern transformer models end-to-end, or they fall back to CPU execution resulting in degraded performance and efficiency. We present SHARD, a compatibility framework to maximize the deployability of diverse transformers on edge NPUs. SHARD transforms high-level models into portable input graphs (e.g., ONNX) compatible with diverse NPU hardware and toolchain constraints. SHARD introduces constraint-driven rewrites that decompose and explicitly shard execution graphs to fit operations within restricted on-chip SRAM limits, legalize unsupported operators by lowering them into hardware-supported primitives, and address impacts on model accuracy as a result of quantization. Using a Rockchip RK3588 NPU, we demonstrate how SHARD enables a vision encoder to natively execute on the NPU, achieving a 8.7× speedup over the vendor toolchain which relies on CPU fallback. | ||
| 17:17 | Closing | |
| Posters Session | ||
| Opinion Depolarization in Social Networks using GNNs Konstantinos Mylonas, Thrasyvoulos Spyropoulos (Technical University of Crete) Nowadays, social media is a primary ground for political debate and exchange of opinions. A significant body of research suggests that these platforms are highly polarized, often exhibiting echo chamber structures where users connect mainly with like-minded individuals and limit their exposure to diverse content. Existing work based on popular opinion diffusion models shows that convincing a subset of “key” users to adopt moderate opinions can reduce the polarization of the entire network. Building on this insight, we propose an efficient algorithm to identify a set of users whose moderation minimizes polarization. Unlike prior approaches that require iteratively solving a large analytical model, our method leverages Graph Neural Networks (GNNs) to significantly reduce decision time while achieving close-to-oracle selection accuracy. | ||
| DisCEdge: Distributed Context Management for Large Language Models at the Edge Mohammadreza Malekabbasi, Minghe Wang, David Bermbach (TU Berlin) Deploying Large Language Model (LLM) services at the edge benefits latency-sensitive and privacy-aware applications. However, the stateless nature of LLMs makes managing user context (e.g., sessions, preferences) across geo-distributed edge nodes challenging. Existing solutions, such as client-side context storage, introduce network latency and bandwidth overhead, undermining edge deployment advantages. We propose DisCEdge, a distributed context management system that stores and replicates user context in tokenized form across edge nodes. By maintaining context as token sequences, our system avoids redundant computation and enables efficient data replication. We evaluate an open-source prototype in a realistic edge environment. DisCEdge improves median response times by up to 14.46% and lowers median inter-node synchronization overhead by up to 15% compared to a raw-text-based system. It also reduces client request sizes by a median of 90% compared to client-side context management, while guaranteeing data consistency. | ||
| GeoServe: Leveraging Disaggregated Data Processing for Scalable Geospatial Model Serving Gerard Finol (Universitat Rovira i Virgili); Christian Pinto (IBM Research Europe) Geospatial foundation models (GFMs) operate on large, multi-band raster products (e.g., GeoTIFF) that require expensive data access and preprocessing -- reprojection, decoding, normalization, and tiling -- before GPU inference. In our measurements, reading and preprocessing geospatial inputs can be orders of magnitude slower than tokenization or standard image preprocessing, and constitute $31-43\%$ of end-to-end request time for a representative GFM. Existing inference frameworks such as vLLM execute this preprocessing inline with request handling, which under load serializes CPU and I/O work, increasing queueing delay, and leaving GPUs underutilized. We present GeoServe, a Ray-based serving system that decouples the geospatial data pipeline from GPU inference by disaggregating I/O- and CPU-heavy preprocessing to a scalable pool of CPU workers, while keeping GPU nodes dedicated to model forward passes. We show experimentally that GeoServe reduces the p90 request latency by up to $414.9\times$ at high load and improves throughput by up to $4.74\times$ compared to vanilla vLLM, while increasing the achieved model forward-pass rate from $\sim16 \ \text{inf.}/\text{sec}$ to $\sim72 \ \text{inf.}/\text{sec}$ via better batching opportunities. | ||
| Towards On-the-Fly Snapshot Memory Compression for Low-Latency Elastic Inference Serving Systems Radostin Stoyanov (University of Oxford); Viktória Spišaková (Masaryk University); Adrian Reber (Red Hat); Andrei Vagin (CRIU); Rodrigo Bruno (INESC-ID, Instituto Superior Técnico, University of Lisbon) In-memory model caching and startup latency are key bottlenecks in large-scale AI serving systems, especially for GPU-accelerated large language model (LLM) inference in elastic, serverless environments. While container checkpointing enables hot starts, it introduces new challenges in memory footprint, storage bandwidth, and restore latency. Existing offline snapshot compression methods reduce snapshot size but add extra I/O, storage duplication, and decompression overhead. In this paper, we present CRIU-LZ4, a restore-optimized method for on-the-fly compression integrated directly into the CPU–GPU checkpoint and restore pipelines. Built atop CRIUgpu, CRIU-LZ4 performs page-level compression during memory transfer, eliminating intermediate artifacts and minimizing the latency on the restore critical path. Our evaluation results show that CRIU-LZ4 reduces cold-start latency by 46–59% and achieves up to 6× smaller snapshots compared to uncompressed GPU-aware checkpointing, while eliminating the decompression bottleneck of offline compression, significantly reducing both end-to-end restore time and peak disk usage. | ||
| Probabilistic Sampling-Enhanced Temporal-Spatial GCN: A Scalable Framework for Transaction Anomaly Detection in Ethereum Networks Stefan Behfar, Richard Mortier, Jon Crowcroft (University of Cambridge) The rapid growth of the Ethereum network necessitates advanced anomaly detection techniques to enhance security, transparency, and resilience against evolving malicious activities. While there have been significant strides in anomaly detection, they often fall short in capturing the intricate spatial-temporal patterns inherent in blockchain transactional data. This study presents a scalable framework that integrates Graph Convolutional Networks (GCNs) with Temporal Random Walks (TRW) specifically designed to adapt to the complexities and temporal dynamics of the Ethereum transaction network. Unlike traditional methods that focus on detecting specific attack types, such as front-running or flash loan exploits, our approach targets time-sensitive anomalies more broadly—detecting irregularities such as rapid transaction bursts, anomalous token swaps, and sudden volume spikes. This broader focus reduces reliance on pre-defined attack categories, making the method more adaptable to emerging and evolving malicious strategies. To ground our contributions, we establish three theoretical results: (1) the effectiveness of TRW in enhancing GCN-based anomaly detection by capturing temporal dependencies, (2) the identification of weight cancellation conditions in the anomaly detection process, and (3) the scalability and efficiency improvements of GCNs achieved through probabilistic sampling. Empirical evaluations demonstrate that the TRW-GCN framework outperforms state-of-the-art Temporal Graph Attention Networks (TGAT) in detecting time-sensitive anomalies. Furthermore, as part of our ablation study, we evaluated various anomaly detection techniques on the TRW-GCN embeddings and found that our proposed scoring classifier consistently achieves higher accuracy and precision compared to baseline methods such as Isolation Forest, One-Class SVM, and DBSCAN, thereby validating the robustness and adaptability of our framework. | ||
| Robust Ultra Low-Bit Post‑Training Quantization via Stable Diagonal Curvature Estimate Jaemin Kim (Seoul National University); Sungkyun Kim, Junyeol Lee (Hanyang University); Jiwon Seo (Seoul National University) Large Language Models (LLMs) are widely used across many domains, but their scale makes deployment challenging. Post-Training Quantization (PTQ) reduces memory footprint without retraining by leveraging a small calibration set. Recent Hessian-based PTQ methods compensate quantization error via cross-channel dependencies, but such approaches degrade at low bit-widths due to noisy curvature estimates from limited calibration data. We propose DASH-Q, a robust PTQ framework using diagonal Hessian approximation and iterative weighted least squares. By discarding noise-prone dependencies, DASH-Q filters sampling noise while prioritizing the preservation of salient feature power. We outperform other PTQ baselines in ultra low-bit regime, improving zero-shot accuracy by 7.01% on average and up to 14.01% over the strongest baselines across five baseline LLM models, while showing robust and stable performance with very small calibration data. | ||
| A Case for a Simulation-Driven Exploration of Distributed GenAI Platforms Animesh Trivedi, Radu Stoica, Jeremy Cohn, Danny Harnik, Yue Zhu, Jonathan Terner (IBM Research); Guy Margalit (IBM Storage); Frank Schmuck, Vasily Tarasov, Swaminathan Sundararaman (IBM Research) The rapid adoption of Generative AI (GenAI) workloads has driven the emergence of inference serving platforms like llm-d and Nvidia Dynamo. However, exploring the design space for these platforms, especially for large-scale, multi-layer optimizations, remains prohibitively expensive and slow due to limited hardware access and high engineering overheads. Current evaluation methods often focus on isolated components, failing to capture the complex interplay between hardware components, scheduling policies, and dynamic GenAI workloads. We argue that the design space exploration of GenAI platforms can be accelerated by leveraging a simulation-based approach that offers a *fast*, *cheap*, and *scalable* methodology to rapidly prototype and validate new ideas. To this end, we present Opal, an open-source, discrete-event simulation framework. Unlike prior simulators, Opal models interactions across multiple layers of the inference stack from hardware to workloads, enabling a holistic analysis of system-level behaviors and trade-offs. Opal is designed to be simple, extensible, reproducible, and fast, allowing researchers to rapidly explore a wide range of deployment scenarios and optimization strategies. In this paper, we present our motivation and Opal's design, and seek feedback from the community on open challenges. | ||
| Reducing Language Model Inference Latency using CPU-Assisted Serving Theodoros Aslanidis (University College Dublin); Sokol Kosta (Aalborg University); Raffaele Montella (University of Naples Parthenope); Spyros Lalis (University of Thessaly); Dimitris Chatzopoulos (University College Dublin) The growing demand for language model (LM) inference is placing significant strain on datacenter resources, particularly GPUs, which are costly and often scarce. This leads service operators to face long request queues or to throttle users to cope with limited GPU availability. The conventional response is to scale out GPU-equipped servers, but this incurs substantial capital and operational expenses. In this work, we propose an alternative strategy that leverages idle CPU nodes, a resource commonly available in modern datacenter clusters. Our approach exploits GPU virtualization to forward GPU API calls from CPU-only nodes to remote GPUs, while performing CPU-intensive computations locally. For LMs where the primary bottleneck is CPU execution rather than GPU utilization, this mechanism allows idle CPUs to effectively augment serving capacity without requiring additional GPUs. Assuming high-speed interconnects typical of modern datacenters, the overhead of remote CPU-GPU communication is amortized, yielding improvements in job completion time and overall throughput. By converting idle CPUs into cost-free contributors to LM serving, our method reduces request queueing delays and provides a practical pathway to increase service efficiency without incurring additional GPU provisioning costs or sacrificing model accuracy, thereby saving on operational expenses. Extensive experimentation on a testbed with eight popular LMs and across five widely used datasets demonstrates that our ready-to-use open-source system can reduce LM inference-serving delays by up to 98%. | ||
| Towards Graph-Based Detection of Jailbreak and Prompt-Leakage Attacks in LLMs Javad Forough, Mohammad M Maheri, Hamed Haddadi (Imperial College London) Large Language Models (LLMs) are vulnerable to adversarial input prompts that compromise both safety and confidentiality, including policy-violation jailbreaks and prompt-leakage attacks that extract private system instructions. Although these attacks pursue different objectives, they operate entirely at the input level and often share structural manipulation patterns, motivating unified pre-inference defenses. We present GuardNet, a hierarchical graph-based framework that detects adversarial prompts before model execution by representing each input as a hybrid token graph combining sequential, syntactic, and attention-derived relations. Experiments on representative jailbreak and prompt-leakage benchmarks show that GuardNet consistently outperforms existing input-side defenses, demonstrating the effectiveness of structure-aware pre-inference protection for deployed LLM systems. | ||
| All is Not Lost: LLM Recovery without Checkpoints Nikolay Blagoev (Gensyn, University of Neuchâtel); Oğuzhan Ersoy (Gensyn); Lydia Yiyu Chen (University of Neuchâtel) Training LLMs on decentralized nodes or on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the transient churns of nodes due to failures and the operator's scheduling policies, leading to losing parts of the model (some layers). The conventional approaches to recover from failures is to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper we propose CheckFree, an efficient recovery method where a failing stage is substituted by weighted averaging of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of the first and last stages are mimicked by their neighboring ones, which allows CheckFree+ to recover them by copying the neighboring stages. To recover the (de-)embedding layers, CheckFree+ copies those layers in the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence wall-clock time, achieving up to 12% improvement over redundant computation. Both of our proposals can be ran via our code available at: https://github.com/gensyn-ai/CheckFree | ||
| Both Ends Count! Just How Good are LLM Agents at Text-to-"Big SQL"? Germán T. Eizaguirre (Universitat Rovira i Virgili); Lars Tissen (RWTH Aachen University); Marc Sánchez-Artigas (Universitat Rovira i Virgili) Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as ``Text-to-Big SQL''. However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. These metrics enable new LLM-specific insights that are not revealed by traditional accuracy-based evaluations. For example, GPT-4o compensates for roughly 7% lower accuracy than the top-performing later-generation models with up to a 12.16× speedup, while GPT-5.2 is more than twice as cost-effective as Gemini 3 Pro at large input scales. | ||
| Block-Aware Distributed Data Pipelines for Out-of-Core Tabular Machine Learning Dimitris Metaxakis, Argiris Sofotasios (University of Patras, Athena Research Center); Chrysostomos Stylios (Athena Research Center, University of Ioannina); Panagiotis Hadjidoukas (University of Patras, Athena Research Center) Modern Machine Learning (ML) pipelines are often bottlenecked by prolonged data stalls, particularly in out-of-core settings where data volumes exceed main memory capacity. In such cases, preprocessing and batch-wise data loading impose substantial I/O overhead, resulting in underutilization of accelerators and, consequently, limited scalability. In this work, we present a block-aware distributed data pipeline system for out-of-core tabular ML composed of two complementary components. The first extends the Mneme preprocessing framework to distributed environments through an MPI-based, multi-level parallel execution model. Python multithreading is also employed, as the preprocessing workload is dominated by disk I/O and native backend execution, both of which release the Global Interpreter Lock (GIL). The second is a cache-aware PyTorch DataLoader that reduces I/O operations by retrieving data batches from a dedicated application-level in-memory cache and applying preprocessing operators on the fly. Experimental results show that the proposed system enforces block-level locality, reducing memory I/O amplification from 44.9× the dataset size to 3.7×, and translates these reductions into up to 9.3× higher data-loading throughput and significant end-to-end training time savings on a large-scale tabular dataset stored in raw-text CSV format. | ||
| With a Hop, Skip, and a Prefill: How Benchmark Volatility Distorts the Accuracy of Long-Context Benchmarks and How To Combat It Valentijn Dymphnus van de Beek (Delft University of Technology); Takeshi Yoshimura (IBM Research - Tokyo) Contemporary large language models now support context windows of up to millions of tokens, and this capability enables higher accuracy, new tasks, and longer conversational history. Researchers rely on long-context inference benchmarks to evaluate specific model behaviours, but practitioners still find it difficult to translate benchmark results into AI system design decisions, such as model selection and configuration for target workloads. In this work, we analyse 16 long-context benchmarks to characterise their composition in terms of tasks, prompt token sizes, and variation between prompts. We find substantial differences in context-prompt length both across and within benchmarks: the coefficient of variation reaches 313% and the ratio between the 5th and 95th percentile prompt lengths reaches 65x within the same task. Our follow-up analysis shows that this volatility can distort benchmark accuracy, so results may reflect a model's ability to handle extreme prompt lengths rather than the underlying task itself. We further show that token-size-controlled variants of SCBench tasks reveal performance overestimates of up to 40% in the original benchmark, and that modest modifications using a knapsack-based document selection strategy can produce more representative and stable results. | ||
| ClawMobile: Rethinking Smartphone-Native Agentic Systems Hongchao Du, Shangyu Wu, Qiao Li (Mohamed bin Zayed University of Artificial Intelligence); Riwei Pan (Department of Computer Science, City University of Hong Kong); Jinheng Li (Independent Researcher); Youcheng Sun, Chun Jason Xue (Mohamed bin Zayed University of Artificial Intelligence) Smartphones represent a uniquely challenging environment for agentic systems. Unlike cloud or desktop settings, mobile devices combine constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As large language models (LLMs) evolve from conversational assistants to action-oriented agents, achieving reliable smartphone-native autonomy requires rethinking how reasoning and control are composed. We introduce \method as a concrete exploration of this design space. \method adopts a hierarchical architecture that separates high-level language reasoning from structured, deterministic control pathways, improving execution stability and reproducibility on real devices. Using \method as a case study, we distill the design principles for mobile LLM runtimes and identify key challenges in efficiency, adaptability, and stability. We argue that building robust smartphone-native agentic systems demands principled coordination between probabilistic planning and deterministic system interfaces. The implementation is open-sourced~\footnote{\url{https://github.com/ClawMobile/ClawMobile}} to facilitate future exploration. | ||
| DFS: Dynamic Flow Spraying with Bounded Reordering for AI Training Clusters GuanJun Ma, Chenyue Zheng, Yuchao Zhang (Beijing University of Posts and Telecommunications) Mixture-of-Experts (MoE) training in modern super-node AI clusters is increasingly limited by communication.At each training step, expert dispatch and combine traffic sits on the critical path, while the same network must also carry throughput-oriented transfers such as gradient synchronization, checkpointing, and storage traffic.The result is a basic trade-off: expert traffic benefits from aggressive multipath routing to reduce latency, whereas non-expert traffic needs limited packet reordering to preserve the efficiency and stability of commodity RDMA.Existing load-balancing schemes do not fit this setting well.Per-flow hashing can create short-lived hotspots that delay critical-path expert communication, and reactive transport mechanisms cannot respond quickly enough to bursty dynamics within a single RTT.Packet spraying improves path utilization, but under load it can overflow NIC out-of-order (OOO) buffers and disrupt transport. We present DFS (Dynamic Flow Spraying), a host-network co-design that uses the receiver's OOO tolerance as a control budget.Running on the sender SmartNIC, DFS combines INT-based path-state collection, inflight estimation, adaptive threshold control, and a dual-mode spraying policy to steer packets away from transient congestion while keeping packet reordering within acceptable bounds.This design retains the latency advantage of packet spraying for expert traffic without violating the ordering requirements of background flows that share the network, and it requires no changes to upper-layer MPI/NCCL applications. We implement DFS on a hybrid testbed built with physical NVIDIA BlueField-2 SmartNICs. Compared with ECMP, DFS reduces average and P99 flow completion times (FCTs) by 73\% and 98.8\%, respectively.Compared with PLB, the reductions are 57\% and 97.8\%.DFS remains within 10\% of ideal packet spraying, maintains stable throughput, and prevents Priority Flow Control (PFC) storms without exceeding hardware memory limits.Overall, the results show that bounded reordering, when exposed and controlled at the SmartNIC, is an effective systems mechanism for improving critical-path communication in next-generation MoE training clusters. | ||
| Orbit: Efficient Agentic Inference using Priority Scheduling Sami Abuzakuk, Anne-Marie Kermarrec, Palak, Rafael Pires, Rishi Sharma, Martijn de Vos (EPFL) Large language models (LLMs) are increasingly deployed as autonomous agents that execute complex tasks through long sequences of reasoning steps and tool calls. Serving these agentic workloads at scale is a growing priority, yet existing LLM inference engines are mostly designed for single-pass, chat-style interactions. Agentic tasks, however, can fail, and under system congestion, policies such as first come first serve (FCFS) allocate GPU resources equally to all tasks regardless of their likelihood of success, letting failing tasks inflate queue waiting times for tasks that will complete correctly. We introduce Orbit, a trajectory-aware scheduler that addresses this by dynamically deprioritizing tasks predicted to fail, shifting resources toward likely-correct tasks. We present two scheduling algorithms. Orbit-step is a lightweight heuristic that assigns priority inversely proportional to the number of steps a task has taken, leveraging the empirical observation that longer-running tasks are more likely to fail. Orbit-judge uses an asynchronous LLM-as-a-judge that reads partial execution trajectories and predicts task correctness online, enabling finer-grained prioritization. We evaluate both variants on the GAIA benchmark using the Magentic-One benchmark and the GPT-OSS-120B model. Compared to vLLM's default FCFS policy, Orbit-step and Orbit-judge show 11.2% and 4.3% decrease in average end-to-end latency, respectively, for correct tasks. These results demonstrate that real-time trajectory observation is a practical and effective signal for resource allocation in agentic serving systems. | ||
| Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications Jiaxi Li (University of Illinois Urbana-Champaign); Yue Zhu (IBM Research); Bo Chen (University of Illinois Urbana-Champaign); Eun Kyung Lee (IBM Research); Klara Nahrstedt (University of Illinois Urbana-Champaign) Different from traditional Large Language Model (LLM) serving that colocates the prefill and decode stages on the same GPU, disaggregated serving dedicates distinct GPUs to prefill and decode workload. Once the prefill GPU completes its task, the KV cache must be transferred to the decode GPU. While existing works have proposed various KV cache transfer paths across different memory and storage tiers, there remains a lack of systematic benchmarking that compares their performance (e.g., latency and throughput) and energy efficiency. Meanwhile, although optimization techniques such as frequency scaling have been utilized for disaggregated serving, their performance and energy implications have not been rigorously characterized. In this paper, we address this gap through comprehensive benchmarking of disaggregated LLM serving. We evaluate multiple KV transfer paths under synthetic inference workloads, measuring latency, throughput, and energy consumption, and compare them against a colocated serving baseline. Through dynamic voltage and frequency scaling (DVFS), we further investigate the performance and energy consumption of disaggregated serving configurations under different GPU frequencies, and explore the benefits of independently frequency scaling on the prefill and decode GPUs. Our results show that performance gains from prefill–decode disaggregation are not guaranteed, but depend on multiple factors including the workload characteristic, KV transfer medium, and LLM model size. We also observe diminishing performance returns from increasing KV transmission bandwidth beyond a certain threshold. Moreover, we demonstrate that stage-wise independent frequency scaling provides additional opportunities to improve energy efficiency while satisfying latency SLO constraints. Based on these insights, we propose a prototype framework for SLO-aware energy-efficient LLM serving and discuss its design challenges and evaluation methodology. | ||
| Harnessing Idle Compute at the Edge for Foundation Model Training Leyang Xue (University of Edinburgh); Meghana Madhyastha (Johns Hopkins University); Myungjin Lee (Cisco Research); Amos Storkey (The University of Edinburgh); Randal Burns (Johns Hopkins University); Mahesh Marina (University of Edinburgh) Foundation model training is increasingly centralized in large cloud data centers because it demands immense compute and memory resources. Training over decentralized edge devices could democratize this ecosystem by harnessing otherwise idle compute, but prior edge-training systems fall short: they scale poorly with model size and device count, exceed per-device memory budgets, incur prohibitive collective communication, and are fragile to heterogeneous and dynamic device availability. We present Cleave, a parameter-server--centric framework that makes tensor-parallel training practical at the edge. Cleave introduces selective hybrid tensor parallelism, which finely shards GEMM-dominated training operations into memory-feasible sub-tasks while avoiding peer-to-peer collectives that become bottlenecks on asymmetric edge links. A cost model guides device selection and shard placement to mitigate stragglers and rapidly adapt to churn. Across OPT and Llama2 models, Cleave matches GPU training efficiency while scaling to thousands of devices. It supports up to 8x more devices than prior edge-training approaches, reduces per-batch training time by up to 10x, and achieves 100x faster recovery from device failures. | ||
| Cost-Aware Model Orchestration for LLM-based Systems Daria Smirnova, Hamid Nasiri, Marta Adamska, Zhengxin Yu, Peter Garraghan (Lancaster University) As modern artificial intelligence (AI) systems become more advanced and capable, they can leverage a wide range of tools and models to perform complex tasks. The task of orchestrating these models is increasingly performed by Large Language Models (LLMs) that rely on qualitative descriptions of models for decision-making. However, the descriptions provided to existing LLM-based orchestrators frequently do not reflect true model capabilities and performance characteristics, leading to suboptimal model selection, reduced task accuracy, and increased cost. In this paper, we conduct an empirical analysis of LLM-based orchestration limitations and propose a cost-aware model selection method that accounts for performance–cost trade-offs by incorporating quantitative model performance characteristics within decision-making. Initial experimental results demonstrate that our proposed method increases accuracy by 0.90\%–11.92\% across various evaluated tasks, achieves up to a 54\% energy efficiency improvement, and reduces orchestrator model selection latency from 4.51 s to 7.2 ms. | ||
| LLM-based AIOps via Log Prioritization in Air-Gapped Systems SangHee Song, Hwajung Kim (Seoul National University of Science and Technology) Large-scale computing systems generate massive volumes of operational logs across diverse components. Although LLMs offer new opportunities for automated AIOps analysis, directly applying them to raw log streams is impractical due to input length and resource constraints, especially in air-gapped and edge environments. In this paper, we propose an LLM-based AIOps pipeline that converts raw logs into structured events through rule-based transformation and temporal aggregation. The structured events are then prioritized to enable efficient LLM-driven diagnosis. We implement the pipeline in a real-world isolated edge environment and evaluate it using production-scale system logs. Experimental results show that our pipeline enables efficient LLM-guided analysis while preserving system-level diagnostic effectiveness. Our pipeline reduces log volume by 76% through the log transformation stage. After aggregation into structured events, the log prioritization stage further reduces the event set by 51% before LLM analysis, resulting in a 43% reduction in LLM token requirements compared to the non-prioritized case, significantly lowering resource consumption in constrained environments. | ||
| Balancing Compute in LLM Inference: Model Selection, Quantization, and Test-Time Scaling Tomislav Šubić (University of Trieste); Stefano Cozzini (Institute for Research and Technologies, AREA Science Park) Deploying large language models efficiently requires balancing accuracy, throughput, energy consumption, and memory usage. We present an empirical study of three complementary optimiza tion dimensions for LLM inference: model size selection, post-training quantization, and test-time compute scaling. Using the Qwen3 model family (0.6B–32B parameters) on NVIDIA L40S GPUs, we evaluate six quantization methods across a combination of six benchmarks and measure energy consumption per output token. We then apply budget-forced reasoning to quantized and unquantized models on the GSM8K benchmark. Our results show that model size selection provides up to 8× variation in energy per token, while quantization method choice accounts for up to 1.7×. Combining quantization with test-time scaling, a 4B model with a 2000-token reasoning budget achieves 90% accuracy on GSM8K, outperforming a 14B model (82%) while using 71% less memory and 2.7× less energy. We further demonstrate an adaptive thinking approach that allocates inference compute proportionally to problem difficulty, reducing average energy consumption while maintaining accuracy. These results provide practical guidance for configuring LLM inference systems that balance model performance and resource efficiency. | ||
| Scalable Federated Learning for Scientific Foundation Models on Leadership-Class Systems Olivera Kotevska, Trong Nguyen, Rafael Ferreira da Silva, Christian Engelmann (Oak Ridge National Laboratory); Prasanna Balaprakash (PrimaLabs) Federated learning (FL) at leadership-class HPC systems remains largely unexplored, despite growing interest in deploying federated workflows on modern HPC systems. This paper provides the first system-level empirical characterization of federated fine-tuning of pretrained foundation models on an exascale supercomputer under a multi-node deployment. Using up to 96 concurrent FL clients deployed across Frontier nodes, we study the impact of client scale, model size, data heterogeneity, partial participation, and differential privacy on runtime, communication overhead, and convergence stability. Our results show that pretrained transformer models remain robust to heterogeneity, client dropout, and privacy noise, while system efficiency degrades rapidly with scale as synchronization and orchestration dominate runtime. We further demonstrate that system-aware execution strategies, including intra-node aggregation and early aggregation, significantly reduce wall-clock time without degrading model quality. These findings establish a practical performance baseline and inform the design of communication-efficient FL systems on leadership-class HPC platforms. | ||
| Before the First Token: Benchmarking Data Preprocessing in Vision-Language Models Sepideh Zohdi, Lin Wang (Paderborn University) While vision-language models (VLMs) become widely used for video understanding, the sheer volume of spatiotemporal data they process presents a critical computational challenge. Current efforts have been predominantly focused on accelerating token generation by large language models (LLMs), overlooking the preprocessing required for preparing the input data. In this paper, we systematically benchmark and analyze the performance of the end-to-end VLM pipeline. Through a stage-by-stage latency characterization across diverse real-world datasets, we reveal that data preprocessing, spanning both CPU-bound video decoding and GPU-bound vision encoder, represents more frequently a critical performance bottleneck compared to the actual generation. Moreover, this data preprocessing overhead stays significant under varying input characteristics and hardware specifications. Our results underscore the urgent need for holistic, end-to-end performance optimizations for VLM pipelines. | ||
| OpenMCP: an open-source self-hosted benchmarking harness for MCP-enabled computer use agents Agustin Leon, Anup Niroula, Fraida Fund (New York University) The Model Context Protocol (MCP) standardizes how language models connect to external tools and applications, yet existing MCP benchmarks rely on externally hosted LLMs, constraining reproducibility and limiting observable metrics and privacy control. We present OpenMCP, an open-source, self-hosted benchmarking harness for MCP-enabled computer-use agents. Models are served locally via Ollama in Docker containers, giving researchers control over model versions, hardware, and serving configuration. This enables collection of metrics for task completion (success rate, key-step progress), efficiency (energy, GPU utilization, memory), and agent issues (hallucinations, tool-choice confidence, loops). OpenMCP follows a bring-your-own approach: researchers supply models, infrastructure, MCP servers, and tasks, while the harness handles scheduling, isolation, evaluation, and telemetry. We demonstrate OpenMCP on 19 models across 5 families and 6 infrastructure configurations, from datacenter GPUs to edge devices. Code and data are available at https://github.com/AguLeon/OpenMCP | ||
| The Cost of Expertise: Understanding MoE Decode Performance Sami Abuzakuk (EPFL); Oana Balmau, Jiaxuan Chen (McGill University); Anne-Marie Kermarrec, Rafael Pires, Ramya Prabhu, Martijn de Vos (EPFL) MoE architectures have emerged as a popular strategy for scaling LLMs, enabling substantial increases in parameter counts without proportional growth in per-token FLOPs. Existing analyses often model MoE inference as an attention backbone followed by a routing layer that is used to route tokens to multiple FFNs. MoE-based LLMs expand the design space across sparsity and scaling dimensions, such as top-k routing, expert intermediate size and prompt-dependent expert activation whose systems implications remain underexplored especially for the decode-phase, despite decode dominating interactive inference workloads. We systematically study how MoE design choices influence inference latency during the decode phase, which is the primary latency bottleneck in LLM workloads. We analyze latency behavior across varying batch size, expert intermediate sizes, and numbers of activated experts. Our results reveal that expert intermediate size, batching behavior, and expert activation patterns interact in non-trivial ways: latency scales near-linearly with expert width, batching gains plateau for wider experts due to early GPU saturation, and activating additional experts increases latency through memory pressure rather than compute overhead. These findings expose critical system-level trade-offs in MoE design that are invisible to FLOP-centric analyses. | ||
| AgenTEE: Confidential LLM Agent Execution on Edge Devices Sina Abdollahi, Mohammad M Maheri, Javad Forough, Amir Al Sadi, Josh Millar (Imperial College London); David Kotz (Dartmouth College); Marios Kogias, Hamed Haddadi (Imperial College London) Large Language Model (LLM) agents provide powerful automation capabilities, but they also create a substantially broader attack surface than traditional applications due to their tight integration with non-deterministic models and third-party service. While current deployments primarily rely on cloud-hosted services, emerging designs increasingly execute agents directly on edge devices to reduce latency and enhance user privacy. However, securely hosting such complex agent pipelines on edge devices remains challenging. These deployments must protect proprietary assets (e.g., system prompts and model weights) and sensitive runtime state on heterogeneous platforms that are vulnerable to software attacks and potentially controlled by malicious users. To address these challenges, we present AgenTEE, a system for deploying confidential agent pipelines on edge devices. AgenTEE places the agent runtime, inference engine, and third-party applications into independently attested confidential virtual machines (cVMs) and mediates their interaction through explicit, verifiable communication channels. Built on Arm Confidential Compute Architecture (CCA), a recent extension to Arm platforms, AgenTEE enforces strong system-level isolation of sensitive assets and runtime state. Our evaluation shows that such multi-cVMs system is practical, achieving near-native performance with less than 5.15% runtime overhead compared to commodity OS multi-process deployments. | ||
| SCALER: Sensitivity-Centric Adaptive Layer Execution & Runtime Mapping for Hybrid Analog-Digital Accelerator Naman Kalra, BONTHU PURNA NAND, Jaynarayan T Tudu (Indian Institute of Technology Tirupati) Deploying DNN models on edge devices places a stringent constraint on energy and resource efficiency while ensuring the accuracy of the model and the compute latency. Among many approaches, hardware-software co-design is appropriate to balance model accuracy, latency, energy & area efficiency. The H/S co-design of a single compute accelerator alone is insufficient to draw this balancing among multiple design parameters. In this work, we explore a yet new area of dual-engine accelerator. The basic concept of dual engine is well studied in many-core CPU architecture in the form of heterogeneous cores. However, in AI accelerator, the approach needs a new innovation. We present SCALER, a dual on-chip accelerator that integrates a reconfigurable tile-centric digital engine with bit-sliced in-SRAM vector compute engine. A bridge connects these two engines enabling coordinate preserving cross-domain execution. On this dual engine, the accuracy critical layers execute on the digital engine, while noise tolerant layers leverage the analog in-SRAM engine to achieve substantial energy & latency efficiency. To exploit perlayer sparsity patterns, noise tolerance & mixed compute utilization, We use a Hardware Utility–Based Supervision(HUBS) framework to identify an optimal configuration. This framework has a light weight supervised multi-label mapper to predict per-layer compression (quantization & pruning) and engine choice on SCALER hardware. The experimental results on ResNet-50, VGG-16, and AlexNet demonstrate significant improvements, with the best results achieved on AlexNet, up to 65. 84% energy savings and 47.73% latency reduction, while maintaining accuracy degradation less than 1% compared to Nvidia A100 GPU baseline. | ||
| Dynamically Adaptable Ensemble Proxies for Training-Free Neural Architecture Search Christian Heidorn, Tina Herderich, Jürgen Teich (Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)) In neural architecture search (NAS), evaluating a candidate network's accuracy is considered the most resource- and time-consuming step, as this usually requires full training and validation, which may take hours or even days. Recently, training-free, also called zero-shot NAS, has attracted increasing attention. In zero-shot NAS, a so-called zero-shot proxy (ZSP) is used to estimate the network's accuracy without training, using, e.g., architectural or gradient-based features. However, the predictive performance and computational cost of proposed ZSPs can vary significantly across different datasets and search spaces. To address this issue, we introduce the notion of dynamically adaptable ensemble proxies (DAEP), a new class of zero-shot proxies, defined as a weighted sum of existing proxies, and customizable to a specific NAS design space by dynamic adaptation. Here, a weighting strategy is proposed that, in the first phase of NAS, adapts the weights of accuracy-estimating ZSPs dynamically, by amplifying or reducing each weight according to the correlation of their zero-proxy estimate with the accuracy obtained by training each sampled model. This process is repeated iteratively until no improvement in the mean accuracy estimation can be observed over a given number of evaluated models. For the proposed class of ensemble proxies, we demonstrate a) that DAEP can deliver highly correlated accuracy estimators b) already after very few model evaluations. Moreover, we show that c) different NAS problems require different weight vectors to achieve a high correlation with network accuracy. Finally, it is illustrated for multiple NAS search spaces (i.e., NATS-Bench and NDS) that d) DAEPs can drastically reduce the total NAS exploration time while accounting for the quality of accuracy estimation. | ||
| RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection Sami Abuzakuk (EPFL); Lucas Crijns (armasuisse); Anne-Marie Kermarrec, Rafael Pires, Martijn de Vos (EPFL) Infrastructure as code (IaC) tools automate cloud provisioning but verifying that deployed systems remain consistent with the IaC specifications remains challenging. Such \emph{configuration drift} occurs because of bugs in the IaC specification, manual changes, or system updates. Large laguage model-based agentic AI systems can automate the analysis of large volumes of telemetry data, making them suitable for the detection of configuration drift. However, existing agentic systems implicitly assume that the tools they invoke always return correct outputs, making them vulnerable to erroneous tool responses. Since agents cannot distinguish whether an anomalous tool output reflects a real infrastructure problem or a broken tool, such errors may cause missed drift or false alarms, reducing reliability precisely when it is most needed. We introduce RIVA (Robust Infrastructure by Verification Agents), a novel multi-agent system that performs robust IaC verification even when tools produce incorrect or misleading outputs. RIVA employs two specialized agents, a verifier agent and a tool generation agent, that collaborate through iterative cross-validation, multi-perspective verification, and tool call history tracking. Evaluation on the AIOpsLab benchmark demonstrates that RIVA, in the presence of erroneous tool responses, recovers task accuracy from 27.3\% when using a baseline ReAct agent to 50.0\% on average. RIVA also improves task accuracy 28\% to 43.8\% without erroneous tool responses. Our results show that cross-validation of diverse tool calls enables more reliable autonomous infrastructure verification in production cloud environments. |
Submission
Papers must be submitted electronically as PDF files, formatted for 8.5x11-inch paper. Submissions will be up to 6 pages long, including figures, and tables, with 10-point font, in a two-column format. Bibliographic references are not included in the 6-page limit. Submitted papers must use the official SIGPLAN Latex / MS Word templates.
Submissions will be single-blind.
You may include an appendix (there is no page limit). However, it is optional, and we cannot guarantee that the reviewers will read it.
The appendix should be included in the same PDF as the main paper, starting on a new page immediately after the bibliography
Submit your paper at: https://euromlsys26.hotcrp.com/paper/new
Keynote
-
14:00 Laurent Bindschaedler Research Group Leader at the Max Planck Institute for Software Systems
What Survives When Code Doesn't?
Large language models have significantly reduced the cost of code generation. An increasing share of code is now produced with AI assistance, and developers increasingly treat implementations as disposable rather than precious. Code is moving away from its traditional role as the primary durable artifact in software development. Yet code was never purely about implementation. The maintained codebase has historically served as the concrete foundation for four essential guarantees: the system’s intended behavior (intent), what it carries forward across executions (state), how behavior is organized at runtime (composition), and what it may change in the external environment (effect). A skilled developer using AI still upholds these guarantees through review and expertise. But as human oversight diminishes, whether through autonomous agents, no-code tools, or contexts where no expert review layer exists, the guarantees become fragmented and difficult to enforce. Current agent frameworks seldom provide unified contracts for them. This talk argues that these four guarantees must be made explicit as enforceable contracts, forming the outline of a computational model for agentic software. The ML-systems community, with its roots in operating systems, databases, and distributed computing, is well positioned to lead this effort, and relevant components are already emerging across the field. Many of the pieces exist. The blueprint does not.
Bio: Laurent Bindschaedler is a Research Group Leader at the Max Planck Institute for Software Systems, where he leads the Data Systems Group. His research sits at the intersection of operating systems, databases, and machine learning, with recent work on abstractions for long-horizon LLM agents, transactional semantics for agent tool use, and benchmarks for agentic workflows. He holds a PhD from EPFL and was a postdoctoral fellow at MIT CSAIL. His work has been published at SOSP, ASPLOS, EuroSys, EMNLP, and NDSS. Website
Sponsors
Committees
Workshop and TPC Chairs
- Eiko Yoneki, University of Cambridge, https://www.cl.cam.ac.uk/~ey204/
- Paul Patras, University of Edinburgh, http://homepages.inf.ed.ac.uk/ppatras/
Technical Program Committee
- Aaron Zhao, Imperial College London
- Abhishek Dharmaratnakar, Google
- Ahmed Sayed, Queen Mary University of London
- Alec Diallo, University of Edinburgh
- Alexandros Koliousis, Northeastern University London and Institute for Experiential AI
- Amir Payberah, KTH
- Amitabha Roy, Google
- Andy Twigg, Google
- Bo Zhao, Aalto University
- Chi Zhang, Meta
- Christos Bouganis, Imperial College London
- Chunwei Xia, University of Leeds
- Daniel Goodman, Orcle
- Daniel Mendoza, Stanford University
- Dawei Li, Amazon
- Debanshu Das, Google
- Deepak George Thomas, Tulane University
- Dimitris Chatzopoulos, University College Dublin
- Fiodar Kazhamiaka, Microsoft
- Guilherme H. Apostolo, Vrije Universiteit Amsterdam
- Jiayi Nie, University of Cambridge
- Jiwon Seo, Seoul National University
- Joana Tirana, University College Dublin
- Jon Crowcroft, University of Cambridge
- Jose Cano Reyes, University of Glasgow
- Laurent Bindschaedler, MPI
- Luo Mai, University of Edinburgh
- Mengying Zhou, Shanghai University of Finance and Economics
- Nikolas Ioannou, Google
- Pedro Gimenes, Imperial College London
- Pedro Silvestre, Imperial College London
- Peter Pietzuch, Imperial College London
- Peter Triantafillou, University of Warwick
- Pinar Tözün, IT University of Copenhagen
- Pouya Hamadanian, MIT
- Sam Ainsworth, University of Edinburgh
- Sami Alabed, Deepmind
- Smit Hinsu, Google
- Srivaths Ranganathan, Google
- Swapnil Gandhi, Stanford University
- Taiyi Wang, University of Cambridge
- Thaleia Dimitra Doudali, IMDEA
- Tobias Grosser, University of Cambridge
- Valentin Radu, University of Sheffield
- Veljko Pejovic, University of Ljubljana
- Xupeng Miao, Peking University
- Youhe Jiang, University of Cambridge
- Zheng Wang, University of Leeds
- Zhihao Jia, CMU
- Zhiqiang Xie, Stanford University
Web Chair
- Alexis Duque, Net AI
Contact
For any question(s) related to EuroMLSys 2025, please contact the TPC Chairs Eiko Yoneki and Paul Patras.
Follow us on Twitter: @euromlsys