Procedural Mistake Detection via
Action Effect Modeling

Michigan State University
Accepted by ICLR 2026

Abstract

Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the Action Effect. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement.

To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics.

Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.

Motivation

Motivation figure
Existing mistake detection approaches primarily model the execution process and assess correctness by analyzing motion patterns and action sequences. A common limitation, however, is the implicit assumption that mistakes can be identified solely from the execution trajectory, without verifying whether the final outcome matches the intended goal. In real-world scenarios, execution may appear correct while subtle deviations still lead to flawed outcomes. This highlights a critical gap: mistake detection should consider not only how an action is performed, but also whether its resulting effect aligns with the desired outcome.

Framework Overview

Framework overview.
We adopt an action segmentation backbone (ActionFormer) to extract frame-level features and non-overlapping action segments. Frame features are aggregated along the temporal dimension according to segment boundaries to form segment representations. These representations are then fed into the Action Effect Modeling module to obtain enriched effect-aware features, which are subsequently passed to a prompt-based mistake detection module.

Action Effect Modeling

Effect modeling overview.
AEM integrates action effects by extracting multimodal features from selected effect frames and using them as external supervision. These features guide a learnable effect token, which distills outcome-related semantics into the action representation.

Experimental Results

Experiment overview.
We evaluate our method on egocentric video datasets EgoPER and CaptainCook4D. The experimental results on both datsets demonstrate that our approach achieves SOTA performance.

Qualitative Results

Effect-frame Sampling

Frame sampling visualization.
Visualization of effect frame sampling results. The selected frames show the most informative moments that capture the action effects.

Scene Graph Analysis

Scene graph visualization.
Scene graph generated by GPT-4o showing object relationships and spatial arrangements extracted from the effect frames.

Mistake Probability

Mistake detection visualization.
Examples of mistakes occurring in different actions. The right bar charts show mistake probabilities predicted by models without (in blue) and with (in orange) effect modeling. Red boxes in images are only used to highlight mistake regions for clearer visualization.