Procedural Mistake Detection via Action Effect Modeling

Abstract

Mistake detection in procedural tasks is essential for developing intelligent assistive agents that enhance learning and task execution. Existing methods predominantly focus on analyzing how an action is performed, overlooking what it produces, i.e., the Action Effect. However, execution mistakes often manifest not in the action itself but in its outcome, such as an unintended object state or incorrect spatial arrangement.

To this end, we introduce Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its effects from a causal perspective. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics.

Our approach outperforms prior work on the EgoPER and CaptainCook4D benchmarks under the challenging One-Class Classification (OCC) setting. These results highlight the importance of jointly modeling execution and effect for accurate mistake detection in real-world procedural tasks.

Motivation

Existing mistake detection approaches primarily focus on modeling the execution process, assessing correctness by analyzing motion patterns and action sequences. However, a shared limitation of these methods lies in the assumption that mistakes can be identified solely from the execution process, without verifying whether the final outcome aligns with the intended result. In real-world scenarios, the execution may appear correct, yet minor deviations can still lead to significantly flawed outcomes, which underscores a critical limitation: mistake detection should account not only for the execution process but also for its resulting action effect.

Framework

We adopt an action segmentation backbone (ActionFormer) to extract non-overlapping action segments from the input video frames. The resulting segment features are then fed into the Action Effect Modeling (AEM) module. The effect-aware segment features are passed into the mistake detection module, which models the mistake probability for each segment using a prompt-based detector.

Effect Modeling

In AEM, we first perform effect frame sampling to identify the frame most indicative of the action outcome. Subsequently, we extract multimodal effect knowledge including object states and spatial relationships from both visual grounding and symbolic scene graphs. During training, these cues serve as supervision signals to guide effect-aware learning.

Experiments

We evaluate our method on two egocentric video datasets: EgoPER and CaptainCook4D. As shown in Table 1, our method significantly outperforms AMNAR in AUC across all five tasks on the EgoPER dataset, with an average improvement of 3.3%. Similarly, the results in Table 2 further demonstrate its effectiveness on the CaptainCook4D dataset, surpassing AMNAR by 9.7% in Precision and 1.5% in AUC.

Qualitative Results

Effect Frame Sampling

Frame sampling visualization. — Visualization of effect frame sampling results. The selected frames show the most informative moments that capture the action effects.

Scene Graph Analysis

Scene graph visualization. — Scene graph generated by GPT-4o showing object relationships and spatial arrangements extracted from the effect frames.

Mistake Detection Results

Mistake detection visualization. — Visualizations of action segments (left) alongside the predicted correctness probabilities (right). Blue bars represent predictions from the model without action-effect modeling, while orange bars show predictions with it. The results highlight two key strengths of our approach. First, the model successfully detects mistakes that manifest in the final outcome by leveraging action-effect modeling. Second, it can also identify execution errors that occur during the action, even when the outcome appears visually correct. This demonstrates the complementary nature of effect-aware representation and temporal execution modeling in capturing a broader range of procedural mistakes.