Procedural Mistake Detection via Action Effect Modeling

Michigan State University

Abstract

Mistake detection in procedural tasks is essential for developing intelligent assistive agents that enhance learning and task execution. Existing methods predominantly focus on analyzing how an action is performed, overlooking what it produces, i.e., the Action Effect. However, execution mistakes often manifest not in the action itself but in its outcome, such as an unintended object state or incorrect spatial arrangement.

To bridge this gap, we introduce Action Effect Modeling, a novel framework that detects mistakes by evaluating deviations in action outcomes. Our method captures fine-grained object states and spatial relationships using an egocentric scene graph, enabling a more comprehensive understanding of procedural correctness. By explicitly modeling expected action effects, our method improves the detection of subtle execution errors that traditional approaches fail to identify.

We validate our framework on the challenging EgoPER dataset in a One-Class Classification (OCC) setting, demonstrating its effectiveness in identifying mistakes beyond conventional action-centric methods. Our findings highlight the significance of action effect reasoning in mistake detection and open new avenues for enhancing assistive intelligence in procedural activities.

Motivation

Motivation figure
Existing mistake detection approaches share a common limitation: they assume mistakes can be detected purely by scrutinizing the execution, while failing to consider whether the final outcome aligns with expectations. However, in many real-world cases, execution may appear correct, but small deviations lead to noticeable differences in the result. An improper stirring position may seem fine in motion, but the spilled mixture on the table tells a different story. Similarly, a minor variation in slicing technique can lead to irregular cucumber pieces. This highlights a critical gap: mistake detection should not only assess the execution process, but also evaluate its effects.

Framework

Framework overview.
Overview of the framework and action effect modeling. Frame-wise features are aggregated into action segment features for effect modeling. The scene graph, generated by the Vision-Language Model (VLM), is decomposed using the proposed method to extract semantics that enable learning action effects. Mistake detection is achieved by a detector that learns correct action patterns during training.

Experiments

Experiment overview.
Mistake detection results on the EgoPER dataset. Values are reported in percentage (%), with the highest results highlighted in bold and the second-highest results marked with underline. AUC is used as the primary evaluation metric and Error Detection Accuracy (EDA) is only for reference since it does not account for false negatives and thus cannot assess model comprehensively.

Qualitative Results

Qualitative results.
Examples of predicted correctness probability by our model w/o (in blue) and w/ (in orange) action effect modeling. Predictions are based on the action frames, the effect frames are only for reference.