Understanding Actions Across Perspective
Published:
Specific Aims: Action recognition, a pivotal component of computer vision, finds diverse applications across several fields such as smart security (Hu et al., 2007), human-robot interaction (Akkaladevi & Heindl, 2015) and virtual reality (Bates et al., 2017). This technology plays a crucial role in enhancing surveillance systems by enabling efficient monitoring and security through the prompt detection of unusual behaviors. Despite significant advancements in human action recognition, state-of-the-art algorithms still face challenges, such as misclassifications caused by background noise mistaken for signals and the scarcity of annotated data. Moreover, two foundational issues call for non-technical solutions: intra- and inter-class variations in action labels, where the same action may be performed differently depending on motor capabilities, and the similarities across different action categories. For example, “running” and “walking” involve similar human motion patterns. Another challenge is regarding to the action vocabulary. Actions can also be categorized into different levels—movements, atomic actions, composite actions, events—creating an action hierarchy. Complex actions at higher levels of the hierarchy can be decomposed into a combination of actions at lower levels. Defining and analyzing these different types of actions is crucial. Additionally, humans can often effortlessly solve these challenges, highlighting the importance of examining the discrepancies between machine performance and human capabilities to improve the design of machine action recognition systems. Aim 1: Understand and quantify the Action Invariance attributes in State-of-the-Art Action Recognition Algorithms. This project will be examining the last layers’ embeddings from various SOTA action recognition networks with different kind of architectures ranging from supervised to unsupervised and self-supervised learning algorithms. These networks will be analyzed based on their performance when the input videos depict the same action executed from different angles and in diverse contexts with varying levels of abstraction. We will leverage the META dataset(Bezdek et al., 2022)—a large-scale, well-characterized collection of stimuli representative of such activities. This dataset consists of a structured and thoroughly instrumented set of extended event sequences performed in naturalistic settings, complete with hand-annotated timings of high-level actions. Additionally, it includes sequences of humans performing similar actions in a highly controlled manner, devoid of actual projects. Specifically, a Representation Dissimilarity Matrix analysis will be applied to assess the action invariance capabilities of these state-of-the-art algorithms. This approach will provide a quantitative measure of the action invariance features within the SOTA action recognition models. Furthermore, we will explore the relationship between these invariance abilities and the success of action classification. Aim 2: Examine the Correspondence Between Artificial Neural Networks and Biological Neural Representations of Actions. Participants will watch movies depicting everyday activities while undergoing simultaneous neural recording in an MRI scanner. The activation patterns in the brain networks will be compared with those produced by state-of-the-art action recognition algorithms. This comparison aims to investigate both the convergence and divergence of these patterns. Significance and Innovation: One of the principal objectives of artificial intelligence research is to develop machines that can accurately comprehend human actions and intentions, thereby enhancing their ability to assist us. Consider a scenario where a patient performs rehabilitation exercises at home, monitored by a robotic assistant. This robot is not only capable of recognizing the patient’s movements but also evaluates the accuracy of the exercises and prevents potential injuries. Such advanced technology could significantly reduce the need for in-person therapy visits, decrease medical expenses, and facilitate the feasibility of remote rehabilitation. Action recognition algorithms are pivotal to numerous practical applications, particularly in the realms of sports and entertainment where they enhance viewer engagement through detailed analytics and enriched interactive experiences. In healthcare, these algorithms are integral to precise patient monitoring and support physical therapy by ensuring movements are performed correctly. Cutting-edge technologies in this field (Feichtenhofer et al., 2017; Wang et al., 2016) have substantially minimized the need for manual video data analysis, providing insights into both present and predicted future activities within video sequences. However, the field of action recognition algorithms faces several distinct challenges. Some issues are primarily related to scale and engineering, such as background segmentation and the lack of sufficiently labeled data for various actions, which might find solutions in engineering advancements. Yet, there are two unique challenges in action recognition that do not have straightforward engineering solutions but have been effectively addressed by biological systems. These challenges include managing intra- and inter-class variations and the hierarchical representation of action recognition. It is widely recognized that individuals exhibit distinct behaviors when performing the same actions. For instance, the action categorized as “running” may vary significantly; a person may run quickly, slowly, or intersperse running with jumping. This indicates that a single action category can encompass various styles of human movement. Additionally, videos capturing the same action may be taken from multiple angles—frontal, lateral, or even aerial perspectives—introducing variations in appearance across different views (figure 1). Moreover, diverse individuals might assume different poses while performing identical actions. These factors contribute to substantial intra-class variations in appearance and pose, often confounding existing action recognition algorithms. Such variations are even more pronounced in real-world action datasets (Karpathy et al., 2014), necessitating the development of more sophisticated action recognition algorithms suitable for practical deployment. This aspect of action recognition, known as intra- and inter-class variations in computer vision literature, presents a critical challenge for the generatability of the action recognition algorithms. A primary objective of this research is to provide a detailed and precise quantification of the challenges current state-of-the-art action recognition technologies face in handling high intra- and inter-class variations. In addition, by exploring how biological systems overcome these issues, we aim to enhance and advance the development of future action recognition technologies. The invariance problem is not exclusive to machine models for action recognition.. Convolutional Neural Networks (CNNs) have significantly advanced the field of image recognition, yet they encounter notable challenges with the object invariance problem. This issue arises when objects are presented in slightly different ways or when negligible, seemingly insignificant features are introduced. In such scenarios, CNNs often struggle to maintain consistent classification accuracy. For instance, minor variations in object orientation, scale, or background changes can disproportionately affect the CNN’s ability to recognize the object correctly (Kar et al., 2019).
A key innovation of this research proposal is the attempt to address this computational vision model challenge by examining human behavior and biological systems’ solutions to similar problems. As a social species, humans rely on recognizing the actions of others in their everyday lives. We quickly and effortlessly extract action information from rich dynamic stimuli, despite variations in the visual appearance of action sequences due to transformations such as changes in size, position, actor, and viewpoint (e.g., determining whether a person is running or walking towards us, regardless of the direction they are coming from). This ability emerges early in development. Studies involving eye-gazing in four-month-old infants(Woodward & Sommerville, 2000) have demonstrated that infants possess an innate understanding of actions. They tend to gaze longer at the conclusion of action sequences that end unexpectedly compared to those that conclude as anticipated. Even when actions are minimally represented, such as through joint-position skeleton representations, Johansson discovered that individuals could recognize the action plans quickly and without error. Those behavioral findings indicated that challenges deemed significant in machine models are not as problematic for human models. This project aims to first quantify the action invariance capabilities of machine models and then explore human neural representations and strategies that could potentially address these challenges. Existing literature suggests that the existence of mirror neurons could be a potential solution to the action invariance problem. Mirror neurons(Gallese & Goldman, 1998), first discovered in the premotor cortex of macaques, are known to fire both when the animal performs an action and when it observes the same action performed by others. This discovery has been extrapolated to humans, suggesting a neurological basis for action understanding and imitation. Research indicates that mirror neurons may play a crucial role in not just recognizing but also predicting and interpreting the actions of others, thereby bridging the gap in action recognition across different perspectives and contexts. This neural mechanism underscores the potential for mirror neurons to facilitate a more robust understanding of action invariance, enhancing our ability to develop more accurate and adaptable action recognition technologies.