What is Next?
Published:
Published:
Published:
In the intricate dance of perception, the human mind is adept at navigating a flood of sensory stimuli, effortlessly filtering through the mundane to capture the essence of experience. At the heart of this cognitive ballet lies the intriguing concept of information optimization – a process where attention is delicately modulated to enhance processing efficiency.
Published:
Specific Aims: Action recognition, a pivotal component of computer vision, finds diverse applications across several fields such as smart security (Hu et al., 2007), human-robot interaction (Akkaladevi & Heindl, 2015) and virtual reality (Bates et al., 2017). This technology plays a crucial role in enhancing surveillance systems by enabling efficient monitoring and security through the prompt detection of unusual behaviors. Despite significant advancements in human action recognition, state-of-the-art algorithms still face challenges, such as misclassifications caused by background noise mistaken for signals and the scarcity of annotated data. Moreover, two foundational issues call for non-technical solutions: intra- and inter-class variations in action labels, where the same action may be performed differently depending on motor capabilities, and the similarities across different action categories. For example, “running” and “walking” involve similar human motion patterns. Another challenge is regarding to the action vocabulary. Actions can also be categorized into different levels—movements, atomic actions, composite actions, events—creating an action hierarchy. Complex actions at higher levels of the hierarchy can be decomposed into a combination of actions at lower levels. Defining and analyzing these different types of actions is crucial. Additionally, humans can often effortlessly solve these challenges, highlighting the importance of examining the discrepancies between machine performance and human capabilities to improve the design of machine action recognition systems. Aim 1: Understand and quantify the Action Invariance attributes in State-of-the-Art Action Recognition Algorithms. This project will be examining the last layers’ embeddings from various SOTA action recognition networks with different kind of architectures ranging from supervised to unsupervised and self-supervised learning algorithms. These networks will be analyzed based on their performance when the input videos depict the same action executed from different angles and in diverse contexts with varying levels of abstraction. We will leverage the META dataset(Bezdek et al., 2022)—a large-scale, well-characterized collection of stimuli representative of such activities. This dataset consists of a structured and thoroughly instrumented set of extended event sequences performed in naturalistic settings, complete with hand-annotated timings of high-level actions. Additionally, it includes sequences of humans performing similar actions in a highly controlled manner, devoid of actual projects. Specifically, a Representation Dissimilarity Matrix analysis will be applied to assess the action invariance capabilities of these state-of-the-art algorithms. This approach will provide a quantitative measure of the action invariance features within the SOTA action recognition models. Furthermore, we will explore the relationship between these invariance abilities and the success of action classification. Aim 2: Examine the Correspondence Between Artificial Neural Networks and Biological Neural Representations of Actions. Participants will watch movies depicting everyday activities while undergoing simultaneous neural recording in an MRI scanner. The activation patterns in the brain networks will be compared with those produced by state-of-the-art action recognition algorithms. This comparison aims to investigate both the convergence and divergence of these patterns. Significance and Innovation: One of the principal objectives of artificial intelligence research is to develop machines that can accurately comprehend human actions and intentions, thereby enhancing their ability to assist us. Consider a scenario where a patient performs rehabilitation exercises at home, monitored by a robotic assistant. This robot is not only capable of recognizing the patient’s movements but also evaluates the accuracy of the exercises and prevents potential injuries. Such advanced technology could significantly reduce the need for in-person therapy visits, decrease medical expenses, and facilitate the feasibility of remote rehabilitation. Action recognition algorithms are pivotal to numerous practical applications, particularly in the realms of sports and entertainment where they enhance viewer engagement through detailed analytics and enriched interactive experiences. In healthcare, these algorithms are integral to precise patient monitoring and support physical therapy by ensuring movements are performed correctly. Cutting-edge technologies in this field (Feichtenhofer et al., 2017; Wang et al., 2016) have substantially minimized the need for manual video data analysis, providing insights into both present and predicted future activities within video sequences. However, the field of action recognition algorithms faces several distinct challenges. Some issues are primarily related to scale and engineering, such as background segmentation and the lack of sufficiently labeled data for various actions, which might find solutions in engineering advancements. Yet, there are two unique challenges in action recognition that do not have straightforward engineering solutions but have been effectively addressed by biological systems. These challenges include managing intra- and inter-class variations and the hierarchical representation of action recognition. It is widely recognized that individuals exhibit distinct behaviors when performing the same actions. For instance, the action categorized as “running” may vary significantly; a person may run quickly, slowly, or intersperse running with jumping. This indicates that a single action category can encompass various styles of human movement. Additionally, videos capturing the same action may be taken from multiple angles—frontal, lateral, or even aerial perspectives—introducing variations in appearance across different views (figure 1). Moreover, diverse individuals might assume different poses while performing identical actions. These factors contribute to substantial intra-class variations in appearance and pose, often confounding existing action recognition algorithms. Such variations are even more pronounced in real-world action datasets (Karpathy et al., 2014), necessitating the development of more sophisticated action recognition algorithms suitable for practical deployment. This aspect of action recognition, known as intra- and inter-class variations in computer vision literature, presents a critical challenge for the generatability of the action recognition algorithms. A primary objective of this research is to provide a detailed and precise quantification of the challenges current state-of-the-art action recognition technologies face in handling high intra- and inter-class variations. In addition, by exploring how biological systems overcome these issues, we aim to enhance and advance the development of future action recognition technologies. The invariance problem is not exclusive to machine models for action recognition.. Convolutional Neural Networks (CNNs) have significantly advanced the field of image recognition, yet they encounter notable challenges with the object invariance problem. This issue arises when objects are presented in slightly different ways or when negligible, seemingly insignificant features are introduced. In such scenarios, CNNs often struggle to maintain consistent classification accuracy. For instance, minor variations in object orientation, scale, or background changes can disproportionately affect the CNN’s ability to recognize the object correctly (Kar et al., 2019).
A key innovation of this research proposal is the attempt to address this computational vision model challenge by examining human behavior and biological systems’ solutions to similar problems. As a social species, humans rely on recognizing the actions of others in their everyday lives. We quickly and effortlessly extract action information from rich dynamic stimuli, despite variations in the visual appearance of action sequences due to transformations such as changes in size, position, actor, and viewpoint (e.g., determining whether a person is running or walking towards us, regardless of the direction they are coming from). This ability emerges early in development. Studies involving eye-gazing in four-month-old infants(Woodward & Sommerville, 2000) have demonstrated that infants possess an innate understanding of actions. They tend to gaze longer at the conclusion of action sequences that end unexpectedly compared to those that conclude as anticipated. Even when actions are minimally represented, such as through joint-position skeleton representations, Johansson discovered that individuals could recognize the action plans quickly and without error. Those behavioral findings indicated that challenges deemed significant in machine models are not as problematic for human models. This project aims to first quantify the action invariance capabilities of machine models and then explore human neural representations and strategies that could potentially address these challenges. Existing literature suggests that the existence of mirror neurons could be a potential solution to the action invariance problem. Mirror neurons(Gallese & Goldman, 1998), first discovered in the premotor cortex of macaques, are known to fire both when the animal performs an action and when it observes the same action performed by others. This discovery has been extrapolated to humans, suggesting a neurological basis for action understanding and imitation. Research indicates that mirror neurons may play a crucial role in not just recognizing but also predicting and interpreting the actions of others, thereby bridging the gap in action recognition across different perspectives and contexts. This neural mechanism underscores the potential for mirror neurons to facilitate a more robust understanding of action invariance, enhancing our ability to develop more accurate and adaptable action recognition technologies.
Published:
The central debate between Davachi s and Zack’s stimulus types is not merely about one being more naturalistic than the other. Consider, for instance, the inherent oddity in signing up for a psychological experiment—our study primarily involves participants labeled as ‘WIRED,’ consisting of undergraduate students. Conversely, sequences of pictures with varying colored backgrounds might deviate from our everyday experiences, yet one might argue that, from a process perspective, both types of stimuli are processed by similar biological computations and could generalize to more naturalistic stimuli.
Published:
Event Models: Understanding the Temporal and Spatial Dynamics
Published:
The nuances and characteristics of short-term memory (STM) are often explored through simple methodologies, such as the use of basic images or sequences of numbers. The insights derived from these investigations highlight a fundamental limitation within our working memory capabilities. On average, an individual can maintain about 5 to 7 pieces of information in their working memory. Moreover, the retention span of these items is remarkably brief; without deliberate rehearsal, information begins to fade after approximately 12 seconds. This limitation appears to be in stark contrast with our experiences of complex activities like reading comprehension or watching movies, during which the process of information retention seems almost effortless. Generally, individuals are capable of understanding and remembering the narrative of a film or the content of a book without noticeable difficulty, even when unexpectedly asked to recall specific details.
Published:
#Incremental Understandings of Working Memory
Published:
Deep learning architectures represent a sophisticated hierarchy of modules, each designed for incremental learning and transformation of input data. These structures excel in creating representations that are both highly selective and invariant, allowing for intricate functions that can distinguish subtle details while overlooking irrelevant variations. This capability is exemplified in tasks as nuanced as differentiating Samoyeds from white wolves, irrespective of background or environmental conditions.
Published:
In everyday scenarios, we accumulate information across different timescales. This involves observing the milliseconds-scale facial expressions of our conversational partners, their body movements on a seconds scale, and tracking their final locations. Only a system capable of allowing past information to influence current processing across multiple timescales simultaneously could handle such diverse information streams.
Published:
Reflecting on a book I am currently reading, the concept of Situation Models stands out as a critical cognitive framework. These models represent the events, actions, individuals, and the overall situation that a text evokes. They are informed by a blend of the reader’s prior knowledge, the information currently being processed, and episodic information no longer in working memory—such as details from previous chapters.
Published:
One way to consider the body is as a dynamic collection of sensors, perpetually gathering light, sound, smell, touch, heat, and more from the surrounding environment. Additionally, numerous sensors within the body capture data from its own activities and physiological processes. Understanding everyday actions and experiences necessitates the integration of this information to make sense of it. Despite the variable influx of information to our senses, our perception of the world remains remarkably stable. From the constantly changing multimodal stream of data, the mind extracts fixed entities, organizing and integrating sensations of light, sound, smell, and touch into distinct entities that are separate from other sensory inputs. The perception extends beyond mere separation; it involves the recognition of specific objects and organisms, each with unique shapes, sizes, and components. Though sensations are continuous and evolving, our perception of them is discrete and enduring. Activities, too, are perceived in a discretized manner. Although activity is inherently about change over time, this change is conceptualized not as a constant flux but as sequences of key moments.
Published:
Published:
Published:
Published:
Published:
A system with constrained storage capacity must effectively process and understand causal reasoning to ensure successful comprehension. This comprehension can be conceptualized as a problem-solving process wherein the reader is required to discern a series of causal links that bridge the gap between a text’s beginning and its conclusion(Trabasso & Sperry, 1985). It is plausible that only the causal antecedent of the subsequent event is retained in short-term memory. By leveraging these connected relationships, one can unearth a causal sequence that navigates from the text’s inception to its final resolution. This framework is referred to as the “construction-integration model.(Kintsch & van Dijk,1987.)”
The problem-solving hypothesis posits that individuals recall the textual structure based on its causal framework. To unravel this structure, a parsing algorithm is employed to dissect the text into discrete states. Subsequently, a definition of causal relationships is utilized to determine the interconnections between these states. Conversely, an alternative viewpoint suggests that the meaning of a text is encapsulated in memory as a web of propositions, termed as a “textbase.” Innovative strategies are adopted to pinpoint the most pivotal propositions. This selection process commences with the presupposition that the propositions residing in short-term memory are organized into a hierarchical network, with the most significant propositions acting as the superordinate nodes. These superordinate propositions are then chosen for retention in short-term memory.
These two theoretical perspectives diverge on several fronts. Firstly, they are predicated on different units of analysis – clause length states versus propositions. The construction-integration model hypothesizes that two textual elements can be interconnected only if they coexist within the confines of limited-capacity short-term memory. However, literature on problem-solving theory accommodates the possibility of establishing all conceivable connections. Additionally, these models presuppose different mechanisms that contribute to the recallability of a text element, ranging from its causal structure to the duration it persists in short-term memory. A potential unifying approach to harmonize these theories suggests that the causal relationships can be extrapolated to individual levels.
It is arguable that both processes occur concurrently. Beyond retaining causal connections between propositions, one might also strive to maintain a representation of the overarching goals at a situational level, thereby forging links that transcend the immediate context.
A particular study(Fletcher & Bloom, 1988) sought to examine these competing hypotheses by crafting texts that encapsulated various types of causal structures at both the situational and propositional levels. A cohort of students was instructed to internalize the narrative at their own pace and subsequently recount the story’s content freely. Diverse models yielded varying forecasts regarding which propositions would be remembered. These model predictions were then juxtaposed with the participants’ actual recall performance. The findings corroborated that the status of a causal chain and the quantity of causal relationships are intricately linked to the memorability of text elements, thus substantiating the claims posited by the problem-solving hypothesis. Moreover, the insights gleaned from the results extended to the level of individual propositions, with readers predominantly retaining the terminal propositions from the causal chain in their short-term memory as they perused the text. In essence, the objective of narrative comprehension is to unravel a sequence of causal links that forges a connection from the beginning of the text to its culmination. It is somewhat surprising that goal-oriented information is not consistently held in short-term memory. A plausible explanation is that maintaining active goal information would impose an excessive burden on short-term memory capacity. However, it is conjectured that goals are reinstated whenever local coherence falters.
Fletcher, C. R., & Bloom, C. P. (1988). Causal reasoning in the comprehension of simple narrative texts. Journal of Memory and Language, 27(3), 235–244. https://doi.org/10.1016/0749-596X(88)90052-6 Kintsch, W., & van Dijk, T. A. 1987.). Toward a Model of Text Comprehension and Production. Trabasso, T., & Sperry, L. L. (1985). Causal relatedness and importance of story events. Journal of Memory and Language, 24(5), 595–611. https://doi.org/10.1016/0749-596X(85)90048-8
Published:
Delving deeper into the Construction-Integration (C-I) model(Kintsch, 1988), a framework proposed by Kintsch, we explore its multifaceted approach to understanding text comprehension. This model raises a fundamental question: How do we form cognitive representations based on textual descriptions? Kintsch proposed a bottom-up process, examining its connection to the representation of visual events and contemplating its implications for discourse comprehension.
Published:
How do we encode information? How are memories represented in the brain? Do we value each episode of information, or do we integrate new information into an existing abstract world model? Behavioral data support various theories. This paper introduces two computational models of memory formation and explores their similarities in the mechanisms for updating dynamic cognition.
Published: