The Evolution, Mechanisms, and Impact of Deep Learning

2 minute read

Published: March 24, 2024

Deep learning architectures represent a sophisticated hierarchy of modules, each designed for incremental learning and transformation of input data. These structures excel in creating representations that are both highly selective and invariant, allowing for intricate functions that can distinguish subtle details while overlooking irrelevant variations. This capability is exemplified in tasks as nuanced as differentiating Samoyeds from white wolves, irrespective of background or environmental conditions.

The essence of learning in these networks is encapsulated by the backpropagation process, a methodical application of the chain rule for derivatives. This process involves calculating gradients of an objective function relative to module weights by iteratively moving backwards through the network. Starting from the output layer down to the initial input, backpropagation ensures that gradients are propagated through every module, facilitating the adjustment of weights based on the computed gradients.

Historically, the 1990s witnessed a waning interest in neural networks and backpropagation, attributed to the prevailing skepticism towards training multistage feature extractors without substantial prior knowledge. The fear of gradient descent methods becoming ensnared in poor local minima further contributed to this skepticism. However, large networks have consistently demonstrated that they converge to high-quality solutions, undermining the supposed issue of local minima. Instead, the landscape of such networks is characterized by a plethora of saddle points, which, despite their numerical abundance, have minimal impact on the quality of solutions due to their similar objective function values.

Convolutional Neural Networks (ConvNets) stand out as a paradigm-shifting model in this evolution. Engineered to process data across multiple arrays, such as the tri-color layers of an image, ConvNets leverage four fundamental concepts: local connections, shared weights, pooling, and depth. The architecture alternates between convolutional layers, which detect local patterns, and pooling layers, which aggregate similar features, thus enhancing the model’s ability to generalize across variations in position and scale. This approach not only mirrors the hierarchical processing of visual information in biological systems but also has shown remarkable alignment with neuronal activations observed in primate studies.

Deep learning theories advocate for the exponential benefits of these architectures, particularly through their ability to generalize across unseen combinations of features and to exponentially increase representational power with depth. This contrasts sharply with earlier paradigms of cognition and language modeling, which lacked the ability to abstract and generalize across semantic similarities due to their reliance on discrete symbol processing.

In conclusion, deep learning architectures, exemplified by ConvNets, have transcended initial skepticism to redefine the boundaries of artificial intelligence and cognitive modeling. These networks embody a fusion of mathematical rigor, computational efficiency, and biological inspiration, charting a course for future explorations in AI that are as boundless as they are profound.

Share on

Twitter Facebook LinkedIn

What is Next?

less than 1 minute read

Published: January 01, 2199

🌟 Always Curious, Forever Learning 📚

Unveiling the Strategy of Information Optimization in Event Perception

2 minute read

Published: May 05, 2024

In the intricate dance of perception, the human mind is adept at navigating a flood of sensory stimuli, effortlessly filtering through the mundane to capture the essence of experience. At the heart of this cognitive ballet lies the intriguing concept of information optimization – a process where attention is delicately modulated to enhance processing efficiency.

Understanding Actions Across Perspective

8 minute read

Published: April 28, 2024

Specific Aims: Action recognition, a pivotal component of computer vision, finds diverse applications across several fields such as smart security (Hu et al., 2007), human-robot interaction (Akkaladevi & Heindl, 2015) and virtual reality (Bates et al., 2017). This technology plays a crucial role in enhancing surveillance systems by enabling efficient monitoring and security through the prompt detection of unusual behaviors. Despite significant advancements in human action recognition, state-of-the-art algorithms still face challenges, such as misclassifications caused by background noise mistaken for signals and the scarcity of annotated data. Moreover, two foundational issues call for non-technical solutions: intra- and inter-class variations in action labels, where the same action may be performed differently depending on motor capabilities, and the similarities across different action categories. For example, “running” and “walking” involve similar human motion patterns. Another challenge is regarding to the action vocabulary. Actions can also be categorized into different levels—movements, atomic actions, composite actions, events—creating an action hierarchy. Complex actions at higher levels of the hierarchy can be decomposed into a combination of actions at lower levels. Defining and analyzing these different types of actions is crucial. Additionally, humans can often effortlessly solve these challenges, highlighting the importance of examining the discrepancies between machine performance and human capabilities to improve the design of machine action recognition systems. Aim 1: Understand and quantify the Action Invariance attributes in State-of-the-Art Action Recognition Algorithms. This project will be examining the last layers’ embeddings from various SOTA action recognition networks with different kind of architectures ranging from supervised to unsupervised and self-supervised learning algorithms. These networks will be analyzed based on their performance when the input videos depict the same action executed from different angles and in diverse contexts with varying levels of abstraction. We will leverage the META dataset(Bezdek et al., 2022)—a large-scale, well-characterized collection of stimuli representative of such activities. This dataset consists of a structured and thoroughly instrumented set of extended event sequences performed in naturalistic settings, complete with hand-annotated timings of high-level actions. Additionally, it includes sequences of humans performing similar actions in a highly controlled manner, devoid of actual projects. Specifically, a Representation Dissimilarity Matrix analysis will be applied to assess the action invariance capabilities of these state-of-the-art algorithms. This approach will provide a quantitative measure of the action invariance features within the SOTA action recognition models. Furthermore, we will explore the relationship between these invariance abilities and the success of action classification. Aim 2: Examine the Correspondence Between Artificial Neural Networks and Biological Neural Representations of Actions. Participants will watch movies depicting everyday activities while undergoing simultaneous neural recording in an MRI scanner. The activation patterns in the brain networks will be compared with those produced by state-of-the-art action recognition algorithms. This comparison aims to investigate both the convergence and divergence of these patterns. Significance and Innovation: One of the principal objectives of artificial intelligence research is to develop machines that can accurately comprehend human actions and intentions, thereby enhancing their ability to assist us. Consider a scenario where a patient performs rehabilitation exercises at home, monitored by a robotic assistant. This robot is not only capable of recognizing the patient’s movements but also evaluates the accuracy of the exercises and prevents potential injuries. Such advanced technology could significantly reduce the need for in-person therapy visits, decrease medical expenses, and facilitate the feasibility of remote rehabilitation. Action recognition algorithms are pivotal to numerous practical applications, particularly in the realms of sports and entertainment where they enhance viewer engagement through detailed analytics and enriched interactive experiences. In healthcare, these algorithms are integral to precise patient monitoring and support physical therapy by ensuring movements are performed correctly. Cutting-edge technologies in this field (Feichtenhofer et al., 2017; Wang et al., 2016) have substantially minimized the need for manual video data analysis, providing insights into both present and predicted future activities within video sequences. However, the field of action recognition algorithms faces several distinct challenges. Some issues are primarily related to scale and engineering, such as background segmentation and the lack of sufficiently labeled data for various actions, which might find solutions in engineering advancements. Yet, there are two unique challenges in action recognition that do not have straightforward engineering solutions but have been effectively addressed by biological systems. These challenges include managing intra- and inter-class variations and the hierarchical representation of action recognition. It is widely recognized that individuals exhibit distinct behaviors when performing the same actions. For instance, the action categorized as “running” may vary significantly; a person may run quickly, slowly, or intersperse running with jumping. This indicates that a single action category can encompass various styles of human movement. Additionally, videos capturing the same action may be taken from multiple angles—frontal, lateral, or even aerial perspectives—introducing variations in appearance across different views (figure 1). Moreover, diverse individuals might assume different poses while performing identical actions. These factors contribute to substantial intra-class variations in appearance and pose, often confounding existing action recognition algorithms. Such variations are even more pronounced in real-world action datasets (Karpathy et al., 2014), necessitating the development of more sophisticated action recognition algorithms suitable for practical deployment. This aspect of action recognition, known as intra- and inter-class variations in computer vision literature, presents a critical challenge for the generatability of the action recognition algorithms. A primary objective of this research is to provide a detailed and precise quantification of the challenges current state-of-the-art action recognition technologies face in handling high intra- and inter-class variations. In addition, by exploring how biological systems overcome these issues, we aim to enhance and advance the development of future action recognition technologies. The invariance problem is not exclusive to machine models for action recognition.. Convolutional Neural Networks (CNNs) have significantly advanced the field of image recognition, yet they encounter notable challenges with the object invariance problem. This issue arises when objects are presented in slightly different ways or when negligible, seemingly insignificant features are introduced. In such scenarios, CNNs often struggle to maintain consistent classification accuracy. For instance, minor variations in object orientation, scale, or background changes can disproportionately affect the CNN’s ability to recognize the object correctly (Kar et al., 2019).
A key innovation of this research proposal is the attempt to address this computational vision model challenge by examining human behavior and biological systems’ solutions to similar problems. As a social species, humans rely on recognizing the actions of others in their everyday lives. We quickly and effortlessly extract action information from rich dynamic stimuli, despite variations in the visual appearance of action sequences due to transformations such as changes in size, position, actor, and viewpoint (e.g., determining whether a person is running or walking towards us, regardless of the direction they are coming from). This ability emerges early in development. Studies involving eye-gazing in four-month-old infants(Woodward & Sommerville, 2000) have demonstrated that infants possess an innate understanding of actions. They tend to gaze longer at the conclusion of action sequences that end unexpectedly compared to those that conclude as anticipated. Even when actions are minimally represented, such as through joint-position skeleton representations, Johansson discovered that individuals could recognize the action plans quickly and without error. Those behavioral findings indicated that challenges deemed significant in machine models are not as problematic for human models. This project aims to first quantify the action invariance capabilities of machine models and then explore human neural representations and strategies that could potentially address these challenges. Existing literature suggests that the existence of mirror neurons could be a potential solution to the action invariance problem. Mirror neurons(Gallese & Goldman, 1998), first discovered in the premotor cortex of macaques, are known to fire both when the animal performs an action and when it observes the same action performed by others. This discovery has been extrapolated to humans, suggesting a neurological basis for action understanding and imitation. Research indicates that mirror neurons may play a crucial role in not just recognizing but also predicting and interpreting the actions of others, thereby bridging the gap in action recognition across different perspectives and contexts. This neural mechanism underscores the potential for mirror neurons to facilitate a more robust understanding of action invariance, enhancing our ability to develop more accurate and adaptable action recognition technologies.

Adaptive Behavior and Environmental Influence

3 minute read

Published: April 22, 2024

The central debate between Davachi s and Zack’s stimulus types is not merely about one being more naturalistic than the other. Consider, for instance, the inherent oddity in signing up for a psychological experiment—our study primarily involves participants labeled as ‘WIRED,’ consisting of undergraduate students. Conversely, sequences of pictures with varying colored backgrounds might deviate from our everyday experiences, yet one might argue that, from a process perspective, both types of stimuli are processed by similar biological computations and could generalize to more naturalistic stimuli.

Sophie(Xing) Su