#7 - Zsolt Kira: Continual Learning, Multimodality & Achieving AGI

In this episode, Zsolt discusses the challenges and advancements in robotics and AI, focusing on memory systems, continual learning, and the importance of human-robot interaction. He emphasizes the need for robots to learn from ambiguous requests and adapt to new environments while maintaining personalization. The conversation also explores the role of reinforcement learning and the potential for collective intelligence among robots.

Embodied intelligence in home robotics. Imagine a home assistant robot capable of complex tasks like stocking groceries, clearing tables, and fetching objects on command. This has been a long-standing societal ambition. Achieving such embodied intelligence involves a classic “sense, plan, act” pipeline. Robots must first process sensor data - images, audio, and video - to build a model of their environment. Based on this understanding, they then determine the appropriate actions and execute them through low-level controls, like moving a gripper to pick up an object. While significant progress has been made in sensor processing and action planing, several key challenges remain. True household utility requires robots to possess a layer of common sense understanding and personalization. They need to learn individual preferences, such as where specific items are typically placed or which mug a user prefers for a particular drink, and even discern dirty items from clean ones. Handling ambiguous human requests is another hurdle. Furthermore, robots must develop memory, enabling them to learn from past interactions and experiences - for instance, remembering not to repeat an action that previously resulted in a negative user experience. This capacity for long-term learning and adaptation is crucial for robots intended to cohabitate with humans for extended periods.

Memory and continual learning challenges. A fundamental challenge in developing robots that learn over time is “catastrophic interference”. This phenomenon occurs when a model, in the process of learning a new task or information, inadvertently overwrites or degrades its knowledge of previously learned tasks. Traditionally, AI models are trained on a dataset and then developed in a “frozen” state, meaning their knowledge base is fixed. However, for a home robot to be truly useful, it must continuously learn - perhaps how to handle a new object in the house or adapt to a new routine. If a robot tries to learn this new task, it might lose its proficiency in older ones. One straightforward approach to mitigate this, known as “rehearsal,” involves storing all past experiences and continuously relearning old tasks alongside new ones. While this can prevent forgetting, it presents its own set of problems. Humans generally don’t learn new board games by simultaneously rehearsing every game they have ever played. Moreover, retaining vast amounts of past data raises significant privacy concerns, especially for a robot operating within a private home. There are also substantial computational implications, as learning new information would necessitate processing all previously learned material, demanding considerable processing power.

Rehearsal-free continual learning. To address the limitations of rehearsal-based methods, research is shifting towards “rehearsal-free” or “replay-free” continual learning. CODA-Prompt (Continual Decomposed Attention-based Prompting) capitalizes on powerful, large pre-trained foundation models. Instead of training models from scratch and battling catastrophic forgetting, the strategy involves taking existing, highly capable models and fine-tuning them by adjusting only a small number of their parameters. This is a form of parameter-efficient learning, popularized by techniques like LoRA. By applying regularization and other sophisticated techniques to these targeted parameter updates, researchers aim to achieve performance comparable to models trained on all data (old and new) simultaneously, but without the need to store or replay past data. This methodology is distinct from techniques like replay buffers used in reinforcement learning, as the core goal is to eliminate any form of data replay. This is particularly crucial for home robots, where long-term operation necessitates learning numerous new tasks without continuously storing potentially private past experiences, thereby also improving scalability and reducing computational load.

Scalability in robotics. Scalability is a critical factor in the practical deployment of robots, encompassing both computational resources (memory and processing power) and performance consistency as tasks accumulate. While replay-free methods like CODA-Prompt offer improved scalability by avoiding the storage of past data, challenges persist. Even advanced continual learning methods show performance degradation as the number of sequentially learned tasks increases. Current research often evaluates these methods on a relatively small number of new tasks, perhaps up to 25 or 50. However, a robot operating in a home for years would need to learn thousands, if not tens of thousands, of new things. This presents an unsolved problem in continual learning: how to effectively manage and retain relevant information from vast, continuous streams of data over extended periods. Unlike the human brain, which has sophisticated, albeit not fully understood, mechanisms for memory retention (possibly linked to emotional salience), current robotic systems lack robust methods for determining what to remember and what to discard from months or years of experience. Even techniques for optimizing replay buffers by selecting the most relevant experiences have inherent limitations in scaling to such long-term, high-volume learning scenarios.

Biological versus artificial memory systems. Significant disparities exist between biological and artificial memory systems. The latter are relatively “flat,” storing information like vectors or knowledge in a uniform manner. In contrast, the human brain, as understood by neuroscience, features more complex structures, such as dual memory frameworks distinguishing between short-term and long-term memory. Evidence from brain damage or disease studies suggests these are distinct systems. Human memory is also deeply intertwined with saliency—information linked to strong emotional states or task relevance is often recalled more effectively. Current robotic systems largely lack such mechanisms. A fascinating aspect of human memory is its associative nature: a particular smell, for instance, can trigger a cascade of related memories. This “rolling associative process” doesn’t have a direct computational analog in current AI. Furthermore, concepts like neurogenesis, where the brain generates more neurons when exposed to diverse experiences, and the processes of synaptic pruning during development, highlight the dynamic and adaptive nature of biological learning. While some AI research explores growing or pruning parts of neural networks to accommodate new tasks, these approaches are complex and difficult to implement and study effectively over the long lifespan of a robot, especially when compared to the nuanced, lifelong learning capabilities of the human brain.

Multimodal language understanding. A key aspect of human-robot interaction is the robot’s ability to understand and act upon natural language instructions, which can often be ambiguous. Dr. Kira references recent work on a generalized embodied agent leveraging a multimodal Large Language Model (LLM) designed to complete tasks from such instructions. To ensure alignment between the user’s intent and the robot’s execution, the primary strategy is scale—training the model on millions of diverse experiences. This vast exposure helps the model generalize and develop an implicit understanding of common requests and their contexts, similar to how LLMs become proficient in language. However, scale alone isn’t always sufficient. When a robot misinterprets an instruction or performs a task poorly, the model needs to be updated to learn from that failure. Another crucial component, explored in work with Meta, is enabling robots to actively manage ambiguity by determining when to ask clarifying questions. If a robot receives a vague command like “bring me a mug,” it might proactively ask about the intended use or specific type of mug, much like humans do in conversation to resolve uncertainty. This combination of robust learning from large-scale data and the ability to engage in clarifying dialogue represents a two-pronged approach to tackling ambiguity in multimodal language understanding.

Personalization in robot learning. For a robot to be truly integrated into a home, it must adapt to the unique communication styles, preferences, and routines of its specific users. A one-size-fits-all model, even a very large and capable one, is unlikely to suffice. Instead, robots will need to undergo a process of acclimation, learning the particular “ways of being” within each new environment, much like a new person would. Several techniques are being explored to achieve this personalization. Continual learning allows the robot’s underlying model to be updated based on new experiences within the home. Beyond direct model updates, “in-context learning” or “prompt learning” offers a way to adapt behavior without altering the model’s core weights. This involves retrieving relevant information from the robot’s memory—past interactions, user preferences stored as text or image data—and providing this information as part of the input or prompt to the main model. This is analogous to how systems like ChatGPT use conversation history to inform current responses. By conditioning the model’s outputs on this personalized, retrieved context, the robot can tailor its actions and responses to the specific user and household, making interactions more natural and effective.

Multi-embodiment action tokenization. Generalized Embodied Agent (GEA), a paper published recently, involves “multi-embodiment action tokenization”. This work focuses on enabling a single AI model to control different types of robotic systems or “embodiments” performing diverse tasks, moving beyond training separate models for each specific application. A key innovation lies in how actions are “tokenized” or represented for the model, which differs significantly from how language is tokenized. For tasks involving discrete action spaces—where the robot chooses from a set of predefined skills like “pick,” “place,” or “navigate”—the most effective approach was found to be using semantically meaningful tokens. These are often existing tokens within the language model that correspond to these concepts (e.g., the word “pick”). This leverages the common-sense knowledge already embedded in the foundation model, leading to better performance. For tasks requiring low-level continuous control, such as specifying joint angles or torques, a different strategy proved superior: an explicit learning phase to create a discrete “codebook” of action tokens from the continuous space, using techniques like VQVAE. This effectively translates continuous commands into a manageable set of discrete outputs. The power of this approach was demonstrated by training a single model capable of navigating a robot in simulation, playing Atari games, and even operating a smartphone’s graphical user interface.

Reinforcement learning + foundation models. The research on multi-embodiment action tokenization yielded some surprising insights, particularly regarding the synergy between reinforcement learning (RL) and large foundation models. One unexpected finding was the relatively low level of interference observed when training a single model on highly diverse tasks, such as robotics simulations and Atari games. This can be attributed to the robustness of the base foundation models (like LLaVA-One-Vision), which have already assimilated a vast amount of general knowledge from web-scale data. This pre-existing knowledge appears to create a strong scaffold that facilitates learning across varied domains, even when using RL. Perhaps more striking was the dramatic performance boost achieved by applying RL fine-tuning on top of supervised learning (also known as imitation learning or learning from demonstrations). For instance, success rates in a simulated “Habitat Pick” task surged from 57% to 83% after RL fine-tuning. Imitation learning provides a good initial model by showing the robot how to perform a task correctly. However, it typically doesn’t expose the robot to failure scenarios or deviations from the demonstrated path. RL complements this by allowing the robot to explore, make mistakes, and learn how to recover from those failures. This combination creates a much more robust system, better equipped to handle the inevitable unexpected events and errors that occur in real-world operation.

Long-tail problem in robotics. The real world is inherently chaotic and unpredictable, presenting robots with a “long-tail problem”. This refers to the vast number of rare, unusual, and often unforeseeable situations that a robot might encounter during its operational lifetime. While a robot can be trained for many common scenarios, like self-driving cars are, deployment in the real world inevitably exposes it to an almost infinite variety of edge cases. Take, for example, a self-driving car encountering a bicyclist with a stop sign printed on the back of their shirt — a bizarre but plausible event. The more a robot interacts with the world, the more likely it is to encounter these low-probability but high-impact events. It is crucial for robots to be able to react robustly and safely to such occurrences. Failure to do so could lead to task incompletion, damage to the robot, or, in the worst case, harm to humans. Addressing the long-tail problem necessitates extensive experience in diverse environments and the development of robust learning and recovery mechanisms. This underscores the importance of approaches like reinforcement learning, which allows robots to learn from a wider range of experiences, including failures, thereby building resilience against the unexpected.

Future of robot generalization. Looking ahead, achieving true robot generalization—the ability to adapt to new situations and learn new tasks efficiently—remains a paramount goal. The importance of “few-shot learning” is emphasized, where a robot can acquire a new skill from just a few examples, without catastrophically forgetting what it already knows. This is critical for practical home robots, as users won’t want to provide millions of demonstrations for every new task. For the next decade, a key area of focus will be enabling robots to learn from natural human interaction. This goes beyond structured demonstrations to include learning from casual conversation and simple, informal showings of how to do something. A particularly exciting and less explored frontier is learning from linguistic descriptions alone. Imagine telling a robot, “Fetching the mail is like fetching my cup, but you need to go to the mailroom, which is located here,” and having the robot update its knowledge and capabilities accordingly. This involves understanding spatial, temporal, and semantic concepts conveyed through language and integrating them into its operational framework. Moving from learning through explicit, repeated demonstrations to learning from high-level, potentially ambiguous, linguistic interactions is a significant challenge but crucial for the future of adaptable and intelligent robots.

Integrating multimodal inputs. Humans effortlessly integrate a rich tapestry of sensory information—the words someone speaks, their tone of voice, facial expressions, and body language—to form a cohesive understanding of a situation. For robots to interact naturally and intelligently, they too must master the integration of multimodal inputs. Robots are often equipped with a variety of sensors, including cameras, depth sensors, microphones, and sometimes radar. Current AI models, particularly transformer architectures, are showing promise in processing these diverse data streams, though this field is less mature than pure language modeling. A significant challenge lies in the nature of available training data. Datasets often provide only a subset of modalities; for example, YouTube videos offer audio and video, while robot-specific data might include video and depth information. Learning a unified model from such a patchwork of multimodal datasets is complex. Furthermore, extracting subtle but crucial cues—like nuanced facial expressions or slight changes in vocal intonation—often requires highly specialized models. Currently, a single, massive multimodal model that can perform all these specialized sensory processing tasks as well as dedicated individual models remains elusive. The ultimate goal is to develop such a unified system capable of holistically perceiving and interpreting the complex, multimodal world around it.

Collective intelligence in robotics. Human success as a species is largely attributed to our ability to cooperate and share knowledge. Envision a future where robots can similarly benefit from a form of “collective intelligence.” This extends beyond human-robot communication to robot-to-robot interaction and learning. A unique advantage robots have is that their “brains” or models can be interconnected via the internet, allowing for direct knowledge sharing on a scale impossible for humans. This concept presents a dual opportunity. Firstly, data and experiences gathered by many individual robots operating in diverse environments (e.g., different homes) could be aggregated to train a more generalized and capable “global model.” This is akin to federated learning, where learnings from individual devices contribute to a central model, leveraging scale to enhance generalization. This improved global model could then be distributed to new robots. However, this raises a critical tension: the need to balance the benefits of a powerful, generalized global model with the equally important requirement for personalization. Each robot must still retain and adapt to the unique characteristics and preferences of its specific home and users. Exploring this interplay, potentially through approaches like federated continual learning, is crucial for developing robotic systems that are both broadly intelligent and individually attuned.

What am I most proud of? When reflecting on his career, Dr. Kira highlights a few aspects he is particularly proud of. Firstly, he values his tendency to think beyond current trends and consider future limitations. Even as a PhD student, he advocated for concepts like feature learning when feature engineering was the norm, and he was an early proponent of integrating learning with robotics before it became mainstream. This holistic perspective, focusing on broad scientific questions rather than being confined to specific tools or methods, has been a guiding principle. However, his greatest pride lies in his students, both current and graduated. He fosters an environment where students are encouraged to pursue their passions, which has led to a diverse range of research projects within his lab. While these projects may seem disparate, they collectively contribute to the overarching goals of improving intelligence, robustness, and generalization in robotics and multimodal models. Dr. Kira is especially proud of his students’ ability to carve out new sub-areas of research, some of which gain traction and influence in the wider academic community. Witnessing their accomplishments, both during their time in his lab and after they move on to their own careers, brings him immense satisfaction.

“What would I tell my kids?” If his children were to express an interest in pursuing robotics and machine learning, Dr. Kira’s advice would center on intrinsic motivation and embracing the journey. He emphasizes the importance of following one’s passion and finding enjoyment in the process of learning and discovery, rather than focusing solely on external goals, deliverables, or the current “hotness” of a particular field. He believes that life is about maximizing engagement with activities one is passionate about and minimizing those one is not. Finding that passion can be a difficult and evolving process, even for seasoned researchers. His key counsel would be to remain open to opportunities and avoid self-filtering. Young individuals, he notes, sometimes prematurely dismiss potential paths or fellowships, believing they are out of reach. Instead, he would encourage them to actively pursue any opportunity that could enable them to follow their passions. By continually seeking out and seizing such chances, they increase the likelihood of building a fulfilling career aligned with their deepest interests.

On the go? There’s an audio-only version too. Click here.

Zsolt Kira is an Associate Professor at the School of Interactive Computing, and serves as an Associate Director of ML@GT, the machine learning center created at Georgia Tech. He leads the RobotIcs Perception and Learning (RIPL) lab, focusing on the intersection of learning methods for sensor processing and robotics.