Active learning, intrinsic motivation
self-organization of developmental structures
Together with my colleagues, we study computational mechanisms of active exploration for life-long learning, allowing an embodied learner to develop repertoires of novel sensorimotor an cognitive skills by structuring autonomously its own learning experiences from simple to progressively more complex.
This research is framed within, and allows to bridge, three fields: statistical machine learning, developmental robotics, and the study of human development.
We elaborated computational mechanisms of curiosity-driven self-exploration, modeling central aspects of intrinsic motivation in humans and based on theories like the Flow theory, and further refined within the statistical active learning framework for efficient learning of multiple sensorimotor tasks in high-dimensional real-word robots.
Statistical active learning for robot skill learning
In statistical machine learning, we have elaborated active learning mechanisms with the goal to allow robots to learn efficiently novel sensorimotor skills in high-dimensional, non-linear, non-stationary and redundant spaces, and within severe time constraints. They are based on three key principles that can be combined: empirical evaluation of learning progress, goal babbling, and strategic learning.
Exploration driven by empirical estimation of learning progress: A first key principle we are studying is active learning of sensorimotor models driven by the maximization of empirically evaluated learning progress. This drives the learner to explore zones of its sensorimotor space where its predictions, or its competences, improve maximally fast in practice (as opposed to in theory). As a side effect, the learner first explores easy activities, then when they are learnt it automatically shifts to progressively more complex ones. In large and real-world spaces, where it is impossible to assume strong analytical properties of the relation between the learner and the environment, this was shown to be significantly more efficient than active learning approaches which maximize novelty, surprise or entropy (or their reduction if it is estimated in a model-based manner). Yet, estimating efficiently the empirical learning progress is highly challenging since it is a spatially and temporally non-stationary quantity. We have developed over the years a series of algorithms addressing this challenge, starting from IAC (Oudeyer et al., 2007), R-IAC (Baranes and Oudeyer, 2009), SAGG-RIAC (Baranes and Oudeyer, 2013), McSAGG-RIAC (Baranes and Oudeyer, 2011), SGIM-ACTS (Nguyen and Oudeyer, 2013), zeta-Rmax, zeta-BEB (Lopes et al., 2012), and SSB (Lopes and Oudeyer, 2012). For example, (Baranes and Oudeyer, 2009) presents the R-IAC architecture and shows how it allows efficient learning of hand-eye coordination in robots. In (Lopes et al., 2012), an RL formulation of this approach is compared to PAC-MDP approaches (e.g. R-Max) and Bayesian RL approaches (e.g. exploration bonuses), providing natural extensions that make them more robust in complex non-stationary spaces (zeta-R-Max and zeta-EB algorithms).
Autonomous and active goal babbling: A second key principle is goal babbling, also called goal exploration. A crucial need in robot learning is the acquisition of inverse models, where a robot has to efficiently learn a mapping between goal or task parameters and the parameters of motor controllers that reach these goals. In goal exploration (Oudeyer and Kaplan, 2007), the learner selects its own goals and self- explores and learns only sub-parts of the sensorimotor space that are sufficient reach these goals: this allows to leverage the redundancy of these spaces by building dense tubes of learning data only where it is necessary for control. The selection of goals can be made active, by sampling goals for which the empirical estimation of competence progress is maximal. This allows the robot learner to avoid spending too much time on unreachable or trivial goals, and progressively explore self-generated goals/tasks of increasing complexity. The SAGG-RIAC architecture (Baranes and Oudeyer, 2013; Baranes and Oudeyer, 2010) instantiates this approach and was shown to allow orders of magnitude speed-up for learning skills such as omnidirectional locomotion of quadruped robots and learning how to control a fishing rod with a flexible wire.
Strategic learningand strategic teaching for life-long learning. Strategic learning refers to mechanisms that allow a learner (or a teacher) to decide concurrently (or hierarchically) how to use its various kinds of learning resources: what to learn, how to learn it, when to learn it, and possibly from whom to learn it.
Indeed, for life-long learning of multiple tasks in real world robots, time, physical and cognitive resources are limited: learning requires that multiple kinds of choices be made by the learner or by its teacher. For example, one has to choose how to allocate time to the practice and learning of each task, to choose which data collection method to use (e.g. self-exploration versus imitation learning), and to choose which statistical inference method to use (e.g. using different kinds of representations and inference biases). These choices generate an ordered and structured learning trajectory, and this structure can have a major impact on both what is learned and how efficiently it is learned.
We have introduced a formal framework to study Strategic Learning, using the Strategic Student Problem model (Lopes and Oudeyer, 2012). This has been formally linked to techniques in the Bandit literature, and led to the Strategic Bandit algorithm, which actively chooses learning resources based on empirical evaluation of learning progress (Lopes and Oudeyer, 2012).
We have also experimented how Strategic Learning can be used to address a very important question in muti-task life-long robot learning (Nguyen and Oudeyer, 2013): how this can allow the robot learner to concurrently decide what task to learn at a given moment, when to do self-exploration and when to imitate for improving on this task, and in the latter case how to imitate (emulation vs. mimicry) and whom to imitate (from one of several available teachers).
Intrinsic motivation and the self-organization of developmental trajectories in robots and humans
We have been modeling mechanisms of spontaneous exploration in human infants known as Intrinsic Motivation, and which generate curiosity-driven learning.
This allowed us to generate novel hypothesis on their form in humans, as well as on their impact for the self-organization of open-ended sensorimotor development, concept formation (distinction self/objects/others), and for social learning and language acquisition.
IAC and the Playground Experiment. In particular, the IAC architecture and its implications were studied in a series of experiments, called the Playground Experiments (Oudeyer and Kaplan, 2006; Oudeyer et al., 2007). Figure 1 illustrates the cognitive architecture employed by the IAC. Prediction learning plays a central role in the IAC architecture. In particular, there are two specific modules in the model that predict future states. First, the “Classic Machine learner” M is a machine that learns a forward model. The forward model receives as input the current sensory state, context, and action, and generates a prediction of the sensory consequences of the planned action. An error feedback signal is provided on the difference between predicted and observed consequences, and allows to update the forward model . Second, the “Meta Machine learner” metaM receives the same input as M, but instead of generating a prediction of the sensory consequences, metaM learns a meta-model that allows to predict how much the errors of the lower-level forward model will decrease in local regions of the sensorimotor space, i.e. modeling learning progress locally. In order to deal with the difficulties of generalization and high-dimensional continuous spaces, an associated categorization mechanism progressively splits the sensorimotor space in sub-regions, for example by maximizing their differences in predictability (Baranes and Oudeyer, 2009), and focusing its refinement of categorization in regions where learning progress is maximal. Then, in each observed context/state, an action selection system chooses stochastically which actions to experiment so as to maximize expected learning progress. Such a system allows the robot to automatically avoid experimenting actions which outcome is either trivial or too difficult to predict/learn at a given moment of development, while first focusing on simple actions and progressively shifting to more complex ones.
Figure 1: The IAC algorithmic architecture for curiosity-driven learning and intrinsically motivated exploration
In order to evaluate the IAC architecture in a physical implementation, the Playground Experiments were developed (Oudeyer and Kaplan, 2006; Oudeyer et al., 2007). During the experiment, a quadruped robot is placed on an infant play mat and presented with a set of nearby objects, as well as an “adult” robot caretaker (see Figure 2). The robot is equipped with four kinds of motor primitives parameterized by several continuous numbers and which can be combined, thus forming an infinite set of possible actions: (a) turning the head in various directions; (b) opening and closing the mouth while crouching with various strengths and timing; (c) rocking the leg with various angles and speed; (d) vocalizing with various pitches and lengths. Similarly, several kinds of sensori primitives allow the robot to detect visual movement, salient visual properties, proprioceptive touch in the mouth, and pitch and length of perceived sounds. For the robot, these motor and sensori primitives are initially black boxes and he has no knowledge about their semantics, effects or relations. The IAC architecture is then used to drive the robot’s exploration and learning purely by curiosity, i.e. by the search of learning progress. The nearby objects include an elephant (which can be bitten or “grasped” by the mouth), a hanging toy (which can be “bashed” or pushed with the leg) and an adult robot “caretaker” pre-programmed to imitate the learning robot when the latter looks at the adult while vocalizing at the same time.
Figure 2: The Playground Experiment: a quadruped robot explores and learn physical and social affordances through curiosity-driven learning.
Open-ended and embodied acquisition of skills. A key finding from the Playground Experiments is the self-organization of structured developmental trajectories, where the robot explores objects and actions in a progressively more complex stage-like manner, while acquiring autonomously diverse affordances and skills that can be reused later on. As a result of a series of runs of such experiments, the following developmental sequence is typically observed:
In a first phase, the robot achieves unorganized body babbling;
In a second phase, after learning a first rough model and meta-model, the robot stops combining motor primitives, exploring them one by one, but each primitive is explore itself in a random manner;
In a third phase, the robot now begins to experiment actions towards zones of its environment where the external observer knows there are objects (the robot is not provided with a representation of the concept of “object”), but in a non-affordant manner (e.g. it vocalizes at the non-responding elephant or bashes the adult robot which is too far to be touched);
In a third phase, the robot now explores affordant experiments: he first focuses on grasping movements with the elephant, then shifts to bashing movements with the hanging toy, and finally shifts to exploring vocalizing towards the imitating adult robot.
In the end, the robot has learnt sensorimotor affordances with several objects, as well as social affordances with a peer, and masters multiple skills, yet none of these specific objectives where pre-programmed in the beginning. They self-organize through the dynamic interaction between intrinsic motivation, statistical inference, the properties of the body, and the properties of the environment.
New hypothesis for infant development. Two aspects of this outcome can be noted. First, it shows how an IM system can drive a robot to learn autonomously a variety of affordances and skills for which no engineer provided beforehand specific reward functions. Second, the observed process spontaneously generates three properties of infant development so far mostly unexplained:
Staged development: Qualitatively different and more complex behaviours and capabilities appear along with time, and in a non-linear manner. Such unfolding is highly described in developmental psychology, but little principled explanation currently exists. The Playground Experiment provides the intriguing hypotheses that IM driven exploration, in dynamic interaction with the body and environment, could explain important aspects of how this unfolding can be made spontaneously (thus for example without an internal pre-programmed schedule that specifies to the organism what to do and when to do it);
The regularities/diversity duality in developmental structures: The typical developmental trajectory described above is only the most frequent emerging trajectory. No two trajectories are exactly the same (e.g. the order of action exploration in the fifth phase might change). And in some experiments, with the same robot, same mechanism, same environment, widely different trajectories can happen. The whole IM/body/environment system can be seen as a dynamical system with various attractors, and stochasticity can sometimes drive it in local minima far from the main attractor(s) (Thelen and Smith, 1993). Thus, this also suggests a novel principled IM-based mechanism to explain the duality regularities/diversity widely observed in infant development;
Discovery of communication: Through the same general mechanism, the robot both explores and learns how to manipulate objects and how to vocalize to trigger specific responses from a conspecific. While vocal babbling (Oller, 2000), and more generally language play and games, have been shown to be key in infant language development, an associated ad hoc motivation if typically assumed both in developmental psychology and computational models. The Playground Experiment suggests that the exploration and learning of communicative behavior might be at least partially explained by general intrinsically motivated exploration of the body affordances (Oudeyer and Kaplan, 2006). A more detailed study showed that curiosity-driven exploration of vocalizations can allow to reproduce aspects of developmental change in vocal babbling observed in human infants (Moulin-Frier and Oudeyer, 2012). Further analysis of the links between IM, sensorimotor, social and language development can be found in (Kaplan et al., 2008).
The origins of the self/object/other distinction.The categorization system associated in such an IM architecture generates also a progressive internal development of cognitive categories which complement the above described behavioral and skill development. As explained in (Kaplan and Oudeyer, 2007b, Oudeyer et al., 2007), such a mechanism can indeed allow the learning agent to progressively form fundamental categorical distinctions between “self”/”physical objects”/”others”, which are central in infant development.
The Playground Experiment. We have built an
experimental setup, called the Playground
Experiment, which allowed to show how the curiosity algorithm which we
developped allows for the self-organization of developmental trajectories
with sequences of behavioural stages of increasing complexity (Oudeyer et al., 2007,Oudeyer and Kaplan, 2006).
Learning omnidirectional quadruped locomotion.In this experiment, we showed how the successive architectures we developped allow a quadruped robot, initially equipped with parameterized motor primitives in the form of a 24 dimensional oscillator (sinuses with various parameters in most of the joints), learns to use these motor primitives to locomote precisely in all directions and in varied manners. In the article (Baranes and Oudeyer, 2013), we study extensively a physical simulation of this experimental setup with active learning algorithms.
Moulin-Frier, C. and Oudeyer, P-Y. (2012) Curiosity Driven Phonetic Learning, in Proceedings of IEEE International Conference on Development and Learning and Epigenetic Robot (ICDL-Epirob), San Diego, USA.
(Best Paper Award, category "Models of Cognitive Development").