top of page
  • Writer's pictureMLT

Reinforcement Learning

Are you a type of person who gets stunned by watching how robotic vacuum cleaners intelligently avoid collisions with furniture while cleaning floors or curious about how computers manage to accurately play chess or Super Mario at beyond human-level speeds? Then this blog post is most likely for you as, here, we are going to give a general overview of the technology that lies behind mentioned solutions – reinforcement learning.


Reinforcement learning (RL) is a type of unsupervised machine learning, which aims at learning intelligent agent`s complex behavior that will enable it accurately inter-act with the dynamic environment. RL concept is built around a few main concepts: agent, environment, state, reward, policy.


Let’s discuss these concepts first:

  • Agent is an object that RL aims to teach to interact with the given environment, like a car, robotic vacuum cleaner, and a super Mario character.

  • Environment is an environment with which an agent is supposed to interact, like roads and streets for cars, floors, and walls for a robotic vacuum cleaner, and virtual world environments in games.

  • State is the number of parameters of an agent in a certain environment at a given timestamp, like coordinates, speed, and velocity.

  • Action is the number of manipulations that an object can undertake with itself or with the environment, like drive right/left/forward/back.

  • Reward is calculated after every interaction of an object with an environment and indicates a measure of success - relative to a given goal that an agent is supposed to fulfill - that an object has achieved when interacting with the environment.

  • Policy is a map of states that should be transferred to actions.

  • Below is a very high-level schematic diagram of an RL process with a given agent and given environment:



The diagram can be read as follows:

An Agent starts at an initial state at timestamp 1 (t1) and takes a certain action within the given environment. As a result of the agent’s action, the agent`s state is updated for timestamp 2 (t2). Along with the state, a reward value is being calculated. After all, components are in place, the purpose of RL is to learn a policy, which maps the states of an agent to the actions, which maximizes the reward.


The RL approach helps to train intelligent agents for many applications without the need for a dataset, as it is learning the real world or a specified environment with its intrinsic properties by exploring and optimizing its reward. RL gains a lot of momentum in the industry currently, as data collection and/or labeling (required by many supervised machine learning methods) can often be too problematic, time-consuming, and costly.


Ildar Abdrashitov, MLT Data Scientist Missing Link Technologies ltd.

81 views0 comments

Recent Posts

See All
bottom of page