Trading Strategies Using Deep Reinforcement Learning

The purpose of this post is to expose some results after creating a trading bot based on Reinforcement Learning that is capable of generating a trading strategy and at the same time to share a possible architecture for an agent and the features of the dataset that was used, furthermore to share detail about the problems faced.

First, we need to understand the problem, so let’s talk about Trading.


Trading consists of buying and selling assets in the financial markets in order to obtain a profit by buying at a low price and selling at a higher price. In the trading process, we also have the concept of Trading Strategy, which is nothing more than a fixed plan designed to achieve a profitable performance.

The term “trading” simply means “exchanging one item for another.” We usually understand this to be the exchanging of goods for money, or in other words, simply buying something.

When we talk about trading in the financial markets, it is the same principle. Think about someone who trades shares. What they are actually doing is buying shares (or a small part) of a company. If the value of those shares increases, then they make money by selling them again at a higher price. This is trading. You buy something for one price and sell it again for another — hopefully at a higher price, thus making a profit and vice versa.


What is a trading strategy?

A trading strategy is the method of buying and selling in markets that are based on predefined rules used to make trading decisions. A trading strategy includes a well-considered investing and trading plan that specifies investing objectives, risk tolerance, time horizon, and tax implications. Ideas and best practices need to be researched and adopted and then adhered to. Planning for trading includes developing methods that include buying or selling stocks, bonds, ETFs, or other investments and may extend to more complex trades such as options or futures. Placing trades means working with a broker or broker dealer and identifying and managing trading costs including spreads, commissions, and fees. Once executed, trading positions are monitored and managed, including adjusting or closing them as needed. Risk and return are measured as well as portfolio impacts of trades. The longer-term tax results of trading are a major factor and may encompass capital gains or tax-loss harvesting strategies to offset gains with losses.

Now that we have the fundamentals of our problem, we need to understand the technique.

Deep Reinforcement Learning (DRL)

Reinforcement learning (RL) is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from supervised learning because, in supervised learning, the training data has the answer key with it so the model is trained with the correct answer itself, whereas in reinforcement learning, there is no answer, but the reinforcement agent decides what to do to perform the given task. In the absence of a training dataset, it is bound to learn from its experience. RL refers to a goal-oriented algorithm, that is, algorithms that seek to achieve a complex objective or to maximize the reward through a sequence of steps, such as obtaining the highest score in an Atari game.

The elements that conform to this approach are states, a reward function, actions, and an environment in which the agent interacts.

RL elements

What is DRL and what is the difference to RL?

Deep Reinforcement Learning is essentially the combination of deep neural networks and reinforcement learning. In this case, we speak of a special type called Q-Learning.

In Q-Learning, typically, a search table is used to store (Q-table) where each of the states and actions are represented. This table allows us to know the action that must be taken depending on the state to obtain the highest reward. The above quickly becomes a problem when the states are very complex and the table grows to incomputable sizes. In the case of DRL, the neuronal model is used as a generalizer of the states, thus allowing them to be compacted in a smaller entity and consequently to make the model converge faster.

Image title

There are some characteristics of the financial markets that can be handled with DRL, such as:

  • Markets require perfect handling of extensive continuous data
  • Agents’ actions may result in long-term consequences that other machine-learning mechanisms are unable to measure
  • Agents’ actions also have short-term effects on the current market conditions which make the environment highly unpredictable

The odds that trading can be disrupted look promising thanks to some of deep reinforcement learning’s main advantages:

  • It builds upon the existing algorithmic trading models
  • The self-learning process suits the ever-evolving market environment
  • Brings more power and efficiency in a high-density environment

Training Dataset

The data used for training the agent provides us with information on the market as well as news or articles that have to do with the assets.

Among the market data, we can find the opening price, closing price, the volume of the transaction, the name of the asset, etc.

Between the data of the news, we have the date of creation of the news, the heading, and the feeling.

There are 4,072,956 samples and 16 features in the training market dataset ranging from 2007 to 2018.

Image title



We need to define the necessary elements for the agent, State, Actions, and the Reward function.

For the definition of the state, we can combine the information that the dataset provides us. I am talking about the market information and the news information. We extract the features that best describe our problem (at least one of the possible configurations) such as the opening price, closing price, etc. The full state is described below.

  • Market Information
    • Opening Price
    • Closing Price
    • Transaction Volume
    • Asset code
  • News Information
    • Header
    • Author
    • Audience to which it is addressed
    • Size
    • Sentiment of the news
    • The assets mentioned in the news

Please note that we are making the assumption that only the news that talks about a specific asset will affect the behavior of that asset. But this is not always true, it can be the case that news is mentioning an asset “x” but it will affect the asset “y.”


The possible actions of our agent are easy to see, the agent has to decide between three options, “buy,” “hold,” and “sell.”


It is defined as the difference between the current values of the asset minus the value of the asset in the previous step.

The pseudo-code is described below.

Image title

Description of the architecture of the solution

The architecture of the neural network is quite simple. The input is formed by the combination of the market data and the news data. The news data comes in text format, for being able to feed the model we need to pass through an embedding. The output of this model is a layer of three neurons, each of them corresponds to the available actions.

Image title


OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball. It makes no assumptions about the structure of your agent and is compatible with any numerical computation library, such as TensorFlow.

Image title

The implementation was done in python using OpenAI Gym and Tensorflow as the core components.

OpenAI Gym for the definition of the environment where the agent takes back the reward and the next state after an action was executed.

Tensorflow for the definition of the neural model and the respective training phase.

Experiments and Results

For the purpose of this experiment, we only considered data from 2010 onwards in addition to selecting only 10 of the available assets. This is because selecting more assets exponentially increases the complexity of the model making it incomputable.

Image title

The results of analyzing the behavior of the opening price for asset 1 & 2. Where the red line indicates an initial investment and the green line indicates the average of the gain obtained over time.

Image title

The results of the actions that the agent took overtime for asset 1 are shown. Green indicates “sell,” red indicates “buy,” and gray indicates “hold.”

Image title

Conclusion and Future Work

Although the results obtained can be considered satisfactory, we are sure that they can be improved. We need to be sure that the real-time performance of the model would be optimum in an actual trading environment.

To improve the model, we plan to:

  • Increase the number of assets that we handle in the model to more than 10.

  • Strengthen the model with more macroeconomic information such as marking fees, growth rates, market capitalization, profits, revenues, etc.

This UrIoTNews article is syndicated fromDzone