Path: blob/master/finrl/agents/portfolio_optimization/README.md
732 views
Portfolio Optimization Agents
This directory contains architectures and algorithms commonly used in portfolio optimization agents.
To instantiate the model, it's necessary to have an instance of PortfolioOptimizationEnv. In the example below, we use the DRLAgent
class to instantiate a policy gradient ("pg") model. With the dictionary model_kwargs
, we can set the PolicyGradient
class parameters and, whith the dictionary policy_kwargs
, it's possible to change the parameters of the chosen architecture.
In the example below, the model is trained in 5 episodes (we define an episode as a complete period of the used environment).
It's important that the architecture and the environment have the same time_window
defined. By default, both of them use 50 timesteps as time_window
. For more details about what is a time window, check this article.
Policy Gradient Algorithm
The class PolicyGradient
implements the Policy Gradient algorithm used in Jiang et al paper. This algorithm is inspired by DDPG (deep deterministic policy gradient), but there are a couple of differences:
DDPG is an actor-critic algorithm, so it has an actor and a critic neural network. The algorithm below, however, doesn't have a critic neural network and uses the portfolio value as value function: the policy will be updated to maximize the portfolio value.
DDPG usually makes use of a noise parameter in the action during training to create an exploratory behavior. PG algorithm, on the other hand, has a full-exploit approach.
DDPG randomly samples experiences from its replay buffer. The implemented policy gradient, however, samples a sequential batch of experiences in time, to make it possible to calculate the variation of the portfolio value in the batch and use it as value function.
The algorithm was implemented as follows:
Initializes policy network and replay buffer;
For each episode, do the following:
For each period of
batch_size
timesteps, do the following:For each timestep, define an action to be performed, simulate the timestep and save the experiences in the replay buffer.
After
batch_size
timesteps are simulated, sample the replay buffer.Calculate the value function: , where is the action performed at timestep t, is the price variation vector at timestep t and is the transaction remainder factor at timestep t. Check Jiang et al paper for more details.
Perform gradient ascent in the policy network.
If, in the and of episode, there is sequence of remaining experiences in the replay buffer, perform steps 1 to 5 with the remaining experiences.
References
If you are using one of them in your research, you can use the following references.
EIIE Architecture and Policy Gradient algorithm
A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem
EI3 Architecture
A Multi-Scale Temporal Feature Aggregation Convolutional Neural Network for Portfolio Management