Path: blob/master/finrl/meta/env_portfolio_optimization/README.md
732 views
PortfolioOptimizationEnv (POE)
This environment simulates the effects of the market in a portfolio that is periodically rebalanced through a reinforcement learning agent. At every timestep , the agent is responsible for determining a portfolio vector which contains the percentage of money invested in each stock. The environment, then, utilizes data provided by the user to simulate the new portfolio value at time-step .
For more details on the formulation of this problem, check the following paper:
POE: A General Portfolio Optimization Environment for FinRL
Inputs
This environment simulates the interactions between an agent and the financial market based on data provided by a dataframe. The dataframe contains the time series of features defined by the user (such as closing, high and low prices) and must have a time and a tic column with a list of datetimes and ticker symbols respectively. An example of dataframe is shown below:
Actions
At each time step, the environment expects an action that is a one-dimensional Box of shape (n+1,), where is the number of stocks in the portfolio. This action is called portfolio vector and contains, for the remaining cash and for each stock, the percentage of allocated money.
For example: given a portfolio of three stocks, a valid portfolio vector would b . In this example, 25% of the money is not invested (remaining cash), 40% is invested in stock 1, 20% in stock 2 and 15% in sotck 3.
Note: It's important that the sum of the values in the portfolio vator is equal (or very close) to 1. If it's not, POE will apply a softmax normalization.
Observations
POE can return two types of observations during simulation: a Dict or a Box.
The box is a three-dimensional array of shape , where s the number of features, is the number of stocks in the portfolio and is the time series timw window. This observation basically only contains the current state of the agent.
The dict representation, on the other hand, is a dictionary containing the state and the last portfolio vector, like below:
Rewards
Given the simulation of timestep , the reward is given by the following formula: , where is the value of the portfolio at time . By using this formulation, the reward is negative whenever the portfolio value decreases due to a rebalancing and is positive otherwise.
Example
A jupyter notebook using this environment can be found here.