GitHub Repository: AI4Finance-Foundation/FinRL
Path: blob/master/examples/Stock_NeurIPS2018_2_Train.ipynb
⁷²⁶ views

Kernel: Python 3 (ipykernel)

Stock NeurIPS2018 Part 2. Train

This series is a reproduction of the process in the paper Practical Deep Reinforcement Learning Approach for Stock Trading.

This is the second part of the NeurIPS2018 series, introducing how to use FinRL to make data into the gym form environment, and train DRL agents on it.

Other demos can be found at the repo of FinRL-Tutorials.

Part 1. Install Packages

In [ ]:

## install finrl library
!pip install git+https://github.com/AI4Finance-Foundation/FinRL.git

In [13]:

import pandas as pd
from stable_baselines3.common.logger import configure

from finrl.agents.stablebaselines3.models import DRLAgent
from finrl.config import INDICATORS, TRAINED_MODEL_DIR, RESULTS_DIR
from finrl.main import check_and_make_directories
from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv

check_and_make_directories([TRAINED_MODEL_DIR])

Out[13]:

/Users/joey/opt/anaconda3/envs/finance/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Part 2. Build A Market Environment in OpenAI Gym-style

The core element in reinforcement learning are agent and environment. You can understand RL as the following process:

The agent is active in a world, which is the environment. It observe its current condition as a state, and is allowed to do certain actions. After the agent execute an action, it will arrive at a new state. At the same time, the environment will have feedback to the agent called reward, a numerical signal that tells how good or bad the new state is. As the figure above, agent and environment will keep doing this interaction.

The goal of agent is to get as much cumulative reward as possible. Reinforcement learning is the method that agent learns to improve its behavior and achieve that goal.

To achieve this in Python, we follow the OpenAI gym style to build the stock data into environment.

state-action-reward are specified as follows:

State s: The state space represents an agent's perception of the market environment. Just like a human trader analyzing various information, here our agent passively observes the price data and technical indicators based on the past data. It will learn by interacting with the market environment (usually by replaying historical data).
Action a: The action space includes allowed actions that an agent can take at each state. For example, a ∈ {−1, 0, 1}, where −1, 0, 1 represent selling, holding, and buying. When an action operates multiple shares, a ∈{−k, ..., −1, 0, 1, ..., k}, e.g.. "Buy 10 shares of AAPL" or "Sell 10 shares of AAPL" are 10 or −10, respectively
Reward function r(s, a, s′): Reward is an incentive for an agent to learn a better policy. For example, it can be the change of the portfolio value when taking a at state s and arriving at new state s', i.e., r(s, a, s′) = v′ − v, where v′ and v represent the portfolio values at state s′ and s, respectively

Market environment: 30 constituent stocks of Dow Jones Industrial Average (DJIA) index. Accessed at the starting date of the testing period.

Read data

We first read the .csv file of our training data into dataframe.

In [15]:

train = pd.read_csv('train_data.csv')
# If you are not using the data generated from part 1 of this tutorial, make sure 
# it has the columns and index in the form that could be make into the environment. 
# Then you can comment and skip the following two lines.
train = train.set_index(train.columns[0])
train.index.names = ['']

Construct the environment

Calculate and specify the parameters we need for constructing the environment.

In [16]:

stock_dimension = len(train.tic.unique())
state_space = 1 + 2*stock_dimension + len(INDICATORS)*stock_dimension
print(f"Stock Dimension: {stock_dimension}, State Space: {state_space}")

Out[16]:

Stock Dimension: 29, State Space: 291

In [16]:

buy_cost_list = sell_cost_list = [0.001] * stock_dimension
num_stock_shares = [0] * stock_dimension

env_kwargs = {
    "hmax": 100,
    "initial_amount": 1000000,
    "num_stock_shares": num_stock_shares,
    "buy_cost_pct": buy_cost_list,
    "sell_cost_pct": sell_cost_list,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4
}


e_train_gym = StockTradingEnv(df = train, **env_kwargs)

Environment for training

In [17]:

env_train, _ = e_train_gym.get_sb_env()
print(type(env_train))

Out[17]:

<class 'stable_baselines3.common.vec_env.dummy_vec_env.DummyVecEnv'>

Part 3: Train DRL Agents

Here, the DRL algorithms are from Stable Baselines 3. It's a library that implemented popular DRL algorithms using pytorch, succeeding to its old version: Stable Baselines.
Users are also encouraged to try ElegantRL and Ray RLlib.

In [18]:

agent = DRLAgent(env = env_train)

# Set the corresponding values to 'True' for the algorithms that you want to use
if_using_a2c = True
if_using_ddpg = True
if_using_ppo = True
if_using_td3 = True
if_using_sac = True

Agent Training: 5 algorithms (A2C, DDPG, PPO, TD3, SAC)

Agent 1: A2C

In [19]:

agent = DRLAgent(env = env_train)
model_a2c = agent.get_model("a2c")

if if_using_a2c:
  # set up logger
  tmp_path = RESULTS_DIR + '/a2c'
  new_logger_a2c = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_a2c.set_logger(new_logger_a2c)

Out[19]:

{'n_steps': 5, 'ent_coef': 0.01, 'learning_rate': 0.0007}
Using cpu device
Logging to results/a2c

In [21]:

trained_a2c = agent.train_model(model=model_a2c, 
                             tb_log_name='a2c',
                             total_timesteps=50000) if if_using_a2c else None

Out[21]:

--------------------------------------
| time/                 |            |
|    fps                | 59         |
|    iterations         | 100        |
|    time_elapsed       | 8          |
|    total_timesteps    | 500        |
| train/                |            |
|    entropy_loss       | -41.7      |
|    explained_variance | -1.19e-07  |
|    learning_rate      | 0.0007     |
|    n_updates          | 3454       |
|    policy_loss        | -5.75      |
|    reward             | 0.10798945 |
|    std                | 1.02       |
|    value_loss         | 0.113      |
--------------------------------------
----------------------------------------
| time/                 |              |
|    fps                | 65           |
|    iterations         | 200          |
|    time_elapsed       | 15           |
|    total_timesteps    | 1000         |
| train/                |              |
|    entropy_loss       | -41.7        |
|    explained_variance | 0            |
|    learning_rate      | 0.0007       |
|    n_updates          | 3554         |
|    policy_loss        | -95.5        |
|    reward             | -0.075115256 |
|    std                | 1.02         |
|    value_loss         | 6.41         |
----------------------------------------
------------------------------------
| time/                 |          |
|    fps                | 71       |
|    iterations         | 300      |
|    time_elapsed       | 20       |
|    total_timesteps    | 1500     |
| train/                |          |
|    entropy_loss       | -41.7    |
|    explained_variance | 0        |
|    learning_rate      | 0.0007   |
|    n_updates          | 3654     |
|    policy_loss        | -266     |
|    reward             | 4.334886 |
|    std                | 1.02     |
|    value_loss         | 44.6     |
------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 68        |
|    iterations         | 400       |
|    time_elapsed       | 29        |
|    total_timesteps    | 2000      |
| train/                |           |
|    entropy_loss       | -41.7     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 3754      |
|    policy_loss        | -42.1     |
|    reward             | 0.9876081 |
|    std                | 1.02      |
|    value_loss         | 1.81      |
-------------------------------------
--------------------------------------
| time/                 |            |
|    fps                | 71         |
|    iterations         | 500        |
|    time_elapsed       | 34         |
|    total_timesteps    | 2500       |
| train/                |            |
|    entropy_loss       | -41.6      |
|    explained_variance | 2.38e-07   |
|    learning_rate      | 0.0007     |
|    n_updates          | 3854       |
|    policy_loss        | 474        |
|    reward             | -13.375484 |
|    std                | 1.02       |
|    value_loss         | 169        |
--------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 70        |
|    iterations         | 600       |
|    time_elapsed       | 42        |
|    total_timesteps    | 3000      |
| train/                |           |
|    entropy_loss       | -41.6     |
|    explained_variance | -0.733    |
|    learning_rate      | 0.0007    |
|    n_updates          | 3954      |
|    policy_loss        | 136       |
|    reward             | 0.2225512 |
|    std                | 1.02      |
|    value_loss         | 15.3      |
-------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 71        |
|    iterations         | 700       |
|    time_elapsed       | 48        |
|    total_timesteps    | 3500      |
| train/                |           |
|    entropy_loss       | -41.6     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 4054      |
|    policy_loss        | -118      |
|    reward             | -5.508063 |
|    std                | 1.02      |
|    value_loss         | 10.1      |
-------------------------------------
--------------------------------------
| time/                 |            |
|    fps                | 73         |
|    iterations         | 800        |
|    time_elapsed       | 54         |
|    total_timesteps    | 4000       |
| train/                |            |
|    entropy_loss       | -41.6      |
|    explained_variance | -0.0931    |
|    learning_rate      | 0.0007     |
|    n_updates          | 4154       |
|    policy_loss        | 10.4       |
|    reward             | -1.2465808 |
|    std                | 1.02       |
|    value_loss         | 0.901      |
--------------------------------------
---------------------------------------
| time/                 |             |
|    fps                | 72          |
|    iterations         | 900         |
|    time_elapsed       | 62          |
|    total_timesteps    | 4500        |
| train/                |             |
|    entropy_loss       | -41.7       |
|    explained_variance | 0           |
|    learning_rate      | 0.0007      |
|    n_updates          | 4254        |
|    policy_loss        | 111         |
|    reward             | -0.92076373 |
|    std                | 1.02        |
|    value_loss         | 8.6         |
---------------------------------------
--------------------------------------
| time/                 |            |
|    fps                | 73         |
|    iterations         | 1000       |
|    time_elapsed       | 68         |
|    total_timesteps    | 5000       |
| train/                |            |
|    entropy_loss       | -41.7      |
|    explained_variance | -1.19e-07  |
|    learning_rate      | 0.0007     |
|    n_updates          | 4354       |
|    policy_loss        | 72.2       |
|    reward             | -6.5418406 |
|    std                | 1.02       |
|    value_loss         | 4.29       |
--------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 71        |
|    iterations         | 1100      |
|    time_elapsed       | 76        |
|    total_timesteps    | 5500      |
| train/                |           |
|    entropy_loss       | -41.8     |
|    explained_variance | 0.00334   |
|    learning_rate      | 0.0007    |
|    n_updates          | 4454      |
|    policy_loss        | -498      |
|    reward             | 6.0074706 |
|    std                | 1.02      |
|    value_loss         | 150       |
-------------------------------------
---------------------------------------
| time/                 |             |
|    fps                | 71          |
|    iterations         | 1200        |
|    time_elapsed       | 84          |
|    total_timesteps    | 6000        |
| train/                |             |
|    entropy_loss       | -41.7       |
|    explained_variance | -3.3e-05    |
|    learning_rate      | 0.0007      |
|    n_updates          | 4554        |
|    policy_loss        | -168        |
|    reward             | -0.22433381 |
|    std                | 1.02        |
|    value_loss         | 19.6        |
---------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 71        |
|    iterations         | 1300      |
|    time_elapsed       | 90        |
|    total_timesteps    | 6500      |
| train/                |           |
|    entropy_loss       | -41.8     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 4654      |
|    policy_loss        | 199       |
|    reward             | -5.576095 |
|    std                | 1.02      |
|    value_loss         | 47.8      |
-------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 71        |
|    iterations         | 1400      |
|    time_elapsed       | 98        |
|    total_timesteps    | 7000      |
| train/                |           |
|    entropy_loss       | -41.8     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 4754      |
|    policy_loss        | 224       |
|    reward             | -1.378166 |
|    std                | 1.02      |
|    value_loss         | 32.2      |
-------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 72        |
|    iterations         | 1500      |
|    time_elapsed       | 104       |
|    total_timesteps    | 7500      |
| train/                |           |
|    entropy_loss       | -41.8     |
|    explained_variance | -1.19e-07 |
|    learning_rate      | 0.0007    |
|    n_updates          | 4854      |
|    policy_loss        | -139      |
|    reward             | 4.8697376 |
|    std                | 1.03      |
|    value_loss         | 11.3      |
-------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 71        |
|    iterations         | 1600      |
|    time_elapsed       | 112       |
|    total_timesteps    | 8000      |
| train/                |           |
|    entropy_loss       | -41.8     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 4954      |
|    policy_loss        | 214       |
|    reward             | 1.2550246 |
|    std                | 1.03      |
|    value_loss         | 29.8      |
-------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 71        |
|    iterations         | 1700      |
|    time_elapsed       | 118       |
|    total_timesteps    | 8500      |
| train/                |           |
|    entropy_loss       | -41.8     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 5054      |
|    policy_loss        | -173      |
|    reward             | 3.4403105 |
|    std                | 1.02      |
|    value_loss         | 42.3      |
-------------------------------------
day: 2892, episode: 10
begin_total_asset: 1000000.00
end_total_asset: 6470641.90
total_reward: 5470641.90
total_cost: 41812.90
total_trades: 48852
Sharpe: 0.814
=================================
-------------------------------------
| time/                 |           |
|    fps                | 71        |
|    iterations         | 1800      |
|    time_elapsed       | 126       |
|    total_timesteps    | 9000      |
| train/                |           |
|    entropy_loss       | -41.8     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 5154      |
|    policy_loss        | -45.2     |
|    reward             | 1.8548855 |
|    std                | 1.02      |
|    value_loss         | 1.32      |
-------------------------------------
--------------------------------------
| time/                 |            |
|    fps                | 71         |
|    iterations         | 1900       |
|    time_elapsed       | 132        |
|    total_timesteps    | 9500       |
| train/                |            |
|    entropy_loss       | -41.8      |
|    explained_variance | -1.19e-07  |
|    learning_rate      | 0.0007     |
|    n_updates          | 5254       |
|    policy_loss        | -103       |
|    reward             | -0.2859979 |
|    std                | 1.02       |
|    value_loss         | 8.36       |
--------------------------------------
--------------------------------------
| time/                 |            |
|    fps                | 72         |
|    iterations         | 2000       |
|    time_elapsed       | 138        |
|    total_timesteps    | 10000      |
| train/                |            |
|    entropy_loss       | -41.8      |
|    explained_variance | 0          |
|    learning_rate      | 0.0007     |
|    n_updates          | 5354       |
|    policy_loss        | -226       |
|    reward             | 0.73303264 |
|    std                | 1.02       |
|    value_loss         | 41.7       |
--------------------------------------

In [23]:

trained_a2c.save(TRAINED_MODEL_DIR + "/agent_a2c") if if_using_a2c else None

Agent 2: DDPG

In [ ]:

agent = DRLAgent(env = env_train)
model_ddpg = agent.get_model("ddpg")

if if_using_ddpg:
  # set up logger
  tmp_path = RESULTS_DIR + '/ddpg'
  new_logger_ddpg = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_ddpg.set_logger(new_logger_ddpg)

In [ ]:

trained_ddpg = agent.train_model(model=model_ddpg, 
                             tb_log_name='ddpg',
                             total_timesteps=50000) if if_using_ddpg else None

In [ ]:

trained_ddpg.save(TRAINED_MODEL_DIR + "/agent_ddpg") if if_using_ddpg else None

Agent 3: PPO

In [ ]:

agent = DRLAgent(env = env_train)
PPO_PARAMS = {
    "n_steps": 2048,
    "ent_coef": 0.01,
    "learning_rate": 0.00025,
    "batch_size": 128,
}
model_ppo = agent.get_model("ppo",model_kwargs = PPO_PARAMS)

if if_using_ppo:
  # set up logger
  tmp_path = RESULTS_DIR + '/ppo'
  new_logger_ppo = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_ppo.set_logger(new_logger_ppo)

In [ ]:

trained_ppo = agent.train_model(model=model_ppo, 
                             tb_log_name='ppo',
                             total_timesteps=200000) if if_using_ppo else None

In [ ]:

trained_ppo.save(TRAINED_MODEL_DIR + "/agent_ppo") if if_using_ppo else None

Agent 4: TD3

In [ ]:

agent = DRLAgent(env = env_train)
TD3_PARAMS = {"batch_size": 100, 
              "buffer_size": 1000000, 
              "learning_rate": 0.001}

model_td3 = agent.get_model("td3",model_kwargs = TD3_PARAMS)

if if_using_td3:
  # set up logger
  tmp_path = RESULTS_DIR + '/td3'
  new_logger_td3 = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_td3.set_logger(new_logger_td3)

In [ ]:

trained_td3 = agent.train_model(model=model_td3, 
                             tb_log_name='td3',
                             total_timesteps=50000) if if_using_td3 else None

In [ ]:

trained_td3.save(TRAINED_MODEL_DIR + "/agent_td3") if if_using_td3 else None

Agent 5: SAC

In [ ]:

agent = DRLAgent(env = env_train)
SAC_PARAMS = {
    "batch_size": 128,
    "buffer_size": 100000,
    "learning_rate": 0.0001,
    "learning_starts": 100,
    "ent_coef": "auto_0.1",
}

model_sac = agent.get_model("sac",model_kwargs = SAC_PARAMS)

if if_using_sac:
  # set up logger
  tmp_path = RESULTS_DIR + '/sac'
  new_logger_sac = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_sac.set_logger(new_logger_sac)

In [ ]:

trained_sac = agent.train_model(model=model_sac, 
                             tb_log_name='sac',
                             total_timesteps=70000) if if_using_sac else None

In [ ]:

trained_sac.save(TRAINED_MODEL_DIR + "/agent_sac") if if_using_sac else None

Save the trained agent

Trained agents should have already been saved in the "trained_models" drectory after you run the code blocks above.

For Colab users, the zip files should be at "./trained_models" or "/content/trained_models".

For users running on your local environment, the zip files should be at "./trained_models".