Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
AI4Finance-Foundation
GitHub Repository: AI4Finance-Foundation/FinRL
Path: blob/master/examples/Stock_NeurIPS2018_2_Train.ipynb
726 views
Kernel: Python 3 (ipykernel)

Stock NeurIPS2018 Part 2. Train

This series is a reproduction of the process in the paper Practical Deep Reinforcement Learning Approach for Stock Trading.

This is the second part of the NeurIPS2018 series, introducing how to use FinRL to make data into the gym form environment, and train DRL agents on it.

Other demos can be found at the repo of FinRL-Tutorials.

Part 1. Install Packages

## install finrl library !pip install git+https://github.com/AI4Finance-Foundation/FinRL.git
import pandas as pd from stable_baselines3.common.logger import configure from finrl.agents.stablebaselines3.models import DRLAgent from finrl.config import INDICATORS, TRAINED_MODEL_DIR, RESULTS_DIR from finrl.main import check_and_make_directories from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv check_and_make_directories([TRAINED_MODEL_DIR])
/Users/joey/opt/anaconda3/envs/finance/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm

Part 2. Build A Market Environment in OpenAI Gym-style

rl_diagram_transparent_bg.png

The core element in reinforcement learning are agent and environment. You can understand RL as the following process:

The agent is active in a world, which is the environment. It observe its current condition as a state, and is allowed to do certain actions. After the agent execute an action, it will arrive at a new state. At the same time, the environment will have feedback to the agent called reward, a numerical signal that tells how good or bad the new state is. As the figure above, agent and environment will keep doing this interaction.

The goal of agent is to get as much cumulative reward as possible. Reinforcement learning is the method that agent learns to improve its behavior and achieve that goal.

To achieve this in Python, we follow the OpenAI gym style to build the stock data into environment.

state-action-reward are specified as follows:

  • State s: The state space represents an agent's perception of the market environment. Just like a human trader analyzing various information, here our agent passively observes the price data and technical indicators based on the past data. It will learn by interacting with the market environment (usually by replaying historical data).

  • Action a: The action space includes allowed actions that an agent can take at each state. For example, a ∈ {−1, 0, 1}, where −1, 0, 1 represent selling, holding, and buying. When an action operates multiple shares, a ∈{−k, ..., −1, 0, 1, ..., k}, e.g.. "Buy 10 shares of AAPL" or "Sell 10 shares of AAPL" are 10 or −10, respectively

  • Reward function r(s, a, s′): Reward is an incentive for an agent to learn a better policy. For example, it can be the change of the portfolio value when taking a at state s and arriving at new state s', i.e., r(s, a, s′) = v′ − v, where v′ and v represent the portfolio values at state s′ and s, respectively

Market environment: 30 constituent stocks of Dow Jones Industrial Average (DJIA) index. Accessed at the starting date of the testing period.

Read data

We first read the .csv file of our training data into dataframe.

train = pd.read_csv('train_data.csv') # If you are not using the data generated from part 1 of this tutorial, make sure # it has the columns and index in the form that could be make into the environment. # Then you can comment and skip the following two lines. train = train.set_index(train.columns[0]) train.index.names = ['']

Construct the environment

Calculate and specify the parameters we need for constructing the environment.

stock_dimension = len(train.tic.unique()) state_space = 1 + 2*stock_dimension + len(INDICATORS)*stock_dimension print(f"Stock Dimension: {stock_dimension}, State Space: {state_space}")
Stock Dimension: 29, State Space: 291
buy_cost_list = sell_cost_list = [0.001] * stock_dimension num_stock_shares = [0] * stock_dimension env_kwargs = { "hmax": 100, "initial_amount": 1000000, "num_stock_shares": num_stock_shares, "buy_cost_pct": buy_cost_list, "sell_cost_pct": sell_cost_list, "state_space": state_space, "stock_dim": stock_dimension, "tech_indicator_list": INDICATORS, "action_space": stock_dimension, "reward_scaling": 1e-4 } e_train_gym = StockTradingEnv(df = train, **env_kwargs)

Environment for training

env_train, _ = e_train_gym.get_sb_env() print(type(env_train))
<class 'stable_baselines3.common.vec_env.dummy_vec_env.DummyVecEnv'>

Part 3: Train DRL Agents

  • Here, the DRL algorithms are from Stable Baselines 3. It's a library that implemented popular DRL algorithms using pytorch, succeeding to its old version: Stable Baselines.

  • Users are also encouraged to try ElegantRL and Ray RLlib.

agent = DRLAgent(env = env_train) # Set the corresponding values to 'True' for the algorithms that you want to use if_using_a2c = True if_using_ddpg = True if_using_ppo = True if_using_td3 = True if_using_sac = True

Agent Training: 5 algorithms (A2C, DDPG, PPO, TD3, SAC)

Agent 1: A2C

agent = DRLAgent(env = env_train) model_a2c = agent.get_model("a2c") if if_using_a2c: # set up logger tmp_path = RESULTS_DIR + '/a2c' new_logger_a2c = configure(tmp_path, ["stdout", "csv", "tensorboard"]) # Set new logger model_a2c.set_logger(new_logger_a2c)
{'n_steps': 5, 'ent_coef': 0.01, 'learning_rate': 0.0007} Using cpu device Logging to results/a2c
trained_a2c = agent.train_model(model=model_a2c, tb_log_name='a2c', total_timesteps=50000) if if_using_a2c else None
-------------------------------------- | time/ | | | fps | 59 | | iterations | 100 | | time_elapsed | 8 | | total_timesteps | 500 | | train/ | | | entropy_loss | -41.7 | | explained_variance | -1.19e-07 | | learning_rate | 0.0007 | | n_updates | 3454 | | policy_loss | -5.75 | | reward | 0.10798945 | | std | 1.02 | | value_loss | 0.113 | -------------------------------------- ---------------------------------------- | time/ | | | fps | 65 | | iterations | 200 | | time_elapsed | 15 | | total_timesteps | 1000 | | train/ | | | entropy_loss | -41.7 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 3554 | | policy_loss | -95.5 | | reward | -0.075115256 | | std | 1.02 | | value_loss | 6.41 | ---------------------------------------- ------------------------------------ | time/ | | | fps | 71 | | iterations | 300 | | time_elapsed | 20 | | total_timesteps | 1500 | | train/ | | | entropy_loss | -41.7 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 3654 | | policy_loss | -266 | | reward | 4.334886 | | std | 1.02 | | value_loss | 44.6 | ------------------------------------ ------------------------------------- | time/ | | | fps | 68 | | iterations | 400 | | time_elapsed | 29 | | total_timesteps | 2000 | | train/ | | | entropy_loss | -41.7 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 3754 | | policy_loss | -42.1 | | reward | 0.9876081 | | std | 1.02 | | value_loss | 1.81 | ------------------------------------- -------------------------------------- | time/ | | | fps | 71 | | iterations | 500 | | time_elapsed | 34 | | total_timesteps | 2500 | | train/ | | | entropy_loss | -41.6 | | explained_variance | 2.38e-07 | | learning_rate | 0.0007 | | n_updates | 3854 | | policy_loss | 474 | | reward | -13.375484 | | std | 1.02 | | value_loss | 169 | -------------------------------------- ------------------------------------- | time/ | | | fps | 70 | | iterations | 600 | | time_elapsed | 42 | | total_timesteps | 3000 | | train/ | | | entropy_loss | -41.6 | | explained_variance | -0.733 | | learning_rate | 0.0007 | | n_updates | 3954 | | policy_loss | 136 | | reward | 0.2225512 | | std | 1.02 | | value_loss | 15.3 | ------------------------------------- ------------------------------------- | time/ | | | fps | 71 | | iterations | 700 | | time_elapsed | 48 | | total_timesteps | 3500 | | train/ | | | entropy_loss | -41.6 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 4054 | | policy_loss | -118 | | reward | -5.508063 | | std | 1.02 | | value_loss | 10.1 | ------------------------------------- -------------------------------------- | time/ | | | fps | 73 | | iterations | 800 | | time_elapsed | 54 | | total_timesteps | 4000 | | train/ | | | entropy_loss | -41.6 | | explained_variance | -0.0931 | | learning_rate | 0.0007 | | n_updates | 4154 | | policy_loss | 10.4 | | reward | -1.2465808 | | std | 1.02 | | value_loss | 0.901 | -------------------------------------- --------------------------------------- | time/ | | | fps | 72 | | iterations | 900 | | time_elapsed | 62 | | total_timesteps | 4500 | | train/ | | | entropy_loss | -41.7 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 4254 | | policy_loss | 111 | | reward | -0.92076373 | | std | 1.02 | | value_loss | 8.6 | --------------------------------------- -------------------------------------- | time/ | | | fps | 73 | | iterations | 1000 | | time_elapsed | 68 | | total_timesteps | 5000 | | train/ | | | entropy_loss | -41.7 | | explained_variance | -1.19e-07 | | learning_rate | 0.0007 | | n_updates | 4354 | | policy_loss | 72.2 | | reward | -6.5418406 | | std | 1.02 | | value_loss | 4.29 | -------------------------------------- ------------------------------------- | time/ | | | fps | 71 | | iterations | 1100 | | time_elapsed | 76 | | total_timesteps | 5500 | | train/ | | | entropy_loss | -41.8 | | explained_variance | 0.00334 | | learning_rate | 0.0007 | | n_updates | 4454 | | policy_loss | -498 | | reward | 6.0074706 | | std | 1.02 | | value_loss | 150 | ------------------------------------- --------------------------------------- | time/ | | | fps | 71 | | iterations | 1200 | | time_elapsed | 84 | | total_timesteps | 6000 | | train/ | | | entropy_loss | -41.7 | | explained_variance | -3.3e-05 | | learning_rate | 0.0007 | | n_updates | 4554 | | policy_loss | -168 | | reward | -0.22433381 | | std | 1.02 | | value_loss | 19.6 | --------------------------------------- ------------------------------------- | time/ | | | fps | 71 | | iterations | 1300 | | time_elapsed | 90 | | total_timesteps | 6500 | | train/ | | | entropy_loss | -41.8 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 4654 | | policy_loss | 199 | | reward | -5.576095 | | std | 1.02 | | value_loss | 47.8 | ------------------------------------- ------------------------------------- | time/ | | | fps | 71 | | iterations | 1400 | | time_elapsed | 98 | | total_timesteps | 7000 | | train/ | | | entropy_loss | -41.8 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 4754 | | policy_loss | 224 | | reward | -1.378166 | | std | 1.02 | | value_loss | 32.2 | ------------------------------------- ------------------------------------- | time/ | | | fps | 72 | | iterations | 1500 | | time_elapsed | 104 | | total_timesteps | 7500 | | train/ | | | entropy_loss | -41.8 | | explained_variance | -1.19e-07 | | learning_rate | 0.0007 | | n_updates | 4854 | | policy_loss | -139 | | reward | 4.8697376 | | std | 1.03 | | value_loss | 11.3 | ------------------------------------- ------------------------------------- | time/ | | | fps | 71 | | iterations | 1600 | | time_elapsed | 112 | | total_timesteps | 8000 | | train/ | | | entropy_loss | -41.8 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 4954 | | policy_loss | 214 | | reward | 1.2550246 | | std | 1.03 | | value_loss | 29.8 | ------------------------------------- ------------------------------------- | time/ | | | fps | 71 | | iterations | 1700 | | time_elapsed | 118 | | total_timesteps | 8500 | | train/ | | | entropy_loss | -41.8 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 5054 | | policy_loss | -173 | | reward | 3.4403105 | | std | 1.02 | | value_loss | 42.3 | ------------------------------------- day: 2892, episode: 10 begin_total_asset: 1000000.00 end_total_asset: 6470641.90 total_reward: 5470641.90 total_cost: 41812.90 total_trades: 48852 Sharpe: 0.814 ================================= ------------------------------------- | time/ | | | fps | 71 | | iterations | 1800 | | time_elapsed | 126 | | total_timesteps | 9000 | | train/ | | | entropy_loss | -41.8 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 5154 | | policy_loss | -45.2 | | reward | 1.8548855 | | std | 1.02 | | value_loss | 1.32 | ------------------------------------- -------------------------------------- | time/ | | | fps | 71 | | iterations | 1900 | | time_elapsed | 132 | | total_timesteps | 9500 | | train/ | | | entropy_loss | -41.8 | | explained_variance | -1.19e-07 | | learning_rate | 0.0007 | | n_updates | 5254 | | policy_loss | -103 | | reward | -0.2859979 | | std | 1.02 | | value_loss | 8.36 | -------------------------------------- -------------------------------------- | time/ | | | fps | 72 | | iterations | 2000 | | time_elapsed | 138 | | total_timesteps | 10000 | | train/ | | | entropy_loss | -41.8 | | explained_variance | 0 | | learning_rate | 0.0007 | | n_updates | 5354 | | policy_loss | -226 | | reward | 0.73303264 | | std | 1.02 | | value_loss | 41.7 | --------------------------------------
trained_a2c.save(TRAINED_MODEL_DIR + "/agent_a2c") if if_using_a2c else None

Agent 2: DDPG

agent = DRLAgent(env = env_train) model_ddpg = agent.get_model("ddpg") if if_using_ddpg: # set up logger tmp_path = RESULTS_DIR + '/ddpg' new_logger_ddpg = configure(tmp_path, ["stdout", "csv", "tensorboard"]) # Set new logger model_ddpg.set_logger(new_logger_ddpg)
trained_ddpg = agent.train_model(model=model_ddpg, tb_log_name='ddpg', total_timesteps=50000) if if_using_ddpg else None
trained_ddpg.save(TRAINED_MODEL_DIR + "/agent_ddpg") if if_using_ddpg else None

Agent 3: PPO

agent = DRLAgent(env = env_train) PPO_PARAMS = { "n_steps": 2048, "ent_coef": 0.01, "learning_rate": 0.00025, "batch_size": 128, } model_ppo = agent.get_model("ppo",model_kwargs = PPO_PARAMS) if if_using_ppo: # set up logger tmp_path = RESULTS_DIR + '/ppo' new_logger_ppo = configure(tmp_path, ["stdout", "csv", "tensorboard"]) # Set new logger model_ppo.set_logger(new_logger_ppo)
trained_ppo = agent.train_model(model=model_ppo, tb_log_name='ppo', total_timesteps=200000) if if_using_ppo else None
trained_ppo.save(TRAINED_MODEL_DIR + "/agent_ppo") if if_using_ppo else None

Agent 4: TD3

agent = DRLAgent(env = env_train) TD3_PARAMS = {"batch_size": 100, "buffer_size": 1000000, "learning_rate": 0.001} model_td3 = agent.get_model("td3",model_kwargs = TD3_PARAMS) if if_using_td3: # set up logger tmp_path = RESULTS_DIR + '/td3' new_logger_td3 = configure(tmp_path, ["stdout", "csv", "tensorboard"]) # Set new logger model_td3.set_logger(new_logger_td3)
trained_td3 = agent.train_model(model=model_td3, tb_log_name='td3', total_timesteps=50000) if if_using_td3 else None
trained_td3.save(TRAINED_MODEL_DIR + "/agent_td3") if if_using_td3 else None

Agent 5: SAC

agent = DRLAgent(env = env_train) SAC_PARAMS = { "batch_size": 128, "buffer_size": 100000, "learning_rate": 0.0001, "learning_starts": 100, "ent_coef": "auto_0.1", } model_sac = agent.get_model("sac",model_kwargs = SAC_PARAMS) if if_using_sac: # set up logger tmp_path = RESULTS_DIR + '/sac' new_logger_sac = configure(tmp_path, ["stdout", "csv", "tensorboard"]) # Set new logger model_sac.set_logger(new_logger_sac)
trained_sac = agent.train_model(model=model_sac, tb_log_name='sac', total_timesteps=70000) if if_using_sac else None
trained_sac.save(TRAINED_MODEL_DIR + "/agent_sac") if if_using_sac else None

Save the trained agent

Trained agents should have already been saved in the "trained_models" drectory after you run the code blocks above.

For Colab users, the zip files should be at "./trained_models" or "/content/trained_models".

For users running on your local environment, the zip files should be at "./trained_models".