Optiver Realized Volatility Prediction Data Description

Optiver Realized Volatility Prediction Data Description This dataset contains stock market data relevant to the practical execution of trades in the financial markets In particular, it includes order.

Trang 1

Optiver Realized Volatility Prediction

Data Description

This dataset contains stock market data relevant to the practical execution of trades in the financial markets Inparticular, it includes order book snapshots and executed trades With one second resolution, it provides a

uniquely fine grained look at the micro-structure of modern financial markets

This is a code competition where only the first few rows of the test set are available for download The rows thatare visible are intended to illustrate the hidden test set format and folder structure The remainder will only beavailable to your notebook when it is submitted The hidden test set contains data that can be used to constructfeatures to predict roughly 150,000 target values Loading the entire dataset will take slightly more than 3 GB ofmemory, by our estimation

This is also a forecasting competition, where the final private leaderboard will be determined using data

gathered after the training period closes, which means that the public and private leaderboards will have zerooverlap During the active training stage of the competition a large fraction of the test data will be filler,

intended only to ensure the hidden dataset has approximately the same size as the actual test data The fillerdata will be removed entirely during the forecasting phase of the competition and replaced with real market

data

Files

book_[train/test].parquet A parquet file partitioned by stock_id Provides order book data on the most

competitive buy and sell orders entered into the market The top two levels of the book are shared The first

Trang 2

level of the book will be more competitive in price terms, it will then receive execution priority over the secondlevel.

stock_id - ID code for the stock Not all stock IDs exist in every time bucket Parquet coerces this column tothe categorical data type when loaded; you may wish to convert it to int8

time_id - ID code for the time bucket Time IDs are not necessarily sequential but are consistent across allstocks

seconds_in_bucket - Number of seconds from the start of the bucket, always starting from 0

bid_price[1/2] - Normalized prices of the most/second most competitive buy level

ask_price[1/2] - Normalized prices of the most/second most competitive sell level

bid_size[1/2] - The number of shares on the most/second most competitive buy level

ask_size[1/2] - The number of shares on the most/second most competitive sell level

trade_[train/test].parquet A parquet file partitioned by stock_id Contains data on trades that actually

executed Usually, in the market, there are more passive buy/sell intention updates (book updates) thanactual trades, therefore one may expect this file to be more sparse than the order book

stock_id - Same as above

time_id - Same as above

seconds_in_bucket - Same as above Note that since trade and book data are taken from the same timewindow and trade data is more sparse in general, this field is not necessarily starting from 0

price - The average price of executed transactions happening in one second Prices have been normalizedand the average has been weighted by the number of shares traded in each transaction

size - The sum number of shares traded

Trang 3

order_count - The number of unique trade orders taking place train.csv The ground truth values for thetraining set.

stock_id - Same as above, but since this is a csv the column will load as an integer instead of categorical

target - The realized volatility computed over the 10 minute window following the feature data under thesame stock/time_id There is no overlap between feature and target data You can find more info in ourtutorial notebook test.csv Provides the mapping between the other data files and the submission file

As with other test files, most of the data is only available to your notebook upon submission with just thefirst few rows available for download

stock_id - Same as above

row_id - Unique identifier for the submission row There is one row for each existing time ID/stock ID pair.Each time window is not necessarily containing every individual stock sample_submission.csv - A samplesubmission file in the correct format

In [201… # Suppressing Warnings

import warnings warnings.filterwarnings('ignore')

In [202… # Importing Pandas and NumPy

import pandas as pd, numpy as np

Trang 4

stock_id time_id target

In [203… # Importing all datasets

Optiver_train = pd.read_csv("C:/Users/HP/Desktop/Upgrad Case Study/Optiver Realized Volatility Pre Optiver_train.head()

Out[203…

Optiver_test = pd.read_csv("C:/Users/HP/Desktop/Upgrad Case Study/Optiver Realized Volatility Pred Optiver_test.head()

Out[204…

In [205… Optiver_train.dtypes

Trang 5

stock_id int64 time_id int64 target float64 dtype: object

stock_id int64 time_id int64 row_id object dtype: object

Inspecting the Null Values

stock_id 0 time_id 0 target 0 dtype: int64

stock_id 0 time_id 0 row_id 0 dtype: int64

Rescaling the Features

Trang 6

We will use MinMax scaling

stock_id time_id target

Checking for Outliers

In [209… from sklearn.preprocessing import MinMaxScaler

Trang 7

stock_id time_id target count 428932.000000 428932.000000 428932.000000 mean 0.495539 0.489408 0.053765

Trang 9

In [215… sns.boxplot(Optiver_train.target)

Out[215…

Trang 10

In [216… # removing (statistical) outliers

Q1 = Optiver_train.stock_id.quantile( 0.238095 ) Q3 = Optiver_train.stock_id.quantile( 0.761905 ) IQR = Q3 - Q1

Optiver_train = Optiver_train[(Optiver_train.stock_id >= Q1 - 1.5*IQR) & (Optiver_train.stock_id <

Q1 = Optiver_train.time_id.quantile( 0.239576 ) Q3 = Optiver_train.time_id.quantile( 0.732220 ) IQR = Q3 - Q1

Optiver_train = Optiver_train[(Optiver_train.time_id >= Q1 - 1.5*IQR) & (Optiver_train.time_id <=

Q1 = Optiver_train.target.quantile( 0.00889 )

Trang 13

import seaborn as sns plt.figure(figsize = [ 9 , 5 ]) sns.distplot(num_Optiver_train.target, bins = 40 , color = "orange") plt.title("Distribution of statistics", fontsize = 20 , fontweight = 10 , verticalalignment = 'basel plt.show()

In [221… # Putting feature variable to X

X_train = Optiver_train.drop(['target'], axis=1 )

Trang 14

X_train.head()

Out[221…

In [222… # Putting feature variable to X

y_train = Optiver_train.target y_train.head()

Out[222…

In [223… # Let's see the correlation matrix

plt.style.use("ggplot") plt.figure(figsize = ( 7 , 4 )) # Size of the figure

Trang 15

Model Building

Let's start by splitting our data into a training set and a test set

Running Your First Training Model

sns.heatmap(X_train.corr(),annot = True,cmap="Greens") plt.show()

In [224… import xgboost as xg

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error as MSE

Trang 16

[18:14:11] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/objective/reg ression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='',

learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=10, n_jobs=4, num_parallel_tree=1,

objective='reg:linear', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None)

Index(['stock_id', 'time_id', 'row_id'], dtype='object')

In [249… y_test = 0

In [225… xgb_r = xg.XGBRegressor(objective ='reg:linear', n_estimators = 10, seed = 123)

In [226… # Fitting the model

Trang 17

stock_id time_id

0 0.0 0.000000

1 0.0 0.933333

2 0.0 1.000000

Index(['stock_id', 'time_id'], dtype='object')

In [230… # Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables

num_vars = ["stock_id","time_id"]

Optiver_test[num_vars] = scaler.fit_transform(Optiver_test[num_vars]) Optiver_test.head()

Out[230…

In [231… # Now let's use our model to make predictions

# Creating X_test_new dataframe by dropping variables from X_test

X_test_new = Optiver_test[X_train.columns]

In [232… X_test_new.columns

Out[232…

In [233… # Predict the model

Y_pred = xgb_r.predict(X_test_new)

Trang 18

Optiver_test2 = pd.read_csv("C:/Users/HP/Desktop/Upgrad Case Study/Optiver Realized Volatility Pre Optiver_test2.head()

Trang 19

1 0.024067

2 0.024067

Index(['stock_id', 'time_id', 'row_id', 'ID'], dtype='object')

stock_id time_id row_id ID 0

In [238… # Putting CustID to index

Optiver_test2['ID'] = Optiver_test2.index

In [239… Optiver_test2.columns

Out[239…

In [240… # Removing index for both dataframes to append them side by side

y_pred_1.reset_index(drop=True, inplace=True) Optiver_test2.reset_index(drop=True, inplace=True)

In [241… # Appending y_test_df and y_pred_1

y_pred_final = pd.concat([Optiver_test2, y_pred_1],axis=1 )

In [242… y_pred_final.head()

Out[242…

Trang 20

stock_id time_id row_id ID 0

Định dạng
Số trang	21
Dung lượng	705,04 KB