Token Volatility Forecasting
Welcome to the Token Volatility Prediction Tutorial. In this tutorial, we will attempt to predict the volatility of various tokens within the DeFi ecosystem using the full Giza stack, from utilizing and processing some of our datasets to execute an action. The approach we will take to solve this task is not the most common one. Instead of using the time series of the asset's price, transforming it, and trying to predict the next n
days, we will use all the information provided by our datasets to try to find patterns that relate the overall behavior of the DeFi ecosystem with the future volatility of a token. Subsequently, we will compare these predictions against a benchmark and execute the action.
If you didn't catch all the concepts in this introduction, don't worry. In this tutorial, we're going to give a reasonable explanation for every step that needs to be taken.
Before starting
If you haven't yet checked out the tutorial Build a Verifiable Neural Network with Giza Actions, we highly recommend doing so. This tutorial is more complex and does not contain all the code needed to execute it from start to finish.
To run the complete code, visit our Orion-Hub repo.
Install Giza stack: Giza CLI, Giza Datasets and Giza Actions.
Data loading and preprocessing
Our plan in this tutorial will be to construct a large number of features that make financial sense and represent the behavior of the DeFi ecosystem. Specifically, we will build 452 variables. The @task that generates all this information is as follows (To understand how tasks work, access here):
This method calls the functions daily_price_dateset_manipulation
, apy_dateset_manipulation
, and tvl_dateset_manipulation
. Each of them loads and processes one of the following datasets:
Tokens Daily Information: Tokens Daily Information Dataset
This dataset offer daily information on various tokens. It includes data on token prices, market capitalization, trading volume, and more, which will help us predict the target volatility and preprocess many features.
Top Pools APY per Protocol: Top Pools APY per Protocol Dataset
The Annual Percentage Yield (APY) data from the top pools across different protocols provides insights into the profitability of investments in specific token pairs.
TVL for Each Token by Protocol: TVL for Each Token by Protocol Dataset
The Total Value Locked (TVL) in protocols for each token offers a measure of the token's popularity and the trust investors place in it.
To understand all the transformations that will be performed, it will be necessary to delve into the files utils.py and financial_features.py. A summary of the final data we will be working with could be:
Calculate the target as:
Select the 10 tokens with the highest lag-correlation to the target token. For each of these tokens, calculate multiple financial variables such as
momentum
,mean_percentage_changes
,min/max prices
,past volatility
and technical indicators likeRSI
,MACD
, andBollinger Bands.
Repeat this process of generating synthetic features, not only for the returns variable but also for
volume_last_24_hours
andmarket_cap
variables.Add the time series of tvl_usd and the APY of pools that work with the target token and have sufficient historical data.
Add the time series of TVL and fees from the most influential protocols in the DeFi ecosystem
For instance, the function apy_dataset_manipulatio
n would look like:
Finally, before moving on to training, we will drastically reduce the dimensionality of our problem, keeping only 25 features. To do this, we will first remove those features that have a high correlation among them and then apply RFE with LightGBM to retain the last 25 features:
We now have the dataset ready for training! It's important to remember that all this code is encapsulated in a @task, which will later serve to execute our action:
Train our model
In this case, we will use LightGBM. Although the dimensionality of our dataset has been drastically reduced, in these types of problems with very few observations, it is crucial to be careful not to overfit. To avoid this, we will use a special time series cross-validation technique beforehand to select the optimal number of rounds and then manually adjust some additional parameters.
We will create another @task for this purpose:
Naive benchmark vs model predictions
In this section, we evaluate the performance of our volatility forecasting model against a naive benchmark. The naive benchmark predicts the volatility based on the average volatility of the token in the same number of days as our test set, but immediately preceding it. This approach ensures our benchmark is grounded in the most recent market conditions. The @task "test_model" will reproduce these metrics to ensure the reproducibility of the experiment.
This concise comparison demonstrates the effectiveness of our model against the naive approach. The table below succinctly presents the performance metrics for both:
Metric | Model Performance | Naive Benchmark |
---|---|---|
MSE | 7.147e-5 | 0.000106 |
MAE | 0.00672 | 0.00805 |
R^2 | 0.105 | -0.331 |
The model's lower MSE and MAE values compared to the naive benchmark signify a more accurate and precise prediction of token volatility. Moreover, the model's positive R^2 value, as opposed to the negative value for the benchmark, underscores the model's ability to capture the variability in the data effectively. This table highlights the added value our model brings to forecasting DeFi token volatility, leveraging a rich set of financial features and Giza's comprehensive datasets.
Visually, the result of the experiment would be:
Execute it!
We've already seen the model results and some preprocessing steps. However, let's see what the final execution method would look like:
If you have followed the "Build a Verifiable Neural Network with Giza Actions" tutorial that we recommended in the introduction, you will already be familiar with Giza CLI, ONNX, and how to transpile and deploy our model.
To proceed with this last steps we only need to execute our train_action.py
. This script will train the model based on the dataset and preprocessing steps we've discussed:
Last updated