Are you interested on how to set up your own AI stock-prediction system from scratch? In this blogpost, we will review how to set up an AI system that utilizes news information to predict the close value of a stock (ticker) on a daily basis. In contrast to previous posts, the required data was not available; instead, we set up crawling processes to create a dataset that spawn almost 2 years of news information and stock data. Therefore, we present a use-case on how to develop an AI system holistically i.e., from data gathering, storage, pre-processing, modelling and service.
All of the aforementioned steps can be described as different stages of the end product. Lets begin with the data crawling.
Phase 1: Data crawling & storage
Data is the most important component in any AI system, so we collected temporal information regarding market trends and used them as features for our modelling. Therefore, we spent quite some time setting up the data crawling processes. Data was collected from free sources such as search engines such as google, public forums such as Reddit (e.g., investment communities), publicly available RSS newsfeed (e.g., investing.com, forbes.com) from multiple sources, and other free APIs (e.g., mediastack). At some point we also monitored influential twitter accounts such as elonmusk (e.g., correlation between tweets and cryptocurrency value), pkedrosky, GoldmanSachs, etc. After the update of Twitter's API to v2, we did not incorporated the changes and dropped the twitter source. For some other sources, we would advice you to use selenium stealth package to prevent crawling detection and request block. Our sources are summarized in the following table.
Type | Period | |
Search Engines | Sept. 2022 - today | 3 |
RSS WebFeed | Sept. 2022 - today | 467 |
Open Source APIs | Sept. 2022 - today | 8 |
Data is crawled on a daily basis (CRON schedule) before midnight to get all the related information of the particular day. Afterwards, for each individual source the data is stored in a separated directory as to be processed for the next step. Till this day (May 2024) around 1.7M data entries have been crawled from all the aforementioned sources.
Storage is a rather straightforward process. After each source has been crawled, a csv file is generated that contains all the important information e.g., timestamp, source id, news information and then it is being uploaded to a cloud storage infrastructure to avoid data loss.
Phase 2: Data pre-processing
Now that we have enough unstructured data, we have to convert them to a readable format for the machine learning models. The preprocessing phase is a multi-step process that aims to:
Reduce noise
Extract relevant information
Generate ML-readable format
The very first step is to remove noisy entries such as trash and non-English text. For detecting the language of the text, we have employed two models (from which we take the majority vote):
Pretrained Transformer from hugging-face: "papluca/xlm-roberta-base-language-detection"
Google's open source library: langdetect
After we make sure that the textual data is in English language, we proceed by extracting entities from the text. Named-entity-recognition (NER) is a ML domain that focuses on extracting entities from textual data. The goal here is to find organizations that correspond to a particular stock-ticker in order to associate the sentiment of the news to the particular stock.
For the NER task, we employ two well known pretrained transformer models from hugging-face and aggregate the unioned results:
Hugging-face: flair/ner-english-ontonotes-large
Hugging-face: guishe/span-marker-generic-ner-v1-fewnerd-fine-super
Here is a couple of examples of sentences and the results we get from both models:
Text | Results 1 | Results 2 |
Vedanta Touts $6 Billion Investment Pipeline As Growth Driver | Vedanta:organization-company | Vedanta:ORG,6 Billion:CARDINAL |
Elon Musk Losing Understanding Of Consumer Base & Retail Investors Tesla Still In Need Of Market Correction: McWhorter Foundation's Highlights' | Elon Musk:person-other,Tesla:organization-company | Elon Musk:PERSON,Tesla:ORG,McWhorter:ORG |
We only care about organizations (for the particular case study) and therefore we aggregate only ORG and organization-* tags from these models and merge them. The below figures show the top 15 organizations identified by each model. It seems that there is an intersection in the top 15 organizations (13 out of 15 organizations).
Furthermore, the textual information is also used for sentiment analysis. Sentiment analysis will define if the text is positive, negative or neutral. Once again, we employ the state-of-the-art sentiment classification transformer models which are fine-tuned on financial data from hugging-face:
Hugging-face: KernAI/stock-news-distilbert
Hugging-face: ProsusAI/finbert
Hugging-face: yiyanghkust/finbert-tone
We use these models to average their predictions in order to extract a class label (positive, negative or neutral) and a confidence score. After all these steps, we have ended up with temporal information regarding entities and their corresponding sentiment class and score.
The final step before the model selection and training is to merge this information with the financial data of the stock. This is feasible by using YahooFinancials python library. The following table shows some examples of this mapping:
Organization (Entity Name) | Ticker Symbol (Stock) |
Accenture | ACN |
BlackRock, BlackRock Inc | BLK |
Alphabet | GOOG |
Lumen Technologies | LUMN |
NIFTY IT | ^CNXIT |
Phase 3: Modelling
The previous steps have led us to a temporal dataset that combines sentiment analysis of news and financial information regarding the stock prices on a daily basis. Our goal is to create an AI system that will receive the temporal information of the previous day and estimate the closing value of a particular stock for the upcoming day. We can also train the system to predict more than one day in advance, just by shifting the training data one or more days ahead.
After doing some feature engineering using pandas_ta library, we end up with a dataset of 646,873 entries and 50 features. We have dropped out stocks that have less than 100 entries. For the experimentation we have used a variety of regression models, but here we will report only the best performing model which was the ExtraTreesRegressor.
For the setup, we treated our data as a stream which means that for each new day, we retrained our system with the data of the previous days (up to the inference day). This way the model stays up-to-date with the most relevant trends. For the hyper-parameter selection, we employed Bayesian optimization through auto-sklearn library on a small validation set. Below, we report on the performance in terms of dollars $ (mean absolute error) for a 2-month period: 2024-03-01 till 2024-05-06 for around 18K unseen observations from various stocks.
Lets first see the performance of the best fitted model without the presence of the news sentiment which means that we employ only the financial data we get from the yahoo finance API.
The mean absolute error of all the predictions averages to 27.1$. Now lets see what happens in the presence of additional information (sentiment from news).
As we observe, the performance is better with sentiment information than without. The difference is about 19%. We could also go further and analyze which are the most problematic stocks based on the model's predictions or try to explain why we have spikes on particular days, but this is not the aim of this use case. Our purpose is to incorporate sentiment information to the ML system as to enhance its overall predictive power.
Phase 4: Productionisation
Now that we have seen all the individual components, we can orchestrate them to create an online AI service for our users. The following figure shows the daily execution of the pipeline in order to have an up-to-date AI system for stock predictions.
Each day, data is systematically extracted from established sources and stored for subsequent utilization. Subsequently, the preprocessing phase is initiated to minimize noisy data and optimize it for training purposes. This phase also involves the storage of preprocessed data for future usage. Upon completion of processing new data, a series of models undergo training utilizing hyper-parameter optimization techniques, with the best-performing model selected based on validation set results.
In the final stage, transitioning the model into production is facilitated through various means, including leveraging existing infrastructures such as Azure or AWS, where the model can be deployed and made accessible via provided APIs. In case we want some localized or private infrastructure, FastAPI framework is a very good option. FastAPI provides an easy-to-implement solution that can put the model into production in no-time.
Conclusion
In this use-case, we have seen how we can set up an AI service from end-to-end with practically zero data. We have selected the daily stock prediction granularity due to lack of open-source real-time news information. In our case study, we update the model once per day, but we can easily change that to facilitate on-the-fly updates by switching to online/incremental learning models that can be updated per instance.
In our end-to-end AI system, the most important phase is the very first, getting the data. Data crawling is a simple process but it must done consistently. Sometimes the APIs are down, sometimes the search engines deny the requests and sometimes the services are updated and you have to update the http requests accordingly. A good way to monitor deviations, is to create an alert system. One simple way to do that, is in the end of each crawling phase to send an automated alert (email) in case something terminated unexpectedly. Preprocessing is also a crucial component while it mitigates noisy data and formats the data in a ML readable format. Finally, experimentation with various models is quite straightforward and putting the model into live action can be achieved either by already established infrastructure or using custom made solutions.
** If you are interested in other ML use-cases, please contact me using the form (and also include a publicly available dataset for this case, I'm always curious to explore new problems).
Comments