xFinance: How to outperform BloombergGPT without breaking the bank

Despite being 4x smaller and budget-friendly, xFinance has outperformed BloombergGPT on financial NLP tasks by a significant margin.

We have recently developed an analog of the Bloomberg GPT model using our very own xTuring library and data scraped from the internet. Our aim was to create a cost-effective solution for financial NLP tasks without sacrificing performance.

Introduction

At Stochastic, we're always looking for ways to push the boundaries of what's possible in the field of artificial intelligence. With the recent buzz around BloombergGPT, a 50-billion parameter large language model designed specifically for the finance industry, we set out to see if we could achieve similar results using a open-source model and a modest budget of $1000.

What we discovered was surprising - not only did we achieve better results on finance tasks than BloombergGPT, but we also discovered a formula for fine-tuning open-source models for specific domains at a fraction of the cost.

With xFinance, we've developed an approach for fine-tuning LLMs that allows for continual or incremental learning, without the models losing their prior training. This technique empowers us to construct automated pipelines for fine-tuning open-source models with domain-specific data as it becomes available. By leveraging this, we're able to ensure that the models remain up-to-date and perform exceptionally, delivering unparalleled results to our clients.

xFinance

We created xFinance, a 13-billion parameter model fine-tuned on an open-source model using LoRA. Our goal was to show that it is possible to achieve impressive results in financial NLP tasks without breaking the bank. We put xFinance to the test on popular open-source finance tasks like Financial Phrasebank(FPB), FiQA SA, and Headline, and the results speak for themselves. We were able to achieve a better F1 score than BloombergGPT on these tasks, despite the significant difference in model size and cost.

Dataset

In training our xFinance model, we utilized two distinct datasets: the text dataset and the instruction dataset. The text dataset consists of raw financial text data for unsupervised fine-tuning. The instruction dataset, on the other hand, was generated using our dataset generation flow in the xTuring library for the purpose of instruction fine-tuning. By combining these two datasets, we aimed to provide our model with a robust training corpus that can capture both domain-specific and general-purpose text. In the following section, we provide detailed information about the datasets and the methodology used to construct them.

Text dataset

The text dataset which contains various types of financial documents written in English, such as news articles, technical reports, filings, press releases, financial documents scraped from the internet, and social media posts in 2022 and 2023. We divided the data into three sets for continuous learning. The first set consists of data up to January 2023, while the second set comprises a sample of previous data before February 2023 and February 2023 at a ratio of 5:1. The third set is a sample of previous data before March 2023 and February 2023 at a ratio of 5:1. Our training corpus is designed for both domain-specific and general-purpose text and aims to improve the quality of the data by removing duplicates. We present a detailed breakdown of the entire training set in Table 1, which may differ from those in other papers due to the de-duplication process.

Datasets	Docs	Tokens
Until Jan 2023	216626	236M
Feb 2023	114758	125M
March 2023	128972	132M

Table 1: The number of documents and number of tokens (in millions) in each split

Instruction dataset

To generate the instruction dataset for xFinance, we extracted finance data from the text dataset and used our dataset generation flow from the xTuring library to create a set of question-answer pairs in the format of question answering. The resulting instruction dataset provides a structured set of examples for fine-tuning the model on specific financial tasks. The generation process involves creating questions related to finance topics and identifying the corresponding answers within the text data. We present a detailed breakdown of the instruction dataset in the following table, which provides information on the number of questions, answers. By using this method to generate the instruction dataset, we aimed to improve the model's ability to perform financial-specific tasks accurately and efficiently.

Datasets	Samples
Until Jan 2023	25000
Feb 2023	27000
March 2023	30000

Table 2: The number of samples in each instruction dataset

Fine-tuning

In the fine-tuning process of xFinance, we utilized two different datasets: the text dataset and the instruction dataset. The text dataset was used for unsupervised fine-tuning to enable the model to learn knowledge from a large amount of unlabeled text data. On the other hand, the instruction dataset was used for instruction fine-tuning to train the model on specific tasks such as question answering, reasoning, summarization, and chain-of-thought. We fine-tuned the LLaMA 13B model, which is a large pre-trained language model that has been trained on diverse text sources. We perform both unsupervised fine-tuning and instruction fine-tuning using our xTuring library. To support the fine-tuning process, we used 8 A100 80GB GPUs on GCP, which provided high computational power to handle the large-scale data and model complexity. By fine-tuning the model with both the text and instruction datasets, we aimed to enhance its ability to perform well on financial-specific tasks and achieve better performance on a range of financial applications.

To ensure that our xFinance model is capable of continuously learning from new data and adapting to the ever-changing financial world, we performed unsupervised fine-tuning and instruction fine-tuning on three different datasets, each corresponding to a different time period. For unsupervised fine-tuning, we utilized a pre-trained LLaMA 13B model and fine-tuned it on the data up to January 2023. For the February 2023 and March 2023 datasets, we loaded the checkpoint from the previous month's unsupervised fine-tuning and performed continuous learning. For instruction fine-tuning, we loaded the unsupervised checkpoint corresponding to each time period and fine-tuned it on the instruction dataset generated using our xTuring library. To prevent catastrophic forgetting, we employed techniques such as defining a proper ratio to combine old and current data and decreasing the learning rate accordingly. The detailed parameters and the for both unsupervised and instruction fine-tuning can be found in the tables below.

Datasets	Hyper parameters	Time (hours)
Until Jan 2023	{learning_rate: 5e-5, max_sequence_length: 1024, batch_size: 2}	8.5
Feb 2023	{learning_rate: 3e-5, max_sequence_length: 1024, batch_size: 2}	4
March 2023	{learning_rate: 3e-5, max_sequence_length: 1024, batch_size: 2}	4.5

Table 3: The summary of hyper-parameters and fine-tuning time for the unsupervised fine-tuning

Datasets	Hyper parameters	Time (hours)
Until Jan 2023	{learning_rate: 3e-5, max_sequence_length: 512, batch_size: 4}	2.5
Feb 2023	{learning_rate: 2e-5, max_sequence_length: 512, batch_size: 4}	2.5
March 2023	{learning_rate: 2e-5, max_sequence_length: 512, batch_size: 4}	2.7

Table 4: The summary of hyper-parameters and fine-tuning time for the instruction fine-tuning

Evaluation

We evaluated our model on three datasets from the BloombergGPT paper and an additional dataset using the multi-shot (5) method from the same paper. The four datasets we used were Financial Phrasebank (FPB), FiQA SA, Headline, and an additional part of the FiQA dataset (FiQA-S) that was not used in the BloombergGPT evaluation.

Our evaluation method followed the likelihood-based classification approach described in Brown et al. (2020), and we reported the best performance for each model and task using , calibration, and normalization methods.

We compared our model's performance with BloombergGPT's by calculating F1 scores on the full dataset. We also performed bootstrapping using the lm_eval_harness library to obtain standard deviations for the metrics and gain confidence about them.

This demonstrates that it is possible to create a powerful, cost-effective financial NLP model without access to the vast resources of a major company like Bloomberg.

Task	xFinance	BloombergGPT
FPB	0.7283	0.5107
Headline	0.8543	0.822
FiQA SA (headline)	0.774	0.7507
FiQA SA (sentence)	0.8271	-

Table 5: FP16 model scores on finance sentiment tasks

Below are the highlights of our evaluation:

On the Financial Phrasebank dataset, our model showed excellent performance in a 5-shot setup, predicting sentiment in financial news sentences with high accuracy.
In the FiQA SA dataset, our model accurately predicted aspect-specific sentiment in English financial news and microblog headlines, demonstrating its versatility in handling different types of financial texts.
Our model also performed well on the Headline dataset, correctly classifying news headlines in the gold commodity domain based on the information included.
As an added bonus, our model showed promising results on the additional FiQA-S dataset, which consisted of sentences from financial news articles rather than headlines. This demonstrates that our model is capable of handling more context-heavy tasks as well.

Also we were able to reach nearly same metrics with int8:

Task	xFinance	BloombergGPT
FPB	0.7119	0.5107
Headline	0.8511	0.822
FiQA SA (headline)	0.7785	0.7507

Table 6: INT8 model scores on finance sentiment tasks

Cost analysis

Despite having a small budget of less than $1000, we were able to perform fine-tuning on xFinance and achieve results that are competitive with BloombergGPT. This demonstrates the efficiency and effectiveness of our approach in achieving high-quality language models with limited resources. By using cost-effective hardware configurations and smart techniques to prevent catastrophic forgetting, we were able to train our model on multiple datasets and achieve impressive results. The cost of fine-tuning each model is detailed in the table below, demonstrating that with careful planning and execution, it is possible to produce high-quality language models even with a small budget.

Model	Unsupervised fine-tuning cost ($)	Instruction fine-tuning cost ($)	Total cost ($)
Until Jan 2023	95	260	355
Feb 2023	50	265	315
March 2023	55	272	327

Table 7: The summary of fine-tuning costs for each checkpoint

Conclusion

In conclusion, the xFinance model built using the xTuring library and internet data, has proven to be a cost-effective and high-performing alternative for financial NLP tasks. We're excited about the potential applications of our approach and can't wait to share more in upcoming blog posts.

If you're interested in fine-tuning models for your domain or want to explore how our other cutting-edge solutions can help your business, please don't hesitate to get in touch. Drop us a line at hello@stochastic.ai, and we'll be happy to discuss how our AI expertise can benefit your organization.