GPT 3.5 vs Llama 2 fine-tuning: A Comprehensive Comparison

In this post, I document my experiments benchmarking the fine-tuning of GPT 3.5 against Llama 2 in an SQL task and a functional representation task. Overall:

GPT 3.5 performs marginally better against a Lora fine-tuned CodeLlama 34B (the model I found to work the best) on both datasets
GPT 3.5 costs 4-6x more to train (and even more to deploy)

Code and data for SQL task is here. Code and data for functional representation task is here.

Why do this? Fine-tuning GPT 3.5 is expensive. I wanted to do these experiments to see if manually fine-tuning models can come even close to the performance of GPT3.5, at a fraction of the cost. And funnily enough, they do!

Results

The performance of CodeLlama 34B and GPT 3.5 trained to convergence on an SQL task and a functional representation task. GPT 3.5 achieves slightly better accuracy on both of the tasks. When getting the model to generate SQL queries I also used execution accuracy as a metric which compares the outputs of running queries on dummy databases. (Exact-match accuracy is just a character-level comparison.)

Training costs

	Code Llama 34B	GPT 3.5
SQL spider dataset	$3.32 (7 hour job)	$11.99
Functional representation dataset	$4.27 (9 hour job)	$26.05

For reference, I used an A40 GPU available for $0.475 per hour on the decentralised GPU marketplace vast.ai. (Code Llama 34B quantized to int 8 takes up slightly more than 40GB of GPU memory so 40GB A100s were not an option)

Experiment setup

I use a subset of the Spider dataset and the Viggo functional representation dataset. These are great datasets for fine-tuning:

They are teaching the model the form of desired outputs, not facts. And SQL and the functional representation task both are looking for very structured outputs. (Credit to Anyscale for this wisdom).
The pre-trained models can't do either of the tasks very well out of the box.

For GPT 3.5 fine-tuning, OpenAI only allows the number of epochs to be configured. Though they recommend you let them choose the number of epochs. Therefore, to make this a fair comparison with OpenAI, I did minimal hyperparameter tuning with Llama, allowed OpenAI to choose the number of epochs and trained Llama to convergence on the eval set.

Llama architecture

The two key decisions I made were to use Code Llama 34B and Lora fine-tuning rather than full-parameter:

OpenAI is very likely to be doing some kind of adapter/non full-parameter fine tuning. (They can’t possibly be managing and cold-booting multiple instances of a 175B model. Please tell me if you know otherwise.)
Anyscale’s other blog post points to Lora being pretty much on par with full-parameter tuning for tasks like SQL and functional representation.

Sticking largely with conventional wisdom on lora hyperparameters. The lora adapter config I used is:

config = LoraConfig(
    r=8,
    lora_alpha=16,
     target_modules=[
     "q_proj",
     "k_proj",
     "v_proj",
     "o_proj",
 ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

I experimented targetting all linear layers with the adapter (as the Qlora paper suggests) and found little performance improvement. Similarly, increasing r to 16 merely consumes more compute while providing little performance benefit.

Datasets

An example SQL prompt is:

You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:

Instead of using the full Spider dataset which does database schema like this:

department : Department_ID [ INT ] primary_key Name [ TEXT ] Creation [ TEXT ] Ranking [ INT ] Budget_in_Billions [ INT ] Num_Employees [ INT ] head : head_ID [ INT ] primary_key name [ TEXT ] born_state [ TEXT ] age [ INT ] management : department_ID [ INT ] primary_key management.department_ID = department.Department_ID head_ID [ INT ] management.head_ID = head.head_ID temporary_acting [ TEXT ]

I chose to use the intersection of the sql-create-context dataset and the Spider dataset. So the context given to the model is an SQL create command like this:

CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

(I did this purely because it saved me OpenAI tokens)

An example functional representation prompt is:

Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
I remember you saying you found Little Big Adventure to be average. Are you not usually that into single-player games on PlayStation?

### Meaning representation:

The output would then be:

verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])

Evaluation

Both models converged very quickly:

Graphs showing the loss of the models on the eval sets over the course of training. SQL (left) starts to overfit fairly soon into the job.

For the SQL task, I also used the spider eval repo to calculate the execution accuracy of SQL queries. The repo sets up dummy databases and the ground truth outputs are compared with the outputs from the queries of GPT3.5 and Llama 2.

Conclusion

Overall, this experience left me with the feeling that fine-tuning GPT 3.5 is for initial validation/MVP work but beyond that a model like Llama 2 is your best bet:

Why fine-tune GPT 3.5?

You want to validate that fine-tuning is the right approach to solve a given task/dataset
You want a fully-managed experience

Why fine-tune an open-source model like Llama 2?

You want to save money!
You want to squeeze maximum performance from a dataset
You want full flexibility in train and deploy infrastructure
You want to hold on to some private data

Please email me (sam at ragntune.com) if you have any questions or comments. Also let me know if you have any FAAS (fine-tuning as a service) recommendations! I'm eager to try a managed experience for open-source models...