Published on

A complete guide to fine-tuning Code Llama

In this guide I show you how to fine-tune Code Llama to become a beast of an SQL developer. For coding tasks, you can generally get much better performance out of Code Llama than Llama 2, especially when you specialise the model on a particular task:

  • I use the b-mc2/sql-create-context which is a bunch of text queries and their corresponding SQL queries
  • A Lora approach, quantizing the base model to int 8, freezing its weights and only training an adapter
  • Much of the code is borrowed from alpaca-lora, but I refactored it quite a bit for this

I used an A100 GPU machine with Python 3.10 and cuda 11.8 to run this notebook. It took about an hour to run. (I also tested that this code works on Colab Pro.)

*This the corresponding notebook.

1. Pip installs

!pip install git+ bitsandbytes  # we need latest transformers for this
!pip install git+
!pip install datasets==2.10.1
import locale # colab workaround
locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
!pip install wandb
!pip install scipy

2. Loading libraries

from datetime import datetime
import os
import sys

import torch
from peft import (
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq

(If you have import errors, try restarting your Jupyter kernel)

3. Load dataset

This pulls the dataset from the Huggingface Hub and splits 10% of it into an evaluation set to check how well the model is doing through training:

from datasets import load_dataset
dataset = load_dataset("b-mc2/sql-create-context", split="train")
train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]

If you want to load your own dataset do this:

train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')

And if you want to view any samples in the dataset just do something like:


4. Load model

I load code llama from huggingface in int8 the standard for Lora:

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

torch_dtype=torch.float16 means computations are performed using a float16 representation, even though the values themselves are 8 bit ints.

If you get error "ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported." Make sure you have transformers version is 4.33.0.dev0 and accelerate is >=0.20.3.

5. Check base model

Check whether the model can already do what we want it to do:

eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

I get the output:

SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'

which is clearly wrong if the input is asking for just class so ahead with the fine-tuning!

6. Tokenization

Setup some tokenization settings like left padding because it makes training use less memory:

tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

Setup the tokenize function to make labels and input_ids the same. This is basically what self-supervised fine-tuning is:

def tokenize(prompt):
    result = tokenizer(

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

And run convert each data_point into a prompt that I found online that works quite well:

def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.

### Input:

### Context:

### Response:
    return tokenize(full_prompt)

Reformat to prompt and tokenize each sample into our tokenized train and eval datasets:

tokenized_train_dataset =
tokenized_val_dataset =

7. Setup Lora

Setup standard Lora config and attach it to the base model:

model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
model = get_peft_model(model, config)

Optional stuff to setup Weights and Biases to view training graphs:

wandb_project = "sql-try2-coder"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

8. Training

All the variables are standard stuff that I wouldn't recommend messing with:

batch_size = 128
per_device_train_batch_size = 32
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "sql-code-llama"

training_args = TrainingArguments(
        evaluation_strategy="steps", # if val_set_size > 0 else "no",
        group_by_length=True, # group sequences of roughly the same length together to speed up training
        report_to="wandb", # if use_wandb else "none",
        run_name=f"codellama-{'%Y-%m-%d-%H-%M')}", # if use_wandb else None,

trainer = Trainer(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True

(If you run out of GPU memory, change per_device_train_batch_size. The gradient_accumulation_steps variable should ensure this doesn't affect batch dynamics during the training run.)

Then we do some pytorch-related optimisation which just make training faster but don't affect accuracy:

model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

This ^ will run for about 1 hour on an A100.

Load the final checkpoint

Now for the moment of truth! Has our work paid off...?

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

To load a fine-tuned Lora/Qlora adapter use PeftModel.from_pretrained. output_dir should be something containing an adapter_config.json and adapter_model.bin:

from peft import PeftModel
model = PeftModel.from_pretrained(model, output_dir)

Try the same prompt as before:

eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

And the model outputs:

SELECT class FROM table_name_12 WHERE frequency_mhz > 91.5 AND city_of_license = "hyannis, nebraska"

So it works! We've fine-tuned a model and it actually improves...If you have any questions, shoot me an email at sam[at]