Learning by Doing: Diving into InstructLab

Introduction

It’s been about a month since I started working with the InstructLab project and getting into Red Hat’s AI portfolio and strategy. In this blog post, I’ll share my early experiences with InstructLab and provide insights on how to easily get started with the tool at home.

My goal is to help you become more familiar with this open-source project and understand how simple it is to use. Keep in mind, I’m still learning about this technology, so there’s more for me to explore and understand about how this project works and how it will be used in the future.

What is InstructLab?

InstructLab is an open-source project created by IBM and Red Hat to enhance large language models (LLMs) for generative AI applications. The project’s goal is to make LLM tuning more cost-effective and accessible. Whether you are a data scientist expert or a sales rep, you should be able to use this tool without trouble.

Here are the resources I used to get started, both for learning about InstructLab as a project and for using it practically. I recommend them reading through in this order:

Key features of InstructLab to understand:

  1. Enhance LLMs with less data, resources, and manual effort: Uses less human-generated data and computing resources to contribute to and tune LLMs.
  2. Open-source community approach: Being open-source allows for continuous improvement of both the models and the project itself through community contributions.
  3. Versatile and model-agnostic: You can bring your own model (BYO-model) to customize and tune locally or for your business needs, providing flexibility.
  4. LAB Method: InstructLab leverages IBM Research’s Large-scale Alignment for chatBots (LAB) method. Essentially, you input a small amount of data manually, and the “synthetic data generators” use that input to create a large data set, which then tunes the model. This process significantly reduces the manual effort typically required, as models need large data inputs to make meaningful improvements in their behavior.

Getting started

Considerations

When using InstructLab, as of now, it’s best to use an M1/M2/M3 Mac or a computer with a dedicated GPU. If you don’t have access to either, consider trying to find a remote system with a GPU. Red Hat employees can utilize the InstructLab-specific demo environment available in the Red Hat Demo Platform for short reservations. That’s what I’m currently using. Note that RHDP currently also requires some prep work, so please reach out if you’re a Red Hatter looking to use this and need assistance.

You can also use InstructLab without a GPU, but be aware that some tasks, especially model training, will take significantly longer. Make the most of the resources you have! You can still get started with the basics and gain valuable experience regardless of your hardware setup. I had done a portion of the below tasks with my Lenovo ThinkPad T14 Gen 4 and still was able to gain some experience.

Let’s begin

Follow along with the GitHub instructions to install and use InstructLab and the ilab CLI tool. These instructions are very clear and easy to follow, guiding you through all tasks from installing ilab to training a model. Below, I’ve outlined exactly what I did, but be sure to refer back to the GitHub instructions for the most up-to-date information and commands tailored to your setup.

Installation

Install dependencies

$ sudo dnf install gcc gcc-c++ make git python3.11 python3.11-devel

Setup InstructLab directory

$ mkdir instructlab
$ cd instructlab

Install using PyTorch (no GPU) and setup virtual environment

$ python3 -m venv --upgrade-deps venv
$ source venv/bin/activate
$ pip cache remove llama_cpp_python
$ pip install instructlab[cuda] \
   -C cmake.args="-DLLAMA_CUDA=on" \
   -C cmake.args="-DLLAMA_NATIVE=off"

Note: I definitely recommend using virtual environments consistently when using python to isolate the installed components.

Initializing ilab

$ ilab config init
Welcome to InstructLab CLI. This guide will help you to setup your environment.
Please provide the following values to initiate the environment [press Enter for defaults]:
Path to taxonomy repo [taxonomy]:
'taxonomy' seems to not exist or is empty. Should I clone https://github.com/instructlab/taxonomy.git for you? [y/N]: y
Cloning https://github.com/instructlab/taxonomy.git
Generating 'config.yaml' in the current directory...
Initialization completed successfully, you're ready to start using 'ilab'. Enjoy!

Let’s check out the config.yaml that we just created to see what our default setup looks like. Of course, you can make changes if needed.

config.yaml

chat:
  context: default
  greedy_mode: false
  logs_dir: data/chatlogs
  max_tokens: null
  model: models/merlinite-7b-lab-Q4_K_M.gguf
  session: null
  vi_mode: false
  visible_overflow: true
general:
  log_level: INFO
generate:
  chunk_word_count: 1000
  model: models/merlinite-7b-lab-Q4_K_M.gguf
  num_cpus: 10
  num_instructions: 100
  output_dir: generated
  prompt_file: prompt.txt
  seed_file: seed_tasks.json
  taxonomy_base: origin/main
  taxonomy_path: taxonomy
serve:
  gpu_layers: -1
  host_port: 127.0.0.1:8000
  max_ctx_size: 4096
  model_path: models/merlinite-7b-lab-Q4_K_M.gguf

Downloading the model

By default, ilab uses a Merlinite model (4-bit quantized version) from Hugging Face. However, you can and should download additional models to experiment with different ones! I downloaded a few, including the Granite-code models.

To change the default model, simply edit the config.yaml file in the InstructLab folder you accessed in the previous step. Alternatively, you can pass in a different model as a parameter. This flexibility allows you to tailor your model selection to best fit your specific needs and objectives.

$ ilab model download
Downloading model from instructlab/merlinite-7b-lab-GGUF@main to models...
...

The above command downloads the default model as defined in config.yaml.

Serving the model

$ ilab model serve

Serving means to deploy the model for use. This deploys the default Merlinite model, which is what I used.

Chat with the model

With the model served in one terminal window, open up a secondary terminal window for a chat.

$ cd instructlab
$ source venv/bin/activate
$ ilab model chat
Welcome to InstructLab Chat w/ MODELS/MERLINITE-7B-LAB-Q4_K_M.GGUF

See what it knows!

>>> Who is Chappell Roan?
Chappell Roan is a former American football player who spent the majority of his career as a tight end in the National Football League (NFL). He played college football for Texas A&M University before being selected by the New York Jets in the 1978 NFL Draft. During his time with the Jets, Roan established himself as a reliable and productive player, contributing to the team's offensive successes. After spending six seasons with the Jets, he joined the Detroit Lions for two more years before concluding his professional football career in 1986. Since then, Roan has maintained a relatively low profile in public life, focusing on personal and family matters. 

If you know who Chappell Roan is, this ain’t it. So I thought this would be a good piece of info to test with to see if I can help the model provide accurate information.

Contribute knowledge or skills

Before completing this step, I recommend reading the LAB research paper. It provides valuable insight into how LAB works and explains how the taxonomy setup of information feeds into the synthetic data generation and subsequent training process.

To start contributing knowledge, follow the “Getting Started with Knowledge Contributions” directions on GitHub. Copy the example files provided and substitute your own information. This is the simpler way to test out functionality, as adding skills can be a bit more complex initially. Note that you’ll need a GitHub account for this process.

Create a qna.yaml file and attribution.txt file locally. These files will be placed in the taxonomy directory structure.

Directory structure I used: instructlab/taxonomy/knowledge/music/pop/chappell_roan

qna.yaml

task_description: 'Teach the model about Chappell Roan, the pop singer'
created_by: taylorjordanNC
domain: pop
seed_examples:
  - question: Who is Chappell Roan?
    answer: |
      Chappell Roan, whose birth name is "Kayleigh Rose Amstutz"  is an American pop singer and songwriter
  - question: What is Chappell Roan's music and style influenced by?
    answer: |
      Chappell Roan is influenced heavily by drag queens, and her music and performing style has been described as "campy".
  - question: What was Chappell Roan's debut album called?
    answer: |
      Chappell Roan's debut album is entitled "The Rise and Fall of a Midwest Princess"
  - question: Where and when  was Chappell Roan born?
    answer: |
      Chappell Roan was born on Febuary 19th, 1998 in Willard, Missouri in the United States of America.
  - question: What are some famous songs that Chappell Roan sang?
    answer: |
      "Red Wine Supernova", "Good Luck, Babe!", "Pink Pony Club" and "HOT TO GO!" are four singles released by Chappell Roan that have been very successful.
  - question: How long has Chappell Roan been a professional singer?
    answer: Chappell Roan began singing professionally since she was 16 years old in 2014.
document:
  repo: https://github.com/taylorjordanNC/chappell_roan_knowledge.git
  commit: 273e8be
  patterns:
    - chappell_roan_bio.md

You can see at the bottom of the file, a GitHub repository is referenced. This repository will contain a markdown file with additional content. Here is mine for reference: https://github.com/taylorjordanNC/chappell_roan_knowledge

attribution.txt

Title of work: Chappell Roan
Link to work: https://en.wikipedia.org/wiki/Chappell_Roan
License of the work: CC-BY-SA-4.0
Creator names: Wikipedia Authors

Contributing to an open source model means open source data must be used. I followed the provided examples by leveraging Wikipedia as my data source.

Validate your new taxonomy additions

Stop serving the model and quit the chat if you still have it running.

$ ilab taxonomy diff
knowledge/music/pop/chappell_roan/qna.yaml
Taxonomy in taxonomy is valid :)

Generate synthetic data

This task will take your manual data input and generate a large dataset for the subsequent model training step. The duration of this process depends on your infrastructure.

To expedite this, you can pass a parameter to the synthetic data generator to limit the amount of data that is generated, thereby reducing processing time. However this could mean the data is less impactful to the model. I tried this parameter and experienced a model that still wasn’t very sure who was Chappell Roan was. So I’m recommending the command here without it:

$ ilab data generate
Generating synthetic data using 'models/merlinite-7b-lab-Q4_K_M.gguf' model, taxonomy:'taxonomy' against http://127.0.0.1:28141/v1 server
Cannot find prompt.txt. Using default prompt depending on model-family.
  0%|                                                                                                                       | 0/100 [00:00<?, ?it/s]Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-07-22 19:27:19,315 generate_data.py:505: generate_data Selected taxonomy path knowledge->music->pop->chappell_roan

Check out my video showing ilab setup -> synthetic data generation

Several weeks ago, I recorded myself working with InstructLab up until the point of model training. Check it out if you are a visual learner or want to see the above steps in action:

Model training

Now that we have the larger synthetic data set about Chappell Roan, it’s time to train the model.

I am passing the --device=cuda parameter to enable GPU acceleration.

$ ilab model train --device=cuda

Let’s see the output:

$ ls models
ggml-model-f16.gguf  ibm-granite  merlinite-7b-lab-Q4_K_M.gguf

ggml-model-f16.gguf is our newly trained model output as a result of the training step.

Testing out the new model

Since I’m using a server with a NVIDIA GPU, I have to skip the ilab model test and ilab model convert steps that you’ll see in the GitHub instructions. These steps are only supported on Apple M-Series chips.

Serve the newly trained model:

$ ilab model serve --model-path models/ggml-model-f16.gguf

Now, let’s open a chat window again to interact with it.

$ cd instructlab
$ source venv/bin/activate
$ ilab model chat
Welcome to InstructLab Chat w/ MODELS/GGML-MODEL-F16.GGUF

Now comes the big question…will it give me more accurate information? Let’s see!

>>> Who is Chappell Roan?
Chappell Roan, whose full name is Cydney Elizabeth Smith-Roan, is an American singer and songwriter from St. Joseph, Missouri. She gained popularity in 2017 when she released her debut single "Pink Pony Club" as part of Island Records' roster. However, after being dropped by the label, she embarked on a series of independent releases before signing with Atlantic Records in early 2023.  

Alright! So, a few things still need correcting: her full name is Kayleigh Rose Amstutz, she was born in Willard, Missouri, “Pink Pony Club” was released in 2020, and she was initially signed to Atlantic Records, who subsequently dropped her.

However, I am so stoked. We went from a complete LLM hallucination to a mostly accurate statement based on the data I added. With more data and further training, I’m confident it will become even more accurate. We started with almost the minimum dataset, so there’s great potential for improvement.

Conclusion

Alright, so I decided to train a model on some pop culture data that may not seem super useful to everyone.

I mean, I believe it’s useful! But I digress.

I faced challenges without a Macbook but pushed through to understand how it works with Linux better. This helped me grasp what the InstructLab team will need to address for Linux users or those with different GPU access. If we want this tool to be truly accessible, which we certainly do, we must ensure it’s easy to use on a variety of setups. This is important, whether for short-term trials or long-term use.

My hope is that we provide more ways for non-Macbook users, including internal Red Hat folks, to get hands-on experience without as much difficulty. Even using an internal demo environment presented some challenges. But it’s still super early! I expect that as the project and our upcoming product, RHEL AI, mature, we will see these improvements.

Leave a comment