Introduction
It’s been about a month since I started working with the InstructLab project and getting into Red Hat’s AI portfolio and strategy. In this blog post, I’ll share my early experiences with InstructLab and provide insights on how to easily get started with the tool at home.
My goal is to help you become more familiar with this open-source project and understand how simple it is to use. Keep in mind, I’m still learning about this technology, so there’s more for me to explore and understand about how this project works and how it will be used in the future.
What is InstructLab?
InstructLab is an open-source project created by IBM and Red Hat to enhance large language models (LLMs) for generative AI applications. The project’s goal is to make LLM tuning more cost-effective and accessible. Whether you are a data scientist expert or a sales rep, you should be able to use this tool without trouble.
Here are the resources I used to get started, both for learning about InstructLab as a project and for using it practically. I recommend them reading through in this order:
- What is InstructLab? (Red Hat website topic overview)
- LAB: Large-Scale Alignment for ChatBots (research paper dealing the underlying training methodology)
- InstructLab GitHub (documentation for how to install and use ilab CLI tool)
- Getting started with InstructLab for generative AI model tuning (more detailed documentation and insights for the install and use of the ilab CLI tool to train a model)
Key features of InstructLab to understand:
- Enhance LLMs with less data, resources, and manual effort: Uses less human-generated data and computing resources to contribute to and tune LLMs.
- Open-source community approach: Being open-source allows for continuous improvement of both the models and the project itself through community contributions.
- Versatile and model-agnostic: You can bring your own model (BYO-model) to customize and tune locally or for your business needs, providing flexibility.
- LAB Method: InstructLab leverages IBM Research’s Large-scale Alignment for chatBots (LAB) method. Essentially, you input a small amount of data manually, and the “synthetic data generators” use that input to create a large data set, which then tunes the model. This process significantly reduces the manual effort typically required, as models need large data inputs to make meaningful improvements in their behavior.
Getting started
Considerations
When using InstructLab, as of now, it’s best to use an M1/M2/M3 Mac or a computer with a dedicated GPU. If you don’t have access to either, consider trying to find a remote system with a GPU. Red Hat employees can utilize the InstructLab-specific demo environment available in the Red Hat Demo Platform for short reservations. That’s what I’m currently using. Note that RHDP currently also requires some prep work, so please reach out if you’re a Red Hatter looking to use this and need assistance.
You can also use InstructLab without a GPU, but be aware that some tasks, especially model training, will take significantly longer. Make the most of the resources you have! You can still get started with the basics and gain valuable experience regardless of your hardware setup. I had done a portion of the below tasks with my Lenovo ThinkPad T14 Gen 4 and still was able to gain some experience.
Let’s begin
Follow along with the GitHub instructions to install and use InstructLab and the ilab CLI tool. These instructions are very clear and easy to follow, guiding you through all tasks from installing ilab to training a model. Below, I’ve outlined exactly what I did, but be sure to refer back to the GitHub instructions for the most up-to-date information and commands tailored to your setup.
Installation
Install dependencies
$ sudo dnf install gcc gcc-c++ make git python3.11 python3.11-devel
Setup InstructLab directory
$ mkdir instructlab
$ cd instructlab
Install using PyTorch (no GPU) and setup virtual environment
$ python3 -m venv --upgrade-deps venv
$ source venv/bin/activate
$ pip cache remove llama_cpp_python
$ pip install instructlab[cuda] \
-C cmake.args="-DLLAMA_CUDA=on" \
-C cmake.args="-DLLAMA_NATIVE=off"
Note: I definitely recommend using virtual environments consistently when using python to isolate the installed components.
Initializing ilab
$ ilab config init
Welcome to InstructLab CLI. This guide will help you to setup your environment.
Please provide the following values to initiate the environment [press Enter for defaults]:
Path to taxonomy repo [taxonomy]:
'taxonomy' seems to not exist or is empty. Should I clone https://github.com/instructlab/taxonomy.git for you? [y/N]: y
Cloning https://github.com/instructlab/taxonomy.git
Generating 'config.yaml' in the current directory...
Initialization completed successfully, you're ready to start using 'ilab'. Enjoy!
Let’s check out the config.yaml that we just created to see what our default setup looks like. Of course, you can make changes if needed.
config.yaml
chat:
context: default
greedy_mode: false
logs_dir: data/chatlogs
max_tokens: null
model: models/merlinite-7b-lab-Q4_K_M.gguf
session: null
vi_mode: false
visible_overflow: true
general:
log_level: INFO
generate:
chunk_word_count: 1000
model: models/merlinite-7b-lab-Q4_K_M.gguf
num_cpus: 10
num_instructions: 100
output_dir: generated
prompt_file: prompt.txt
seed_file: seed_tasks.json
taxonomy_base: origin/main
taxonomy_path: taxonomy
serve:
gpu_layers: -1
host_port: 127.0.0.1:8000
max_ctx_size: 4096
model_path: models/merlinite-7b-lab-Q4_K_M.gguf
Downloading the model
By default, ilab uses a Merlinite model (4-bit quantized version) from Hugging Face. However, you can and should download additional models to experiment with different ones! I downloaded a few, including the Granite-code models.
To change the default model, simply edit the config.yaml file in the InstructLab folder you accessed in the previous step. Alternatively, you can pass in a different model as a parameter. This flexibility allows you to tailor your model selection to best fit your specific needs and objectives.
$ ilab model download
Downloading model from instructlab/merlinite-7b-lab-GGUF@main to models...
...
The above command downloads the default model as defined in config.yaml.
Serving the model
$ ilab model serve
Serving means to deploy the model for use. This deploys the default Merlinite model, which is what I used.
Chat with the model
With the model served in one terminal window, open up a secondary terminal window for a chat.
$ cd instructlab
$ source venv/bin/activate
$ ilab model chat
Welcome to InstructLab Chat w/ MODELS/MERLINITE-7B-LAB-Q4_K_M.GGUF
See what it knows!
>>> Who is Chappell Roan?
Chappell Roan is a former American football player who spent the majority of his career as a tight end in the National Football League (NFL). He played college football for Texas A&M University before being selected by the New York Jets in the 1978 NFL Draft. During his time with the Jets, Roan established himself as a reliable and productive player, contributing to the team's offensive successes. After spending six seasons with the Jets, he joined the Detroit Lions for two more years before concluding his professional football career in 1986. Since then, Roan has maintained a relatively low profile in public life, focusing on personal and family matters.
If you know who Chappell Roan is, this ain’t it. So I thought this would be a good piece of info to test with to see if I can help the model provide accurate information.
Contribute knowledge or skills
Before completing this step, I recommend reading the LAB research paper. It provides valuable insight into how LAB works and explains how the taxonomy setup of information feeds into the synthetic data generation and subsequent training process.
To start contributing knowledge, follow the “Getting Started with Knowledge Contributions” directions on GitHub. Copy the example files provided and substitute your own information. This is the simpler way to test out functionality, as adding skills can be a bit more complex initially. Note that you’ll need a GitHub account for this process.
Create a qna.yaml file and attribution.txt file locally. These files will be placed in the taxonomy directory structure.
Directory structure I used: instructlab/taxonomy/knowledge/music/pop/chappell_roan
qna.yaml
task_description: 'Teach the model about Chappell Roan, the pop singer'
created_by: taylorjordanNC
domain: pop
seed_examples:
- question: Who is Chappell Roan?
answer: |
Chappell Roan, whose birth name is "Kayleigh Rose Amstutz" is an American pop singer and songwriter
- question: What is Chappell Roan's music and style influenced by?
answer: |
Chappell Roan is influenced heavily by drag queens, and her music and performing style has been described as "campy".
- question: What was Chappell Roan's debut album called?
answer: |
Chappell Roan's debut album is entitled "The Rise and Fall of a Midwest Princess"
- question: Where and when was Chappell Roan born?
answer: |
Chappell Roan was born on Febuary 19th, 1998 in Willard, Missouri in the United States of America.
- question: What are some famous songs that Chappell Roan sang?
answer: |
"Red Wine Supernova", "Good Luck, Babe!", "Pink Pony Club" and "HOT TO GO!" are four singles released by Chappell Roan that have been very successful.
- question: How long has Chappell Roan been a professional singer?
answer: Chappell Roan began singing professionally since she was 16 years old in 2014.
document:
repo: https://github.com/taylorjordanNC/chappell_roan_knowledge.git
commit: 273e8be
patterns:
- chappell_roan_bio.md
You can see at the bottom of the file, a GitHub repository is referenced. This repository will contain a markdown file with additional content. Here is mine for reference: https://github.com/taylorjordanNC/chappell_roan_knowledge
attribution.txt
Title of work: Chappell Roan
Link to work: https://en.wikipedia.org/wiki/Chappell_Roan
License of the work: CC-BY-SA-4.0
Creator names: Wikipedia Authors
Contributing to an open source model means open source data must be used. I followed the provided examples by leveraging Wikipedia as my data source.
Validate your new taxonomy additions
Stop serving the model and quit the chat if you still have it running.
$ ilab taxonomy diff
knowledge/music/pop/chappell_roan/qna.yaml
Taxonomy in taxonomy is valid :)
Generate synthetic data
This task will take your manual data input and generate a large dataset for the subsequent model training step. The duration of this process depends on your infrastructure.
To expedite this, you can pass a parameter to the synthetic data generator to limit the amount of data that is generated, thereby reducing processing time. However this could mean the data is less impactful to the model. I tried this parameter and experienced a model that still wasn’t very sure who was Chappell Roan was. So I’m recommending the command here without it:
$ ilab data generate
Generating synthetic data using 'models/merlinite-7b-lab-Q4_K_M.gguf' model, taxonomy:'taxonomy' against http://127.0.0.1:28141/v1 server
Cannot find prompt.txt. Using default prompt depending on model-family.
0%| | 0/100 [00:00<?, ?it/s]Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-07-22 19:27:19,315 generate_data.py:505: generate_data Selected taxonomy path knowledge->music->pop->chappell_roan
Check out my video showing ilab setup -> synthetic data generation
Several weeks ago, I recorded myself working with InstructLab up until the point of model training. Check it out if you are a visual learner or want to see the above steps in action:
Model training
Now that we have the larger synthetic data set about Chappell Roan, it’s time to train the model.
I am passing the --device=cuda parameter to enable GPU acceleration.
$ ilab model train --device=cuda
Let’s see the output:
$ ls models
ggml-model-f16.gguf ibm-granite merlinite-7b-lab-Q4_K_M.gguf
ggml-model-f16.gguf is our newly trained model output as a result of the training step.
Testing out the new model
Since I’m using a server with a NVIDIA GPU, I have to skip the ilab model test and ilab model convert steps that you’ll see in the GitHub instructions. These steps are only supported on Apple M-Series chips.
Serve the newly trained model:
$ ilab model serve --model-path models/ggml-model-f16.gguf
Now, let’s open a chat window again to interact with it.
$ cd instructlab
$ source venv/bin/activate
$ ilab model chat
Welcome to InstructLab Chat w/ MODELS/GGML-MODEL-F16.GGUF
Now comes the big question…will it give me more accurate information? Let’s see!
>>> Who is Chappell Roan?
Chappell Roan, whose full name is Cydney Elizabeth Smith-Roan, is an American singer and songwriter from St. Joseph, Missouri. She gained popularity in 2017 when she released her debut single "Pink Pony Club" as part of Island Records' roster. However, after being dropped by the label, she embarked on a series of independent releases before signing with Atlantic Records in early 2023.
Alright! So, a few things still need correcting: her full name is Kayleigh Rose Amstutz, she was born in Willard, Missouri, “Pink Pony Club” was released in 2020, and she was initially signed to Atlantic Records, who subsequently dropped her.
However, I am so stoked. We went from a complete LLM hallucination to a mostly accurate statement based on the data I added. With more data and further training, I’m confident it will become even more accurate. We started with almost the minimum dataset, so there’s great potential for improvement.
Conclusion
Alright, so I decided to train a model on some pop culture data that may not seem super useful to everyone.
I mean, I believe it’s useful! But I digress.
I faced challenges without a Macbook but pushed through to understand how it works with Linux better. This helped me grasp what the InstructLab team will need to address for Linux users or those with different GPU access. If we want this tool to be truly accessible, which we certainly do, we must ensure it’s easy to use on a variety of setups. This is important, whether for short-term trials or long-term use.
My hope is that we provide more ways for non-Macbook users, including internal Red Hat folks, to get hands-on experience without as much difficulty. Even using an internal demo environment presented some challenges. But it’s still super early! I expect that as the project and our upcoming product, RHEL AI, mature, we will see these improvements.

Leave a comment