Curate fine-tuning data with Lilac
Lilac is an open-source product that helps you analyze, structure, and clean unstructured data with AI. You can use it to enrich datasets of LangChain runs to create better fine-tuning datasets.
In this walkthrough, we will use Lilac on a dataset of LangSmith runs to check for PII and remove approximate duplicates before fine-tuning. The overall workflow looks something like the following:
The basic workflow is as follows:
- Create a LangSmith dataset of runs data.
- Load LangSmith dataset into Lilac.
- Filter and curate dataset using signals and concepts.
- Export the dataset for fine-tuning.
We will explain each of these steps in more detail below, but first, install some prerequisite packages.
Setup
In addition to Lilac and LangSmith, this walkthrough requires a couple of additional packages.
# %pip install -U "lilac[pii]" langdetect openai langchain --quiet
import uuid
import os
# os.environ["LANGCHAIN_API_KEY"] = "<YOUR-API-KEY>"
unique_id = uuid.uuid4().hex[:8]
1: Create LangSmith dataset
We've included an example dataset in this repository that you can use to complete this walkthrough.
This dataset was made by querying prompt and LLM runs from an example deployment of chat langchain.
For more information on how to query runs in LangSmith, check out the docs or explore some of the other recipes in this cookbook.
from langsmith import Client
client = Client()
dataset_name = f"langsmith-prompt-runs-{unique_id}"
ds = client.create_dataset(dataset_name)
import json
from concurrent.futures import ThreadPoolExecutor
def create_example(line: str):
d = json.loads(line)
client.create_example(inputs=d["inputs"], outputs=d["outputs"], dataset_id=ds.id)
with open("rag.jsonl", "r", encoding="utf-8") as f:
with ThreadPoolExecutor(max_workers=10) as executor:
executor.map(create_example, f)
Now you can create the dataset. Lilac works best on flat dataset structures, so we will flatten (and stringify) some of the attributes.
2. Import into Lilac
Next, we can import the LangSmith dataset into Lilac. Select the dataset name you created above, and run the code below. Once you've run the code, you can view the the results in Lilac's UI.
from IPython.display import display
import lilac as ll
ll.set_project_dir("./langsmith-finetune")
data_source = ll.sources.langsmith.LangSmithSource(
dataset_name=dataset_name,
)
config = ll.DatasetConfig(
namespace="local",
name=dataset_name,
source=data_source,
)
dataset = ll.create_dataset(config)
ll.start_server()
# await ll.stop_server()
Reading from source langsmith...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 54422.01it/s]
Dataset "langsmith-prompt-runs-d19f7f5b" written to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-d19f7f5b
INFO: Started server process [18508] INFO: Waiting for application startup.
3: Enrich Dataset
Now that we have our dataset in Lilac, we can run Lilac’s signals, concepts and labels to help organize and filter the dataset. Our goal is to select distinct examples demonstrating good language model generations for a variety of input types. You can explore and annotate the dataset in the app by navigating to the URL printed out by the local server above. I'd encourage you to try out their off-the-shelf "concepts" or try training your own.
For the sake of this walkthrough, we will focus on using the Python API. You can follow along with the code below.
Applying 'signals'
Signals in Lilac refer to any function that is applied over a field. We will use a couple off-the-shelf "signals" to perform the following:
- PII detection: we don't want to leak private data
- Near duplicate detection: we want each training example to be informative
These are useful for filtering bad examples from our dataset before fine-tuning a model.
dataset.compute_signal(ll.PIISignal(), "question")
dataset.compute_signal(ll.PIISignal(), "output")
# Apply min-hash LSH (https://en.wikipedia.org/wiki/MinHash) to detect approximate n-gram duplicates
dataset.compute_signal(ll.NearDuplicateSignal(), "question")
dataset.compute_signal(ll.NearDuplicateSignal(), "output")
Computing pii on local/langsmith-prompt-runs-d19f7f5b:('question',): 100%|████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 909.97it/s]
Computing signal "pii" on local/langsmith-prompt-runs-d19f7f5b:('question',) took 0.441s. Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-d19f7f5b/question/pii
Computing pii on local/langsmith-prompt-runs-d19f7f5b:('output',): 100%|██████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 421.69it/s]
Computing signal "pii" on local/langsmith-prompt-runs-d19f7f5b:('output',) took 0.950s. Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-d19f7f5b/output/pii
Computing near_dup on local/langsmith-prompt-runs-d19f7f5b:('question',): 0%| | 0/400 [00:00<?, ?it/s] Fingerprinting...: 400it [00:00, 17717.11it/s]
Computing hash collisions...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 865.16it/s][A
Clustering...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 23133.62it/s][A Computing near_dup on local/langsmith-prompt-runs-d19f7f5b:('question',): 100%|█████████████████████████ █████████████████████████████████████████████████████| 400/400 [00:00<00:00, 6097.39it/s]
Computing signal "near_dup" on local/langsmith-prompt-runs-d19f7f5b:('question',) took 0.067s. Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-d19f7f5b/question/near_dup
Computing near_dup on local/langsmith-prompt-runs-d19f7f5b:('output',): 0%| | 0/400 [00:00<?, ?it/s] Fingerprinting...: 391it [00:00, 4102.89it/s]
Computing hash collisions...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 782.52it/s][A
Clustering...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 53248.00it/s][A Computing near_dup on local/langsmith-prompt-runs-d19f7f5b:('output',): 100%|████████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 2896.51it/s]
Computing signal "near_dup" on local/langsmith-prompt-runs-d19f7f5b:('output',) took 0.139s. Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-d19f7f5b/output/near_dup
Adding labels
Labeling is best done in-app, but you can also programmatically label rows using the python SDK. Below is an example that labels all rows not tagged as English as not_english
dataset.compute_signal(ll.LangDetectionSignal(), "question")
dataset.compute_signal(ll.LangDetectionSignal(), "output")
Computing lang_detection on local/langsmith-prompt-runs-d19f7f5b:('question',): 100%|█████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 820.34it/s]
Computing signal "lang_detection" on local/langsmith-prompt-runs-d19f7f5b:('question',) took 0.494s. Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-d19f7f5b/question/lang_detection
Computing lang_detection on local/langsmith-prompt-runs-d19f7f5b:('output',): 100%|███████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 502.81it/s]
Computing signal "lang_detection" on local/langsmith-prompt-runs-d19f7f5b:('output',) took 0.797s. Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-d19f7f5b/output/lang_detection
# You can check the current schema by running the following. Select the fields you want to export.
# dataset.manifest()
dataset.add_labels(
"not_english", filters=[(("question", "lang_detection"), "not_equal", "en")]
)
Lilac has a lot more powerful capabilities like custom concepts and signals that you can apply. Check out their docs for more info, and see our exploratory data analysis noteboook for an introduction on using them with LangSmith datasets.
4. Prepare the enriched dataset
Now let's prepare the dataset for fine-tuning, we will fetch the deduplicated rows and filter out any rows that may contain PII.
df = dataset.to_pandas(
[
"question",
"chat_history",
"context",
"output",
"question.pii",
"question.near_dup",
"user_score",
"not_english",
]
)
print(f"Original length: {len(df)}")
# Flatten the dataframe
df["cluster_id"] = df["question.near_dup"].apply(lambda x: x["cluster_id"])
df["contains_pii"] = df["question.pii"].apply(
lambda x: bool([v for l in x.values() for v in l])
)
df["not_english"] = df["not_english"].apply(
lambda x: x is not None and x.get("label") == "true"
)
# Drop original dotted columns
df.drop(columns=["question.near_dup", "question.pii"], inplace=True)
# Now filter for only rows for which contains_pii is false, user_score is 1.0
df = df[(~df["contains_pii"]) & (df["user_score"] != "0.0") & (~df["output"].isna())]
# And drop the duplicate cluster IDs
df = df.drop_duplicates(subset="cluster_id", keep="first")
print(f"Filtered length: {len(df)}")
Original length: 400 Filtered length: 314
df.head(3)
| question | chat_history | context | output | user_score | not_english | cluster_id | contains_pii | 0 | 1 | 2 --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- "J'ai ajout\u00e9 "insurer" au metadata de d... | [{"content": "I tried to make a chatbot to hel... | "<doc id="0">Skip to main content\ud83e\udd9c... | "Je m'excuse pour la confusion. Il semble que ... | None | True | 0 | False "Show me how to use a RecursiveUrlLoader?" | [{"content": "Show me how to use a RecursiveUr... | [{"metadata": {"source": "http://www.hernandez... | "I'm sorry, but the provided context does not ... | None | False | 1 | False "Embody the persona of Melinda Reed, my expert... | [] | "<doc id="0">history, so anything important mu... | "```markdown\n# Project Skeleton\n\n## File St... | None | False | 2 | False </doc></doc>
5. Finetune
With the dataset filtered, we can now prepare it to a compatible format for fine-tuning. We will use OpenAI's fine-tuning endpoint for this, but you could also apply similar logic to finetune a Llama, T5, or other model.
def create_messages(row):
chat_history = json.loads(row.chat_history or "[]") or []
roles = ("assistant", "user")
messages = [
{"role": "system", "content": "Helpfully answer the questions about LangChain."}
]
for i, msg in enumerate(chat_history):
messages.append({"role": roles[i % 2], "content": str(msg["content"])})
messages.append({"role": "user", "content": row.question})
messages.append({"role": "assistant", "content": row.output})
return messages
messages = df.apply(create_messages, axis=1).tolist()
Now you can fine-tune the model! This will take a while (20+ minutes), so we'd encourage you to further explore your local Lilac dataset while you wait.
import json
from io import BytesIO
import time
import openai
# We will write the jsonl file in memory
my_file = BytesIO()
for m in messages:
my_file.write((json.dumps({"messages": m}) + "\n").encode("utf-8"))
my_file.seek(0)
training_file = openai.File.create(file=my_file, purpose="fine-tune")
# OpenAI audits each training file for compliance reasons.
# This make take a few minutes
status = openai.File.retrieve(training_file.id).status
start_time = time.time()
while status != "processed":
print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
time.sleep(5)
status = openai.File.retrieve(training_file.id).status
print(f"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.")
job = openai.FineTuningJob.create(
training_file=training_file.id,
model="gpt-3.5-turbo",
)
status = openai.FineTuningJob.retrieve(job.id).status
start_time = time.time()
while status != "succeeded":
print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
time.sleep(5)
job = openai.FineTuningJob.retrieve(job.id)
status = job.status
File file-HtPWASZzc0LLlvEY7ENy0q13 ready after 106.96 seconds. Status=[running]... 1589.27s
Use fine-tuned model
With the model fine-tuning complete, you can load the fine-tuned model directly in LangChain!
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
model = ChatOpenAI(
model=job.fine_tuned_model,
temperature=1,
)
prompt = ChatPromptTemplate.from_messages(
[("system", "Helpfully answer the questions about LangChain."), ("user", "{input}")]
)
chain = prompt | model
chain.invoke({"input": "What's LangChain Expression Language?"}).content
WARNING:langchain.llms.base:Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
'"The LangChain Expression Language is a domain-specific language (DSL) designed to work with LangChain. It allows users to create custom expressions and scripts for various purposes, such as data processing, text manipulation, and automation.\n\nWith the LangChain Expression Language, users can leverage a wide range of functions, operators, and variables to build complex and dynamic expressions. These expressions can be used within LangChain to transform data, generate text, make decisions, and perform other operations.\n\nOverall, the LangChain Expression Language provides a flexible and powerful tool for users to customize and extend the functionality of LangChain according to their specific needs."'
Conclusion
LangSmith makes it easy to collect unstructured data seen by your production LLM application. Lilac can make it easier to filter and analyze with sophisticated methods.
In this tutorial you created a dataset of run traces, filtered by near-duplicates and looking for PII, then used the filtered dataset to fine-tune a new model.