Image Generated by DALL-E 2
Text analysis tasks have been around for some time as the needs are always there. Research has come a long way, from simple description statistics to text classification and advanced text generation. With the addition of the Large Language Model in our arsenal, our working tasks become even more accessible.
The Scikit-LLM is a Python package developed for text analysis activity with the power of LLM. This package stood out because we could integrate the standard Scikit-Learn pipeline with the Scikit-LLM.
So, what is this package about, and how does it work? Let’s get into it.
Scikit-LLM is a Python package to enhance text data analytic tasks via LLM. It was developed by Beatsbyte to help bridge the standard Scikit-Learn library and the power of the language model. Scikit-LLM created its API to be similar to the SKlearn library, so we don’t have too much trouble using it.
Installation
To use the package, we need to install them. To do that, you can use the following code.
As of the time this article was written, Scikit-LLM is only compatible with some of the OpenAI and GPT4ALL Models. That’s why we would only going to work with the OpenAI model. However, you can use the GPT4ALL model by installing the component initially.
pip install scikit-llm[gpt4all]
After installation, you must set up the OpenAI key to access the LLM models.
from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("")
SKLLMConfig.set_openai_org("")
Trying out Scikit-LLM
Let’s try out some Scikit-LLM capabilities with the environment set. One ability that LLMs have is to perform text classification without retraining, which we call Zero-Shot. However, we would initially try a Few-Shot text classification with the sample data.
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset
#label: Positive, Neutral, Negative
X, y = get_classification_dataset()
#Initiate the model with GPT-3.5
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)
You only need to provide the text data within the X variable and the label y in the dataset. In this case, the label consists of the sentiment, which is Positive, Neutral, or Negative.
As you can see, the process is similar to using the fitting method in the Scikit-Learn package. However, we already know that Zero-Shot didn’t necessarily require a dataset for training. That’s why we can provide the labels without the training data.
X, _ = get_classification_dataset()
clf = ZeroShotGPTClassifier()
clf.fit(None, ["positive", "negative", "neutral"])
labels = clf.predict(X)
This could also be extended in the multilabel classification cases, which you can see in the following code.
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset
X, _ = get_multilabel_classification_dataset()
candidate_labels = [
"Quality",
"Price",
"Delivery",
"Service",
"Product Variety",
"Customer Support",
"Packaging",,
]
clf = MultiLabelZeroShotGPTClassifier(max_labels=4)
clf.fit(None, [candidate_labels])
labels = clf.predict(X)
What’s amazing about the Scikit-LLM is that it allows the user to extend the power of LLM to the typical Scikit-Learn pipeline.
Scikit-LLM in the ML Pipeline
In the next example, I will show how we can initiate the Scikit-LLM as a vectorizer and use XGBoost as the model classifier. We would also wrap the steps into the model pipeline.
First, we would load the data and initiate the label encoder to transform the label data into a numerical value.
from sklearn.preprocessing import LabelEncoder
X, y = get_classification_dataset()
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)
Next, we would define a pipeline to perform vectorization and model fitting. We can do that with the following code.
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from skllm.preprocessing import GPTVectorizer
steps = [("GPT", GPTVectorizer()), ("Clf", XGBClassifier())]
clf = Pipeline(steps)
#Fitting the dataset
clf.fit(X_train, y_train_enc)
Lastly, we can perform prediction with the following code.
pred_enc = clf.predict(X_test)
preds = le.inverse_transform(pred_enc)
As we can see, we can use the Scikit-LLM and XGBoost under the Scikit-Learn pipeline. Combining all the necessary packages would make our prediction even stronger.
There are still various tasks you can do with Scikit-LLM, including model fine-tuning, which I suggest you check the documentation to learn further. You can also use the open-source model from GPT4ALL if necessary.
Scikit-LLM is a Python package that empowers Scikit-Learn text data analysis tasks with LLM. In this article, we have discussed how we use Scikit-LLM for text classification and combine them into the machine learning pipeline.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and Data tips via social media and writing media.