methodshub-text-classification-with-pretrained-language-model

Property	Value
?:about	<https://data.gesis.org/gesiskg/resource/Large_language_model> <https://data.gesis.org/gesiskg/resource/Large_language_model_fine-tuning> <https://data.gesis.org/gesiskg/resource/RoBERTa_fine-tuning> <https://data.gesis.org/gesiskg/resource/Text_classification>
?:abstract	Learn how to customize (personalize through fine-tuning) and use a large language model for text classification in a language of interest (xsd:string)
?:associatedTask	<https://data.gesis.org/gesiskg/resource/methods_hub_task_data_analysis> <https://data.gesis.org/gesiskg/resource/methods_hub_task_text_classification> <https://data.gesis.org/gesiskg/resource/methods_hub_task_text_mining>
?:codeRepository	<https://github.com/Stephan-Linzbach/Text-Classification-with-Pretrained-Language-Models>
?:contributor	<https://data.gesis.org/gesiskg/resource/methodshub-text-classification-with-pretrained-language-model_Linzbach_Stephan>
?:dateModified	2025 (xsd:gyear)
?:datePublished	2025 (xsd:gyear)
?:description	This step-by-step tutorial provides a gentle introduction to customizing (fine-tuning) a pre-trained multilingual language model (RoBERTa) for text classification tasks. It demonstrates how to use the model’s existing knowledge to classify text accurately, even with a small set of labeled examples. It takes input as text documents and their corresponding labels for training, validating and testing. It covers using specialized models for English, German, and French, but employs XLM-RoBERTa, which can be used for over 100 languages. Learning Objectives This tutorial answers the following questions: What exactly is text-classification? What is a pre-trained language model? What even is fine-tuning? By the end of this tutorial, you will be able to: Work with large language models for text classification (RoBERTa) Customize (fine-tunine) a large language model for a text classification task in any language (100+ languages supported) Employ low-resource learning (with only few hundred examples) using the SAM optimizer Target Audience This tutorial is aimed at an intermediate level. You should have basic knowledge of large language models and of Python programming. Duration About half a work-day. Use Cases Training and then using a text classifier for special text data, for example to detect sentiment of specific texts, hatefulness of social media posts or something completely different like whether a text contains mentions of fruits or not. Environment Setup Use this tutorial preferably in an environment with GPU access like Colab. Run the cells below: !pip install --quiet transformers pandas openpyxl # download the SAM optimizer code !wget --quiet https://raw.githubusercontent.com/davda54/sam/main/sam.py # Import packages used in the tutorial import pandas from sam import SAM import shutil import torch from torch.utils.data import DataLoader from transformers import AutoModelForSequenceClassification, AutoTokenizer import transformers Introduction What exactly is text-classification ? In text-classification we try to assign a property to a text. For example we are interested in classifying texts that are about fruits. We could easily find a dictionary with all fruits (e.g.: ‘Apple’, ‘Banana’, ‘Pear’ etc.). Everytime we recognize a word from this dictionary in a text we know this text is about fruits, right? However, this might not be true all the time. For example “Apple designed the new pencil pro.” is not about the fruit ‘Apple’ although we would recognize it as such with our dictionary approach. So the context of the word might be helpful (more on this later). Furthermore, dictionaries work only for the language in the dictionary. However, we aim for multilingual approaches in this tutorial and will thus replace dictionaries with AI methods later in this tutorial. Classification is obviously transferable to more than just fruits. People try to classify the sentiment of a text, the stance towards an entity expressed in a text, the topic of a text, the expressed emotion in a text, and many many more. Data Preparation We here focus on text-classification in its purest form, i.e., have as data points single texts, each of which is assigned a single label (or class) that our classifier has to predict (as opposed to multi-label classification, in which a text can be assigned none up to all available labels, see the “Defining the Classification Task” section). We use the following format for this tutorial: { 'Text': ["text1", "text2", "...", "textN"] 'Labels': ["label-of-text1", "label-of-text2", "...", "label-of-textN"] } Example: { 'Text': ['Yesterday i ate an apple.', 'Yesterday I crashed my Apple.'], 'Labels': ['about_fruit', 'not_about_fruit'] } As in most classification settings, we separate our data into three parts: train. Contains the data on which classifiers are based (“training data”) val. Contains the data based on which we select a classifier among the available ones (“validation data”) test. Contains the data to give us an estimate of the performance of the selected classifier (“test data”) These different datasets should all be independent from each other, so that (1) we select the classifier that generalizes best to the validation data it does not know from the training data and (2) we get a solid performance estimate of the classifier in the wild from the test data (that was neither used to train nor to validate/select the classifier). We take an old Twitter Bag Brands Sentiment Dataset as data. Feel free to use your own data instead. !wget -q https://zenodo.org/records/7679325/files/bag_brand_sentiment_dataset.xlsx data = pandas.read_excel('bag_brand_sentiment_dataset.xlsx') data.head() # have a look at the first 5 texts Unnamed: 0 0 tweet_id text user location sentiment 0 1 WhEn I’m AbLe to PuRchAse A ChaNel BaG & S... 2740216059 when i’m able to purchase a chanel bag amp sti... blk_goddess8 Washington, DC positive 1 2 (INFO) Taehyung is wearing CHANEL TOTE BAG ,GR... 1591185253737053952 info taehyung is wearing chanel tote bag graff... Thv_style NaN positive 2 3 Influencers on insta are like "I finally got o... 15441716 influencers on insta are like i finally got of... ac_palma Portugal negative 3 4 @Brieyonce This is like Chanel’s advent calend... 1316674861658329088 brieyonce this is like chanel’s advent calenda... MarahhJayy NaN neutral 4 5 Tae with jw anderson jacket and chanel tote ba... 1539547580299742976 tae with jw anderson jacket and chanel tote ba... Bobabobabooboo1 NaN positive The first column contains the text, whereas the sentiment column contains the label. We now transform it into our simple structure detailed above: data_all = { "Text": data[0].tolist(), "Labels": data["sentiment"].tolist() } num_data_all = len(data_all["Text"]) # It is always good to check whether your data is "balanced", i.e., whether it # has a similar amount of data for each class: from collections import Counter print("Total:", Counter(data_all["Labels"])) # And split the data into three parts: split_target_size = int(num_data_all/3) train = { "Text": data_all["Text"][:(split_target_size)], "Labels": data_all["Labels"][:(split_target_size)] } print("Train:", Counter(train["Labels"])) val = { "Text": data_all["Text"][(split_target_size):(2split_target_size)], "Labels": data_all["Labels"][(split_target_size):(2split_target_size)] } print("Val: ", Counter(val["Labels"])) test = { "Text": data_all["Text"][(2split_target_size):], "Labels": data_all["Labels"][(2split_target_size):] } print("Test: ", Counter(test["Labels"])) print() assert len(train['Text']) == len(train['Labels']), "Number of texts does not match number of labels for train data!" assert len(val['Text']) == len(val['Labels']), "Number of texts does not match number of labels for val data!" assert len(test['Text']) == len(test['Labels']), "Number of texts does not match number of labels for test data!" print(f"The train data has {len(train['Text'])} texts and {len(train['Labels'])} labels,") print(f"the validation data {len(val['Text'])} texts and {len(val['Labels'])} labels") print(f"and the test data {len(test['Text'])} texts and {len(test['Labels'])} labels.") print() print("First five training texts:") for i in range(5): print(f"'{train['Text'][i]}', {train['Labels'][i]}") Total: Counter({'neutral': 1424, 'positive': 1083, 'negative': 374}) Train: Counter({'neutral': 457, 'positive': 384, 'negative': 119}) Val: Counter({'positive': 424, 'neutral': 379, 'negative': 157}) Test: Counter({'neutral': 588, 'positive': 275, 'negative': 98}) The train data has 960 texts and 960 labels, the validation data 960 texts and 960 labels and the test data 961 texts and 961 labels. First five training texts: 'WhEn I’m AbLe to PuRchAse A ChaNel BaG & StiLl LivE comfortabLY🌻💪🏾', positive '(INFO) Taehyung is wearing CHANEL TOTE BAG ,GRAFFITI CANVASSILVER HARDWARE, from Spring/Summer 2014 (Limited Edition) -🐯🐆 https://t.co/iqiWeD4BnV https://t.co/dGrOTWjnwL', positive 'Influencers on insta are like "I finally got offered my dream luxury bag" and it's a black Chanel flap. 😆', negative '@Brieyonce This is like Chanel’s advent calendar including stickers, a dust bag and a keychain.', neutral 'Tae with jw anderson jacket and chanel tote bag😉 after jen using all these brands while touring in europe', positive However, to avoid that our model just learns to pick the most common label (as that is the most likely to be the case if is is unsure), we balance the training set by oversampling so that it contains the same number of examples for each label. import random counts = Counter(train['Labels']) max_count = max(counts.values()) for label, count in counts.items(): label_indices = [i for i, train_label in enumerate(train['Labels']) if train_label == label] selected_indices = random.choices(label_indices, k=max_count - count) train['Text'] += [train['Text'][i] for i in selected_indices] train['Labels'] += [train['Labels'][i] for i in selected_indices] # now shuffle the data again indices = random.sample(range(len(train['Text'])), len(train['Text'])) train['Text'] = [train['Text'][i] for i in indices] train['Labels'] = [train['Labels'][i] for i in indices] print("Train:", Counter(train["Labels"])) Train: Counter({'neutral': 457, 'negative': 457, 'positive': 457}) Great, the data is ready! Selecting a Pre-Trained Language Model In which language is the text in the data written? Many language models are trained to understand a particular language. In case you need to process text of just one language, taking such a specialized model is often a good idea. However, also cross-language models exist, which were trained on many different language. We use a cross-language model here, but you can comment out one of the other lines to use a language-specific model instead: model_name = 'xlm-roberta-base' # for 100 languages # model_name = "roberta-base" # for English # model_name = "benjamin/roberta-base-wechsel-german" # for German # model_name = "camembert-base" # for French What is this model_name ? It relates to our second question: What is a pre-trained language model ? A pre-trained language model has be trained to predict words in a text. Usually, single words are removed from a text to make kind of a cloze test. The model is trained to fill in the gap with the word that we removed. This process is done for millions of texts, making the model somewhat adapt at “speaking” the language(s). This is called pre-training, as it teaches the model a specific skill (e.g., language understanding) that is useful for many other skills (e.g., predicting the sentiment of a text). What even is fine-tuning ? It means to take a pre-trained model and training it now on the actual task we want it to solve. Since the model already has useful skills (e.g., language understanding), we need less training data to transform it into an expert for the task. To highlight that this is a (relatively) small adjustment, this step is called “fine-tuning”. Defining the Classification Task As said above, classification tasks can be separated into single-label and multi-label classification. This tutorial by default uses single-label classification, but it also contains the code for multi-label classification, in case that is what you need when you use your own data. Here is a more detailed distinction: Single-Label Classification Select one label from a list of possible labels. Example: Does this text have a positive or a negative sentiment? Multi-Label Classification Decide for each label from a list of possible labels whether it applies. Example: Is this text positive , in formal language and/or easy to read ? # Classify by 'single_label_classification' or 'multi_label_classification'? problem_type = 'single_label_classification' # Ensure the labels align with the selected classification type if problem_type == 'single_label_classification': assert isinstance(train['Labels'][0], str), ( "For 'single-label' tasks, labels should be strings (e.g., 'positive')." ) elif problem_type == 'multi_label_classification': assert isinstance(train['Labels'][0], list), ( "For 'multi-label' tasks, labels should be a list " \ + "(e.g., ['positive', 'easy-to-read'])." ) else: raise ValueError(f"Invalid classification type: {problem_type}") It is always advised to check your data behaves like you would expect it to: def get_unique_labels(data): """ Returns the number of possible labels and the possible labels. """ def flatten_list(list_to_flatten): return [x for xs in list_to_flatten for x in xs] labels = data['Labels'] if problem_type == 'multi_label_classification': labels = flatten_list(labels) labels = set(labels) return labels labels_train = get_unique_labels(train) labels_val = get_unique_labels(val) labels_test = get_unique_labels(test) assert labels_train == labels_val == labels_test, ( "Train, val and test must contain the same labels, but the where\n" f"{labels_train},\n{labels_val} and\n{labels_test}, respectively" ) print(f"Labels (same across train, val and test): {list(labels_train)}") Labels (same across train, val and test): ['negative', 'positive', 'neutral'] Setting up the Model The next step is to load the model and tokenizer. The tokenizer translates text into a model-specific vocabulary (usually numbers) that the model can process efficiently. # The AI library uses integer numbers instead of strings for labels # We convert them into each other with these dictionaries id2label = {i: k for i, k in enumerate(labels_train)} # id2label[0] == labels_train[0] label2id = {k: i for i, k in enumerate(labels_train)} # label2id[labels_train[0]] == 0 model_config = {'pretrained_model_name_or_path': model_name, 'num_labels': len(labels_train), 'problem_type': problem_type, 'id2label': id2label, 'label2id': label2id} model = AutoModelForSequenceClassification.from_pretrained(*model_config) tokenizer = AutoTokenizer.from_pretrained(model_name) Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Now we specify training-specific parameters. Do not worry if you are unsure about what to change if you use your own data. The preset values should work just fine for most cases. In general, when training a model, you provide it with some example data points (in your case, the training data). From this data, the model learns helpful patterns that explain the correlation between input and output. Key parameters of the training are: batch_size : Determines how many examples we show to the model at a time before deducing how to improve classification. learning_rate : Controls how strongly the model commits to patterns it recognizes within each batch. num_epoch : Specifies how many times the model sees all the training data (e.g., 3 times). warm_up_rate : Indicates the portion of training during which the model makes smaller adjustments. batch_size = 32 learning_rate = 1e-4 num_epochs = 3 # how often we go through the entire training dataset warm_up_rate = 0.1 # fraction of our training steps for warm-up scheduling Next we create dataloaders to prepare the data for the model training. This involves to convert both the texts and the labels from string to integers: tokenization (texts to sequences of numbers) and label2id mapping (labels to numbers). Both are then used in the dataloader. # Convert text into the model vocabulary (tokenization) train_text = tokenizer([m for m in train['Text']], truncation=True, padding='longest', return_tensors='pt') val_text = tokenizer([m for m in val['Text']], truncation=True, padding='longest', return_tensors='pt') test_text = tokenizer([m for m in test['Text']], truncation=True, padding='longest', return_tensors='pt') # Convert label strings into label ids (using the label2id mapping) def map_labels(labels, mapping): if isinstance(labels[0], list): # multi-label classification label_matrix = [] for text_labels in labels: # text_labels: labels for one text label_vector = torch.tensor([ # vector of 0s and 1s, the index encodes the label 1 if label in text_labels else 0 for label in mapping ]) label_matrix.append(label_vector) labels = torch.stack(label_matrix, dim=0) return torch.tensor([mapping[label] for label in labels]) train_y = map_labels(train['Labels'], label2id) val_y = map_labels(val['Labels'], label2id) test_y = map_labels(test['Labels'], label2id) # Get dataloaders for iteration over the data def generate_dataloader(text, y, batch_size, workers=1): """ Returns a dataloader with input_ids and attention_mask to process the text. """ attention_mask = text['attention_mask'] input_ids = text['input_ids'] dataset = list(zip(input_ids, attention_mask, y)) dataloader = DataLoader( dataset, shuffle=True, batch_size=batch_size, num_workers=workers) return dataloader train_dataloader = generate_dataloader(train_text, train_y, batch_size) val_dataloader = generate_dataloader(val_text, val_y, batch_size) test_dataloader = generate_dataloader(test_text, test_y, batch_size) The learning of patterns and adaptation of the model is achieved by the optimizer. In our case it is a special optimizer that keeps a model from optimizing too much to the training data-which often results in less generalization to new data. You can find the technical details in the SAM optimizer repository . optimizer = SAM(model.parameters(), torch.optim.Adam, lr=learning_rate, adaptive=True) The scheduler adapts the learning rate of the optimizer (by how much it adopts the model in a learning step) during the learning process. There are several different scheduling strategies, though typically they decrease the learning rate after some initial (“warm up”) steps. The further one is in training, one typically assumes that only minor adoptions to the model are necessary to really hit the optimum. num_training_steps = (len(train['Labels']) // batch_size) num_epochs num_warmup_steps = int(num_training_steps * warm_up_rate) scheduler = transformers.get_cosine_schedule_with_warmup( optimizer = optimizer, num_warmup_steps = num_warmup_steps, num_training_steps = num_training_steps, last_epoch = -1 ) Before we start the training, let us check our available ressources and compare with the batch size (how much of the training data we load at once - if we have a small GPU only, larger batches might not fit into it at once). device = 'cuda:0' if torch.cuda.is_available() else 'cpu' # Memory and system check before training import psutil import gc import os # Set tokenizers parallelism to avoid fork warnings os.environ["TOKENIZERS_PARALLELISM"] = "false" # Check system memory memory = psutil.virtual_memory() print(f"System Memory: {memory.total / 10243:.2f} GB total, {memory.available / 10243:.2f} GB available") # Check GPU memory if available if device != 'cpu': gpu_memory = torch.cuda.get_device_properties(0).total_memory gpu_memory_allocated = torch.cuda.memory_allocated(0) gpu_memory_reserved = torch.cuda.memory_reserved(0) print(f"GPU Memory: {gpu_memory / 10243:.2f} GB total") print(f"GPU Memory Allocated: {gpu_memory_allocated / 10243:.2f} GB") print(f"GPU Memory Reserved: {gpu_memory_reserved / 1024*3:.2f} GB") # Clear any existing GPU cache torch.cuda.empty_cache() gc.collect() print(f"Device: {device}") else: print("No GPU available - using CPU") # Check model size model_size = sum(p.numel() for p in model.parameters()) print(f"Model has {model_size:,} parameters") # Recommend batch size based on available memory if device != 'cpu': available_gpu_memory = gpu_memory - gpu_memory_reserved if available_gpu_memory < 4 1024*3: # Less than 4GB available recommended_batch_size = 8 elif available_gpu_memory < 8 10243: # Less than 8GB available recommended_batch_size = 16 else: recommended_batch_size = 32 print(f"Recommended batch size: {recommended_batch_size}") if batch_size > recommended_batch_size: print(f"Warning: Current batch size ({batch_size}) may be too large. Consider reducing to {recommended_batch_size}") # Avoid that tutorial gets stuck when running on a CPU. if device == 'cpu': print("Training on CPU takes too long. For the sake of the tutorial, we " \ "restrict the data to 10 examples each. The classifier will not really " \ "work from that little data, but illustrates how it would work in " \ "general.") train_text = tokenizer([m for m in train['Text'][0:10]], truncation=True, padding='longest', return_tensors='pt') val_text = tokenizer([m for m in val['Text'][0:10]], truncation=True, padding='longest', return_tensors='pt') test_text = tokenizer([m for m in test['Text'][0:10]], truncation=True, padding='longest', return_tensors='pt') train_y = map_labels(train['Labels'][0:10], label2id) val_y = map_labels(val['Labels'][0:10], label2id) test_y = map_labels(test['Labels'][0:10], label2id) train_dataloader = generate_dataloader(train_text, train_y, 10) val_dataloader = generate_dataloader(val_text, val_y, 10) test_dataloader = generate_dataloader(test_text, test_y, 10) num_epochs = 1 System Memory: 7.76 GB total, 4.44 GB available No GPU available - using CPU Model has 278,045,955 parameters Training on CPU takes too long. For the sake of the tutorial, we restrict the data to 10 examples each. The classifier will not really work from that little data, but illustrates how it would work in general. Fine-Tuning the Pre-Trained Model (“Fine-tuning” means to train an already pre-trained model to a specific task) Let’s start the training process! During fine-tuning, we show the training data to the model and adjust its parameters to optimize performance for the task. Here is what happens: The model learns patterns in the data to perform the classification task. After each epoch (a complete pass through the training dataset), we test the model on the validation set to monitor progress. The best-performing model is saved during the training process. best_loss = float('inf') best_epoch = 0 already_trained = 0 best_model_path = '' should_delete = True # Move model to device model.to(device) for epoch in range(num_epochs): # Repeat num_epochs times model.train() for batch_idx, batch in enumerate(train_dataloader): # Train the model on the batch try: input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device) # First forward pass output = model(input_ids, attention_mask, labels=y) loss = output.loss loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # SAM optimizer first step optimizer.first_step(zero_grad=True) # Second forward pass (required by SAM) output2 = model(input_ids, attention_mask, labels=y) loss2 = output2.loss loss2.backward() # SAM optimizer second step optimizer.second_step(zero_grad=True) # Update learning rate AFTER optimizer steps scheduler.step() print(f"Train: Epoch {epoch}, Train step {already_trained+batch_idx}, Loss {loss.item():.4f}, learning_rate {scheduler.get_last_lr()[0]:.2e}", flush=True) # Clear cache periodically to prevent memory buildup if batch_idx % 5 == 0 and torch.cuda.is_available(): torch.cuda.empty_cache() except RuntimeError as e: if "out of memory" in str(e).lower(): print(f"OOM Error at batch {batch_idx}. Trying to recover...") torch.cuda.empty_cache() gc.collect() # Try with smaller effective batch size if batch_size > 8: batch_size = batch_size // 2 print(f"Reducing batch size to {batch_size}") train_dataloader = generate_dataloader(train_text, train_y, batch_size) val_dataloader = generate_dataloader(val_text, val_y, batch_size) break else: raise e else: raise e already_trained += batch_idx # Validation phase model.eval() val_loss = [] with torch.no_grad(): for batch_idx, batch in enumerate(val_dataloader): # Validate the current state of the model on the validation data input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device) val_output = model(input_ids, attention_mask, labels=y) val_loss.append(val_output.loss) val_loss = torch.mean(torch.stack(val_loss)) print(f"Validation: Epoch {epoch}, Train step {already_trained}, Loss {val_loss.item():.4f}, old best/epoch {str(best_loss)[1:6]}/{best_epoch}", flush=True) if val_loss < best_loss: # Save the model if the val_loss is the best loss we have seen so far best_loss = val_loss.item() best_epoch = epoch if should_delete and best_model_path and os.path.exists(best_model_path): shutil.rmtree(best_model_path) best_model_path = f"./my_model_epoch_{best_epoch}_val_loss_{str(val_loss.item())[1:6]}" model.save_pretrained(best_model_path, from_pt=True) print(f" END EPOCH {epoch} ") # Clean up memory after each epoch if torch.cuda.is_available(): torch.cuda.empty_cache() gc.collect() print(f" FINISHED TRAINING FOR N={num_epochs} ") print(f"BEST EPOCH: {best_epoch}") print(f"BEST LOSS: {best_loss}") Train: Epoch 0, Train step 0, Loss 1.1397, learning_rate 8.33e-06 /srv/conda/envs/notebook/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:192: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn( Validation: Epoch 0, Train step 0, Loss 1.0745, old best/epoch nf/0 END EPOCH 0 FINISHED TRAINING FOR N=1 ** BEST EPOCH: 0 BEST LOSS: 1.0745060443878174 The training is finished now we can load the model. best_model = AutoModelForSequenceClassification.from_pretrained(best_model_path).to(device) Evaluating the Model With the loaded model we can now predict the labels of test set’s texts and compare thos predicted labels with the one already in the dataset to understand the models performance. test_y_logits = [] with torch.no_grad(): for batch_idx, batch in enumerate(test_dataloader): input_ids, attention_mask = batch[0].to(device), batch[1].to(device) batch_y_logits = best_model(input_ids, attention_mask).logits test_y_logits.append(batch_y_logits) test_y_logits = torch.cat(test_y_logits, dim=0) print(test_y_logits) tensor([[-0.1697, -0.0447, 0.1559], [-0.1834, -0.0495, 0.1719], [-0.1934, -0.0595, 0.1885], [-0.1953, -0.0488, 0.1792], [-0.1995, -0.0650, 0.1862], [-0.1819, -0.0601, 0.1768], [-0.2040, -0.0539, 0.1803], [-0.2005, -0.0652, 0.1852], [-0.1846, -0.0476, 0.1717], [-0.1786, -0.0579, 0.1756]]) Now these “logits” have to be converted to (number) labels. Based on the selected problem_type , the following code selects the correct decision function. The decision function converts the scores the model predicts into the actual decision. # In multi-label classification, every label for which the model predicts a # score above this value is said as being predicted multi_label_decision_threshold = 0 decision_function = { # select single label with highest score 'single_label_classification': lambda x: torch.argmax(x, dim=1), # select all labels with score above the threshold 'multi_label_classification': lambda x: torch.where( x > multi_label_decision_threshold, 1, 0) }[problem_type] test_y_pred = decision_function(test_y_logits).cpu() print(test_y_pred) tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) We can finally compare how the predicted labels ( test_y_pred ) align with the labels that came along with the dataset ( test_y ). The scikit-learn library provides a nice report function to compare them. from sklearn.metrics import classification_report print(classification_report(test_y, test_y_pred, target_names=label2id.keys(), zero_division=True)) precision recall f1-score support negative 1.00 0.00 0.00 1 positive 1.00 0.00 0.00 1 neutral 0.80 1.00 0.89 8 accuracy 0.80 10 macro avg 0.93 0.33 0.30 10 weighted avg 0.84 0.80 0.71 10 Results The classification report shows us four metric results these are the precision, the recall, the f1-score, and the accuracy. Additionally, the report displays two different average aggregations, these are the macro avg, and the weighted average. The precision tells us “When we predict a label, is it the correct label?” The recall tells us “How many instances of a class do we find?” The f1-score is the harmonic mean of the precision and the recall . The accuracy tells us “How many of our predictions are correct?” The macro avg aggregates the f1-score per class it tells us “How well do we classify, if all classes occur equally often?” The weighted avg aggregates the f1-score weighted by class size it tells us “How well do we classify the complete label set?” Using the Classifier You can now check the classifier on further examples: def classify(model, texts): tokenized_texts = tokenizer(texts, truncation=True, padding='longest', return_tensors='pt') input_ids, attention_mask = tokenized_texts['input_ids'].to(device), tokenized_texts['attention_mask'].to(device) logits = model(input_ids, attention_mask).logits predictions = decision_function(logits).cpu().tolist() predicted_labels = [id2label[prediction] for prediction in predictions] return predicted_labels example_texts = ["What a great bag!", "What a horrible thing"] example_predicted_labels = classify(best_model, example_texts) for i in range(len(example_texts)): print(f"The classifier predicted the label '{example_predicted_labels[i]}' for the text '{example_texts[i]}'") The classifier predicted the label 'neutral' for the text 'What a great bag!' The classifier predicted the label 'neutral' for the text 'What a horrible thing' Further Reading Unsupervised Cross-lingual Representation Learning at Scale RoBERTa: A Robustly Optimized BERT Pretraining Approach CamemBERT: a Tasty French Language Model WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models Sharpness-Aware Minimization for Efficiently Improving Generalization Contact Details For questions or feedback, contact Stephan Linzbach via Stephan.Linzbach@gesis.org . (xsd:string)
?:doi	10.71627/Text-Classification-with-Pretrained-Language-Model.1 ()
?:downloadURL	<https://github.com/Stephan-Linzbach/Text-Classification-with-Pretrained-Language-Models/archive/refs/heads/main.zip>
?:format	NOTEBOOK (de) NOTEBOOK (en)
?:gitReference	0dc06c7fc01b94220a11c5f1e5073e6eaa218dd8 ()
is ?:hasPart of	<https://data.gesis.org/gesiskg/resource/>
?:levelOfDifficulty	INTERMEDIATE (xsd:string)
?:license	Apache-2.0 (xsd:string)
?:name	A Practical Guide to Multilingual Large Language Model (RoBERTa) Classification (xsd:string)
?:portalUrl	<https://methodshub.gesis.org/library/tutorials/Text-Classification-with-Pretrained-Language-Model/1/>
?:programmingLanguage	Python (de) Python (en)
?:sourceInfo	GESIS-Methods Hub (xsd:string)
?:timeRequired	About Half A Work Day (xsd:string)
rdf:type	<https://data.gesis.org/gesiskg/schema/Tutorial>