PropertyValue
?:about
?:abstract
  • Learn how to fine-tune transformer models like BERT for binary and multiclass document classification (xsd:string)
?:associatedTask
?:codeRepository
?:contributor
?:dateModified
  • 2025 (xsd:gyear)
?:datePublished
  • 2025 (xsd:gyear)
?:description
  • Learning Objectives By the end of this tutorial, you will be able to fine-tune transformer models like BERT for binary and multiclass document classification. We show how to use simpletransformers for using transformer models in Python. We then switch to HuggingFace to train the same model. Finally, we expand the binary classification to multiclass. As an example, we will fine-tune a specific transformer model (DistilBERT) for automatic sexism detection. Target audience This tutorial is aimed at social scientists with some knowledge in Python and supervised machine learning. Setting up the computational environment The following Python packages are required !pip install pandas numpy torch sklearn !pip install simpletransformers !pip install transformers[torch] This package is optional !pip install accelerate -U Duration The entire tutorial takes around 30 minutes to read through. The running time depends on the hardware. This notebook can be used with or without GPU compute. On a laptop with a Nvidia T500 GPU (4G) and a proper CUDA setup, the running time is around 7 minutes. Without GPU, it would take many hours. Social Science Usecase(s) This method has been used in Samory et al. (2021) for automatic sexism detection. Import necessary Python libraries and modules We will import necessary Python libraries and modules. import pandas as pd import numpy as np # For deep learning import torch We then check if have a GPU available. This is important because some parts of the code have to be modified later on based on this. gpu_avail = torch.cuda.is_available() Get the data We first download the datasets we need for finetuning our models. This is a supervised classification task, therefore, we will need labeled data. We download the the ‘Call me sexist but’ (CMSB) dataset which you can find here: https://search.gesis.org/research_data/SDN-10.7802-2251 This dataset is from our paper on detecting sexism in a theory-driven manner (Samory et al., 2021). Here, we can download the full data (‘sexism_data.csv’) and put it in folder called ‘sexism_data’. Here, we will use just a subset of the data (‘sexism_sample.csv’) for speeding up the training. For a better classifier, please train on the full data. ## sexism_data = pd.read_csv('sexism_data/sexism_data.csv') sexism_data = pd.read_csv('sexism_data/sexism_sample.csv') sexism_data.head() id dataset text toxicity sexist of_id 0 10330 other MENTION3839 MENTION2475 would have to see what… 0.203880 False -1 1 5703 callme I hate it when women beat up men in movies! 0.661064 True -1 2 12331 scales Men are better at science and math than women are 0.531301 True -1 3 12855 callme “MENTION149: MENTION2551 ps I’d never listen t… 0.278660 False -1 4 2240 other MENTION279 Vivek W is the Carrot Top of academ… 0.084303 False -1 sexism_data = sexism_data.dropna(subset = 'sexist') Finetuning a classifier: the general procedure With simpletransformers We first use the simpletransformers package which is more beginner-friendly. The basic steps for finetuning a classifier using simpletrasnformers are: Initialize a model based on a specific architechture (BERT, DistilBERT, etc) Train the model with train_model() Evaluate the model with eval_model() Make predictions on (unlabelled) data with predict() from simpletransformers.classification import ClassificationModel, ClassificationArgs import logging logging.basicConfig(level=logging.INFO) transformers_logger = logging.getLogger("transformers") transformers_logger.setLevel(logging.WARNING) We need to preprocess the data first before we start the finetuning process. In this step, we split the dataset into train and test sets to have a fully held-out test set that can be used to evaluate our classifier. We can also create a validation that is used during the fine tuning process for hyperparameter tuning, but that is not mandatory. from sklearn.model_selection import train_test_split train_df, test_df = train_test_split(sexism_data, stratify=sexism_data['sexist'], test_size=0.2) We now convert the dataframes into a format that can be read by simpletransformers. This is a dataframe with the columns ‘text’ and ‘labels’. The ‘labels’ column should be numerical, so we use one-hot encoding to transform our boolean sexist labels to numerical ones. from sklearn.preprocessing import LabelEncoder le = LabelEncoder() le.fit(train_df['sexist']) train_df['labels'] = le.transform(train_df['sexist']) test_df['labels'] = le.transform(test_df['sexist']) # to see which number was mapped to which class: list(le.inverse_transform([0,1])) [np.False_, np.True_] So, 0 is non-sexist and 1 is sexist. We now have the appropriate data structure. The next step is setting the training parameters and loading the classification model, in this case, DistilBERT (Sanh et al., 2019), a lightweight model that can be trained relatively quickly compared to other transformer variants like BERT and RoBERTa. For training parameters, we have many to choose from such as the learning rate, whether we want to stop early or not, where we should save the model, and more. You can find all of them here . As a minimal setup, we will just set the number of epochs , i.e., the number of passes the model does over the full training set. For recent transformer models, epochs are usually set to 2 or 3, after which overfitting may happen. use_cuda is a parameter that signals whether the GPU should be used or not. It will be set based on our check earlier. # Optional model configuration model_args = ClassificationArgs(num_train_epochs=3, overwrite_output_dir=True) # Create a ClassificationModel model = ClassificationModel( "distilbert", "distilbert-base-uncased", args=model_args, use_cuda=gpu_avail, ) # we set some additional parameters when using a GPU if gpu_avail: model_args.use_multiprocessing=False model_args.use_multiprocessing_for_evaluation=False We are now finally ready to begin training! This might take a while, especially when we’re not using a GPU. # Train the model model.train_model(train_df) Epoch: 0%| | 0/3 [00:00 [60/60 02:58, Epoch 3/3] Step Training Loss TrainOutput(global_step=60, training_loss=0.39763174057006834, metrics={'train_runtime': 181.7385, 'train_samples_per_second': 2.641, 'train_steps_per_second': 0.33, 'total_flos': 63584351354880.0, 'train_loss': 0.39763174057006834, 'epoch': 3.0}) Save fine-tuned model The following cell will save the model and its configuration files to a directory in Colab. To preserve this model for future use, you should download the model to your computer. trainer.save_model(cached_model_directory_name) (Optional) If you’ve already fine-tuned and saved the model, you can reload it using the following line. You don’t have to run fine-tuning every time you want to evaluate. # trainer = DistilBertForSequenceClassification.from_pretrained(cached_model_directory_name) We can now evaluate the model by predicting the labels for the test set. predicted_results = trainer.predict(tokenized_test_df) predicted_labels = predicted_results.predictions.argmax(-1) # Get the highest probability prediction predicted_labels = predicted_labels.flatten().tolist() # Flatten the predictions into a 1D list predicted_labels[0:5] [1, 1, 1, 1, 1] print(classification_report(tokenized_test_df['labels'], predicted_labels)) precision recall f1-score support 0 0.94 0.75 0.83 20 1 0.79 0.95 0.86 20 accuracy 0.85 40 macro avg 0.86 0.85 0.85 40 weighted avg 0.86 0.85 0.85 40 You can now use this classifier on other types of data to label it for potentially sexist content. Optional: Multi-class classification In the previous parts, we finetuned a binary classifier for differentiating sexist vs. non-sexist content. However, the CMSB dataset has fine-grained labels for sexism based on content and phrasing . So we now use a multi-class classifier using simpletransformers, with a few tweaks to our earlier code. But first, we have to aggregate the annotations from all crowdworkers to obtain the content and phrasing labels. For simplicity, we will use the majority label (breaking ties randomly). sexism_data_annotations = pd.read_csv('sexism_data/sexism_annotations.csv', sep = ',') sexism_data_annotations.head() phrasing content worker id 0 3 2 0 1815 1 3 6 1 1815 2 3 6 2 1815 3 3 6 3 1815 4 3 6 4 1815 tweets = sexism_data_annotations['id'].unique() from collections import Counter content_labels = [] phrasing_labels = [] for tweet in tweets: data_subset = sexism_data_annotations[sexism_data_annotations['id'] == tweet] content_labels.append(Counter(data_subset['content'].values).most_common()[0][0]) phrasing_labels.append(Counter(data_subset['phrasing']).most_common()[0][0]) Line 8 get the majority label for content Line 9 get the majority label for phrasing finegrained_sexism_data = pd.DataFrame([tweets, content_labels, phrasing_labels]).T finegrained_sexism_data.columns = ['id', 'content_label', 'phrasing_label'] finegrained_sexism_data id content_label phrasing_label 0 1815 6 3 1 8199 2 3 2 11847 6 3 3 9218 6 3 4 13298 6 3 … … … … 5645 2383 6 2 5646 5627 6 3 5647 11041 6 3 5648 3535 6 3 5649 9901 6 3 5650 rows × 3 columns finegrained_sexism_data.groupby('content_label').size() content_label 1 625 2 876 3 173 4 78 5 237 6 3661 dtype: int64 finegrained_sexism_data.groupby('phrasing_label').size() phrasing_label 1 149 2 223 3 5278 dtype: int64 The six content and three phrasing categories are: Let’s join this data with the tweets data from ‘all_data.csv’ finegrained_sexism_data = pd.merge(finegrained_sexism_data, sexism_data[['id', 'text', 'sexist']]) finegrained_sexism_data.groupby(['content_label']).size() content_label 1 37 2 53 3 6 4 1 5 1 6 39 dtype: int64 Since our dataset is somewhat imbalanced with low representation for some categories, we can restrict it to only those classes that have at least 30 instances, i.e., 1, 2, and 6. finegrained_sexism_data = finegrained_sexism_data[finegrained_sexism_data['content_label'].isin([1, 2, 6])] # we also change the label range for simpletransformers, making them range from 0 to 2. label_map = {1 : 0, 2 : 1, 6 : 2} finegrained_sexism_data['content_label'] = [label_map[i] for i in finegrained_sexism_data['content_label']] finegrained_sexism_data.groupby(['content_label']).size() content_label 0 37 1 53 2 39 dtype: int64 Let’s train a classifier for identifying sexist content or phrasing category = 'content_label' multi_train_df, multi_test_df = train_test_split(finegrained_sexism_data, stratify=finegrained_sexism_data[category], test_size=0.2) You have the add the number of labels to the model initialization. # Optional model configuration model_args = ClassificationArgs(num_train_epochs=5, output_dir='output_st', overwrite_output_dir=True) # Create a ClassificationModel model = ClassificationModel( "distilbert", "distilbert-base-uncased", num_labels=len(finegrained_sexism_data[category].unique()), use_cuda=gpu_avail, args=model_args ) # we set some additional parameters when using a GPU if gpu_avail: model_args.use_multiprocessing=False model_args.use_multiprocessing_for_evaluation=False # multi_train_df['content_label'] = [i-1 for i in multi_train_df['content_label']] # multi_test_df['content_label'] = [i-1 for i in multi_test_df['content_label']] multi_train_df = multi_train_df[['text', category]] multi_test_df = multi_test_df[['text', category]] # Train the model. model.train_model(multi_train_df) Epoch: 0%| | 0/5 [00:00 (xsd:string)
?:format
  • WRITTEN_DOCUMENT (de)
  • WRITTEN_DOCUMENT (en)
is ?:hasPart of
?:levelOfDifficulty
  • BEGINNER (xsd:string)
?:license
  • MIT (xsd:string)
?:name
  • Fine tuning BERT for text classification tasks (xsd:string)
?:portalUrl
?:programmingLanguage
  • Python (de)
  • Python (en)
?:sourceInfo
  • GESIS-Methods Hub (xsd:string)
?:timeRequired
  • About A Full Work Day (xsd:string)
rdf:type
?:version
  • a4f2f81dd7c47d6ba08bb0394ece4ed3ac2ebde9 (xsd:string)