|
?:description
|
-
Duct tape the ‘quanteda’ ecosystem (Benoit et al., 2018) doi:10.21105/joss.00774 to modern Transformer-based text classification models (Wolf et al., 2020) doi:10.18653/v1/2020.emnlp-demos.6 , in order to facilitate supervised machine learning for textual data. This package mimics the behaviors of ‘quanteda.textmodels’ and provides a function to setup the ‘Python’ environment to use the pretrained models from ‘Hugging Face’ https://huggingface.co/ . More information: doi:10.5117/CCR2023.1.003.CHAN . Keywords Deep Learning Supervised machine learning Text analysis Use Cases This package can be used in any typical supervised machine learning usecase involving text data. In the software paper ( Chan et al. ), several cases were presented, e.g. Prediction of incivility based on tweets ( Theocharis et al., 2020 ). Input Data grafzahl accepts text data as either character vector or the corpus data structure of quanteda . Sample Input and Output Data A sample input is a corpus . This is an example dataset: library(grafzahl)
library(quanteda)
unciviltweets Corpus consisting of 19,982 documents and 1 docvar.
text1 :
"@ @ Karma gave you a second chance yesterday. Start doing m..."
text2 :
"@ With people like you, Steve King there's still hope for we..."
text3 :
"@ @ You bill is a joke and will sink the GOP. #WEDESERVEBETT..."
text4 :
"@ Dream on. The only thing trump understands is how to enric..."
text5 :
"@ @ Just like the Democrat taliban party was up front with t..."
text6 :
"@ you are going to have more of the same with HRC, and you a..."
[ reached max_ndoc ... 19,976 more documents ] The output is an S3 object. Hardware Requirements Grafzahl runs on any machine that can run R. A GPU that supports CUDA is optional. Environment Setup With R installed: install.packages("grafzahl") How to Use Before training, please setup the conda environment. setup_grafzahl(cuda = TRUE) ## if you have GPU(s) A typical way to train and make predictions. input <- corpus(ecosent, text_field = "headline")
training_corpus <- corpus_subset(input, !gold) Use the x (text data), y (label, in this case a docvar ), and model_name (Model name, from Hugging Face) parameters to control how the supervised machine learning model is trained. model2 <- grafzahl(x = training_corpus,
y = "value",
model_name = "GroNLP/bert-base-dutch-cased")
test_corpus <- corpus_subset(input, gold)
predict(model2, test_corpus) Technical Details See the publication for tested and selected models and parameters, the reasoning behind the model selection, and employed datasets for training. References Chan, C. H. (2023). grafzahl: fine-tuning Transformers for text data from within R. Computational Communication Research, 5(1), 76. https://doi.org/10.5117/CCR2023.1.003.CHAN Contact Details Maintainer: Chung-hong Chan chainsawtiney@gmail.com Issue Tracker: https://github.com/gesistsa/grafzahl/issues
(xsd:string)
|