UDPipe Natural Language Processing

UDPipe - General

The data preparation part of any Natural Language Processing flow consists of a number of important steps: Tokenization (1), Parts of Speech tagging (2), Lemmatization (3) and Dependency Parsing (4). This package allows you to do out-of-the-box annotation of these 4 steps and also allows you to train your own annotator models directly from R.

It does this by providing an Rcpp wrapper around the UDPipe C++ library which is described at https://ufal.mff.cuni.cz/udpipe and is available at https://github.com/ufal/udpipe.

udpipe the R package

The udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:

Give R users simple access in order to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language
Provide easy access to pre-trained annotation models
Allow R users to easily construct your own annotation model based on data in CONLL-U format as provided in more than 100 treebanks available at https://universaldependencies.org
Don’t rely on Python or Java so that R users can easily install this package without configuration hassle
No external R package dependencies except the strict necessary (Rcpp and data.table, no tidyverse)

UDPipe the C++ library

UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
UDPipe allows to work with data in CONLL-U format as described at https://universaldependencies.org/format.html
The techniques used are explained in detail in the paper: “Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe”, available at https://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf. In that paper, you’ll also find accuracies on different languages and process flow speed (measured in words per second).

udpipe models

Pre-trained models

Before you can start on performing the annotation, you need a model. Pre-trained models build on Universal Dependencies treebanks are made available for more than 65 languages based on 101 treebanks, namely:

afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb.

For R users who want to use these open-sourced models provided by the UDPipe community and start on tagging, you can proceed as follows to download the model of the language of your choice.

library(udpipe)
dl <- udpipe_download_model(language = "dutch")
str(dl)

'data.frame':   1 obs. of  5 variables:
 $ language        : chr "dutch-alpino"
 $ file_model      : chr "/tmp/RtmpdWwWR5/Rbuildc4fbe5eb928d5/udpipe/vignettes/dutch-alpino-ud-2.5-191206.udpipe"
 $ url             : chr "https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/dutch-alpino-"| __truncated__
 $ download_failed : logi FALSE
 $ download_message: chr "OK"

Build your own annotator models

The udipe R package also allows you to easily train your own models, based on data in CONLL-U format, so that you can use these for your own commercial or non-commercial purposes. This is described in the other vignette of this package which you can view by the command vignette("udpipe-train", package = "udpipe") `

Annotate text

Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe_annotate. This goes as follows.

Load the model

First load the model which you have downloaded or which you have stored somewhere on disk.

## Either give a file in the current working directory
udmodel_dutch <- udpipe_load_model(file = "dutch-alpino-ud-2.5-191206.udpipe")
## Or give the full path to the file 
udmodel_dutch <- udpipe_load_model(file = dl$file_model)

Annotate your text

Tokenisation, tagging and parsing

Once you have this model, you can start on annotating. Provide a vector of text and use udpipe_annotate. The resulting tagged output is in CONLL-U format as described at https://universaldependencies.org/format.html. You can put this in a data.frame format with as.data.frame.

txt <- c("Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt? Jazeker meneer", 
         "Het gaat vooruit, het gaat verbazend goed vooruit")
x <- udpipe_annotate(udmodel_dutch, x = txt)
x <- as.data.frame(x)
str(x)

'data.frame':   27 obs. of  14 variables:
 $ doc_id       : chr  "doc1" "doc1" "doc1" "doc1" ...
 $ paragraph_id : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sentence_id  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sentence     : chr  "Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?" "Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?" "Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?" "Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?" ...
 $ token_id     : chr  "1" "2" "3" "4" ...
 $ token        : chr  "Ik" "ben" "de" "weg" ...
 $ lemma        : chr  "ik" "zijn" "de" "weg" ...
 $ upos         : chr  "PRON" "AUX" "DET" "NOUN" ...
 $ xpos         : chr  "VNW|pers|pron|nomin|vol|1|ev" "WW|pv|tgw|ev" "LID|bep|stan|rest" "N|soort|ev|basis|zijd|stan" ...
 $ feats        : chr  "Case=Nom|Person=1|PronType=Prs" "Number=Sing|Tense=Pres|VerbForm=Fin" "Definite=Def" "Gender=Com|Number=Sing" ...
 $ head_token_id: chr  "5" "5" "4" "5" ...
 $ dep_rel      : chr  "nsubj" "cop" "det" "obj" ...
 $ deps         : chr  NA NA NA NA ...
 $ misc         : chr  NA NA NA NA ...

table(x$upos)


  ADJ   ADV   AUX   DET  NOUN  PRON PUNCT  VERB 
    4     3     2     2     4     5     3     4

Only part of the annotation

Mark that by default udpipe_annotate does Tokenization, Parts of Speech Tagging, Lemmatization and Dependency parsing. If you want to gain some time because you require only a part of the annotation, you can specify to leave parts of the annotation out. This is done as follows.

## Tokenization + finds sentences, does not execute POS tagging, nor lemmatization or dependency parsing
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "none", parser = "none")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)

## Tokenization + finds sentences, does POS tagging and lemmatization but does not execute dependency parsing
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "default", parser = "none")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)

## Tokenization + finds sentences and executes dependency parsing but does not do POS tagging nor lemmatization
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "none", parser = "default")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)

My text data is already tokenised

If your data is already tokenised according to your needs using other tools like the tidytext / tokenizers / text2vec R packages or any other external software or just by manual work. You can still use udpipe to do parts of speech annotation and dependency parsing and skip the tokenisation. This is done as follows.

## Either put every token on a new line and use tokenizer: vertical
input <- list(doc1 = c("Ik", "ben", "de", "weg", "kwijt", ",", "kunt", "u", "me", "zeggen", 
                       "waar", "de", "Lange Wapper", "ligt", "?", "Jazeker", "meneer"),
              doc2 = c("Het", "gaat", "vooruit", ",", "het", "gaat", "verbazend", "goed", "vooruit"))
txt <- sapply(input, FUN=function(x) paste(x, collapse = "\n"))
x <- udpipe_annotate(udmodel_dutch, x = txt, tokenizer = "vertical")
x <- as.data.frame(x)

## Or put every token of each document in 1 string separated by a space and use tokenizer: horizontal
##   Mark that if a token contains a space, you need to replace the space 
##   with the 'NO-BREAK SPACE' (U+00A0) character to make sure it is still considered as one token
txt <- sapply(input, FUN=function(x){
  x <- gsub(" ", intToUtf8(160), x) ## replace space with no-break-space
  paste(x, collapse = " ")
})
x <- udpipe_annotate(udmodel_dutch, x = as.character(txt), tokenizer = "horizontal")
x <- as.data.frame(x)

Remarks

Some remarks:

If your model is not trained to be able to do parsing/tagging, you can not request it to do parsing/tagging
Use argument doc_id to udpipe_annotate so that you can link your document to the tagged terms later on
Your text has to be in UTF-8 Encoding when you pass it to udpipe_annotate, if you don’t have that Encoding use standard R facilities like iconv to convert it to UTF-8. You get also results in UTF-8 encoding back.

dl <- udpipe_download_model(language = "sanskrit", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0")
udmodel_sanskrit <- udpipe_load_model(file = dl$file_model)
txt <- "ततः असौ प्राह क्षत्रियस्य तिस्रः भार्या धर्मम् भवन्ति तत् एषा कदाचिद् वैश्या सुता भविष्यति तत् अनुरागः ममास्याम् ततः रथकारः तस्य निश्चयम् विज्ञायावदत् वयस्य किम् अ धुना कर्तव्यम् कौलिकः आह किम् अहम् जानामि त्वयि मित्रे यत् अभिहितं मया ततः"
x <- udpipe_annotate(udmodel_sanskrit, x = txt)
Encoding(x$conllu)
x <- as.data.frame(x)

If you want to work with other tools which are capable of handling CONLL-U format, just export your annotation to a file as shown below

x <- udpipe_annotate(udmodel_sanskrit, x = txt)
cat(x$conllu, file = "myannotation.conllu")

UDPipe Natural Language Processing - Text Annotation

Jan Wijffels

2023-01-04