1. Introduction

The vignette walks you through importing a variety of different text files into R using the readtext package. Currently, readtext supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF and Microsoft Word formatted files (.pdf, .doc, .docx).

readtext also handles multiple files and file types using for instance a “glob” expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz). Usually, you do not have to determine the format of the files explicitly - readtext takes this information from the file ending.

The readtext package comes with a data directory called extdata that contains examples of all files listed above. In the vignette, we use this data directory.

The extdata directory contains several subfolders that include different text files. In the following examples, we load one or more files stored in each of these folders. The paste0 command is used to concatenate the extdata folder from the readtext package with the subfolders. When reading in custom text files, you will need to determine your own data directory (see ?setwd()).

2. Reading one or more text files

2.1 Plain text files (.txt)

The folder “txt” contains a subfolder named UDHR with .txt files of the Universal Declaration of Human Rights in 13 languages.

We can specify document-level metadata (docvars) based on the file names or on a separate data.frame. Below we take the docvars from the filenames (docvarsfrom = "filenames") and set the names for each variable (docvarnames = c("unit", "context", "year", "language", "party")). The command dvsep = "_" determines the separator (a regular expression character string) included in the filenames to delimit the docvar elements.

readtext can also curse through subdirectories. In our example, the folder txt/movie_reviews contains two subfolders (called neg and pos). We can load all texts included in both folders.

2.2 Comma- or tab-separated values (.csv, .tab, .tsv)

Read in comma separated values (.csv files) that contain textual data. We determine the texts variable in our .csv file as the text_field. This is the column that contains the actual text. The other columns of the original csv file (Year, President, FirstName) are by default treated as document-level variables.

The same procedure applies to tab-separated values.

2.5 Microsoft Word files (.doc, .docx)

Microsoft Word formatted files are converted through the package antiword for older .doc files, and using XML for newer .docx files.

2.6 Text from URLs

You can also read in text directly from a URL.

2.7 Text from archive files (.zip, .tar, .tar.gz, .tar.bz)

Finally, it is possible to include text from archives.

3. Inter-operability with quanteda

readtext was originally developed in early versions of the quanteda package for the quantitative analysis of textual data. It was spawned from the textfile() function from that package, and now lives exclusively in readtext. Because quanteda’s corpus constructor recognizes the data.frame format returned by readtext(), it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.

You can easily construct a corpus from a readtext object.

4. Solving common problems

4.1 Remove page numbers using regular expressions

When a document contains page numbers, they are imported as well. If you want to remove them, you can use a regular expression. We strongly recommend using the stringi package. For the most common regular expressions you can look at this cheatsheet.

You first need to check in the original file in which format the page numbers occur (e.g., “1”, “-1-”, “page 1” etc.). We can make use of the fact that page numbers are almost always preceded and followed by a linebreak (\n). After loading the text with readtext, you can replace the page numbers.

In the first example, the page numbers have the format “page X”.

In the second example we remove page numbers which have the format “- X -”.

Such stringi functions can also be applied to readtext objects.

4.2 Read files with different encodings

Sometimes files of the same type have different encodings. If the encoding of a file is included in the file name, we can extract this information and import the texts correctly.

Here, we will get the encoding from the filenames themselves.

If we read the text files without specifying the encoding, we get erroneously formatted text. To avoid this, we determine the encoding using the character object fileencoding created above.

We can also add docvars based on the filenames.

txts <- readtext(paste0(DATA_DIR, "/data_files_encodedtexts.zip"), 
                 encoding = fileencodings,
                 docvarsfrom = "filenames", 
                 docvarnames = c("document", "language", "input_encoding"))
print(txts, n = 50)
## readtext object consisting of 36 documents and 3 docvars.
## # data.frame [36 × 5]
##    doc_id              text             document   language input_encoding
##    <chr>               <chr>            <chr>      <chr>    <chr>         
##  1 IndianTreaty_Engli… "\"WHEREAS, t\"… IndianTre… English  UTF-16LE      
##  2 IndianTreaty_Engli… "\"ARTICLE 1.\"… IndianTre… English  UTF-8-BOM     
##  3 UDHR_Arabic_ISO-88… "\"الديباجة\nل\… UDHR       Arabic   ISO-8859-6    
##  4 UDHR_Arabic_UTF-8.… "\"الديباجة\nل\… UDHR       Arabic   UTF-8         
##  5 UDHR_Arabic_WINDOW… "\"الديباجة\nل\… UDHR       Arabic   WINDOWS-1256  
##  6 UDHR_Chinese_GB231… "\"世界人权宣言\n联合国\… UDHR       Chinese  GB2312        
##  7 UDHR_Chinese_GBK.t… "\"世界人权宣言\n联合国\… UDHR       Chinese  GBK           
##  8 UDHR_Chinese_UTF-8… "\"世界人权宣言\n联合国\… UDHR       Chinese  UTF-8         
##  9 UDHR_English_UTF-1… "\"Universal \"… UDHR       English  UTF-16BE      
## 10 UDHR_English_UTF-1… "\"Universal \"… UDHR       English  UTF-16LE      
## 11 UDHR_English_UTF-8… "\"Universal \"… UDHR       English  UTF-8         
## 12 UDHR_English_WINDO… "\"Universal \"… UDHR       English  WINDOWS-1252  
## 13 UDHR_French_ISO-88… "\"Déclaratio\"… UDHR       French   ISO-8859-1    
## 14 UDHR_French_UTF-8.… "\"Déclaratio\"… UDHR       French   UTF-8         
## 15 UDHR_French_WINDOW… "\"Déclaratio\"… UDHR       French   WINDOWS-1252  
## 16 UDHR_German_ISO-88… "\"Die Allgem\"… UDHR       German   ISO-8859-1    
## 17 UDHR_German_UTF-8.… "\"Die Allgem\"… UDHR       German   UTF-8         
## 18 UDHR_German_WINDOW… "\"Die Allgem\"… UDHR       German   WINDOWS-1252  
## 19 UDHR_Greek_CP1253.… "\"ΟΙΚΟΥΜΕΝΙΚ\"… UDHR       Greek    CP1253        
## 20 UDHR_Greek_ISO-885… "\"ΟΙΚΟΥΜΕΝΙΚ\"… UDHR       Greek    ISO-8859-7    
## 21 UDHR_Greek_UTF-8.t… "\"ΟΙΚΟΥΜΕΝΙΚ\"… UDHR       Greek    UTF-8         
## 22 UDHR_Hindi_UTF-8.t… "\"मानव अधिका\"… UDHR       Hindi    UTF-8         
## 23 UDHR_Icelandic_ISO… "\"Mannréttin\"… UDHR       Iceland… ISO-8859-1    
## 24 UDHR_Icelandic_UTF… "\"Mannréttin\"… UDHR       Iceland… UTF-8         
## 25 UDHR_Icelandic_WIN… "\"Mannréttin\"… UDHR       Iceland… WINDOWS-1252  
## 26 UDHR_Japanese_CP93… "\"『世界人権宣言』\n \… UDHR       Japanese CP932         
## 27 UDHR_Japanese_ISO-… "\"『世界人権宣言』\n \… UDHR       Japanese ISO-2022-JP   
## 28 UDHR_Japanese_UTF-… "\"『世界人権宣言』\n \… UDHR       Japanese UTF-8         
## 29 UDHR_Japanese_WIND… "\"『世界人権宣言』\n \… UDHR       Japanese WINDOWS-936   
## 30 UDHR_Korean_ISO-20… "\"세 계 인 권 선 \"… UDHR       Korean   ISO-2022-KR   
## 31 UDHR_Korean_UTF-8.… "\"세 계 인 권 선 \"… UDHR       Korean   UTF-8         
## 32 UDHR_Russian_ISO-8… "\"Всеобщая д\"… UDHR       Russian  ISO-8859-5    
## 33 UDHR_Russian_KOI8-… "\"Всеобщая д\"… UDHR       Russian  KOI8-R        
## 34 UDHR_Russian_UTF-8… "\"Всеобщая д\"… UDHR       Russian  UTF-8         
## 35 UDHR_Russian_WINDO… "\"Всеобщая д\"… UDHR       Russian  WINDOWS-1251  
## 36 UDHR_Thai_UTF-8.txt "\"ปฏิญญาสากล\"…  UDHR       Thai     UTF-8

From this file we can easily create a quanteda corpus object.