# Load readtext package
library("readtext")
The vignette walks you through importing a variety of different text files into R using the readtext package. Currently, readtext supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF and Microsoft Word formatted files (.pdf, .doc, .docx).
readtext also handles multiple files and file types using for instance a “glob” expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz). Usually, you do not have to determine the format of the files explicitly - readtext takes this information from the file ending.
The readtext package comes with a data directory
called extdata
that contains examples of all files listed
above. In the vignette, we use this data directory.
# Get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")
The extdata
directory contains several subfolders that
include different text files. In the following examples, we load one or
more files stored in each of these folders. The paste0
command is used to concatenate the extdata
folder from the
readtext package with the subfolders. When reading in
custom text files, you will need to determine your own data directory
(see ?setwd()
).
The folder “txt” contains a subfolder named UDHR with .txt files of the Universal Declaration of Human Rights in 13 languages.
# Read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
## readtext object consisting of 13 documents and 0 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 13 × 2\033[39m"
## [2] " doc_id text "
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m "
## [4] "\033[38;5;250m1\033[39m UDHR_chinese.txt \033[38;5;246m\"\033[39m\\\"世界人权宣言\\n联合国\\\"...\033[38;5;246m\"\033[39m"
## [5] "\033[38;5;250m2\033[39m UDHR_czech.txt \033[38;5;246m\"\033[39m\\\"VŠEOBECNÁ \\\"...\033[38;5;246m\"\033[39m "
## [6] "\033[38;5;250m3\033[39m UDHR_danish.txt \033[38;5;246m\"\033[39m\\\"Den 10. de\\\"...\033[38;5;246m\"\033[39m "
## [7] "\033[38;5;250m4\033[39m UDHR_english.txt \033[38;5;246m\"\033[39m\\\"Universal \\\"...\033[38;5;246m\"\033[39m "
## [8] "\033[38;5;250m5\033[39m UDHR_french.txt \033[38;5;246m\"\033[39m\\\"Déclaratio\\\"...\033[38;5;246m\"\033[39m "
## [9] "\033[38;5;250m6\033[39m UDHR_georgian.txt \033[38;5;246m\"\033[39m\\\"FLFVBFYBC \\\"...\033[38;5;246m\"\033[39m "
## [10] "\033[38;5;246m# ℹ 7 more rows\033[39m"
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
We can specify document-level metadata (docvars
) based
on the file names or on a separate data.frame. Below we take the docvars
from the filenames (docvarsfrom = "filenames"
) and set the
names for each variable
(docvarnames = c("unit", "context", "year", "language", "party")
).
The command dvsep = "_"
determines the separator (a regular
expression character string) included in the filenames to delimit the
docvar
elements.
# Manifestos with docvars from filenames
readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
dvsep = "_",
encoding = "ISO-8859-1")
## readtext object consisting of 17 documents and 5 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 17 × 7\033[39m"
## [2] " doc_id text unit context year language party"
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<int>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m"
## [4] "\033[38;5;250m1\033[39m EU_euro_2004_de_PSE.txt \033[38;5;246m\"\033[39m\\\"PES · PSE \\\"...\033[38;5;246m\"\033[39m EU euro \033[4m2\033[24m004 de PSE "
## [5] "\033[38;5;250m2\033[39m EU_euro_2004_de_V.txt \033[38;5;246m\"\033[39m\\\"Gemeinsame\\\"...\033[38;5;246m\"\033[39m EU euro \033[4m2\033[24m004 de V "
## [6] "\033[38;5;250m3\033[39m EU_euro_2004_en_PSE.txt \033[38;5;246m\"\033[39m\\\"PES · PSE \\\"...\033[38;5;246m\"\033[39m EU euro \033[4m2\033[24m004 en PSE "
## [7] "\033[38;5;250m4\033[39m EU_euro_2004_en_V.txt \033[38;5;246m\"\033[39m\\\"Manifesto\\n\\\"..… EU euro \033[4m2\033[24m004 en V "
## [8] "\033[38;5;250m5\033[39m EU_euro_2004_es_PSE.txt \033[38;5;246m\"\033[39m\\\"PES · PSE \\\"...\033[38;5;246m\"\033[39m EU euro \033[4m2\033[24m004 es PSE "
## [9] "\033[38;5;250m6\033[39m EU_euro_2004_es_V.txt \033[38;5;246m\"\033[39m\\\"Manifesto\\n\\\"..… EU euro \033[4m2\033[24m004 es V "
## [10] "\033[38;5;246m# ℹ 11 more rows\033[39m"
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
readtext can also curse through subdirectories. In
our example, the folder txt/movie_reviews
contains two
subfolders (called neg
and pos
). We can load
all texts included in both folders.
# Recurse through subdirectories
readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"))
## readtext object consisting of 10 documents and 0 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 10 × 2\033[39m"
## [2] " doc_id text "
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m "
## [4] "\033[38;5;250m1\033[39m neg_cv000_29416.txt \033[38;5;246m\"\033[39m\\\"plot : two\\\"...\033[38;5;246m\"\033[39m "
## [5] "\033[38;5;250m2\033[39m neg_cv001_19502.txt \033[38;5;246m\"\033[39m\\\"the happy \\\"...\033[38;5;246m\"\033[39m "
## [6] "\033[38;5;250m3\033[39m neg_cv002_17424.txt \033[38;5;246m\"\033[39m\\\"it is movi\\\"...\033[38;5;246m\"\033[39m "
## [7] "\033[38;5;250m4\033[39m neg_cv003_12683.txt \033[38;5;246m\"\033[39m\\\" \\\" quest f\\\"...\033[38;5;246m\"\033[39m"
## [8] "\033[38;5;250m5\033[39m neg_cv004_12641.txt \033[38;5;246m\"\033[39m\\\"synopsis :\\\"...\033[38;5;246m\"\033[39m "
## [9] "\033[38;5;250m6\033[39m pos_cv000_29590.txt \033[38;5;246m\"\033[39m\\\"films adap\\\"...\033[38;5;246m\"\033[39m "
## [10] "\033[38;5;246m# ℹ 4 more rows\033[39m"
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
Read in comma separated values (.csv files) that contain textual
data. We determine the texts
variable in our .csv file as
the text_field
. This is the column that contains the actual
text. The other columns of the original csv file (Year
,
President
, FirstName
) are by default treated
as document-level variables.
# Read in comma-separated values
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
## readtext object consisting of 5 documents and 3 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 5 × 5\033[39m"
## [2] " doc_id text Year President FirstName"
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<int>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m "
## [4] "\033[38;5;250m1\033[39m inaugCorpus.csv.1 \033[38;5;246m\"\033[39m\\\"Fellow-Cit\\\"...\033[38;5;246m\"\033[39m \033[4m1\033[24m789 Washington George "
## [5] "\033[38;5;250m2\033[39m inaugCorpus.csv.2 \033[38;5;246m\"\033[39m\\\"Fellow cit\\\"...\033[38;5;246m\"\033[39m \033[4m1\033[24m793 Washington George "
## [6] "\033[38;5;250m3\033[39m inaugCorpus.csv.3 \033[38;5;246m\"\033[39m\\\"When it wa\\\"...\033[38;5;246m\"\033[39m \033[4m1\033[24m797 Adams John "
## [7] "\033[38;5;250m4\033[39m inaugCorpus.csv.4 \033[38;5;246m\"\033[39m\\\"Friends an\\\"...\033[38;5;246m\"\033[39m \033[4m1\033[24m801 Jefferson Thomas "
## [8] "\033[38;5;250m5\033[39m inaugCorpus.csv.5 \033[38;5;246m\"\033[39m\\\"Proceeding\\\"...\033[38;5;246m\"\033[39m \033[4m1\033[24m805 Jefferson Thomas "
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
The same procedure applies to tab-separated values.
# Read in tab-separated values
readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")
## readtext object consisting of 33 documents and 9 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 33 × 11\033[39m"
## [2] " doc_id text speechID memberID partyID constID title date member_name"
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<int>\033[39m\033[23m \033[3m\033[38;5;246m<int>\033[39m\033[23m \033[3m\033[38;5;246m<int>\033[39m\033[23m \033[3m\033[38;5;246m<int>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m "
## [4] "\033[38;5;250m1\033[39m dailsample.ts… \033[38;5;246m\"\033[39m\\\"M… 1 977 22 158 1. C… 1919… Count Geor…"
## [5] "\033[38;5;250m2\033[39m dailsample.ts… \033[38;5;246m\"\033[39m\\\"I… 2 \033[4m1\033[24m603 22 103 1. C… 1919… Mr. Pádrai…"
## [6] "\033[38;5;250m3\033[39m dailsample.ts… \033[38;5;246m\"\033[39m\\\"'… 3 116 22 178 1. C… 1919… Mr. Cathal…"
## [7] "\033[38;5;250m4\033[39m dailsample.ts… \033[38;5;246m\"\033[39m\\\"T… 4 116 22 178 2. C… 1919… Mr. Cathal…"
## [8] "\033[38;5;250m5\033[39m dailsample.ts… \033[38;5;246m\"\033[39m\\\"L… 5 116 22 178 3. A… 1919… Mr. Cathal…"
## [9] "\033[38;5;250m6\033[39m dailsample.ts… \033[38;5;246m\"\033[39m\\\"-… 6 116 22 178 3. A… 1919… Mr. Cathal…"
## [10] "\033[38;5;246m# ℹ 27 more rows\033[39m"
## [11] "\033[38;5;246m# ℹ 2 more variables: party_name <chr>, const_name <chr>\033[39m"
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
You can also read .json data. Again you need to specify the
text_field
.
## Read in JSON data
readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")
## readtext object consisting of 3 documents and 3 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 3 × 5\033[39m"
## [2] " doc_id text Year President FirstName"
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<int>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m "
## [4] "\033[38;5;250m1\033[39m inaugural_sample.json.1 \033[38;5;246m\"\033[39m\\\"Fellow-Cit\\\"...\033[38;5;246m\"\033[39m \033[4m1\033[24m789 Washington George "
## [5] "\033[38;5;250m2\033[39m inaugural_sample.json.2 \033[38;5;246m\"\033[39m\\\"Fellow cit\\\"...\033[38;5;246m\"\033[39m \033[4m1\033[24m793 Washington George "
## [6] "\033[38;5;250m3\033[39m inaugural_sample.json.3 \033[38;5;246m\"\033[39m\\\"When it wa\\\"...\033[38;5;246m\"\033[39m \033[4m1\033[24m797 Adams John "
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
readtext can also read in and convert .pdf files.
In the example below we load all .pdf files stored in the
UDHR
folder, and determine that the docvars
shall be taken from the filenames. We call the document-level variables
document
and language
, and specify the
delimiter (dvsep
).
## Read in Universal Declaration of Human Rights pdf files
(rt_pdf <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filenames",
docvarnames = c("document", "language"),
sep = "_"))
## readtext object consisting of 11 documents and 2 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 11 × 4\033[39m"
## [2] " doc_id text document language"
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m "
## [4] "\033[38;5;250m1\033[39m UDHR_chinese.pdf \033[38;5;246m\"\033[39m\\\"世界人权宣言\\n\\n联合\\\"...\033[38;5;246m\"\033[39m UDHR chinese "
## [5] "\033[38;5;250m2\033[39m UDHR_czech.pdf \033[38;5;246m\"\033[39m\\\"VŠEOBECNÁ \\\"...\033[38;5;246m\"\033[39m UDHR czech "
## [6] "\033[38;5;250m3\033[39m UDHR_danish.pdf \033[38;5;246m\"\033[39m\\\"Den 10. de\\\"...\033[38;5;246m\"\033[39m UDHR danish "
## [7] "\033[38;5;250m4\033[39m UDHR_english.pdf \033[38;5;246m\"\033[39m\\\"Universal \\\"...\033[38;5;246m\"\033[39m UDHR english "
## [8] "\033[38;5;250m5\033[39m UDHR_french.pdf \033[38;5;246m\"\033[39m\\\"Déclaratio\\\"...\033[38;5;246m\"\033[39m UDHR french "
## [9] "\033[38;5;250m6\033[39m UDHR_greek.pdf \033[38;5;246m\"\033[39m\\\"ΟΙΚΟΥΜΕΝΙΚ\\\"...\033[38;5;246m\"\033[39m UDHR greek "
## [10] "\033[38;5;246m# ℹ 5 more rows\033[39m"
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
Microsoft Word formatted files are converted through the package
antiword for older .doc
files, and using
XML for newer .docx
files.
## Read in Word data (.docx)
readtext(paste0(DATA_DIR, "/word/*.docx"))
## readtext object consisting of 2 documents and 0 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 2 × 2\033[39m"
## [2] " doc_id text "
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m "
## [4] "\033[38;5;250m1\033[39m UK_2015_EccentricParty.docx \033[38;5;246m\"\033[39m\\\"The Eccent\\\"...\033[38;5;246m\"\033[39m"
## [5] "\033[38;5;250m2\033[39m UK_2015_LoonyParty.docx \033[38;5;246m\"\033[39m\\\"The Offici\\\"...\033[38;5;246m\"\033[39m"
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
readtext was originally developed in early versions
of the quanteda
package for the quantitative analysis of textual data. It was spawned
from the textfile()
function from that package, and now
lives exclusively in readtext. Because
quanteda’s corpus constructor recognizes the data.frame
format returned by readtext()
, it can construct a corpus
directly from a readtext
object, preserving all docvars and
other meta-data.
You can easily construct a corpus from a readtext object.
if (require("quanteda")) {
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
}
## Loading required package: quanteda
## Package version: 3.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: 10 of 10 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:readtext':
##
## texts
## Corpus consisting of 5 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName
## inaugCorpus.csv.1 625 1539 23 1789 Washington George
## inaugCorpus.csv.2 96 147 4 1793 Washington George
## inaugCorpus.csv.3 826 2577 37 1797 Adams John
## inaugCorpus.csv.4 717 1923 41 1801 Jefferson Thomas
## inaugCorpus.csv.5 804 2380 45 1805 Jefferson Thomas
When a document contains page numbers, they are imported as well. If you want to remove them, you can use a regular expression. We strongly recommend using the stringi package. For the most common regular expressions you can look at this cheatsheet.
You first need to check in the original file in which format the page
numbers occur (e.g., “1”, “-1-”, “page 1” etc.). We can make use of the
fact that page numbers are almost always preceded and followed by a
linebreak (\n
). After loading the text with
readtext, you can replace the page numbers.
In the first example, the page numbers have the format “page X”.
# Make some text with page numbers
sample_text_a <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus,
page 1
with the newspaper from a boy named quick Seamus, in his mouth.
page 2
The quicker brown fox jumped over 2 lazy dogs."
sample_text_a
## [1] "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, \npage 1 \nwith the newspaper from a boy named quick Seamus, in his mouth.\npage 2\nThe quicker brown fox jumped over 2 lazy dogs."
# Remove "page" and respective digit
sample_text_a2 <- unlist(stri_split_fixed(sample_text_a, '\n'), use.names = FALSE)
sample_text_a2 <- stri_replace_all_regex(sample_text_a2, "page \\d*", "")
sample_text_a2 <- stri_trim_both(sample_text_a2)
sample_text_a2 <- sample_text_a2[sample_text_a2 != '']
stri_paste(sample_text_a2, collapse = '\n')
## [1] "The quick brown fox named Seamus jumps over the lazy dog also named Seamus,\nwith the newspaper from a boy named quick Seamus, in his mouth.\nThe quicker brown fox jumped over 2 lazy dogs."
In the second example we remove page numbers which have the format “- X -”.
sample_text_b <- "The quick brown fox named Seamus
- 1 -
jumps over the lazy dog also named Seamus, with
- 2 -
the newspaper from a boy named quick Seamus, in his mouth.
- 33 -
The quicker brown fox jumped over 2 lazy dogs."
sample_text_b
## [1] "The quick brown fox named Seamus \n- 1 - \njumps over the lazy dog also named Seamus, with \n- 2 - \nthe newspaper from a boy named quick Seamus, in his mouth. \n- 33 - \nThe quicker brown fox jumped over 2 lazy dogs."
sample_text_b2 <- unlist(stri_split_fixed(sample_text_b, '\n'), use.names = FALSE)
sample_text_b2 <- stri_replace_all_regex(sample_text_b2, "[-] \\d* [-]", "")
sample_text_b2 <- stri_trim_both(sample_text_b2)
sample_text_b2 <- sample_text_b2[sample_text_b2 != '']
stri_paste(sample_text_b2, collapse = '\n')
## [1] "The quick brown fox named Seamus\njumps over the lazy dog also named Seamus, with\nthe newspaper from a boy named quick Seamus, in his mouth.\nThe quicker brown fox jumped over 2 lazy dogs."
Such stringi functions can also be applied to readtext objects.
Sometimes files of the same type have different encodings. If the encoding of a file is included in the file name, we can extract this information and import the texts correctly.
# create a temporary directory to extract the .zip file
FILEDIR <- tempdir()
# unzip file
unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = FILEDIR)
Here, we will get the encoding from the filenames themselves.
# get encoding from filename
filenames <- list.files(FILEDIR, "^(Indian|UDHR_).*\\.txt$")
head(filenames)
## [1] "IndianTreaty_English_UTF-16LE.txt" "IndianTreaty_English_UTF-8-BOM.txt"
## [3] "UDHR_Arabic_ISO-8859-6.txt" "UDHR_Arabic_UTF-8.txt"
## [5] "UDHR_Arabic_WINDOWS-1256.txt" "UDHR_Chinese_GB2312.txt"
# Strip the extension
filenames <- gsub(".txt$", "", filenames)
parts <- strsplit(filenames, "_")
fileencodings <- sapply(parts, "[", 3)
head(fileencodings)
## [1] "UTF-16LE" "UTF-8-BOM" "ISO-8859-6" "UTF-8" "WINDOWS-1256"
## [6] "GB2312"
# Check whether certain file encodings are not supported
notAvailableIndex <- which(!(fileencodings %in% iconvlist()))
fileencodings[notAvailableIndex]
## [1] "UTF-8-BOM"
If we read the text files without specifying the encoding, we get
erroneously formatted text. To avoid this, we determine the
encoding
using the character object
fileencoding
created above.
We can also add docvars
based on the filenames.
txts <- readtext(paste0(DATA_DIR, "/data_files_encodedtexts.zip"),
encoding = fileencodings,
docvarsfrom = "filenames",
docvarnames = c("document", "language", "input_encoding"))
print(txts, n = 50)
## readtext object consisting of 36 documents and 3 docvars.
## $text
## [1] "\033[38;5;246m# A data frame: 36 × 5\033[39m"
## [2] " doc_id text document language input_encoding"
## [3] " \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m \033[3m\033[38;5;246m<chr>\033[39m\033[23m "
## [4] "\033[38;5;250m 1\033[39m IndianTreaty_English_UTF-16LE.txt \033[38;5;246m\"\033[39m\\\"WHERE… IndianT… English UTF-16LE "
## [5] "\033[38;5;250m 2\033[39m IndianTreaty_English_UTF-8-BOM.txt \033[38;5;246m\"\033[39m\\\"ARTIC… IndianT… English UTF-8-BOM "
## [6] "\033[38;5;250m 3\033[39m UDHR_Arabic_ISO-8859-6.txt \033[38;5;246m\"\033[39m\\\"الديب… UDHR Arabic ISO-8859-6 "
## [7] "\033[38;5;250m 4\033[39m UDHR_Arabic_UTF-8.txt \033[38;5;246m\"\033[39m\\\"الديب… UDHR Arabic UTF-8 "
## [8] "\033[38;5;250m 5\033[39m UDHR_Arabic_WINDOWS-1256.txt \033[38;5;246m\"\033[39m\\\"الديب… UDHR Arabic WINDOWS-1256 "
## [9] "\033[38;5;250m 6\033[39m UDHR_Chinese_GB2312.txt \033[38;5;246m\"\033[39m\\\"世界… UDHR Chinese GB2312 "
## [10] "\033[38;5;250m 7\033[39m UDHR_Chinese_GBK.txt \033[38;5;246m\"\033[39m\\\"世界… UDHR Chinese GBK "
## [11] "\033[38;5;250m 8\033[39m UDHR_Chinese_UTF-8.txt \033[38;5;246m\"\033[39m\\\"世界… UDHR Chinese UTF-8 "
## [12] "\033[38;5;250m 9\033[39m UDHR_English_UTF-16BE.txt \033[38;5;246m\"\033[39m\\\"Unive… UDHR English UTF-16BE "
## [13] "\033[38;5;250m10\033[39m UDHR_English_UTF-16LE.txt \033[38;5;246m\"\033[39m\\\"Unive… UDHR English UTF-16LE "
## [14] "\033[38;5;250m11\033[39m UDHR_English_UTF-8.txt \033[38;5;246m\"\033[39m\\\"Unive… UDHR English UTF-8 "
## [15] "\033[38;5;250m12\033[39m UDHR_English_WINDOWS-1252.txt \033[38;5;246m\"\033[39m\\\"Unive… UDHR English WINDOWS-1252 "
## [16] "\033[38;5;250m13\033[39m UDHR_French_ISO-8859-1.txt \033[38;5;246m\"\033[39m\\\"Décla… UDHR French ISO-8859-1 "
## [17] "\033[38;5;250m14\033[39m UDHR_French_UTF-8.txt \033[38;5;246m\"\033[39m\\\"Décla… UDHR French UTF-8 "
## [18] "\033[38;5;250m15\033[39m UDHR_French_WINDOWS-1252.txt \033[38;5;246m\"\033[39m\\\"Décla… UDHR French WINDOWS-1252 "
## [19] "\033[38;5;250m16\033[39m UDHR_German_ISO-8859-1.txt \033[38;5;246m\"\033[39m\\\"Die A… UDHR German ISO-8859-1 "
## [20] "\033[38;5;250m17\033[39m UDHR_German_UTF-8.txt \033[38;5;246m\"\033[39m\\\"Die A… UDHR German UTF-8 "
## [21] "\033[38;5;250m18\033[39m UDHR_German_WINDOWS-1252.txt \033[38;5;246m\"\033[39m\\\"Die A… UDHR German WINDOWS-1252 "
## [22] "\033[38;5;250m19\033[39m UDHR_Greek_CP1253.txt \033[38;5;246m\"\033[39m\\\"ΟΙΚΟΥ… UDHR Greek CP1253 "
## [23] "\033[38;5;250m20\033[39m UDHR_Greek_ISO-8859-7.txt \033[38;5;246m\"\033[39m\\\"ΟΙΚΟΥ… UDHR Greek ISO-8859-7 "
## [24] "\033[38;5;250m21\033[39m UDHR_Greek_UTF-8.txt \033[38;5;246m\"\033[39m\\\"ΟΙΚΟΥ… UDHR Greek UTF-8 "
## [25] "\033[38;5;250m22\033[39m UDHR_Hindi_UTF-8.txt \033[38;5;246m\"\033[39m\\\"मानव … UDHR Hindi UTF-8 "
## [26] "\033[38;5;250m23\033[39m UDHR_Icelandic_ISO-8859-1.txt \033[38;5;246m\"\033[39m\\\"Mannr… UDHR Iceland… ISO-8859-1 "
## [27] "\033[38;5;250m24\033[39m UDHR_Icelandic_UTF-8.txt \033[38;5;246m\"\033[39m\\\"Mannr… UDHR Iceland… UTF-8 "
## [28] "\033[38;5;250m25\033[39m UDHR_Icelandic_WINDOWS-1252.txt \033[38;5;246m\"\033[39m\\\"Mannr… UDHR Iceland… WINDOWS-1252 "
## [29] "\033[38;5;250m26\033[39m UDHR_Japanese_CP932.txt \033[38;5;246m\"\033[39m\\\"『世… UDHR Japanese CP932 "
## [30] "\033[38;5;250m27\033[39m UDHR_Japanese_ISO-2022-JP.txt \033[38;5;246m\"\033[39m\\\"『世… UDHR Japanese ISO-2022-JP "
## [31] "\033[38;5;250m28\033[39m UDHR_Japanese_UTF-8.txt \033[38;5;246m\"\033[39m\\\"『世… UDHR Japanese UTF-8 "
## [32] "\033[38;5;250m29\033[39m UDHR_Japanese_WINDOWS-936.txt \033[38;5;246m\"\033[39m\\\"『世… UDHR Japanese WINDOWS-936 "
## [33] "\033[38;5;250m30\033[39m UDHR_Korean_ISO-2022-KR.txt \033[38;5;246m\"\033[39m\\\"세 계… UDHR Korean ISO-2022-KR "
## [34] "\033[38;5;250m31\033[39m UDHR_Korean_UTF-8.txt \033[38;5;246m\"\033[39m\\\"세 계… UDHR Korean UTF-8 "
## [35] "\033[38;5;250m32\033[39m UDHR_Russian_ISO-8859-5.txt \033[38;5;246m\"\033[39m\\\"Всеоб… UDHR Russian ISO-8859-5 "
## [36] "\033[38;5;250m33\033[39m UDHR_Russian_KOI8-R.txt \033[38;5;246m\"\033[39m\\\"Всеоб… UDHR Russian KOI8-R "
## [37] "\033[38;5;250m34\033[39m UDHR_Russian_UTF-8.txt \033[38;5;246m\"\033[39m\\\"Всеоб… UDHR Russian UTF-8 "
## [38] "\033[38;5;250m35\033[39m UDHR_Russian_WINDOWS-1251.txt \033[38;5;246m\"\033[39m\\\"Всеоб… UDHR Russian WINDOWS-1251 "
## [39] "\033[38;5;250m36\033[39m UDHR_Thai_UTF-8.txt \033[38;5;246m\"\033[39m\\\"ปฏิญญา… UDHR Thai UTF-8 "
##
## $summary
## $summary[[1]]
## NULL
##
##
## attr(,"class")
## [1] "trunc_mat"
From this file we can easily create a quanteda
corpus
object.
if (require("quanteda")) {
corpus_txts <- corpus(txts)
summary(corpus_txts, 5)
}
## Corpus consisting of 36 documents, showing 5 documents:
##
## Text Types Tokens Sentences document
## IndianTreaty_English_UTF-16LE.txt 618 2577 152 IndianTreaty
## IndianTreaty_English_UTF-8-BOM.txt 647 3085 150 IndianTreaty
## UDHR_Arabic_ISO-8859-6.txt 753 1555 86 UDHR
## UDHR_Arabic_UTF-8.txt 753 1555 86 UDHR
## UDHR_Arabic_WINDOWS-1256.txt 753 1555 86 UDHR
## language input_encoding
## English UTF-16LE
## English UTF-8-BOM
## Arabic ISO-8859-6
## Arabic UTF-8
## Arabic WINDOWS-1256