Read texts and (if any) associated document-level meta-data from one or more source files. The text source files come from the textual component of the files, and the document-level metadata ("docvars") come from either the file contents or filenames.

readtext(file, ignore_missing_files = FALSE, text_field = NULL,
  docvarsfrom = c("metadata", "filenames", "filepaths"), dvsep = "_",
  docvarnames = NULL, encoding = NULL, source = NULL, cache = TRUE,
  verbosity = readtext_options("verbosity"), ...)

Arguments

file

the complete filename(s) to be read. This is designed to automagically handle a number of common scenarios, so the value can be a "glob"-type wildcard value. Currently available filetypes are:

Single file formats:

txt

plain text files: So-called structured text files, which describe both texts and metadata: For all structured text filetypes, the column, field, or node which contains the the text must be specified with the text_field parameter, and all other fields are treated as docvars.

json

data in some form of JavaScript Object Notation, consisting of the texts and optionally additional docvars. The supported formats are:

  • a single JSON object per file

  • line-delimited JSON, with one object per line

  • line-delimited JSON, of the format produced from a Twitter stream. This type of file has special handling which simplifies the Twitter format into docvars. The correct format for each JSON file is automatically detected.

csv,tab,tsv

comma- or tab-separated values

html

HTML documents, including specialized formats from known sources, such as Nexis-formatted HTML. See the source parameter below.

xml

Basic flat XML documents are supported -- those of the kind supported by xmlToDataFrame. For xml files, an additional argument collapse may be passed through ... that names the character(s) to use in appending different text elements together.

pdf

pdf formatted files, converted through pdftools.

doc, docx

Microsoft Word formatted files.

Reading multiple files and file types: In addition, file can also not be a path to a single local file, but also combinations of any of the above types, such as:
a wildcard value

any valid pathname with a wildcard ("glob") expression that can be expanded by the operating system. This may consist of multiple file types.

a URL to a remote

which is downloaded then loaded

zip,tar,tar.gz,tar.bz

archive file, which is unzipped. The contained files must be either at the top level or in a single directory. Archives, remote URLs and glob patterns can resolve to any of the other filetypes, so you could have, for example, a remote URL to a zip file which contained Twitter JSON files.

ignore_missing_files

if FALSE, then if the file argument doesn't resolve to an existing file, then an error will be thrown. Note that this can happen in a number of ways, including passing a path to a file that does not exist, to an empty archive file, or to a glob pattern that matches no files.

text_field

a variable (column) name or column number indicating where to find the texts that form the documents for the corpus. This must be specified for file types .csv, .json, and .xls/.xlsx files. For XML files, an XPath expression can be specified.

docvarsfrom

used to specify that docvars should be taken from the filenames, when the readtext inputs are filenames and the elements of the filenames are document variables, separated by a delimiter (dvsep). This allows easy assignment of docvars from filenames such as 1789-Washington.txt, 1793-Washington, etc. by dvsep or from meta-data embedded in the text file header (headers). If docvarsfrom is set to "filepaths", consider the full path to the file, not just the filename.

dvsep

separator (a regular expression character string) used in filenames to delimit docvar elements if docvarsfrom="filenames" or docvarsfrom="filepaths" is used

docvarnames

character vector of variable names for docvars, if docvarsfrom is specified. If this argument is not used, default docvar names will be used (docvar1, docvar2, ...).

encoding

vector: either the encoding of all files, or one encoding for each files

source

used to specify specific formats of some input file types, such as JSON or HTML. Currently supported types are "twitter" for JSON and "nexis" for HTML.

cache

if TRUE, save remote file to a temporary folder. Only used when file is a URL.

verbosity
  • 0: output errors only

  • 1: output errors and warnings (default)

  • 2: output a brief summary message

  • 3: output detailed file-related messages

...

additional arguments passed through to low-level file reading function, such as file, fread, etc. Useful for specifying an input encoding option, which is specified in the same was as it would be give to iconv. See the Encoding section of file for details.

Value

a data.frame consisting of a columns doc_id and text that contain a document identifier and the texts respectively, with any additional columns consisting of document-level variables either found in the file containing the texts, or created through the readtext call.

Examples

## get the data directory if (!interactive()) pkgload::load_all()
#> Loading readtext
DATA_DIR <- system.file("extdata/", package = "readtext") ## read in some text data # all UDHR files (rt1 <- readtext(paste0(DATA_DIR, "/txt/UDHR/*")))
#> readtext object consisting of 13 documents and 0 docvars. #> # data.frame [13 × 2] #> doc_id text #> <chr> <chr> #> 1 UDHR_chinese.txt "\"世界人权宣言\n联合国\"..." #> 2 UDHR_czech.txt "\"VŠEOBECNÁ \"..." #> 3 UDHR_danish.txt "\"Den 10. de\"..." #> 4 UDHR_english.txt "\"Universal \"..." #> 5 UDHR_french.txt "\"Déclaratio\"..." #> 6 UDHR_georgian.txt "\"FLFVBFYBC \"..." #> # ... with 7 more rows
# manifestos with docvars from filenames (rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"), docvarsfrom = "filenames", docvarnames = c("unit", "context", "year", "language", "party"), encoding = "LATIN1"))
#> readtext object consisting of 17 documents and 5 docvars. #> # data.frame [17 × 7] #> doc_id text unit context year language party #> <chr> <chr> <chr> <chr> <int> <chr> <chr> #> 1 EU_euro_2004_de_PSE.t… "\"PES · PSE \"..." EU euro 2004 de PSE #> 2 EU_euro_2004_de_V.txt "\"Gemeinsame\"..." EU euro 2004 de V #> 3 EU_euro_2004_en_PSE.t… "\"PES · PSE \"..." EU euro 2004 en PSE #> 4 EU_euro_2004_en_V.txt "\"Manifesto\n\"..… EU euro 2004 en V #> 5 EU_euro_2004_es_PSE.t… "\"PES · PSE \"..." EU euro 2004 es PSE #> 6 EU_euro_2004_es_V.txt "\"Manifesto\n\"..… EU euro 2004 es V #> # ... with 11 more rows
# recurse through subdirectories (rt3 <- readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"), docvarsfrom = "filepaths", docvarnames = "sentiment"))
#> Warning: Fewer docnames supplied than existing docvars - last 3 docvars given generic names.
#> readtext object consisting of 10 documents and 4 docvars. #> # data.frame [10 × 6] #> doc_id text sentiment docvar2 docvar3 docvar4 #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 neg_cv000_… "\"plot : … /Users/stefan/GitHub/read… reviews/n… cv000 29416.… #> 2 neg_cv001_… "\"the hap… /Users/stefan/GitHub/read… reviews/n… cv001 19502.… #> 3 neg_cv002_… "\"it is m… /Users/stefan/GitHub/read… reviews/n… cv002 17424.… #> 4 neg_cv003_… "\" \" que… /Users/stefan/GitHub/read… reviews/n… cv003 12683.… #> 5 neg_cv004_… "\"synopsi… /Users/stefan/GitHub/read… reviews/n… cv004 12641.… #> 6 pos_cv000_… "\"films a… /Users/stefan/GitHub/read… reviews/p… cv000 29590.… #> # ... with 4 more rows
## read in csv data (rt4 <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv")))
#> readtext object consisting of 5 documents and 3 docvars. #> # data.frame [5 × 5] #> doc_id text Year President FirstName #> <chr> <chr> <int> <chr> <chr> #> 1 inaugCorpus.csv.1 "\"Fellow-Cit\"..." 1789 Washington George #> 2 inaugCorpus.csv.2 "\"Fellow cit\"..." 1793 Washington George #> 3 inaugCorpus.csv.3 "\"When it wa\"..." 1797 Adams John #> 4 inaugCorpus.csv.4 "\"Friends an\"..." 1801 Jefferson Thomas #> 5 inaugCorpus.csv.5 "\"Proceeding\"..." 1805 Jefferson Thomas
## read in tab-separated data (rt5 <- readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech"))
#> readtext object consisting of 33 documents and 9 docvars. #> # data.frame [33 × 11] #> doc_id text speechID memberID partyID constID title date member_name #> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <chr> #> 1 dails… "\"M… 1 977 22 158 1. C… 1919… Count Geor… #> 2 dails… "\"I… 2 1603 22 103 1. C… 1919… Mr. Pádrai… #> 3 dails… "\"'… 3 116 22 178 1. C… 1919… Mr. Cathal… #> 4 dails… "\"T… 4 116 22 178 2. C… 1919… Mr. Cathal… #> 5 dails… "\"L… 5 116 22 178 3. A… 1919… Mr. Cathal… #> 6 dails… "\"-… 6 116 22 178 3. A… 1919… Mr. Cathal… #> # ... with 27 more rows, and 2 more variables: party_name <chr>, #> # const_name <chr>
## read in JSON data (rt6 <- readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts"))
#> readtext object consisting of 3 documents and 3 docvars. #> # data.frame [3 × 5] #> doc_id text Year President FirstName #> <chr> <chr> <int> <chr> <chr> #> 1 inaugural_sample.json.1 "\"Fellow-Cit\"..." 1789 Washington George #> 2 inaugural_sample.json.2 "\"Fellow cit\"..." 1793 Washington George #> 3 inaugural_sample.json.3 "\"When it wa\"..." 1797 Adams John
## read in pdf data # UNHDR (rt7 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filenames", docvarnames = c("document", "language")))
#> readtext object consisting of 11 documents and 2 docvars. #> # data.frame [11 × 4] #> doc_id text document language #> <chr> <chr> <chr> <chr> #> 1 UDHR_chinese.pdf "\"世界人权宣言\n联合国\"..." UDHR chinese #> 2 UDHR_czech.pdf "\"VŠEOBECNÁ \"..." UDHR czech #> 3 UDHR_danish.pdf "\"Den 10. de\"..." UDHR danish #> 4 UDHR_english.pdf "\"Universal \"..." UDHR english #> 5 UDHR_french.pdf "\"Déclaratio\"..." UDHR french #> 6 UDHR_greek.pdf "\"ΟΙΚΟΥΜΕΝΙΚ\"..." UDHR greek #> # ... with 5 more rows
Encoding(rt7$text)
#> [1] "UTF-8" "UTF-8" "UTF-8" "unknown" "UTF-8" "UTF-8" "UTF-8" #> [8] "UTF-8" "UTF-8" "UTF-8" "UTF-8"
## read in Word data (.doc) (rt8 <- readtext(paste0(DATA_DIR, "/word/*.doc")))
#> readtext object consisting of 4 documents and 0 docvars. #> # data.frame [4 × 2] #> doc_id text #> <chr> <chr> #> 1 21Parti_Socialiste_SUMMARY_2004.doc "\"[pic]\nRésu\"..." #> 2 21vivant2004.doc "\"http://www\"..." #> 3 21VLD2004.doc "\"http://www\"..." #> 4 32_socialisti_democratici_italiani.doc "\"DIVENTARE \"..."
Encoding(rt8$text)
#> [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8"
## read in Word data (.docx) (rt9 <- readtext(paste0(DATA_DIR, "/word/*.docx")))
#> readtext object consisting of 2 documents and 0 docvars. #> # data.frame [2 × 2] #> doc_id text #> <chr> <chr> #> 1 UK_2015_EccentricParty.docx "\"The Eccent\"..." #> 2 UK_2015_LoonyParty.docx "\"The Offici\"..."
Encoding(rt9$text)
#> [1] "UTF-8" "UTF-8"
## use elements of path and filename as docvars (rt10 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filepaths", dvsep = "[/_.]"))
#> readtext object consisting of 11 documents and 12 docvars. #> # data.frame [11 × 14] #> doc_id text docvar1 docvar2 docvar3 docvar4 docvar5 docvar6 docvar7 docvar8 #> <chr> <chr> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 UDHR_… "\"世… NA Users stefan GitHub readte… inst extdata pdf #> 2 UDHR_… "\"V… NA Users stefan GitHub readte… inst extdata pdf #> 3 UDHR_… "\"D… NA Users stefan GitHub readte… inst extdata pdf #> 4 UDHR_… "\"U… NA Users stefan GitHub readte… inst extdata pdf #> 5 UDHR_… "\"D… NA Users stefan GitHub readte… inst extdata pdf #> 6 UDHR_… "\"Ο… NA Users stefan GitHub readte… inst extdata pdf #> # ... with 5 more rows, and 4 more variables: docvar9 <chr>, docvar10 <chr>, #> # docvar11 <chr>, docvar12 <chr>