Read texts and (if any) associated document-level meta-data from one or more source files. The text source files come from the textual component of the files, and the document-level metadata ("docvars") come from either the file contents or filenames.
readtext(
file,
ignore_missing_files = FALSE,
text_field = NULL,
docid_field = NULL,
docvarsfrom = c("metadata", "filenames", "filepaths"),
dvsep = "_",
docvarnames = NULL,
encoding = NULL,
source = NULL,
cache = TRUE,
verbosity = readtext_options("verbosity"),
...
)
the complete filename(s) to be read. This is designed to automagically handle a number of common scenarios, so the value can be a "glob"-type wildcard value. Currently available filetypes are:
Single file formats:
txt
plain text files:
So-called structured text files, which describe both texts and metadata:
For all structured text filetypes, the column, field, or node
which contains the the text must be specified with the text_field
parameter, and all other fields are treated as docvars.
json
data in some form of JavaScript Object Notation, consisting of the texts and optionally additional docvars. The supported formats are:
a single JSON object per file
line-delimited JSON, with one object per line
line-delimited JSON, of the format produced from a Twitter stream. This type of file has special handling which simplifies the Twitter format into docvars. The correct format for each JSON file is automatically detected.
csv,tab,tsv
comma- or tab-separated values
html
HTML documents, including specialized formats from known
sources, such as Nexis-formatted HTML. See the source
parameter
below.
xml
XML documents are supported -- those of the
kind that can be read by xml2::read_xml()
and navigated through
xml2::xml_find_all()
. For xml files, an additional
argument collapse
may be passed through ...
that names the character(s) to use in
appending different text elements together.
pdf
pdf formatted files, converted through pdftools.
odt
Open Document Text formatted files.
doc, docx
Microsoft Word formatted files.
rtf
Rich Text Files.
any valid pathname with a wildcard ("glob") expression that can be expanded by the operating system. This may consist of multiple file types.
which is downloaded then loaded
zip,tar,tar.gz,tar.bz
archive file, which is unzipped. The contained files must be either at the top level or in a single directory. Archives, remote URLs and glob patterns can resolve to any of the other filetypes, so you could have, for example, a remote URL to a zip file which contained Twitter JSON files.
if FALSE
, then if the file
argument doesn't resolve to an existing file, then an error will be thrown.
Note that this can happen in a number of ways, including passing a path
to a file that does not exist, to an empty archive file, or to a glob
pattern that matches no files.
a variable (column) name or column number
indicating where to find the texts that form the documents for the corpus
and their identifiers. This must be specified for file types .csv
,
.json
, and .xls
/.xlsx
files. For XML files, an XPath
expression can be specified.
used to specify that docvars should be taken from the
filenames, when the readtext
inputs are filenames and the elements
of the filenames are document variables, separated by a delimiter
(dvsep
). This allows easy assignment of docvars from filenames such
as 1789-Washington.txt
, 1793-Washington
, etc. by dvsep
or from meta-data embedded in the text file header (headers
).
If docvarsfrom
is set to "filepaths"
, consider the full path to the
file, not just the filename.
separator (a regular expression character string) used in
filenames to delimit docvar elements if docvarsfrom="filenames"
or docvarsfrom="filepaths"
is used
character vector of variable names for docvars
, if
docvarsfrom
is specified. If this argument is not used, default
docvar names will be used (docvar1
, docvar2
, ...).
vector: either the encoding of all files, or one encoding for each files
used to specify specific formats of some input file types, such
as JSON or HTML. Currently supported types are "twitter"
for JSON and
"nexis"
for HTML.
if TRUE
, save remote file to a temporary folder. Only used
when file
is a URL.
0: output errors only
1: output errors and warnings (default)
2: output a brief summary message
3: output detailed file-related messages
additional arguments passed through to low-level file reading
function, such as file()
, fread()
, etc. Useful
for specifying an input encoding option, which is specified in the same was
as it would be give to iconv()
. See the Encoding section of
file for details.
a data.frame consisting of a columns doc_id
and text
that contain a document identifier and the texts respectively, with any
additional columns consisting of document-level variables either found
in the file containing the texts, or created through the
readtext
call.
if (FALSE) {
## get the data directory
if (!interactive()) pkgload::load_all()
DATA_DIR <- system.file("extdata/", package = "readtext")
## read in some text data
# all UDHR files
(rt1 <- readtext(paste0(DATA_DIR, "/txt/UDHR/*")))
# manifestos with docvars from filenames
(rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1"))
# recurse through subdirectories
(rt3 <- readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"),
docvarsfrom = "filepaths", docvarnames = "sentiment"))
## read in csv data
(rt4 <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv")))
## read in tab-separated data
(rt5 <- readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech"))
## read in JSON data
(rt6 <- readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts"))
## read in pdf data
# UNHDR
(rt7 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filenames",
docvarnames = c("document", "language")))
Encoding(rt7$text)
## read in Word data (.doc)
(rt8 <- readtext(paste0(DATA_DIR, "/word/*.doc")))
Encoding(rt8$text)
## read in Word data (.docx)
(rt9 <- readtext(paste0(DATA_DIR, "/word/*.docx")))
Encoding(rt9$text)
## use elements of path and filename as docvars
(rt10 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filepaths", dvsep = "[/_.]"))
}