Detect the encoding of texts in a character readtext object and report
on the most likely encoding for each document. Useful in detecting the
encoding of input texts, so that a source encoding can be (re)specified when
inputting a set of texts using readtext()
, prior to constructing
a corpus.
encoding(x, verbose = TRUE, ...)
character vector, corpus, or readtext object whose texts' encodings will be detected.
if FALSE
, do not print diagnostic report
additional arguments passed to stri_enc_detect
Based on stri_enc_detect, which is in turn based on the ICU libraries. See the ICU User Guide, https://unicode-org.github.io/icu/userguide/.
if (FALSE) encoding(data_char_encodedtexts)
# show detected value for each text, versus known encoding
data.frame(labelled = names(data_char_encodedtexts),
detected = encoding(data_char_encodedtexts)$all)
#> Probable encoding: UTF-8
#> (but other encodings also detected)
#> Encoding proportions:
#> [****************--------........~~~~~~~aaaaaaabbbbbcccccddddd]
#>
#> Samples of the first text as:
#> [*] UTF-8 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde ~ em das
#> [-] windows-1252 “8-bit†oœncodings are passé. €0. Hyphen-ate. Tilde
#> [.] ISO-8859-6 ق8-bitق oإncodings are passأ�. ق،0. Hyphen-ate. Tilde
#> [~] ISO-8859-2 â8-bitâ oĹncodings are passĂŠ. âŹ0. Hyphen-ate. Tilde
#> [a] ISO-8859-1 â8-bitâ oÅncodings are passé. â¬0. Hyphen-ate. Tilde
#> [b] ISO-8859-5 т8-bitт oХncodings are passУЉ. тЌ0. Hyphen-ate. Tilde
#> [c] windows-1251 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [d] KOI8-R Б─°8-bitБ─² oе⌠ncodings are passц╘. Б┌╛0. Hyphen-ate. Tilde
#> labelled detected
#> 1 UTF-8 UTF-8
#> 2 ISO-8859-1 ISO-8859-1
#> 3 windows-1252 windows-1252
#> 4 macroman windows-1252
#> 5 ISO-8859-2 ISO-8859-2
#> 6 ISO-8859-6 ISO-8859-6
#> 7 ISO-8859-5 ISO-8859-5
#> 8 windows-1251 windows-1251
#> 9 KOI8-R KOI8-R
#> 10 ASCII ISO-8859-1
# Russian text, Windows-1251
myreadtext <- readtext("https://kenbenoit.net/files/01_er_5.txt")
encoding(myreadtext)
#> readtext object consisting of 1 document and 0 docvars.
#> # A data frame: 1 × 2
#> doc_id text
#> <chr> <chr>
#> 1 01_er_5.txt "\"\xd1\xf2\xe5\xed\xee\xe3\xf0\xe0\xec\xec\"..."
#> Probable encoding: windows-1251
#>