Detect the encoding of texts in a character readtext object and report on the most likely encoding for each document. Useful in detecting the encoding of input texts, so that a source encoding can be (re)specified when inputting a set of texts using readtext, prior to constructing a corpus.

encoding(x, verbose = TRUE, ...)

Arguments

x

character vector, corpus, or readtext object whose texts' encodings will be detected.

verbose

if FALSE, do not print diagnostic report

...

additional arguments passed to stri_enc_detect

Details

Based on stri_enc_detect, which is in turn based on the ICU libraries. See the ICU User Guide, http://userguide.icu-project.org/conversion/detection.

Examples

encoding(data_char_encodedtexts)
#> Probable encoding: UTF-8
#> (but other encodings also detected)
#> Encoding proportions:
#> [****************--------........~~~~~~~aaaaaaabbbbbcccccddddd]
#> #> Samples of the first text as:
#> [*] UTF-8 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde ~ em das
#> [-] windows-1252 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [.] ISO-8859-6 ق€œ8-bitق€ oإ“ncodings are passأ�. ق‚،0. Hyphen-ate. Tilde
#> [~] ISO-8859-2 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [a] ISO-8859-1 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [b] ISO-8859-5 т€œ8-bitт€ oХ“ncodings are passУЉ. т‚Ќ0. Hyphen-ate. Tilde
#> [c] windows-1251 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [d] KOI8-R Б─°8-bitБ─² oе⌠ncodings are passц╘. Б┌╛0. Hyphen-ate. Tilde
# show detected value for each text, versus known encoding data.frame(labelled = names(data_char_encodedtexts), detected = encoding(data_char_encodedtexts)$all)
#> Probable encoding: UTF-8
#> (but other encodings also detected)
#> Encoding proportions:
#> [****************--------........~~~~~~~aaaaaaabbbbbcccccddddd]
#> #> Samples of the first text as:
#> [*] UTF-8 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde ~ em das
#> [-] windows-1252 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [.] ISO-8859-6 ق€œ8-bitق€ oإ“ncodings are passأ�. ق‚،0. Hyphen-ate. Tilde
#> [~] ISO-8859-2 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [a] ISO-8859-1 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [b] ISO-8859-5 т€œ8-bitт€ oХ“ncodings are passУЉ. т‚Ќ0. Hyphen-ate. Tilde
#> [c] windows-1251 “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde
#> [d] KOI8-R Б─°8-bitБ─² oе⌠ncodings are passц╘. Б┌╛0. Hyphen-ate. Tilde
#> labelled detected #> 1 UTF-8 UTF-8 #> 2 ISO-8859-1 ISO-8859-1 #> 3 windows-1252 windows-1252 #> 4 macroman windows-1252 #> 5 ISO-8859-2 ISO-8859-2 #> 6 ISO-8859-6 ISO-8859-6 #> 7 ISO-8859-5 ISO-8859-5 #> 8 windows-1251 windows-1251 #> 9 KOI8-R KOI8-R #> 10 ASCII ISO-8859-1
# NOT RUN { # Russian text, Windows-1251 myreadtext <- readtext("http://www.kenbenoit.net/files/01_er_5.txt") encoding(myreadtext) # }