detect the encoding of texts — encoding • readtext

Detect the encoding of texts in a character readtext object and report on the most likely encoding for each document. Useful in detecting the encoding of input texts, so that a source encoding can be (re)specified when inputting a set of texts using readtext(), prior to constructing a corpus.

encoding(x, verbose = TRUE, ...)

Arguments

x: character vector, corpus, or readtext object whose texts' encodings will be detected.
verbose: if FALSE, do not print diagnostic report
...: additional arguments passed to stri_enc_detect

Details

Based on stri_enc_detect, which is in turn based on the ICU libraries. See the ICU User Guide, https://unicode-org.github.io/icu/userguide/.

Examples

if (FALSE) encoding(data_char_encodedtexts)
# show detected value for each text, versus known encoding
data.frame(labelled = names(data_char_encodedtexts), 
           detected = encoding(data_char_encodedtexts)$all)
#> Probable encoding: UTF-8
#>    (but other encodings also detected)
#>   Encoding proportions: 
#> [****************--------........~~~~~~~aaaaaaabbbbbcccccddddd]
#> 
#>   Samples of the first text as:
#>   [*] UTF-8          “8-bit” oœncodings are passé. €0. Hyphen-ate. Tilde ~ em das
#>   [-] windows-1252   â€œ8-bitâ€ oÅ“ncodings are passÃ©. â‚¬0. Hyphen-ate. Tilde 
#>   [.] ISO-8859-6     ق8-bitق oإncodings are passأ�. ق،0. Hyphen-ate. Tilde 
#>   [~] ISO-8859-2     â8-bitâ oĹncodings are passĂŠ. âŹ0. Hyphen-ate. Tilde 
#>   [a] ISO-8859-1     â8-bitâ oÅncodings are passÃ©. â¬0. Hyphen-ate. Tilde 
#>   [b] ISO-8859-5     т8-bitт oХncodings are passУЉ. тЌ0. Hyphen-ate. Tilde 
#>   [c] windows-1251   вЂњ8-bitвЂќ oЕ“ncodings are passГ©. в‚¬0. Hyphen-ate. Tilde 
#>   [d] KOI8-R         Б─°8-bitБ─² oе⌠ncodings are passц╘. Б┌╛0. Hyphen-ate. Tilde 
#>        labelled     detected
#> 1         UTF-8        UTF-8
#> 2    ISO-8859-1   ISO-8859-1
#> 3  windows-1252 windows-1252
#> 4      macroman windows-1252
#> 5    ISO-8859-2   ISO-8859-2
#> 6    ISO-8859-6   ISO-8859-6
#> 7    ISO-8859-5   ISO-8859-5
#> 8  windows-1251 windows-1251
#> 9        KOI8-R       KOI8-R
#> 10        ASCII   ISO-8859-1

# Russian text, Windows-1251
myreadtext <- readtext("https://kenbenoit.net/files/01_er_5.txt")
encoding(myreadtext)
#> readtext object consisting of 1 document and 0 docvars.
#> # A data frame: 1 × 2
#>   doc_id      text                                             
#>   <chr>       <chr>                                            
#> 1 01_er_5.txt "\"\xd1\xf2\xe5\xed\xee\xe3\xf0\xe0\xec\xec\"..."
#> Probable encoding: windows-1251
#>