Hunspell is the spell checker library used by LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS-X, InDesign, Opera, RStudio and many others. It provides a system for tokenizing, stemming and spelling in almost any language or alphabet. The R package exposes both the high-level spell-checker as well as low-level stemmers and tokenizers which analyze or extract individual words from various formats (text, html, xml, latex).

Hunspell uses a special dictionary format that defines which characters, words and conjugations are valid in a given language. The examples below use the (default) "en_US" dictionary. However each function can be used in another language by setting a custom dictionary in the dict parameter. See the section on dictionaries below.

Spell Checking

Spell checking text consists of the following steps:

  1. Parse a document by extracting (tokenizing) words that we want to check
  2. Analyze each word by breaking it down in it’s root (stemming) and conjugation affix
  3. Lookup in a dictionary if the word+affix combination if valid for your language
  4. (optional) For incorrect words, suggest corrections by finding similar (correct) words in the dictionary

We can do each these steps manually or have Hunspell do them automatically.

Check Individual Words

The hunspell_check and hunspell_suggest functions can test individual words for correctness, and suggest similar (correct) words that look similar to the given (incorrect) word.

library(hunspell)

# Check individual words
words <- c("beer", "wiskey", "wine")
correct <- hunspell_check(words)
print(correct)
[1]  TRUE FALSE  TRUE
# Find suggestions for incorrect words
hunspell_suggest(words[!correct])
[[1]]
[1] "whiskey"  "whiskery"

Check Documents

In practice we often want to spell check an entire document at once by searching for incorrect words. This is done using the hunspell function:

bad <- hunspell("spell checkers are not neccessairy for langauge ninjas")
print(bad[[1]])
[1] "neccessairy" "langauge"   
hunspell_suggest(bad[[1]])
[[1]]
[1] "necessary"   "necessarily"

[[2]]
[1] "language" "Augean"   "Angela"  

Besides plain text, hunspell supports various document formats, such as html or latex:

download.file("https://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz",  mode = "wb")
untar("1406.4806v1.tar.gz", "content.tex")
text <- readLines("content.tex", warn = FALSE)
bad_words <- hunspell(text, format = "latex")
sort(unique(unlist(bad_words)))
 [1] "CORBA"             "CTRL"              "DCOM"             
 [4] "DOM"               "DSL"               "ESC"              
 [7] "JRI"               "NaN"               "OAuth"            
[10] "OpenCPU"           "RInside"           "RPC"              
[13] "RProtoBuf"         "RStudio"           "Reproducibility"  
[16] "RinRuby"           "Rserve"            "SIGINT"           
[19] "STATA"             "STDOUT"            "Stateful"         
[22] "auth"              "cpu"               "cran"             
[25] "cron"              "css"               "csv"              
[28] "de"                "dec"               "decompositions"   
[31] "dir"               "eol"               "facto"            
[34] "grDevices"         "httpuv"            "ignorable"        
[37] "interoperability"  "interoperable"     "js"               
[40] "json"              "jsonlite"          "knitr"            
[43] "md"                "memcached"         "mydata"           
[46] "myfile"            "nondegenerateness" "ocpu"             
[49] "opencpu"           "pandoc"            "pb"               
[52] "php"               "png"               "preinstalled"     
[55] "prescripted"       "priori"            "protobuf"         
[58] "rApache"           "rda"               "rds"              
[61] "reproducibility"   "rlm"               "rmd"              
[64] "rnorm"             "rnw"               "rpy"              
[67] "saveRDS"           "scalability"       "scalable"         
[70] "schemas"           "se"                "sep"              
[73] "stateful"          "statefulness"      "stdout"           
[76] "suboptimal"        "svg"               "sweave"           
[79] "tex"               "texi"              "tmp"              
[82] "toJSON"            "urlencoded"        "www"              
[85] "xyz"              

Check PDF files

Use the text-extraction from the pdftools package to spell check text from PDF files!

text <- pdftools::pdf_text('https://www.gnu.org/licenses/quick-guide-gplv3.pdf')
bad_words <- hunspell(text)
sort(unique(unlist(bad_words)))
 [1] "AGPLed"        "Affero"        "DRM"           "GPLed"        
 [5] "GPLv"          "ISC"           "OpenSolaris"   "Tivoization"  
 [9] "cryptographic" "fsf"           "isn"           "opments"      
[13] "tivoization"   "tributing"     "ve"            "wasn"         
[17] "weren"         "wouldn"       

Check Manual Pages

The devtools package builds on hunspell and has a wrapper to spell-check manual pages from R packages. Results might contain a lot of false positives for technical jargon, but you might also catch a typo or two. Point it to the root of your source package:

devtools::spell_check("~/workspace/V8")
  WORD          FOUND IN
ECMA          V8.Rd:16, description:2,4
ECMAScript    description:2
emscripten    description:5
htmlwidgets   JS.Rd:16
JSON          V8.Rd:33,38,39,57,58,59,120
jsonlite      V8.Rd:42
Ooms          V8.Rd:41,120
Xie           JS.Rd:26
Yihui         JS.Rd:26

Morphological Analysis

In order to lookup a word in a dictionary, hunspell needs to break it down in a stem (stemming) and conjugation affix. The hunspell function does this automatically but we can also do it manually.

Stemming Words

The hunspell_stem looks up words from the dictionary which match the root of the given word. Note that the function returns a list because some words can have multiple matches.

# Stemming
words <- c("love", "loving", "lovingly", "loved", "lover", "lovely")
hunspell_stem(words)
[[1]]
[1] "love"

[[2]]
[1] "loving" "love"  

[[3]]
[1] "loving"

[[4]]
[1] "loved" "love" 

[[5]]
[1] "lover" "love" 

[[6]]
[1] "lovely" "love"  

Analyzing Words

The hunspell_analyze function is similar, but it returns both the stem and the affix syntax of the word:

hunspell_analyze(words)
[[1]]
[1] " st:love"

[[2]]
[1] " st:loving"    " st:love fl:G"

[[3]]
[1] " st:loving fl:Y"

[[4]]
[1] " st:loved"     " st:love fl:D"

[[5]]
[1] " st:lover"     " st:love fl:R"

[[6]]
[1] " st:lovely"    " st:love fl:Y"

Tokenizing

To support spell checking on documents, Hunspell includes parsers for various document formats, including text, html, xml, man or latex. The Hunspell package also exposes these tokenizers directly so they can be used for other application than spell checking.

text <- readLines("content.tex", warn = FALSE)
allwords <- hunspell_parse(text, format = "latex")

# Third line (title) only
print(allwords[[3]])
 [1] "The"        "OpenCPU"    "System"     "Towards"    "a"         
 [6] "Universal"  "Interface"  "for"        "Scientific" "Computing" 
[11] "through"    "Separation" "of"         "Concerns"  

Summarizing Text

In text analysis we often want to summarize text via it’s stems. For example we can count words for display in a wordcloud:

allwords <- hunspell_parse(janeaustenr::prideprejudice)
stems <- unlist(hunspell_stem(unlist(allwords)))
words <- sort(table(stems), decreasing = TRUE)
print(head(words, 30))
stems
 the   to   of  and    a    i    h    I   in  was  she   it that  not  you 
4402 4305 3611 3585 3135 2930 2233 2070 1881 1846 1711 1640 1579 1429 1367 
  he   hi   be  had   Mr  for with  but have   on   at  him   my    s   by 
1336 1271 1241 1177 1129 1064 1053 1002  938  931  787  764  719  653  636 

Most of these are stop words. Let’s filter these out:

df <- as.data.frame(words)
df$stems <- as.character(df$stems)
stopwords <- hunspell_parse(readLines('https://jeroen.github.io/files/stopwords.txt'))
stops <- df$stems %in% unlist(stopwords)
wcdata <- head(df[!stops,], 150)
print(wcdata, max = 40)
           stems Freq
20            Mr 1129
31     Elizabeth  635
49         Darcy  418
64        sister  294
65          Jane  292
66          miss  290
69          Miss  281
73          lady  265
77            It  247
80            He  235
86          time  224
88            ha  221
97           aft  200
106        happy  183
108       Collin  180
109      Collins  180
111         dear  178
114          bee  175
117          day  174
118       friend  174
 [ reached getOption("max.print") -- omitted 130 rows ]
library(wordcloud2)
names(wcdata) <- c("word", "freq")
wordcloud2(wcdata)

Hunspell Dictionaries

Hunspell is based on MySpell and is backward-compatible with MySpell and aspell dictionaries. Chances are your dictionaries in your language are already available on your system!

A Hunspell dictionary consists of two files:

  • The [lang].aff file specifies the affix syntax for the language
  • The [lang].dic file contains a wordlist formatted using syntax from the aff file.

Typically both files are located in the same directory and share the same filename, for example en_GB.aff and en_GB.dic. The dictionary function will search for these files in the current directory and standard system paths where dictionaries are usually installed.

dictionary("en_GB")
<hunspell dictionary>
 affix: /private/var/folders/l8/bhmtp25n2lx0q0dgv1x4gf1w0000gn/T/RtmpwZDZdP/Rinstd57911b55641/hunspell/dict/en_GB.aff 
 dictionary: /private/var/folders/l8/bhmtp25n2lx0q0dgv1x4gf1w0000gn/T/RtmpwZDZdP/Rinstd57911b55641/hunspell/dict/en_GB.dic 
 encoding: UTF-8 
 wordchars: 0123456789’ 

If the files are not in one of the standard paths you can also specify the full path to either or both the dic and aff file:

dutch <- dictionary("~/workspace/Dictionaries/Dutch.dic")
print(dutch)
<hunspell dictionary>
 affix: /Users/jeroen/workspace/Dictionaries/Dutch.aff 
 dictionary: /Users/jeroen/workspace/Dictionaries/Dutch.dic 
 encoding: UTF-8 
 wordchars: '-./0123456789\ij’ 

Setting a Language

The hunspell R package includes dictionaries for en_US and en_GB. So if you you don’t speak en_US you can always switch to the British English:

hunspell("My favourite colour to visualise is grey")
[[1]]
[1] "favourite" "colour"    "visualise" "grey"     
hunspell("My favourite colour to visualise is grey", dict = 'en_GB')
[[1]]
character(0)

If you want to use another language you need to make sure that the dictionary is available from your system. The dictionary function is used to read in dictionary.

dutch <- dictionary("~/workspace/Dictionaries/Dutch.dic")
hunspell("Hij heeft de klok wel horen luiden, maar weet niet waar de klepel hangt", dict = dutch)

Note that if the dict argument is a string, it will be passed on to the dictionary function.

System Dictionaries

The best way to install dictionaries on Linux is via the system package manager. For example on if you would like to install the Austrian-German dictionary on Debian or Ubuntu you either need the hunspell-de-at or myspell-de-at package:

sudo apt-get install hunspell-de-at

On Fedora and CentOS / RHEL all German dialects are included with the hunspell-de package

sudo yum install hunspell-de

After installing this you should be able to load the dictionary:

dict <- dictionary('de_AT')

If that didn’t work, verify that the dictionary files were installed in one of the system directories (usually /usr/share/myspell or /usr/share/hunspell).

Custom Dictionaries

If your system does not provide standard dictionaries you need to download them yourself. There are a lot of places that provide quality dictionaries.

On OS-X it is recommended to put the files in ~/Library/Spelling/ or /Library/Spelling/. However you can also put them in your project working directory, or any of the other standard locations. If you wish to store your dictionaries somewhere else, you can make hunspell find them by setting the DICPATH environment variable. The hunspell:::dicpath() shows which locations your system searches:

Sys.setenv(DICPATH = "/my/custom/hunspell/dir")
hunspell:::dicpath()
 [1] "/my/custom/hunspell/dir"                                                                            
 [2] "/private/var/folders/l8/bhmtp25n2lx0q0dgv1x4gf1w0000gn/T/RtmpwZDZdP/Rinstd57911b55641/hunspell/dict"
 [3] "/Users/jeroen/Library/Spelling"                                                                     
 [4] "/usr/local/share/hunspell"                                                                          
 [5] "/usr/local/share/myspell"                                                                           
 [6] "/usr/local/share/myspell/dicts"                                                                     
 [7] "/usr/share/hunspell"                                                                                
 [8] "/usr/share/myspell"                                                                                 
 [9] "/usr/share/myspell/dicts"                                                                           
[10] "/Library/Spelling"                                                                                  
[11] "/dictionaries"