| wordStem {Rstem} | R Documentation |
This function computes the stems of each of the given words in the vector. This reduces a word to its base component, making it easier to compare words like win, winning, winner. See http://snowball.tartarus.org/ for more information about the concept and algorithms for stemming.
wordStem(words, language = character())
words |
a character vector of words whose stems are to be computed. |
language |
the name of a recognized language for the package.
This should either be a single string which is an element in the
vector returned by getStemLanguages, or
alternatively a character vector of length 3
giving the names of the routines for
creating and closing a Snowball SN\_env environment
and performing the stem (in that order).
See the example below.
|
This uses Dr. Martin Porter's stemming algorithm and the interface generated by Snowball http://snowball.tartarus.org/.
A character vector with as many elements as there are in the input vector with the corresponding elements being the stem of the word.
Duncan Temple Lang <duncan@wald.ucdavis.edu>
See http://snowball.tartarus.org/
# Simple example
# "win" "win" "winner"
wordStem(c("win", "winning", 'winner'))
# test the supplied vocabulary.
testWords = readLines(system.file("words", "english", "voc.txt", package = "Rstem"))
validate = readLines(system.file("words", "english", "output.txt", package = "Rstem"))
## Not run:
# Read the test words directly from the snowball site over the Web
testWords = readLines(url("http://snowball.tartarus.org/english/voc.txt"))
## End(Not run)
testOut = wordStem(testWords)
all(validate == testOut)
# Specify the language from one of the built-in languages.
testOut = wordStem(testWords, "english")
all(validate == testOut)
# To illustrate using the dynamic lookup of symbols that allows one
# to easily add new languages or create and close environment
# routines (for example, to manage pools if this were an efficiency
# issue!)
testOut = wordStem(testWords, c("testDynCreate", "testDynClose", "testDynStem"))