Can someone please explain what does this code do?
def stemmer(word):[(stem,end)] = re.findall('^(.*ss|.*?)(s)?$',word)return stem
Can someone please explain what does this code do?
def stemmer(word):[(stem,end)] = re.findall('^(.*ss|.*?)(s)?$',word)return stem
It splits a word into two parts: stem
and end
. There are three cases:
ss
(or even more s
): stem <- word
and end <- ""
s
: stem <- word without "s"
and end <- "s"
s
: stem <- word
and end <- ""
This is done by a regular expression which captures the full word (due to ^....$
). The first part (i.e. stem
) consists either of as much as possible ending in ss
(.*ss
) or if that is not possible of as less as possible (.*?
). Then possibly an ending s
is taken to be the end
part.
Note that in the first case (as much as possible ending in ss
) there can never be an additional s
for the end
part.