Machine Learning

Text Mining #4 Stemming and Lemmatization

grace21110 2023. 9. 6. 08:00
반응형

Stemming and Lemmatization are for grammatically or semantically changing archetypes of words.

In Stemming, there is a tendency to extract some misspelled root words from the original word by applying general methods or by applying more simplified methods when converting them into circular words. However, Lemmatization finds root words in correct spelling considering grammatical elements such as part-time speech and more meaningful parts.

 

Let's compare Stemming and Lemmatization. Stemming finds the prototype more simply as the verb changes according to progressive, third-person singular, past tense, and adjective changes according to comparison and superlative.

 

# Stemming Example code: 

 
import nltk
from nltk.stem import PorterStemmer

# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Example words for stemming
words = ["running", "flies", "happily", "generously"]

# Stem the words
stemmed_words = [stemmer.stem(word) for word in words]

# Print the stemmed words
print(stemmed_words)

Output: 

['run', 'fli', 'happili', 'gener']

Lemmatization needs to input the part of the word for more accurate circular word extraction.

 

# Lemmatization Example code: 

import nltk
from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Example words for lemmatization
words = ["running", "flies", "happily", "generously"]

# Lemmatize the words
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]

# Print the lemmatized words
print(lemmatized_words)

Output: 

['run', 'fly', 'happily', 'generously']