Machine Learning
Text Mining #3 Removing Stop Word
grace21110
2023. 9. 5. 09:27
반응형
Stop words are common words that are filtered out or removed from text data during the preprocessing phase. Common stop words include articles (e.g., "a," "an," "the"), prepositions (e.g., "in," "on," "at"), and conjunctions (e.g., "and," "but," "or").
Example code:
1) Install nltk
pip install nltk
2) Import NLTK and Download Punkt Tokenizer Models (if not already downloaded)
import nltk
nltk.download('punkt')
3) Code to Remove Stop Words
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Sample text
text = "This is an example sentence with some stop words."
# Tokenize the text into words
words = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
# Print the filtered words
print(filtered_words)
4) Output
['This', 'example', 'sentence', 'stop', 'words', '.']