Text Mining #3 Removing Stop Word

Machine Learning

Text Mining #3 Removing Stop Word

grace21110 2023. 9. 5. 09:27

Stop words are common words that are filtered out or removed from text data during the preprocessing phase. Common stop words include articles (e.g., "a," "an," "the"), prepositions (e.g., "in," "on," "at"), and conjunctions (e.g., "and," "but," "or").

Example code:

1) Install nltk

pip install nltk

2) Import NLTK and Download Punkt Tokenizer Models (if not already downloaded)

import nltk

nltk.download('punkt')

3) Code to Remove Stop Words

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

# Sample text

text = "This is an example sentence with some stop words."

# Tokenize the text into words

words = word_tokenize(text)

# Remove stop words

stop_words = set(stopwords.words('english'))

filtered_words = [word for word in words if word.lower() not in stop_words]

# Print the filtered words

print(filtered_words)

4) Output

['This', 'example', 'sentence', 'stop', 'words', '.']