Machine Learning

Text Mining #2 Text Normalization

grace21110 2023. 9. 5. 09:00
반응형

By following the previous lesson, we have the idea of what Text Mining is. For this writing, we are going to jump into Text Normalization. Text Nomoralization is categorized into 5 in total which are Cleansing, Tokenization, Filtering/Removing Stopword/Correcting spelling, Stemming and Lemmatization. 

 

Today, we are going to dive into Text Tokenization. There are two types of tokenization which are sentence tokenization that separates sentences and word tokenization that separates words from sentences into tokens.

 

# Sentence Tokenization

In general, sentence tokenization separates by symbols that indicate the end of a sentence, such as a period of a sentence, an opening letter, etc.

 

Example code: 

1) Install nltk

pip install nltk

 

2) Import NLTK and Download Punkt Tokenizer Models (if not already downloaded)

import nltk
nltk.download('punkt')

 

3) Sentence Tokenization Code

import nltk
from nltk.tokenize import sent_tokenize

# Sample text
text = "This is the first sentence. This is the second sentence. And this is the third sentence."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Print the tokenized sentences
for sentence in sentences:
    print(sentence)

 

4) Output

This is the first sentence.
This is the second sentence.
And this is the third sentence.

# Word Tokenization 

It is used to tokenize sentences into words. Basically, words are separated into spaces, commas, periods, and open lines. In particular, if the order of words is not important, such as Bag of Word, you can only use word tokenization without using sentence tokenization. 

 

Example code:

1) Install nltk

pip install nltk

 

2) Import NLTK and Download Punkt Tokenizer Models (if not already downloaded)

import nltk
nltk.download('punkt')

 

3) Word Tokenization Code

import nltk
from nltk.tokenize import word_tokenize

# Sample text
text = "Word tokenization is an essential step in natural language processing."

# Tokenize the text into words
words = word_tokenize(text)

# Print the tokenized words
print(words)

 

4) Output

['Word', 'tokenization', 'is', 'an', 'essential', 'step', 'in', 'natural', 'language', 'processing', '.']