Text Mining #2 Text Normalization

Machine Learning

Text Mining #2 Text Normalization

grace21110 2023. 9. 5. 09:00

By following the previous lesson, we have the idea of what Text Mining is. For this writing, we are going to jump into Text Normalization. Text Nomoralization is categorized into 5 in total which are Cleansing, Tokenization, Filtering/Removing Stopword/Correcting spelling, Stemming and Lemmatization.

Today, we are going to dive into Text Tokenization. There are two types of tokenization which are sentence tokenization that separates sentences and word tokenization that separates words from sentences into tokens.

# Sentence Tokenization

In general, sentence tokenization separates by symbols that indicate the end of a sentence, such as a period of a sentence, an opening letter, etc.

Example code:

1) Install nltk

pip install nltk

2) Import NLTK and Download Punkt Tokenizer Models (if not already downloaded)

import nltk

nltk.download('punkt')

3) Sentence Tokenization Code

import nltk

from nltk.tokenize import sent_tokenize

# Sample text

text = "This is the first sentence. This is the second sentence. And this is the third sentence."

# Tokenize the text into sentences

sentences = sent_tokenize(text)

# Print the tokenized sentences

for sentence in sentences:

print(sentence)

4) Output

This is the first sentence.

This is the second sentence.

And this is the third sentence.

# Word Tokenization

It is used to tokenize sentences into words. Basically, words are separated into spaces, commas, periods, and open lines. In particular, if the order of words is not important, such as Bag of Word, you can only use word tokenization without using sentence tokenization.

Example code:

1) Install nltk

pip install nltk

2) Import NLTK and Download Punkt Tokenizer Models (if not already downloaded)

import nltk

nltk.download('punkt')

3) Word Tokenization Code

import nltk

from nltk.tokenize import word_tokenize

# Sample text

text = "Word tokenization is an essential step in natural language processing."

# Tokenize the text into words

words = word_tokenize(text)

# Print the tokenized words

print(words)

4) Output

['Word', 'tokenization', 'is', 'an', 'essential', 'step', 'in', 'natural', 'language', 'processing', '.']