Text Mining #2 Text Normalization
By following the previous lesson, we have the idea of what Text Mining is. For this writing, we are going to jump into Text Normalization. Text Nomoralization is categorized into 5 in total which are Cleansing, Tokenization, Filtering/Removing Stopword/Correcting spelling, Stemming and Lemmatization.
Today, we are going to dive into Text Tokenization. There are two types of tokenization which are sentence tokenization that separates sentences and word tokenization that separates words from sentences into tokens.
# Sentence Tokenization
In general, sentence tokenization separates by symbols that indicate the end of a sentence, such as a period of a sentence, an opening letter, etc.
Example code:
1) Install nltk
2) Import NLTK and Download Punkt Tokenizer Models (if not already downloaded)
3) Sentence Tokenization Code
4) Output
# Word Tokenization
It is used to tokenize sentences into words. Basically, words are separated into spaces, commas, periods, and open lines. In particular, if the order of words is not important, such as Bag of Word, you can only use word tokenization without using sentence tokenization.
Example code:
1) Install nltk
2) Import NLTK and Download Punkt Tokenizer Models (if not already downloaded)
3) Word Tokenization Code
4) Output