To tokenize sentences and words with NLTK, “nltk.word_tokenize()
” function will be used. NLTK Tokenization is used for parsing a large amount of textual data into parts to perform an analysis of the character of the text. NLTK for tokenization can be used for training machine learning models, Natural Language Processing text cleaning. The tokenized words and sentences with NLTK can be turned into a data frame and vectorized. Natural Language Tool Kit (NLTK) tokenization involves punctuation cleaning, text cleaning, vectorization of parsed text data for better lemmatization, and stemming along with machine learning algorithm training.
Natural Language Tool Kit Python Libray has a tokenization package is called “tokenize”. In the “tokenize” package of NLTK, there are two types of tokenization functions.
- “word_tokenize” is to tokenize words.
- “sent_tokenize” is to tokenize sentences.
How to Tokenize Words with Natural Language Tool Kit (NLTK)?
Tokenization of words with NLTK means parsing a text into the words via Natural Language Tool Kit. To tokenize words with NLTK, follow the steps below.
- Import the “word_tokenize” from the “nltk.tokenize”.
- Load the text into a variable.
- Use the “word_tokenize” function for the variable.
- Read the tokenization result.
Below, you can see a tokenization example with NLTK for a text.
from nltk.tokenize import word_tokenize
text = "Search engine optimization is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic."
print(word_tokenize(text))
>>>OUTPUT
['Search', 'engine', 'optimization', 'is', 'the', 'process', 'of', 'improving', 'the', 'quality', 'and', 'quantity', 'of', 'website', 'traffic', 'to', 'a', 'website', 'or', 'a', 'web', 'page', 'from', 'search', 'engines', '.', 'SEO', 'targets', 'unpaid', 'traffic', 'rather', 'than', 'direct', 'traffic', 'or', 'paid', 'traffic', '.']
The explanation of the tokenization example above can be seen below.
- The first line is for importing the “word_tokenize” function.
- The second line of code is to provide text data for tokenization.
- Third line of code to print the output of the “word_tokenize”.
What are the advantages of word tokenization with NLTK?
The word tokenization benefits with NLTK involves the benefits of White Space Tokenization, Dictionary Based Tokenization, Rule-Based Tokenization, Regular Expression Tokenization, Penn Treebank Tokenization, Spacy Tokenization, Moses Tokenization, Subword Tokenization. All type of word tokenization is a part of the text normalization process. Normalizing the text with stemming and lemmatization improves the accuracy of the language understanding algorithms. The benefits and advantages of the word tokenization with NLTK can be found below.
- Removing the stop words easily from the corpora before the tokenization.
- Splitting words into the sub-words for understanding the text better.
- Removing the text disambiguate is faster and requires less coding with NLTK.
- Besides White Space Tokenization, Dictionary Based and Rule-based Tokenization can be implemented easily.
- Performing Byte Pair Encoding, Word Piece Encoding, Unigram Language Model, Setence Piece Encoding is easier with NLTK.
- NLTK has TweetTokenizer for tokenizing the tweets that including emojis and other Twitter norms.
- NLTK has PunktSentenceTokenizer has a pre-trained model for tokenization in multiple European Languages.
- NLTK has Multi Word Expression Tokenizer for tokenizing the compound words such as “in spite of”.
- NLTK has RegexpTokenizer to tokenize sentences based on the regular expressions.
How to Tokenize Sentences with Natural Language Tool Kit (NLTK)?
To tokenize the sentences with Natural Language Tool kit, the steps below should be followed.
- Import the “sent_tokenize” from “nltk.tokenize”.
- Load the text for sentence tokenization into a variable.
- Use the “sent_tokenize” for the specific variable.
- Print the output.
Below, you can see an example of NLTK Tokenization for sentences.
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))
Output: ['God is Great!', 'I won a lottery ']
At the code block above, the text is tokenized into the sentences. By taking all of the sentences into a list with the sentence tokenization with NLTK can be used to see which sentence is connected to which one, average word per sentence, and unique sentence count.
What are the advantages of sentence tokenization with NLTK?
The advantages of sentence tokenization with NLTK are listed below.
- NLTK provides a chance to perform text data mining for sentences.
- NLTK sentence tokenization involves comparing different text corporas at the sentence level.
- Sentence tokenization with NLTK provides understanding how many sentences are used in a different sources of texts such as websites, or books, and papers.
- Thanks to NLTK “sent_tokenize” function, it is possible to see how the sentences are connected to each other, with what bridge words.
- Via NLTK sentence tokenizer, performing an overall sentiment analysis for the sentences is possible.
- Performing Semantic Role Labeling for the sentences to understand how the sentences are connected each other is one of the benefits of NLTK sentence tokenization.
How to perform Regex Tokenization with NLTK?
Regex Tokenization with NLTK is to perform tokenization based on regex rules. Regex Tokenization via NLTK can be used for extracting certain phrase patterns from a corpus. To perform regex tokenization with NLTK, the “tokenize.regexp()” method should be used. An example of the regex tokenization NLTK is below.
from nltk.tokenize import RegexpTokenizer
regex_tokenizer = RegexpTokenizer('\?', gaps = True)
text = "How to perform Regex Tokenization with NLTK? To perform regex tokenization with NLTK, the regex pattern should be chosen."
regex_tokenization = regex_tokenizer.tokenize(text)
print(regex_tokenization)
OUTPUT >>>
['How to perform Regex Tokenization with NLTK', ' To perform regex tokenization with NLTK, the regex pattern should be chosen.']
The Regex Tokenization example with NLTK demonstrates that how to take a question sentence and a sentence after it. By taking sentences that end with a question mark, and taking the sentences after it, matching the answers and questions, or taking the question formats from a corpus is possible.
How to perform Rule-based Tokenization with NLTK?
Rule-based Tokenization is tokenization based on certain rules that are generated from certain conditions. NLTK has three different rule-based tokenization algorithms as TweetTokenizer for Twitter Tweets, and MWET for Multi-word tokenization, along with the TreeBankTokenizer for the English Language rules. Rule-based Tokenization is helpful for performing the tokenization based on the best possible conditions for the nature of the textual data.
An example of Rule-based tokenization with MWET for multi-word tokenization can be seen below.
from nltk.tokenize import MWETokenizer
sentence = "I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example."
tokenizer = MWETokenizer()
tokenizer.add_mwe(("Steven", "Nissen"))
result = tokenizer.tokenize(word_tokenize(sentence))
result
OUTPUT >>>
['I',
'have',
'sent',
'Steven',
'Nissen',
'to',
'the',
'new',
'reserch',
'center',
'for',
'the',
'nutritional',
'value',
'of',
'the',
'coffee',
'.',
'This',
'sentence',
'will',
'be',
'tokenized',
'while',
'Mr.',
'Steven',
'Nissen',
'is',
'on',
'the',
'journey',
'.',
'The',
'#',
'truth',
'will',
'be',
'learnt',
'.',
'And',
',',
'it',
"'s",
'will',
'be',
'well',
'known',
'thanks',
'to',
'this',
'tokenization',
'example',
'.']
An example of Rule-based tokenization with TreebankWordTokenizer for English language text can be seen below.
from nltk.tokenize import TreebankWordTokenizer
sentence = "I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example."
tokenizer = TreebankWordTokenizer()
result = tokenizer.tokenize(sentence)
result
OUTPUT >>>
['I',
'have',
'sent',
'Steven',
'Nissen',
'to',
'the',
'new',
'reserch',
'center',
'for',
'the',
'nutritional',
'value',
'of',
'the',
'coffee.',
'This',
'sentence',
'will',
'be',
'tokenized',
'while',
'Mr.',
'Steven',
'Nissen',
'is',
'on',
'the',
'journey.',
'The',
'#',
'truth',
'will',
'be',
'learnt.',
'And',
',',
'it',
"'s",
'will',
'be',
'well',
'known',
'thanks',
'to',
'this',
'tokenization',
'example',
'.']
An example of Rule-based tokenization with TweetTokenizer for Twitter Tweets’ tokenization can be seen below.
from nltk.tokenize import TweetTokenizer
sentence = "I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example."
tokenizer = TweetTokenizer()
result = tokenizer.tokenize(sentence)
result
OUTPUT>>>
['I',
'have',
'sent',
'Steven',
'Nissen',
'to',
'the',
'new',
'reserch',
'center',
'for',
'the',
'nutritional',
'value',
'of',
'the',
'coffee',
'.',
'This',
'sentence',
'will',
'be',
'tokenized',
'while',
'Mr',
'.',
'Steven',
'Nissen',
'is',
'on',
'the',
'journey',
'.',
'The',
'#truth',
'will',
'be',
'learnt',
'.',
'And',
',',
"it's",
'will',
'be',
'well',
'known',
'thanks',
'to',
'this',
'tokenization',
'example',
'.']
The most standard rule-based type of word tokenization is white space tokenization. White-space tokenization is basically taken spaces between words for the tokenization. White-space tokenization can be performed with the “split(” “)” method and argument as below.
sentence = "I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example."
result = sentence.split(" ")
result
OUTPUT >>>
['I',
'have',
'sent',
'Steven',
'Nissen',
'to',
'the',
'new',
'reserch',
'center',
'for',
'the',
'nutritional',
'value',
'of',
'the',
'coffee.',
'This',
'sentence',
'will',
'be',
'tokenized',
'while',
'Mr.',
'Steven',
'Nissen',
'is',
'on',
'the',
'journey.',
'The',
'#truth',
'will',
'be',
'learnt.',
'And,',
"it's",
'will',
'be',
'well',
'known',
'thanks',
'to',
'this',
'tokenization',
'example.']
In NLTK tokenization methods, there are other tokenization methodologies such as PunktSentenceTokenizer
for detecting the sentence boundaries, and Punctuation-based tokenization for tokenizing the punctuation-related words, and multiword properly.
How to use Lemmatization with NLTK Tokenization?
To use lemmatization with NLTK Tokenization, the “nltk.stem.wordnet.WordNetLemmetizer” should be used. WordNetLemmetizer from NLTK is to lemmatize the words within the text. The word lemmatization is the process of turning a word into its original dictionary form. Unlike stemming, lemmatization removes all of the suffixes, prefixes, and morphological changes for the word. NLTK Lemmatization is useful to see a word’s context and understand which words are actually the same during the word tokenization. Below, you will see word tokenization and lemmatization with NLTK example code block.
from nltk.stem.wordnet import WordNetLemmatizer
lemmatize = WordNetLemmatizer()
lemmatized_words = []
for w in tokens:
rootWord = lemmatize.lemmatize(w)
lemmetized_words.append(rootWord)
counts_lemmetized_words = Counter(lemmatized_words)
df_tokenized_lemmatized_words = pd.DataFrame.from_dict(counts_lemmatized_words, orient="index").reset_index()
df_tokenized_lemmatized_words.sort_values(by=0, ascending=False, inplace=True)
df_tokenized_lemmatized_words[:50]
The NLTK Tokenization and Lemmatization example code bloc explanation is below.
- The “nltk.stem.wordnet” is called for importing WordNetLemmatizer.
- It is assigned to a variable which is “lemmatize”.
- An empty list is created for the “lemmatized_words”.
- A for loop is created for lemmatizing every word within the tokenized words with NLTK.
- The lemmatized and tokenized words are appended to the “lemmatized_words” list.
- The counter object has been used for counting them.
- The data frame has been created with lemamtized and tokenized word counts, sorted and called.
You can see the lemmatization and tokenization with the NLTK example result below.
The NLTK Tokenization and Lemmatization stats will be different than the NLTK Tokenization and Stemming. These differences will reflect their methodological differences for the statistical analysis for tokenized textual data with NLTK.
How to use Stemming with NLTK Tokenization?
To use stemming with NLTK Tokenization, the “PorterStemmer” from the “NLTK.stem” should be imported. Stemming is reducing words to the stem forms. Stemming can be useful for a better NLTK Word Tokenization analysis since there are lots of suffixes in the words. Via the NLTK Stemming, the words that come from the same root can be counted as the same. Being able to see which words without suffixes are used is to create a more comprehensive look at the statistical counts of the concepts and phrases within a text. An example of stemming with NLTK Tokenization is below.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed_words = []
for w in tokens:
rootWord = ps.stem(w)
stemmed_words.append(rootWord)
OUTPUT>>>
['think',
'like',
'seo',
',',
'code',
'like',
'develop',
'python',
'seo',
'techseo',
'theoret',
'seo',
'on-pag',
'seo',
'pagespe',
'UX',
'market',
'think',
'like',
'seo',
',',
'code',
'like',
'develop',
'main',
show more (open the raw output data in a text editor) ...
'in',
'bulk',
'with',
'python',
'.',
...]
counts_stemmed_words = Counter(stemmed_words)
df_tokenized_stemmed_words = pd.DataFrame.from_dict(counts_stemmed_words, orient="index").reset_index()
df_tokenized_stemmed_words.sort_values(by=0, ascending=False)
df_tokenized_stemmed_words
index | 0 | |
0 | think | 529 |
1 | like | 1059 |
2 | seo | 5389 |
3 | , | 22564 |
4 | code | 1128 |
… | … | … |
10342 | pixel. | 1 |
10343 | success. | 1 |
10344 | pages. | 1 |
10345 | free… | 1 |
10346 | almost. | 1 |
10347 rows × 2 columns |
How to Tokenize Content of a Website via NLTK?
To tokenize the content of a website with NLTK on word, and sentence level the steps below should be followed.
- Crawling the website’s content.
- Extracting the website’s content from the crawl output.
- Using the “word_tokenize” of NLTK for word tokenization.
- Using “sent_tokenize” of NLTK for sentence tokenization.
Interpreting and comparing the output of the tokenization of a website provides benefits for the overall evaluation of the content of a website. Below, you will see an example of a website content tokenization example. To perform NLTK Tokenization with a website’s content, the Python libraries below should be used.
- Advertools
- Pandas
- NLTK
- Collections
- String
Below, you will see the importing process of the necessary libraries and functions for NLTK tokenization from Python for SEO.
import advertools as adv
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from collections import Counter
from nltk.tokenize import RegexpTokenizer
import string
adv.crawl("https://www.holisticseo.digital", "output.jl", custom_settings={"LOG_FILE":"output.log", "DOWNLOAD_DELAY":0.5}, follow_links=True)
df = pd.read_json("output.jl", lines=True)
for i in df.columns:
if i.__contains__("text"):
print(i)
word_tokenize(df["body_text"].explode())
content_of_website = df["body_text"].str.split().explode().str.cat(sep=" ")
tokens = word_tokenize(content_of_website)
tokenized_counts = Counter(tokens)
df_tokenized = pd.DataFrame.from_dict(tokenized_counts, orient="index").reset_index()
df_tokenized.nunique()
df_tokenized
To crawl the website’s content to perform an NLTK word and sentence tokenization, the Advertools’ “crawl” function will be used to take all of the content of the website into a “jl” extension file. Below, you will see an example of crawling a website with Python.
adv.crawl("https://www.holisticseo.digital", "output.jl", custom_settings={"LOG_FILE":"output.log", "DOWNLOAD_DELAY":0.5}, follow_links=True)
df = pd.read_json("output.jl", lines=True)
In the first line, we have started the crawling process of the website, while in the second line we have started to read the “jl” file. Below, you can find the output of the crawled website’s output which is “output.jl” from the code block.
In the third step, the website’s content should be found within the data frame. To do that, a for loop for filtering the data frame columns with the “boyd_text” is necessary. To find it, we will use the “__contains__” method of Python.
for i in df.columns:
if i.__contains__("text"):
print(i)
At the next step of NLTK Tokenization for website content, we will use the Pandas library’s “str.cat” method to unite all of the content pieces across different web pages.
content_of_website = df["body_text"].str.split().explode().str.cat(sep=" ")
Creating a variable “content_of_website” to assign the united content corpus of the website with the “sep=” parameter with a space value is necessary to decrease the computation need. Instead of performing NLTK Tokenization for every web page’s content separately and then uniting all of the tokenized output of text pieces, uniting all of the content pieces and then performing the NLTK tokenization for the united content piece is better for time and energy saving. At the next step, the “NLTK.word_tokenize” will be performed and the output of the tokenization process will be assigned to a variable.
tokens = word_tokenize(content_of_website)
To be able to see the counts of the tokenized words, and their counts, the “Counter” from the “collections” can be used as below.
tokenized_counts = Counter(tokens)
df_tokenized = pd.DataFrame.from_dict(tokenized_counts, orient="index").reset_index()
“tokenized_counts = Counter(tokens)” is to provide a counting process for all of the counted objects. And, at the second line of the counting tokenized words with NLTK, the “from_dict” and “reset_index” methods of Pandas have been used to provide a data frame. Thanks to NLTK tokenization, the “unique word count” of a website can be found below.
df_tokenized.nunique()
OUTPUT>>>
index 18056
0 471
dtype: int64
The “holisticseo.digital” has 18056 unique words within its content. These words can be seen below.
df_tokenized.sort_values(by=0, ascending=False, inplace=True)
df_tokenized
Below, you can see the tokenization of words as an image.
If the image of the word tokenization output is not clear for you, you can check the table of the word tokenization output is below.
index | 0 | |
23 | the | 31354 |
26 | . | 23012 |
3 | , | 22564 |
36 | and | 12812 |
22 | of | 12349 |
… | … | … |
14747 | NEL | 1 |
14748 | CSE | 1 |
14749 | recipe-related | 1 |
17753 | Plan | 1 |
18055 | almost. | 1 |
18056 rows × 2 columns |
After the word tokenization with NLTK for website content, we see that the words from the header and footer appear more along with the stop words. The visualization of the counted word tokenization can be done as below.
df_tokenized.sort_values(by=0, ascending=False, inplace=True)
df_tokenized[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15})
Below, you can see the output of the code block for visualizing the word tokenization output.
To save the word tokenization output’s barplot as a PNG, you can use the code block below.
df_tokenized[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15}).figure.savefig("word-tokenization-2.png")
The word “the” appears more than 30000 times while some of the punctuations are also included within the word tokenization results. It appears that the words “content” and “Google” are the most appeared words besides the punctuation characters and the stop words. To have more insight when it comes to the word tokenization, the “TF-IDF Analysis with Python” can help to understand a word’s weight within a corpus. To create a better insight for SEO and content analysis via NLTK tokenization, the stop words should be removed.
How to Filter out the Stop Words for Tokenization with NLTK?
To remove the stop words from the NLTK Tokenization process’ output, a filter-out process should be performed with a repetitive loop with a list comprehension or a normal for loop. An example of NLTK Tokenization by removing the stop words can be seen below.
stop_words_english = set(stopwords.words("english"))
df_tokenized["filtered_tokens"] = pd.Series([w for w in df_tokenized["index"] if not w.lower() in stop_words_english])
To filter out the stop words during the word tokenization, the text cleaning methoıds should be used. To clean the stop words, the “stopwords.words(“english”)” method can be used from NLTK. In the code line above, the first line assigns the stop words within English to the “stop_words_english” variable. At the second line, we created a new column within the “df_tokenized” data frame which uses a list comprehension with the “pd.Series”. Basically, we take every word from the stop words list and filter the tokenized words output with NLTK according to the stop words. The “filtered_tokends” column doesn’t include any of the stop words.
How to Count Tokenized Words with NLTK without Stop Words?
To count the tokenized words with NLTK without the stop words, a list comprehension for subtracting the stop words should be used over the tokenized output. Below, you will see a specific example definition for the NLTK Tokenization tutorial.
To count tokenized words with NLTK by subtracting the stop words in English, the “Counter” object will be used over the list that has been created over the “tokens_without_stop_words” with the list comprehension process of “[word for word in tokens if not a word in stopwords.words(“english”)]”.
Below, you can see a code block to count the tokenized words and their output.
tokenized_counts_without_stop_words = Counter(tokens_without_stop_words)
tokenized_counts_without_stop_words
OUTPUT>>>
Counter({'Think': 381,
'like': 989,
'SEO': 5172,
',': 22564,
'Code': 467,
'Developer': 405,
'Python': 1583,
'TechSEO': 733,
'Theoretical': 700,
'On-Page': 670,
'PageSpeed': 645,
'UX': 699,
'Marketing': 1086,
'Main': 231,
'Menu': 192,
'X-Default': 232,
'value': 370,
'hreflang': 219,
'attribute': 178,
'link': 509,
'tag': 314,
'.': 23012,
'An': 198,
'specify': 42,
'alternate': 109,
show more (open the raw output data in a text editor) ...
'plan': 6,
'reward': 3,
'high-quality': 34,
'due': 82,
'back': 49,
...})
The next step is creating a data frame from the Counter Object via the “from_dict” method of the “pd.DataFrame”.
df_tokenized_without_stopwords = pd.DataFrame.from_dict(tokenized_counts_without_stop_words, orient="index").reset_index()
df_tokenized_without_stopwords
Below, you can see the output.
Below, you can see the table output of the tokenization with NLTK without stop words in English.
index | 0 | |
0 | Think | 381 |
1 | like | 989 |
2 | SEO | 5172 |
3 | , | 22564 |
4 | Code | 467 |
… | … | … |
17910 | success. | 1 |
17911 | pages. | 1 |
17912 | infinitely | 1 |
17913 | free… | 1 |
17914 | almost. | 1 |
17915 rows × 2 columns |
Even if the stop words are removed from the text, still the punctuations exist. To clean the textual data completely for a healthier word tokenization process with NLTK, the stop words should be cleaned. Below, you will see the sorted version of the word tokenization with NLTK output without stop words.
df_tokenized_without_stopwords.sort_values(by=0, ascending=False, inplace=True)
df_tokenized_without_stopwords
You can see the output of word tokenization with NLTK as an image.
The table output of the word tokenization with NLTK without stop words and sorted values is below.
index | 0 | |
21 | . | 23012 |
3 | , | 22564 |
298 | ” | 5639 |
296 | “ | 5623 |
2 | SEO | 5172 |
… | … | … |
12618 | gzip | 1 |
6968 | exited | 1 |
6969 | seduce | 1 |
6970 | collaborating | 1 |
17914 | almost. | 1 |
17915 rows × 2 columns |
How to visualize the Word Tokenization with NLTK without the Stop Words?
To visualize the NLTK Word Tokenization without the stop words, the “plot” method of Pandas Python Library should be used. Below, you can see an example visualization of the word tokenization with NLTK without stop words.
df_tokenized_without_stopwords[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15})
You can see the output as an image below.
The effect of the punctuations is more evident within the visualization of the word tokenization via NLTK without stop words.
How to Calculate the Effect of the Stop Words for the Length of the Corpora?
The corpora length represents the total word count of the textual data. To calculate the effect of the stop words for the length of the corpora, the stop words count should be subtracted from the total word counts. Below, you can see an example for the calculation of the total stop word count effect for the length of the corpora.
tokens_without_stop_words = [word for word in tokens if not word in stopwords.words("english")]
print(len(content_of_website))
len(content_of_website) - len(tokens_without_stop_words)
OUTPUT>>>
307199
268147
The total word count of the website is 307199, the total word count without the stop words is 268147. And, 18056 of these words are unique. The unique word count can demonstrate a website’s potential query count since every different word is a representational score for relevance to a topic, or concept. Every unique word and n-gram can give a better chance to be relevant to a concept, or phrase in the search bar.
How to Remove the Punctuation from Word Tokenization with NLTK?
To remove the Punctuation from Word Tokenization with NLTK the “isalnum()” method should be used with a list comprehension. In the NLTK Word Tokenization tutorial the “isalnum()” method will be used on the “content_of_website_removed_punct” variable as below.
content_of_website_removed_punct = [word for word in tokens if word.isalnum()]
content_of_website_removed_punct
OUTPUT >>>
['Think',
'like',
'SEO',
'Code',
'like',
'Developer',
'Python',
'SEO',
'TechSEO',
'Theoretical',
'SEO',
'SEO',
'PageSpeed',
'UX',
'Marketing',
'Think',
'like',
'SEO',
'Code',
'like',
'Developer',
'Main',
'Menu',
'is',
'a',
show more (open the raw output data in a text editor) ...
'server',
'needs',
'to',
'return',
'304',
...]
As you see all of the punctuations are removed from the tokenized words with NLTK. The next step is using the Counter object for creating a data frame so that the tokenized output with NLTK can be used for analysis and machine learning.
content_of_website_removed_punct_counts = Counter(content_of_website_removed_punct)
content_of_website_removed_punct_counts
OUTPUT >>>
Counter({'Think': 381,
'like': 989,
'SEO': 5172,
'Code': 467,
'Developer': 405,
'Python': 1583,
'TechSEO': 733,
'Theoretical': 700,
'PageSpeed': 645,
'UX': 699,
'Marketing': 1086,
'Main': 231,
'Menu': 192,
'is': 8965,
'a': 11483,
'value': 370,
'for': 8377,
'hreflang': 219,
'attribute': 178,
'of': 12349,
'the': 31354,
'link': 509,
'tag': 314,
'An': 198,
'can': 5065,
show more (open the raw output data in a text editor) ...
'natural': 56,
'Due': 20,
'inaccuracies': 1,
'calculation': 63,
'always': 241,
...})
The Counter object has been created for the NLTK Word Tokenization output without the punctuation. Below, you will see an example for creating a data frame methodology via “from_dict” for the result of the NLTK word tokenization without punctuation.
content_of_website_removed_punct_counts_df = pd.DataFrame.from_dict(content_of_website_removed_punct_counts, orient="index").reset_index()
content_of_website_removed_punct_counts_df.sort_values(by=0, ascending=False, inplace=True)
content_of_website_removed_punct_counts_df
Below, you can see the image output of the NLTK word tokenization by removing the punctuations.
Below, you can see the table output of the NLTK word tokenization by removing the punctuations.
index | 0 | |
20 | the | 31354 |
32 | and | 12812 |
19 | of | 12349 |
14 | a | 11483 |
44 | to | 11063 |
… | … | … |
11303 | ground | 1 |
6178 | Zurich | 1 |
6181 | Visa | 1 |
9184 | omitted | 1 |
14783 | infinitely | 1 |
14784 rows × 2 columns |
How to visualize the NLTK Word Tokenization result without punctuation?
To visualize the NLTK Word Tokenization result within a data frame without punctuation, the “plot” method of the pandas should be used. An example visualization of the NLTK Word Tokenization without punctuation can be seen below.
content_of_website_removed_punct_counts_df[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15})
The output of the visualization of the NLTK Word Tokenization result without punctuation is below.
How to Remove stop words and punctuations from text data for a better NLTK Word Tokenization?
To remove the stop words and punctuations for cleaning the text in order to have a better NLTK Word Tokenization result, the list comprehension should be used with an “if” statement. Multiple conditional list comprehensions are to provide a faster text cleaning process for the removal of punctuation and stop words. An example of the cleaning of punctuation and stop words can be seen below.
content_of_website_filtered_stopwords_and_punctiation = [w for w in tokens if not w in set(stopwords.words("english")) if w.isalnum()]
content_of_website_filtered_stopwords_and_punctiation_counts = Counter(content_of_website_filtered_stopwords_and_punctiation)
content_of_website_filtered_stopwords_and_punctiation_counts_df = pd.DataFrame.from_dict(content_of_website_filtered_stopwords_and_punctiation_counts, orient="index").reset_index()
content_of_website_filtered_stopwords_and_punctiation_counts_df.sort_values(by=0, ascending=False, inplace=True)
content_of_website_filtered_stopwords_and_punctiation_counts_df[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15})
The explanation of the removal of the punctuation and the stop words in English from the tokenized words via NLTK code block is below.
- Remove the stop words and the punctuations via the “stopwods.words(“english”)” and “isalnum()”.
- Count the rest of the words with the “Counter” functions from Collections Python built-in module.
- Create the dataframe with the “from_dict” method of the “pd.DataFrame” with the “orientation”, “figsize”, “kind”, “x” parameters.
- Sort the values from high to low.
- Use the plot method of the Pandas Python Library for the visualization with the “label”, “label”, “colormap”, “table”, “grid”, “fontsize”, “rot”, “position”, “title”, “legend” parameters.
The output of the removal of the punctuation and the stop words from the tokenized words with NLTK for visualization can be seen below.
The table version of the NLTK Tokenization for words without punctuations and stop words in English can be seen below.
index | 0 | |
2 | SEO | 5172 |
63 | 2762 | |
160 | The | 2716 |
280 | content | 2477 |
23 | page | 2230 |
… | … | … |
9600 | estimations | 1 |
9595 | Politic | 1 |
9594 | 396 | 1 |
9593 | Politics | 1 |
14642 | infinitely | 1 |
14643 rows × 2 columns |
How to perform sentence Tokenization with NLTK?
To perform the sentence tokenization with NLTK, the “sent_tokenize” method of the NLTK should be used. The steps below can be used for NLTK Sentence Tokenization.
- Extract the text and assign a variable
- Import the NLTK and sent_tokenize” method.
- Use the “sent_tokenize” on the extracted text.
- Use the Counter from Collections to count the sentences.
- Create a Data Frame from the values of the count process’ output.
- Call the dataframe.
index | 0 | |
104 | required fields are marked * name* email* webs… | 175 |
105 | python seo techseo theoretical seo on-page seo… | 175 |
103 | post navigation ← → your email address will no… | 156 |
3259 | you may see the result below. | 43 |
383 | you can see the result below. | 42 |
867 | 2 | 34 |
879 | 3 | 29 |
885 | 4 | 24 |
857 | 1 | 21 |
908 | 6 | 15 |
891 | 5 | 15 |
917 | 7 | 13 |
165 | reply your email address will not be published. | 13 |
924 | 8 | 12 |
14947 | below, you can see the result. | 11 |
3450 | you also may want to read our some of the rela… | 11 |
942 | 10 | 8 |
932 | 9 | 8 |
16 | in this article, we will focus on how to resiz… | 7 |
3146 | you can see an example below. | 7 |
15 | because of those motivations, image compressio… | 7 |
15040 | below, you will see an example. | 7 |
3204 | you may see the output below. | 6 |
3451 | return of investment definition and importance… | 6 |
3452 | what is conversion rate optimization? | 6 |
11046 | you may see an example below. | 6 |
62 | what is conversion funnel? | 6 |
3147 | you can see the output below. | 6 |
1 | an hreflang value can specify the alternate ve… | 5 |
38 | the click path plays a role above all in terms… | 5 |
24100 | read more » python seo techseo theoretical seo… | 5 |
34 | for get and head methods, the server will retu… | 5 |
35 | if the resource’s etag is not on the list, the… | 5 |
8357 | what is a news sitemap? | 5 |
32 | trust elements are called by definition all th… | 5 |
31 | trust elements are used in this context. | 5 |
28 | translating a pandas data frame with python ca… | 5 |
3 | website is a way of world wide web presence. | 5 |
5 | announced in october 2015 as an internal proje… | 5 |
7 | user-centric performance metrics are announced… | 5 |
How to perform sentence tokenization with NLTK without the stop words?
To remove the stop words from the sentence tokenization with NLTK output, the “join()” method should be used for the textual data will be tokenized. The steps that will be followed for the sentence tokenization with NLTK without the stop words can be seen below.
- Remove the stop words from the tokenized text data.
- Join the tokens with space via the “join(” “)” method and argument.
- Use the “sent_tokenize()” over the joined tokens without stop words.
An example of sentence tokenization with NLTK without the stop words can be found below.
sent_tokens_counts = Counter([sent.lower() for sent in sent_tokens])
sent_tokens_counts_df = pd.DataFrame.from_dict(sent_tokens_counts, orient="index").reset_index()
sent_tokens_counts_df.sort_values(by=0, ascending=False)[0:40]
The table output of the sentence tokenization without the stop words can be found below.
index | 0 | |
104 | required fields are marked * name* email* webs… | 175 |
105 | python seo techseo theoretical seo on-page seo… | 175 |
103 | post navigation ← → your email address will no… | 156 |
3259 | you may see the result below. | 43 |
383 | you can see the result below. | 42 |
867 | 2 | 34 |
879 | 3 | 29 |
885 | 4 | 24 |
857 | 1 | 21 |
908 | 6 | 15 |
891 | 5 | 15 |
917 | 7 | 13 |
165 | reply your email address will not be published. | 13 |
924 | 8 | 12 |
14947 | below, you can see the result. | 11 |
3450 | you also may want to read our some of the rela… | 11 |
942 | 10 | 8 |
932 | 9 | 8 |
16 | in this article, we will focus on how to resiz… | 7 |
3146 | you can see an example below. | 7 |
15 | because of those motivations, image compressio… | 7 |
15040 | below, you will see an example. | 7 |
3204 | you may see the output below. | 6 |
3451 | return of investment definition and importance… | 6 |
3452 | what is conversion rate optimization? | 6 |
11046 | you may see an example below. | 6 |
62 | what is conversion funnel? | 6 |
3147 | you can see the output below. | 6 |
1 | an hreflang value can specify the alternate ve… | 5 |
38 | the click path plays a role above all in terms… | 5 |
24100 | read more » python seo techseo theoretical seo… | 5 |
34 | for get and head methods, the server will retu… | 5 |
35 | if the resource’s etag is not on the list, the… | 5 |
8357 | what is a news sitemap? | 5 |
32 | trust elements are called by definition all th… | 5 |
31 | trust elements are used in this context. | 5 |
28 | translating a pandas data frame with python ca… | 5 |
3 | website is a way of world wide web presence. | 5 |
5 | announced in october 2015 as an internal proje… | 5 |
7 | user-centric performance metrics are announced… | 5 |
How to Interpret the Tokenized Text with NLTK?
To interpret the tokenized text with NLTK for SEO, or NLP and text quality understanding, the metrics and dimensions below can be used.
- The unique word count within the text data.
- The unique word count within the headings of a website.
- The unique word count within the anchor texts.
- The sentence count per article of a website.
- The unique sentence count per article of a website.
- The unique word count per article of a website.
- The most used words within the headings
- The most used words within the text
- The percentage of the stop words to the unique words.
- Checking the impressions, clicks and rankings for the unique group of words from different website sections such as footer, header, main content area, side bar, or the headings.
In terms of Search Engine Optimization, and understanding the text’s quality, the interpretation methods above can be used.
How can a Search Engine Use Tokenization?
A search engine can use tokenization to split the text into “tokens” so that the information retrieval can have a match between the queries and the document. Tokenization is used for text normalization. A search engine uses word tokenization, and sentence tokenization to perform text normalization so that they can decrease the cost of computation for their own algorithms. Pairing words from different contexts with different prefixes and suffixes, recognizing word pairs, vectorizing the N-grams within the sentences, supporting the part of speech tag with tokenized word data from a corpus are among the usage of word tokenization purposes for a search engine. For tokenization purposes, a search engine can use NLTK and other NLP Libraries such as Genism, Keras, or TensorFlow. Natural Language Tool Kit (NLTK) can be used by Google and other search engines with the same purposes. Below, you will see two patent from Google, and Max Benjamin Braun, Ying Sheng that includes the usage of NLTK.
Below, you will see another example that shows how a search engine can use NLTK and tokenization from Google Search Engine.
Do Search Engines use NLTK for tokenization? Yes, search engines use tokenization. Search Engines such as Microsoft Bing, Google, and DuckDuckGo can use word and sentence tokenization to create indexes of words, and indexes of documents to understand the contextual connection between the queries, and the documents. A word’s place, a word’s surrounding other words can help a search engine to understand the relevance of words to each other and to a topic. Word Tokenization and sentence tokenization are to provide a better lemmatization, stemming, word grouping, and textual data aggregation for search engines. To learn more about how a search engine can use Natural Language Processing, and its sub-practices such as tokenization, you can read the following articles.
- Named Entity Recognition
- Semantic Search
- Semantic SEO
- Semantic Role Labeling
- Lexical Semantics
- Sentiment Analysis
Last Thoughts on NLTK Tokenize and Holistic SEO
NLTK Word Tokenization is important to interpret a website’s content or a book’s text. Word Tokenization is an important and basic step for Natural Language Processing. It can be used for analyzing the SEO Performance of a website or cleaning a text for NLP Algorithm training. Using lemmatization, stemming, stop word cleaning, punctuation cleaning, and visualizing the NLTK Tokenization outputs are beneficial to perform statistical analysis for a text. Filtering certain documents that mention a word, or filtering certain documents based on their content, content length, and unique word count can be beneficial to perform a faster and scaled analysis.
The NLTK Tutorials and NLTK Tokenize Guideline will be updated over time.
- Sliding Window - August 12, 2024
- B2P Marketing: How it Works, Benefits, and Strategies - April 26, 2024
- SEO for Casino Websites: A SEO Case Study for the Bet and Gamble Industry - February 5, 2024
Great post! I appreciate the clear explanations and examples of how to use NLTK for tokenization. It’s really helpful for anyone getting started with text processing. Looking forward to trying out the code snippets!
Thank you, I hope it helps.