Term-Frequency Inverse-Documnent-Frequency (TF IDF): Encoding text for Natural Language Processing (NLP) project -Part II

NLP Python

A short description of the post.

Linus Agbleze https://agbleze.github.io/Portfolio/
2022-09-27

Introduction to TF-IDF

TF-IDF is the acronym for Term Frequency Inverse Document Frequency and is one of the techniques used to represent text data. In the previous discussion of representing text using one hot encoding, a major critic is that one hot encoding does not take into account the frequency of words. This is one weakness that TF-IDF reverses.

TF-IDF takes into consideration the frequency of a word or token as well as the number of documents in which it occurs. Usually, less important words such as helping verbs, prepositions among others have the highest frequency in documents. These words regarded as stop words have less predictive power.

Imagine a corpus like “This is my first lesson. This lesson is interesting. The lesson is on logistic regression.” If our task is to classify or identify the topic that the document or text describe then certainly, words such as “This”, “is”, “lesson” which had higher frequency will provide less insight for classifying the text while less frequent tokens such as “logistic” and “regression” will provide more predictive insights in classifying the text. This understanding is what underline TF-IDF in representing text.

Thus, the formula for TF-IDF is which is a combination of TF and IDF is express as follows;

TF = Number of times a word occurs in a document. A token with a higher frequency receives higher weight.

IDF = \(log\frac{N}{n_w}\)

IDF indicates the log of the ratio of total number of documents (N) to the number of documents the word has appeared in (\(n_w\)). The higher the number of documents that a word appears in, the lower the weight it receives. Thus, IDF penalizes higher frequency words or token. TF-IDF is a product of TF AND IDF where a token that frequently appears in a document is highly weighted for TF and when when that token occurs in many documents than it receives lower weight for IDF.

Implementation of TF-IDF in python

To understand TF-IDF, we will use the same sentences that were encoded using one-hot encoding. The python scikit-learn provides TfidfVectorizer class for encoding text using TF-IDF.

First the packages to use are imported as follows;

Next, let’s consider the following sentences below for encoding

Using the code below, TfidfVectorizer is initialized and used to transform the text.

/Users/lin/Documents/python_venvs/pytorch_deepL/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
                  as        be  computers  ...  this        to  understand
sentence_0  0.000000  0.000000   0.401043  ...   0.0  0.000000    0.401043
sentence_1  0.000000  0.000000   0.577350  ...   0.0  0.000000    0.577350
sentence_2  0.400218  0.400218   0.000000  ...   0.0  0.400218    0.000000
sentence_3  0.000000  0.000000   0.000000  ...   0.5  0.000000    0.000000

[4 rows x 15 columns]

The result of the tfidf is visualize in a heatmap below

Summary

In this tutorial, TF-IDF was discussed as one of the techniques for encoding text data for machine learning task. An exampple was implemented in python to demonstrate it.