# How does Tfidfvectorizer from sklearn calculate the values of tf-idf

This article was published as part of the Data Science Blogathon.

### summary

Here in this blog, we will try to crack tf-idf and see how TfidfVectorizer from sklearn calculates tf-idf values. I had a hard time matching the tf-idf values generated by TfidfVectorizer with the ones I calculated. The reason is that there are many ways that tf-idf values are calculated, and We need to be aware of the way that TfidfVectorizer used for tf-idf account. This is amazing It will save a lot of time and effort is yours. I spent two days troubleshooting Before I could realize the problem.

we It will write a simple Python program Uses TfidfVectorizer to calculate tf-idf and Manually validate this. Before get into Coding part, let’s It goes through some conditions which constitute tf-idf.

**What is the term frequency (tf)**

fulfilled is the number of times a term appears in Special document. So it is document specific. Here are some ways to calculate tf:-

TF (NS) = No. NSf times the term ‘t’ occurs in document

**or**

TF (NS) = (No. NSf times the term ‘t’ occurs in document) / (number from terms in document)

**or**

TF (NS) = (No. NSf times the term ‘t’ occurs in document) / (repeat The most common term in document)

sFirst klearn uses i: the number of times the term ‘t’ appears in the document

**inverse document frequency (idf)**

IDF It is a measure of how common it is or Scarcely term across the entire group of documents. So the point to note is that it’s common to all docs. If the word is common and appears in many documents, idf value (normal) will approach 0 or Another 1 . approach If it’s rare. few of our ways can calculate idf value for term He is Given below

IDF (NS) = 1 + record _{NS} [ n / df

**or**

IDF(NS) = register _{NS} [ n / df

where

n = total number of documentss Available

NS = The term for which the idf value should be calculatedDr

df

But according to sklearn’s online documentation, it uses the method below to calculate the idf for the term in the document.

idf

And

idf

**Inverted Document Frequency Range (tf-idf)**

**NS****F****-def **The value of the term in the document is the product of its tf and idf. The higher the value, the more appropriate the term in that document.

**Python program to generate tf-idf values**

**sStep 1: Import the library**

from sklearn.feature_extraction.text import TfidfVectorizer

**Step 2: Prepare the set of documents**

d2 = “Diesel is cheaper than petrol”

doc_corpus=[d1,d2]

print(doc_corpus)

**Step 3: ****initialization ****TfidfVectorizer ****And print the names of the features**

vec=TfidfVectorizer(stop_words="english")

matrix=vec.fit_transform(doc_corpus)

print("Feature Names n",vec.get_feature_names_out())

**Step 4: Create a sparse array with tf-idf values**

print("Sparse Matrix n",matrix.shape,"n",matrix.toarray())

**Validate tf-idf values in sparse array**

to start tis the frequency term (tf) For each term in the above documents table.

**idf values calculation** for each term.

As mentioned earlier, the idf value for the term is common to all documents. Here we will look at the case when **Smooth_idf = true** (default behaviour). So idf

IDF(NS) = register _{NS} [ (1+n) / ( 1 + df

Here n = 2 (number of documents)

idf(“cars”) = register _{NS} (3/2) +1 => 1.405465083

idf(“cheaper”) = register _{NS }(3/3) + 1 => 1

idf (“diesel”) = register _{NS }(3/3) + 1 => 1

idf (“gasoline”) = register _{NS }(3/3) + 1 => 1

From the above idf values, we can see that as “Cheaper“,”diesel “and”petrol” be Common in both documents, it has a lower idf value

**NS****alculate tf-idf** subordinate Terms in each d1 and d2 document.

**For d1**

tf-idf (“cars“) = tf(“cars“) x idf (“cars“) = 2 x 1.405465083 => 2.810930165

tf-idf (“Cheaper“) = tf (“Cheaper“) x idf (“Cheaper“) = 1 x 1 => 1

tf-idf (“diesel“) = tf (“diesel“) x idf (“diesel“) = 1 x 1 => 1

tf-idf (“petrol“) = tf (“petrol“) x idf (“petrol“) = 1 x 1 => 1

**F****or d****2**

tf-idf (“cars“) = tf(“cars“) x idf (“cars“) = 0 x 1.405465083 => 0

tf-idf (“Cheaper“) = tf (“Cheaper“) x idf (“Cheaper“) = 1 x 1 => 1

tf-idf (“diesel“) = tf (“diesel“) x idf (“diesel“) = 1 x 1 => 1

tf-idf (“petrol“) = tf (“petrol“) x idf (“petrol“) = 1 x 1 => 1

So we have a scattered matrix of the form 2 x 3

[

[2.810930165 1 1 1]

[0 1 1 1]

]

**Normalization of tf-idf values**

We have one last step. To avoid large documents in the blog dominate Youngest, we have to me** **normalization** **every row in the sparse matrix for the Euclidean rule.

*first document d1*

2.810930165/sqft (2.810930165 ^{2} + 1^{2} + 1^{2} + 1^{2}) => 0.851354321

1/ square root (2.810930165^{2} + 1^{2} + 1^{2} + 1^{2}) => 0.302872811

1 / square root (2.810930165^{2} + 1^{2} + 1^{2} + 1^{2}) => 0.302872811

1 / square foot (2.810930165^{2} + 1^{2} + 1^{2} + 1^{2}) => 0.302872811

*second document d2*

0 / square feet (0 ^{2 } + 1^{2} + 1^{2} + 1^{2}) => 0

1/ square root (0^{2} + 1^{2} + 1^{2} + 1^{2})=> 0.577350269

1/ Sqrt (0^{2} + 1^{2} + 1^{2} + 1^{2}) => 0.577350269

1/ Sqrt (0^{2} + 1^{2} + 1^{2} + 1^{2}) => 0.577350269

This gives us the final sparse matrix

[

[0.851354321 0.302872811 0.302872811 0.302872811]

[0 0.577350269 0.577350269 0.577350269]

]

And the above sparse array we just computed matches the one generated by sklearn’s TfidfVectorizer.

**Inference from tf-idf قيم values**

Here are the tf . values from Both documents D1 and D2

(1) In the first document d1, “cars” is the most relevant term since it has the highest value of tf-idf (0.851354321)

(2) In the second document d2, most terms have the same tf-idf and it has equal affinity.

The complete Python code for building the sparse array using Tfidfvectorizer is provided below for ready reference.

from sklearn.feature_extraction.text import TfidfVectorizer

doc2="diesel is cheaper than petrol"

doc_corpus=[doc1,doc2]

print(doc_corpus)

vec=TfidfVectorizer(stop_words="english")

matrix=vec.fit_transform(doc_corpus)

print("Feature Names n",vec.get_feature_names_out())

print("Sparse Matrix n",matrix.shape,"n",matrix.toarray())

## concluding remarks

In this blog we got to know about tf, idf and tf-idf and got it right **idf (term)** Common to document body and **tf-idf (term)** Document specific. And we used Python to create a sparse tf-idf array using sklearn’s TfidfVectorizer and also validate the values.

Finding tf-idf values when leaving “smooth_idf = False” as an exercise for the reader. I hope you found this blog useful. Please leave your comments or questions if any.

**The media described in this article is not owned by Analytics Vidhya and is used at the author’s discretion.**