RAKE algorithm in natural language processing – News Couple

RAKE algorithm in natural language processing


1. Automatic Rapid Keyword Extraction (RAKE) is a domain-independent keyword extraction algorithm in natural language processing.

2. It is a single document oriented dynamic information retrieval method.

3. The RAKE concept is built on three matrices of word degree (score (s)), word frequency (frequency (s)), and degree to frequency ratio (score (s) / frequency (s)).

an introduction

In machine learning, thanks to No Free Lunch (NFL), we have multiple options of algorithms to solve a problem. Is it a blessing? Unfortunately, that is not the case. You cannot run for the entire buffet menu. While I was working on a project based on NLP, this was accurate, what happened to me. Due to time constraints, I had to find a ready-made algorithm to extract text features from unstructured text data. There was no time for me to build an algorithm. So I did my research, and was completely confused with the number of options for my table. Then I found an algorithm developed by Stuart Rose et al. Rapid automatic keyword extraction (RAKE). Stop searching immediately. Now here, my idea is to give you a brief idea of ​​the algorithm. People who are interested in learning NLP-based unsupervised learning can find this algorithm very useful.

Text extraction feature

While you’re analyzing unstructured text, as we’ve found in social media posts and e-commerce reactions, the challenge most of us face is how to filter it? Here I am not suggesting data cleaning. Thanks to various libraries like TextBlob in Python, this data cleaning part is easy to handle because there is always a structure in the unstructured data of the domain specific content. Like on e-commerce sites, while giving feedback on a product, we generally found the writing style to be the same – misspellings, a mix of language usage, Unicode character, etc. But the problem arises if you are looking for specific features, like most of the ones being talked about. Take the example of a mobile phone. What are the most used features? “Camera,” “Screen,” “Performance,” “Battery Life,” etc. Let’s say you rank them from highest speaking to most small spoken features. Do you think the order of these features will always remain the same? Or say if you have a predefined static feature list, do you think it will serve a purpose? The answer to this is no. To save us from this, here come some ready-made techniques or algorithms available.

Word frequency, word associations, common frequencies, inverse document frequency, linguistic approaches, graph-based methods are various techniques for unsupervised feature extraction. They have various applicability based on your needs and requirements. But for me during that time, I discovered a useful algorithm, RAKE.

RAKE algorithm

Let me discuss the Auto Keyword Extraction (RAKE) algorithm. First, I’ll give you the algorithm’s intuition. Then the python code perspective.

One important point made by the RAKE creator is that keywords often contain multiple words but rarely contain punctuation, stop words, or other words with lower lexical meaning. The inventor here spoke mainly about the aggregation and simultaneity of the word. As you analyze cell phones and feedback data from an e-commerce website, you’ll see binary grams like “good camera” and “customer service”. These words often occur in the comments field for a particular product together. It is the assembly. Now consider the “bad camera” and “worst camera” two grams. Words like “bad” and “worse” have semantic affinity, which means similar. The previously mentioned binary sizes (“bad camera” and “worst camera”) have a higher probability of being present or appearing in specific areas where the camera module is attached. Examples are cell phones, DSLR cameras, etc.

Once we have the set of text, RAKE splits the text into a word list, and removes stop words from the same list. The return list is known as content words. People familiar with natural language processing are familiar with the terminology of stop words. Words like “she,” “no,” “there,” she “do not add any meaning in the sentence. Ignoring them will make our main group neat and clean.

Let’s take a live example of the sentence:

“Feature extraction is not that complicated. There are many algorithms available that can help you in feature extraction. Fast automatic keyword extraction is one of those.

Initial word list: (Consider converting the input set to lower case you can use TextBlob)

· corpus =[ feature, extraction, is, not, that, complex, there, are, many, algorithms, available, that, can, help, you, with, feature, extraction, rapid, automatic, keyword, extraction, is, one, of, those]

We Invited Highlight stop words:

Extract feature not this compound. There are many Available Algorithms can helps you are with Extract feature. Fast automatic keyword extraction is one of the those

** Notes: I took “many” as a stop word on purpose. You can ignore it when you’re training.

· Stopword =[ is, not, that, there, are, many, that, can, you, with, is, one, of, those]

· determinant =[.]

Content_Word = group of words – stop words – delimiter

· word_content =[ feature, extraction, complex, algorithms, available, help, feature, extraction, rapid, automatic, keyword, extraction]

Now, when we have the word content, this list considers the text as well as candidate keywords. Below is an example where candidate statements are highlighted in bold.

Extract feature not this compound. There are many Available Algorithms can helps you are with Extract feature. Fast automatic keyword extraction He is one of those.”

Let’s create a file word score matrix As shown below, each row will display the number of times a given content word occurs with another content word in the candidate keyword phrases.

word score matrix |  RAKE algorithm

We have to give a score for each word. Calculate the “word score” in the matrix, the sum of the number of common occurrences, and then divide by the frequency of occurrence. Frequency of occurrence means the number of times the word occurs in the initial set. Check below.

word degree

Now consider the candidate’s key phrases and think of the combined score (sum) of each word for each candidate key phrases. It will look below.

Candidate's Cumulative Score

Suppose two keywords or keyword phrases appear together in the same order more than twice. A new keyword phrase is generated regardless of how many stop words the existing keyword phrases contain. A keyword score is calculated just like that of a single keyword.


When I first used RAKE, it felt great. The logic behind the algorithm is simple, but the result was amazing. Over the past couple of years, I’ve always tried RAKE while dealing with feature extraction issues in NLP for a while. One request for you to browse the original research paper. I have given the link below. Also, I’ve added some code snippets below.





Coding with Python

pip install rake-nltk
r = Rake()
text="Feature extraction is not that complex. There are many algorithms available that can help you with feature extraction. Rapid Automatic Key Word Extraction is one of those"
['rapid automatic key word extraction',
 'many algorithms available',
 'feature extraction',



[(23.5, 'rapid automatic key word extraction'),

 (9.0, 'many algorithms available'),

 (5.5, 'feature extraction'),

 (1.0, 'one'),

 (1.0, 'help'),

 (1.0, 'complex')]

#to control the max or min words in a phrase

r = Rake(min_length=2, max_length=4)

#to control whether or not to include repeated phrases in text
# in our example (feature,extraction ) occurring twice we can choose so select it one time.

from rake_nltk import Rake

# To include all phrases even the repeated ones.
r = Rake() # Equivalent to Rake(include_repeated_phrases=True)
# To include all phrases only once and ignore the repetitions
r = Rake(include_repeated_phrases=False)

The media described in this article is not owned by Analytics Vidhya and is used at the author’s discretion.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button