Autocorrect feature using NLP in Python – News Couple
ANALYTICS

Autocorrect feature using NLP in Python


This article was published as part of the Data Science Blogathon.

Natural Language Processing (NLP) is the field of artificial intelligence that is related to linguistics and computer science. I assume you have understood the basic concepts of NLP. So we will go ahead. There are some NLP applications as follows: Auto Spell Correction, Sentiment Analysis, Fake News Detection, Machine Translation, Question & Reply (Q&A), Chatbot, and many more…

Introduction to AutoCorrect

Have you ever wondered how autocorrect features work on a smartphone keyboard? Now, almost every smartphone brand, regardless of its price, offers auto-correction in their keyboards today. Everyone knows that the topic of smartphones will be an endless list and we will not focus on this topic in this blog!

The main purpose of this article, as you saw the title so you can guess it is to create an autocorrect feature. Yes, it’s sort of similar, but not exact, to the smartphone version we’re using now, but that would be an implementation of natural language processing on a smaller dataset like a book.

Well, let’s understand how these autocorrect features work. In this article, I will walk you through “How to create autocorrect with Python”.

Autocorrect using NLP with Python – How does it work?

In the background of machine learning, automatic correction is completely based on Natural Language Processing (NLP). As the name suggests it is programmed in order to correct spelling and errors while writing text. So let’s see how it works?

Before we move on to the topic of coding, let’s understand “How does AutoCorrect work?”. Let’s say you typed a word on the keyboard, but if this word is in the vocabulary of our smartphone, it will assume that you typed the correct word. OK. Now it doesn’t matter whether you are writing a name or a name or whatever word you want to write.

Understood this scenario? If the word is in the history of the smartphone, it will generalize or create the word as the correct word for selection. But what if the word didn’t exist? Well, if the word you typed is a word that does not exist in the history of smartphones, autocorrect is specially programmed to find the most similar words in our smartphone history as it suggests.

So let’s understand the algorithm.

There are 4 basic steps to building an autocorrect form that corrects misspellings:

1: – Recognize spelling errors WLyrics – Let’s consider, on an example, how to recognize the word “dream “ Wrong or correct spelling? If the word is spelled correctly, the word will be found in the dictionary and if it is not there, it is likely a misspelled word. Hence, when a word is not found in the dictionary, we will flag it to correct it.

2:- Find strings edit distance – Editing is one of the operations performed on a string to convert it to another string, and n It is only the edit distance which is an adjustment distance like – 1, 2, 3, and so on… that will count the number of edits to be performed. Hence the release distance n tells us that the number of remote operations from one chain to another. Here are the different types of mods:-

    • Enter (will add a character)
    • delete (will remove a character)
    • Switch (two nearby characters will be switched)
    • replace (replace one character with another)

With these four mods, we are adept at modifying any series. So the set of modifications allows us to find a list of all possible strings that cannot be made any modifications.

important note: For autocorrect we usually take n between 1 to 3 adjustments.

3: – Filter candidates – Here we want to look only at the correctly spelled real words from the generated list of candidates so that we can compare the words with a known dictionary (as we did in the first step) Then filter out the words in the generated list of candidates that do not appear in the known “dictionary”.

4:- Calculate the word probabilities – We can calculate word probabilities and then find the most likely word from candidates made with our actual word list. This requires the frequencies of the words we know and the total number of words in the set (also known as a dictionary).

Build NLP AutoCorrect with Python

I hope you are now clear about what autocorrect is and how it works. Now let’s see how we can build the autocorrect feature using Python for smartphones. As our smartphones use past history to match the written words whether they are correct or not. So here we are required to use some words to run jobs in automatic correction.

So I will use the text from a book to understand it in practice which you can easily download from here. Now let’s start with the task of building an autocorrect form using Python.

Note: You can use any type of text data.

Download Link

To run this task, we need some libraries. I will be using very public libraries for machine learning. So you should have all these libraries already installed in your system except for one. You need to install a single library known as “textspace”, which can be easily installed using the pip command.

pip install textdistance

Now let’s start with this by importing all the necessary packages and libraries and reading our text file:

cipher:

import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter
words = []
with open('auto.txt', 'r') as f:
    file_name_data = f.read()
    file_name_data=file_name_data.lower()
    words = re.findall('w+',file_name_data)
# This is our vocabulary
V = set(words)
print("Top ten words in the text are:words[0:10]")
print("Total Unique words are len(V).")
Output:
Top ten words in the text are:['moby', 'dick', 'by', 'herman', 'melville', '1851', 'etymology', 'supplied', 'by', 'a']
Total Unique words are 17140.

In the above code, you can see that we have made a list of words and now we will build the repetition of these words, which can be done easily using the “counter function” in Python:

cipher:

word_freq =   
word_freq = Counter(words)
print(word_freq.most_common()[0:10])
Output:
[('the', 14431), ('and', 6430), ('a', 4736), ('to', 4625), ('in', 4172), ('his', 2530), ('it', 2522), ('i', 2127)]

Relative frequency of words

Now here we want to get the occurrence of each word which is not a thing but we have to find the probabilities which are equal to the relative frequencies of the words:

cipher:

probs =      
Total = sum(word_freq.values())    
for k in word_freq.keys():
    probs[k] = word_freq[k]/Total

Find similar words

So we will sort similar words according to “Jaccard distance” by calculating 2 grams Q of words. Next, we will return the five most similar words which have been ordered by similarity and probability:-

cipher:

def my_autocorrect(input_word):
    input_word = input_word.lower()
if input_word in V:
        return('Your word seems to be correct')
    else:
        sim = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq.keys()]
        df = pd.DataFrame.from_dict(probs, orient="index").reset_index()
        df = df.rename(columns='index':'Word', 0:'Prob')
        df['Similarity'] = sim
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)

Well, now, let’s find some similar words using the autocorrect function:

cipher:

my_autocorrect('neverteless')
word A question similarity
2209 however 0.000229 0.750000
13300 boneless 0.000014 0.4166667
12309 Lifts 0.00005 0.4166667
718 Start 0.000942 0.400,000
6815 level 0.000110 0.400,000

This is how the autocorrect algorithm works here!!

We also took words from a book. In the same way, there are some words that are already in the vocabulary of the smartphone and then some that it records as the user starts typing with the keyboard.

conclusion

You can use this feature for real-time execution. I hope you liked this article on how to create an autocorrect feature using NLP with Python.

The media described in this article is not owned by Analytics Vidhya and is used at the author’s discretion



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button