Complete Guide to Dealing with Missing Values ​​in Python – News Couple
ANALYTICS

Complete Guide to Dealing with Missing Values ​​in Python


Yes, everyone! Today we are going to look at an interesting problem with data preprocessing: how to deal with missing values ​​(which is part of data cleaning). So, before we get to the heart of the matter, let’s review some basic terms so we can see why we care about missing values. The topics we will explore in this comprehensive article are listed below.

Table of contents

  1. Introduction – data cleaning
  2. The importance of filling in missing values
  3. Problems due to missing values
  4. Lost data -types
  5. How do you overcome missing data in our dataset?

Introduction – data cleaning

It has nothing to do with machine learning methods, deep learning engineering, or any other complex methods in the field of data science. We have data collection, data preprocessing, modeling (machine learning, computer vision, deep learning, or any other complex approach), evaluation, and finally deploying the model, and I’m sure I forgot something. Therefore, dealing with modeling techniques is a hot topic, but data pre-processing requires a lot of work. If we ask a data scientist about their work process, they will say that it is a 60:40 ratio, which means that 60% of the work is related to data processing beforehand and the rest is related to the above technologies.

In this post, we will look at data cleaning, which is a component of the data preprocessing unit. The practice of correcting or eliminating inaccurate, corrupted, poorly formatted, duplicate or incomplete data from a data set is known as data cleaning.

The importance of filling in missing values

The concept of missing values ​​is important to understand in order to manage data efficiently. If a researcher, programmer, or academic does not properly deal with the missing numbers, he may come to a wrong conclusion about the data, which will have a significant impact on the modeling stage. It is a big problem in data analysis because it affects the results. It’s hard to fully believe in insights when you know that many elements are missing data. It may reduce the statistical power of the research and lead to erroneous results due to skewed estimates.

Problems due to missing values

  1. Statistical power, or the chance that a test will reject a null hypothesis when it is false, is reduced in the absence of evidence.
  2. Data loss may cause parameter estimates to skew.
  3. It has the potential to reduce sample representation.
  4. It may make study analysis more difficult.

Missing Data – Types

They can be categorized into, depending on the pattern or data absent in the data set or data.

  1. Completely missing in Random (MCAR)

    When the probability of data loss is not related to the exact value to be obtained or to the sum of the observed answers.

  2. Missing in Random (MAR)

    When the probability of missing responses is determined by the set of observed responses rather than by the exact missing values ​​expected to be reached.

  3. Don’t Miss Random (MNAR)
    Other than the above categories, MNAR is the missing data. MNAR data cases are a pain to deal with. Missing data modeling is the only way to get a fair approximation of the parameters in this case.

missing value classes

Columns with missing values ​​fall into the following categories:

  1. A continuous variable or feature – a set of numeric data, i.e. numbers may be of any type
  2. Categorical Variable or Feature – May be a numerical or objective type. For example: Customer rating: Poor, Satisfactory,
    Good, better, better, or gender: male or female.

For either of these two types of categories only we will get data set.

Kinds of suggestions

Fillings are available in a range of sizes and shapes. It is one way to solve data set missing data problems before modeling our application for more accuracy.

  1. Univariate modulation, or implying means, when values ​​are calculated using only the target variable.
  2. Multivariate inclusion: Ratio values ​​depend on other factors, such as estimating missing values ​​based on other variables using linear regression.
  3. Individual embedding: To create a single computed data set, attribute any missing values ​​only once within the data set.
  4. Several assumptions: Compute the same missing values ​​multiple times within the data set. This essentially entails iterating one computation to get many attributable data sets.

How do you overcome missing data in our dataset?

There are many ways to overcome lost data, we will see ways, before that we will start scratching like importing libraries,

Dataset: https://github.com/JangirSumit/data_science/blob/master/18th%20May%20Assignments/case%20study%201/SalaryGender.csv with categorically modified Ph.D.

At the start of each code, we need to import the libraries,

#importing the libraries

import pandas as pd

import numpy as np

dataset = pd.read_csv("SalaryGender.csv")


then we need to import the datasets,
dataset.head()

Check the dimensions of the dataset

dataset.shape


Check for missing values

print(dataset.isnull().sum())
  1. Just leave it as is! (Do not disturb)

Don’t do anything about the lost data. You can hand over complete control to the algorithm about how it responds to the data. On the other hand, different algorithms react differently to missing data. Some algorithms, for example, determine the best imputation values ​​for missing data based on training loss minimization. Take XGBoost, for example. In some cases, such as linear regression, an error will occur. This simply means that you will have to deal with missing data either during the preprocessing stages or when the form fails, and we have to find out what went wrong. This section is basically similar to the trial-and-error technique; Depending on the reaction, we will move forward.

#old dataset with missed values
dataset["Age"][:10]

      2. Drop it if it is not in use (mostly Rows)

Excluding observations with missing data is the next most easy approach. However, you run the risk of missing some critical data points as a result. You may do this by using the Python pandas package’s dropna() function to remove all the columns with missing values. Rather than eliminating all missing values from all columns, utilize your domain knowledge or seek the help of a domain expert to selectively remove the rows/columns with missing values that aren’t relevant to the machine learning problem.

Pros: after removing missed data, the model becomes robust

Cons: Loss of data, which may be important too. If you have more missing data then efficiency won’t be good for modelling.

#deleting rows - missed vales
dataset.dropna(inplace=True)
print(dataset.isnull().sum())

     3. Imputation by Mean:

Using this approach, you may compute the mean of a column’s non-missing values, and then replace the missing values in each column separately and independently of the others. The most significant disadvantage is that it can only be used with numerical data. It’s a simple and fast method that works well with small numerical datasets. However, there are certain limitations, such as the fact that feature correlations are ignored. It only works for a single column at a time. Furthermore, if the outlier treatment is skipped, a skewed mean value will almost certainly be substituted, lowering the model’s overall quality.

Cons: Works only with numerical datasets and failed in covariance between the independent variables

#Mean - missed value
dataset["Age"] = dataset["Age"].replace(np.NaN, dataset["Age"].mean())
print(dataset["Age"][:10])

4. Borrowing by broker:

Another method of embedding that addresses the exogenous problem in the previous method is the use of intermediate values. When sorting, it ignores the effect of outliers and updates the middle value that occurred in that column.

Cons: It only works with scalar datasets and fails to covariate between independent variables

#Median - missed value
dataset["Age"] = dataset["Age"].replace(np.NaN, dataset["Age"].median())
print(dataset["Age"][:10])

5. Borrow with the most common values ​​(mode):

This method can be applied to categorical variables with a limited set of values. For assignment, you can use the most common value. For example, whether the available alternatives are nominal class values ​​such as true/false or conditions such as normal/unnatural. This is especially true for ordinal categorical factors such as educational attainment. Pre-primary, primary, secondary, high school, graduation, etc. are all examples of educational levels. Unfortunately, since this method ignores feature connections, there is a risk of data bias. If the category values ​​are unbalanced, you are more likely to introduce bias into the data (category imbalance problem).

Pros: Works with all data formats.

Cons: Unpredictable value of covariance between independent features

#Mode - missed value
import statistics
dataset["Age"] = dataset["Age"].replace(np.NaN, statistics.mode(dataset["Age"]))
print(dataset["Age"][:10])
Use mode to deal with missing values

     6. Imputation for Categorical values:

When categorical columns have missing values, the most prevalent category may be utilized to fill in the gaps. If there are many missing values, a new category can be created to replace them.

Pros: Good for small datasets. Compliments the loss by inserting the new category

Cons: Cant able to use for other than
categorical data, additional encoded features may result in a drop inaccuracy

#missing values - categorical

dataset.isnull().sum()



#missing values - categorical - solution

dataset["PhD"] = dataset["PhD"].fillna('U')


#checking for missed values in Categorical - cabin

dataset.isnull().sum()

     7. Last observation carried forward (LOCF)

It is a common statistical approach for the analysis of longitudinal repeated measures data when some follow-up observations are missing.

#LOCF - last observation carried forward

dataset["Age"] = dataset["Age"].fillna(method ='ffill')

dataset.isnull().sum()



8. interpolation – linear

It is a method of rounding a missing value by joining the points in ascending order along a straight line. In short, it computes the unknown value in the same ascending order as the values ​​that came before it. Since linear interpolation is the default method, we didn’t have to specify it while using it. It will almost always be used in a time series data set.

#interpolation - linear

dataset["Age"] = dataset["Age"].interpolate(method='linear', limit_direction='forward', axis=0)

dataset.isnull().sum()



9. Borrowing by K-NN:

The basic classification approach is the k-nearest (kNN) algorithm. Class membership is the result of the k-NN classification. The classification of the element is determined by how similar it is to the points in the training set, the object moves to the class with the most members among its nearest neighbors. If k = 1, the element is simply assigned to the element’s nearest neighbor class. Finding the nearest k neighbors of an observation with missing data and then calculating them based on the non-missing values ​​in the neighborhood may help to make predictions about the missing values.

#for knn imputation - we need to remove normalize the data and categorical data we need to convert

cat_variables = dataset[['PhD']]

cat_dummies = pd.get_dummies(cat_variables, drop_first=True)

cat_dummies.head()




dataset = dataset.drop(['PhD'], axis=1)

dataset = pd.concat([dataset, cat_dummies], axis=1)

dataset.head()




#removing unwanted features

dataset = dataset.drop(['Gender'], axis=1)

dataset.head()




#scaling mandatory before knn

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

dataset = pd.DataFrame(scaler.fit_transform(dataset), columns = dataset.columns)

dataset.head()




#knn imputer

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)

dataset = pd.DataFrame(imputer.fit_transform(dataset),columns = dataset.columns)




#checking for missing

dataset.isnull().sum()

10. Imputation by Multivariate Imputation by Chained Equation (MICE):

MICE is a method for replacing missing data values in data collection via multiple imputations. You can start by making duplicate copies of the data set with missing values in one or more of the variables.

#MICE
import numpy as np 
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df = df.drop(['PassengerId','Name'],axis=1)
df = df[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]
df["Sex"] = [1 if x=="male" else 0 for x in df["Sex"]]
df.isnull().sum()
imputer=IterativeImputer(imputation_order="ascending",max_iter=10,random_state=42,n_nearest_features=5)
imputer
imputed_dataset = imputer.fit_transform(df)

Notes:

For our dataset, we may use the above ideas to solve for missing values. There is no single preferred approach for detecting missing values ​​in this case; The solution to finding missing values ​​varies depending on the missing values ​​in our feature and the application we will be using. Therefore, we will have to take advantage of trial and error to determine the optimal option for our application.

Did you find this article helpful? Please leave your thoughts/opinions in the comments area below. Learning from your mistakes is my favorite quote; If you find something incorrect, just highlight it; I am excited to learn from students like you.

Briefly, I’m Premanandes, a Junior Assistant Professor and Machine Learning Researcher. I love teaching and I love learning new things in data science. Send me any doubt or error, [email protected], and my Linkedin https://www.linkedin.com/in/premsanand/

The media described in this article is not owned by Analytics Vidhya and is used at the author’s discretion.



Source link

Related Articles

Back to top button