As we all are well aware feature selection is one of the most important steps in data preprocessing where we select all the features that based on our knowledge would be the best fit for the model development phase. Hence here all the valid numerical columns will be taken into account.
from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=['Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites'],outputCol="features")
Inference: While working with MLIB we should know the format of data that MLIB as a library accepts hence we use the VectorAssembler module which clubs all the selected features together in one column and that is treated as the feature column (summation of all the features), the same thing we can see in the parameter section of the assembler object.
output = assembler.transform(data)
Inference: Transforming the data is very much necessary as it works as the commit statement ie All the transactions (changes) which are processed should be seen in the real dataset if we see it hence we used the transform method for it.
final_data = output.select('features','churn').show()
Inference: So while looking at the above output things will get clear that what we were aiming to do as the first column is features that have all the selected columns and then the label column ie churn.
Test Train Split
Now if you are following me from the very beginning of the article might have a question if we already have the separate testing data then why are splitting this dataset? right?
So the answer is to keep this phase of splitting as the validation of the model and we do not have to perform this routine again when we would be dealing with new data as it is already split into different CSV files.
train_churn,test_churn = final_data.randomSplit([0.7,0.3])
Inference: With the help of tuple unpacking we have stored the 70% of the data in train_churn and 30% of it in test_churn by using PySpark’s random split() method.
We reaching this phase of the article is the proof that we have already cleaned our data completely and that it is ready to be fed to the classification algorithm model (more specifically the Logistic Regression)
Note that we have to do this model building again when we have to deal with new customers’ data.
from pyspark.ml.classification import LogisticRegression lr_churn = LogisticRegression(labelCol="churn") fitted_churn_model = lr_churn.fit(train_churn) training_sum = fitted_churn_model.summary
Code breakdown: This would be a complete explanation of the steps that are required in the model building phase using MLIB
- Importing the LogisticRegression module from the ml. classification library of the Pyspark.
- Creating a Logistic Regression object and passing the label column (churn).
- Fitting the model ie starting the training of the model on the training dataset.
- Getting the summary of the training using the summary object which was attained over the trained model
Inference: So the summary object of the MLIB library returned a lot of insights about the trained logistic regression model and with the statistical information available we can conclude that the model has performed well as the mean, standard deviation of the churn (actual values) and prediction (predicted values) is very close.
In this stage of the customer churn prediction, we should analyze our model which was trained on 70% of the dataset and by evaluating it we can decide whether we should go with the model or if some twitches are required.
from pyspark.ml.evaluation import BinaryClassificationEvaluator pred_and_labels = fitted_churn_model.evaluate(test_churn)
Inference: One can notice that in the first step we imported the BinaryClassificationEvaluator which is quite logical as well because we are dealing with the label column that has binary values only.
Then evaluate() method comes into existence where it takes the testing data (30% of the total dataset) as the parameter and returns the multiple fields from which we can evaluate the model (manually).
Inference: In the above output one can see 4 columns that were returned by the evaluation method they are:
- Features: All the feature values were clubbed together by VectorAssembler during the feature selection phase.
- Customer Churn: The Actual values ie the actual label column
- Probability: This column have the probability of the predictions that were made by the model.
- Predictions: The predicted values (here 0 or 1) by the model on the testing data.
Predicting the New Data
Finally comes the last stage of the article where till now we have already built and evaluated our model and now here the predictions will be made on the completely new data ie the new customer’s dataset and see how well the model performed.
Note that in this stage the steps will be the same but the dataset will be different according to the situation.
final_lr_model = lr_churn.fit(final_data)
Inference: Yes! Yes! nothing extra to discuss here as we have already gone through this step but the main thing to notice is that we are performing the training on the complete dataset (final_data) as we know we already have the testing data in the CSV file hence no splitting of the dataset is required.
new_customers = spark.read.csv('new_customers.csv',inferSchema=True, header=True) new_customers.printSchema()
Inference: As the testing data is in a different file then it becomes necessary to read it in the same way we did it before in the case of the customer_churn dataset.
Then we saw the Schema of this new dataset and concludes that it has the exactly the same.
test_new_customers = assembler.transform(new_customers)
Inference: Assembler object was already created while the main features were selected so now the same assembler object is being used to transform this new testing data.
final_results = final_lr_model.transform(test_new_customers)
Inference: As we did the transformation of the features using assembler object similarly we also need to do the transformation of the final model on top of new customers.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.