Looking at the data using the show() function where it will return the top 20 rows from the complete data.
Now the head function needs to be introduced which is quite similar to the head function used in pandas in the below code’s output we can see that the head function returned the Row object which holds one complete record/tuple.
Row(Email="[email protected]", Address="835 Frank TunnelWrightmouth, MI 82180-9605", Avatar="Violet", Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005)
Now let’s see the more clear version of getting into the data where each item will be iterable through the combination of for loop and head function and the output shown is the more clear version of the Row object output.
for item in data.head(): print(item)
[email protected] 835 Frank TunnelWrightmouth, MI 82180-9605 Violet 34.49726772511229 12.65565114916675 39.57766801952616 4.0826206329529615 587.9510539684005
Importing Linear Regression Library
As mentioned earlier that we will predict the customer’s yearly expenditure on products so based on what we already know, we have to deal with continuous data and when we are working with such type of data we have to use the linear regression model.
For that reason, we will be importing the Linear Regression package from the ML library of PySpark.
from pyspark.ml.regression import LinearRegression
Data Preprocessing for Machine Learning
In this section, all the data preprocessing techniques will be performed which are required to make the dataset ready to be sent across the ML pipeline where the model could easily adapt and build an efficient model.
Importing Vector and VectorAssembler libraries so that we could easily separate the features columns and the Label column ie all the dependent columns will be stacked together as the feature column and the independent column will be as a label column.
from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler
Let’s have a look at which columns are present in our dataset.
Inference: So from the above output all the columns are listed down in the form of list type only but this will not give us enough information about which column to select hence for that reason we will use the describe method.
DataFrame[summary: string, Email: string, Address: string, Avatar: string, Avg Session Length: string, Time on App: string, Time on Website: string, Length of Membership: string, Yearly Amount Spent: string]
Inference: If you will go through the output closely you will find those columns that have a string as the data type will have no role in the model development Phase as machine learning is the involvement of mathematical calculation where only number game is allowed hence integer and double data type columns accepted.
Based on the above discussion the columns which are selected to be part of the machine learning pipeline are as follows:
- Average Session Length
- Time on App
- Time on Website
- Length of Membership
assembler = VectorAssembler( inputCols=["Avg Session Length", "Time on App", "Time on Website",'Length of Membership'], outputCol="features")
Inference: In the above code we chose the VectorAssembler method to stack all our features columns together and return them as the “featurescolumns by the output column parameter.
output = assembler.transform(data)
Here, the Transform function is used to fit the real data with the changes that we have done in the assembler variable using the VectorAssembler function so that the changes should reflect in the real dataset.
Now with the select function, we have selected only the features column from the dataset and showed it in the form of DataFrame using the show() function.