Working as an ML engineer, it’s common to be in situations where you spend hours building a great model with desired metrics after doing multiple iterations and tuning the hyperparameter but you can’t return the same results with the same model just because you missed a single small hyperparameter record.
What can save one from such situations is to keep track of the experiences you have in the process of solving an ML problem.
- If you have ever worked on any ML project, you will know that the most difficult part is getting to perform well – Which makes it necessary to perform many experiments to modify the different parameters and track each of them.
- You don’t want to waste time looking for that good model you got in the past – Repurchasing all the experiences you had in the past makes it hassle free.
- Just a small change in the alpha and accuracy of the model touches the ceiling – Capturing the small changes we make in our model and the associated metrics saves a lot of time.
- All your experiences under one roof – Experience tracking helps compare all the different runs you do by putting all the information under one roof.
Should we just keep track of machine learning model parameters?
Okey, no. When running any machine learning experiment, you should ideally track multiple numbers of things to enable experimentation to be reproduced and access to an optimized model:
- cipher: The code used to perform the experiments
- data: Copies of saving data used in training and evaluation
- environment: Save environment configuration files such as “Dockerfile”, “requirements.txt”, etc.
- Factors: Save the different hyperparameters used for the model.
- Metrics: Training on registration and validation metrics for all pilot processes.
Why not use an excel sheet?
Spreadsheets are something we all love because they are so easy to use! However, recording all information about experiments in a spreadsheet is only possible when we perform a limited number of iterations.
Whether you are a beginner or an expert in data science, you will know how difficult the process of building an ML model with many things happening simultaneously such as multiple versions of data, hyperparameters of different models, many versions of laptops, etc. Make it pointless to go for manual recording.
Fortunately, there are many tools available to help you. Neptune is one such tool that can help us keep track of all our ML experiences within a project.
Let’s see him in action!
Install Neptune in Python
To install Neptune, we can run the following command:
pip install neptune-client
To import the Neptune client, we can use the following line:
import neptune.new as Neptune
Do you need credentials?
We need to pass our credentials to the neptune.init() method to enable logging of metadata to Neptune.
run = neptune.init(project="",api_token='')
We can create a new project by logging in to https://app.neptune.ai/ and then fetching the project name and API token.
Recording parameters in Neptune
We use the iris dataset here and apply a random forest classifier to the dataset. Thus we record the parameters of the models and scales using Neptune.
from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from joblib import dump data = load_iris() X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.4, random_state=1234) params = 'n_estimators': 10, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'max_features': 3, clf = RandomForestClassifier(**params) clf.fit(X_train, y_train) y_train_pred = clf.predict_proba(X_train) y_test_pred = clf.predict_proba(X_test) train_f1 = f1_score(y_train, y_train_pred.argmax(axis=1), average="macro") test_f1 = f1_score(y_test, y_test_pred.argmax(axis=1), average="macro")
To register the parameters of the above form, we can use the run object we started before as follows:
run['parameters'] = params
Neptune also allows tracing of code and environment during run object creation as follows:
run = neptune.init(project=" stateasy005/iris",api_token='', source_files=['*.py', 'requirements.txt'])
Can I record metrics as well?
The training and evaluation metrics can be logged back using the run object we created:
run['train/f1'] = train_f1 run['test/f1'] = test_f1
Shortcut to record everything at once?
We can create a summary of our classifier model which by itself will capture different parameters of the model, diagnostic charts and a test folder with actual predictions, prediction probabilities and different scores for all categories like accuracy, recall, support etc.
This summary can be obtained using the following code:
import neptune.new.integrations.sklearn as npt_utils
run["cls_summary "] = npt_utils.create_classifier_summary(clf, X_train, X_test, y_train, y_test)
This creates the following
Folders on the Neptune user interface as shown below:
What is inside the folders?
The Diagnostic Charts Folder It is useful as one can evaluate their experiences using multiple metrics only with a single line of code in the workbook summary.
The “all_params” Folder It includes various hyperparameters of the model. These hyperparameters help one compare the performance of the model in a set of values and propagate their tuning through some level. Tracking hyperparameters also helps to get back to the exact same form (with the same hyperparameter values) when one needs to.
The trained model is also saved as a “.pkl” file which can be fetched later for use. The ‘Test’ Folder It contains the predictions, prediction probabilities, and outcomes in the test data set.
What about regression and agglomeration using Neptune
We can get a similar summary if we have a regression model using the following lines:
import neptune.new.integrations.sklearn as npt_utils
run['rfr_summary'] = npt_utils.create_regressor_summary(rfr, X_train, X_test, y_train, y_test)
Similarly, for compilation also, we can create an abstract with the help of the following lines of code:
import neptune.new.integrations.sklearn as npt_utils run['kmeans_summary'] = npt_utils.create_kmeans_summary(km, X, n_clusters=5)
Here, km is the name of the k-mean model.
How do I upload my data to Neptune?
We can also register csv files to play and watch on the Neptune user interface using the following lines of code:
Download artifacts to Neptune
Any character drawn using libraries like matplotlib, plotly, etc. can also be registered to Neptune.
import matplotlib.pyplot as plt plt.plot(data) run["dataset/distribution"].log(plt.gcf())
To download the same files later programmatically, we can use the download method of the “run” object with the following line of code:
In this article, I have tried to cover why it is important to track experiments and how Neptune can help facilitate this which in turn increases productivity while running various ML experiments for your projects. This article focused on ML experience tracking but we can implement code version, notebook version, data version, environment version as well with Neptune.
There are of course many similar libraries available online for tracking runs which I will try to cover in my next articles.
About the author
Nibedita is an MSc in Chemical Engineering from IIT Kharagpur and currently working as a Senior Consultant at AbsolutData Analytics. In her current capacity, she builds AI/Machine Learning-based solutions for clients from a range of industries.
Picture 1: https://tinyurl.com/em429czk