What does this article mean?
A forest, bush or plant fire can be described as any uncontrolled and non-described burning or burning of plants in a natural environment such as a forest, grassland, etc. Or not, we expect wildfire confidence based on some traits.
Why do we need a wildfire prediction model?
Well, the first question that arises is why do we even need machine learning to predict wildfires in that particular region? So, yes, the question is correct that although there is an experienced forestry department dealing with these issues for a long time, why is ML needed, having said that the answer is so simple that an experienced forestry department can check 3-4 parameters of their mind Human but ML on the other hand can handle many parameters whether it can be latitude, longitude, satellite, version etc, so to handle this multiple relationship of parameter responsible for fire in the forest we need ML for sure!
- Import the necessary libraries
- Exploratory data analysis
- Data cleaning
- Form development (RandomForestRegressor)
- Form tuning (RandomSearchCV)
- bz2 unit (big bonus)
import datetime as dt import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report from sklearn.ensemble import RandomForestRegressor
Read a wildfire exploration dataset (.csv)
forest = pd.read_csv('fire_archive.csv')
Let’s take a look at our dataset (2.7+MB)
Here we can see that we have 36011 Safa And 15 columns Obviously in our dataset we have to do a lot of data cleaning but first
Let’s explore this dataset further
Index(['latitude', 'longitude', 'brightness', 'scan', 'track', 'acq_date', 'acq_time', 'satellite', 'instrument', 'confidence', 'version', 'bright_t31', 'frp', 'daynight', 'type'], dtype="object")
Validation of nulls in a forest fire forecast dataset
latitude 0 longitude 0 brightness 0 scan 0 track 0 acq_date 0 acq_time 0 satellite 0 instrument 0 confidence 0 version 0 bright_t31 0 frp 0 daynight 0 type 0 dtype: int64
Fortunately, we don’t have any null values in this data set
plt.figure(figsize=(10, 10)) sns.heatmap(forest.corr(),annot=True,cmap='viridis',linewidths=.5)
forest = forest.drop(['track'], axis = 1)
Here we are dropping the track column
Noticeable: By the way from the dataset we do not find whether the forest fire occurred or not, we are trying to find confidence in the occurrence of forest fires. They may seem the same but there is very little difference between them, try to find that 🙂
Categorical data search
print("The scan column") print(forest['scan'].value_counts()) print() print("The aqc_time column") print(forest['acq_time'].value_counts()) print() print("The satellite column") print(forest['satellite'].value_counts()) print() print("The instrument column") print(forest['instrument'].value_counts()) print() print("The version column") print(forest['version'].value_counts()) print() print("The daynight column") print(forest['daynight'].value_counts()) print()
The scan column 1.0 8284 1.1 6000 1.2 3021 1.3 2412 1.4 1848 1.5 1610 1.6 1451 1.7 1281 1.8 1041 1.9 847 2.0 707 2.2 691 2.1 649 2.3 608 2.5 468 2.4 433 2.8 422 3.0 402 2.7 366 2.9 361 2.6 347 3.1 259 3.2 244 3.6 219 3.4 203 3.3 203 3.8 189 3.9 156 4.7 149 4.3 137 3.5 134 3.7 134 4.1 120 4.6 118 4.5 116 4.2 108 4.0 103 4.4 100 4.8 70 Name: scan, dtype: int64 The aqc_time column 506 851 454 631 122 612 423 574 448 563 ... 1558 1 635 1 1153 1 302 1 1519 1 Name: acq_time, Length: 662, dtype: int64 The satellite column Aqua 20541 Terra 15470 Name: satellite, dtype: int64 The instrument column MODIS 36011 Name: instrument, dtype: int64 The version column 6.3 36011 Name: version, dtype: int64 The daynight column D 28203 N 7808 Name: daynight, dtype: int64
From the above data, we can see that only some columns have One value repeated, which means it is of no value to us
So we will completely abandon them.
Thus only Satellites And day and night Columns are the only ones cutter type.
Having said that, we can even use to survey Column to be restructured into a file Categorical data type vertical. Which we will do shortly.
forest = forest.drop(['instrument', 'version'], axis = 1)
daynight_map = "D": 1, "N": 0 satellite_map = "Terra": 1, "Aqua": 0 forest['daynight'] = forest['daynight'].map(daynight_map) forest['satellite'] = forest['satellite'].map(satellite_map)
Consider another column type
0 35666 2 335 3 10 Name: type, dtype: int64
Forest series and data frame types
types = pd.get_dummies(forest['type']) forest = pd.concat([forest, types], axis=1)
forest = forest.drop(['type'], axis = 1) forest.head()
Rename columns for better understanding
forest = forest.rename(columns=0: 'type_0', 2: 'type_2', 3: 'type_3')
- Now you mentioned that we’re going to convert the scan column to a categorical type, and we’re going to do that with a file binning method.
- The range of these columns was from 1 to 4.8
bins = [0, 1, 2, 3, 4, 5] labels = [1,2,3,4,5] forest['scan_binned'] = pd.cut(forest['scan'], bins=bins, labels=labels)
Convert data type to data type of String or NumPy.
forest['acq_date'] = pd.to_datetime(forest['acq_date'])
Now we will drop a file to survey shaft and handle date type Data – We can extract useful information from these types of data just as we do serial data.
forest = forest.drop(['scan'], axis = 1)
Create new column year with the help of acq_date column
forest['year'] = forest['acq_date'].dt.year forest.head()
As we added the year column similarly we will add the month And day vertical
forest['month'] = forest['acq_date'].dt.month forest['day'] = forest['acq_date'].dt.day
Check the shape of the dataset again
Now, as we can see that two more columns have been added which are split date columns
Separation of the target variable
y = forest['confidence'] fin = forest.drop(['confidence', 'acq_date', 'acq_time', 'bright_t31', 'type_0'], axis = 1)
Check the link again
plt.figure(figsize=(10, 10)) sns.heatmap(fin.corr(),annot=True,cmap='viridis',linewidths=.5)
Let’s see the data set that has been cleaned and sorted now
Split clean data into a training and test dataset
Xtrain, Xtest, ytrain, ytest = train_test_split(fin.iloc[:, :500], y, test_size=0.2)
building a model
Use RandomForestRegressor to build the form
random_model = RandomForestRegressor(n_estimators=300, random_state = 42, n_jobs = -1)
#Fit random_model.fit(Xtrain, ytrain) y_pred = random_model.predict(Xtest) #Checking the accuracy random_model_accuracy = round(random_model.score(Xtrain, ytrain)*100,2) print(round(random_model_accuracy, 2), '%')
random_model_accuracy1 = round(random_model.score(Xtest, ytest)*100,2) print(round(random_model_accuracy1, 2), '%')
Save form through pickle module using sequential format
import pickle saved_model = pickle.dump(random_model, open('ForestModelOld.pickle','wb'))
Adjust the form
- The accuracy is not too big, plus the model is convenient
Get all parameters of the form
'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 300, 'n_jobs': -1, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False
""" n_estimators = number of trees in the forest max_features = max number of features considered for splitting a node max_depth = max number of levels in each decision tree min_samples_split = min number of data points placed in a node before the node is split min_samples_leaf = min number of data points allowed in a leaf node bootstrap = method for sampling data points (with or without replacement) """
from sklearn.model_selection import RandomizedSearchCV
Number of trees in random forest n_estimators = [int(x) for x in np.linspace(start = 300, stop = 500, num = 20)] Number of features to consider at every split max_features = ['auto', 'sqrt'] Maximum number of levels in tree max_depth = [int(x) for x in np.linspace(15, 35, num = 7)] max_depth.append(None) Minimum number of samples required to split a node min_samples_split = [2, 3, 5] Minimum number of samples required at each leaf node min_samples_leaf = [1, 2, 4] Create the random grid random_grid = 'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf, print(random_grid)
'n_estimators': [300, 310, 321, 331, 342, 352, 363, 373, 384, 394, 405, 415, 426, 436, 447, 457, 468, 478, 489, 500], 'max_features': ['auto', 'sqrt'], 'max_depth': [15, 18, 21, 25, 28, 31, 35, None], 'min_samples_split': [2, 3, 5], 'min_samples_leaf': [1, 2, 4]
- Random parameter search, using 3x validation, search through 100 different combinations and use all available cores
- n_iter , which controls how many different combinations to try, and cv which is how many folds to use for cross-validation
rf_random = RandomizedSearchCV(estimator = random_model, param_distributions = random_grid, n_iter = 50, cv = 3, verbose=2, random_state=42) # Fit the random search model rf_random.fit(Xtrain, ytrain)
Just like this snippet, there will be many folds In this RandomizedSearchCV
Get the best teacher out of it
'n_estimators': 394, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 25
Create a new form with exact parameters
random_new = RandomForestRegressor(n_estimators = 394, min_samples_split = 2, min_samples_leaf = 1, max_features="sqrt", max_depth = 25, bootstrap = True)
#Fit random_new.fit(Xtrain, ytrain)
y_pred1 = random_new.predict(Xtest)
#Checking the accuracy random_model_accuracy1 = round(random_new.score(Xtrain, ytrain)*100,2) print(round(random_model_accuracy1, 2), '%')
random_model_accuracy2 = round(random_new.score(Xtest, ytest)*100,2) print(round(random_model_accuracy2, 2), '%')
Save the model set by Pickle module using sequential format
saved_model = pickle.dump(random_new, open('ForestModel.pickle','wb'))
Download exact pickle form
reg_from_pickle = pickle.load(saved_model)
Here comes the cherry on the pie portion (bonus of this article). Let’s understand what this bz2file module means. Let’s get started!
What is bz2file
bz2file is one of the modules in Python responsible for compressing and decompressing files, hence it can help reduce sequential or non-sequential file size to a smaller size which will be very useful in the long run when we have large data sets
How is bz2file useful here?
As we know that our data set is 2.7+MB and our random forest model is 700+MB, so we need to compress that so that this model is not a storage overheating situation.
How to install bz2file?
- Jupyter notebook: ! bz2file installation point
- Anaconda/CMD prompt: bz2file installation point
Hence I installed bz2file, which is used to compress data. This is a life-saving package for those who have low disk space but want to store or use large data sets. Now the pickle file is finished 700 MB The size that is compressed when used in a bz2 file 93MB or less.
import bz2 compressionLevel = 9 source_file="ForestModel.pickle" # this file can be in a different format, like .csv or others... destination_file="ForestModel.bz2" with open(source_file, 'rb') as data: tarbz2contents = bz2.compress(data.read(), compressionLevel) fh = open(destination_file, "wb") fh.write(tarbz2contents) fh.close()
This code will reduce the size of the pickle model that has been set.
Well, that’s a cover on my part!
Thank you for reading my article 🙂
I hope you like this step by step tutorial Forest fire prediction using machine learning. The last thing I want to mention is that I am well aware of the fact that the model accuracy is not good but the point of the article is quite balanced so that you guys can try different Ml algorithms to search for better accuracy.
Here is the repo link for this article.
Here you can access my other articles published on Analytics Vidhya as part of the Blogathon (link)
If you receive any inquiries, you can contact me on LinkedIn, please refer to this link
Hello everyone, I am currently working in a TCS Previously, she worked as an Associate Analyst in Data Science in a Zorba Consulting India. Besides working full time, I have a great interest in the same field i.e. data science with other subsets of artificial intelligence like computer vision, machine learning and deep learning, feel free to collaborate with me on any project on the above mentioned fields (LinkedIn).
- Picture 1 – https://www.theleader.info/wp-content/uploads/2017/08/forest-fire.jpg