Forest fire prediction using machine learning – News Couple

Forest fire prediction using machine learning

What does this article mean?

A forest, bush or plant fire can be described as any uncontrolled and non-described burning or burning of plants in a natural environment such as a forest, grassland, etc. Or not, we expect wildfire confidence based on some traits.

picture 1

Why do we need a wildfire prediction model?

Well, the first question that arises is why do we even need machine learning to predict wildfires in that particular region? So, yes, the question is correct that although there is an experienced forestry department dealing with these issues for a long time, why is ML needed, having said that the answer is so simple that an experienced forestry department can check 3-4 parameters of their mind Human but ML on the other hand can handle many parameters whether it can be latitude, longitude, satellite, version etc, so to handle this multiple relationship of parameter responsible for fire in the forest we need ML for sure!

content list

  1. Import the necessary libraries
  2. Exploratory data analysis
  3. Data cleaning
  4. Form development (RandomForestRegressor)
  5. Form tuning (RandomSearchCV)
  6. bz2 unit (big bonus)

import libraries

import datetime as dt

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestRegressor

Read a wildfire exploration dataset (.csv)

forest = pd.read_csv('fire_archive.csv')

Let’s take a look at our dataset (2.7+MB)



Read the forest fire forecast dataset

data exploration



(36011, 15)

Here we can see that we have 36011 Safa And 15 columns Obviously in our dataset we have to do a lot of data cleaning but first

Let’s explore this dataset further



Index(['latitude', 'longitude', 'brightness', 'scan', 'track', 'acq_date',
       'acq_time', 'satellite', 'instrument', 'confidence', 'version',
       'bright_t31', 'frp', 'daynight', 'type'],

Validation of nulls in a forest fire forecast dataset

forest.isnull(). sum()


latitude      0
longitude     0
brightness    0
scan          0
track         0
acq_date      0
acq_time      0
satellite     0
instrument    0
confidence    0
version       0
bright_t31    0
frp           0
daynight      0
type          0
dtype: int64

Fortunately, we don’t have any null values ​​in this data set



Description of the data set
plt.figure(figsize=(10, 10))


Chromatography of the data set

Data cleaning

forest = forest.drop(['track'], axis = 1)

Here we are dropping the track column

Noticeable: By the way from the dataset we do not find whether the forest fire occurred or not, we are trying to find confidence in the occurrence of forest fires. They may seem the same but there is very little difference between them, try to find that 🙂

Categorical data search

print("The scan column")
print("The aqc_time column")
print("The satellite column")
print("The instrument column")
print("The version column")
print("The daynight column")


The scan column
1.0    8284
1.1    6000
1.2    3021
1.3    2412
1.4    1848
1.5    1610
1.6    1451
1.7    1281
1.8    1041
1.9     847
2.0     707
2.2     691
2.1     649
2.3     608
2.5     468
2.4     433
2.8     422
3.0     402
2.7     366
2.9     361
2.6     347
3.1     259
3.2     244
3.6     219
3.4     203
3.3     203
3.8     189
3.9     156
4.7     149
4.3     137
3.5     134
3.7     134
4.1     120
4.6     118
4.5     116
4.2     108
4.0     103
4.4     100
4.8      70
Name: scan, dtype: int64

The aqc_time column
506     851
454     631
122     612
423     574
448     563
1558      1
635       1
1153      1
302       1
1519      1
Name: acq_time, Length: 662, dtype: int64

The satellite column
Aqua     20541
Terra    15470
Name: satellite, dtype: int64

The instrument column
MODIS    36011
Name: instrument, dtype: int64

The version column
6.3    36011
Name: version, dtype: int64

The daynight column
D    28203
N     7808
Name: daynight, dtype: int64

From the above data, we can see that only some columns have One value repeated, which means it is of no value to us
So we will completely abandon them.
Thus only Satellites And day and night Columns are the only ones cutter type.

Having said that, we can even use to survey Column to be restructured into a file Categorical data type vertical. Which we will do shortly.

forest = forest.drop(['instrument', 'version'], axis = 1)


Looking at the head of data for forest fire forecasting
daynight_map = "D": 1, "N": 0
satellite_map = "Terra": 1, "Aqua": 0

forest['daynight'] = forest['daynight'].map(daynight_map)
forest['satellite'] = forest['satellite'].map(satellite_map)


Display the raw rows of a data set

Consider another column type



0    35666
2      335
3       10
Name: type, dtype: int64

Forest series and data frame types

types = pd.get_dummies(forest['type'])
forest = pd.concat([forest, types], axis=1)
forest = forest.drop(['type'], axis = 1)


View data after linking forest and data frame types

Rename columns for better understanding

forest = forest.rename(columns=0: 'type_0', 2: 'type_2', 3: 'type_3')

Binning method

  • Now you mentioned that we’re going to convert the scan column to a categorical type, and we’re going to do that with a file binning method.
  • The range of these columns was from 1 to 4.8
bins = [0, 1, 2, 3, 4, 5]
labels = [1,2,3,4,5]
forest['scan_binned'] = pd.cut(forest['scan'], bins=bins, labels=labels)


Binning method application

Convert data type to data type of String or NumPy.

forest['acq_date'] = pd.to_datetime(forest['acq_date'])

Now we will drop a file to survey shaft and handle date type Data – We can extract useful information from these types of data just as we do serial data.

forest = forest.drop(['scan'], axis = 1)

Create new column year with the help of acq_date column

forest['year'] = forest['acq_date'].dt.year


Create a new column

As we added the year column similarly we will add the month And day vertical

forest['month'] = forest['acq_date'].dt.month
forest['day'] = forest['acq_date']

Check the shape of the dataset again



(36011, 17)

Now, as we can see that two more columns have been added which are split date columns

Separation of the target variable

y = forest['confidence']
fin = forest.drop(['confidence', 'acq_date', 'acq_time', 'bright_t31', 'type_0'], axis = 1)

Check the link again

plt.figure(figsize=(10, 10))


Link Check
author photo

Let’s see the data set that has been cleaned and sorted now



View the cleaned and sorted data

Split clean data into a training and test dataset

Xtrain, Xtest, ytrain, ytest = train_test_split(fin.iloc[:, :500], y, test_size=0.2)

building a model

Use RandomForestRegressor to build the form

random_model = RandomForestRegressor(n_estimators=300, random_state = 42, n_jobs = -1)
#Fit, ytrain)

y_pred = random_model.predict(Xtest)

#Checking the accuracy
random_model_accuracy = round(random_model.score(Xtrain, ytrain)*100,2)
print(round(random_model_accuracy, 2), '%')


95.32 %

Accuracy Check

random_model_accuracy1 = round(random_model.score(Xtest, ytest)*100,2)
print(round(random_model_accuracy1, 2), '%')


65.32 %

Save form through pickle module using sequential format

import pickle 
saved_model = pickle.dump(random_model, open('ForestModelOld.pickle','wb'))

Adjust the form

  • The accuracy is not too big, plus the model is convenient

Get all parameters of the form



'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 300,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False

Run RadomizedSearchCV!

n_estimators = number of trees in the forest
max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree
min_samples_split = min number of data points placed in a node before the node is split
min_samples_leaf = min number of data points allowed in a leaf node
bootstrap = method for sampling data points (with or without replacement)
from sklearn.model_selection import RandomizedSearchCV
Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 300, stop = 500, num = 20)]
Number of features to consider at every split
max_features = ['auto', 'sqrt']
Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(15, 35, num = 7)]
Minimum number of samples required to split a node
min_samples_split = [2, 3, 5]
Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
Create the random grid
random_grid = 'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,


'n_estimators': [300, 310, 321, 331, 342, 352, 363, 373, 384, 394, 405, 415, 426, 436, 447, 457, 468, 478, 489, 500], 'max_features': ['auto', 'sqrt'], 'max_depth': [15, 18, 21, 25, 28, 31, 35, None], 'min_samples_split': [2, 3, 5], 'min_samples_leaf': [1, 2, 4]
  • Random parameter search, using 3x validation, search through 100 different combinations and use all available cores
  • n_iter , which controls how many different combinations to try, and cv which is how many folds to use for cross-validation
rf_random = RandomizedSearchCV(estimator = random_model, param_distributions = random_grid, 
                                n_iter = 50, cv = 3, verbose=2, random_state=42)
# Fit the random search model, ytrain)


Outputs of our wildfire prediction model

Just like this snippet, there will be many folds In this RandomizedSearchCV

Get the best teacher out of it



'n_estimators': 394,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 25

Create a new form with exact parameters

random_new = RandomForestRegressor(n_estimators = 394, min_samples_split = 2, min_samples_leaf = 1, max_features="sqrt",
                                      max_depth = 25, bootstrap = True)
#Fit, ytrain)
y_pred1 = random_new.predict(Xtest)
#Checking the accuracy
random_model_accuracy1 = round(random_new.score(Xtrain, ytrain)*100,2)
print(round(random_model_accuracy1, 2), '%')


95.31 %

Accuracy Check

random_model_accuracy2 = round(random_new.score(Xtest, ytest)*100,2)
print(round(random_model_accuracy2, 2), '%')


67.39 %

Save the model set by Pickle module using sequential format

saved_model = pickle.dump(random_new, open('ForestModel.pickle','wb'))

Download exact pickle form

reg_from_pickle = pickle.load(saved_model)

a file

Here comes the cherry on the pie portion (bonus of this article). Let’s understand what this bz2file module means. Let’s get started!

What is bz2file

bz2file is one of the modules in Python responsible for compressing and decompressing files, hence it can help reduce sequential or non-sequential file size to a smaller size which will be very useful in the long run when we have large data sets

How is bz2file useful here?

As we know that our data set is 2.7+MB and our random forest model is 700+MB, so we need to compress that so that this model is not a storage overheating situation.

How to install bz2file?

  • Jupyter notebook: ! bz2file installation point
  • Anaconda/CMD prompt: bz2file installation point

Hence I installed bz2file, which is used to compress data. This is a life-saving package for those who have low disk space but want to store or use large data sets. Now the pickle file is finished 700 MB The size that is compressed when used in a bz2 file 93MB or less.

import bz2

compressionLevel = 9
source_file="ForestModel.pickle" # this file can be in a different format, like .csv or others...

with open(source_file, 'rb') as data:
    tarbz2contents = bz2.compress(, compressionLevel)
fh = open(destination_file, "wb")

This code will reduce the size of the pickle model that has been set.

Well, that’s a cover on my part!


Thank you for reading my article 🙂

I hope you like this step by step tutorial Forest fire prediction using machine learning. The last thing I want to mention is that I am well aware of the fact that the model accuracy is not good but the point of the article is quite balanced so that you guys can try different Ml algorithms to search for better accuracy.

Here is the repo link for this article.

Here you can access my other articles published on Analytics Vidhya as part of the Blogathon (link)

If you receive any inquiries, you can contact me on LinkedIn, please refer to this link

About Me

Hello everyone, I am currently working in a TCS Previously, she worked as an Associate Analyst in Data Science in a Zorba Consulting India. Besides working full time, I have a great interest in the same field i.e. data science with other subsets of artificial intelligence like computer vision, machine learning and deep learning, feel free to collaborate with me on any project on the above mentioned fields (LinkedIn).

image source-

  1. Picture 1 –

The media described in this article on forest fire forecasting are not owned by Analytics Vidhya and are used at the author’s discretion.

Source link

Related Articles

Back to top button