Modin: Expedite Your Pandas Code with Single Change – News Couple
ANALYTICS

Modin: Expedite Your Pandas Code with Single Change


This article was published as a part of the Data Science Blogathon.

Introduction to Pandas

Pandas is a python library that needs no introduction. Pandas provide an easier way to do preprocessing and analysis on our . The primary reason for the slowdown is pandas can’t run the program parallelly, and it only uses one CPU core for running the program. We have to shift to distributed computing platforms like Spark for working on large data. But maintaining distributed computing platforms is q

Moreover, a steep learning curve is required to use distributed computing platforms, and it is challenging for beginners to use them. This is where the Modin comes from. It makes your Pandas code run parallelly by changing just one line of code.

Modin is an open source library developed by UC Berkeley’s RISELab to speed up the computation by distributive computing. Modin uses Ray/Dask libraries in backend to parallelize the code and also we don’t need any distributive computing knowledge to use Modin. Modin Dataframe has a similar API to Pandas. So all we are to do is to continue using Pandas API as was before. Modin provides the speedup of upto 4x on 4 core laptop. Modin can be used for dataset size ranging from 1MB to 1TB.

Installation

Modin can be installed through pip command and it uses Ray/Dask library as backend. Modin will automatically detect ray/dask engine in our computer. In case if Ray/Dask library is not preinstalled on your computer, Modin can be installed with its dependencies by below commands.

pip install "modin[ray]" # Install Modin dependencies and Ray to run on Ray
pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
pip install "modin[all]" # Install all of the above

Modin Architecture

The high level architecture diagram of Modin can be given as below

Currently we can use Modin with pandas API. The SQlite API is in experimental mode with Modin. In future Modin developers are planning to come up with a separate API’s for Modin but nothing as such is developed yet. The Query Compiler layer which is beneath the API layer will compose the query and perform some optimizations based on the format of the data.

Modin will run with Ray/Dask as its backend. We can also make Modin to work with our own backend library since it is an open source library and also even though we can run Modin directly with python without the backend it won’t serve our purpose as Modin can’t run code in parallel fashion on it’s own.

Modin Dataframe Architecture

The Modin Dataframe is partitioned along both rows and columns and each partition is a separate pandas dataframe.

Modin Dataframe Architecture|  Pandas

We can change the default partitions in modin by using repartition() method.

Implementation

We can replace pandas with modin by just one line of code

import modin.pandas as pd

Modin vs Pandas Comparision pd.read_csv()

import modin.pandas as pd import time start_time=time.time() data_modin=pd.read_csv(“../input/uwmgi-mask-dataset/train.csv”) end_time=time.time() duration=end_time-start_time print(“Time taken to run the code “+str(duration))

Time taken to run the code 0.416454076769678

import pandas import time start_time=time.time() data_pandas=pandas.read_csv(“../input/uwmgi-mask-dataset/train.csv”) end_time=time.time() duration=end_time-start_time print(“Time”) taken to run the code “+str(duration))

Time taken to run the code 0.6549224853515625

df.fillna()

%%time data_modin.fillna(0)

CPU times: user 7.68 ms, sys: 5.06 ms, total: 12.7 ms Wall time: 10.9 ms

%%time

data_pandas.fillna(0)

CPU times: user 59.2 ms, sys: 5.38 ms, total: 64.6 ms Wall time: 62.3 ms

Limitations

Pandas is a heavy library with wide collection of API’s. Even though Modin support popular API’s of Pandas it won’t support all the API’s of Pandas.

Limitations|  Pandas For the functions that are not implemented in Modin they are automatically defaulted to pandas. So for the functions that are not implemented in pandas and for user defined functions(apply functions in pandas) Modin will convert the Modin DataFrame to Pandas DataFrame and then apply those functions. There will be some performance penalty for converting to Pandas Dataframe.
Limitations Image 2|Pandas

Comparison to Other Libraries

Dask, Vaex, Ray, Cudf and Koalas are some of the popular alternatives to Modin.

Libraries like Dask, Koalas try to resolve the performance issue for large datasets by their own ways but it won’t preserve the Pandas API behavior and we have to make significant changes to our pandas code to make it run on dask/Koalas. Aslo Dask/Koalas support only row partitioning whereas Modin supports row, column and cell partitioning of dataframe which helps the Modin to support a wide variety of Pandas API’s. Due to the control over partitioning Modin supports pandas methods like transpose(). quantile(), median() which are difficult to apply in row based partition of data.

Comparison to Other Libraries|  Pandas

Similarly, libraries like Vaex are designed for data visualization for large datasets but not as a replacement for pandas. So, its better to give a try for Vaex for data visualization but for ease of adoption and performance Modin beats the Vaex library. Also we can’t compare the libraries like Ray/Cudf with Modin. Ray/Cudf don’t provide any high level pandas API’s. Ray/Cudf can be used as the backend support for Modin for optimal performance with less code changes.

Also Dask/Ray uses lazy evaluation which executes the code only when user explicitly evaluates the result. Even though lazy evaluation decrease time complexity, sometimes it increases the space complexity. Unlike Dask/Ray, Modin by default doesn’t execute lazy evaluation. But Modin also support lazy evaluation with Omnisci engine.

Conclusion on Pandas Code

Modin is still in its early

Modin’s core motto is to make users use the same tools for small and large datasets without changing the API’s. In future developers are planning to integrate Modin with Pyarrow, Sqlite and various other libraries. The key takeaways of the article are

  1. We understood why pandas is not ideal for large data and how the Modin helps us in dealing with large data.
  2. We understood the architecture of the Modin.
  3. We learned how to implement Modin in python.
  4. We are gone through the limitations of Modin.
  5. Finally we compared Modin with its alternatives and discussed the inherent strengths of Modin when compared to its alternatives.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button