Faster than Pandas

How to Speed Up Pandas with Modin

ref: Towards Data Science

The pandas library provides easy-to-use data structures like pandas DataFrames as well as tools for data analysis. One issue with pandas is that it wasn’t designed for analyzing a large amount of data like 100 GB or 1 TB datasets. 1

Fortunately, there is the Modin library. It can handle the datasets that pandas can't.

Modin is a drop-in replacement for pandas. While pandas is single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory.

By simply replacing the import statement, Modin offers users effortless speed and scale for their pandas workflows. modin-import

The charts below show the speedup you get by replacing pandas with Modin modin-speed-comparison

Faster Than Pandas with Polars

ref: Python in Office

Libraries in comparison: polars, modin, datatable

Results:

  • 1-million-row
  • 10-million-row
  • 50-million-row
  • 100-million-row

polars performs consistently better than all other libraries in most of our tests. Some of the highlights include:

  • ~17x faster than pandas when reading csv files
  • ~10x faster than pandas when merging two dataframes
  • ~2-3x faster than pandas for our other tests

The results suggest that replacing pandas with polars will likely increase the speed of our Python program by at least 2-3 times.


Tags

  1. cat.tut
  2. topic.data

Footnotes

  1. Wes McKinney | Apache Arrow and the "10 Things I Hate About pandas"˄