Steven Levine
Steven Levine
3 min read

Tags

TLDR - When comparing Pandas API on Spark vs Pandas I found that as the data size grew, the performance difference grew as well with Spark being the clear winner

I found this post about the new Pandas API on Spark very intriguing, specifically the performance improvements and the fact that “Pandas users will be able to scale their workloads with one simple line change”. Additionally, “The pandas API on Spark often outperforms pandas even on a single machine thanks to the optimizations in the Spark engine.”

Being that these are bold statements, and the fact that I am super curious, I decided to write a few simple load tests to run on my laptop to see if I could reproduce the performance improvements locally.

Since most of my posts on data use the open NYC Taxi Dataset I figured why not use it again? Next, I needed to come up with a simple problem to use for the performance comparison that would exercise the Dataframe properly. The problem I decided to go with was to try to figure out the top 10 drop off locations that yielded the highest total taxi fare for the year. This would require a scanning of the entire dataset (since it is not partitioned), a grouping, a summing, and finally a filtering. Seems like a fair problem, finally, in order to ensure it processed as much data as possible, I went with CSV files instead of Parquet being that reading CSVs requires reading the entire file, not just the specific columns used for the calculation.

The code that I wrote for this post can be found in this Github Repo. To reiterate, the goal of the code was:

  • Highlight the fact that only an import needed to change in order to use the Pandas API on Spark instead of plain Pandas.
  • “Verify” there was a difference in performance

On a side note, since the frameworks do not handle reading in multiple files (wildcards) the same exact way, a few changes were required beyond the import in order to achieve the same results. Will highlight them below.

The Code

from datetime import datetime

from pyspark.pandas import read_csv

start = datetime.now()

pdf = read_csv("./data/yellow_tripdata_2019-*.csv")

print(pdf.count())

res = pdf.groupby("DOLocationID")["fare_amount"]
  .sum()
  .sort_values(ascending=False)
  .head(10)

print(res)
print(f"Runtime: {datetime.now() - start}")

Download Source: pyspark_pandas_test

  • This code pulls in taxi trip data from 12 CSV’s (one per month) and calculates which drop off location DOLocationID has the highest number of trips in terms of fare_amount
from datetime import datetime
from glob import glob

from pandas import concat
from pandas import read_csv

start = datetime.now()

df = concat(map(read_csv, glob('./data/yellow_tripdata_2019-*.csv')))

# Everything below identical

Download Source: pandas_test.py

Here are the only two differences between the two tests:

  • The imports are from pandas vs from pyspark.pandas
  • Building a Dataframe using plain Pandas containing data from all 12 of the files requires concat() as well as creating a glob()

Results

Note: The benchmarks were conducted on the latest Macbook Pro (M1 Max 10 Core 32GB)

First Run

First Run  
File type CSV
Number of files 12
Avg size of file 383,161 MB
Total size of files 2.2 GB
Total Record Count 23,838,931
Pyspark Runtime 0:00:21.042511
Pandas Runtime 0:00:27.491613
Difference 6.449102
Difference % 23%

Second Run

Second Run  
File type CSV
Number of files 12
Avg size of file 1,282,542 MB
Total size of files 7.3 GB
Total Record Count 84,152,418
Pyspark Runtime 0:00:51.244250
Pandas Runtime 0:01:40.545144
Difference 49.3008939
Difference % 49%

Disclaimer: This was run on a single machine and is not meant to be used as an authoritative performance comparison.

Conclusion

As the data size grew, the performance difference grew as well with pyspark being the clear winner. Even though this test was run on a single machine with 10 cores, my assumption (based on the data in this Databricks post), is if you were to run this on a larger cluster of machines the performance difference would grow as well.

With the performance of Pandas on Spark improving so much, it begs the question of when should you use Plain Pandas vs Pandas on Spark vs Spark Dataframes?

My suggestion is as follows:

  • If you already know Pandas and are new to PySpark, use the Pandas on Spark API to get started, or you can leverage existing code.
  • If you are new to both Pandas and PySpark, use the native PySpark (Dataframe) API since it offers the most complete functionality (streaming, batch, and ML) Spark has to offer.