TLDR - When comparing Pandas API on Spark vs Pandas I found that as the data size grew, the performance difference grew as well with Spark being the clear winner
I found this post about the new Pandas API on Spark very intriguing, specifically the performance improvements and the fact that “Pandas users will be able to scale their workloads with one simple line change”. Additionally, “The pandas API on Spark often outperforms pandas even on a single machine thanks to the optimizations in the Spark engine.”
Being that these are bold statements, and the fact that I am super curious, I decided to write a few simple load tests to run on my laptop to see if I could reproduce the performance improvements locally.
Since most of my posts on data use the open NYC Taxi Dataset I figured why not use it again? Next, I needed to come up with a simple problem to use for the performance comparison that would exercise the Dataframe properly. The problem I decided to go with was to try to figure out the top 10 drop off locations that yielded the highest total taxi fare for the year. This would require a scanning of the entire dataset (since it is not partitioned), a grouping, a summing, and finally a filtering. Seems like a fair problem, finally, in order to ensure it processed as much data as possible, I went with CSV files instead of Parquet being that reading CSVs requires reading the entire file, not just the specific columns used for the calculation.
The code that I wrote for this post can be found in this Github Repo. To reiterate, the goal of the code was:
- Highlight the fact that only an
import
needed to change in order to use the Pandas API on Spark instead of plain Pandas. - “Verify” there was a difference in performance
On a side note, since the frameworks do not handle reading in multiple files (wildcards) the same exact way, a few changes were required beyond the import
in order to achieve the same results. Will highlight them below.
The Code
from datetime import datetime
from pyspark.pandas import read_csv
start = datetime.now()
pdf = read_csv("./data/yellow_tripdata_2019-*.csv")
print(pdf.count())
res = pdf.groupby("DOLocationID")["fare_amount"]
.sum()
.sort_values(ascending=False)
.head(10)
print(res)
print(f"Runtime: {datetime.now() - start}")
Download Source: pyspark_pandas_test
- This code pulls in taxi trip data from 12 CSV’s (one per month) and calculates which drop off location
DOLocationID
has the highest number of trips in terms offare_amount
from datetime import datetime
from glob import glob
from pandas import concat
from pandas import read_csv
start = datetime.now()
df = concat(map(read_csv, glob('./data/yellow_tripdata_2019-*.csv')))
# Everything below identical
Download Source: pandas_test.py
Here are the only two differences between the two tests:
- The imports are
from pandas
vsfrom pyspark.pandas
- Building a Dataframe using plain Pandas containing data from all 12 of the files requires
concat()
as well as creating aglob()
Results
Note: The benchmarks were conducted on the latest Macbook Pro (M1 Max 10 Core 32GB)
First Run
First Run | |
---|---|
File type | CSV |
Number of files | 12 |
Avg size of file | 383,161 MB |
Total size of files | 2.2 GB |
Total Record Count | 23,838,931 |
Pyspark Runtime | 0:00:21.042511 |
Pandas Runtime | 0:00:27.491613 |
Difference | 6.449102 |
Difference % | 23% |
Second Run
Second Run | |
---|---|
File type | CSV |
Number of files | 12 |
Avg size of file | 1,282,542 MB |
Total size of files | 7.3 GB |
Total Record Count | 84,152,418 |
Pyspark Runtime | 0:00:51.244250 |
Pandas Runtime | 0:01:40.545144 |
Difference | 49.3008939 |
Difference % | 49% |
Disclaimer: This was run on a single machine and is not meant to be used as an authoritative performance comparison.
Conclusion
As the data size grew, the performance difference grew as well with pyspark
being the clear winner. Even though this test was run on a single machine with 10 cores, my assumption (based on the data in this Databricks post), is if you were to run this on a larger cluster of machines the performance difference would grow as well.
With the performance of Pandas on Spark improving so much, it begs the question of when should you use Plain Pandas vs Pandas on Spark vs Spark Dataframes?
My suggestion is as follows:
- If you already know Pandas and are new to PySpark, use the Pandas on Spark API to get started, or you can leverage existing code.
- If you are new to both Pandas and PySpark, use the native PySpark (Dataframe) API since it offers the most complete functionality (streaming, batch, and ML) Spark has to offer.