Example use case

A user can upload either csv or an xlsx log files through a front-end portal. The log files have an average size of around 50MB. The files are uploaded to a cloud storage instance and a Python back-end downloads and process them. The goals are to minimize cloud storage usage and to load the files as fast as possible.

Question: Which file format is better to use, csv or xlsx?

Note: TL;DR compressed csv files use 31% less space and load 97% faster than xlsx files.

Generate dummy data

Generate some dummy data that is roughly 50MB in size and store it in a Pandas DataFrame.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10000, 250))

We export the data as csv and xlsx files to local storage and compress them to further reduce file size. We use the "zip" format as it offers the best overall compression ratio vs. read/write speeds. We note that pd.to_excel() takes significantly longer than pd.to_csv().

df.to_csv("csv.csv")
! zip csv.zip csv.csv
df.to_excel("excel.xlsx")
! zip excel.zip excel.xlsx 

Note: A quick benchmark of different compression methods native to Pandas is available in my post Faster pandas - compression format comparison.

Tip: if Pandas has errors when reading/writing excel files try installing the openpyxl engine with "pip install openpyxl".

File size

Now, lets consider the files sizes:

! ls -lh *.{csv,xlsx,zip}
-rw-r--r-- 1 jverster jverster 47M Jul  7 16:07 csv.csv
-rw-r--r-- 1 jverster jverster 22M Jul  7 16:07 csv.zip
-rw-r--r-- 1 jverster jverster 32M Jul  7 16:07 excel.xlsx
-rw-r--r-- 1 jverster jverster 32M Jul  7 16:07 excel.zip

The csv file saves to 47MB and can be compressed at a ratio of 0.47 to 22MB which leads to a space saving of roughly 53%. On the other hand, the xlsx file saves to 32MB, but offers almost no compression gains. The compressed csv.zip file is the smallest option and 31% smaller than the xlsx file.

Read speed

Next, lets consider the read speed of csv.csv and csv.zip compared to the excel.xlsx file.

Pandas can read compressed files directly with pd.read_csv(), but not with pd.read_excel(). We read only the xlsx file convenience sake.

print("csv.csv read in:")
%timeit pd.read_csv("csv.csv")

print("csv.zip read in:")
%timeit pd.read_csv("csv.zip")

print("excel.xlsx read in:")
%timeit -r 1 -n 1 pd.read_excel("excel.xlsx")
csv.csv read in:
392 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
csv.zip read in:
571 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
excel.xlsx read in:
18.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

We see a slight reduction in speed when reading the csv.csv compared to csv.zip. Surprisingly, the excel.xlsx file takes 18.1 seconds to read. This means the csv.zip file reads about 97% faster than the excel.xlsx file!

Reading directly from the cloud

Finally, Pandas read_csv() can also read files directly from a url which is useful for this use case.

I was interested to test this so I uploaded the csv files to Google Drive and read them directly from there.

csv_file_url = "https://drive.google.com/file/d/1KoserTnHwB14dXHjRMl3PhKnm3fulR3E/view?usp=sharing"
csv_zip_url = "https://drive.google.com/file/d/1qQjmg9aeVh28y5zSBxaTX-ZBi8JePoFn/view?usp=sharing"

print("csv.csv read from Google Drive in:")
%timeit pd.read_csv(csv_file_url)

print("csv.zip read from Google Drive in:")
%timeit pd.read_csv(csv_zip_url)
csv.csv read from Google Drive in:
894 ms ± 67.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
csv.zip read from Google Drive in:
927 ms ± 49.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The times are really close. The smaller zip file downloads faster, but then reads slightly slower. With a blazing fast internet connection the csv.zip files might be slightly faster.

Summary

So in summary, compressed csv.zip files will allow us to use 31% less space and is 97% faster when compared to excel.xlsx files

Very nice!