Faster pandas - csv vs. xlsx
Which file format is better to use when trying to minimize file size and improve loading times?
Example use case
A user can upload either csv or an xlsx log files through a front-end portal. The log files have an average size of around 50MB. The files are uploaded to a cloud storage instance and a Python back-end downloads and process them. The goals are to minimize cloud storage usage and to load the files as fast as possible.
Question: Which file format is better to use, csv or xlsx?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 250))
We export the data as csv and xlsx files to local storage and compress them to further reduce file size. We use the "zip" format as it offers the best overall compression ratio vs. read/write speeds. We note that pd.to_excel() takes significantly longer than pd.to_csv().
df.to_csv("csv.csv")
! zip csv.zip csv.csv
df.to_excel("excel.xlsx")
! zip excel.zip excel.xlsx
! ls -lh *.{csv,xlsx,zip}
The csv file saves to 47MB and can be compressed at a ratio of 0.47 to 22MB which leads to a space saving of roughly 53%. On the other hand, the xlsx file saves to 32MB, but offers almost no compression gains. The compressed csv.zip file is the smallest option and 31% smaller than the xlsx file.
print("csv.csv read in:")
%timeit pd.read_csv("csv.csv")
print("csv.zip read in:")
%timeit pd.read_csv("csv.zip")
print("excel.xlsx read in:")
%timeit -r 1 -n 1 pd.read_excel("excel.xlsx")
We see a slight reduction in speed when reading the csv.csv compared to csv.zip. Surprisingly, the excel.xlsx file takes 18.1 seconds to read. This means the csv.zip file reads about 97% faster than the excel.xlsx file!
csv_file_url = "https://drive.google.com/file/d/1KoserTnHwB14dXHjRMl3PhKnm3fulR3E/view?usp=sharing"
csv_zip_url = "https://drive.google.com/file/d/1qQjmg9aeVh28y5zSBxaTX-ZBi8JePoFn/view?usp=sharing"
print("csv.csv read from Google Drive in:")
%timeit pd.read_csv(csv_file_url)
print("csv.zip read from Google Drive in:")
%timeit pd.read_csv(csv_zip_url)
The times are really close. The smaller zip file downloads faster, but then reads slightly slower. With a blazing fast internet connection the csv.zip files might be slightly faster.