Faster pandas - compression format comparison
A quick experiment with Pandas' built-in compression formats.
Comparison of different compression formats
Pandas offers "on-the-fly compression of the output data" as part of to_csv() and according the the latest documentation it supports the following extensions: ".gzip", ".bz2", ".zip" and ".xz". A quick experiment is done to compare these formats and pick the winner based on compression/decompression times and compression rates.
Note: I did not consider different compression levels for each format, but you can also set this.
Note: TL;DR "zip" has excellent compression rates with the fastest compression and loading times.
import time
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 250))
compression_formats = ["csv", "gz", "bz2", "zip", "xz"]
for fmt in compression_formats:
t1 = time.perf_counter()
df.to_csv(f"csv.{fmt}", compression="infer")
print(f"Saved {fmt} in {round(time.perf_counter()-t1, 2)} seconds.")
- Compare compressed file sizes using the terminal.
! ls -lh *.{csv,gz,bz2,zip,xz}
- Finally, decompress the data and record decompression times.
for fmt in compression_formats:
t1 = time.perf_counter()
pd.read_csv(f"csv.{fmt}")
print(f"Read {fmt} in {round(time.perf_counter()-t1, 2)} seconds.")