Comparison of different compression formats

Pandas offers "on-the-fly compression of the output data" as part of to_csv() and according the the latest documentation it supports the following extensions: ".gzip", ".bz2", ".zip" and ".xz". A quick experiment is done to compare these formats and pick the winner based on compression/decompression times and compression rates.

Note: I did not consider different compression levels for each format, but you can also set this.

Note: TL;DR "zip" has excellent compression rates with the fastest compression and loading times.

Generate dummy data

Generate dummy data that is roughly 50MB in size and store it in a Pandas DataFrame.

import time

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10000, 250))

Compression tests

  1. Compress the data and record the compression time for each format.
compression_formats = ["csv", "gz", "bz2", "zip", "xz"]

for fmt in compression_formats:
    t1 = time.perf_counter()
    df.to_csv(f"csv.{fmt}", compression="infer")
    print(f"Saved {fmt} in {round(time.perf_counter()-t1, 2)} seconds.")
Saved csv in 2.59 seconds.
Saved gz in 6.94 seconds.
Saved bz2 in 5.67 seconds.
Saved zip in 5.3 seconds.
Saved xz in 54.73 seconds.
  1. Compare compressed file sizes using the terminal.
! ls -lh *.{csv,gz,bz2,zip,xz}
-rw-r--r-- 1 jverster jverster 19M Jul  8 11:09 csv.bz2
-rw-r--r-- 1 jverster jverster 47M Jul  8 11:09 csv.csv
-rw-r--r-- 1 jverster jverster 22M Jul  8 11:09 csv.gz
-rw-r--r-- 1 jverster jverster 21M Jul  8 11:10 csv.xz
-rw-r--r-- 1 jverster jverster 22M Jul  8 11:09 csv.zip
  1. Finally, decompress the data and record decompression times.
for fmt in compression_formats:
    t1 = time.perf_counter()
    pd.read_csv(f"csv.{fmt}")
    print(f"Read {fmt} in {round(time.perf_counter()-t1, 2)} seconds.")
Read csv in 0.47 seconds.
Read gz in 0.73 seconds.
Read bz2 in 2.46 seconds.
Read zip in 0.68 seconds.
Read xz in 1.47 seconds.

Results summary

Top three results for each category are as follows:

Compression time:

  1. zip (5.3 seconds)
  2. bz2 (5.67 seconds)
  3. gz (6.94 seconds)

Compression rate:

  1. bz2 (19 MB)
  2. xz (21 MB)
  3. zip and gz (22 MB)

Loading time:

  1. zip (0.68 seconds)
  2. gz (0.73 seconds)
  3. xz (1.47 seconds)

Conclusion

Best overall results:

  1. "zip" has excellent compression rates with the fastest compression and loading times.
  2. "gz" offers similar compression rates to zip but with slower compression times.
  3. "bz2" offer the best compression rates but is slow to load.