Skip to Content

Elegantly Reading Multiple CSVs to Pandas

Posted on 2 mins read

Reading multiple CSVs into Pandas is fairly routine. However, there isn’t one clearly right way to perform this task. This often leads to a lot of interesting attempts with varying levels of exoticism.

In an effort to push my own agenda I’m documenting my process.

Dask

One of the cooler features of Dask, a Python library for parallel computing, is the ability to read in CSVs by matching a pattern.

import dask.dataframe as dd
df = dd.read_csv('data*.csv')

This small quirk ends up solving quite a few problems. Despite this, the raw power of Dask isn’t always required, so it’d be nice to have a Pandas equivalent.

Glob

The python module glob provides Unix style pathname pattern expansion. Therefore, using glob.glob('*.gif') will give us all the .gif files in a dir as a list.

my_data/
    data1.csv
    data2.csv
    data3.csv
    data4.csv
import glob

files = glob.glob('data*.csv')
# ['data1.csv', 'data2.csv', 'data3.csv']

OS

Another way to potentially combat this problem is by using the os module.

my_data/
    data_blah.csv
    more_data.csv
    yet_another.csv
    lots_of_data.csv
import os

files = os.listdir()
# ['data_blah.csv', 'more_data.csv', 'yet_another.csv', 'lots_of_data.csv']

One-Liner

Now comes the fun part. More or less, this dance usually boils down to two functions: pd.read_csv() and pd.concat(). There are a variety of ways to call them, however I feel this is a scenario in which a little cleverness is apt.

import glob
import pandas as pd

# glob.glob('data*.csv') - returns List[str]
# pd.read_csv(f) - returns pd.DataFrame()
# for f in glob.glob() - returns a List[DataFrames]
# pd.concat() - returns one pd.DataFrame()
df = pd.concat([pd.read_csv(f) for f in glob.glob('data*.csv')], ignore_index = True)

The real beauty of this method is that it still allows for you to configure how you read in your .csv files. For instance, if our encoding was was latin1 instead of UTF-8.

import glob
import pandas as pd

df = pd.concat([pd.read_csv(f, encoding='latin1') for f in glob.glob('data*.csv'), ignore_index=True])

Turning into the Oracle of One-Liners shouldn’t be anyone’s goal. Yet, reading in data is something that happens so frequently that it feels like an ideal use case. Find the files I want, read them in how I want, and…boom! One nice compact dataframe.