Reading multiple CSVs into Pandas is fairly routine. However, there isn’t one clearly right way to perform this task. This often leads to a lot of interesting attempts with varying levels of exoticism.
In an effort to push my own agenda I’m documenting my process.
One of the cooler features of Dask, a Python library for parallel computing, is the ability to read in CSVs by matching a pattern.
import dask.dataframe as dd df = dd.read_csv('data*.csv')
This small quirk ends up solving quite a few problems. Despite this, the raw power of Dask isn’t always required, so it’d be nice to have a Pandas equivalent.
The python module glob provides Unix style pathname pattern expansion. Therefore, using
glob.glob('*.gif') will give us all the
.gif files in a dir as a list.
my_data/ data1.csv data2.csv data3.csv data4.csv
import glob files = glob.glob('data*.csv') # ['data1.csv', 'data2.csv', 'data3.csv']
Another way to potentially combat this problem is by using the os module.
my_data/ data_blah.csv more_data.csv yet_another.csv lots_of_data.csv
import os files = os.listdir() # ['data_blah.csv', 'more_data.csv', 'yet_another.csv', 'lots_of_data.csv']
Now comes the fun part. More or less, this dance usually boils down to two functions:
pd.concat(). There are a variety of ways to call them, however I feel this is a scenario in which a little cleverness is apt.
import glob import pandas as pd # glob.glob('data*.csv') - returns List[str] # pd.read_csv(f) - returns pd.DataFrame() # for f in glob.glob() - returns a List[DataFrames] # pd.concat() - returns one pd.DataFrame() df = pd.concat([pd.read_csv(f) for f in glob.glob('data*.csv')], ignore_index = True)
The real beauty of this method is that it still allows for you to configure how you read in your
.csv files. For instance, if our encoding was was
latin1 instead of
import glob import pandas as pd df = pd.concat([pd.read_csv(f, encoding='latin1') for f in glob.glob('data*.csv'), ignore_index=True])
Turning into the Oracle of One-Liners shouldn’t be anyone’s goal. Yet, reading in data is something that happens so frequently that it feels like an ideal use case. Find the files I want, read them in how I want, and…boom! One nice compact dataframe.