# Posts

Reading CSVs into Pandas is fairly routine. However, there isn’t one clearly right way to perform this task. This often leads to a lot of interesting attempts with varying levels of exoticism.

In an effort to push my own agenda I’m documenting my process.

One of the cooler features of Dask, a Python library for parallel computing, is the ability to read in CSVs by matching a pattern.

python
import dask.dataframe as dd
df = dd.read_csv('data*.csv')

This small quirk ends up solving quite a few problems. Despite this, the raw power of Dask isn’t always required, so it’d be nice to have a Pandas equivalent.

## Glob

The python module glob provides Unix style pathname pattern expansion. Therefore, using glob.glob('*.gif') will give us all the .gif files in a dir as a list.

txt
my_data/
data1.csv
data2.csv
data3.csv
data4.csv
python
import glob

files = glob.glob('data*.csv')
# ['data1.csv', 'data2.csv', 'data3.csv']

## OS

Another way to potentially combat this problem is by using the os module.

python
my_data/
data_blah.csv
more_data.csv
yet_another.csv
lots_of_data.csv
python
import os

files = os.listdir()
# ['data_blah.csv', 'more_data.csv', 'yet_another.csv', 'lots_of_data.csv']

## One-Liner

Now comes the fun part. More or less, this dance usually boils down to two functions: pd.read_csv() and pd.concat(). There are a variety of ways to call them, however I feel this is a scenario in which a little cleverness is apt.

python
import glob
import pandas as pd

# glob.glob('data*.csv') - returns List[str]
df = pd.concat([pd.read_csv(f) for f in glob.glob('data*.csv')], ignore_index = True)
The real beauty of this method is that it still allows for you to configure how you read in your .csv files. For instance, if our encoding was was latin1 instead of UTF-8.
import glob
df = pd.concat([pd.read_csv(f, encoding='latin1') for f in glob.glob('data*.csv'), ignore_index=True])