Recently, I stumbled upon a new package that drastically improved my workflow - pandas_redshift. This package abstracts away alot of the nitty gritty details of interacting with your Redshift cluster. Additionally, all queries are returned as Pandas dataframes by default. This one-two punch helps to streamline a variety of daily tasks.
In this post, I’ll cover getting set up with the basics of using
# Will generate a Pipfile and virtualenv using python3 pipenv --three # Will install + place package into Pipfile pipenv install pandas-redshift
Up & Running
Now that we have a virtualenv, and our package, we can connect to Redshift. However, we will first need to activate our virtualenv. This can be accomplished by running
pipenv shell. This command invokes the underlying virtualenv with the installed package - same as the typical
# Invoke shell, activate virtualenv and load package pipenv shell
Nex up, you’ll want to enter your cluster configuration details.
import os import pandas_redshift as pr pr.connect_to_redshift(dbname=os.environ.get('db_name'), host=os.environ.get('host'), port=os.environ.get('port'), user=os.environ.get('user'), password=os.environ.get('password'))
Quickly, you should run a simple query to ensure that everything is copacetic.
data = pr.redshift_to_pandas('select * from my_data.feb18_expenses limit 10') data.head()