Skip to Content

Redshift To Pandas

Posted on 2 mins read

Recently, I stumbled upon a new package that drastically improved my workflow - pandas_redshift. This package abstracts away alot of the nitty gritty details of interacting with your Redshift cluster. Additionally, all queries are returned as Pandas dataframes by default. This one-two punch helps to streamline a variety of daily tasks.

In this post, I’ll cover getting set up with the basics of using pandas_redshift.

Installation

First things first, let’s follow some best practices and set up a virtualenv for our project with Pipenv.

# Will generate a Pipfile and virtualenv using python3
pipenv --three

# Will install + place package into Pipfile
pipenv install pandas-redshift

Up & Running

Now that we have a virtualenv, and our package, we can connect to Redshift. However, we will first need to activate our virtualenv. This can be accomplished by running pipenv shell. This command invokes the underlying virtualenv with the installed package - same as the typical source venv/bin/activate.

# Invoke shell, activate virtualenv and load package
pipenv shell

Nex up, you’ll want to enter your cluster configuration details.

import os
import pandas_redshift as pr

pr.connect_to_redshift(dbname=os.environ.get('db_name'),
                        host=os.environ.get('host'),
                        port=os.environ.get('port'),
                        user=os.environ.get('user'),
                        password=os.environ.get('password'))

Quickly, you should run a simple query to ensure that everything is copacetic.

data = pr.redshift_to_pandas('select * from my_data.feb18_expenses limit 10')
data.head()

Use Cases

  • Moving data from Redshift to S3 with advanced transformations
  • Leveraging new data in Redshift for models being served via a Lambda
  • Pulling data into web apps like Dash