Setting up your Python environment for Think Stats

Update: I’ve added all my current exercises to a GitHub repo. If you want to see how I’m working with the project structure I describe here, that’s a good place to start.

I’ve been reading Allen Downey’s Think Stats lately. If you’re not familiar with it, this is a book that purports to teach statistics to programmers. Beginning in the first chapter, readers are invited to write code that helps them explore the statistical concepts being discusssed. I’m learning a fair amount, and it’s pretty well-written.

If you follow the link above, you can find the book in PDF and HTML formats.

One area of improvement for the book is in using best practices for working with Python. I initially started working through the examples with the project structure Downey recommends, but that quickly became unwieldy. So I restructured my project and thought I’d share. I had the following goals:

achieve code/data separation
treat official code as libraries, not suggestions I’m free to change
use tools like virtualenv and pip (discussed later)
be able to put my own code under version control without having to add too many things to .gitignore

A few notes: First, this post isn’t intended to be a criticism of Downey’s work. In fact, if you’re not interested in Python per se and just want to learn statistics, you should probably ignore this post. Downey’s text is solid and should work on its own. Following these instructions might only be a distraction to you. Second, this tutorial assumes you’re familiar with at least the first chapter of the book. (Maybe you’ve gotten through it and started hesitating about the structure as I have.) I won’t be providing exercise solutions in this post. Third, I assume you’re running on Linux or OS X. Some of the details may be different for Windows users.

Basic directory structure

To start, I split up the project like this:


think_stats/
    data/
    official_code/
    exercises/

In this setup all of Downey’s libraries will be in the official_code directory and all of the reader’s code will be in exercises. Go ahead and make those directories now.

Downloading required files

Next, we’re going to want to get all of the libraries for the book. cd to your official_code directory and do the following:
$ wget http://greenteapress.com/thinkstats/thinkstats.code.zip $ tar -xzf thinkstats.code.zip

Your official_code directory should now have all of the custom code used in the book.

Now let’s get the datasets we’ll be using. Downey’s got a special page he wants readers to go through to get the files. It’s possible to bypass this, but I think that would make you a jerk. So navigate to that page, download the three files, — 2002FemPreg.dat.gz (1.0 MB), 2002FemResp.dat.gz (3.2 MB), and 2002Male.dat.gz (1.4 MB) — and put each of them in your data directory.

Executing official code

We should be able to run Downey’s first script, survey.py, now. Let’s try it (working from our top-level think_stats directory):


$ python official_code/survey.py 
Traceback (most recent call last):
  File "official_code/survey.py", line 195, in 
    main(*sys.argv)
  File "official_code/survey.py", line 186, in main
    resp.ReadRecords(data_dir)
  File "official_code/survey.py", line 108, in ReadRecords
    self.ReadFile(data_dir, filename, self.GetFields(), Respondent, n)
  File "official_code/survey.py", line 45, in ReadFile
    fp = gzip.open(filename)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 89, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: './2002FemResp.dat.gz'

Whoops! The survey module thinks it’s located next to the data files. Luckily, it’s already written such that it can take the location of the datasets as an argument. So let’s try it again this way:
$ python official_code/survey.py data/ Number of respondents 7643 Number of pregnancies 13593

Excellent.

Executing exercise code

For Exercise 1-3, the reader is supposed to write a module called first.py. Assuming you’ve already written it, put it in your exercises directory. This won’t work immediately:


$ python exercises/first.py 
Traceback (most recent call last):
  File "exercises/first.py", line 1, in 
    import survey
ImportError: No module named survey

The problem here is that the code in first.py doesn’t know where to find the survey module. We can fix this with the PYTHONPATH environment variable:
$ PYTHONPATH=official_code/:$PYTHONPATH python exercises/first.py data/ Number of pregnancies: 13593 Number of live births: 9148 Number of first births: 4413 Number of other births: 4735 Avg length of first births: 38.6009517335 Avg length of other births: 38.5229144667 Difference: 0.0780372667775 weeks (13.1102608186 hours)

Note: the trick to making first.py work with data in another directory is to pass the location of data to your table’s ReadRecords method. Here are the first few lines of my main execution logic:


if __name__ == '__main__':
    data_dir = sys.argv[1]
    table = survey.Pregnancies()
    table.ReadRecords(data_dir)
    ...

The full source for my first.py is here.

Using virtualenv to avoid setting `PYTHONPATH` all the time

Finally, if you’re working in Python, you should be using virtualenv and virtualenvwrapper. Each of those projects is popular and well-documented, so I leave getting them installed to the user. Please leave a comment if you would find an introduction to using these helpful.

Using this combination will be useful later in the book (e.g., when we’re asked to install matplotlib). So let’s go ahead and get an environment set up for this book. On my Mac, that looks like this:


$ mkvirtualenv think_stats
New python executable in think_stats/bin/python
Installing setuptools............done.
Installing pip...............done.
virtualenvwrapper.user_scripts creating /Users/jeremyboyd/.virtualenvs/think_stats/bin/predeactivate
virtualenvwrapper.user_scripts creating /Users/jeremyboyd/.virtualenvs/think_stats/bin/postdeactivate
virtualenvwrapper.user_scripts creating /Users/jeremyboyd/.virtualenvs/think_stats/bin/preactivate
virtualenvwrapper.user_scripts creating /Users/jeremyboyd/.virtualenvs/think_stats/bin/postactivate
virtualenvwrapper.user_scripts creating /Users/jeremyboyd/.virtualenvs/think_stats/bin/get_env_details

We’re going to use virtualenv’s postactivate and predeactivate hooks to modify our PYTHONPATH when we enter and exit the think_stats environment. First, do yourself a favor and get out of the virtualenv you just set up by running the deactivate command. Then modify your virtualenv’s postactivate file to include the following two lines:
export OLD_PYTHONPATH=$PYTHONPATH export PYTHONPATH=/path/to/think_stats/official_code:$PYTHONPATH

And its predeactivate file to include this line:
export PYTHONPATH=$OLD_PYTHONPATH

You probably don’t need to do the OLD_PYTHONPATH bits if you don’t already modify your PYTHONPATH variable. The most important thing here is to add your official_code directory. Here’s the result after I made those changes:
$ env | grep PYTHONPATH $ workon think_stats $ env | grep PYTHONPATH PYTHONPATH=/Users/jeremyboyd/Dropbox/think_stats/official_code: OLD_PYTHONPATH= $ python exercises/first.py data/ Number of pregnancies: 13593 Number of live births: 9148 Number of first births: 4413 Number of other births: 4735 Avg length of first births: 38.6009517335 Avg length of other births: 38.5229144667 Difference: 0.0780372667775 weeks (13.1102608186 hours) $ deactivate $ env | grep PYTHONPATH PYTHONPATH= OLD_PYTHONPATH=

(Note that I didn’t originally have PYTHONPATH set to anything special, so I could probably have gone without tracking its original value.)

Conclusion

I’ve used this setup to get through Chapter 2 of the book, and once I began writing my exercise code to handle data directory locations, it’s worked very well. Recapping my original goals, I like this setup for a few reasons:

Data and code are physically separated from each other.
I’m not even tempted to modify the official code. I can treat it as a set of libraries rather than a part of my applications.
I can use best-practice tools like virtualenv and pip to maintain separation from my normal development environment.
I can keep my own exercises under version control without tracking all of their dependencies.

There are undoubtedly other ways to handle some of these concerns, but this was a simple, quick approach to the problem I was trying to solve.