Why is Celery only giving tasks to one preforked worker?

I had a situation come up today at Experiment Engine that took a surprising amount of time to debug, so I’ll leave a note here in case it helps someone else — even if that “someone else” is me in the future.

We run our Celery workers with preforked child processes (using `-c $WORKER_CONCURRENCY` since we store our config in the environment). However, it looked like only one worker process (say, Process A) was executing the tasks:

-> send T1 to Process A 
# A executes T1
-> send T2 to Process A
# A's buffer can still hold tasks
-> send T3 to Process A
# A's buffer can still hold tasks
<- T1 complete
# A begins processing T2

# Processes B, C, D sit around looking for stuff to do

Since I could see that the parent process had 4 child processes configured, this was pretty confusing. But I could also see that the worker had reserved several more tasks to run; it just wasn’t sending them out to the other child processes. Instead, a queued task would just get run by Process A when it was finished with the current one. Frustrating.

It turns out that Celery’s prefetch behavior isn’t just on a parent process level. Prefetching actually occurs for the child processes as well. This is documented, but the example sketched out there isn’t the problem I was seeing, so I skipped over it. Our case looks more like what I showed above.

Anyway, the solution in the documentation is right: just enable the `-Ofair` flag when you start your Celery worker and your tasks will be distributed to child processes correctly.

Thanks to Michael Linder for helping me figure this out.

Installing PIL on 64-bit CentOS 5.8

I recently upgraded to a CentOS 5.8 VM built by another developer in our shop. Things were working smoothly until I came across an area of our codebase that relied on PIL. (Experienced Pythonistas may start groaning now.) The problem I ran into presented itself with this error:

IOError: decoder jpeg not available

From past experience, I knew that this was probably due to PIL’s being installed via pip without the right versions of libjpeg and libjpeg-devel installed. A quick uninstall/reinstall verified this was the case, since PIL’s installation helpfully tells you what formats are supported after installation.

The solution here is pretty straightforward if you Google around a bit: uninstall PIL again, do a sudo yum install libjpeg and sudo yum install libjpeg-devel, and then reinstall PIL. Right? But when I installed the dependencies, yum informed me that the VM was already up-to-date. (Note that this wasn’t a clean VM, and I have no idea whether they’re installed by default. In any case, if you’re running into this problem you should try the yum install steps just to get them out of the way.)

As it turns out, PIL’s installer only looks for libraries in /usr/lib/ (and a bunch of other places, but this is the relevant bit). But the libjpeg dependencies were actually installed to /usr/lib64/. So the trick is to augment PIL’s setup.py file. Here’s how I did it:

  1. pip uninstall PIL (if you haven’t already)
  2. pip install PIL --no-install (this will download the source but not install it)
  3. vi /path/to/virtualenv/build/PIL/setup.py (assuming you’re in a virtualenv)
  4. find the line that says ‘add_directory(library_dirs, "/usr/lib")
  5. put ‘add_directory(library_dirs, "/usr/lib64")‘ just above that line
  6. pip install PIL --no-download

Following these steps got me to a working state where PIL reported it had JPEG support.

Props to StackOverflow user ele, whose related question and answer set me on the right path.

Tips on Python and upgrading to Mountain Lion

This weekend I upgraded one of my machines to Mountain Lion. My system-level Python is pretty bare: I typically just keep pip, virtualenv, virtualenvwrapper, and a few associated libraries installed there. So, as I worked through a few errors I decided to follow this advice from the Hitchhiker’s Guide to Python and begin using Homebrew to manage my Python installation. What follows is a brief list of steps I used to troubleshoot the issues I ran into during the process:

First, we’ll want to install Python:

$ brew install python

Homebrew installs python, easy_install, and pip, so you’ll bootstrap quickly.

Next, add “export PATH=/usr/local/bin:/usr/local/share/python:$PATH” to your ~/.bash_profile. Homebrew describes /usr/local/share/python as an optional addition, but things like virtualenv, virtualenvwrapper, etc., get added there, so you’ll definitely want to include it.

Now, if you previously installed virtualenv, it’s probably pointing to /usr/bin/python. You can verify this by just taking a look at the hashbang in, for example, /usr/local/bin/virtualenv. Let’s get rid of those so they don’t shadow the versions we’re about to install via our brewed pip:

$ rm /usr/local/bin/virtualenv*

Now we can install the virtualenv ecosystem. You can do this pretty quickly with just:

$ pip install virtualenvwrapper

This will install virtualenv, virtualenvwrapper, and virtualenv-clone.

Finally, add “source /usr/local/share/python/virtualenvwrapper_lazy.sh” to your ~/.bash_profile. You can use the non-lazy version if you want, but on my machine new shells spin up a lot faster with the lazy version.

This process solved an issue I was experiencing where virtualenvwrapper seemed to have run correctly on login, but I was unable to use mkvirtualenv or virtualenv itself. That problem presented itself as a pkg_resources.DistributionNotFound exception (traceback). It was apparent that virtualenv was relying on the system Python, but I couldn’t figure out why. The legacy /usr/local/bin/virtualenv was the culprit.

Thanks to Thomas Wouters (Yhg1s on #python) for his help in figuring this out.

Setting up your Python environment for Think Stats

Update: I’ve added all my current exercises to a GitHub repo. If you want to see how I’m working with the project structure I describe here, that’s a good place to start.

I’ve been reading Allen Downey’s Think Stats lately. If you’re not familiar with it, this is a book that purports to teach statistics to programmers. Beginning in the first chapter, readers are invited to write code that helps them explore the statistical concepts being discusssed. I’m learning a fair amount, and it’s pretty well-written.

If you follow the link above, you can find the book in PDF and HTML formats.

One area of improvement for the book is in using best practices for working with Python. I initially started working through the examples with the project structure Downey recommends, but that quickly became unwieldy. So I restructured my project and thought I’d share. I had the following goals:

  • achieve code/data separation
  • treat official code as libraries, not suggestions I’m free to change
  • use tools like virtualenv and pip (discussed later)
  • be able to put my own code under version control without having to add too many things to .gitignore

A few notes: First, this post isn’t intended to be a criticism of Downey’s work. In fact, if you’re not interested in Python per se and just want to learn statistics, you should probably ignore this post. Downey’s text is solid and should work on its own. Following these instructions might only be a distraction to you. Second, this tutorial assumes you’re familiar with at least the first chapter of the book. (Maybe you’ve gotten through it and started hesitating about the structure as I have.) I won’t be providing exercise solutions in this post. Third, I assume you’re running on Linux or OS X. Some of the details may be different for Windows users.
Continue reading Setting up your Python environment for Think Stats

Installing PostgreSQL and psycopg2 in a virtualenv on Lion

I had the pleasure today of installing PostgreSQL and psycopg2 in a virtualenv on Lion. Here’s what I did, just so I remember in the future.

Note: These instructions assume a clean virtualenv. So if you’ve already attempted to install psycopg2 without PostgreSQL installed, and it failed, you should probably blow away your virtualenv altogether before attempting the steps below. I’m not sure what about the fact that a failure previously occurred makes further attempts fail, but I believe it’s related to setup.cfg‘s already being partially written. At any rate, here are the steps I took:

  1. Install PostgreSQL via the provided binary. The installer will ask you to reboot. Once that’s done, run the installer again. This should actually install the postgres binaries — likely under /Library/PostgreSQL/<version>/.
  2. workon the relevant virtualenv.
  3. Attempt pip install psycopg2. This should fail but will create a psycopg2 directory under your virtualenv’s build directory. (Note: I’m not sure this step is required, but it’s the order in which I proceeded.)
  4. Edit <virtualenv>/build/psycopg2/setup.cfg with the following lines:
    • include_dirs=/Library/PostgreSQL/9.1/bin (around line 35)
    • library_dirs=/Library/PostgreSQL/9.1/lib (around line 46)
  5. Re-run pip install psycopg2.

New Year’s Python meme: 2012 edition

Partially as a year-in-review kind of action, and partially to reinvigorate my writing, I thought I’d participate in Tarek Ziade’s New Year’s Python meme:

1. What’s the coolest Python application, framework or library you have discovered in 2011?

Flask. Django was the only Python web framework I’d worked with until this last November, so it was nice to see things from Flask’s much more minimalistic perspective. I know Flask isn’t news to anyone in the Python world, but I suspect there are a lot of people who like or are simply comfortable with the kitchen-sink approach of Django and haven’t seen what mini frameworks can do. I was one of those until very recently.

For the record, I haven’t built much in Flask. But I did get involved in a rapid-prototyping exercise where Django’s complexity would only have gotten in the way. Flask’s simplicity let me get a simple REST-ish API up and running in a matter of hours from the point of introduction. I may continue using it beyond the prototype stage, or I may branch out even further. But I’m glad I dove in as much as I did.

2. What new programming technique did you learn in 2011?

Message queues. Specifically, I got to work with celery and django-celery to offload things like external API interactions within our Django apps. I’d done similar previous work when doing batch processing on a mainframe, so the idea of offloading computationally expensive work wasn’t new. But doing it within Python was.

I learned at the same time that MQs can’t be the solution to all of your problems. I’ve seen MQs back up by orders of magnitude due to major, production-crippling failures elsewhere. And in cases like that, it’s often not the case that you want a lot of processing waiting in line to be processed later.

3. What’s the name of the open source project you contributed the most in 2011? What did you do?

Sadly, I only contributed once — to Django. It was at an early stab at sprinting (leading up to AWPUG’s foundation), and we collectively worked on this bug. The patch itself is pretty simple, but I got to see more of Django’s internals than I previously had, and it gave me a chance to meet folks I’ve come to really enjoy working with.

But I also worked extensively on the PyTexas conference, which — though not strictly an “open source project” — represented the bulk of my contribution to the community at large. I’m really looking forward to this next year and some of the things we might be able to do.

4. What was the Python blog or website you read the most in 2011?

Like many who’ve participated in this, Planet Python and the Python subreddit have been my go-to resources this year.

5. What are the three top things you want to learn in 2012?

  • MongoDB and pymongo: We use these at the day job, and right now I shy away from them just out of ignorance fear. This is silly and must be fixed.
  • an async framework (likely Tornado): This kind of development is paradigmatically different from other things I’ve done, but a lot of people think it’s a good idea. That’s reason enough for me. I can also think of a few good use cases for it in work I’ll be doing in the near future.
  • Python packaging: I’ve run into a lot of cases this year where better knowledge in this area would have been useful. Every time I’ve needed to maintain something internal developed by someone who “gets” packaging, it’s taken me way more time than I feel is necessary. I’d love to know more about this area and contribute back to it if possible.

6. What are the top software, app or lib you wish someone would write in 2012?

  • I want to see microformats or some other kind of standardized interchange format for workout/fitness information. This is really specific to my work at MMF, but it would be enormously helpful. Every vendor in this space uses something different.
  • I want to see a platform for [redacted] in real-time. In my side project I’m working on this one right now. 😉
  • Is it okay to ask for docs? Because I’d love to see grok-able explanations for certain advanced topics (e.g., metaprogramming, packaging) become standard — such that I don’t have to Google “wtf is python metaprogramming” and dig through blog posts, because there’s just one (excellent) doc out there describing it.

 

Want to do your own list? Here’s how:

  1. Copy and paste the questions and answer to them in your blog
  2. Tweet it with the #2012pythonmeme hashtag

MapMyFitness sponsoring PyCon

I got word this week that my employer is officially sponsoring PyCon. I’m pretty stoked about this because I pushed pretty hard to make it happen. It doesn’t hurt that it’ll also be my first chance to attend PyCon. While I’m there, I’m hoping to learn a lot, meet a lot of people whose stuff I read daily on Python Planet and elsewhere, and recruit.

Speaking of which, if you’re a dev with experience in SOA or building SaaS platforms, MapMyFitness is hiring. Python experience is a plus but by no means required. We’re just looking for the right people for our team.

Unladen Swallow’s progress

One thing that caught my eye in the Euler test results is that Unladen Swallow comes in with a total time of 509.72 (seconds?) vs. CPython’s 569.37. That’s an improvement of about 10%. When you look at the wins, Unladen Swallow has 33 vs. CPython’s 51. That 5x improvement on CPython looks pretty far off.

Also, the Project Plan looks like it hasn’t been updated in nearly a year.

Lest this become another “Unladen Swallow is dead” post, I’ll also point out that Collin Winter has recently said they’re now targeting CPython 3.3. Not sure when that is, but 3.2a3 recently dropped, so it’s probably not too far off.

PyPy outperforming CPython and Psyco

David Ripton ran a bunch of implementations against his collection of Euler Challenge solutions:

And now PyPy is clearly the fastest Python implementation for this code, with both the most wins and the lowest overall time.  Psyco is still pretty close.  Both are a bit more than twice as fast as CPython.

I’d really like to see memory usage for these tests, too.