May 29, 2009

Where Next for PyCon?

PyCon 2010 will be in Atlanta, as I wrote a few weeks ago. No decision has been made about future venues, but given yesterday's release of the PyCon 2009 delegate questionnaire response data I thought it might be instructive to use Python to try and get a handle on public opinion about future locations. Question 13 asked "Where would you like to see PyCon 2010 or a future PyCon? Enter up to 3 cities or regions."

I copied the data into a spreadsheet for ease of manipulation, and did some munging to standardize presentation. I may have taken a few liberties here (such as replacing "California, Bay Area" with "San Francisco"), but I did so in an honest attempt to make sure that every important datum was counted. If the conference were to go to "San Francisco" the organizers would be very foolish to overlook other potential Bay Area locations, for example.

The Data
You can find the raw data I ended up with at http://holdenweb.com/files/wherenext.txt if you want to play about with them yourself. Here's a small section to give you the flavor:
New Jersey      Palm Springs    "Portland, OR"
New York "Portland, OR" "Portland, OR"
New York "Portland, OR" "Portland, OR"
New York "Portland, OR" San Diego
Northeast "Portland, OR" San Diego
Nowhere specific "Portland, OR" San Francisco
Phoenix Saint Paul San Francisco
"Portland, OR" San Francisco Seattle
It's simple tab-separated data, with gratuitous quote marks from Excel. The presentation makes it look like everyone made first second and third choices that were suspiciously close together in the alphabet, but that's only because I sorted each column independently to make the munging easier. I should probably have used the csv module to read the data, but foolishly chose to do it myself. Just the same, there isn't a lot of code.

The Program
#!/usr/bin/python
#
# Process PyCon feedback about future venues
#
# NOTE: Food for thought only ...
#
pd = {}
f = open("wherenext.txt")
for line in f:
places = [p.strip('"') for p in line.strip().split("\t")]
for rank, where in enumerate(places):
if not where in pd:
pd[where] = [0, 0, 0]
if where:
pd[where][rank] += 1

places = []
for where, [w1, w2, w3] in pd.items():
places.append((3*w1+2*w2+w3, where, [w1, w2, w3]))
for rank, where, scores in sorted(places):
print where, rank, scores
The first for loop simply iterates over the file. Each place's position in the line is the weighting that a voter gave it, so after splitting the line at each tab and removing any leading or trailing quotes I then use enumerate() to generate (rank, placename) pairs for all elements in the line. Some lines have only one or two entries because not everyone made three choices, but the formulation I used copes fine with that.

The placename of each pair is used to index a dict of [first, second, third] counts which counts the number of times a specific place was ranked in each position.

The next for loop is my favorite piece of this little program. It generates a new list of (weighted_score, placename, raw_scores) tuples ready to be sorted. The weightings I used (3 for a first choice, 2 for a second and 1 for a third) were completely arbitrary, so feel freee to change them in an attemto to fudge the results for your favored location.

Why did I like this particular loop so much? Since each item in the dict is a (placename, raw_scores) tuple I unpack the elements right in the for statement. It's such little elegances that endear Python to its fans: you can always do things in a straightforward way if you want, but as long as it doesn't interfere with readability I usually take advantage of such abbreviations.

I felt that a list comprehension would have been just a little bit too difficult to read, but you could easily replace the loop with one, and for further illegibility you could put it directly as the argument to sorted().

Finally I print out the results, with the lowest scoring first. Yes, I could have used a reverse() call to print out the favorite first, but then I'd have had to scroll the window back to know which city had "won". I could have formatted them better, too. Feel free, knock yourself out. I try to avoid over-specifying presentation when it's really the data I am interested in.

The Results
Given that the 2009 conference was held in Chicago I was unsurprised to see a strong vote for it, presumably by locals who might not be able to get to Atlanta, and certainly wouldn't find it as convenient. The only surprise was that it came in second rather than first! Portland, OR made a very strong showing, scoring only two points less, closely followed by Seattle with New York and Washington DC trailing rather further behind, But the clear winner was San Francisco, by coincidence the other location to make a strong bid for 2010 with Atlanta. Here are the results in their entirety.

Please remember that the data is from a self-selected group, and that this program is not binding on the PyCon organizers! Despite my best efforts I must have left a blank line in the data, so the empty string is recorded as having no scores! The if where condition should really have guarded the whole loop body, but it was an afterthought, and serves to remind us that sloppy design will lead to sub-optimal results.

0 [0, 0, 0]
Bay Area, CA 1 [0, 0, 1]
Cleveland 1 [0, 0, 1]
Houston, Texas 1 [0, 0, 1]
Knoxville, TN 1 [0, 0, 1]
Midwest 1 [0, 0, 1]
Oregon 1 [0, 0, 1]
Orlando, FL 1 [0, 0, 1]
Phoenix, Az 1 [0, 0, 1]
Pittsburgh, PA 1 [0, 0, 1]
Portland, ME 1 [0, 0, 1]
Twin Cities, MN 1 [0, 0, 1]
Virgin Islands 1 [0, 0, 1]
West coast 1 [0, 0, 1]
not next to an airport 1 [0, 0, 1]
vancouver 1 [0, 0, 1]
Boston Area 2 [0, 1, 0]
Cleveland, OH 2 [0, 1, 0]
East coast 2 [0, 1, 0]
Fort Collins, CO 2 [0, 1, 0]
Honalulu, HI 2 [0, 1, 0]
Kansas City, MO 2 [0, 1, 0]
Las Vegas, NV 2 [0, 1, 0]
Lawrence, Kansas 2 [0, 1, 0]
Los Angeles, CA 2 [0, 0, 2]
Manhattan 2 [0, 1, 0]
Minnesota 2 [0, 1, 0]
New York, NY 2 [0, 1, 0]
Palm Springs 2 [0, 1, 0]
Saint Paul 2 [0, 1, 0]
San Diego 2 [0, 0, 2]
Seatle 2 [0, 1, 0]
Toronto, ON 2 [0, 1, 0]
Vegas 2 [0, 1, 0]
huntsville, al 2 [0, 1, 0]
new orleans 2 [0, 1, 0]
Detroit, MI 3 [1, 0, 0]
East Coast 3 [1, 0, 0]
Europe 3 [1, 0, 0]
Houston, TX 3 [1, 0, 0]
Huntsville, AL 3 [1, 0, 0]
Kansas City 3 [1, 0, 0]
Madison, WI 3 [1, 0, 0]
Miami 3 [1, 0, 0]
Montreal 3 [1, 0, 0]
New Jersey 3 [1, 0, 0]
Nowhere specific 3 [1, 0, 0]
Phoenix 3 [1, 0, 0]
Raleigh/Durham, NC 3 [1, 0, 0]
Reno, NV 3 [1, 0, 0]
San Diego , CA 3 [1, 0, 0]
San Fransisco 3 [1, 0, 0]
San Jose 3 [1, 0, 0]
St. Louis, MO 3 [1, 0, 0]
Tucson, AZ 3 [1, 0, 0]
london 3 [1, 0, 0]
New Orleans 4 [0, 2, 0]
Colorado 5 [1, 1, 0]
Northeast 5 [1, 1, 0]
Canada 6 [2, 0, 0]
Las Vegas 6 [1, 0, 3]
Somewhere hot 6 [1, 1, 1]
Vancouver 6 [0, 3, 0]
Dallas, TX 7 [1, 2, 0]
Denver 7 [1, 2, 0]
Minneapolis 7 [2, 0, 1]
Toronto 7 [2, 0, 1]
California 9 [3, 0, 0]
Atlanta 10 [2, 2, 0]
Austin, TX 14 [1, 5, 1]
Boston 16 [2, 2, 6]
Washington, DC 16 [4, 2, 0]
New York 19 [3, 4, 2]
Seattle 27 [3, 7, 4]
Portland, OR 35 [5, 5, 10]
Chicago 37 [10, 2, 3]
San Francisco 45 [11, 5, 2]

2 comments:

Doug Napoleone said...

using the system we used in the past of weights: (1.0, 0.8, 0.6) that would mean :

SanFran: 16.2
Portland: 15.0
Chicago: 13.4
Seattle: 11.0

The problem with SanFran? Cost. The San Fran bid for 2010 was almost 3x the cost of Atlanta. With the recession and Calf. economic problems, I doubt those numbers still hold though.

René Dudfield said...

Looks like everyone wants to go to their own home town :)