January 14, 2015

Puget Sound Python User Group, Jan 14, 2015

The meeting started with one of the meeting's organizers, Tammy Lee, welcoming us to Dice Cabana. Alan Besner then took the MC slot to introduce tonight's speakers after a word from OfferUp's Arean, who explained that as the largest local mobile marketplace in the US they have collected some high volume data, and will be listening attentively to the talks. Unsurprisingly, OfferUp are hiring (their staff having grown about 400% in the last year).

Tammy announced that the group was interested in reaching out to members to help them with their professional development. The group has a new logo designed by New Relic's Jen Pierce. It tweets as @ps_python, and they have a LinkedIn group "to make things less socially awkward." Tammy will be happy to talk to any members who want to get involved: pugetsoundpython at gmail dot com.

The first speaker was Carlos Guestrin, founding CEO of Dato (formerly GraphLab), who is also Professor of Machine Learning at UW. His talk was about recent developments in machine learning. Companies have been collecting data for a long time, but early attempts at analytics were not really addressing the real questions.

In talking to over 300 businesses with interests in machine learning, Carlos again and again has seen a failure to address the need to eliminate what Carlos called "duct tape and spit" work. Building tailored Hadoop systems is expensive and time-consuming, so we need a way to scale Python processes to the standard big-data ecosystem.

Dato's mission was to empower people to handle all sorts of data sources at scale, even from a laptop. Users should never have out-of-memory issues, but the GraphLab Create product allows building analytic tools that will move up to web scale with GraphLab Produce.

A first live demo was an image search, hosted on Amazon, matching with predictions made by a neural network running on Amazon infrastructure. An engineer who joined the team last October built this demo in a few days.

A second demo showed a recommender built from Amazon product data. It allowed two people to list their tastes and then found a way from a favored product of one user to a favored product of another.

Carlos is interested in allowing people to build their own applications, so he showed us an IPython Notebook that read in Amazon book review data. An interesting feature of Graphlab objects is that they have graphical representations in the Notebook, with the ability to drill down.

Carlos built and trained a book recommender with a single function call. This is all very well, but how do we scale such a service? The answer is to take the recommender model and deploy it as a service hosted on EC2 over a number of machines. The system is robust to the failure of single systems, and uses cacheing to take advantage of repeated queries to the RESTful API.

The underlying support infrastructure is written in "high-performance C++". The GraphLab components are based on an open source core, with an SDK that allows you to extend your coding however you want. GraphLab Create is unique in its ability to allow a laptop to handle terabyte volumes. It includes predictive algorithms and scales well - Carlos showed some impressive statistics on desktop performance. TAPAD had a problem with 400M nodes and 600M edges. Graphlab handled it where Mahout and Spark had failed.

One of the key components is the SFrame, a scalable way to represent tabular data. These representations can be sharded, and can be run on a distributed cluster without change to the Python code. Another is the SGraph, which offers similarly scalable aproaches to graph data.

The toolkits are based on an abstract pyrsid wehose base in the algorithms and SDK. Upon this is layered a machine learning layer.

Deployment is very important. How can you connect predictive services to your existing web front-end? GraphLab aloows you to serve predictions in a robust scaleable way. Install

Carlos closed by pointing out the Data Science and Dato Conference, whose fourth incarnation is in July in San Francisco.

Q: What's the pricing model?
A: You can use GraphLab Create for free; you can use Produce and other services either on annual license or by the hour.

Q: How do you debug distributed machine learning algorithms.
A: It's hard. We've been working to make it easier, and Carlos pushes for what he calls "insane customer focus"

Q: Is SFrame an open source component?
A: Yes.

Q: What about training - how do you trrain your models"
A: We have made it somewhat easier by producing tools. HTe second layer, choosing models and parameters, is assisted by recent developments. The next release will support automatic feature search.

Q: Is the data store based on known technologies?
A: It isn't a data store, it's en ephemeral representation. Use storage appropriate to your tasks.

Q: Do you have a time series components?
A: Yes, we are working on it, but we want to talk about whjat's interesting to potential customers?

Q: Do you have use cases in personalized medicine?
A: We've seen some interesting applications; one example reduced the cost of registering new drugs.

Erin Shellman works at Nordstrom Data Lab, a small group of computer and data scientists. She built scrapers to extract data fomr sports retailers. Erin reminded us of the problems of actually getting your hands on the data you want. Volumes of data are effectively locked up in DOMs.

The motivation for the project was to determine to optimum point to reduce prices on specific products. It was therefore necessary to extract the competitive data, which Erin decided to do with Scrapy. Erin likes the technology because it's adaptable and, being based on Twisted, super-fast.

Using code examples (available from a Github repository) Erin showed us how to start a web scraping project. The first principal components are items.py, which tells Scrapy which data you want to scrape. In order to extract this kind of data means you really should look at the DOM, which can yield valuable information (one customer was including SKU-level product availability information in hidden fields!)

The second component is the web crawler. Erin decided to start at one competitor's "brands" page, then follwo the brand's "shop all" link, whidh got her to a page full of products. The crawl setup subclasses the Scrapy CrawlSpider class. The page parser uses yield to generate successive links in the same page. Erin pointed out that Scrapy also has a useful interactive shell mode, which allows you to testt he assumptions on which you build your crawler.

Erin found that smaller brands didn't implement a "shop all" page, and for those cases she had to implement logic to scrape the data from the page where it would normally appear.

Erin's description of how to handle paginated multi-page listings showed that Scrapy will automatically omit already-processed pages, meaning it's easy to perform redundant searches where the same page may appear mutiple times in the result.

Erin underlines the necessity of cleaning the data as early in the processing chain as possible, and showed some code she had used to eliminate unnecessary data. Interestingly Erin ended up collecting a lot more data than she had initially started by looking for. Disposition of the scraped data can be handled by a further component of Scrapy, which was not covered.

Most of data science is data wrangling. Scraper code at https://github.com/erinshellman/backcountry-scraper

Erin also pointed out that the next PyLadies Seattle is on Thursday, January 29th.

Q: If you repeat scrapes, is there support for discovering DOM structure changes?
A: Good question, I don't know. I assume there must be, but didn't feel I needed it.

Q: If you ran this job over the course of 10 minutes (with 40,000+ GET requests) do you need to use stealth techniques?
A: We haven't so far. Retailers aren't currently setting their web sites up to protect against such techniques.

Q: What's the worst web site you've come across to scrape?
A: I don't want to say; but you can tell a lot about the quality of the dev teams by scraping their pages.

Q: Does Scrapy have support to honor robots.txt
A: No, but the crawler can implement this function easily.

Q: What techniques did you use to match information?
A: Pandas was very helkpful in analysis.

Q: What about web sites with a lot of client-side scripting?
A: We didn't run into that problem, so didn't have to address the issues.

Trey Causey from Facebook rose after the interval to talk on Pythonic Data Science: from __future__ import machine_learning. Try's side job is as a statistical consultant for an unnamed NFL team.

The steps of data science are

1. Get data
2. Build Model
3. Predict

Python is quickly becoming the preferred language of the data science world. NumPy has given rise to Pandas, whose DataFrame structure is based on it. Pandas can interface with scikit-learn.

A DataFrame offers fast ETL operatiopns, descriptive statistics, adn the ability to read in a wide variety of data sources. Trey showed a DataFrame populated with NFL data in which each row represented a play in a specific game, which had been processed to remove irrelevant events.

Scikit-learn is a fast system for machine learning. New algorithms are being added with each release, and the consistent API they offer are a notable feature. It offers facilities for classification, regression, clustering, reducing dimensionality and preprocessing. This last is valuable, due to the high volume of data science work that is just getting the data in the desired form.

Scikit-learn implements the interface of the Estimator class, and subclasses have .fit(X, y=None)m .transform() and .predict() methods. Some have .predict_proba() methods. The "algorithm cheat sheet" is an attempt to show how touse the package's features to obtain desired results.

Suppose you are interested in the probability that a given team wins a game given the appearance of a particular play i that game. This allows coaches to answer questions like "should we punt on this fourth down?". This is therefore a classification problem, under the assumption that players are independent (which seems to be a ridiculous assumption, but in fact works pretty well(,

The janitorial steps will involve looking at the data. Does it suffer frmo class imballances (e.g. many more wins than losses)? Does it need centering and scaling. How do I split my data set into training and prediction sets?

A first study showed a histogram of # of plays vs. remaining time showed a major spike just before halftime and smaller spikes before the quarters due to the use of timeouts. The predictive technique used was "random forests." Testing the predictions was performed on both true and false negatoves and positives. Calibration is important for predictive forecasts. Sometimes you will need to use domain knowledge to wring information out of the data.

Trey showed a number of graphics, showing various statistics which my total lack of football knowledge didn't allow me to interpret sensibly. He pointed out some of his important omissions, being quite frank about the shortcomings of his methodology. Feature engineering depends on subtleties of the data, and no confidence intervals are given. Putting models into production is difficult, and data serialization has not yet been suitably standardized.

Data science may not bet easy, but it can be very rewarding. Read more from @treycausey (trey dot causey at gmail dot com) or at thespread.us.

Q: Does what you are doing become less effective over time?
A: In the NFL case most teams don't trust statistics, so it is difficult to get coaches to implement stats-based decisions making. In the larger context, if you get really good at predicting a particular phenomenon, you would expect performace to decline over time due to changing behavior.

Q: Are there particular teams who are making visibly good or bad decisions?
A: Yes. It's easy to see which coaches aren't optimizing at all, but even the best teams still have a way to go. The New York Times has a "fourthdownbot" that is fun.

Q: Are you trying to develop mental constructs of what is right and wrong in feature engineering?
A: That's a philosophical divide in the data science communities; some prefer more parsimonious models, which should yields actionable features, others prefer highly parameterized models with better predictive ability.

Q: Do you understand more about the features in your data as you work with it?
A: Yes, all the time, and the new features aren't always obvious at first. This is basically a knowledge representation problem. Sometimes there is just no signal ...

Q: How are you giving the coaches this data?
A: I can't tell you how. Cellphone communications are forbidden on the sidelines.

Q: Are you REALLY good at fantasy football?
A: Fantasy football's scoring system is highly variable; a lot of research shows that some scoring features can't be predicted. Usually FF's busy time is Trey's busy time too, so he tends to play like other people do and say "he's looking good this week."

Thanks to all speakers for their amazing talks, and thanks to the group for hosting me.

January 6, 2015

PyData London, January 2015

[EDIT: Updates talk with additional background and slide links]

Tonight I'm attending the PyData London Meetup group for the first time.

The meeting began with a short "What's New in the Python World," from the joint organizers, who assumed the audience was mainly here for the invited presentations. That's true -- like many others I find it interesting to learn what other Pythonistas are doing with their Py (so to speak), but it's also nice to hear a potted summary of "events of significance."

I know from my own experiences that they (said organizers) will be DELIGHTED if you would offer any feedback at all. They are doing so much good work for nothing (and it can begin to feel like a thankless task) that we all owe it to them to pass on any suggestion that might help the group be even more effective. Similarly, if you would like to let me know anything about this, or future posts of this nature, just add your comment below. If you feel like making a critical comment at the same time, I am old enough and ugly enough to survive.

Frank Kelly was introduced to talk about "Changepoint Detection with Bayesian Inference". After graduation Frank was sentenced to investigate rock strata produced as part of an oil exploration project. One initial problem was the transmission of information from the drill head (hundreds of feet down in the rock). They discovered that they could encode digital information by pulsing the highly viscous mud that lived in the well. As you can imagine, mud is not a very good transmission medium, and Frank's final-year project was to use Bayesian statistics to decode the transmissions.

Frank's discussion of frequentist vs Bayesian methods was interesting. It revealed that Bayesian methods are a fundamentally different technique. Bayes was a non-conformist who is buried about ten minutes away from the meeting space at Lyst Studios. I don't know whether this was intended as a warning. Essentially, Bayesian methods used fixed data sets and tries to use post-hoc processing to understand the experimental results. Data is assumed to be generated as model data plus noise, which for Frank's purposes they could characterize as Gaussian white noise. Essentially the signal is the reading minus the noise, by integrating with respect to the "nuisance parameters."

We saw a graphical demonstration of how this could be used to detect thresholds in noisy data, and how as the data tended to a lower and lower threshold the result of Franks analytical function tended towards a smooth curve, with no detectable thresholds indicated by minima or maxima.

Frank then went on to discuss other applications of his technique related to Google search. There are pressures on Google to reveal the algorithms they use to produce their search results, but this is unlikely to happen as the algorithm represents Google's "secret sauce". Changes to Google's algorithms cause large fluctuations in results and sales for some companies.

Then we were treated to a view of an IPython Notebook, though the code wasn't very readable (remember that Command+ sequence, Mac presenters), but it was interesting to see a demonstration that a Google change had made a detectable difference to the traffic on sites. By presenting this kind of supporting evidence companies can at least give Google some facts about how the changes have affected them.

Frank then went on to talk about applications to tropical storms, which generally have wind speeds over 17 m/s (35 m/s is classified as a hurricane). Data is available for the number of tropical storms per year since 1856, and Frank pointed out that there were spikes that correlated well with changes in the surface ocean temperature.

Frank explained that he is now mixing Bayesian with frequentist techniques, and that he has done some informal research into correlations between external events such as Christmas and the Scottish independence referendum, with no really conclusive results (except that Christmas had more effect on the stock market than the referendum did). Overall a very interesting talk. Frank even promised that the published slides (now linked from the talk title above as well as here) will include a link to a "Matlab to Python" conversion cheat sheet. Well done!

In the Q and A session the first questioner asked whether Frank had benchmarked his methods against other change-detection algorithms. Frank felt this might be difficult because different algorithms produce different types of output. Next, someone asked whether Frank had tried any custom Bayesian analysis tools, but Frank had simply put his own code together. Next, someone asked whether the technology to communicate from the drill head were any better nowadays, and Frank said that things are now more sophisticated, but "you can't just stick a cable down the hole." A commenter pointed out that mud pulsing was still common five years ago, and asked whether it was easy to apply a moving window to the data. I'm not sure I understood the answer, but Frank seemed to think it could be used to produce online analysis tools. that would allow a sliding window to be applied to time series data. Next someone asked whether the technique could cope with different noise spectra/distributions. The answer was that the noise levels had to be modelled, and sometimes the noise was different after the change from before. The talk ended with a robust round of applause.

There was then a break for beer and pizza, so don't blame me for any degradation of quality in the rest of this post. There are no guarantees. I met a few people while we were milling around, including a data scientist who had heard the the PyData meetups were friendly and relevant, and the business development manager of a company whose business is training scientists.

The second talk of the evening was "Learning to Read URLs: Finding word boundaries in multi-word domain names with Python and sklearn" by Calvin Giles.

Calvin started out with a confession that he isn't a computer scientist but a data scientist, so he was describing algorithms that worked to solve his problems rather than theoretically optimal solutions.

The basic problem: given a domain name such as powerwasherchicago.com, resolve it into words. The point is, if this minimal amount of semantic information can be extracted, you can avoid simple string comparisons and determine the themes that might be present in a relatively large collection of domain names. Calvin warned us that the code is the result of a single day's hacking project, based on Adthena's data of ten million unique domain names and third party data such as the Gutenberg project and the Google ngram viewer data sets.

The process was to determine which of a number of possible sentences should be associated with the "words" found in a domain name. The first code snippet showed a way of extracting all the single words from the data (requiring at least two characters per word).  A basic threshold detection was required, since low frequency words are not useful. The initial results were sufficiently interesting to move further research to the Google ngram dataset. This resulted in an English vocabulary or roughly 1,000,000 words. A random selection included "genĂ©ticos" and "lanzamantio", but Calvin was happy to allow anything that had a frequency of more than 1,000 in his corpus.

Calvin then presented a neat algorithm to find the words that were present in a string. The list of potential words in powerwasherchicago.com seemed to have about fifty elements in it. catholiccommentaryonsacredscripture.com had 101 words in it, including a two-letter word, making it a very interesting test case.

Calvin's algorithm to find the "most likely" set of words allows you to decide how likely the domain name is to occur given the potential words it could be made up from. Sadly with n substrings of the data you can generate 2^n sentences. Some convincing calculations were used to demonstrate that this wasn't a practical approach. The words should be non-overlapping and contiguous, however, so this allows us to limit the possibilities quite radically, but it isn't easy to find all subsets of non-overlapping words, This is, however, a major win in reducing the number of cases to be considered.

Given the solution with least overlap, Calvin then chose to omit all "sentences" that were less than 95% of the length of the least-overlap solution. The code is all available on the slides, and I am hoping to be able to publish the URL (which Calvin said at the start of his talk would be available).

The get_sentences() wrapper function was only seven lines of code (ignoring the called functionality), and quite comprehensible. The demonstration of its result was extremely convincing, reducing some very large potential sets to manageable proportions. The average domain turned out to have around 10,000 possible interpretations. The domains with more than that number of candidate interpretations turned out to be the most interesting.

In order to produce a probability ordering for the possibilities, the algorithm prefers fewer longer words to a larger number of shorter words.A four-line function gave that ordering - a nice demonstration of Python's high-level abilities. While this did not give perfect results, Calvin convinced me that his methods, while possibly over-trained on his sample domains (modulo some fuzzy word-inference techniques that help to identify one-character words in the domain) are useful and workable.

Calvin has now trained his system to understand which was the correct interpretation for the first hundred domain names. The visual representation of Calvin's latest results showed that even when the algorithm was "wrong" it was because the arbitrary ordering of equally-assessed domains had put the correct answer further down the list.

He expects that in a set of 500 domains his system will currently give roughly two-thirds of the answers to be useful and the remainder to be randomly wrong. A missing part of the model is an assumption that all sentences are equally likely. Using Bayesian techniques one can determine the "likelihood" of the generated sentences being the intended interpretation.  A somewhat pragmatic approach to this problem, inspired by Peter Norvig's spell-checker blog post, used as much existing code as possible. The code

    guess("powerwasherchicago")[0]

returned a result of

    'power washer chicago'

Calvin closed by suggesting that his initial training data set might have been too small, so he plans to use larger data sets and more sophisticated training methods with a rather more rigorous evaluation of the ad hoc decision making incorporated in his existing code base. His list of desired enhancements made sense, usually the sign of a talk that's related closely to the author's real-world experience. I was impressed that he is also considering how to make his work more usable. Another great talk!

Someone opened the Q and A by asking whether the web site content couldn't be analysed algorithmically to provide more input to the parse. Calvin's reply reflected real-world experience that suggested it would not be easy to do this. Given there are 40,000,000 domain names, Calvin felt such a parser would be unlikely to be helpful. The final question asked whether finite-state transducers, as implemented in OpenFST (sadly written in C++) would be useful. Calvin replied that it might well be, and asked to talk to the questioner with a view to updating the slides.

To close the meeting Product Madness announced that they were hiring, and I am happy to say that nobody wanted me to answer any questions, despite the organizers putting me on the spot!

Thanks to Lyst for hosting. This was a most stimulating meeting, and I'll be attending this group again whenever I can.