Skill Learning: Google Research:Lessons learned developing a practical large scale machine learning system(including comments)

Come from:

http://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html

Lessons learned developing a practical large scale machine learning system

Tuesday, April 06, 2010 at 4/06/2010 08:00:00 AM

Posted by Simon Tong, Google Research

When faced with a hard prediction problem, one possible approach is to attempt to perform statistical miracles on a small training set. If data is abundant then often a more fruitful approach is to design a highly scalable learning system and use several orders of magnitude more training data.

This general notion recurs in many other fields as well. For example, processing large quantities of data helps immensely for information retrieval and machine translation.

Several years ago we began developing a large scale machine learning system, and have been refining it over time. We gave it the codename “Seti” because it searches for signals in a large space. It scales to massive data sets and has become one of the most broadly used classification systems at Google.

After building a few initial prototypes, we quickly settled on a system with the following properties:

Binary classification (produces a probability estimate of the class label)
Parallelized
Scales to process hundreds of billions of instances and beyond
Scales to billions of features and beyond
Automatically identifies useful combinations of features
Accuracy is competitive with state-of-the-art classifiers
Reacts to new data within minutes

Seti’s accuracy appears to be pretty decent. For example, tests on standard smaller datasets indicate that it is comparable with modern classifiers.

Seti has the flexibility to be used on a broad range of training set sizes and feature sets. These sizes are substantially larger than those typically used in academia (e.g., the largest UCI dataset has 4 million instances). A sample of the data sets used with Seti gives the following statistics:

	Training set size	Unique features
Mean	100 Billion	1 Billion
Median	1 Billion	10 Million

A good machine learning system is all about accuracy, right?

In the process of designing Seti we made plenty of mistakes. However, we made some good key decisions as well. Here are a few of the practical lessons that we learned. Some are obvious in hindsight, but we did not necessarily realize their importance at the time.

Lesson: Keep it simple (even at the expense of a little accuracy).

Having good accuracy across a variety of domains is very important, and we were tempted to focus exclusively on this aspect of the algorithm. However, in a practical system there are several other aspects of an algorithm that are equally critical:

Ease of use: Teams are more willing to experiment with a machine learning system that is simple to set up and use. Those teams are not necessarily die-hard machine learning experts, and so they do not want to waste much time figuring out how to get a system up and running.
System reliability: Teams are much more willing to deploy a reliable machine learning system in a live environment. They want a system that is dependable and unlikely to crash or need constant attention. Early versions of Seti had marginally better accuracy on large data sets, but were complex, stressed the network and GFS architecture considerably, and needed constant babysitting. The number of teams willing to deploy these versions was low.

Seti is typically used in places where a machine learning system will provide a significant improvement in accuracy over the existing system. The gains are usually large enough that most teams do not care about the small differences in accuracy between different flavors of algorithms. And, in practice, the small differences are often washed out by other effects such as better data filtering, adding another useful feature, parameter tuning, etc. Teams much prefer having a stable, scalable and easy-to-use classification system. We found that these other aspects can be the difference between a deployable system and one that gets abandoned.

It is perhaps less academically interesting to design an algorithm that is slightly worse in accuracy, but that has greater ease of use and system reliability. However, in our experience, it is very valuable in practice.

Lesson: Start with a few specific applications in mind.

It was tempting to build a learning system without focusing on any particular application. After all, our goal was to create a large scale system that would be useful on a wide variety of present and future classification tasks. Nevertheless, we decided to focus primarily on a small handful of initial applications. We believe this decision was useful in several ways:

We could examine what the small number of domains had in common. By building something that would work for a few domains, it was likely the resulting system would be useful for others.
More importantly, it helped us quickly decide what aspects were unnecessary. We noticed that it was surprisingly easy to over-generalize or over-engineer a machine learning system. The domains grounded our project in reality and drove our decision making. Without them, even deciding how broad to make the input file format would have been harder (e.g., is it important to permit binary/categorical/real-valued features? Multiple classes? Fractional labels? Weighted instances?).
Working with a few different teams as initial guinea pigs allowed us to learn about common teething problems, and helped us smooth the process of deployment for future teams.

Lesson: Know when to say “no”.

We have a hammer, but we don't want to end up with bent screws. Being machine learning practitioners, it was very tempting for us to always recommend using machine learning for a problem. We saw very early on that, despite its many significant benefits, machine learning typically adds complexity, opacity and unpredictability to a system. In reality, simpler techniques are sometimes good enough for the task at hand. And in the long run, the extra effort that would have been spent integrating, maintaining and diagnosing issues with a live machine learning system could be spent on other way of improving the system instead.

Seti is often used in places where there is a good chance of significantly improving predictive accuracy over the incumbent system. And we usually advise teams against trying the system when we believe there is likely to be only a small improvement.

Large-scale machine learning is an important and exciting area of research. It can be applied to many real world problems. We hope that we have given a flavor of the challenges that we face, and some of the practical lessons that we have learned.

20 comments:

Mr.Wizard said...: Can you give some examples of some places where this is used?; 10:05 AM
methode said...: Google Translate, i guess :) It would be quite stupid to not use it on a service like Translate. Or I imagine it's used in the "Did You Mean..." service as well.; 10:57 AM
Glowing Face Man said...: How about some links where we can see this in action :); 11:09 AM
threeiem said...: It would be great if you could do some learning with climate data. There is tons of it and it would serve a great purpose. Here is a link to lots of data from NOAA.. http://nomads.noaa.gov/; 11:23 AM
3145 said...: I would say filter images by face only or smthg like that, if that works based on that system I'm sooooo impressed.; 12:43 PM
Dan said...: The suggestion that small sample sizes are inadequate for machine learning might be a bit misleading. Human and animal neuronal systems easily learn complex categorization tasks in a very small number of trials. Humans and animals cannot live long enough to be exposed to billions of learning trials. Typically, learning asymptotes in accuracy in classical reinforcement studies of category learning in pigeons in fewer than a thousand trials. See for example the famous Cerella (1980) study where pigeons learned to classify whether Charlie Brown was in complex cartoon pictures with many different Peanuts characters and scenes with 95% accuracy in 800 learning trials. Charlie Brown was actually in about 400 of these learning trials in this case. While it may be true that semi-supervised learning based upon small numbers of labeled trials such as one labeled event does not generally work very well, this Google researcher needs to be aware of what is possible in supervised learning based upon animal learning studies (unless he wants to reinvent the wheel) and then he needs to be aware of newer developments in supervised machine learning such as the Generalized Reduced Error Logistic Regression Machine (RELR). Learning can easily asymptote in accuracy in RELR in the same number of trials as is typically seen in classical reinforcement studies in animals. More importantly, these RELR models are simple, interpretable, and highly accurate models that do not exhibit the black-box character of complex machine learning paradigms.; 12:50 PM
Ronald said...: Learning is the self organization of data. You seem to build the usual recognition engine. Recognition is not learning. Learning includes the building of abstracts or generalizations by the machine, recognition does not.; 1:39 PM
Tadej said...: Any possibility of telling us about the mathematics of the concrete underlying method that exhibits those properties? Possibly a future paper?; 2:03 PM
Dan said...: I would disagree with Ronald's comments about recognition not requiring abstraction. I recognize a dog even when it has only three legs; so could any accurate machine learning algorithm. This category of dog can be learned through some form of supervised learning that tells me the probability that certain combinations of features predict a certain category. The fact that it is probabilistic allows for the abstraction and generalization. Ronald's definition of learning would seem to ignore the vast majority of what is considered learning - that is supervised learning. Clearly humans and animals can learn categories very quickly through supervised learning and this does not require a billion learning trials even when the number of potential features is very large such as in millions of potential features that would arise through all the interactions between features seen through large numbers of Peanuts cartoon strips.; 3:52 PM
Ronald said...: Sorry Dan, the system ate my long response, had to run away a few times. Anyway what it really boils down to. Recognition is not learning, but learning includes recognition. Try to teach your system math and have it use it independently. Now teach it one-two-many math(math not equal math, its culture dependent) as best as we Westerners can understand it, no change in anything. Or for starters, teach it "all", What kind of abstraction does it build and use on its own? Think about how a brain layer/region does not feed back data to the layer it received data from. Or how to build decomposition with stochastic behavior. From where I stand it will be hard pressed without its own data organization. But I agree Google is way behind.; 6:52 PM
Dan said...: Ronald, I agree with you that there are limits to what passive, supervised learning systems can do. For any more natural learning of higher cognitive concepts, I believe that a form of active learning would be required. Yet, we are at a point in this field where we need to have a reasonable model for the “engram” before we can build massively parallel and distributed systems that have higher cognitive capabilities anywhere close to humans or even simple animals, such as pigeons. I actually believe that Google’s basic proposal for a massively parallel machine learning system is probably on the cutting edge in all areas except that they lack a reasonable model for this “engram”. The brain’s engrams are distributed representations for the fundamental categories, words, objects etc. that form cognition, but the brain’s engrams do not require a billion learning trials to be formed. My suggestion is that rather than immediately dismiss small sample size learning as a “statistical miracle”, they may wish to view this as something that a natural system like the brain must do in its engram formation. Once they open their minds to this possibility and learn about an algorithm such as RELR that does not arbitrarily impose L1/L2 regularization to achieve this, they may also be surprised that this is not a “statistical miracle”.; 5:07 AM
JezC said...: I'm expecting that this is used for AdWords Broad Matching and possibly organic ranking; language processing to create conceptually relationships? In AdWords, there's a fairly obvious feedback for machine learning - more clicks in response to better selections of adverts, and this would need large samples because of irregular user behaviour in response to adverts.; 2:44 PM
Ronald said...: Dan, depends all on what one wants to do. If one wants to analyze text, I would go with an self organizing system. Since it can learn the ambiguous structure of human language. For example: I use TTL (Time To Live) for analysis. In other words the system is "none" numerical and doesn't phrase text, it uses flow in time to associate differences in structure with meaning. Like: "I see" and "See I", have a different flow in time and a different meaning and the system can easily organize that. Think about it as columns (pronunciation) on a pane over time, looks like a wave in 3d (except it can/will twist and turn in any direction). Or why can a Magpie(bird, really different brain structure) recognize "self" from a mirror but not from a picture. What data does the mirror present a picture does not? I would say space timing info. In other words, most if not all higher cognitive functions can be presented and tested as space timing models. Including math and learning what "see" means. Would I use it to analyze global warming data, I don't think so.; 3:03 PM
amanfromMars said...: "In reality, simpler techniques are sometimes good enough for the task at hand." Such is the Enigmatic Paradox, that in Reality which can be Virtually Controlled and Directed [Driven] with the Presentation/Placement of Intelligently Designed Information [Advanced and/or Artificial and/or Alienating Intelligence (and in CyberIntelAIgent NIRobotIQs, Transparently Shared NEUKlearer HyperRadioProActive Intellectual Property with Semantically Astute Analytical Algorithm Processing of Source Metadata/Core Ore Lode)] to Create another Beta Perception and Virtually Real Reality, which simply requires further Corroborating Information to Reinforce the Source Facts and Deliver an Energising Continuity, is the Simplest Possible Technique Always Far More than just Good Enough and Simply the Best for Every Task which is Attempted .... and No Task is then Impossible for what you are then Driving is the Great Game, and a Virtual Operating System Morphed into an ARG/AIMMORPG and Played out in Live Operational Virtual Environments/Earthed Nations. That of course, makes Semantically Astute Analytical Algorithm Processing of Source Metadata/Core Ore Lode Intellectual Property, Proprietary, and Absolutely Priceless and Worth a Fortune to Any who could Use it in IT 42 Deliver the Future Virtually with AIResearch and dDevelopments in the Creative CyberSpace Command and Control of Computers and Communications. ....... AI @ ITs Work in Progress with C42 Quantum Control Systems. "Ronald said... [6:52 PM] Sorry Dan, the system ate my long response, had to run away a few times." ....... I used to hate it when it did that, whereas now there are always cloud copies to effortlessly re-post, if the gremlins are phishing for phorms/swarming information. :-); 8:10 AM
Alex said...: PROs: -they have found some good principles which are simple enough to make users interact in a meaningful way with the system - these principles are general enough to not cause ackward procedures to deal with some subsets of data - they have defined what is better not to deal with in the implementation of the system. DOUBTs: - it seems like a "brute force attack" approach, leveraging on google massive computational power when dealing with highly parallelized algorithms. - there is not emphasis on how to deal with the sparsity nature of categorization. -Knowledge and categories are units of information with specific boundaries which must be updated as new data comes in. - how about clustering information in schematas with prototypes, with multiple hierarchies based on the "domain" or context at hand? Clustering is the only way to deal with sparsity. Schemata must be organized from general concepts to more specific ones. - The human brain analyses patterns with huge parallelism, it also correlates toghether schemas with a concurrent high degree of belief in a distibuted way. However, when schematas are evaluated, merged or redefined, the brain retreives and process information in a more sequential fashion relying heavily on hierarchies among schemas.; 11:06 AM
Kumar said...: +1 for Mr. Wizard. Please give some examples of the kinds of problems that this massively scalable machine learning system solves in much better ways than whatever other approaches in use. Without that, I'm not sure what I'd gain by reading this research post.; 7:17 AM
dinesh said...: @ dan @ ronald You may find this recent post on The Noisy Channel about Information Retrieval using a Bayesian Model of Learning and Generalization interesting: (http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/); 4:52 AM
Ronald said...: The problem I have with bayesian systems is. They try to avoid basic cell behavior instead of taking advantage of it. Simple example, cell behavior is: Stochastic subjective to exhaustion myelin sheath BAC . to name a few. Now if we want to do decomposition in a "fixed" connected network. Which basically requires specifics(n1)-> generalization(gn)-> specifics(gn...) If we use stochastic and exhaustive behavior we can try different specifics(gn). If we combine this with the myelin sheath and BAC we can introduce deterministic behavior. All of this is missing form your "normal" bayesian system, that some people associate with intelligent. Yet the real system does just that.; 8:07 AM
Dmitry Chichkov said...: Any plans to release it to the public? It looks like Microsoft had released its own toolkit (SIGMA: Large-Scale and Parallel Machine-Learning Tool Kit).; 4:59 PM
Mitu said...: Continuous Signal and Linear System; 9:45 AM