The Machine Is Learning
The past couple of weeks have been rather tumultus for me and several others. I wont go into details and “kiss and tell” but suffice to say it is has led me to the land of “Free Agency”.
As such over the past couple of weeks I have met several people and have been discussing items such as Big Data, Semantics, Hadoop, NoSQL, Data Science <<insert favorite bingo buzzword here>>. One issue that I see over and over concerns the usage of different distributed compute frameworks. Saying “Hadoop” is not a panacea for your Big Data problems. One must understand the problem your trying to solve. Many people fall prey to the “new shiny thing” paradigm. On the issue of BigData concerns if you want to scale horizontal go with a NoSQL solution – if it is necessary. The NoSQL database is implemented as a data grid for processing (mapReduce, queries, CRUD, etc.). Most big websites and players that have moved towards non-relational datastores include LinkedIn, Amazon, Digg and Twitter. One of the main reasons for using NoSQL solutions is relational databases place computation on reads, which is considered wrong for large-scale web applications such as Digg etc.
Aspects of this behavior are:
• The serial nature of applications1 often waiting for I/O from the data store which does no good to scalability and low response times.
• Huge amounts of data and a high growth factor lead Twitter towards facilitating Cassandra, which is designed to operate with large scale data.
• Furthermore, operational costs of running and maintaining systems like Twitter et al escalate. Web applications of this size therefore “need a system that can grow in a more automated fashion and be highly available.”.
Amazon’s James Hamilton agrees:
“Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS.”
Ok so lets assume you have your high availability horizontal scale framework (HAHSF) in place and you are harvesting, ingressing and egressing data. Now what? You must figure out the DataScience strategy and what to do with the data. Otherwise its like hoarding – your just harvesting and storing it, which by definition means that you need to do some type of statistics, machine learning or data mining. Here is where I believe most people get slightly sideways. Map Reduction is not Hadoop. Map Reducing of Machine Learning algorithms are at the forefront of what is occurring within scale applications.
Machine Learning Review
For a refresher I would suggest breaking out that Grey Book called Machine Learning by Professor Mitchell or Haykin’s, Neural Network tome. So lets start with something we all probably learned in one of your AI or machine learning classes. Remember the Back-Propagation Network? If not here is the diagram for a refresher:
- Essentially the system is a linear combination of weights with a trigonometric function that provides a threshold then the error is fed back the the weights are changed. There are three types of neurons in a neural network that is created BP ANN algorithm:
-
- Input neurons
For discrete input attributes, an input neuron typically represents a single state from the input attribute. This includes missing values, if the training data contains nulls for that attribute. A discrete input attribute that has more than two states generates one input neuron for each state, and one input neuron for a missing state, if there are any nulls in the training data. A continuous input attribute generates two input neurons: one neuron for a missing state, and one neuron for the value of the continuous attribute itself. Input neurons provide inputs to one or more hidden neurons.
- Hidden neurons
Hidden neurons receive inputs from input neurons and provide outputs to output neurons.
- Output neurons
Output neurons represent predictable attribute values for the data mining model. For discrete input attributes, an output neuron typically represents a single predicted state for a predictable attribute, including missing values. For example, a binary predictable attribute produces one output node that describes a missing or existing state, to indicate whether a value exists for that attribute. A Boolean column that is used as a predictable attribute generates three output neurons: one neuron for a true value, one neuron for a false value, and one neuron for a missing or existing state.
A neuron receives input from other neurons, or from other data, depending on which layer of the network it is in. An input neuron receives inputs from the original data. Hidden neurons and output neurons receive inputs from the output of other neurons in the neural network. Inputs establish relationships between neurons, and the relationships serve as a path of analysis for a specific set of cases.
Each input has a value assigned to it, called the weight, which describes the relevance or importance of that particular input to the hidden neuron or the output neuron. The greater the weight that is assigned to an input, the more relevant or important the value of that input. Weights can be negative, which implies that the input can inhibit, rather than activate, a specific neuron. The value of each input is multiplied by the weight to emphasize the importance of an input for a specific neuron. For negative weights, the effect of multiplying the value by the weight is to deemphasize the importance.
Each neuron has a simple non-linear function assigned to it, called the activation function, which describes the relevance or importance of a particular neuron to that layer of a neural network. Hidden neurons use a hyperbolic tangent function (tanh) for their activation function, whereas output neurons use a sigmoid function for activation. Both functions are nonlinear, continuous functions that allow the neural network to model nonlinear relationships between input and output neurons.
Map Reducing The Algorithm
Now here is where the world of real time systems and distributed systems collide. Suppose we have a epoch and we have an error propagated back and we need to update the weights? We need to map reduce this functionality to fully gain benefit from the map reduction – hadoop-it-izing magic. As a specific note the performance that results depends intimately on the design choices underlying the MapReduce implementation, and how well those choices support the data processing pattern of the respective Machine Learning algorithm.
However, not only does Hadoop not support static reference to shared content across map tasks, as far as I know the implementation also prevented us from adding this feature. In Hadoop, each map task runs in its own Java Virtual Machine (JVM), leaving no access to a shared memory space across multiple map tasks running on the same node. Hadoop does provide some minimal support for sharing parameters in that its streaming mode allows the distribution of a file to all compute nodes (once per machine) that can be referenced by a map task. In this way, data can be shared across map tasks without incurring redundant transfer costs. However, this work-around requires that the map task be written as a separate applications, furthermore, it still requires that each map task load a distinct copy of its parameters into the heap of its JVM. That said almost all ML learning algorithms and particularly ANN are commonly done with stochastic or direct gradient descent/ascent (depending on sign), which poses a challenge to parallelization.
The problem is that in every step of gradient ascent, the algorithm updates a common set of parameters (e.g. the update matrix for the weights in the BPANN case). When one gradient ascent step (involving one training sample) is updating W , it has to lock down this matrix, read it, compute the gradient, update W , and finally release the lock. This “lock-release” block creates a bottleneck for parallelization; thus, instead of stochastic gradient ascent many methods use a batch method which greatly slows the process down and takes away from the real time aspect. Work arounds that I have personally seen used and help create are memory mapped shared functions (think signal processing and CPU shared memory modules) which allows a pseudo persistence to occur and the weights to be updated at and paralleled at “epoch time”. As we reduce the latencies of the networked systems we are starting to get back to some of the very problems that plagued real time parallel signal processing systems – the disc i/o and disc read-write heads are starting to become the bottlneck. So you move to solid state devices or massive in memory systems. That does not solve the map reduction problem of machine learning algorithms. That said over the past years the Apache Foundation and more specifically the Mahout project has been busy adding distributed algorithms to its arsenal. Of particular interest is the Singular value decomposition which is the basis kernel for processes such as Latent Semantic Indexing and the like. Even given the breadth of additions fundamental computational blocks such as Kohonen networks and the algorithm mentioned here – Backpropagation are not in the arsenal. Some good stuff such as an experimental Hidden Markov Model are there.
That said I really do believe we will start to see more of a signal processing architecture on the web. Just maybe “cycle-stealing” via true grids will make comeback?
Go Big Or Go Home!