NuerIPS 2019

And they asked me how I did it, and I gave ’em the Scripture text,
“You keep your light so shining a little in front o’ the next!”
They copied all they could follow, but they couldn’t copy my mind,
And I left ’em sweating and stealing a year and a half behind.

~ “The Mary Gloster”, Rudyard Kipling, 1896

My Badge – I exist.

Well, your humble narrator finally made it to NuerIPS2019. There were several starts and stops to my travel itinerary but I finally persevered!

Bienvenue – Vancouver, British Columbia

First and foremost while the location at least for me required multiple hops Vancouver, BC is a beautiful city. The Vancouver conference center is spacious and an exemplary venue. Also for those that have the time Whistler / Blackcomb is one of the best mountains in North America for snow sports this time of the year. While I didn’t get to go I am being hopeful that I will win the registration the lottery system next year for 2020 and will plan accordingly.

Vancouver Conference Center – Oh Canada!

This year the conference was veritable who’s who of information-theoretic companies. Most of the top market cap companies are now information theoretic-based technology companies and as such have representation here at the conference. To wit IBM Research AI was a diamond sponsor:

While it is nearly impossible to quantify the breadth and depth of the subject matter presented here at the conference I have attempted to classify some overall themes:

  • Agent-Based Modelling and Behaviors
  • Imitation, Meta, Transfer, Policy Learning and Behavioral Cloning
  • Morphological Systems based on Evolutionary Biology
  • Optimization methods for non-convex models
  • Hybrid Bayesian and MCMC methods
  • Ordinary Differential Equation (ODE) direct Modelling and Systems
  • Neuroscience models that couple computational agents and hypotheses of consciousness

Side Note: I think it is amazing that 10 years ago you could not say “I’m using a Neural Network for …” without being laughed out the room. Now there is an entire set of tracks dedicated to said technology and algorithms.

The one major difference in this conference compared to what I have read and heard albeit second hand or through reports or blogs is the focus on ‘Where is your github?” and the question of how fast can we get to production? There was a very focused and volitional undertone to the questions

One aspect that has not changed and appears to have been amplified is the recruiter/job marketplace and (ahem) situation at the conference. To say that it was transparent and out in the open would be an understatement.

New To NeurIPS:

For those that have never been to neurips I’ll provide some recommendations:

  • Download the conference app and fill out your profile
  • Plan your agenda
  • Get to the poster sessions – early
  • Network as much as possible
  • Wear comfortable shoes – it is in the same venue next year, lots of walking.
  • Attempt to get a close hotel as possible due to P(Rain | Conference Timing) > 0.5

Trends and Catagories:

Agent-Based Modelling and Behaviors

This area is finally coming to fruition in the production market at scale. We are seeing both ABB (agent based modeling) and ABM (agent-based modeling aka self emergent / self organizing behaviors). There were many presentations on multi-agent behaviors in the context of both policy and environment responses using reinforcement learning and q-learning.

Imitation, Meta, Transfer, Policy Learning and Behavioral Cloning

I grouped all of these together while technically they are different in application and scope. However, they can and are mixed together for applied systems. For instance in imitation learning (IL). IL instead of trying to learn from the sparse rewards or manually specifying a reward function, an expert (typically a human) provides us with a set of demonstrations. The agent then tries to learn the optimal policy by following, imitating the expert’s decisions. Historically this was called Expert Systems Engineering. However, note the policy learning implicit in this area as well. Furthermore Behavioral cloning is a method by which human subcognitive skills can be captured and reproduced in a computer program. As the human subject performs the skill, his or her actions are recorded along with the situation that gave rise to the action. So as one can see all of these areas are closely related to a so-called expert reference. Algorithms of consensus among multi-agents will play a crucial role here.

Morphological Systems based on Evolutionary Biology

Morphology is a branch of biology dealing with the study of the form and structure of organisms and their specific structural features. Morphology is a branch of life science dealing with the study of a structure of an organism and its component parts. Turing wrote a paper on Morphology and S. Kaufman wrote “The Origins of Order: Self-Organization and Selection in Evolution” just to name a few. We are headed into areas where physics, chemistry, and biology are being brought into play with computing, once again at scale. This multi-modality computing will also benefit from access to the developments in accessible quantum computing.

Optimization methods for non-convex models

Gradient descent in all of its flavors has been our friend for decades. Are the local minima our friend or foe? The algorithms are now starting to ask “Where Am I”?

Hybrid Bayesian and MCMC methods

In 2007 I founded a machine learning and NLP as a service company called “BeliefNetworks”. This self-referencing name should illustrate where I stand on inference methods. Due to access to cycles and throughput, we are finally starting to see these methods integrated system-wide.

Ordinary Differential Equation (ODE) direct Modelling and Systems

Having worked for years in the areas of numerical optimization this is another area that is near and dear. I saw several papers mapping ODE’s to geometric representations. Analog computing could very well be in our return to the future. Naiver-Stokes equation anyone? I see the industry moving into flow models with truly modeling foundational Cauchy momentum equations depending on the application area. We are going to see both software and hardware development in this area.

Neuroscience models that couple computational agents and hypotheses of consciousness

Given all of the above computer scientist are pulling in physicists, biologists, chemists and finally neuroscientists-finally. Possibly the “C” word is no longer anathema? I promise I will not insert a terminator picture here. However, given the developments in cognition and understanding quantum biology, we are now starting to be able to model at least initially what we “think” we are thinking about in some cases. Yoshua Bengio gave a great talk on volitional causal and “conscious” tasks easily accomplished by humans. We also see this with the developments in the areas of spiking algorithms.

Papers, Posters, Demos – Oh My!

As part of this blog, I wanted to review a couple of my favorite presentations, posters, and papers. While this is not a ranked list nor is it a temporal chronological review it is a list of papers that resonated with me for various reasons. While I will be listing papers I will also be posting pictures of poster papers and some meetups that I attended.

Blind Super-Resolution Kernel Estimation using an Internal-GAN

This paper was interesting to me on several fronts. The basic premise for super-resolution kernels are thus:

    \[ILR =  (I{_H}{_R}∗ks)↓_S\]

The paper introduced “KernelGAN” – an image-specific internal-GAN, which estimates the SR kernel that best preserves the distribution of patches across scales of the LR image. This is what I would consider significant progress over previous methods by estimating an image-specific SR-kernel based on the LR image alone. This allows a one-shot mode for training based on the LR image. Network training is done during test time. There is no actual inference step since the training implicitly contains the resulting SR-kernel. They give results in the paper as well a metrics of performance based on NTIRE 2018 dataset although given the first application of a deep linear network I would imagine this doesn’t really do it justice. Very impressive and I can see several applications of this method and algorithm.

Project website: http://www.wisdom.weizmann.ac.il/∼vision/kernelgan

q-means: A Quantum Algorithm for Unsupervised Machine Learning

The cogent aspect of this paper was the efficiency of storing the vectors in First, classical data expressed in the form of N-dimensional complex vectors can be mapped onto quantum states over log2Nqubits: when the data is stored in a quantum random access memory (qRAM). Specifically, the distance estimation becomes very efficient when having quantum access to the vectors and the centroids via qRAM. The optimization yields a k-means optimization

    \[T=O(log(d))\]

further the paper showed that you can also query the norm of the vectors within the state preparation.

Making AI Forget You: Data Deletion in Machine Learning

One of the issues with GDPR legislation and the right to be forgotten comes up when you must re-train the entire data set. This paper addresses methodologies that enable partial re-training. The paper goes over past methods of cryptography and differential privacy of which do not delete data but attempt to make data private or non-identifiable. From the paper: “Algorithms that support efficient deletion do not have to be private, and algorithms that are private do not have to support efficient deletion. To see the difference between privacy and data deletion, note that every learning algorithm supports the naive data deletion operation of retraining from scratch. The algorithm is not required to satisfy any privacy guarantees. Even an operation that outputs the entire dataset in the clear could support data deletion, whereas such an operation is certainly not private.” The paper goes on to define four areas of metric performance for DDIML: Linearity, Laziness, Modularity, and Quantization. They do state that e also assumed that user-based deletion requests correspond to only a single datapoint and this needs to be extended. However, for the unsupervised k-means they describe they have deletion efficiency with substantial algorithm speedup.

paper here: https://arxiv.org/pdf/1907.05012.pdf

Casual Confusion in Imitation Learning

From Wikipedia: “Behavioral cloning is a method by which human sub-cognitive skills can be captured and reproduced in a computer program. As the human subject performs the skill, his or her actions are recorded along with the situation that gave rise to the action.” The fundamental premise was comparing expert versus computational policy and minimizing a graph-based approach:

    \[\mathbb{E}_G[ \mathcal {l}(fφ([X_i \bigodot\ G,G]),Ai)]\]

where G_i is drawn uniformly at random overall 2^{n} graphs and optimize for the mean squared error loss for the continuous action environments and a cross-entropy loss for the discrete action environments. Something very interesting happens during this process of imitation learning with experts. In particular, it leads to a counter-intuitive “causal misidentification” phenomenon: access to more information can yield worse performance ergo more is not better! The paper discusses with demonstrations of an autonomous vehicle scenario of phases with targeted intervention to predict the graph behavior. They did state the solutions are not production-ready. I really appreciated the honesty.

paper: https://papers.nips.cc/paper/9343-causal-confusion-in-imitation-learning.pdf

Learning To Control Self Assembling Morphologies: A Study of Generalized via Modularity

The idea of modular and self-assembling agents goes back at least to Von Neumman’s Theory of Self-Reproducing Automata. In robotics, such systems have been termed “self-reconfiguring modular robots”. E. Schrödinger posed this same question in “What is Life?”. This was one of my favorite demonstrations and presentations. I have been extremely “pro” using agent base self-organizing algorithms for quite some time. This paper and presentation utilizes zero-shot generalization and trains policies and generalizes to changes in the number of limbs of the entity as well as the environment. They then pick the best model from training and evaluate it without any fine-tuning at test-time.

paper: https://arxiv.org/pdf/1902.05546.pdf

Quantum Wassertain GANs

The poster and paper dealt with supposedly the first design of quantum Wasserstein Generative Adversarial Networks (WGANs), which has been shown to improve the robustness and the scalability of the adversarial training of quantum generative on noisy quantum hardware. Parameterized quantum circuits These circuits can be used as a parameterized representation of functions as called quantum neural networks, which can be applied to classical supervised learning models, or to construct generative models. The paper also showed how to turn the quantum Wasserstein semimetrics into a concrete design of quantum WGANs that can be efficiently implemented on quantum machines. FWIW in functional analysis, pseudometrics often come from seminorms on vector spaces, and so it is natural to call them “semimetrics”. The paper used WGANs to generate a 3-qubit quantum circuit of 50 gates that approximated a 3-qubit simulation circuit that requires over 10k gates using off the shelf standard techniques. The QWGAN then can was used to approximate complex quantum circuits with smaller circuits. A smaller circuit was then trained to approximate the Choi–Jamiolkowski isomorphism or Choi state which encodes the action of a quantum circuit.

Deep Signature Transforms

Signatures refer to a set of statistics given a stream of data. The other type of signature is for the transform. Sometimes this is also called the transform kernel. In the case of a signal kernel or transform to model a curve as a linear combination. Signatures provide a basis for functions on the space of curves. These functions can then be used as operative building blocks. The stream can then be defined as:

    \[S(V) ={x= (x1,...,xn) :xi∈V,n∈N}\]

This also has interesting ramifications as a feature mapping/engineering processes as well as embedding the signatures within algorithms, in this case, a layer within a Neural Networks. This is akin to some fingerprinting techniques in the past for media and the paper does mention: “in order to preserve the stream-like nature is to sweep a one-dimensional convolution along the stream.” The embedding techniques as part of the path and preserving nature made this an extremely enjoyable discussion.

code here: https://github.com/patrick-kidger/Deep-Signature-Transforms

paper here: https://arxiv.org/pdf/1905.08494.pdf

Metamers Of Neural Networks

This paper was near and dear to me due to some of my past lives working in the areas of psychological and perceptual media models. Metamers are a psychophysical color match between two patches of light that have different sets of wavelengths. This means that metamers are two patches of color that look identical to us in color but are made up of different physical combinations of wavelengths. In the case of this paper for metamers they “model metamers” to test the similarity between human and artificial neural network representations. The group generated model metamers for natural stimuli by performing gradient descent on noise signal, matching the responses of individual layers of image and audio networks to a natural image or speech signal. The resulting signals reflect the invariances instantiated in the network up to the matched layer. As with most things in machine learning the team sought whether the nature of the invariances would be similar to those of humans, in which case the model metamers should remain human-recognizable regardless of the stage from which they are generated. In this case, the humans were divergent from the neural networks. We need more of this type of work and how perceptions affect machine learning outcomes or possibly priors?

paper here: https://papers.nips.cc/paper/9198-metamers-of-neural-networks-reveal-divergence-from-human-perceptual-systems.pdf

Weight Agnostic Neural Networks

I particularly enjoyed this poster and the commentary “Animals have innate abilities…” I also believe most of the animal kingdom is sentiment as well as operating on literally different wavelengths (spectrum etc). The paper was to demonstrate a method that can find minimal neural network architectures that can perform several reinforcement learning tasks without weight training. Ergo the title Weight Agnostic. In place of optimizing weights of a fixed network, they sought to optimize instead for architectures that perform well over a wide range of weights. When I walked up to the poster I immediately thought of Algorithmic Information Theory (AIT) and how soft weights have been used for neural networks. AIT which based using Kolmogorov complexity of a computable object is the minimum length of the program that can compute it. The paper goes into detail concerning The Minimal Description Length (MDL) of a program and the recent dusting off of these processes applied to larger deep learning nets. The poster did not reflect the transparency of the paper in that the research was very focused on creating generalized network architectures in which IMHO is a step toward AGI and stated the WANN is not approaching the performance of engineered CNNs. I also appreciated the overall frankness of the paper. Quote from the paper: “This paper is strongly motivated towards these goals of blending innate behavior and learning, and we believe it is a step towards addressing the challenge posed by Zador. We hope this work will help bring neuroscience and machine learning communities closer together to tackle these challenges.”

Interactive version of the paper here: https://weightagnostic.github.io/

Regular paper here: https://arxiv.org/pdf/1906.04358.pdf

Inducing Brain Relevant Bias in Natural Language Processing Models

This poster was part of a general theme that I saw throughout the conference. Utilizing medical imaging devices to create better canonical models for machine learning. The paper shows the relationship between language and brain activity learned by BERT (Bidirectional Encoder Representations from Transformers) during fine-tuning transfers across multiple participants. The paper goes on to show that, for some participants, the fine-tuned representations learned from both magnetoencephalography (MEG) and functional magnetic resonance imaging(fMRI) are better for predicting fMRI than the representations learned from fMRI alone, indicating that the learned representations capture brain-activity-relevant information that is not simply an artifact of the modality. The model predicts the fMRI activity associated with reading arbitrary text passages, well enough to distinguish which of two-story segments is being read with 74% accuracy. That is impressive and I believe we need more multi-modality papers of this nature and research.

Full site with paper data etc: http://www.cs.cmu.edu/~fmri/plosone/

A Robust Non-Clairvoyant Dynamic Mechanism for Contextual Auctions

This paper caught my eye as I spend a great deal of time researching agents in game-theoretic of mechanism design based situations. What really caught my eye was the terminology non-clairvoyant. I suppose if there was a method that was truly calirvoynet we wouldn’t be concerned with the robustness of said algorithms. Actually, it is a real definition – a dynamic mechanism is non-clairvoyant if the allocation and pricing rule at each period does not depend on the type distributions in the future periods. In many types of auctions, especially ad networks the seller must rely on approximate or asymmetric models of the buyer’s preferences to effectively set auction parameters such as a reserve price. In mechanism design, you essentially have three vectors of input: [1] collective decision problem, [2] measure of quality to evaluate any candidate solution, [3] description of the resources – information – held by the participants. The paper presented a learned policy model and framework that could be applied in phases and possibly extrapolated to other types of applications. I personally think dynamic mechanism design has great applicability in the areas of distributed computing and distributed ledger platforms.

I also attended the NASA Frontier Design Labs that was sponsored by Google, Intel and Nvidia. I was part of the NASA FDL AI Astronaut Health research project over the summer of 2019. The efforts, technology and most importantly the people are astounding. The event was standing room only and several amazing conversations on the various projects with NASA FDL were had at the event.

Machine Learning For Space

I do hope you will continue to visit my site. If you continue to visit you will notice I have a type of “disease” called Biblomaniac-ism. As such I bought a book at the conference:

The future is distributed

So there you have it. While this probably was tl;dr I hope you gave it a good scan while you were doing a pull request or two. I hope this has at least provided some insight into the conference.

\forall papers: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019

Until Then,

#IWishYouWater

tctjr

The Way To Do It

The machine in action: ITA hammered it.  Massive 100M Series A raise from the crew at Sequia and Catalyst and then in 2010 Google decides to pick them up for700M after a year long bidding war with the usual suspects. Read it here:

ITA Exits at $700M

That said notice why they were bought:  The Data.  Data provisioning and categorization in a nice neat package.  Fluid data that empowered other sites.  Also note the nature of this deal.  People wonder why The Valley will always be on top.  Most will say because they had a head start.  I say because they will continue to invest in people and ideas.

Until Then,

Go Big Or Go Home.