Quantcast
Channel: Off the convex path
Viewing all 53 articles
Browse latest View live

Why go off the convex path?

$
0
0

The notion of convexity underlies a lot of beautiful mathematics. When combined with computation, it gives rise to the area of convex optimization that has had a huge impact on understanding and improving the world we live in. However, convexity does not provide all the answers. Many procedures in statistics, machine learning and nature at large—Bayesian inference, deep learning, protein folding—successfully solve non-convex problems that are NP-hard, i.e., intractable on worst-case instances. Moreover, often nature or humans choose methods that are inefficient in the worst case to solve problems in P.

Can we develop a theory to resolve this mismatch between reality and the predictions of worst-case analysis? Such a theory could identify structure in natural inputs that helps sidestep worst-case complexity.

This blog is dedicated to the idea that optimization methods—whether created by humans or nature, whether convex or nonconvex—are exciting objects of study and, often lead to useful algorithms and insights into nature. This study can be seen as an extension of classical mathematical fields such as dynamical systems and differential equations among others, but with the important addition of the notion of computational efficiency.

We will report on interesting research directions and open problems, and highlight progress that has been made. We will write articles ourselves as well as encourage others to contribute. In doing so, we hope to generate an active dialog between theorists, scientists and practitioners and to motivate a generation of young researchers to work on these important problems.


Semantic Word Embeddings

$
0
0

This post can be seen as an introduction to how nonconvex problems arise naturally in practice, and also the relative ease with which they are often solved.

I will talk about word embeddings, a geometric way to capture the “meaning” of a word via a low-dimensional vector. They are useful in many tasks in Information Retrieval (IR) and Natural Language Processing (NLP), such as answering search queries or translating from one language to another.

You may wonder: how can a 300-dimensional vector capture the many nuances of word meaning? And what the heck does it mean to “capture meaning?”

##Properties of Word Embeddings A simple property of embeddings obtained by all the methods I’ll describe is cosine similarity: the similarity between two words (as rated by humans on a $[-1,1]$ scale) correlates with the cosine of the angle between their vectors. To give an example, the cosine for milk and cow may be $0.6$, whereas for milk and stone it may be $0.2$, which is roughly the similarity human subjects assign to them.

A more interesting property of recent embeddings is that they can solve analogy relationships via linear algebra. For example, the word analogy question man : woman ::king : ?? can be solved by looking for the word $w$ such that $v_{king} - v_w$ is most similar to $v_{man} - v_{woman}$; in other words, minimizes

This simple idea can solve $75\%$ of analogy questions on some standard testbed. Note that the method is completely unsupervised: it constructs the embeddings using a big (unannotated) text corpus; and receives no training specific to analogy solving). Here is a rendering of this linear algebraic relationship between masculine-feminine pairs.

Good embeddings have other properties that will be covered in a future post. (Also, I can’t resist mentioning that fMRI-imaging of the brain suggests that word embeddings are related to how the human brain encodes meaning; see the well-known paper of Mitchell et al..)

##Computing Word embeddings (via Firth’s Hypothesis)

In all methods, the word vector is a succinct representation of the distribution of other words around this word. That this suffices to capture meaning is asserted by Firth’s hypothesis from 1957, “You shall know a word by the company it keeps.” To give an example, if I ask you to think of a word that tends to co-occur with cow, drink, babies, calcium, you would immediately answer: milk.

Note that we don’t believe Firth’s hypothesis fully accounts for all aspects of semantics —understanding new metaphors or jokes, for example, seems to require other modes of experiencing the real world than simply reading text.

But Firth’s hypothesis does imply a very simple word embedding, albeit a very high-dimensional one.

Embedding 1: Suppose the dictionary has $N$ distinct words (in practice, $N =100,000$). Take a very large text corpus (e.g., Wikipedia) and let $Count_5(w_1, w_2)$ be the number of times $w_1$ and $w_2$ occur within a distance $5$ of each other in the corpus. Then the word embedding for a word $w$ is a vector of dimension $N$, with one coordinate for each dictionary word. The coordinate corresponding to word $w_2$ is $Count_5(w, w_2)$. (Variants of this method involve considering cooccurence of $w$ with various phrases or $n$-tuples.)

The obvious problem with Embedding 1 is that it uses extremely high-dimensional vectors. How can we compress them?

Embedding 2: Do dimension reduction by taking the rank-300 singular value decomposition (SVD) of the above vectors.

Recall that for an $N \times N$ matrix $M$ this means finding vectors $v_1, v_2, \ldots, v_N \in \mathbb{R}^{300}$ that minimize

Using SVD to do dimension reduction seems an obvious idea these days but it actually is not. After all, it is unclear a priori why the above $N \times N$ matrix of cooccurance counts should be close to a rank-300 matrix. That this is the case was empirically discovered in the paper on Latent Semantic Indexing or LSI.

Empirically, the method can be improved by replacing the counts by their logarithm, as in Latent Semantic Analysis or LSA. Other authors claim square root is even better than logarithm. Another interesting empirical fact in the LSA paper is that dimension reduction via SVD not only compresses the embedding but improves its quality. (Improvement via compression is a familiar phenomenon in machine learning.) In fact they can use word vectors to solve word similarity tasks as well as the average American high schooler.

A research area called Vector Space Models (see survey by Turney and Pantel) studies various modifications of the above idea. Embeddings are also known to improve if we reweight the various terms in the above expression (2): popular reweightings include TF-IDF, PMI, Logarithm, etc.

Let me point out that reweighting the $(i, j)$ term in expression (1) leads to a weighted version of SVD, which is NP-hard. (I always emphasize to my students that a polynomial-time algorithm to compute rank-k SVD is a miracle, since modifying the problem statement in small ways makes it NP-hard.) But in practice, weighted SVD can be solved on a laptop in less than a day —remember, $N$ is rather large, about $10^5$!— by simple gradient descent on the objective (1), possibly also using a regularizer. (Of course, a lot has been proven about such gradient descent methods in context of convex optimization; the surprise is that they work also in such nonconvex settings.) Weighted SVD is a subcase of Matrix Factorization approaches in machine learning, which we will also encounter again in upcoming posts.

But returning to word embeddings, the following question had not been raised or debated, to the best of my knowledge: What property of human language explains the fact that these very high-dimensional matrices derived (in a nonlinear way) from word cooccurences are close to low-rank matrices? (In a future blog post I will describe our new theoretical explanation.)

The third embedding method I wish to describe uses energy-based models, for instance the Word2Vec family of methods from 2013 by the Google team of Mikolov et al., which also created a buzz due to the above-mentioned linear algebraic method to solve word analogy tasks. The word2vec models are inspired by pre-existing neural net models for language (basically, the word embedding corresponds to the neural net’s internal representation of the word; see this blog.). Let me describe the simplest variant, which assumes that the word vectors are related to word probabilities as follows:

Embedding 3 (Word2Vec(CBOW)):

where the left hand side gives the empirical probability that word $w$ occurs in the text conditional on the last five words being $w_1$ through $w_5$.

Assume we can estimate the left hand side using a large text corpus. Then expression (2) for the word vectors—together with a constraint capping the dimension of the vectors to, say, 300 — implicitly defines a nonconvex optimization problem which is solved in practice as follows. Let $S$ be the set of all the $6$-tuples of words that occur in the text. Let $N$ be a set of random $6$-tuples; this set is called the negative sample since presumably these tuples are gibberish. The method consists of finding word embeddings that give high probability to tuples in $S$ and low probability to tuples in $N$. Roughly speaking, they maximise the difference between the following two quantities: (i) sum of $\exp(v_w \cdot (\frac{1}{5} \sum_i v_{w_i}))$ (suitably scaled) over all $6$-tuples in $S$, and (ii) the corresponding sum over tuples in $N$. The word2vec team introduced some other tweaks that allowed them to solve this optimization for very large corpora containing 10 billion words.

The word2vec papers are a bit mysterious, and have motivated much followup work. A paper by Levy and Goldberg (See Omer Levy’s Blog) explains that the word2vec methods are actually modern versions of older vector space methods. After all, if you take logs of both sides of expression (2), you see that the logarithm of some cooccurence probability is being expressed in terms of inner products of some word vectors, which is very much in the spirit of the older work. (Levy and Goldberg have more to say about this, backed up by interesting experiments comparing the vectors obtained by various approaches.)

Another paper by Pennington et al. at Stanford suggests a model called GLOVE that uses an explicit weighted-SVD strategy for finding word embeddings. They also give an intuitive explanation of why these embeddings solve word analogy tasks, though the explanation isn’t quite rigorous.

In a future post I will talk more about our subsequent theoretical work that tries to unify these different approaches, and also explains some cool linear algebraic properties of word embeddings. I note that linear structure also arises in representations of images learnt via deep learning., and it is tantalising to wonder if similar theory applies to that setting.

Tensor Methods in Machine Learning

$
0
0

Tensors are high dimensional generalizations of matrices. In recent years tensor decompositions were used to design learning algorithms for estimating parameters of latent variable models like Hidden Markov Model, Mixture of Gaussians and Latent Dirichlet Allocation (many of these works were considered as examples of “spectral learning”, read on to find out why). In this post I will briefly describe why tensors are useful in these settings.

Using Singular Value Decomposition (SVD), we can write a matrix $M \in \mathbb{R}^{n\times m}$ as the sum of many rank one matrices:

When the rank $r$ is small, this gives a concise representation for the matrix $M$ (using $(m+n)r$ parameters instead of $mn$). Such decompositions are widely applied in machine learning.

Tensor decomposition is a generalization of low rank matrix decomposition. Although most tensor problems are NP-hard in the worst case, several natural subcases of tensor decomposition can be solved in polynomial time. Later we will see that these subcases are still very powerful in learning latent variable models.

##Matrix Decompositions

Before talking about tensors, let us first see an example of how matrix factorization can be used to learn latent variable models. In 1904, psychologist Charles Spearman tried to understand whether human intelligence is a composite of different types of measureable intelligence. Let’s describe a highly simplified version of his method, where the hypothesis is that there are exactly two kinds of intelligence: quantitative and verbal. Spearman’s method consisted of making his subjects take several different kinds of tests. Let’s name these tests Classics, Math, Music, etc. The subjects scores can be represented by a matrix $M$, which has one row per student, and one column per test.

matrix M

The simplified version of Spearman’s hypothesis is that each student has different amounts of quantitative and verbal intelligence, say $x_{quant}$ and $x_{verb}$ respectively. Each test measures a different mix of intelligences, so say it gives a weighting $y_{quant}$ to quantitative and $y_{verb}$ to verbal. Intuitively, a student with higher strength on verbal intelligence should perform better on a test that has a high weight on verbal intelligence. Let’s describe this relationship as a simple bilinear function:

Denoting by $\vec x_{verb}, \vec x_{quant}$ the vectors describing the strengths of the students, and letting $\vec y_{verb}, \vec y_{quant}$ be the vectors that describe the weighting of intelligences in the different tests, we can express matrix $M$ as the sum of two rank 1 matrices (in other words, $M$ has rank at most $2$):

Thus verifying that $M$ has rank $2$ (or that it is very close to a rank $2$ matrix) should let us conclude that there are indeed two kinds of intelligence.

Matrix Decomposition

Note that this decomposition is not the Singular Value Decomposition (SVD). SVD requires strong orthogonality constraints (which translates to “different intelligences are completely uncorrelated”) that are not plausible in this setting.

##The Ambiguity

But ideally one would like to take the above idea further: we would like to assign a definitive quantitative/verbal intelligence score to each student. This seems simple at first sight: just read off the score from the decomposition. For instance, it shows Alice is strongest in quantitative intelligence.

However, this is incorrect, because the decomposition is not unique! The following is another valid decomposition

Matrix Decomposition

According to this decomposition, Bob is strongest in quantitative intelligence, not Alice. Both decompositions explain the data perfectly and we cannot decide a priori which is correct.

Sometimes we can hope to find the unique solution by imposing additional constraints on the decomposition, such as all matrix entries have to be nonnegative. However even after imposing many natural constraints, in general the issue of multiple decompositions will remain.

Adding the 3rd Dimension

Since our current data has multiple explanatory decompositions, we need more data to learn exactly which explanation is the truth. Assume the strength of the intelligence changes with time: we get better at quantitative tasks at night. Now we can let the (poor) students take the tests twice: once during the day and once at night. The results we get can be represented by two matrices $M_{day}$ and $M_{night}$. But we can also think of this as a three dimensional array of numbers -– a tensor $T$ in $\mathbb{R}^{\sharp students\times \sharp tests\times 2}$. Here the third axis stands for “day” or “night”. We say the two matrices $M_{day}$ and $M_{night}$ are slices of the tensor $T$.

Matrix Decomposition

Let $z_{quant}$ and $z_{verb}$ be the relative strength of the two kinds of intelligence at a particular time (day or night), then the new score can be computed by a trilinear function:

Keep in mind that this is the formula for one entry in the tensor: the score of one student, in one test and at a specific time. Who the student is specifies $x_{quant}$ and $x_{verb}$; what the test is specifies weights $y_{quant}$ and $y_{verb}$; when the test takes place specifies $z_{quant}$ and $z_{verb}$.

Similar to matrices, we can view this as a rank 2 decomposition of the tensor $T$. In particular, if we use $\vec x_{quant}, \vec x_{verb}$ to denote the strengths of students, $\vec y_{quant},\vec y_{verb}$ to denote the weights of the tests and $\vec z_{quant}, \vec z_{verb}$ to denote the variations of strengths in time, then we can write the decomposition as

Matrix Decomposition

Now we can check that the second matrix decomposition we had is no longer valid: there are no values of $z_{quant}$ and $z_{verb}$ at night that could generate the matrix $M_{night}$. This is not a coincidence. Kruskal 1977 gave sufficient conditions for such decompositions to be unique. When applied to our case it is very simple:

Corollary The decomposition of tensor $T$ is unique (up to scaling and permutation) if none of the vector pairs $(\vec x_{quant}, \vec x_{verb})$, $(\vec y_{quant},\vec y_{verb})$, $(\vec z_{quant},\vec z_{verb})$ are co-linear.

Note that of course the decomposition is not truly unique for two reasons. First, the two tensor factors are symmetric, and we need to decide which factor correspond to quantitative intelligence. Second, we can scale the three components $\vec x_{quant}$ ,$\vec y_{quant}$, $\vec z_{quant}$ simultaneously, as long as the product of the three scales is 1. Intuitively this is like using different units to measure the three components. Kruskal’s result showed that these are the only degrees of freedom in the decomposition, and there cannot be a truly distinct decomposition as in the matrix case.

##Finding the Tensor

In the above example we get a low rank tensor $T$ by gathering more data. In many traditional applications the extra data may be unavailable or hard to get. Luckily, many exciting recent developments show that we can uncover these special tensor structures even if the original data is not in a tensor form!

The main idea is to use method of moments (see a nice post by Moritz): estimate lower order correlations of the variables, and hope these lower order correlations have a simple tensor form.

Consider Hidden Markov Model as an example. Hidden Markov Models are widely used in analyzing sequential data like speech or text. Here for concreteness we consider a (simplified) model of natural language texts(which is a basic version of the word embeddings).

In Hidden Markov Model, we observe a sequence of words (a sentence) that is generated by a walk of a hidden Markov Chain: each word has a hidden topic $h$ (a discrete random variable that specifies whether the current word is talking about “sports” or “politics”); the topic for the next word only depends on the topic of the current word. Each topic specifies a distribution over words. Instead of the topic itself, we observe a random word $x$ drawn from this topic distribution (for example, if the topic is “sports”, we will more likely see words like “score”). The dependencies are usually illustrated by the following diagram:

Hidden Markov Model

More concretely, to generate a sentence in Hidden Markov Model, we start with some initial topic $h_1$. This topic will evolve as a Markov Chain to generate the topics for future words $h_2, h_3,…,h_t$. We observe words $x_1,…,x_t$ from these topics. In particular, word $x_1$ is drawn according to topic $h_1$, word $x_2$ is drawn according to topic $h_2$ and so on.

Given many sentences that are generated exactly according to this model, how can we construct a tensor? A natural idea is to compute correlations: for every triple of words $(i,j,k)$, we count the number of times that these are the first three words of a sentence. Enumerating over $i,j,k$ gives us a three dimensional array (a tensor) $T$. We can further normalize it by the total number of sentences. After normalization the $(i,j,k)$-th entry of the tensor will be an estimation of the probability that the first three words are $(i,j,k)$. For simplicity assume we have enough samples and the estimation is accurate:

Why does this tensor have the nice low rank property? The key observation is that if we “fix” (condition on) the topic of the second word $h_2$, it cuts the graph into three parts: one part containing $h_1,x_1$, one part containing $x_2$ and one part containing $h_3,x_3$. These three parts are independent conditioned on $h_2$. In particular, the first three words $x_1,x_2,x_3$ are independent conditioned on the topic of the second word $h_2$. Using this observation we can compute each entry of the tensor as

Now if we let $\vec x_l$ be a vector whose $i$-th entry is the probability of the first word is $i$, given the topic of the second word is $l$; let $\vec y_l$ and $\vec z_l$ be similar for the second and third word. We can then write the entire tensor as

This is exactly the low rank form we are looking for! Tensor decomposition allows us to uniquely identify these components, and further infer the other probabilities we are interested in. For more details see the paper by Anandkumar et al. 2012 (this paper uses the tensor notations, but the original idea appeared in the paper by Mossel and Roch 2006).

##Implementing Tensor Decomposition

Using method of moments, we can discover nice tensor structures from many problems. The uniqueness of tensor decomposition makes these tensors very useful in learning the parameters of the models. But how do we compute the tensor decompositions?

In the worst case we have bad news: most tensor problems are NP-hard. However, in most natural cases, as long as the tensor does not have too many components, and the components are not adversarially chosen, tensor decomposition can be computed in polynomial time! Here we describe the algorithm by Dr. Robert Jenrich (it first appeared in a 1970 working paper by Harshman, the version we present here is a more general version by Leurgans, Ross and Abel 1993).

Jenrich’s Algorithm
Input: tensor $T = \sum_{i=1}^r \lambda_i \vec x_i \otimes \vec y_i \otimes \vec z_i$.

  1. Pick two random vectors $\vec u, \vec v$.
  2. Compute $T_\vec u = \sum_{i=1}^n u_i T[:,:,i] = \sum_{i=1}^r \lambda_i (\vec u^\top \vec z_i) \vec x_i \vec y_i^\top$.
  3. Compute $T_\vec v = \sum_{i=1}^n v_i T[:,:,i] = \sum_{i=1}^r \lambda_i (\vec v^\top \vec z_i) \vec x_i \vec y_i^\top$.
  4. $\vec x_i$’s are eigenvectors of $T_\vec u (T_\vec v)^{+}$, $\vec y_i$’s are eigenvectors of $T_\vec v (T_\vec u)^{+}$.

In the algorithm, “$^+$” denotes pseudo-inverse of a matrix (think of it as inverse if this is not familiar).

The algorithm looks at weighted slices of the tensor: a weighted slice is a matrix that is the projection of the tensor along the $z$ direction (similarly if we take a slice of a matrix $M$, it will be a vector that is equal to $M\vec u$). Because of the low rank structure, all the slices must share matrix decompositions with the same components.

The main observation of the algorithm is that although a single matrix can have infinitely many low rank decompositions, two matrices can only have a unique decomposition if we require them to have the same components. In fact, it is highly unlikely for two arbitrary matrices to share decompositions with the same components. In the tensor case, because of the low rank structure we have

where $D_\vec u,D_\vec v$ are diagonal matrices. This is called a simultaneous diagonalization for $T_\vec u$ and $T_\vec v$. With this structure it is easy to show that $\vec x_i$’s are eigenvectors of $T_\vec u (T_\vec v)^{+} = X D_\vec u D_\vec v^{-1} X^+$. So we can actually compute tensor decompositions using spectral decompositions for matrices.

Many of the earlier works (including Mossel and Roch 2006) that apply tensor decompositions to learning problems have actually independently rediscovered this algorithm, and the word “tensor” never appeared in the papers. In fact, tensor decomposition techniques are traditionally called “spectral learning” since they are seen as derived from SVD. But now we have other methods to do tensor decompositions that have better theoretical guarantees and practical performances. See the survey by Kolda and Bader 2009 for more discussions.

For more examples of using tensor decompositions to learn latent variable models, see the paper by Anandkumar et al. 2012. This paper shows that several prior algorithms for learning models such as Hidden Markov Model, Latent Dirichlet Allocation, Mixture of Gaussians and Independent Component Analysis can be interpreted as doing tensor decompositions. The paper also gives a proof that tensor power method is efficient and robust to noise.

Recent research focuses on two problems: how to formulate other learning problems as tensor decompositions, and how to compute tensor decompositions under weaker assumptions. Using tensor decompositions, we can learn more models that include community models, probabilistic Context-Free-Grammars, mixture of general Gaussians and two-layer neural networks. We can also efficiently compute tensor decompositions when the rank of the tensor is much larger than the dimension (see for example the papers by Bhaskara et al. 2014, Goyal et al. 2014, Ge and Ma 2015). There are many other interesting works and open problems, and the list here is by no means complete.

Nature, Dynamical Systems and Optimization

$
0
0

The language of dynamical systems is the preferred choice of scientists to model a wide variety of phenomena in nature. The reason is that, often, it is easy to locally observe or understand what happens to a system in one time-step. Could we then piece this local information together to make deductions about the global behavior of these dynamical systems? The hope is to understand some of nature’s algorithms and, in this quest, unveil new algorithmic techniques. In this first of a series of posts, we give a gentle introduction to dynamical systems and explain what it means to view them from the point of view of optimization.

Dynamical Systems and the Fate of Trajectories

Given a system whose state at time $t$ takes a value $x(t)$ from a domain $\Omega,$ a dynamical system over $\Omega$ is a function $f$ that describes how this state evolves: one can write the update as [ \frac{dx(t)}{dt} = f(x(t)) \ \ \ \mathrm{or} \ \ \ x(t+1)=x(t) + f(x(t))] in continuous or discrete time respectively. In other words, $f$ describes what happens in one unit of time to each point in the domain $\Omega.$ Classically, to study a dynamical system is to study the eventual fate of its trajectories, i.e., the paths traced by successive states of the system starting from a given state. For this question to make sense, $f$ must not take any state out of the domain. However, a priori, there is nothing to say that $x(t)$ remains in $\Omega$ beyond $x(0).$ This is the problem of global existence of trajectories and can sometimes be quite hard to establish. Assuming that the dynamical system at hand has a solution for all times for all starting points, and $\Omega$ is compact, the trajectories either tend to fixed points, limit cycles or end up in chaos.

The fate of trajectories

A fixed point of a dynamical system, as the name suggests, is a state $x \in \Omega$ which does not change on the application of $f$, i.e., $f(x)=0.$ A fixed point is said to be stable if trajectories starting at all nearby points eventually converge to it and unstable otherwise. Stability is a property that one might expect to find in nature. Limit cycles are closed trajectories with a similar notion of stability/unstability, while limits of trajectories which are neither fixed points or limit cycles are (loosely) termed as chaos.

What do Dynamical Systems Optimize?

For now, we will consider the class of dynamical systems which only have fixed points, possibly many. In this setting, one can define a function $F$ which maps an $x \in \Omega$ to its limit under the repeated application of $f.$ Note that to make this function well-defined we might have to look at the closure of $\Omega.$ This brings us to the following broad, admittedly not well-defined and widely open question that we would like to study:

Given a dynamical system $(\Omega,f)$, what is $F$?

When $f$ happens to be the negative gradient of a convex function $g$ over some convex domain $\Omega,$ the dynamical system $(\Omega,f)$ is nothing but an implementation of gradient descent to find the minimum of $g$, answering our question perfectly.

However, in many cases, $f$ may not be a gradient system and understanding what $f$ optimizes may be quite difficult. The fact that there may be multiple fixed-points necessarily means that trajectories starting at different points may converge to different points in the domain– giving us a sense of non-convexity. In such cases, answering our question can be a daunting task and, currently, there is no general theory for it. We present two dynamical systems from nature – one easy and one not quite.

Evolution and the Largest Eigenvector

As a simple but important example, consider a population consisting of $n$-types which is subject to the forces of evolution and held at a constant size, say one unit of mass. Thus, if we let $x_i(t)$ denote the fraction of type $i$ at time $t$ in the population, the domain becomes $\Delta^n=\{x \in \mathbb{R}^n_{>0}: x \geq 0 \; \mathrm{and} \; \; \sum_i x_i=1 \},$ the unit simplex. The update function is
[ f(x)= Qx - \Vert Qx \Vert_1 \cdot x ] for a positive matrix $Q \in \mathbb{R}_{>0}^{n \times n}.$ The properties of a natural environment in which the population is evolving can be captured by a matrix $Q,$ see this textbook which is dedicated to the study of such dynamical systems. Mathematically, to start with, note that $f$ maps any point in the simplex to a point in the simplex. Thus, starting at any point in $\Delta^n,$ the trajectory remains in $\Delta^n.$ What are the fixed points of $f$? These are vectors $x \in \Delta^n$ such that $Qx=\Vert Qx \Vert_1 \cdot x$ or the eigenvectors of $Q.$ Since $Q>0,$ the Perron-Frobenius theorem tells us that $Q$ has unique eigenvector $v \in \Delta^n$ and, starting at any $x(0) \in \Delta^n,$ $x(t) \rightarrow v$ as $t \rightarrow \infty.$ Thus, in this case, simple linear algebra can allow us to deduce that $f$ has exactly one fixed point and, thus, we can answer what is $f$ achieving globally: $f$ is nothing but nature’s implementation of the Power Method to compute the maximum eigenvector of $Q$! Biologically, the corresponding eigenvalue can be shown to be the average fitness of the population which is what nature is trying to maximize. It may be worthwhile to note that the maximum eigenvalue problem is non-convex as such.

Solving Linear Programs by Molds?

Let us conclude with an interesting dynamical system inspired by the inner workings of a slime mold; see here for a discussion on how this class of dynamics was discovered. Suppose $A \in \mathbb{R}^{n \times m}$ is a matrix and $b \in \mathbb{R}^n$ is a vector. The domain is the positive orthant $\Omega = \{x \in \mathbb{R}{^m}: x>0 \}.$ For a point $x \in \mathbb{R}^m,$ let $X$ denote the diagonal matrix such that $X_{ii}=x_i.$ The evolution function is then: [ \frac{dx}{dt} = X ( A^\top (AXA^\top)^{-1} b - \vec{1}), ] where $\vec{1}$ is the vector of all ones. Now the problem of existence of a solution is neither trivial nor can be ignored as, for the dynamical system to make sense, $x$ has to be positive. Further, it can be argued in a formal sense that this dynamical system is not a gradient descent. What then can we say about the trajectories of this dynamical system? As it turns out, it can be shown that starting at any $x>0,$ the dynamical system is a gradient descent on a natural Riemannian manifold and converges to a unique point among the solutions to the following linear program, [ \min \; \sum_i x_i \ \ \ \mathrm{s.t.} \ \ \ Ax=b, \ \ x \geq 0, ] which gives us a new algorithm for linear programming. We will explain how in a subsequent post.

NIPS 2015 workshop on non-convex optimization

$
0
0

While convex analysis has received much attention by the machine learning community, theoretical analysis of non-convex optimization is still nascent. This blog as well as the recent NIPS 2015 workshop on non-convex optimization aim to accelerate research in this area. Along with Kamalika Chaudhuri, Percy Liang, Niranjan U Naresh, and Sewoong Oh, I was one of the organizers of this NIPS workshop.

The workshop started with my talk on recent progress and challenges. Many interesting machine learning tasks are non-convex, e.g. maximum likelihood estimation of latent variable models or training multi-layer neural networks. As the input dimension grows, the number of critical points can grow exponentially, making an analysis difficult. In contrast, strictly convex optimization has a unique critical point corresponding to the globally optimal solution.

Spectral and tensor methods

I gave an overview of instances where we can provide guarantees for nonconvex optimization. Examples include finding spectral decompositions of matrices and tensors, under a set of conditions. For more details on spectral methods, see Rong Ge’s post on this blog. Kevin Chen gave a compelling talk on the superiority of spectral methods in genomics for training hidden Markov models (HMM) compared to traditional approaches such as expectation maximization. In particular, spectral methods provided better biological interpretation, were more robust to class imbalance, and were orders of magnitude faster compared to EM.

Avoiding saddle points

Moving on to more general non-convex optimization, in my talk, I pointed out the difficulty in even converging to a local optimum due to the existence of saddle points. Saddle points are critical points which are not local minima, meaning there exist directions where the objective value decreases (for minimization problems). Saddle points can slow down gradient descent arbitrarily. Alternatively, if Newton’s method is run, it converges to an arbitrary critical point, and does not distinguish between a local minimum and a saddle point.

One solution to escape saddle points is to use the second order Hessian information to find the direction of escape when the gradient value is small: the Hessian eigenvectors with negative eigenvalues provide such directions of escape. See works here, here and here. A recent work surprisingly shows that it is possible to escape saddle points using only first order information based on noisy stochastic gradient descent (SGD). In many applications, this is far cheaper than (approximate) computation of the Hessian eigenvectors. However, one unresolved issue is handling degenerate saddle points, where there are only positive and zero eigenvalues in the Hessian matrix. For such points, even distinguishing saddle points from local optima is hard. It is also an open problem to establish the presence or absence of such degenerate saddle points for particular non-convex problems, e.g. in deep learning.

Optimization landscape for deep learning

Yann LeCun talked about how the success of deep learning has shown that non-convexity should not be treated as an obstacle. See slides here. An open problem is to understand the optimization landscape for deep learning. While classical works have shown the presence of bad local optima even in simple low-dimensional problems, a recent work argues that the picture is quite different in high dimensions, suggesting that the loss function for training deep neural networks can be approximated as random Gaussian polynomials under a set of (strong) assumptions. A beautiful math paper by Auffinger and Ben Arous characterizes the critical points of such random Gaussian polynomials, proving that the objective values of all local minima concentrate in a “narrow” band, and differ significantly from the values attained at saddle points. However, to add a word of caution, this does not imply that a random Gaussian polynomial is computationally easy to optimize. Current theory does not give a good characterization of the basins of attraction for the local optima, and requires exponential number of initializations to guarantee successful convergence. Moreover, degenerate saddle points can be present, and they are hard to escape from, as I discussed earlier.

Everything old is new again

Andrew Barron took us back in time, and described how he got interested in neural networks when he discovered that they were training 6-8 layer neural networks back in the 1960’s in his dad’s company. Andrew provided a glimpse of the classical results, which bound both the approximation and estimation errors for a shallow neural network (with one hidden layer). The approximation error is related to the Fourier spectrum of the target function: intuitively, a smoother signal has a lower amount of high frequency content, and has a better approximation error under the class of networks of a fixed size. The estimation bound gives the risk achieved under a finite amount of training data, but achieving this bound is computationally hard. Andrew then talked about his recent efforts to develop computationally efficient algorithms for training neural networks based on the use of generative models of the input. This builds on my recent work on training neural networks using tensor methods.

Do not accept the given non-convex problem as is

Sanjeev Arora argued that unlike traditional computer science theory which has focused on characterizing worst-case problem instances, machine learning offers a much more flexible framework. In fact, even the objective function can be changed! He focused on the problem of non-negative matrix factorization (NMF), where the original objective function of quadratic loss minimization is hard. On the other hand, under a so-called separability assumption, the objective can be changed as finding a simplex with non-negative vertices that contains the observations, which can be solved efficiently. A similar philosophy holds for spectral methods, where the original objective function is abandoned, and instead spectral decompositions are employed to solve the learning task at hand. Sanjeev also made the excellent point that the assumptions needed for success are often something we can control as data collectors. For instance, separability holds for the NMF problem when more features are collected, e.g. for learning latent variable PCFGs via NMF techniques.

In a similar spirit, Chris Re showed how we can strengthen the current theory to remove some of the pessimism behind the hardness of many problems that have good solutions in practice. He described the notion of combinatorial width or fractional hypertree width, a notion from logic, that can provide a better analysis of Gibbs sampling and belief propagation. Gregory Valiant showed that by assuming more structure on the class of probability distributions we can do much better compared to the unstructured setting. This includes topic models, hidden Markov models, and word embeddings.

Convex envelopes and Gaussian smoothing

Hossein Mobahi talked about smoothing approaches for convexification. The convex envelope of a function is the convex function that provides the tightest lower bound. Any global minimizer of the original objective function is also a global minimizer of the convex envelope. While we are mostly familiar with the characterization of convex envelopes through duality (i.e. dual of the dual is the convex envelope), its characterization through partial differential equation (PDE) has been mostly unknown to the machine learning community. Hossein introduced the PDE form for convex envelope, and gave a nice interpretation in terms of carrying out diffusion along non-convex regions of the objective function to obtain the convex envelope. However, the convergence time for this PDE can be in general exponential in the input dimension, and therefore is not tractable. Hossein showed that instead we can perform Gaussian smoothing efficiently for many functions such as polynomials. He gave a novel characterization of Gaussian smoothing as linearization of the convex envelope. He proved it by analyzing the PDE of the convex envelope, and then showing that its linearization results in the heat equation, which corresponds to Gaussian smoothing. Hossein also provided bounds for the continuation method, where the extent of smoothing is progressively decreased, and related it to the complexity of the objective function. For more details, refer to his paper.

Conclusion

NP-hardness should not deter us from analyzing non-convex optimization in the context of machine learning. We should be creative in making new assumptions, we should try to change the problem structure and the objective function, collect relevant data to make the learning problem easier, and so on.

As I conclude this blog post, I am reminded of my discussion with Leon Bottou. He said that there are three levels of thinking, with increasingly more sophistication. At the first level, we merely aim to prove statements that are already formulated. Unfortunately, almost all formal education focuses on this type of skill. At the second level, we try to draw implications, given a fixed set of assumptions. On the other hand, at the third level, we need to simultaneously come up with reasonable assumptions as well as their implications, and this level requires the highest level of creativity. In the area of machine learning, there is tremendous opportunity for such creativity, and I am hoping that the workshop managed to foster more of it.

Word Embeddings: Explaining their properties

$
0
0

This is a followup to an earlier post about word embeddings, which capture the meaning of a word using a low-dimensional vector, and are ubiquitous in natural language processing. I will talk about my joint work with Li, Liang, Ma, Risteski, which tries to mathematically explain their fascinating properties.

We focus on a few questions. (a) What properties of natural languages cause these low-dimensional embeddings to exist? (b) Why do low-dimensional embeddings work better at analogy solving than high dimensional embeddings?

In a future blog post I will address another question answered by our subsequent work: How should a word embedding be interpreted when a word has multiple meanings?

Why do low-dimensional embeddings capture huge statistical information?

Recall that all embedding methods try to leverage word co-occurence statistics. Latent Semantic Indexing does a low-rank approximation to word-word cooccurence probabilities. In the simplest version, if the dictionary has $N$ words (usually $N$ is about $10^5$) then find $v_1, v_2, \ldots, v_N \in \mathbb{R}^{300}$ that minimize the following expression, where $p(w, w’)$ is the empirical probability that words $w, w’$ occur within $5$ words of each other in a text corpus like wikipedia. (Here “$5$” and “$300$” are somewhat arbitrary.)

Of course, one can compute a rank$-300$ SVD for any matrix; the surprise here is that the rank $300$ matrix is actually a reasonable approximation to the $100,000$-dimensional matrix of cooccurences. (Since then, topic models have been developed which imply that the matrix is indeed low rank.) This success motivated many extensions of the above basic idea; see the survey on Vector space models. We’re interested today in methods that perform nonlinear operations on word cooccurence probabilities. The simplest uses the old and popular PMI measure of Church and Hanks, where the probability $p(w,w’)$ in expression (1) is replaced by the following (nonlinear) measure of correlation. (And still the $10^5 \times 10^5$ matrix turns out to have a good low-rank approximation.)

Of course, researchers in applied machine learning take the existence of such low-dimensional approximations for granted, but there appears to be no theory to explain their existence. Theoretical explanations are also lacking for other recent methods such as Google’s word2vec.

Our paper gives an explanation using a new generative model for text, which also gives a clearer insight into the causative relationship between word meanings and the cooccurence probabilities. We think of corpus generation as a dynamic process, where the $t$-th word is produced at step $t$. The model says that the process is driven by the random walk of a discourse vector $c_t \in \Re^d$. It is a unit vector whose direction in space represents what is being talked about. Each word has a (time-invariant) latent vector $v_w \in \Re^d$ that captures its correlations with the discourse vector. We model this bias with a loglinear word production model:

The discourse vector does a slow geometric random walk over the unit sphere in $\Re^d$. Thus $c_{t+1}$ is obtained by a small random displacement from $c_t$. Since expression (2) places much higher probability on words that are clustered around $c_t$, and $c_t$ moves slowly, the model predicts that words occuring at successive time steps will also tend to have vectors that are close together. But this correlation weakens after say, $100$ steps. This model is basically the loglinear topic model of Mnih and Hinton, but with an added dynamic element in the form of a random walk. The model is also related to many existing notions like Kalman filters and linear chain CRFs. Also, as is usual in topic models, it ignores grammatical structure, and treats text in small windows as a bag of words.

Our main contribution is to use the model assumptions to derive closed form expressions for the word-word cooccurence probabilities in terms of the latent variables (i.e., the word vectors). This involves integrating out the random walk $c_t$. For this we need to make a theoretical assumption, which says intuitively that the bulk behavior of the set of all word vectors is similar to what it would be if they were randomly strewn around the conceptual space (this is counterintuitive to my linguistics colleagues because they are used to the existence of fine-grained structure in word meanings).

Isotropy assumption about word vectors: In the bulk, the word vectors behave like random vectors, for example, like $s \cdot u$ where $u$ is a standard Gaussian vector and $s$ is a scalar random variable. In particular, the partition function $Z_c = \sum_w \exp(v_w \cdot c)$ is approximately $Z \pm \epsilon$ for most unit vectors $c$.

We find that in practice the partition function is well-concentrated. After writing our paper we discovered that this phenomenon had been been discovered already in empirical work on self-normalizing language models. (Basic message: Treat the partition function as constant; it doesn’t hurt too much!)

The tight concentration of partition function allows us to compute a multidimensional integral to obtain expressions for word probabilities.

Thus the model predicts that the PMI matrix introduced earlier is indeed low-dimensional. Furthermore, unlike previous models, low dimension plays a key role in the story: isotropy requires the dimension $d$ to be much smaller than $N$.

A theoretical “explanation” also follows for some other nonlinear models. For instance if we try to do a max-likelihood fit (MLE) to the above expressions, something interesting happens in the calculation: different word pairs need to be weighted differently. Suppose you see that $w, w’$ cooccur $X(w, w’)$ times in the corpus. Then it turns out that your trust in this count as an estimate of the true value of $p(w, w’)$ scales linearly with $X(w, w’)$ itself. In other words the MLE fit to the model is

This is very similar to the expression in the GloVE model, but provides some explanation for their mysterious bias and reweighting terms. Empirically we find that this model fits the data quite well: the weighted termwise error (without the square) is about $5$ percent. (The weighted termwise error for the PMI model is much worse, around $17\%$.)

A theoretical explanation can also be given for Google’s word2vec model. Suppose we assume the random walk of the discourse vector is slow enough that $c_t$ is essentially unchanged while producing consecutive strings of $10$ words or more. Then the average of the word vectors for any consecutive $5$ words is a Max a Posteriori (MAP) estimate of the discourse vector $c_t$ that produced them. This leads to the word2vec(CBOW) model, which had hitherto seemed mysterious:

Why do low dimensional embeddings work better than high-dimensional ones?

A striking finding in empirical work on word embeddings is that there is a sweet spot for the dimensionality of word vectors: neither too small, nor too large. This graph below from the Latent Semantic Analysis paper (1997) shows the performance on word similarity tasks versus dimension, but a similar phenomenon also occurs for analogy solving.

Performance of word embeddings vs Dimension

Such a performance curve with a bump at the “sweet spot” is very familiar in empirical work in machine learning and usually explained as follows: too few parameters make the model incapable of fitting to the signal; too many parameters, and it starts overfitting (e.g., fitting to noise instead of the signal). Thus the dimension constraint act as a regularizer for the optimization.

Surprisingly, I have not heard of a good theoretical explanation to back up this intuition. Here are some attempted explanations I heard from colleagues in connection with word embeddings.

Suggestion 1:Johnson-Lindenstrauss Lemmaimplies some dimension reduction for every set of vectors. This explanation doesn’t cut it because: (a) it only implies dimension $\frac{1}{\epsilon^2}\log N$, which is too high for even moderate $\epsilon$. (b) It predicts that quality of the embedding goes up monotonically as we increase dimension, whereas in practice overfitting is observed.

Suggestion 2:Standard generalization theory (e.g., VC-dimension) predicts overfitting. I don’t see why this applies either, since we are dealing with unsupervised learning (or transfer learning): the training objective doesn’t have anything to do a priori with analogy solving. So there is no reason a model with fewer parameters will do better on analogy solving, just as there’s no reason it does better for some other unrelated task like predicting the weather.

We give some idea why a low-dimensional model may solve analogies better. This is also related to the following phenomenon.

Why do Semantic Relations correspond to Directions?

Remember the striking discovery in the word2vec paper: word analogy tasks can be solved by simple linear algebra. For example, the word analogy question man : woman ::king : ?? can be solved by looking for the word $w$ such that $v_{king} - v_w$ is most similar to $v_{man} - v_{woman}$; in other words, minimizes

This strongly suggests that semantic relations —in the above example, the relation is masculine-feminine—correspond to directions in space. However, this interpretation is challenged by Levy and Goldberg who argue there is no linear algebra magic here, and the expression can be explained simply in terms of traditional connection between word similarity and vector inner product (cosine similarity). See also this related blog post.

We find on the other hand that the RELATIONS = DIRECTIONS phenomenon is demonstrable empirically, and is particularly clear for semantic analogies in the testbed. For each relation $R$ we can find a direction $\mu_R$ such if a word pair $a, b$ satisfy $R$, then

where $\alpha_{a, b}$ is a scalar that’s roughly about $0.6$ times the norm of $v_a - v_b$ and $\eta$ is a noise vector. Empirically, the residuals $\eta$ do look mathematically like random vectors according to various tests.

In particular, this phenomenon allows the analogy solver to be made completely linear algebraic if you have a few examples (say 20) of the same relation. You compute the top singular vector of the matrix of all $v_a -v_b$’s to recover $\mu_R$, and from then on can solve analogies $a: b:: c:??$ by looking for a word $d$ such that $v_c - v_d$ has the highest possible projection on $\mu_R$ (thus ignoring $v_a -v_b$ altogether). In fact, this gives a “cheating” method to solve the analogy test bed with somewhat higher success rates than state of the art. (Message to future designers of analogy testbeds: Don’t include too many examples of the same relationship, otherwise this cheating method can exploit it.) By the way, Lisa Lee did a senior thesis under my supervision that showed empirically that this phenomenon can be used to extend knowledge-bases of facts, e.g., predict new music composers in the corpus given a list of known composers.

Our theoretical results can be used to explain the emergence of RELATIONS=DIRECTIONS phenomenon in the embeddings. Earlier attempts (eg in the GloVE paper) to explain the success of (3) for analogy solving had failed to account for the fact that all models are only approximate fits to the data. For example, the PMI model fits $v_w \cdot v_{w’}$ to $PMI(w, w’)$) but the termwise error for our corpus is $17\%$, and expression (3) contains $6$ inner products! So even though expression (3) is presumably a linear algebraic proxy for some statistical property of the word distributions, the noise/error is large. By contrast, the difference in the value of (3) between the best and second-best solution is small, say $10-15\%$.

So the question is: Why does error in the approximate fit not kill the analogy solving? Our explanation of RELATIONS = DIRECTIONS phenomenon provides an explanation: the low dimension of the vectors has a “purifying” effect that reduces the effect of this fitting error. (See Section 4 in the paper.)The key ingredient of this explanation is, again, the random-like behavior of word embeddings —quantified in terms of singular values— as well as the standard theory of linear regression. I’ll describe the math in a future post.

Evolution, Dynamical Systems and Markov Chains

$
0
0

In this post we present a high level introduction to evolution and to how we can use mathematical tools such as dynamical systems and Markov chains to model it. Questions about evolution then translate to questions about dynamical systems and Markov chains – some are easy to answer while others point to gaping holes in current techniques in algorithms and optimization. In particular, in this post, we present a setting which captures the evolution of viruses and formulate the question How quickly could evolution happen? This question is not only relevant for the feasibility of drug-design strategies to counter viruses, it also leads to non-trivial questions in computer science.

Just 4 Billion Years…

Starting with the pioneering work of Darwin and Wallace, over the last two centuries there has been tremendous scientific and mathematical advances in our understanding of evolution and how it has shaped diverse and complex life – in a matter of just four billion years. However, unlike physics where the laws seem to be consistent across the universe, evolution is quite complex and its governing dynamics can depend on the context – if we look closely enough, the evolution of life forms such as viruses is quite different from that of humans. Thus, the theory of evolution is not a succinct one, there is vagueness for those who seek mathematical clarity, and, for sure, you should not expect one post to explain all of its various aspects! Instead, we will introduce the basic apparatus of evolution, focus on a concrete setting which has been used to model the evolution of viruses, and ask questions concerning the efficiency of such an evolution.

Evolution in a Nutshell

Abstractly, we can view evolution as nothing but a mechanism (or a meta-algorithm), that takes a population (which is capable of reproducing) as an input and outputs the next generation. At any given time, the population is composed of individuals of different types. As this is all happening in an environment in which resources are limited, who is selected to be a part of the next generation and who is not is determined by the fitness of a type in the environment. The reproduction could be asexual– a simple act of cloning, or sexual– involving the combination of two (or more) individuals to produce offspring. Moreover, during reproduction there could be mutations that transform one type into another. Each of the reproduction, selection or mutation steps could be deterministic or stochastic making evolution either a deterministic or randomized function of the input population.

The size of the population, the number of types, the fitness of each type in the environment, the probabilities of mutation and the starting state are the parameters of the model. Typically, one fixes these parameters and studies how the population evolves over time – whether it reaches a limiting or a steady state and, if so, how this limiting state varies with the parameters of the model and how quickly the limiting state is reached.

After all, evolution without a notion of efficiency is an incomplete theory.

An important and different take on this question is Leslie Valiant’s work on using computational learning theory to understand evolution quantitatively. Finally, as you might have guessed by now, in such generality, evolution encompasses processes which have a priori nothing to do with biology; indeed, evolutionary models have been used to understand many social, economical and cultural phenomena, as described in this book by Nowak.

Populations: Infinite = Dynamical System, Finite = Markov Chain

Given an evolutionary model (which could include stochastic steps), as a first step to understand it we typically assume that the population is infinite and hence all steps are effectively deterministic; we will see an example soon. This allows the evolution of the fraction of each type in the population to be modeled as a deterministic dynamical system from a probability simplex (denoted by $\Delta_m$) to itself; here $m$ is the number of types. However, real populations are finite and often lend themselves to substantial stochastic effects such as random genetic drift. In order to understand their limiting behavior as a function of the population size, we can neither assume that the population is infinite nor ignore stochasticity in the steps in evolution. Hence, Markov chains are appealed to in order to study finite populations. To be concrete, we move on to describing a deterministic and stochastic model for error-prone evolution of an asexual population.

A Deterministic, Infinite Population Model

Consider an infinite population composed of individuals each of who could be one of $m$ types. An individual of type $i$ has a fitness which is specified by a positive integer $a_i,$ and we use a $m \times m$ diagonal matrix $A$ whose $(i,i)$th entry is $a_i$ to capture it. The reproduction is error-prone and this is captured by an $m\times m$ stochastic matrix $Q$ whose $(i,j)$th entry captures the probability that the $j$th type will mutate to the $i$th type during reproduction. In the reproduction stage each type $i$ in the current population produces $a_i$ copies of itself. During reproduction, mutations might occur and in our deterministic model, we assume that one unit of $j$ gives rise to $Q_{i,j}$ fraction of population of $i.$ Since the total mass could become more than one due to reproduction, in the selection stage we normalize the mass so that it is again of unit size.

Thus, the fitness of a type influences its representation in the selected population. Mathematically, we can then track the fraction of each type at step $t$ of the evolution by a vector ${x}^{(t)}\in \Delta_m$ whose evolution is then governed by the dynamical system $ {x}^{(t+1)} = \frac{QA {x}^{(t)}}{\Vert QA {x}^{(t)}\Vert_1}.$ (This is one of the dynamical systems we considered in a previous post.) Thus, the eventual fate of the evolutionary process is not a single type, rather an invariant distribution over types. We saw that when $QA>0$, there is a unique fixed point of this dynamical system; the largest right eigenvalue of $QA.$ Thus, no matter where one starts, this dynamical system converges to this fixed point. Biologically, the corresponding eigenvalue can be shown to be the average fitness of the population which is, in effect, what is being maximized.

How quickly? Well, elementary linear algebra tells us that the rate of convergence of this process is governed by the ratio of the second largest to the largest eigenvalue of $QA.$ Finally, we note that the dynamical system corresponding to a sexually reproducing population is not hard to describe and has been studied recently from an optimization point of view.

A Stochastic, Finite Population Model

Consider now a stochastic, finite population version of the evolutionary dynamics described above. Here, the population is again assumed to be asexual but now it has a fixed finite size $N.$ After normalization, the composition of the population is again captured by a point in $\Delta_m$ say $ {X}^{(t)}$ at time $t.$ How does one generate ${X^{(t+1)}}$ in this model when the parameters are described by the matrices $Q$ and $A$ as in the infinite population setting? In the reproduction stage, one first replaces an individual of type $i$ in the current population by $a_i$ individuals of type $i$: the total number of individuals of type $i$ in the intermediate population is therefore $a_iN {X_i}^{(t)}$. In the mutation stage, each individual in this intermediate population mutates independently and stochastically according to the matrix $Q.$ Finally, in the selection stage, the population is culled back to size $N$ by sampling $N$ individuals from this intermediate population.

Each of these steps is depicted in Figure 2. Note that stochasticity necessarily means that, even if we initialize the system in the same way, different runs of the chain could produce very different outcomes. The vector $X^{(t+1)}$ then is the normalized frequency vector of the resulting population. The state space of the Markov chain described above has size ${N+m-1}\choose{m-1}.$ When $QA>0,$ this Markov chain is ergodic and, hence, has a unique steady state. However, unlike the deterministic case, this steady state is not apriori easy to compute. Certainly, it has no closed form expression except in the most trivial cases. How do we compute it?

The Mixing Time

The number of states grows roughly like $N^m$ (when $m$ is small compared to $N$) and, even for a small constant $m=40$ and a population of size $10,000$, the number of states is more than $2^{300}$ – more than the number of atoms in the universe! Thus, at best, we can hope to have an algorithm that samples from close to the steady state. In fact, noting that each step of the Markov chain can be implemented efficiently, evolution already provides an algorithm. Its efficiency, however, depends on the time it takes to reach close to steady state – its mixing time. However, in general, there is no way to proclaim that a Markov chain has reached close to its steady state other than providing a bound (along with a proof) on the mixing time. Proving bounds on mixing times of Markov chains is an important area in computer science which interfaces with a variety of other disciplines such as statistics, statistical physics and machine learning; see here. In evolution, however, the mixing time is important beyond computing statistics of samples from the steady state: it tells us how quickly a steady state could be reached. This has biological significance as we will momentarily see in applications of this model to viral evolution.

Viral Evolution and Drug Design: The importance of mixing time

The Markov chain described above has recently found use in modeling RNA viral populations which reproduce asexually and that show strong stochastic behavior (e.g., HIV-1, see here), which in turn has guided drug and vaccine design strategies.

For example, the effective population size of HIV-1 in an infected individual is approximately $10^3-10^6$ not big enough for us to use infinite population models.

Let us see, again at a high-level, how. RNA viruses, due to their primitive copying mechanisms, often undergo mutations during reproduction. Mutations introduce genetic variation and the population at any time is composed of different types – some of them being highly effective (in capturing the host cell) and some not so much. A typical situation to keep in mind is when the number of effective types is a relatively small fraction of $m.$ For the sake of simplicity, let us assume that we are in the setting where each strain mutates to another type with probability $\tau$ during reproduction and remains itself with probability $1-\tau (m-1).$ Thus, as $\tau$ goes from $0$ to $1/m,$ intuitively, in the steady state, the composition of the viral population goes from being concentrated on the effective types to uniformly distributed over all types. The population as a whole is effective if most of its mass in the steady state is concentrated around the effective types and we can declare it dead if it is the latter.

Eigen, in a pioneering work, observed that in fact there is a critical mutation rate called the error threshold around which there is a phase transition– i.e., the virus population changes suddenly from being highly effective to dead.

(This observation was proven formally here). This suggests a strategy to counter viruses: drive their mutation rate past their error threshold! Intriguingly, this strategy is already employed by the body which can produce antibodies that increase the mutation rate. Artificially, this effect can also be accomplished by mutagenic drugs such as ribavirin, see here and here In this setting, knowing the error threshold with high precision is critical: inducing the body with excess mutagenic drugs could have undesired ramifications that lead to complications such as cancer, whereas increasing the rate while keeping it below the threshold can increase the fitness of the virus by allowing it to adapt more effectively, making it more lethal. Computing the error threshold requires the knowledge of the steady state and, thus, is one place where a bound on the mixing time is required. Further, when modeling the effect of a mutagenic drug, the convergence rate determines the minimum required duration of treatment.

If the virus population does not reach its steady state in the lifetime of the infected patient, then what good is that?

To Conclude…

We hope that through this example we have convinced you that efficiency is an important consideration in evolution. Specifically, in the setting we presented, the knowledge of the mixing time of evolutionary Markov chains is a crucial question. Despite its importance, there has been a lack of rigorous mixing time bounds for the full range of parameters, even in the simplest of evolutionary models considered here. Prior work has either ignored mutation, assumed that the model is neutral (i.e., types have the same fitness), or moved to the diffusion limit which requires both mutation and selection pressure to be weak. These bounds apply in some special subcases of evolution and we would like to know mixing time bounds that work for all parameters. In a sequence of results available here, here and here, we have shown that a wide class of evolutionary Markov chains (which includes the one described in this post) can mix quickly for all parameter settings as long as the population is large enough! Further, trying to analyze them has led to new techniques to analyze mixing time of Markov chains and stochastic processes which might be important beyond evolution. We will explain some of these techniques in a subsequent post and continue our discussion, more generally, on evolution viewed from the lens of efficiency.

Stability as a foundation of machine learning

$
0
0

Central to machine learning is our ability to relate how a learning algorithm fares on a sample to its performance on unseen instances. This is called generalization.

In this post, I will describe a purely algorithmic approach to generalization. The property that makes this possible is stability. An algorithm is stable, intuitively speaking, if its output doesn’t change much if we perturb the input sample in a single point. We will see that this property by itself is necessary and sufficient for generalization.

Example: Stability of the Perceptron algorithm

Before we jump into the formal details, let’s consider a simple example of a stable algorithm: The Perceptron, aka stochastic gradient descent for learning linear separators! The algorithm aims to separate two classes of points (here circles and triangles) with a linear separator. The algorithm starts with an arbitrary hyperplane. It then repeatedly selects a single example from its input set and updates its hyperplane using the gradient of a certain loss function on the chosen example. How bad might the algorithm screw up if we move around a single example? Let’s find out.

Step 1/30. Click to advance.
The animation shows two runs of the Perceptron algorithm for learning a linear separator on two data sets that differ in the one point marked green in one data set and purple in the other. The perturbation is indicated by an arrow. The shaded green region shows the difference in the resulting two hyperplanes after some number of steps.

As we can see by clicking impatiently through the example, the algorithm seems pretty stable. Even if we substantially move the first example it encounters, the hyperplane computed by the algorithm changes only slightly. Neat. (You can check out the code here.)

Empirical risk jargon

Let’s introduce some terminology to relate the behavior of an algorithm on a sample to its behavior on unseen instances. Imagine we have a sample $S=(z_1,\dots,z_n)$ drawn i.i.d. from some unknown distribution $D$. There’s a learning algorithm $A(S)$ that takes $S$ and produces some model (e.g., the hyperplane in the above picture). To quantify the quality of the model we crank out a loss function $\ell$ with the idea that $\ell(A(S), z)$ describes the loss of the model $A(S)$ on one instance $z$. The empirical risk or training error of the algorithm is defined as:

This captures the average loss of the algorithm on the sample on which it was trained. To quantify out-of-sample performance, we define the risk of the algorithm as:

The difference between risk and empirical risk $R - R_S$ is called generalization error. You will sometimes encounter that term as a synonym for risk, but I find that confusing. We already have a perfectly short and good name for the risk $R$. Always keep in mind the following tautology

Operationally, it states that if we manage to minimize empirical risk all that matters is generalization error.

A fundamental theorem of machine learning

I probably shouldn’t propose fundamental theorems for anything really. But if I had to, this would be the one I’d suggest for machine learning:

In expectation, generalization equals stability.

Somewhat more formally, we will encounter a natural measure of stability, denoted $\Delta$ such that the difference between risk and empirical risk in expectation equals $\Delta.$ Formally,

$\mathbb{E}[R - R_S] = \Delta$

Deferring the exact definition of $\Delta$ to the proof, let’s think about this for a second. What I find so remarkable about this theorem is that it turns a statistical problem into a purely algorithmic one: All we need for generalization is an algorithmic notion of robustness. Our algorithm’s output shouldn’t change much if perturb one of the data points. It’s almost like a sanity check. Had you coded up an algorithm and this wasn’t the case, you’d probably go look for a bug.

Proof

Consider two data sets of size $n$ drawn independently of each other: [ S = (z_1,\dots,z_n), \qquad S’=(z_1’,\dots,z_n’) ] The idea of taking such a ghost sample $S’$ is quite old and already arises in the context of symmetrization in empirical process theory. We’re going to couple these two samples in one point by defining [ S^i = (z_1,\dots,z_{i-1},z_i’,z_{i+1},\dots,z_n),\qquad i = 1,\dots, n. ] It’s certainly no coincidence that $S$ and $S^i$ differ in exactly one element. We’re going to use this in just a moment.

By definition, the expected empirical risk equals

Contrasting this to how the algorithm fares on unseen examples, we can rewrite the expected risk using our ghost sample as:

All expectations we encounter are over both $S$ and $S’$. By linearity of expectation, the difference between expected risk and expected empirical risk equals

It is tempting now to relate the two terms inside the expectation to the stability of the algorithm. We’re going to do exactly that using mathematics’ most trusted proof strategy: pattern matching. Indeed, since $z_i$ and $z_i’$ are exchangeable, we have

where $\delta_i$ is defined to make the second equality true:

Summing up $\Delta = (1/n)\sum_i \delta_i$, we have

The only thing left to do is to interpret the right hand side in terms of stability. Convince yourself that $\delta_i$ measures how differently the algorithm behaves on two data sets $S$ and $S’$ that differ in only one element.

Uniform stability

It can be difficult to analyze the expectation in the definition of $\Delta$ precisely. Fortunately, it is often enough to resolve the expectation by upper bounding it with suprema:

The supremum runs over all valid data sets differing in only one element and all valid sample points $z$. This stronger notion of stability called uniform stability goes back to a seminal paper by Bousquett and Elisseeff.

I should say that you can find the above proof in the essssential stability paper by Shalev-Shwartz, Shamir, Srebro and Sridharan here.

Concentration from stability

The theorem we saw shows that expected empirical risk equals risk up to a correction that involves the stability of the algorithm. Can we also show that empirical risk is close to its expectation with high probability? Interestingly, we can by appealing to stability once again. I won’t spell out the details, but we can use the method of bounded differences to obtain strong concentration bounds. To apply the method we need a bounded difference condition which is just another word for stability. So, we’re really killing two birds with one stone by using stability not only to show that the first moment of the empirical risk is correct but also that it concentrates. The only wrinkle is that, as far as I know, the weak stability notion expressed by $\Delta$ is not enough to get concentration, but uniform stability (for sufficiently small difference) will do.

Applications of stability

There is much more that stability can do for us. We’ve only scratched on the surface. Here are some of the many applications of stability.

  • Regularization implies stability. Specifically, the minimizer of the empirical risk subject to an $\ell_2$-penalty is uniformly stable.

  • Stochastic gradient descent is stable provided that we don’t make too many steps.

  • Differential privacy is nothing but a strong stability guarantee. Any result ever proved about differential privacy is fundamentally about stability.

  • Differential privacy in turn has applications to preventing overfitting in adaptive data analysis.

  • Stability also has many beautiful applications and connections in statistics. I strongly encourage you to read Bin Yu’s beautiful overview paper on the topic.

Looking ahead, I’ve got at least two more posts planned on this.

In my next post I will go into the stability of stochastic gradient descent in detail. We will see a simple argument to show that stochastic gradient descent is uniformly stable. I will then work towards applying these ideas to the area of deep learning. We will see that stability can help us explain why even huge models sometimes generalize well and how we can make them generalize even better.

In a second post I will reflect on stability as a paradigm for reliable machine learning. The focus will be on how ideas from stability can help avoid overfitting and false discovery.


Escaping from Saddle Points

$
0
0

Convex functions are simple — they usually have only one local minimum. Non-convex functions can be much more complicated. In this post we will discuss various types of critical points that you might encounter when you go off the convex path. In particular, we will see in many cases simple heuristics based on gradient descent can lead you to a local minimum in polynomial time.

Various Types of Critical Points

Local Minimum, Local Maximum and Saddle Point

To minimize the function $f:\mathbb{R}^n\to \mathbb{R}$, the most popular approach is to follow the opposite direction of the gradient $\nabla f(x)$ (for simplicity, all functions we talk about are infinitely differentiable), that is,

Here $\eta$ is a small step size. This is the gradient descent algorithm.

Whenever the gradient $\nabla f(x)$ is nonzero, as long as we choose a small enough $\eta$, the algorithm is guaranteed to make local progress. When the gradient $\nabla f(x)$ is equal to $\vec{0}$, the point is called a critical point, and gradient descent algorithm will get stuck. For (strongly) convex functions, there is a unique critical point that is also the global minimum.

However, for non-convex functions, just having the gradient to be $\vec{0}$ is not good enough. A simple example is the function

At $x = (0,0)$, the gradient is $\vec{0}$, but it is clearly not a local minimum as $x = (0, \epsilon)$ has smaller function value. The point $(0,0)$ is called a saddle point of this function.

To distinguish these cases we need to consider the second order derivative $\nabla^2 f(x)$ — an $n\times n$ matrix (usually known as the Hessian) whose $i,j$-th entry is equal to $\frac{\partial^2}{\partial x_i \partial x_j} f(x)$. When the Hessian is positive definite (which means $u^\top\nabla^2 f(x) u > 0$ for any $u\ne 0$), by second order Taylor’s expansion for any direction $u$ therefore $x$ must be a local minimum. Similarly, when the Hessian is negative definite, the point is a local maximum; when the Hessian has both positive and negative eigenvalues, the point is a saddle point.

It is believed that for many problems including learning deep nets, almost all local minimum have very similar function value to the global optimum, and hence finding a local minimum is good enough. However, it is NP-hard to even find a local minimum (see Discussions in Anandkumar, Ge 2006). Many popular optimization techniques in practice are first order optimization algorithms: they only look at the gradient information, and never explicitly compute the Hessian. Such algorithms may get stuck at saddle points.

In the rest of the post, we will first see that getting stuck at saddle points is a very realistic possibility since most natural objective functions have exponentially many saddle points. We will then discuss how optimization algorithms can try to escape from saddle points.

Symmetry and Saddle Points

Many learning problems can be abstracted as searching for $k$ distinct components (sometimes called features, centers,…). For example, in the clustering problem, there are $n$ points, and we are searching for $k$ components that minimizes the sum of distances of points to their nearest center. In a two-layer neural network, we try to find a network with $k$ distinct neurons at the middle layer. In my previous post I talked about tensor decomposition, which also looks for $k$ distinct rank-1 components.

A popular way to solve these problems is to design an objective function: let $x_1, x_2, \ldots, x_k \in \mathbb{R}^n$ denote the desired centers and let objective function $f(x_1,…,x_k)$ measure the quality of the solution. The function is minimized when the vectors $x_1,x_2,…,x_k$ are the $k$ components that we are looking for.

A natural reason why any such problem is inherently non-convex is permutation symmetry. For instance, if we swap the order of first and second component, the solutions are equivalent. Namely,

However, if we take the average of this solution, we will end up with the solution $\frac{x_1+x_2}{2}, \frac{x_1+x_2}{2}, x_3,…,x_k$, which is not equivalent! If the original solution is optimal this average is likely to be suboptimal. Therefore the objective function cannot be convex because for convex functions, average of optimal solutions is still optimal.

Symmetry

There are exponentially many globally optimal solutions that are all permutations of the same solution. Saddle points arise naturally on the paths that connect these isolated local minima. The figure below shows the function $y = x_1^4-2x_1^2 + x_2^2$: between two symmetric local min $(-1,0)$ and $(1,0)$, the point $(0,0)$ is a saddle point.

Symmetry and Saddle Points

Escaping from Saddle Points

In order to optimize these non-convex functions with many saddle points, optimization algorithms need to make progress even at (or near) saddle points. The simplest way to do this is by using the second order Taylor’s expansion:

If the gradient $\nabla f(x)$ is $\vec{0}$, we can still hope to find a vector $u$ where $u^\top \nabla^2 f(x) u < 0$. This way if we let $y = x+\eta u$, the function value of $f(y)$ is likely to be smaller. Many optimization algorithms such as trust region algorithms and cubic regularization use this idea, and they can escape from saddle points in polynomial time for nice functions.

Strict Saddle Functions

As we discussed, in general it is NP-hard to find a local minimum and many algorithms may get stuck at a saddle point. How many steps do we need to escape from a saddle point? This is related to how well-behaved the saddle points are. Intuitively, a saddle point $x$ is well-behaved, if there is a direction $u$ such that the second order term $u^\top \nabla^2 f(x) u$ is significantly smaller than 0 — geometrically this means there is a steep direction where the function value decreases. To quantify this, my paper with Furong Huang, Chi Jin and Yang Yuan introduced the notion of strict saddle functions (also known as “ridable” function in Sun et al. 2015)

A function $f(x)$ is strict saddle if all points $x$ satisfy at least one of the following
1. Gradient $\nabla f(x)$ is large.
2. Hessian $\nabla^2 f(x)$ has a negative eigenvalue that is bounded away from 0.
3. Point $x$ is near a local minimum.

Essentially, the local region of every point $x$ looks like one of the following pictures:

Symmetry

For such functions, trust region algorithms and cubic regularization can find a local minimum efficiently.

Theorem(Informal) There are polynomial time algorithms that can find a local minimum of strict saddle functions.

What functions are strict saddle? Ge et al. 2015 showed a tensor decomposition problem is strict saddle. Sun et al. 2015 observed that problems like complete dictionary learning, phase retrieval are also strict saddle.

First Order Method to Escape from Saddle Points

Trust region algorithms are very powerful. However they need to compute the second order derivative of the objective function, which is often too expensive in practice. If the algorithm can only access the gradient of the function, is it still possible to escape from saddle points?

This might seem hard as the gradient at a saddle point is $\vec{0}$ and does not give us any information. However, the key observation here is saddle points are very unstable: if we put a ball on a saddle point, then slightly perturb it, the ball is likely to fall! Of course we need to make this intuition formal in higher dimensions, as naively to find the direction to fall it seems to require computing the smallest eigenvector of the Hessian matrix.

To formalize this intuition we will try use a noisy gradient descent

$y = x - \eta \nabla f(x) + \epsilon.$

Here $\epsilon$ is a noise vector that has mean $0$. This additional noise is going to deliver the initial nudge that makes the ball fall along the slope.

In fact, often it is much cheaper to compute a noisy gradient than the true gradient — this is the key idea in stochastic gradient , and a large body of work shows that the noise does not interfere with convergence for convex optimization. For non-convex optimization, intuitively people believed the inherent noise helps in convergence because it pushes the current point away from saddle points. It’s not a bug, it’s a feature!

Escaping from saddle points

Previously, there were no good upper bound known on the number of iterations needed to escape saddle points and arrive at a local minimum. In Ge et al. 2015, we show

Theorem(Informal) Noisy gradient descent can find a local minimum of strict saddle functions in polynomial time.

The polynomial dependency on the dimension $n$ and the smallest eigenvalue of the Hessian are fairly high and not very practical. It is an open problem to find the optimal convergence rate for strict saddle problems.

A recent subsequent paper by Lee et al. showed even without adding noise, gradient descent will not converge to any strict saddle point if the initial point is chosen randomly. However their result relies on the Stable Manifold Theorem from dynamical systems theory, which inherently does not provide any upperbound on the number of steps.

Beyond Simple Saddle Points

We have seen algorithms that can handle (simple) saddle points. However, non-convex problems can have much more complicated landscapes that involve degenerate saddle points — points whose Hessian is positive semidefinite and have 0 eigenvalues. Such degenerate structure often indicates a complicated saddle point (such as a monkey saddle, Figure (a)) or a set of connected saddle points (Figures (b)(c)). In Anandkumar, Ge 2016 we gave an algorithm that can deal with some of these degenerate saddle points.

Higher order saddle points

The landscapes of non-convex functions can be very complicated, and there are still many open problems. What other functions are strict saddle? How do we make optimization algorithms that work even when there are degenerate saddle points or even spurious local minima? We hope more researchers will be interested in these problems!

Saddles Again

$
0
0

Thanks to Rong for the very nice blog post describing critical points of nonconvex functions and how to avoid them. I’d like to follow up on his post to highlight a fact that is not widely appreciated in nonlinear optimization. Though we often teach the contrary in our intro courses, it is in fact super hard to converge to a saddle point. (Just look at those pictures in Rong’s post! If you move ever so slightly you fall off the saddle). Even simple algorithms like gradient descent with constant step sizes can’t converge to saddle points unless you try really hard.

It’s hard to converge to a saddle.

To illustrate why gradient descent would not converge to a non-minimizing saddle points, consider the case of a non-convex quadratic, $f(x)=\frac{1}{2} \sum_{i=1}^d a_i x_i^2$. Assume that $a_i$ is nonnegative for the $k$ values and is strictly negative for the last $d-k$ values. The unique stationary point of this problem is $x=0$. The Hessian at $0$ is simply the diagonal matrix with $H_{ii} = a_i$ for $i=1,\ldots,d$.

Now what happens when we run gradient descent on this function from some initial point $x^{(0)}?$ The gradient method has iterates of the form

For our function, this takes the form

If one unrolls this recursive formula down to zero, we see that the $i$th coordinate of the $k$th iterate is given by the formula

One can immediately see from this expression that if the step size $t$ is chosen such that $t |a_i| < 1 $ for all $i$, then when all of the $a_i$ are nonnegative, the algorithm converges to a point where the gradient is equal to zero from any starting point. But if there is a single negative $a_i$, the function diverges to negative infinity exponentially quickly from any randomly chosen starting point.

The random initialization is key here. If we initialized the problem such that $x^{(0)}_i=0$ whenever $a_i<0$, then the algorithm would actually converge. However, under the smallest perturbation away from this initial condition, gradient descent diverges to negative infinity.

Most of the examples showing that algorithms converge to stationary points are fragile in a similar way. You have to try very hard to make an algorithm converge to a saddle point. As an example of this phenomena for a non-quadratic function, consider the following example from Nesterov’s revered Introductory Lectures on Convex Optimization. Let $f(x,y) = \frac12 x^2 +\frac14 y^4-\frac12 y^2$. The critical points of this function are $z^{(1)}= (0,0)$, $z^{(2)} = (0,-1)$ and $z^{(3)} = (0,1)$. The points $z^{(2)}$ and $z^{(3)}$ are local minima, and $z^{(1)}$ is a saddle point. Now observe that gradient descent initialized from any point of the form $(x,0)$ converges to the saddle point $z^{(1)}$. From any other initial point, gradient descent converges to a local minimum. If one chooses an initial point at random, then gradient descent does not converge to a saddle point with probability one.

The Stable Manifold Theorem and random initialization

In recent work with Jason Lee, Max Simchowitz, and Mike Jordan, we made this result precise using the Stable Manifold Theorem from dynamical systems. The Stable Manifold theorem is concerned with fixed point operations of the form $x^{(k+1)} = \Psi(x^{(k)})$. It quantifies that the set of points that locally converge to a fixed point $x^{\star}$ of such an iteration have measure zero whenever the Jacobian of $\Psi$ at $x^{\star}$ has eigenvalues bigger than 1.

With a fairly straightforward argument, we were able to show that the gradient descent algorithm satisfied the assumptions of the Stable Manifold Theorem, and, moreover, that the set of points that converge to strict saddles always has measure zero. This formalizes the above argument. If you pick a point at random and run gradient descent, you will never converge to a saddle point. While this doesn’t give a precise rate on the number of iterations, we show that if all of the local minima satisfy the Kurdyka-Lojasiewicz inequality, then one can derive quantitative convegence rates.

In some sense, optimizers would not be particularly surprised by this theorem. We are sure that some version of our result is already known for gradient descent, but we couldn’t find it in the literature. If you can find an earlier reference proving this theorem we would be delighted if you’d let us know.

Adding noise

As Rong discussed, in his paper with Huang, Jin, and Yuan, adding gaussian noise to the gradient helps to avoid saddle points. In particular, they introduce the notion strict saddle functions to be those where all saddle points are either local minima or have Hessians with negative eigenvalues bounded away from 0. As we saw above, if a saddle point has negative eigenvalues, the set of initial conditions that converge to that point has measure zero. But when we add noise to the gradient, there are no initial conditions that converge to saddles. The noise immediately pushes you off this low-dimensional manifold.

Interestingly, a similar result also follows from the Stable Manifold Theorem. Indeed, Robin Pemantle developed a more general result for stochastic processes. Pemantle uses the Stable Manifold Theorem to show that general vector flows perturbed by noise cannot converge to unstable fixed points. As a special case, he proves that stochastic gradient descent cannot converge to a saddle point provided the gradient noise is sufficiently diverse. In particular, this implies that additive gaussian noise is sufficient to prevent convergence to saddles.

Pemantle does not have to assume the strict saddle point condition to prove his theorem. However, additional work would be required to extract the sort of quantitative convergence bounds that Rong and his coauthors derive from Pemantle’s argument.

What makes nonconvex optimization difficult?

If saddle points are easy to avoid, then the question remains as to what exactly makes nonconvex optimization difficult? In my next post, I’ll explore why this question is so challenging, describing some apparently innocuous problems in optimization that are deviously difficult.

Markov Chains Through the Lens of Dynamical Systems: The Case of Evolution

$
0
0

In this post, we will see the main technical ideas in the analysis of the mixing time of evolutionary Markov chains introduced in a previous post. We start by introducing the notion of the expected motion of a stochastic process or a Markov chain. In the case of a finite population evolutionary Markov chain, the expected motion turns out to be a dynamical system which corresponds to the infinite population evolutionary dynamics with the same parameters. Surprisingly, we show that the limit sets of this dynamical system govern the mixing time of the Markov chain. In particular, if the underlying dynamical system has a unique stable fixed point (as in asexual evolution), then the mixing is fast and in the case of multiple stable fixed points (as in sexual evolution), the mixing is slow. Our viewpoint connects evolutionary Markov chains, nature’s algorithms, with stochastic descent methods, popular in machine learning and optimization, and the readers interested in the latter might benefit from our techniques.

A Quick Recap

Let us recall the parameters of the finite population evolutionary Markov chain (denoted by $\mathcal{M}$) we saw last time. At any time step, the state of the Markov chain consists of a population of size $N$ where each individual could be one of $m$ types. The mutation and the fitness matrices are denoted by $Q$ and $A$ respectively. $X^{(t)}$ captures, after normalization by $N,$ the composition of the population is at time $t$. Thus, $X^{(t)}$ is a point in the $m$-dimensional probability simplex $\Delta_m$. Since we assumed that $QA>0$, the Markov chain has a stationary distribution $\pi$ over its state space, denoted by $\Omega \subseteq \Delta_m$; the state space has cardinality roughly $N^m$. Thus, $X^{(t)}$ evolves in $\Delta_m$ and, with time, its distribution converges to $\pi$. Our goal is to bound the time it takes for this distribution to stabilize, i.e., bound the mixing time of $\mathcal{M}$.

The Expected Motion

As a first step towards understanding the mixing time, let us compute the expectation of $X^{(t+1)}$ for a given $X^{(t)}$. This function tells us where we expect to be after one time step given the current state; in this paper we refer to this as the expected motion of this Markov chain (and define it formally for all Markov chains towards the end of this post). An easy calculation shows that, for $\mathcal{M}$,

This $f$ is the same function that was introduced in the previous post for the infinite population evolutionary dynamics with the same parameters! Thus, in each time step, the expected motion of the Markov chain is governed by $f$. Surprisingly, something stronger is true: we can prove (see Section 3.2 here) that, given some $X^{(t)},$ the point $X^{(t+1)}$ can be equivalently obtained by taking $N$ i.i.d. samples from $f(X^{(t)})$. In words,

In fact, a moment’s thought tells us that this phenomenon transcends any specific model of evolution. We can fix any dynamical system $g$ over the simplex and define a Markov chain guided by it as follows: If $X^{(t)}$ is the population vector at time $t$, then define $X^{(t+1)}$ as the population vector obtained by taking $N$ i.i.d. (or even correlated) copies from $g(X^{(t)})$. By design, $g$ is the expected motion of this Markov chain.

Evolution on Finite Populations = Noisy Evolution on Infinite Populations

The above observation allows us to view our evolutionary Markov chain as a noisy version of the deterministic, infinite population evolution. A bit more formally, there are implicitly defined random variables $\zeta_{s}^{(t+1)}$ for $1 \leq s \leq N$ and all $t$, such that

Here, $\zeta_s^{(t+1)}$ for $1\leq s \leq N$ is a random vector that corresponds to the error or noise of sample $s$ at the $t$-th time step. Formally, because $f$ is the expected motion of the Markov chain, each $\zeta_s^{(t+1)}$ has expectation $0$ conditioned on $X^{(t)}$. Further, the fact that $f$ guides $\mathcal{M}$ implies that for each $t$, when conditioned on $X^{(t)}$, the vectors $\zeta_{s}^{(t+1)}$ are i.i.d. for $1 \leq s \leq N$. Without conditioning, we cannot say much about the $\zeta_{s}^{(t)}$s. However, since we know that the state space of $\mathcal{M}$ lies in the simplex, we can deduce that $\Vert\zeta_s^{(t)}\Vert \leq 2$. The facts that the expectation of the $\zeta_s^{(t)}$s are zero, they are independent and bounded imply that the variance of each coordinate of $\frac{1}{N} \sum_{s=1}^N \zeta_s^{(t+1)}$ (again conditioned on the past) is roughly $1/N$.

Connections to Stochastic Gradient Descent

Now we draw an analogy of the evolutionary Markov chain to an old idea in optimization, stochastic gradient descent or SGD. However, we will see crucial differences that require the development of new tools. Recall that in the SGD setting, one is given a function $F$ and the goal is to find a local minimum of $F.$ The gradient descent method moves from the current point $x^{(t)}$ to a new point $x^{(t+1)}=x^{(t)} - \eta \nabla F(x^{(t)})$ for some rate $\eta$ (which could depend on time $t$).
Since the gradient may not be easy to compute, SGD substitutes the gradient at the current point by an unbiased estimator of the gradient. Thus, the point at time $t$ becomes a random variable $X^{(t)}$. Since the estimate is unbiased, we may write it as

where the expectation of $\zeta^{(t+1)}$ conditioned on $X^{(t)}$ is zero. Thus, we can write one step of SGD as

Comparing it to our evolutionary Markov chain, it can be shown that $f(x)=\frac{QA x}{\Vert QA x\Vert_1}$ is a gradient system (i.e., $f=\nabla G$ for some function $G$) and we may think of the corresponding $\mathcal M$ as SGD with step-size $\eta=1/N$.

There is a vast literature understanding when SGD converges to the global optimum (for convex $F$) or a local optima (for reasonable non-convex $F$). Why can’t we use techniques developed for SGD to analyze our evolutionary Markov chain? To start with, when the step size does not go to zero with time, $X^{(t)}$ wanders around its domain $\Omega$ and will not converge to a point. In the case when the step size is fixed, typically, the time average of $X^{(t)}$ is used in a hope that it will converge to a local minima of the function. The Ergodic Theorem of Markov chains tells us that the time average will converge to the expectation of a sample drawn from $\pi$, the steady state distribution. This quantity is the same as the zero of $\nabla F$ only when it is a linear function (equivalently $F$ is quadratic); certainly not the case in our setting. Further, the rate of convergence to this expectation is governed by the mixing time of the Markov chain. Thus, there is no getting around proving a bound on the mixing time. Moreover, for biological applications (as described in our previous post), we need to know more than the expectation: we need to obtain samples from the steady state distribution $\pi$. Finally, in several other evolutionary Markov chains of interest, the guiding dynamical system is not a gradient system. Hence, the desired results in the setting of evolution seem beyond the reach of current techniques.

The reason for taking this detour and making the connection to SGD is not only to show that completely different sounding problems and areas might be related, but also that the techniques we develop in analyzing evolutionary Markov chains find use in understanding SGD beyond the quadratic case.

The Landscape of the Expected Motion Governs the Mixing Time

Now we delve into our results and proof ideas. We derive all of the information we need to bound the mixing time of $\mathcal M$ from the the limit sets of $f$ which guides it. Roughly, we show that when the limit set of $f$ consists of a unique stable fixed point (which is akin to convexity) as in asexual evolution, then the mixing is fast and in the case of multiple stable fixed points (which is akin to non-convexity) as in sexual evolution, the mixing is slow.

We saw in our first post that the dynamical system $f(x)=\frac{QAx}{\Vert QAx\Vert_1}$ corresponding to the case of asexual evolution has exactly one fixed point in the simplex, say $ x^\star$, when $QA$ is positive. In fact, $x^\star$ is stable and, no matter where we initiate the dynamical system, it ends up close to $x^\star$ in a small number of iterations (which does not depend on $N$).

Back to mixing time: a generic technique to bound the mixing time of a Markov chain employs a coupling of two copies of the chain $X^{(t)}$ and $Y^{(t)}$.

A coupling of a Markov chain $\mathcal M$ is a function which takes as input $X^{(t)}$ and $Y^{(t)}$ and outputs $X^{(t+1)}$ and $Y^{(t+1)}$ such that each of $X^{(t+1)}$ and $Y^{(t+1)}$, when considered on their own, is a correct instantiation of one step of $\mathcal M$ from the states $X^{(t)}$ and $Y^{(t)}$ respectively. However, $X^{(t+1)}$ and $Y^{(t+1)}$ are allowed to be arbitrarily correlated.

For example, we could couple $X^{(t)}$ and $Y^{(t)}$ such that if $X^{(t)} = Y^{(t)}$ then $X^{(t+1)}=Y^{(t+1)}$. More generally, we can consider the distance between $X^{(t)}$ and $Y^{(t)}$, and consider a coupling that contracts the distance between them. If this distance is contractive by, say, a factor of $\rho<1$ at every time step, then the number of iterations required to reduce distance below $1/N$ is about $\log_{1/\rho} N$; this roughly upper bounds the mixing time.

The key observation that connects the dynamical system $f$ and our Markov chain is that using the function $f$ we can construct a coupling $\mathcal{C}$ such that for all $x$,$y \in \Omega$,

Thus, if $ \Vert f(x)-f(y)\Vert_1 < \rho \cdot \Vert x-y\Vert_1 <1$ for some $\rho<1$ and all $x,y \in \Omega$, we would be done. The bad news is that we can show that there are $x,y$ for which $\Vert f(x)-f(y)\Vert_1 > \Vert x-y \Vert_1$ implying that there is no contractive coupling for all $x$ and $y.$

What about when $x$ and $y$ are close to $x^\star$?

In this case, by a first order Taylor approximation of the dynamical system $f$, we can bound the contraction $(\rho)$ by the $1 \rightarrow 1$ norm of the Jacobian of $f$ at $x^\star$. However, this quantity is less than one only when $m=2$, see here. For larger $m$, we have to go back to our intuition from dynamical systems and, using the fact that all trajectories of $f$ converge to $x^\star$, argue that the appropriate norm of the Jacobian of $f^k$ (i.e., $f$ applied $k$ times) is contractive. While there are a few technical challenges, we can use $f^k$ to construct a contractive coupling. We then use concentration to handle the case when $x$,$y$ are not close to $x^\star$, see here for the details. As a consequence, we obtain a mixing time of $O(\log N)$ (suppressing other parameters). Thus, in the world of asexual evolution the steady state can be reached quickly!

Markov Chains Guided by Dynamical Systems - Beyond Uniqueness

Interestingly, this proof does not use any property of $f$ other than that it has a unique fixed point which is stable. However, in many cases, such as sexual evolution (see here for the model of sexual evolution or an equivalent model for how children acquire grammar, see here) and here, the expected motion has multiple fixed points - some stable and some unstable. Such a dynamical system is inherently non-convex - trajectories starting at different points could converge to different points. Further, the presence of unstable fixed points can slow down trajectories and, hence, the mixing time. In this paper, we give a comprehensive treatment about how the landscape of the limit sets determines the mixing time of evolutionary Markov chains. In a nutshell, while the presence of unstable fixed points does not seem to affect the mixing time, the presence of two stable fixed points results in the mixing time being $\exp(N)$!

This result allows us to prove a phase transition in the mixing time for an evolutionary Markov chain with sex where, changing the mutation parameter changes the geometry of the limit sets of the expected motion from multiple stable fixed points to unique stable fixed point.

Evolution on Structured Populations?

A challenging problem left open by our work is to try to estimate the mixing time of evolutionary dynamics on structured populations which arise in ecology. Roughly, this setting extends the evolutionary models discussed thus far by introducing an additional input parameter, a graph on $N$ vertices.

The graph provides structure to the population by locating each individual at a vertex, and, at time $t+1$, a vertex determines its type by sampling with replacement from among its neighbors in the graph at time $t$; see this paper for more details.

The model we discussed so far can be seen as a special case when the underlying graph is the complete graph on $N$ vertices. The difficulty is two fold: now it is no longer sufficient to keep track of the number of each type and also the variance of the noise is no longer $1/N$ - it could be large if a vertex has small degree.

The Expected Motion Revisited

Now we formally define the expected motion of any Markov chain with respect to a function $\phi$ from its state space $\Omega$ to $\mathbb{R}^n$. If $X^{(t)}=x$ is the state of the Markov chain at time $t$ and $X^{(t+1)}$ its state at time $t+1,$ then the expected motion of $\phi$ for the chain at $x$ is

where the expectation is taken over one step of the chain. Often, and in the application we presented in this post, the state space $\Omega$ already has a geometric structure and is a subset of $\mathbb{R}^n$. In this case, there is a canonical expected motion which corresponds to $\phi$ being just the identity map.

What can the expected motion of a Markov chain tell us about the Markov chain itself?

Of course, without imposing additional structure on the Markov chain or $\phi$, the answer is unlikely to be very interesting. However, the results in this post suggest that thinking of a Markov chain in this way can be quite useful.

To Conclude …

In this post, hopefully, you got a flavor of how techniques from dynamical systems can be used to derive interesting properties of Markov chains and stochastic processes. We also saw that nature’s methods, in the context of evolution, seem quite close to the methods of choice of humans - is this a coincidence? In a future post, we will show another example of this phenomena - the famous iteratively reweighted least squares (IRLS) in sparse recovery turns out to be identical to the dynamics of an organism found in nature - the slime mold.

A Framework for analysing Non-Convex Optimization

$
0
0

Previously Rong’s post and Ben’s post show that (noisy) gradient descent can converge to local minimum of a non-convex function, and in (large) polynomial time (Ge et al.’15). This post describes a simple framework that can sometimes be used to design/analyse algorithms that can quickly reach an approximate global optimum of the nonconvex function. The framework —which was used to analyse alternating minimization algorithms for sparse coding in our COLT’15 paper with Ge and Moitra—generalizes many other sufficient conditions for convergence (usually gradient-based) that were formulated in recent papers.

Measuring progress: a simple Lyapunov function

Let $f$ be the function being optimized and suppose the algorithm produces a sequence of candidate solutions $z_1,\dots,z_k,\dots,$ via some update rule

This can be seen as a dynamical system (see Nisheeth’s and Ben’s posts related to dynamical systems). Our goal is to show that this sequence converges to (or gets close to) a target point $z^* $, which is a global optimum of $f$. Of course, the algorithm doesn’t know $z^*$.

To design a framework for proving convergence it helps to indulge in daydreaming/wishful thinking: what property would we like the updates to have, to simplify our job?

A natural idea is to define a Lyapunov function $V(z)$ and show that: (i) $V(z_k)$ decreases to $0$ (at a certain speed) as $k\rightarrow \infty$; (ii) when $V(z)$ is close to $0$, then $z$ is close to $z^* $. (Aside: One can imagine more complicated ways of proving convergence, e.g., show $V(z_k)$ ultimately goes to $0$ even though it doesn’t decrease in every step. Nesterov’s acceleration method uses such a progress measure.)

Consider possibly the most trivial Lyapunov function, the (squared) distance to the target point, $V(z) = |z-z^*|^2$. This is also used in the standard convergence proof for convex functions, since moving in the opposite direction to the gradient can be shown to reduce this measure $V()$.

Even when the function is nonconvex, there always exist update directions that reduce this $V()$ (though finding them may not be easy). Simple algebraic manipulation shows that when the learning rate $\eta$ is small enough, then for $V(z_{k+1}) \le V(z_k)$, it is necessary and sufficient to have $\langle g_k, z_k-z^* \rangle \ge 0$.

correlation condition As illustrated in the figure on the left, $z^* - z_k$ is the ideal direction that we desire to move to, and $-g_k$ is the direction that we actually move to. To establish convergence, it suffices to verify that the direction of movement is positively correlated with the desired direction.

To get quantitative bounds on running time, we need to ensure that $V(z_k)$ not only decreases, but does so rapidly. The next condition formalizes this: intuitively speaking it says that $-g_k$ and $z^*-z_k$ make an angle strictly less than 90 degrees.

Correlation Condition: The direction $g_k$ is $(\alpha,\beta,\epsilon_k)$-correlated with $ z^* $ if

This may look familiar to experts in convex optimization: as a special case if we make the update direction $g_k$ stand for the (negative) gradient, then the condition yields familiar notions such as strong convexity and smoothness. But the condition allows $g_k$ to not be the gradient, and in addition, allows the error term $\epsilon_k$, which is necessary in some applications to accommodate non-convexity and/or statistical error.

If the algorithm can at each step find such update directions, then the familiar convergence proof of convex optimization can be modified to show rapid convergence here as well, except the convergence is approximate, to some point in the neighborhood of $z^*$.

Theorem: Suppose $g_k$ satisfies the Correlation Condition above for every $k$, then with learning rate $\eta \le 2\beta$, we have

As mentioned, the “wishful thinking” approach has been used to identify other conditions under which specific nonconvex optimizations can be carried out to near-optimality: (JNS’13, Hardt’14, BWY’14, CLS’15, AGMM’15, SL’15, CC’15, ZWL’15). All of these can be seen as some weakening of convexity (with the exception of the analysis for matrix completion in Hardt’14 which views the updates as noisy power method).

Our condition appears to contain most if not all of these as special cases.

Often the update direction $g_k$ in these papers is related to the gradient. For example using the gradient instead of $g_k$ in our correlation condition turns it into the “regularity condition” proposed by CLS’15 for analyzing Wirtinger flow algorithm for phase retrieval. The gradient stability condition in BWY’14 is also a special case, where $g_k$ is required to be close enough to $\nabla h(z_k)$ for some convex $h$ such that $z^* $ is the optimum of $h$. Then since $\nabla h(z_k)$ has angle < 90 degrees with $z_k-z^*$ (which follows from convexity of $h$), it implies that $g_k$ also does.

The advantage of our framework is that it encourages one to think of algorithms where $g_k$ is not the gradient. Thus applying the framework doesn’t require understanding the behavior of the gradient on the entire landscape of the objective function; instead, one needs to understand the update direction (which is under the algorithm designer’s control) at the sequence of points actually encountered while running the algorithm.

This slight change of perspective may be powerful.

Application to Sparse Coding

A particularly useful situation for applying the framework above is where the objective function has two sets of arguments and it is feasible to optimize one set after fixing the other –leading to the familiar alternating minimization heuristic. Such algorithms are a good example of how one may try to do local-improvement without explicitly following the (full) gradient. As mentioned, our framework was used to analyse such alternating minimization for sparse coding.

In sparse coding, we are given a set of examples $Y = [y_1,\dots, y_N]\in \mathbb{R}^{d\times N}$, and are asked to find an over-complete basis $A = [a_1,\dots,a_m]$ (where “overcomplete” refers to the setting $m > d$) so that each example $y_j$ can be expressed as a sparse linear combination of $a_i$’s. Therefore, the natural optimization problem with squared loss is that

Here both the objective and the constraint set are not convex. One could consider using $\ell_1$ regularization as a surrogate for sparsity, but the trouble will be that the regularization is neither smooth or strongly convex, and the standard techniques for dealing with $\ell_1$ penalty term in convex optimization cannot be easily applied due to non-convexity.

The standard alternating minimization algorithm (a close variant of the one proposed by Olshausen and Field 1997 as a neurally plausible explanation for V1, the human primary visual cortex) is as follows:

Here update for $X$ is the projection pursuit algorithm in sparse recovery (see Elad’10 for background), which is supposed to give an approximation of the best fit for $X$ given the current $A$.

Sometimes alternating minimization algorithms need careful initialization, but in practice here it suffices to initialize $A_0$ using a random sample of datapoints $y_i$’s.

However, it remains an open problem to analyse convergence using such random initialization; our analysis uses a special starting point $A_0$ found using spectral methods.

Applying our framework

At first glance, the mysterious aspect of our framework was how the algorithm can find an update direction correlating with $z_k -z^* $, without knowing $z^* $? In context of sparse coding, this comes about as follows: if we assume a probabilistic generative model for the observed data (namely, it was generated using some ground-truth sparse coding) then the alternating minimization automatically comes up with such update directions!

Specifically, we will assume that the data points $y_i$’s are generated using some ground truth dictionary $A^* $ using some ground truth $X^* $ whose columns are iid draws from some suitable distribution. (One needs to assume some conditions on $A^* , X^* $, which are not important in the sketch below.) Note that the entries within each column of $X^* $ are not mutually independent, otherwise the problem would be Independent Component Analysis.

In line with our framework, we consider the Lyapunov function $V(A) = |A-A^* |_F^2$. Here the Frobenius norm $|\cdot|_F$ is also the Euclidean norm of the vectorized version of the matrix. Then our framework implies that to show quick convergence it suffices to verify the following (for some $\alpha,\beta > 0$) for the update direction $G_k$:

In AGMM’15 we showed that under certain assumption on the true dictionary $A^* $ and the true coefficient $X^* $, the above inequality is indeed true with small $\epsilon_k$ and some constant $\alpha,\beta > 0$. The proof is a bit technical but reasonable — the partial gradient $\frac{\partial f}{\partial A}$ has a simple form and therefore $G_k$ has a closed form in $A_k$ and $Y$. Therefore, it boils down to plugging in the form of $G_k$ into the equation above and simplifying it appropriately. (One also needs the fact that the starting $A_0$ obtained using spectral methods is somewhat close to $A^* $.)

We hope others will use our framework to analyse other nonconvex problems!

(Aside: We hope that readers will leave comments if they know of other frameworks for proving convergence that are not subcases of the above framework.)

Linear algebraic structure of word meanings

$
0
0

Word embeddings capture the meaning of a word using a low-dimensional vector and are ubiquitous in natural language processing (NLP). (See my earlier post 1 and post2.) It has always been unclear how to interpret the embedding when the word in question is polysemous, that is, has multiple senses. For example, tie can mean an article of clothing, a drawn sports match, and a physical action.

Polysemy is an important issue in NLP and much work relies upon WordNet, a hand-constructed repository of word senses and their interrelationships. Unfortunately, good WordNets do not exist for most languages, and even the one in English is believed to be rather incomplete. Thus some effort has been spent on methods to find different senses of words.

In this post I will talk about my joint work with Li, Liang, Ma, Risteski which shows that actually word senses are easily accessible in many current word embeddings. This goes against conventional wisdom in NLP, which is that of course, word embeddings do not suffice to capture polysemy since they use a single vector to represent the word, regardless of whether the word has one sense, or a dozen. Our work shows that major senses of the word lie in linear superposition within the embedding, and are extractable using sparse coding.

This post uses embeddings constructed using our method and the wikipedia corpus, but similar techniques also apply (with some loss in precision) to other embeddings described in post 1 such as word2vec, Glove, or even the decades-old PMI embedding.

A surprising experiment

Take the viewpoint –simplistic yet instructive– that a polysemous word like tie is a single lexical token that represents unrelated words tie1, tie2, … Here is a surprising experiment that suggests that the embedding for tie should be approximately a weighted sum of the (hypothethical) embeddings of tie1, tie2, …

Take two random words $w_1, w_2$. Combine them into an artificial polysemous word $w_{new}$ by replacing every occurrence of $w_1$ or $w_2$ in the corpus by $w_{new}.$ Next, compute an embedding for $w_{new}$ using the same embedding method while deleting embeddings for $w_1, w_2$ but preserving the embeddings for all other words. Compare the embedding $v_{w_{new}}$ to linear combinations of $v_{w_1}$ and $v_{w_2}$.

Repeating this experiment with a wide range of values for the ratio $r$ between the frequencies of $w_1$ and $w_2$, we find that $v_{w_{new}}$ lies close to the subspace spanned by $v_{w_1}$ and $v_{w_2}$: the cosine of its angle with the subspace is on average $0.97$ with standard deviation $0.02$. Thus $v_{w_{new}} \approx \alpha v_{w_1} + \beta v_{w_2}$. We find that $\alpha \approx 1$ whereas $\beta \approx 1- c\lg r$ for some constant $c\approx 0.5$. (Note this formula is meaningful when the frequency ratio $r$ is not too large, i.e. when $ r < 10^{1/c} \approx 100$.) Thanks to this logarithm, the infrequent sense is not swamped out in the embedding, even if it is 50 times less frequent than the dominant sense. This is an important reason behind the success of our method for extracting word senses.

This experiment –to which we were led by our theoretical investigations– is very surprising because the embedding is the solution to a complicated, nonconvex optimization, yet it behaves in such a striking linear way. You can read our paper for an intuitive explanation using our theoretical model from post2.

Extracting word senses from embeddings

The above experiment suggests that

but this alone is insufficient to mathematically pin down the senses, since $v_{tie}$ can be expressed in infinitely many ways as such a combination. To pin down the senses we will interrelate the senses of different words —for example, relate the “article of clothing” sense tie1 with shoe, jacket etc.

The word senses tie1, tie2,.. correspond to “different things being talked about” —in other words, different word distributions occuring around tie. Now remember that our earlier paper described in post2 gives an interpretation of “what’s being talked about”: it is called discourse and it is represented by a unit vector in the embedding space. In particular, the theoretical model of post2 imagines a text corpus as being generated by a random walk on discourse vectors. When the walk is at a discourse $c_t$ at time $t$, it outputs a few words using a loglinear distribution:

One imagines there exists a “clothing” discourse that has high probability of outputting the tie1 sense, and also of outputting related words such as shoe, jacket, etc. Similarly there may be a “games/matches” discourse that has high probability of outputting tie2 as well as team, score etc.

By equation (2) the probability of being output by a discourse is determined by the inner product, so one expects that the vector for “clothing” discourse has high inner product with all of shoe, jacket, tie1 etc., and thus can stand as surrogate for $v_{tie1}$ in expression (1)! This motivates the following global optimization:

Given word vectors in $\Re^d$, totaling about $60,000$ in this case, a sparsity parameter $k$, and an upper bound $m$, find a set of unit vectors $A_1, A_2, \ldots, A_m$ such that where at most $k$ of the coefficients $\alpha_{w,1},\dots,\alpha_{w,m}$ are nonzero (so-called hard sparsity constraint), and $\eta_w$ is a noise vector.

Here $A_1, \ldots A_m$ represent important discourses in the corpus, which we refer to as atoms of discourse.

Optimization (3) is a surrogate for the desired expansion of $v_{tie}$ in (1) because one can hope that the atoms of discourse will contain atoms corresponding to clothing, sports matches etc. that will have high inner product (close to $1$) with tie1,tie2 respectively. Furthermore, restricting $m$ to be much smaller than the number of words ensures that each atom needs to be used for multiple words, e.g., reuse the “clothing” atom for shoes, jacket etc. as well as for tie.

Both $A_j$’s and $\alpha_{w,j}$’s are unknowns in this optimization. This is nothing but sparse coding, useful in neuroscience, image processing, computer vision, etc. It is nonconvex and computationally NP-hard in the worst case, but can be solved quite efficiently in practice using something called the k-SVD algorithm described in Elad’s survey, lecture 4. We solved this problem with sparsity $k=5$ and using $m$ about $2000$. (Experimental details are in the paper. Also, some theoretical analysis of such an algorithm is possible; see this earlier post.)

Experimental Results

Each discourse atom defines via (2) a distribution on words, which due to the exponential appearing in (2) strongly favors words whose embeddings have a larger inner product with it. In practice, this distribution is quite concentrated on as few as 50-100 words, and the “meaning” of a discourse atom can be roughly determined by looking at a few nearby words. This is how we visualize atoms in the figures below. The first figure gives a few representative atoms of discourse.

A few of the 2000 atoms of discourse found

And here are the discourse atoms used to represent two polysemous words, tie and spring

Discourse atoms expressing the words tie and spring.

You can see that the discourse atoms do correspond to senses of these words.

Finally, we also have a technique that, given a target word, generates representative sentences according to its various senses as detected by the algorithm. Below are the sentences returned for ring. (N.B. The mathematical meaning was missing in WordNet but was picked up by our method.)

Representative sentences for different senses of the word ring.

A new testbed for testing comprehension of word senses

Many tests have been proposed to test an algorithm’s grasp of word senses. They often involve hard-to-understand metrics such as distance in WordNet, or sometimes tied to performance on specific applications like web search.

We propose a new simple test –inspired by word-intrusion tests for topic coherence due to Chang et al 2009– which has the advantages of being easy to understand, and can also be administered to humans.

We created a testbed using 200 polysemous words and their 704 senses according to WordNet. Each “sense” is represented by a set of 8 related words; these were collected from WordNet and online dictionaries by college students who were told to identify most relevant other words occurring in the online definitions of this word sense as well as in the accompanying illustrative sentences. These 8 words are considered as ground truth representation of the word sense: e.g., for the “tool/weapon” sense of axe they were: handle, harvest, cutting, split, tool, wood, battle, chop.

Police line-up test for word senses: the algorithm is given a random one of these 200 polysemous words and a set of $m$ senses which contain the true sense for the word as well as some distractors, which are randomly picked senses from other words in the testbed. The test taker has to identify the word’s true senses amont these $m$ senses.

As usual, accuracy is measured using precision (what fraction of the algorithm/human’s guesses were correct) and recall (how many correct senses were among the guesses).

For $m=20$ and $k=4$, our algorithm succeeds with precision $63\%$ and recall $70\%$, and performance remains reasonable for $m=50$. We also administered the test to a group of grad students. Native English speakers had precision/recall scores in the $75$ to $90$ percent range. Non-native speakers had scores roughly similar to our algorithm.

Our algorithm works something like this: If $w$ is the target word, then take all discourse atoms computed for that word, and compute a certain similarity score between each atom and each of the $m$ senses, where the words in the senses are represented by their word vectors. (Details are in the paper.)

Takeaways

Word embeddings have been useful in a host of other settings, and now it appears that they also can easily yield different senses of a polysemous word. We have some subsequent applications of these ideas to other previously studied settings, including topic models, creating WordNets for other languages, and understanding the semantic content of fMRI brain measurements. I’ll describe some of them in future posts.

Gradient Descent Learns Linear Dynamical Systems

$
0
0

From text translation to video captioning, learning to map one sequence to another is an increasingly active research area in machine learning. Fueled by the success of recurrent neural networks in its many variants, the field has seen rapid advances over the last few years. Recurrent neural networks are typically trained using some form of stochastic gradient descent combined with backpropagation for computing derivatives. The fact that gradient descent finds a useful set of parameters is by no means obvious. The training objective is typically non-convex. The fact that the model is allowed to maintain state is an additional obstacle that makes training of recurrent neural networks challenging.

In this post, we take a step back to reflect on the mathematics of recurrent neural networks. Interpreting recurrent neural networks as dynamical systems, we will show that stochastic gradient descent successfully learns the parameters of an unknown linear dynamical system even though the training objective is non-convex. Along the way, we’ll discuss several useful concepts from control theory, a field that has studied linear dynamical systems for decades. Investigating stochastic gradient descent for learning linear dynamical systems not only bears out interesting connections between machine learning and control theory, it might also provide a useful stepping stone for a deeper undestanding of recurrent neural networks more broadly.

Linear dynamical systems

We focus on time-invariant single-input single-output system. For an input sequence of real numbers $x_1,\dots, x_T\in \mathbb{R}$, the system maintains a sequence of hidden states $h_1,\dots, h_T\in \mathbb{R}^n$, and produces a sequence of outputs $y_1,\dots, y_T\in \mathbb{R}$ according to the following rules:

Here $A,B,C,D$ are linear transformations with compatible dimensions, and $\xi_t$ is Gaussian noise added to the output at each time. In the learning problem, often called system identification in control theory, we observe samples of input-output pairs $((x_1,\dots, x_T),(y_1,\dots y_T))$ and aim to recover the parameters of the underlying linear system.

Although control theory provides a rich set of techniques for identifying and manipulating linear systems, maximum likelihood estimation with stochastic gradient descent remains a popular heuristic.

We denote by $\Theta = (A,B,C,D)$ the parameters of the true system. We parametrize our model with $\widehat{\Theta} = (\hat{A},\hat{B},\hat{C},\hat{D})$, and the trained model maintains hidden states $\hat{h}_t$ and outputs $\hat{y}_t$ exactly as in equation (1). For each given example $(x,y) = ((x_1,\dots,x_T), (y_1,\dots, y_t))$, the log-likelihood of model $\widehat{\Theta}$ is . The population risk is defined as the expected log-likelihood,

Stochastic gradients of the population risk can be computed in time $O(Tn)$ via back-propagation given random samples. We can therefore directly minimize population risk using stochastic gradient descent. The question is just whether the algorithm actually converges. Even though the state transformations are linear, the objective function we defined is not convex. Luckily, we will see that the objective is still close enough to convex for stochastic gradient to make steady progress towards the global minimum.

Hair dryers and quasi-convex functions

Before we go into the math, let’s illustrate the algorithm with a pressing example that we all run into every morning: hair drying. Imagine you have a hair dryer with a low temperature setting and a high temperature setting. Neither setting is ideal. So every morning you switch between the settings frantically in an attempt to modulate to the ideal temperature. Measuring the resulting temperature (red line below) as a function of the input setting (green dots below), the picture you’ll see is something like this:

You can see that the output temperature is related to the inputs. If you set the temperature to high for long enough, you’ll eventually get a high output temperature. But the system has state. Briefly lowering the temperature has little effect on the outputs. Intuition suggests that these kind of effects should be captured by a system with two or three hidden states. So, let’s see how SGD would go about finding the parameters of the system. We’ll initialize a system with three hidden states such that before training its predictions are just the inputs of the system. We then run SGD with a fixed learning rate on the same sequence for 400 steps.

The blue line shows the predictions of SGD after 0/400 gradient updates. Click to advance.

Evidently, gradient descent converges just fine on this example. Let’s look at the hair dryer objective function along the line segment between two random points in the domain.

The function is clearly not convex, but it doesn’t look too bad either. In particular, from the picture, it could be that the objective function is quasi-convex:

Definition: For $\tau > 0$, a function $f(\theta)$ is $\tau$-quasi-convex with respect to a global minimum $\theta ^ * $ if for every $\theta$,

Intuitively, quasi-convexity states that the descent direction $-\nabla f(\theta)$ is positively correlated with the ideal moving direction $\theta^* -\theta$. This implies that the potential function $\left|\theta-\theta ^ * \right|^2$ decreases in expectation at each step of stochastic gradient descent. This observation plugs nicely into the standard SGD analysis, leading to the following result:

Proposition: (informal) Suppose the population risk $f(\theta)$ is $\tau$-quasi-convex, then stochastic gradient descent (with fresh samples at each iteration and proper learning rate) converges to a point $\theta_K$ in $K$ iterations with error bounded by $ f(\theta_K) - f(\theta^*) \leq O(1/(\tau \sqrt{K}))$.

The key challenge for us is to understand under what conditions we can prove that the population risk objective is in fact quasi-convex. This requires some background.

Control theory, polynomial roots, and Pac-Man

A linear dynamical system $(A,B,C,D)$ is equivalent to the system $(TAT^{-1}, TB, CT^{-1}, D)$ for any invertible matrix $T$ in terms of the behavior of the outputs. A little thought shows therefore that in its unrestricted parameterization the objective function cannot have a unique optimum. A common way of removing this redundancy is to impose a canonical form. Almost all non-degenerate system admit the controllable canonical form, defined as

We will also parametrize our training model using these forms. One of its nice properties is that the coefficients of the characteristic polynomial of the state transition matrix $A$ can be read off from the last row of $A$. That is,

Even in controllable canonical form, it still seems rather difficult to learn arbitrary linear dynamical systems. A natural restriction would be stability, that is, to require that the eigenvalues of $A$ are all bounded by $1.$ Equivalently, the roots of the characteristic polynomial should all be contained in the complex unit disc. Without stability, the state of the system could blow up exponentially making robust learning difficult. But the set of all stable systems forms a non-convex domain. It seems daunting to guarantee that stochastic gradient descent would converge from an arbtirary starting point in this domain without ever leaving the domain.

We will therefore impose a stronger restriction on the roots of the characteristic polynomial. We call this the Pac-Man condition. You can think of it as a strengthening of stability.

Pac-Man condition: A linear dynamical system in controllable canonical form satisfies the Pac-Man condition if the coefficient vector $a$ defining the state transition matrix satisfies for all complex numbers $z$ of modulus $|z| = 1$, where $q_a(z) = p_a(z)/z^n = 1+a_1z^{-1}+\dots + a_nz^{-n}$.

Above, we illustrate this condition for a degree 4 system plotting the value of $q_a(z)$ on complex plane for all complex numbers $z$ on the unit circle.

We note that Pac-Man condition is satisfied by vectors $a$ with $|a|_1\le \sqrt{2}/2$. Moreover, if $a$ is a random Gaussian vector with expected $\ell_2$ norm bounded by $o(1/\sqrt{\log n})$, then it will satisfy Pac-Man condition with probability $1-o(1)$. Roughly speaking, the assumption requires the roots of the characteristic polynomial $p_a(z)$ are relatively dispersed inside the unit circle.

The Pac-Man condition has three important implications:

  1. It implies via Rouche’s theorem that the spectral radius of A is smaller than 1 and therefore ensures stability of the system.

  2. The vectors satisfying it form a convex set in $\mathbb{R}^n$.

  3. Finally, it ensures that the objective function is quasi-convex

Main result

Relying on the Pac-Man condition, we can show:

Main theorem (Hardt, Ma, Recht, 2016): Under the Pac-Man condition, projected gradient descent algorithm, given $N$ sample sequences of length $T$, returns parameters $\widehat{\Theta}$ with population risk

The theorem sorts out the right dependence on $N$ and $T$. Even if there is only one sequence, we can learn the system provided that the sequence is long enough. Similarly, even if sequences are really short, we can learn provided that there are enough sequences.

Quasi-convexity in the frequency domain

To establish quasi-convexity under the Pac-Man condition, we will first develop an explicit formula for the population risk in frequency domain. In doing so, we assume that $x_1,\dots, x_T$ are pairwise independent with mean 0 and variance 1. We also consider the population risk as $T\rightarrow \infty$ for simplicity in this post.

A simple algebraic manipulation simplifies the population risk with infinite sequence length to

The first term, $(\hat D - D)^2$ is convex and appears nowhere else. We can safely ignore it and focus on the remaining expression instead, which we call the idealized risk:

To deal with the sequence $\hat{C}\hat{A}^kB$, we take its Fourier transform and obtain that

Similarly we take the Fourier transform of $CA^kB$, denoted by $G_{\lambda}$. Then by Parseval’s Theorem, we obtain the following alternative representation of the population risk,

Mapping out $G_\lambda$ and $\widehat G_\lambda$ for all $\lambda\in [0, 2\pi]$ gives the following picture:

Left: Target transfer function $G$. Right: Approximation $\widehat G$ at step 0/10. Click to advance.

Given this pretty representation of the idealized risk objective, we can finally prove our main lemma.

Lemma: Suppose $\Theta$ satisfies the Pac-Man condition. Then, for every $0\le \lambda\le 2\pi$, $|G_{\lambda}-\widehat{G}_{\lambda}|^2$, as a function of $\hat{A},\hat{C}$ is quasi-convex in the Pac-Man region.

The lemma reduces to the following simple claim.

Claim: The function $h(\hat{u},\hat{v}) = |\hat{u}/\hat{v} - u/v|^2$ is quasi-convex in the region where $Re(\hat{v}/v) > 0$.

The proof simply involves computing the gradients and checking the conditions for quasi-convexity by elementary algebra. We omit a formal proof, but intead show a plot of the function $h(\hat{u}, \hat{v}) = (\hat{u}/\hat{v}- 1)^2$ over the reals:

Click to rotate.

To see how the lemma follows from the previous claim we note that quasi-convexity is preserved under composition with any linear transformation. Specifically, $h(z)$ is quasi-convex, then $h(R x)$ is also quasi-convex for any linear map $R$. So, consider the linear map:

With this linear transformation, our simple claim about a bivariate function extends to show that $(G_{\lambda}-\widehat{G}_{\lambda})^2$ is quasi-convex when $Re(\hat{v}/v) \ge 0$. In particular, when $\hat{a}$ and $a$ both satisfy the Pac-Man condition, then $\hat{v}$ and $v$ both reside in the 90 degree wedge. Therefore they have an angle smaller than 90 degree. This implies that $Re(\hat{v}/v) > 0$.

Conclusion

We saw conditions under which stochastic gradient descent successfully learns a linear dynamical system. In our paper, we further show that allowing our learned system to have more parameters than the target system makes the problem dramatically easier. In particular, at the expense of slight over-parameterization we can weaken the Pac-Man condition to a mild separation condition on the roots of the characteristic polynomial. This is consistent with empirical observations both in machine learning and control theory that highlight the effectiveness of additional model parameters.

More broadly, we hope that our techniques will be a first stepping stone toward a better theoretical understanding of recurrent neural networks.

The search for biologically plausible neural computation: The conventional approach

$
0
0

Inventors of the original artificial neural networks (NNs) derived their inspiration from biology. However, as artificial NNs progressed, their design was less guided by neuroscience facts. Meanwhile, progress in neuroscience has altered our conceptual understanding of neurons. Consequently, we believe that many successful artificial NNs resemble natural NNs only superficially violating fundamental constraints imposed by biological hardware.

The wide gap between the artificial and natural NN designs raises intriguing questions: What algorithms underlie natural NNs? Can insights from biology help build better artificial NNs?

This is the first of a series of posts aimed at explaining recent progress made by my collaborators and myself towards biologically plausible NNs. Such networks can serve both as models of natural NNs and as general purpose artificial NNs. We have found that respecting biological constraints actually helps development of artificial NNs by guiding design decisions.

In this post, I cover the background material, going back several decades. I sketch a biological neuron, introduce primary biological constraints, and discuss the conventional approach to deriving artificial NNs. I will show that while the conventional approach generates a reasonable algorithmic model of a single biological neuron, multi-neuron networks violate biological constraints. In future posts we will see how to fix that.

A Sketch of a Biological Neuron

Here is the minimum biological background needed to understand the rest of the post.

A biological neuron receives signals from multiple neurons, computes their weighted sum and generates a signal transmitted to multiple neurons, Figure 1. Each neuron’s signaling activity is quantified by the firing rate, which is a nonnegative real number that varies over time. Each synapse scales the input from the corresponding upstream neuron onto the receiving neuron by its weight. The receiving neuron sums scaled inputs, i.e. computes the inner product of the upstream activity vector and the synaptic weight vector. The inner product passes through a nonlinearity called the activation function and the output is transmitted to downstream neurons.

Synaptic weight changes over time, typically, on a slower time scale than neuronal signals. The weight depends on neuronal signals per so-called learning rules. For example, in commonly used Hebbian learning rules, synaptic weight is proportional to the correlation between the activities of the two neurons a synapse connects, i.e. pre- and postsynaptic.

Figure 1: A biological neuron modelled by an online algorithm. Left: A biological neuron receives inputs from the upstream neurons (green) which are scaled by the weights of corresponding synapses (blue). The neuron (black) computes output, $y$, as a function of the weighted input sum. Right: Online algorithm outputs an activation function of the inner product of the synaptic weight vector and an upstream activity vector. Synaptic weights are modified by neuronal activities (dashed line) per learning rules.

Primary Biological Constraints

To determine which algorithmic models in this post are biologically plausible, we can focus on a few key biological constraints.

Biologically plausible algorithms must be formulated in the online (or streaming), rather than offline (or batch), setting. This means that input data are streamed to the algorithm sequentially, one sample at a time, and the corresponding output must be computed before the next input sample arrives. The output communicated to downstream neurons cannot be modified in the future. A neuron cannot store individual past inputs or outputs except in a highly compressed format limited to synaptic weights and a few state variables.

In biologically plausible NNs, learning rules must be local. This means that the synaptic weight update may depend on the activities of only the two neurons a synapse connects, as for example, in Hebbian learning. Activities of other neurons are not physically available to a synapse and therefore including them into learning rules would be biologically implausible. Modern artificial NNs, such as backpropagation-based deep learning networks, rely on nonlocal learning rules.

Our initial focus is on unsupervised learning. This is not a hard constraint, but rather a matter of priority. Whereas humans are clearly capable of supervised learning, most of our learning tasks lack big labeled datasets. On the mechanistic level, most neurons lack a clear supervision signal.

Single-neuron Online Principal Component Analysis (PCA)

In 1982, Oja proposed modeling a neuron by an online PCA algorithm. PCA is a workhorse of data analysis used for dimensionality reduction, denoising, and latent factor discovery. Therefore, Oja’s seminal paper established that biological processes in a neuron can be viewed as the steps of an online algorithm solving a useful computational objective.

Oja’s single-neuron online PCA algorithm works as follows. At each time step, $t$, it receives an input data sample, ${\bf x_t}$, computes and outputs the corresponding top principal component value, $y_t$:

Here and below lowercase boldfaced letters designate vectors. Then the algorithm updates the (normalized) feature vector,

The feature vector, ${\bf w}$, converges to the eigenvector of input covariance if data are drawn i.i.d from a stationary distribution.

The steps of the Oja algorithm (1.1-1.2) correspond to the operations of the biological neuron. If the input vector is represented by the activities of the upstream neurons, (1.1) represents weighted summation of the inputs by the output neuron. If the activation function is linear the output, $y_t$, is simply the weighted sum. The update (1.2) is a local Hebbian synaptic learning rule. The first term of the update is proportional to the correlation of the pre- and postsynaptic neurons’ activities and the second term, also local, normalizes the synaptic weight vector.

A Normative Theory

Next, we would like to build on Oja’s insightful identification of biological processes with the steps of the online PCA algorithm by computing multiple principal components using multi-neuron NNs and including the activation nonlinearity.

Instead of trying to extend the Oja model heuristically, we take a more systematic, so-called normative approach. In this approach, a biological model is viewed as the solution of an optimization problem. Specifically, we postulate an objective function motivated by a computational principle, derive an online algorithm optimizing such objective, and map the steps of the algorithm onto biological processes.

Having such normative theory allows us to navigate through the space of possible algorithmic models in a more efficient and systematic way. Mathematical compactness of objective functions facilitates generating new models and weeding out inconsistent ones. This is similar to the Hamiltonian approach in physics which leverages natural symmetries and safeguards against the violation of the first law of thermodynamics (energy conservation).

Deriving a Single-neuron Online PCA using the Reconstruction Approach

To build a normative theory, we first need to derive Oja’s single-neuron online algorithm by solving an optimization problem. What objective function should we choose for online PCA? Historically, neural activity has been often viewed as representing each data sample, ${\bf x}_t$, by the feature vector, ${\bf w}$, scaled by the output, $y_t$, Figure 2. Such reconstruction approach is naturally formalized as the minimization of the reconstruction (or coding) error:

In the offline setting, optimization problem (1.3) is solved by PCA: the optimum ${\bf w}$ is the eigenvector of input covariance corresponding to the top eigenvalue and the optimum output, $y$, is the first principal component.

Figure 2. PCA represents data samples (circles) by their projections (red) onto the top eigenvector, ${\bf w}$. These projections constitute the top principal component. Objective (1.3) minimizes the reconstruction error (blue).

In the online setting, (1.3) can be solved by alternating minimization, which has been a subject of recent analysis. After the arrival of each data point, , the algorithm computes optimum output, $y_t$, while keeping the feature vector, ${\bf w}_{t-1}$, computed at the previous time step, fixed. By using calculus, one finds that the optimum output is given by (1.1). Then, the algorithm minimizes the total reconstruction error with respect to the feature vector while keeping all the outputs fixed. Again, resorting to calculus, one finds (1.2).

Thus, the single-neuron online PCA algorithm may be derived using the reconstruction approach. To compute multiple principal components, we need to extend this success to multi-neuron networks.

The Reconstruction Approach Fails for Multi-neuron Networks

Though the reconstruction approach yields a multi-component online PCA algorithm, the corresponding NNs are not biologically plausible.

Extension of the reconstruction error objective from single to multiple output components is straightforward - each scalar, $y_t$, is replaced by a vector, ${\bf y}_t$:

Here matrix ${\bf W}$ comprises column-vectors corresponding to different features. As in the single- neuron case this objective can be optimized online by alternating minimization. After the arrival of data sample, ${\bf x}_t$, the feature vectors are kept fixed while the objective (1.4) is minimized with respect to the principal components by iterating the following update until convergence:

Minimizing the total objective with respect to the feature vectors for fixed principal components yields the following update:

As before, in NN implementations of algorithm (1.5-1.6), feature vectors are represented by synaptic weights and principal components by the activities of output neurons. Then (1.5) can be implemented by a single-layer NN, Figure 3, in which activity dynamics converges faster than the time interval between the arrival of successive data samples.

However, implementing update (1.6) in the single-layer NN architecture, Figure 3, requires nonlocal learning rules making it biologically implausible. Indeed, the last term in (1.6) implies that updating the weight of a synapse requires the knowledge of output activities of all other neurons which are not available to the synapse. Moreover, the matrix of lateral connection weights, $- {\bf W} _{t-1}^\top {\bf W} _{t-1}$, in the last term of (1.5) is computed as a Grammian of feedforward weights, clearly a nonlocal operation. This problem is not limited to PCA and arises in networks of nonlinear neurons as well.

Rather than deriving learning rules from a principled objective, many authors constructed biologically plausible single-layer networks using local learning rules, Hebbian for feedforward and anti-Hebbian (meaning there is a minus sign in front of the correlation-based synaptic weight as for the last term in (1.5)). However, in my view, abandoning the normative approach creates more problems than it solves.

Figure 3. The single-layer NN implementation of the multi-neuron online PCA algorithm derived using the reconstruction approach requires nonlocal learning rules.

I have outlined how the conventional reconstruction approach fails to generate biologically plausible multi-neuron networks for online PCA. In the next post, I will introduce an alternative approach that overcomes this limitation. Moreover, this approach suggests a novel view of neural computation leading to many interesting extensions.

(Acknowledgement: I am grateful to Sanjeev Arora for his support and encouragement as well as to Cengiz Pehlevan, Leo Shklovskii, Emily Singer, and Thomas Lin for their comments on the earlier versions.)


Back-propagation, an introduction

$
0
0

Given the sheer number of backpropagation tutorials on the internet, is there really need for another? One of us (Sanjeev) recently taught backpropagation in undergrad AI and couldn’t find any account he was happy with. So here’s our exposition, together with some history and context, as well as a few advanced notions at the end. This article assumes the reader knows the definitions of gradients and neural networks.

What is backpropagation?

It is the basic algorithm in training neural nets, apparently independently rediscovered several times in the 1970-80’s (e.g., see Werbos’ Ph.D. thesis and book, and Rumelhart et al.). Some related ideas existed in control theory in the 1960s. (One reader points out another independent rediscovery, the Baur-Strassen lemma from 1983.)

Backpropagation gives a fast way to compute the sensitivity of the output of a neural network to all of its parameters while keeping the inputs of the network fixed: specifically it computes all partial derivatives ${\partial f}/{\partial w_i}$ where $f$ is the output and $w_i$ is the $i$th parameter. (Here parameters can be edge weights or biases associated with nodes or edges of the network, and the precise details of the node computations —e.g., the precise form of nonlinearity like Sigmoid or RELU— are unimportant.) Doing so gives the gradient $\nabla f$ of $f$ with respect to its network parameters, which allows a gradient descent step in the training: change all parameters simultaneously to move the vector of parameters a small amount in the direction $-\nabla f$.

Note that backpropagation computes the gradient exactly, but properly training neural nets needs many more tricks than just backpropagation. Understanding backpropagation is useful for appreciating some advanced tricks.

The importance of backpropagation derives from its efficiency. Assuming node operations take unit time, the running time is linear, specifically, $O(\text{Network Size}) = O(V + E)$, where $V$ is the number of nodes in the network and $E$ is the number of edges. The only technical ingredient is chain rule from calculus, but applying it naively would have resulted in quadratic running time—which would be hugely inefficient for networks with millions or even thousands of parameters.

Backpropagation can be efficiently implemented using highly parallel vector operations available in today’s GPUs (Graphical Processing Units), which play an important role in the the recent neural nets revolution.

Side Note: Expert readers will recognize that in the standard accounts of neural net training, the actual quantity of interest is the gradient of the training loss, which happens to be a simple function of the network output. But the above phrasing is fully general since one can simply add a new output node to the network that computes the training loss from the old output. Then the quantity of interest is indeed the gradient of this new output with respect to network parameters.

Problem Setup

Backpropagation applies only to acyclic networks with directed edges. (Later we briefly sketch its use on networks with cycles.)

Without loss of generality, acyclic networks can be visualized as being structured in numbered layers, with nodes in the $t+1$th layer getting all their inputs from the outputs of nodes in layers $t$ and earlier. We use $f \in \mathbb{R}$ to denote the output of the network. In all our figures, the input of the network is at the bottom and the output on the top.

We start with a simple claim that reduces the problem of computing the gradient to the problem of computing partial derivatives with respect to the nodes:

Claim 1: To compute the desired gradient with respect to the parameters, it suffices to compute $\partial f/\partial u$ for every node $u$.

Let’s be clear what $\partial f/\partial u$ means. Suppose we cut off all the incoming edges of the node $u$, and fix/clamp the current values of all network parameters. Now imagine changing $u$ from its current value. This change may affect values of nodes at higher levels that are connected to $u$, and the final output $f$ is one such node. Then $\partial f/\partial u$ denotes the rate at which $f$ will change as we vary $u$. (Aside: Readers familiar with the usual exposition of back-propagation should note that there $f$ is the training error and this $\partial f/\partial u$ turns out to be exactly the “error” propagated back to on the node $u$.)

Claim 1 is a direct application of chain rule, and let’s illustrate it for a simple neural nets (we address more general networks later). Suppose node $u$ is a weighted sum of the nodes $z_1,\dots, z_m$ (which will be passed through a non-linear activation $\sigma$ afterwards). That is, we have $u = w_1z_1+\dots+w_nz_n$. By Chain rule, we have

Hence, we see that having computed $\partial f/\partial u$ we can compute $\partial f/\partial w_1$, and moreover this can be done locally by the endpoints of the edge where $w_1$ resides.

Multivariate Chain Rule

Towards computing the derivatives with respect to the nodes, we first recall the multivariate Chain rule, which handily describes the relationships between these partial derivatives (depending on the graph structure).

Suppose a variable $f$ is a function of variables $u_1,\dots, u_n$, which in turn depend on the variable $z$. Then, multivariate Chain rule says that

This is a direct generalization of eqn. (2) and a sub-case of eqn. (11) in this description of chain rule.

This formula is perfectly suitable for our cases. Below is the same example as we used before but with a different focus and numbering of the nodes.

We see that given we’ve computed the derivatives with respect to all the nodes that is above the node $z$, we can compute the derivative with respect to the node $z$ via a weighted sum, where the weights involve the local derivative ${\partial u_j}/{\partial z}$ that is often easy to compute. This brings us to the question of how we measure running time. For book-keeping, we assume that

Basic assumption: If $u$ is a node at level $t+1$ and $z$ is any node at level $\leq t$ whose output is an input to $u$, then computing $\frac{\partial u}{\partial z}$ takes unit time on our computer.

Naive feedforward algorithm (not efficient!)

It is useful to first point out the naive quadratic time algorithm implied by the chain rule. Most authors skip this trivial version, which we think is analogous to teaching sorting using only quicksort, and skipping over the less efficient bubblesort.

The naive algorithm is to compute $\partial u_i/\partial u_j$ for every pair of nodes where $u_i$ is at a higher level than $u_j$. Of course, among these $V^2$ values (where $V$ is the number of nodes) are also the desired ${\partial f}/{\partial u_i}$ for all $i$ since $f$ is itself the value of the output node.

This computation can be done in feedforward fashion. If such value has been obtained for every $u_j$ on the level up to and including level $t$, then one can express (by inspecting the multivariate chain rule) the value $\partial u_{\ell}/\partial u_j$ for some $u_{\ell}$ at level $t+1$ as a weighted combination of values $\partial u_{i}/\partial u_j$ for each $u_i$ that is a direct input to $u_{\ell}$. This description shows that the amount of computation for a fixed $j$ is proportional to the number of edges $E$. This amount of work happens for all $V$ values of $j$, letting us conclude that the total work in the algorithm is $O(VE)$.

Backpropagation (Linear Time)

The more efficient backpropagation, as the name suggests, computes the partial derivatives in the reverse direction. Messages are passed in one wave backwards from higher number layers to lower number layers. (Some presentations of the algorithm describe it as dynamic programming.)

Messaging protocol: The node $u$ receives a message along each outgoing edge from the node at the other end of that edge. It sums these messages to get a number $S$ (if $u$ is the output of the entire net, then define $S=1$) and then it sends the following message to any node $z$ adjacent to it at a lower level:

Clearly, the amount of work done by each node is proportional to its degree, and thus overall work is the sum of the node degrees. Summing all node degrees counts each edge twice, and thus the overall work is $O(\text{Network Size})$.

To prove correctness, we prove the following:

Main Claim: At each node $z$, the value $S$ is exactly ${\partial f}/{\partial z}$.

Base Case: At the output layer this is true, since ${\partial f}/{\partial f} =1$.

Inductive case: Suppose the claim was true for layers $t+1$ and higher and $u$ is at layer $t$, with outgoing edges go to some nodes $u_1, u_2, \ldots, u_m$ at levels $t+1$ or higher. By inductive hypothesis, node $z$ indeed receives $ \frac{\partial f}{\partial u_j}\times \frac{\partial u_j}{\partial z}$ from each of $u_j$. Thus by Chain rule, This completes the induction and proves the Main Claim.

Auto-differentiation

Since the exposition above used almost no details about the network and the operations that the node perform, it extends to every computation that can be organized as an acyclic graph whose each node computes a differentiable function of its incoming neighbors. This observation underlies many auto-differentiation packages such as autograd or tensorflow: they allow computing the gradient of the output of such a computation with respect to the network parameters.

We first observe that Claim 1 continues to hold in this very general setting. This is without loss of generality because we can view the parameters associated to the edges as also sitting on the nodes (actually, leaf nodes). This can be done via a simple transformation to the network; for a single node it is shown in the picture below; and one would need to continue to do this transformation in the rest of the networks feeding into $u_1, u_2,..$ etc from below.

Then, we can use the messaging protocol to compute the derivatives with respect to the nodes, as long as the local partial derivative can be computed efficiently. We note that the algorithm can be implemented in a fairly modular manner: For every node $u$, it suffices to specify (a) how it depends on the incoming nodes, say, $z_1,\dots, z_n$ and (b) how to compute the partial derivative times $S$, that is, $S \cdot \frac{\partial u}{\partial z_j}$.

Extension to vector messages: In fact (b) can be done efficiently in more general settings where we allow the output of each node in the network to be a vector (or even matrix/tensor) instead of only a real number. Here we need to replace $\frac{\partial u}{\partial z_j}\cdot S$ by $\frac{\partial u}{\partial z_j}[S]$, which denotes the result of applying the operator $\frac{\partial u}{\partial z_j}$ on $S$. We note that to be consistent with the convention in the usual exposition of backpropagation, when $y\in \mathbb{R}^{p}$ is a funciton of $x\in \mathbb{R}^q$, we use $\frac{\partial y}{\partial x}$ to denote $q\times p$ dimensional matrix with $\partial y_j/\partial x_i$ as the $(i,j)$-th entry. Readers might notice that this is the transpose of the usual Jacobian matrix defined in mathematics. Thus $\frac{\partial y}{\partial x}$ is an operator that maps $\mathbb{R}^p$ to $\mathbb{R}^q$ and we can verify $S$ has the same dimension as $u$ and $\frac{\partial u}{\partial z_j}[S]$ has the same dimension as $z_j$.

For example, as illustrated below, suppose the node $U\in \mathbb{R}^{d_1\times d_3} $ is a product of two matrices $W\in \mathbb{R}^{d_2\times d_3}$ and $Z\in \mathbb{R}^{d_1\times d_2}$. Then we have that $\partial U/\partial Z$ is a linear operator that maps $\mathbb{R}^{d_2\times d_3}$ to $\mathbb{R}^{d_1\times d_3}$, which naively requires a matrix representation of dimension $d_2d_3\times d_1d_3$. However, the computation (b) can be done efficiently because

Such vector operations can also be implemented efficiently using today’s GPUs.

Notable Extensions

1) Allowing weight tying. In many neural architectures, the designer wants to force many network units such as edges or nodes to share the same parameter. For example, in convolutional neural nets, the same filter has to be applied all over the image, which implies reusing the same parameter for a large set of edges between the two layers.

For simplicity, suppose two parameters $a$ and $b$ are supposed to share the same value. This is equivalent to adding a new node $u$ and connecting $u$ to both $a$ and $b$ with the operation $a = u$ and $b=u$. Thus, by chain rule, Hence, equivalently, the gradient with respect to a shared parameter is the sum of the gradients with respect to individual occurrences.

2) Backpropagation on networks with loops. The above exposition assumed the network is acyclic. Many cutting-edge applications such as machine translation and language understanding use networks with directed loops (e.g., recurrent neural networks). These architectures —all examples of the “differentiable computing” paradigm below—can get complicated and may involve operations on a separate memory as well as mechanisms to shift attention to different parts of data and memory.

Networks with loops are trained using gradient descent as well, using back-propagation through time, which consists of expanding the network through a finite number of time steps into an acyclic graph, with replicated copies of the same network. These replicas share the weights (weight tying!) so the gradient can be computed. In practice an issue may arise with exploding or vanishing gradients which impact convergence. Such issues can be carefully addressed in practice by clipping the gradient or re-parameterization techniques such as long short-term memory.

The fact that the gradient can be computed efficiently for such general networks with loops has motivated neural net models with memory or even data structures (see for example neural Turing machines and differentiable neural computer). Using gradient descent, one can optimize over a family of parameterized networks with loops to find the best one that solves a certain computational task (on the training examples). The limits of these ideas are still being explored.

3) Hessian-vector product in linear time. It is possible to generalize backprop to enable 2nd order optimization in “near-linear” time, not just gradient descent, as shown in recent independent manuscripts of Carmon et al. and Agarwal et al. (NB: Tengyu is a coauthor on this one.). One essential step is to compute the product of the Hessian matrix and a vector, for which Pearlmutter’93 gave an efficient algorithm. Here we show how to do this in $O(\mbox{Network size})$ using the ideas above. We need a slightly stronger version of the back-propagation result than the one in the previous subsection:

Claim (informal): Suppose an acyclic network with $V$ nodes and $E$ edges has output $f$ and leaves $z_1,\dots, z_m$. Then there exists a network of size $O(V+E)$ that has $z_1,\dots, z_m$ as input nodes and $\frac{\partial f}{\partial z_1},\dots, \frac{\partial f}{\partial z_m}$ as output nodes.

The proof of the Claim follows in straightforward fashion from implementing the message passing protocol as an acyclic circuit.

Next we show how to compute $\nabla^2 f(z)\cdot v$ where $v$ is a given fixed vector. Let $g(z)= \langle \nabla f(z),v\rangle$ be a function from $\mathbb{R}^d\rightarrow \mathbb{R}$. Then by the Claim above, $g(z)$ can be computed by a network of size $O(V+E)$. Now apply the Claim again on $g(z)$, we obtain that $\nabla g(z)$ can also be computed by a network of size $O(V+E)$.

Note that by construction, Hence we have computed the Hessian vector product in network size time.

##That’s all!

Please write your comments on this exposition and whether it can be improved.

Generative Adversarial Networks (GANs), Some Open Questions

$
0
0

Since ability to generate “realistic-looking” data may be a step towards understanding its structure and exploiting it, generative models are an important component of unsupervised learning, which has been a frequent theme on this blog. Today’s post is about Generative Adversarial Networks (GANs), introduced in 2014 by Goodfellow et al., which have quickly become very popular way to train generative models for complicated real-life data. It involves a game-theoretic tussle between a generator player and a discriminator player, which is very attractive and may be useful in other settings.

This post describes GANs and raises some open questions about them. The next post will describe our recent paper addressing these questions.

A generative model $G$ can be seen as taking a random seed $h$ (say, a sample from a multivariate Normal distribution) and converting it into an output string $G(h)$ that “looks” like a real datapoint. Such models are popular in classical statistics but the simpler ones like Gaussian Mixtures or Dirichlet Processes seem insufficient for modeling complicated distributions on natural images or natural language. Generative models are also popular in statistical physics, e.g., Ising models and their cousins. These physics models migrated into machine learning and neuroscience in the 1980s and 1990s, which led to a new generative view of neural nets (e.g., Hinton’s Restricted Boltzmann Machines) which in turn led to multilayer generative models such as stacked denoising autoencoders and variational autoencoders. At their heart, these are nothing but multilayer neural nets that transform the random seed into an output that looks like a realistic image. The primary differences in the model concern details of training. Here is the obligatory set of generated images (source: OpenAI blog)

GANs: The basic framework

GANs also train a deep net $G$ to produce realistic images, but the new and beautiful twist lies in a novel training procedure.

To understand the new twist let’s first discuss what it could mean for the output to “look” realistic. A classic evaluation for generative models is perplexity: a measure of the amount of probability it gives to actual images. This requires that the generative model must be accompanied by an algorithm that computes the probability density function for the generated distribution (i.e., given any image, it must output an estimate of the probability that the model outputs this image.) I might do a future blog post discussing pros and cons of the perplexity measure, but today let’s instead dive straight to GANs, which sidestep the need for perplexity computations.

Idea 1: Since deep nets are good at recognizing images —e.g., distinguishing pictures of people from pictures of cats—why not let a deep net be the judge of the outputs of a generative model?

More concretely, let $P_{real}$ be the distribution over real images, and $P_{synth}$ the one output by the model (i.e., the distribution of $G(h)$ when $h$ is a random seed). We could try to train a discriminator deep net $D$ that maps images to numbers in $[0,1]$ and tries to discriminate between these distributions in the following sense. Its expected output $E_{x}[D(x)]$ as high as possible when $x$ is drawn from $P_{real}$ and as low as possible when $x$ is drawn from $P_{synth}$. This training can be done with the usual backpropagation. If the two distributions are identical then of course no such deep net can exist, and so the training will end in failure. If on the other hand we are able to train a good discriminator deep net —one whose average output is noticeably different between real and synthetic samples— then this is proof positive that the two distributions are different. (There is an in-between case, whereby the distributions are different but the discriminator net doesn’t detect a difference. This is going to be important in the story in the next post.) A natural next question is whether the ability to train such a discriminator deep net can help us improve the generative model.

Idea 2: If a good discriminator net has been trained, use it to provide “gradient feedback” that improves the generative model.

Let $G$ denote the Generator net, which means that samples in $P_{synth}$ are obtained by sampling a uniform gaussian seed $h$ and computing $G(h)$. The natural goal for the generator is to make $E_{h}[D(G(h))]$ as high as possible, because that means it is doing better at fooling the discriminator $D$. So if we fix $D$ the natural way to improve $G$ is to pick a few random seeds $h$, and slightly adjust the trainable parameters of $G$ to increase this objective. Note that this gradient computation involves backpropagation through the composed net $D(G(\cdot))$).

Of course, if we let the generator improve itself, it also makes sense to then let the discriminator improve itself too, Which leads to:

Idea 3: Turn the training of the generative model into a game of many moves or alternations.

Each move for the discriminator consists of taking a few samples from $P_{real}$ and $P_{synth}$ and improving its ability to discriminate between them. Each move for the generator consists of producing a few samples from $P_{synth}$ and updating its parameters so that $E_{u}[D(G(h))]$ goes up a bit.

Notice, the discriminator always uses the generator as a black box —i.e., never examines its internal parameters —whereas the generator needs the discriminator’s parameters to compute its gradient direction. Also, the generator does not ever use real images from $P_{real}$ for its computation. (Though of course it does rely on the real images indirectly since the discriminator is trained using them.)

GANS: More details

One can fill in the above framework in multiple ways. The most obvious is that the generator could try to maximize $E_{u}[f(D(G(h)))]$ where $f$ is some increasing function. (We call this the measuring function.) This has the effect of giving different importance to different samples. Goodfellow et al. originally used $f(x)=\log (x)$, which, since the derivative of $\log x$ is $1/x$, implicitly gives much more importance to synthetic data $G(u)$ where the discriminator outputs very low values $D(G(h))$. In other words, using $f(x) =\log x$ makes the training more sensitive to instances which the discriminator finds terrible than to instances which the discriminator finds so-so. By contrast, the above sketch implicitly used $f(x) =x$, which gives the same importance to all samples and appears in the recent Wasserstein GAN.

The discussion thus leads to the following mathematical formulation, where $D, G$ are deep nets with specified architecture and whose number of parameters is fixed in advance by the algorithm designer.

There is now a big industry of improving this basic framework using various architectures and training variations, e.g. (a random sample; possibly missing some important ones): DC-GAN, S-GAN, SR-GAN, INFO-GAN, etc.

Usually, the training is continued until the generator wins, meaning the discriminator’s expected output on samples from $P_{real}$ and $P_{synth}$ becomes the same. But a serious practical difficulty is that training in practice is oscillatory, and the above objective is observed to go up and down. This is unlike usual deep net training, where training (at least in cases where it works) steadily improves the objective.

GANS: Some open questions

(a) Does an equilibrium exist?

Since GAN is a 2-person game, the oscillatory behavior mentioned above is not unexpected. Just as a necessary condition for gradient descent to come to a stop is that the current point is a stationary point (ie gradient is zero), the corresponding situation in a 2-person game is an equilibrium: each player’s move happens to be its optimal response to the other’s move. In other words, switching the order of $\min$ and $\max$ in expression (1) doesn’t change the objective. The GAN formulation above needs a so-called pure equilibrium, which may not exist in general. A simple example is the classic rock/paper/scissors game. Regardless of whether one player plays rock, paper or scissor as a move, the other can counter with a move that beats it. Thus no pure equilibrium exists.

(b) Does an equilibrium exist where the generator wins, i.e. discriminator ends up unable to distinguish the two distributions on finite samples?

(c) Suppose the generator wins. What does this say about whether or not $P_{real}$ is close to $P_{synth}$ ?

Question (c) has dogged GANs research from the start. Has the generative model actually learned something meaningful about real life images, or is it somehow memorizing existing images and presenting trivial modifications? (Recall that $G$ is never exposed directly to real images, so any “memorizing” has to be happen via the gradient propagated through the discriminator.)

If generator’s win does indeed say that $P_{real}$ and $P_{synth}$ are close then we think of the GANs training as generalizing. (This by analogy to the usual notion of generalization for supervised learning.)

In fact, the next post will show that this issue is indeed more subtle than hitherto recognized. But to complete the backstory I will summarize how this issue has been studied so far.

Past efforts at understanding generalization

The original paper of Goodfellow et al. introduced an analysis of generalization —adopted since by other researchers— that works when deep nets are trained “sufficiently high capacity, samples and training time” (to use their phrasing).

For the original objective function with $f(x) =\log x$ if the optimal discriminator is allowed to be any function all (i.e., not just one computable by a finite capacity neural net) it can be checked that the optimal choice is $D(x) = P_{real}(x)/(P_{real}(x)+P_{synth}(x))$. Substituting this in the GANs objective, up to linear transformation the maximum value achieved by discriminator turns out to be equivalent to the Jensen-Shannon (JS) divergence between $P_{real}$ and $P_{synth}$. Hence if a generator wins the game against this ideal discriminator on a very large number of samples, then $P_{real}$ and $P_{synth}$ are close in JS divergence, and thus the model has learnt the true distribution.

A similar analysis for Wasserstein GANs shows that if the generator wins using the Wasserstein objective (i.e., $f(x) =x$) then the two distributions are close in Wasserstein or earth-mover distance.

But we will see in the next post that these analyses can be misleading because in practice, deep nets have (very) finite capacity and sample size. Thus even if training produces the optimal discriminator, the above analyses can be very off.

Further resources

OpenAI has a brief survey of recent approaches to generative models. The inFERENCe blog has many articles on GANs.

Goodfellow’s survey is the most authoritative account of this burgeoning field, and gives tons of insight. The text around Figure 22 discusses oscillation and lack of equilibria. He also discusses how GANs trained on a broad spectrum of images seem to get confused and output images that are realistic at the micro level but nonsensical overall; e.g., an animal with a leg coming out of its head. Clearly this field, despite its promise, has many open questions!

Generalization and Equilibrium in Generative Adversarial Networks (GANs)

$
0
0

The previous post described Generative Adversarial Networks (GANs), a technique for training generative models for image distributions (and other complicated distributions) via a 2-party game between a generator deep net and a discriminator deep net. This post describes my new paper with Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. We address some fundamental issues about generalization in GANs that have been debated since the beginning; e.g., what is the sense in which the learnt distribution close to the target distribution, and also what kind of equilibrium exists between generator and discriminator.

The usual analysis of GANs, sketched in the previous post, assumes “sufficiently large number of samples and sufficiently large discriminator nets” to conclude that at the end of training the learnt distribution should be close to the target distribution. Our new analysis, which accounts for the finite capacity of the discriminator net, calls this into question.

Readers looking for new implementation ideas can skip ahead to the section below on our Mix + GAN protocol. It takes other GANs codes as black box and (by adding extra capacity and corresponding training time) often improves the learnt distribution in qualitative and quantitative measures. Our testing suggests that it works well out of the box.

Notation Assume images are represented as vectors in $\Re^d$. Typically $d$ would be $1000$ or much higher. The capacity of the discriminator, namely, the number of trainable parameters, is denoted $n$. The distribution on all real-life images is denoted $P_{real}$. We assume that the number of distinct images in $P_{real}$ —regardless of how one defines “distinct”—is enormous compared to all these parameters.

Recall that the discriminator $D$ is trained to distinguish between samples from $P_{real}$ and samples from the generator’s distribution $P_{synth}$. This can be formalized using different measures (leading to different GANs objectives) and for simplicity our exposition here uses the distinguishing probability which is used in Wasserstein GAN objective:

(Readers with a background in theoretical CS and cryptography will be reminded of similar definitions in theory of pseudorandom generators.)

Finite Discriminators have limited power

The following simple fact shows why ignoring the discriminator’s capacity constraint can lead to grossly incorrect intuition. (The constant $C$ is fairly small and explained in the paper.)

Theorem 1 Suppose the discriminator has capacity $n$. Then expression (1) is less than $\epsilon$ when $P_{synth}$ is the following: uniform distribution on a random sample of $C n/\epsilon^2 \log n$ images from $P_{real}$.

Note that this theorem is not a formalization of the usual failure mode discussed in GANs literature, whereby the generator simply memorizes the training images. The theorem still applies if we allow the discriminator to use a large set of held out images from $P_{real}$, which are completely different than images in $P_{synth}$. Or, if the training set of images is much larger than $C n/\epsilon^2 \log n$ images. Furthermore, common measures of diversity/novelty used in research papers (e.g., pick a random image from $P_{synth}$ and check the “distance” to the nearest neighbor among the training set) are not guaranteed to get around the problem raised by Theorem 1.

Since $Cn/\epsilon^2\log n$ is rather small, this theorem says that a finite-capacity discriminator is unable to even enforce that $P_{synth}$ has large diversity, let alone enforce that $P_{synth}\approx P_{real}$. The theorem does not imply that existing GANs do not in practice generate diverse distributions; merely that the current analyses gives no reason to believe that they do so.

The proof of the Theorem is a standard sampling argument from learning theory: take an $\epsilon$-net in the continuum of all deep nets of capacity $n$ and a fixed architecture, and do a union bound. Please see the paper for details. (Aside: the “$\epsilon^2$” term in Theorem 1 arises from this argument, and is ubiquitous in ML theory.)

Motivated by this theorem, we argue in the paper that the correct way to think about generalization for GANs is not the usual distance functions between distributions such as Jensen-Shannon or Wasserstein, but a new distance we define called Neural net distance. The neural net distance measures the ability of finite-capacity deep nets to distinguish the distributions. It can be small even when the other distances are large (as illustrated in the above theorem).

Corollary: Larger training sets have limited utility

In fact theorem 1 has the following rephrasing. Suppose we have a very large training set of images. If the discriminator has capacity $n$, then it suffices to take a subsample of size $C n/\epsilon^2 \log n$ from this training set, and we are guaranteed that GANs training using this subsample is capable of achieving a training objective that is within $\epsilon$ of the best achieved with the full training set. ( Any more samples can only improve the training objective by at most $\epsilon$. )

Existence of Equilibrium

Let’s recall the objective used in GAN (for simplicity, we again stick with the Wasserstein GAN):

where $G$ is the generator net, and $P_{synth}$ is the distribution of $G(h)$ where $h$ is a random seed. Researchers have noted that this is implicitly a $2$-person game and it may not have an equilibrium; e.g., see the discussion around Figure 22 in Goodfellow’s survey. An equilibrium corresponds to a $G$ and a $D$ such that the pair are still a solution if we switch the order of min and max in (1). (That is, $G$ doesn’t have incentive to switch in response to $D$, and vice versa.) Lack of equilibrium may be a cause of oscillatory behavior observed in training.

But ideally we wish to show something stronger than mere existence of equilibrium: we wish to exhibit an equilibrium where the generator wins, with the objective above at zero or close to zero (in other words, discriminator is unable to distinguish between the two distributions).

We will prove existence of an $\epsilon$-approximate equilibrium, whereby switching the order of $G, D$ affects the expression by at most $\epsilon$. (That is, $G$ has only limited incentive to switch in respnse to $D$ and vice versa.) Naively one would imagine that proving such a result involves some insight into the distribution $P_{real}$ but surprisingly none is needed.

Theorem 2 If a generator net of capacity $T$ is able to generate a Gaussian distribution in $\Re^d$, then there exists an $\epsilon$-approximate equilibrium in the game where the generator has capacity $O(n T\log n/\epsilon^2 )$.

Proof sketch: A classical result in nonparametric statistics states that $P_{real}$ can be well-approximated by an infinite mixture of standard Gaussians. Now take a sample of size $O(n\log n/\epsilon^2)$ from this infinite mixture, and let $G$ be a uniform mixture on this finite sample of Gaussians. By an argument similar to Theorem 1, the distribution output by $G$ will be indistinguishable from $P_{real}$ by every deep net of capacity $n$. Finally, fold in this mixture of $O(n\log n/\epsilon^2)$ Gaussians into a single generator by using a small “selector” circuit that selects between these with the correct probability.

This theorem only shows existence of a particular equilibrium. What a GAN may actually find in practice using backpropagation is not addressed.

Finally, if we are interested in objectives other than Wasserstein GAN, then a similar proof can show the existence of an $\epsilon$-approximate mixed equilibrium, namely, where the discriminator and generator are themselves small mixtures of deep nets.

Aside: The sampling idea in this proof goes back to Lipton and Young 1994. Similar ideas have also appeared in study of pseudorandomness (see Trevisan et al 2009) and model criticism (see Gretton et al 2012.)

MIX + GAN protocol

Our theory shows that using a mixture of (not too many) generators and discriminators guarantees existence of approximate equilibrium. This suggests that GANs training may be better and more stable we replace the simple generator and discriminator with mixtures of generators.

Of course, it is impractical to use very large mixtures, so we propose MIX + GAN: use a mixture of $k$ components, where $k$ is as large as allowed by size of GPU memory. Namely, train a mixture of $k$ generators ${G_{u_i}, i\in [k]}$ and $k$ discriminators ${D_{v_i}, i\in [k]}$). All components of the mixture share the same network architecture but have their own trainable parameters. Maintaining a mixture means of course maintaining a weight $w_{u_i}$ for the generator $G_{u_i}$ which corresponds to the probability of selecting the output of $G_{u_i}$. These weights are also updated via backpropagation. This heuristic can be combined with existing methods like DC-GAN, W-GAN etc., giving us new training methods MIX + DC-GAN, MIX + W-GAN etc.

Some other experimental details: store the mixture probabilities using logarithm and use exponentiated gradient to update. Use an entropy regularizer to prevent collapse of the mixture to single component. All of these are theoretically justified if $k$ were very large, and are only heuristic when $k$ is as small as $5$.

We show that MIX+GAN can improve performance qualitatively (i.e., the images look better) and also quantitatively using popular measures such as Inception score.

$\quad$

Note that using a mixture increases the capacity of the model by a factor $k$, so it may not be entirely fair to compare the performance of MIX + X with X. On the other hand, in general it is not easy to get substantial performance benefit from increasing deep net capacity (in fact obvious ways of adding capacity that we tried actually reduced performance) whereas here the benefit happens out of the box.

Note that a mixture of generators or discriminators has been used in several recent works (cited in our paper), but we are not aware of any attempts to use a trainable mixture as above.

Take-Away Lessons

Complete understanding of GANs is challenging since we cannot even fully analyse simple backpropagation, let alone backpropagation combined with game-theoretic complications.

We therefore set aside issues of algorithmic convergence and focused on generalization and equilibrium, which focus on the maximum value of the objective. Our analysis suggests the following:

(a) Current GANs training uses a finite capacity deep net to distinguish between synthetic and real distributions. This training criterion by itself seems insufficient to ensure even good diversity in the synthetic distribution, let alone that it is actually very closes to $P_{real}$. (Theorem 1) A first step to fix this would be to focus on ways to ensure higher diversity, which is a necessary step towards ensuring $P_{synth} \approx P_{real}$.

(b) Our results seem to pose a conundrum about the GANs idea which I personally have not been able to resolve. Usually, we believe that adding capacity to the generator allows it gain representation power to model more fine-grained facts about the world and thus produce more realistic and diverse distributions. The downside to adding capacity is overfitting, which can be mitigated using more training images. Thus one imagines that the ideal configuration is:

Number of training images > Generator capacity > Discriminator capacity.

Theorem 1 suggests that if discriminator has capacity $n$ then it seems to derive very little benefit (at least in terms of the training objective) from a training set of more than $C (n\log n)/\epsilon^2$ images. Furthermore, there exist equilibria where the generator’s distribution is not too diverse.

So how can we change GANs training so that it ensures $P_{synth}$ having high diversity? Some possibilities are (a) cap the generator capacity to be much below discriminator capacity. This might work but I don’t see a mathematical reason why. It certainly flies against the usual intuition that —so long as training dataset is large enough—more capacity allows generators to produce more realistic images. (b) high diversity results from some as-yet unknown property of back propagation algorithm (c) Change GANs setup in some other way.

At the very least our paper suggests that an explanation for good performance in GANs must draw upon some delicate interplay of the power of generator vs discriminator and the backpropagation algorithm. This fact was overlooked in previous analyses which assumed discriminators of infinite capacity.

(I thank Moritz Hardt, Kunal Talwar, and Luca Trevisan for their comments and help with references.)

Unsupervised learning, one notion or many?

$
0
0

Unsupervised learning, as the name suggests, is the science of learning from unlabeled data. A look at the wikipedia page shows that this term has many interpretations:

(Task A)Learning a distribution from samples. (Examples: gaussian mixtures, topic models, variational autoencoders,..)

(Task B)Understanding latent structure in the data. This is not the same as (a); for example principal component analysis, clustering, manifold learning etc. identify latent structure but don’t learn a distribution per se.

(Task C)Feature Learning. Learn a mapping from datapoint $\rightarrow$ feature vector such that classification tasks are easier to carry out on feature vectors rather than datapoints. For example, unsupervised feature learning could help lower the amount of labeled samples needed for learning a classifier, or be useful for domain adaptation.

Task B is often a subcase of Task C, as the intended user of “structure found in data” are humans (scientists) who pour over the representation of data to gain some intuition about its properties, and these “properties” can be often phrased as a classification task.

This post explains the relationship between Tasks A and C, and why they get mixed up in students’ mind. We hope there is also some food for thought here for experts, namely, our discussion about the fragility of the usual “perplexity” definition of unsupervised learning. It explains why Task A doesn’t in practice lead to good enough solution for Task C. For example, it has been believed for many years that for deep learning, unsupervised pretraining should help supervised training, but this has been hard to show in practice.

The common theme: high level representations.

If $x$ is a datapoint, each of these methods seeks to map it to a new “high level” representation $h$ that captures its “essence.” This is why it helps to have access to $h$ when performing machine learning tasks on $x$ (e.g. classification). The difficulty of course is that “high-level representation” is not uniquely defined. For example, $x$ may be an image, and $h$ may contain the information that it contains a person and a dog. But another $h$ may say that it shows a poodle and a person wearing pyjamas standing on the beach. This nonuniqueness seems inherent.

Unsupervised learning tries to learn high-level representation using unlabeled data. Each method make an implicit assumption about how the hidden $h$ relates to the visible $x$. For example, in k-means clustering the hidden $h$ consists of labeling the datapoint with the index of the cluster it belongs to. Clearly, such a simple clustering-based representation has rather limited expressive power since it groups datapoints into disjoint classes: this limits its application for complicated settings. For example, if one clusters images according to the labels “human”, “animal” “plant” etc., then which cluster should contain an image showing a man and a dog standing in front of a tree?

The search for a descriptive language for talking about the possible relationships of representations and data leads us naturally to Bayesian models. (Note that these are viewed with some skepticism in machine learning theory – compared to assumptionless models like PAC learning, online learning, etc. – but we do not know of another suitable vocabulary in this setting.)

A Bayesian view

Bayesian approaches capture the relationship between the “high level” representation $h$ and the datapoint $x$ by postulating a joint distribution $p_{\theta}(x, h)$ of the data $x$ and representation $h$, such that $p_{\theta}(h)$ and the posterior $p_{\theta}(x \mid h)$ have a simple form as a function of the parameters $\theta$. These are also called latent variable probabilistic models, since $h$ is a latent (hidden) variable.

The standard goal in distribution learning is to find the $\theta$ that “best explains” the data (what we called Task (A)) above). This is formalized using maximum-likelihood estimation going back to Fisher (~1910-1920): find the $\theta$ that maximizes the log probability of the training data. Mathematically, indexing the samples with $t$, we can write this as

where

(Note that $\sum_{t} \log p_{\theta}(x_t)$ is also the empirical estimate of the cross-entropy $E_{x}[\log p_{\theta}(x)]$ of the distribution $p_{\theta}$, where $x$ is distributed according to $p^*$, the true distribution of the data. Thus the above method looks for the distribution with best cross-entropy on the empirical data, which is also log of the perplexity of $p_{\theta}$.)

In the limit of $t \to ∞$, this estimator is consistent (converges in probability to the ground-truth value) and efficient (has lowest asymptotic mean-square-error among all consistent estimators). See the Wikipedia page. (Aside: maximum likelihood estimation is often NP-hard, which is one of the reasons for the renaissance of the method-of-moments and tensor decomposition algorithms in learning latent variable models, which Rong wrote about some time ago.)

Toward task C: Representations arise from the posterior distribution

Simply learning the distribution $p_{\theta}(x, h)$ does not yield a representation per se. To get a distribution of $x$, we need access to the posterior $p_{\theta}(h \mid x)$: then a sample from this posterior can be used as a “representation” of a data-point $x$. (Aside: Sometimes, in settings when $p_{\theta}(h \mid x)$ has a simple description, this description can be viewed as the representation of $x$.)

Thus solving Task C requires learning distribution parameters $\theta$ and figuring out how to efficiently sample from the posterior distribution.

Note that the sampling problems for the posterior can be #-P hard for very simple families. The reason is that by Bayes law, $p_{\theta}(h \mid x) = \frac{p_{\theta}(h) p_{\theta}(x \mid h)}{p_{\theta}(x)}$. Even if the numerator is easy to calculate, as is the case for simple families, the $p_{\theta}(x)$ involves a big summation (or integral) and is often hard to calculate.

Note that the max-likelihood parameter estimation (Task A) and approximating the posterior distributions $p(h \mid x)$ (Task C) can have radically different complexities: Sometimes A is easy but C is NP-hard (example: topic modeling with “nice” topic-word matrices, but short documents, see also Bresler 2015); or vice versa (example: topic modeling with long documents, but worst-case chosen topic matrices Arora et al. 2011)

Of course, one may hope (as usual) that computational complexity is a worst-case notion and may not apply in practice. But there is a bigger issue with this setup, having to do with accuracy.

Why the above reasoning is fragile: Need for high accuracy

The above description assumes that the parametric model $p_{\theta}(x, h)$ for the data was exact whereas one imagines it is only approximate (i.e., suffers from modeling error). Furthermore, computational difficulties may restrict us to use approximately correct inference even if the model were exact. So in practice, we may only have an approximation $q(h|x)$ to the posterior distribution $p_{\theta}(h \mid x)$. (Below we describe a popular methods to compute such approximations.)

How good of an approximation to the true posterior do we need?

Recall, we are trying to answer this question through the lens of Task C, solving some classification task. We take the following point of view:

For $t=1, 2,\ldots,$ nature picked some $(h_t, x_t)$ from the joint distribution and presented us $x_t$. The true label $y_t$ of $x_t$ is $\mathcal{C}(h_t)$ where $\mathcal{C}$ is an unknown classifier. Our goal is classify according to these labels.

To simplify notation, assume the output of $\mathcal{C}$ is binary. If we wish to use $q(h \mid x)$ as a surrogate for the true posterior $p_{\theta}(h \mid x)$, we need to have $\Pr_{x_t, h_t \sim q(\cdot \mid x_t)} [\mathcal{C}(h_t) \neq y_t]$ is small as well.

How close must $q(h \mid x)$ and $p(h \mid x)$ be to let us conclude this? We will use KL divergence as “distance” between the distributions, for reasons that will become apparent in the following section. We claim the following:

CLAIM: The probability of obtaining different answers on classification tasks done using the ground truth $h$ versus the representations obtained using $q(h_t \mid x_t)$ is less than $\epsilon$ if $KL(q(h_t \mid x_t) \parallel p(h_t \mid x_t)) \leq 2\epsilon^2.$

Here’s a proof sketch. The natural distance these two distributions $q(h \mid x)$ and $p(h \mid x)$ with respect to accuracy of classification tasks is total variation (TV) distance. Indeed, if the TV distance between $q(h\mid x)$ and $p(h \mid x)$ is bounded by $\epsilon$, this implies that for any event $\Omega$,

The CLAIM now follows by instantiating this with the event $\Omega = $ “Classifier $\mathcal{C}$ outputs a different answer from $y_t$ given representation $h_t$ for input $x_t$”, and relating TV distance to KL divergence using Pinsker’s inequality, which gives

as we needed. This observation explains why solving Task A in practice does not automatically lead to very useful representations for classification tasks (Task C): the posterior distribution has to be learnt extremely accurately, which probably didn’t happen (either due to model mismatch or computational complexity).

As noted, distribution learning (Task A) via cross-entropy/maximum-likelihood fitting, and representation learning (Task C) via sampling the posterior are fairly distinct. Why do students often conflate the two? Because in practice the most frequent way to solve Task A does implicitly compute posteriors and thus also solves Task C.

The generic way to learn latent variable models involves variational methods, which can be viewed as a generalization of the famous EM algorithm (Dempster et al. 1977).

Variational methods maintain at all times a proposed distribution $q(h | x)$ (called variational distribution). The methods rely on the observation that for every such $q(h \mid x)$ the following lower bound holds \begin{equation} \log p(x) \geq E_{q(h \mid x)} \log p(x,h) + H(q(h\mid x)) \qquad (2). \end{equation} where $H$ denotes Shannon entropy (or differential entropy, depending on whether $x$ is discrete or continuous). The RHS above is often called the ELBO bound (ELBO = evidence-based lower bound). This inequality follows from a bit of algebra using non-negativity of KL divergence, applied to distributions $q(h \mid x)$ and $p(h\mid x)$. More concretely, the chain of inequalities is as follows,

Furthermore, equality is achieved if $q(h\mid x) = p(h\mid x)$. (This can be viewed as some kind of “duality” theorem for distributions, and dates all the way back to Gibbs. )

Algorithmically observation (2) is used by foregoing solving the maximum-likelihood optimization (1), and solving instead

Since the variables are naturally divided into two blocks: the model parameters $\theta$, and the variational distributions $q(h_t\mid x_t)$, a natural way to optimize the above is to alternate optimizing over each group, while keeping the other fixed. (This meta-algorithm is often called variational EM for obvious reasons.)

Of course, optimizing over all possible distributions $q$ is an ill-defined problem, so $q$ is constrained to lie in some parametric family (e.g., “ standard Gaussian transformed by depth $4$ neural nets of certain size and architecture”) such the above objective can be easily evaluated at least (typically it has a closed-form expression).

Clearly if the parametric family of distributions is expressive enough, and the (non-convex) optimization problem doesn’t get stuck in bad local minima, then variational EM algorithm will give us not only values of the parameters $\theta$ which are close to the ground-truth ones, but also variational distributions $q(h\mid x)$ which accurately track $p(h\mid x)$. But as we saw above, this accuracy would need to be very high to get meaningful representations.

Next Post

In the next post, we will describe our recent work further clarifying this issue of representation learning via a Bayesian viewpoint.

Do GANs actually do distribution learning?

$
0
0

This post is about our new paper, which presents empirical evidence that current GANs (Generative Adversarial Nets) are quite far from learning the target distribution. Previous posts had introduced GANs and described new theoretical analysis of GANs from our ICML17 paper. One notable implication of our theoretical analysis was that when the discriminator size is bounded, then GANs training could appear to succeed (i.e., training objective reaches its optimum value) even if the generated distribution is discrete and has very low support —-in other words, the training objective is unable to prevent even extreme mode collapse.

That paper led us (especially Sanjeev) into spirited discussions with colleagues, who wondered if this is just a theoretical result about potential misbehavior rather than a prediction about real-life training. After all, we’ve all seen the great pictures that GANs produce in real life, right? (Note that the theoretical result only describes a possible near-equilibrium that can arise with a certain mix of hyperparameters, and conceivably real-life training avoids that by suitable hyperparameter tuning.)

Our new empirical paper Do GANs actually learn the distribution? An empirical study puts the issue to the test. We present empirical evidence that well-known GANs approaches do end up learning distributions of fairly low support, and thus presumably are not learning the target distribution.

Let’s start by imagining how large the support must be for the target distribution. For example, if the distribution is the set of all possible images of human faces (real or imagined), then these must involve all combinations of hair color/style, facial features, complexion, expression, pose, lighting, race, etc., and thus the possible set of images of faces that humans will consider to be distinct approaches infinity. (After all, there are billions of distinct people living on earth right now.) GANs are trying to learn this full distribution using a finite sample of images, say CelebA which has $200,000$ images of celebrity faces.

Thus a simple sanity check for whether a GAN has truly come close to learning this distribution is to estimate how many “distinct” images it can produce. At first glance, such an estimation seems very difficult. After all, automated/heuristic measures of image similarity can be easily fooled, and we humans surely don’t have enough time to go through millions or billions of images, right?

Luckily, a crude estimate is possible using the simple birthday paradox, a staple of undergrad discrete math.

Birthday paradox test for size of the support

Imagine for argument’s sake that the human race were limited to a genetic diversity of a million —nature’s laws only allow this many distinct humans. How would this hard limit manifest itself in our day to day life? The birthday paradox says that if we take a random sample of a thousand people —note that most of us get to know this many people easily in our lifetimes—we’d see many doppelgangers. Of course, in practice the only doppelgangers we encounter happen to be identical twins.

Formally, the birthday paradox says that if a discrete distribution has support $N$, then a random sample of size about $\sqrt{N}$ would be quite likely to contain a duplicate. (The name comes from its implication that if you put $23 \approx \sqrt{365}$ random people in a room, the chance that two of them have the same birthday is about $1/2$.)

In the GAN setting, the distribution is continuous, not discrete. Thus our proposed birthday paradox test for GANs is as follows.

(a) Pick a sample of size $s$ from the generated distribution. (b) Use an automated measure of image similarity to flag the $20$ (say) most similar pairs in the sample. (c) Visually inspect the flagged pairs and check for images that a human would consider near-duplicates. (d) Repeat.

If this test reveals that samples of size $s$ have duplicate images with good probability, then suspect that the distribution has support size about $s^2$.

Note that the test is not definitive, because the distribution could assign say a probability $10\%$ to a single image, and be uniform on a huge number of other images. Then the test would be quite likely to find a duplicate even with $20$ samples, even though the true support size is huge. But such nonuniformity (a lot of probability being assigned to a few images) is the only failure mode of the birthday paradox test calculation, and such nonuniformity would itself be considered a failure mode of GANs training. The CIFAR-10 samples below show that such nonuniformality can be severe in practice, where the generator tends to generate a fixed image of automobile very likely. On CIFAR-10, this failure mode is also observed in classes of frogs and cats.

Experimental results.

Our test was done using two datasets, CelebA (faces) and CIFAR-10.

For faces, we found Euclidean distance in pixel space works well as a heuristic similarity measure, probably because the samples are centered and aligned. For CIFAR-10, we pre-train a discriminative Convolutional Neural Net for the full classification problem, and use the top layer representation as an embedding of the image. Heuristic similarity is then measured as the Euclidean distance in the embedding space. Possibly these similarity measures are crude, but note that improving them can only lower our estimate of the support size of the distribution, since a better similarity measure can only increase the number of duplicates found. Thus our estimates below should be considered as upper bounds on the support size of the distribution.

Results on CelebA dataset

We tested the following methods, doing the birthday paradox test with Euclidean distance in pixel space as the heuristic similarity measure.

We find that with probability $\geq50\%$, a batch of about $400$ samples contains at least one pair of duplicates for both DCGAN and MIX+DCGAN. The figure below give examples duplicates and their nearest neighbors samples (that we could fine) in training set. These results suggest that the support size of the distribution is less than $400^2\approx160000$, which is actually lower than the diversity of the training set, but this distribution is not just memorizing the training set.

ALI (or BiGANs) appear to be somewhat more diverse, in that collisions appear with $50\%$ probability only with a batch size of $1000$, implying a support size of a million. This is $5$x the training set, but still much smaller than the diversity one would expect among human faces (After all doppelgangers don’t appear in samples of a few thousand people in real life.) For fair comparison, we set the discriminator of ALI (or BiGANs) to be roughly the same in size as that of the DCGAN model, since the results below suggests that the discriminator size has a strong effect on diversity of the learnt distribution.) Nevertheless, these tests do support the suggestion that the bidirectional structure prevents some of the mode collapses observed in usual GANs.

similar_face_pairs

Diversity vs Discriminator Size

The analysis of Arora et al. suggested that the support size could be as low as near-linear in the capacity of the discriminator; in other words, there is a near-equilibrium in which a distribution of such a small support could suffice to fool the best discriminator. So it is worth investigating whether training in real life allows generator nets to exploit this “loophole” in the training that we now know is in principle available to them.

We built DCGANs with increasingly larger discriminators while fixing the other hyper-parameters. The discriminator used here is a 5-layer Convolutional Neural Network such that the number of output channels of each layer is $1\times,2\times,4\times,8\times\textit{dim}$ where $dim$ is chosen to be $16,32,48,64,80,96,112,128$. Thus the discriminator size should be proportional to $dim^2$. The figure below suggests that in this simple setup the diversity of the learnt distribution does indeed grow near-linearly with the discriminator size. (Note the diversity is seen to plateau, possibly because one needs to change other parameters like depth to meaningfully add more capacity to the discriminator.)

diversity_vs_size

Results for CIFAR-10

On CIFAR-10, as mentioned earlier, we use a heuristic image similarity computed with convolutional neural net with 3 convolutional layers, 2 fully-connected layer and a 10-class soft-max output pretrained with a multi-class classification objective. Specifically, the top layer features are viewed as embeddings for similarity test using Euclidean distance. We found that this heuristic similarity test quickly becomes useless if the samples display noise artifacts, and thus was effective only on the very best GANs that generate the most real-looking images. For CIFAR-10 this led us to Stacked GAN, currently believed to be the best generative model on CIFAR-10 (Inception Score $8.59$). Since this model is trained by conditioning on class label, we measure its diversity within each class separately.

The training set for each class has $10k$ images, but since the generator is allowed to learn from all classes, presumably it can mix and match (especially background, lighting, landscape etc.) between classes and learn a fairly rich set of images.

Now we list the batch sizes needed for duplicates to appear.

cifar_diversity_table

As before, we show duplicate samples as well as the nearest neighbor to the samples in training set (identified by using heuristic similarity measure to flag possibilities and confirming visually).

similar_cifar_samples

We find that the closest image is quite different from the duplicate detected, which suggests the issue with GANs is indeed lack of diversity (low support size) instead of memorizing training set. (See the paper for more examples.)

Note that by and large the diversity of the learnt distribution is higher than that of the training set, but still not as high as one would expect in terms of all possible combinations.

Birthday paradox test for VAEs

vae_collisions

Given these findings, it is natural to wonder about the diversity of distributions learned using earlier methods such as Variational Auto-Encoders (VAEs). Instead of using feedback from the discriminator, these methods train the generator net using feedback from an approximate perplexity calculation. Thus the analysis of Arora et al. does not apply as is to such methods and it is conceivable they exhibit higher diversity. However, we found the birthday paradox test difficult to run since samples from a VAE trained on CelebA were not realistic or sharp enough for a human to definitively conclude whether or not two images were almost the same. The figure above shows examples of collision candidates found in batches of 400 samples; clearly some indicative parts (hair, eyes, mouth, etc.) are quite blurry in VAE samples.

Conclusions

Our new birthday paradox test seems to suggest that some well-regarded GANs are currently learning distributions that with rather low support (i.e., suffer mode collapse). The possibility of such a scenario was anticipated in the theoretical analysis of (Arora et al.) reported in an earlier post.

This combination of theory and empirics raises the open problem of how to change the GANs training to avoid such mode collapse. Possibly ALI/BiGANs point to the right direction, since they exhibit somewhat better diversity in our experiments. One should also try tuning of hyperparameter/architecture in current methods now that the birthday paradox test gives a concrete way to quantify mode collapse.

Finally, we should consider the possibility that the best use of GANs and related techniques could be feature learning or some other goal, as opposed to distribution learning. This needs further theoretical and empirical exploration.

Viewing all 53 articles
Browse latest View live




Latest Images