### Monday, March 27th, 2017

A basic – yet very successful – tool for modeling human language has been a new generation of distributed word representations: neural word embeddings. However, beyond just word meanings, we need to understand the meanings of larger pieces of text and the relationships between pieces of text, like questions and answers. Two requirements for that are good ways to understand the structure of human language utterances and ways to compose their meanings. Deep learning methods can help for both tasks. I will then look at methods for understanding the relationships between pieces of text, for tasks such as natural language inference, question answering, and machine translation. A key, still open, question raised by recent deep learning work for NLP is to what extent do we need explicit language and knowledge representations versus everything being latent in distributed representations. Put most controversially, that is the question of whether a bidirectional LSTM with attention is the answer for all language processing needs.

I'll describe recent work on modeling complex relationships in a neural setting, based primarily on a combination of topic-model style dictionary learning (for interpretability) and recurrent neural networks (to capture the flow of time). This all ties in to the question of how to learn common-sense knowledge; for this, I'll talk first about understanding how relationships between humans evolve, learned from text alone; and then how this can be extended to multimodal (image and text) settings. Joint with many people, especially Snigdha Chaturvedi, Mohit Iyyer and Jordan Boyd-Graber.

Training neural networks is a difficult non-convex optimization problem with possibly numerous local optimal and saddle points. However, empirical evidence seems to suggest the effectiveness of simple gradient-based algorithms. In this work, we analyze the properties of stationary points for training one-hidden layer neural networks with ReLU activation functions and show that a stationary point implies a global optimum with high probability under some conditions on the neural weights. Moreover, we introduce semi-random units where the activation pattern is determined by a random projection of the input, and show that networks with these units are guaranteed to converge to global optimal with high probability.

### Tuesday, March 28th, 2017

In this talk, I will discuss how to learn representation for perception and action without using any manual supervision. First, I am going to discuss how we can learn ConvNets for vision in a completely unsupervised manner using auxiliary tasks. Specifically, I am going to demonstrate how spatial context in images and viewpoint changes in videos can be used to train visual representations. Next, I am going to talk about how we can use a robot to physically explore the world and learn visual representations for classification/recognition tasks. Finally, I am going to talk about how we can perform end-to-end learning for actions using self-supervision.

### Wednesday, March 29th, 2017

In this talk I will focus on discussing deep learning models that can find semantically meaningful representations of words, learn to read documents and answer questions about their content. First, I will introduce the Gated-Attention (GA) Reader model, that integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enables the reader to build query-specific representations of tokens in the document for accurate answer selection. Second, I will next introduce a two-step learning system to question answering from unstructured text, consisting of a retrieval step and a reading comprehension step. Finally, I will discuss a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words. I will show that on several tasks, these models significantly improve upon many of the existing techniques.

Joint work with with Bhuwan Dhingra, Zhilin Yang, Yusuke Watanabe, Hanxiao Liu, Ye Yuan, Junjie Hu, and William W. Cohen

Over the past few decades, various approaches have been introduced for learning probabilistic models, depending on whether the examples are labeled or unlabelled, and whether they are complete or incomplete. In this talk, I will introduce an orthogonal class of machine learning problems, which have not been treated as systematically before. In these problems, one has access to Boolean constraints that characterize examples which are known to be impossible (e.g., due to known domain physics). The task is then to learn a tractable probabilistic model over a structured space defined by the constraints.I will describe a new class of Arithmetic Circuits, the PSDD, for addressing this class of learning problems. The PSDD is based on advances from both machine learning and logical reasoning and can be learned under Boolean constraints. I will also provide a number of results on learning PSDDs. First, I will contrast PSDD learning with approaches that ignore known constraints, showing how it can learn more accurate models. Second, I will show that PSDDs can be utilized to learn, in a domain-independent manner, distributions over combinatorial objects, such as rankings, game traces and routes on a map. Third, I will show how PSDDs can be learned from a new type of datasets, in which examples are specified using arbitrary Boolean expressions. A number of case studies will be illustrated throughout the talk, including the unsupervised learning of preference rankings and the supervised learning of classifiers for routes and game traces.

Tensor methods have emerged as a powerful paradigm for consistent learning of many latent variable models such as topic models, independent component analysis and dictionary learning. Model parameters are estimated via CP decomposition of the observed higher order input moments. We extend tensor decomposition framework to models with invariances, such as convolutional dictionary models. Our tensor decomposition algorithm is based on the popular alternating least squares method, but with additional shift invariance constraints on the factors. We demonstrate that each ALS update can be computed efficiently using simple operations such as fast Fourier transforms and matrix multiplications. Our algorithm converges to models with better reconstruction error and is much faster, compared to the popular alternating minimization heuristic, where the filters and activation maps are alternately updated.

We show that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost ``dimension-free''). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free.

### Thursday, March 30th, 2017

TBD

This paper makes progress on several open the- oretical issues related to Generative Adversarial Networks. A definition is provided for what it means for the training to generalize, and it is shown that generalization is not guaranteed for the popular distances between distributions such as Jensen-Shannon or Wasserstein. We intro- duce a new metric called neural net distance for which generalization does occur. We also show that an approximate pure equilibrium in the 2- player game exists for a natural training objective (Wasserstein). Showing such a result has been an open problem (for any training objective).

Finally, the above theoretical ideas lead us to pro- pose a new training protocol, MIX+GAN, which can be combined with any existing method. We present experiments showing that it stabilizes and improves some existing methods.

Joint work with Rong Ge, Yingyu Liang, Tengyu Ma, Yi Zhang.

In this talk we discuss recent works on learning the single-layer noisy or network, which is a textbook example of a Bayes net, and used for example in the classic QMR-DT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits. These networks are highly non-linear, as a result previous works on matrix/tensor decomposition cannot be applied directly. In this talk we show matrix/tensor decomposition techniques can still be adapted to give strong theoretical guarantees even for these nonlinear models.

### Friday, March 31st, 2017

In this talk, I'll present the challenges in today's deep learning approach for learning representations resilient against attacks. I will also explore the question of providing provable guarantees of generalization of a learned model. As a concrete example, I will present our recent work on using recursion to enable provablely perfect generalization in the domain of neural program architectures.

Languages synthesize, borrow, and coin new words. This observation is so uncontroversially robust that it is charaterized by empirical laws (Zipf's and Heap's Laws) about the distributions of words and word frequencies rather than by appeal to any particular linguistic theory. However, the first assumption made in most work on word representation learning and language modeling is that a language's vocabulary is fixed, with the (interesting!) long tail of forms replaced with an out-of-vocabulary token, <unk>. In this talk, I discuss the challenges of modeling the statistical facts of language more accurately, rather than the simplifying caracature of linguistic distributions that receives so much attention in the literature. I discuss existing models that relax the closed vocabulary assumption, how these perform, and how they still might be improved.

A popular machine learning strategy is the transfer of a representation (i.e. a feature extraction function) learned on a source task to a target task. Examples include the re-use of neural network weights or word embeddings. Our work proposes sufficient conditions for the success of this approach. If the representation learned from the source task is fixed, we identify conditions on how the tasks relate to obtain an upper bound on target task risk via a VC dimension-based argument. We then consider using the representation from the source task to construct a prior, which is fine-tuned using target task data. We give a PAC-Bayes target task risk bound in this setting under suitable conditions. We show examples of our bounds using feedforward neural networks. Our results motivate a practical approach to weight transfer, which we validate with experiments.