IT博客汇 | arXiv Paper Daily: Tue, 23 May 2017

arXiv Paper Daily: Tue, 23 May 2017

我爱机器学习(52ml.net)发表于 2017-05-23 00:00:00

Neural and Evolutionary Computing

Block building programming for symbolic regression

Chen Chen, Changtong Luo, Zonglin Jiang
Subjects: Neural and Evolutionary Computing (cs.NE)

Symbolic regression that aims to detect underlying data-driven model, has
become increasingly important for industrial data analysis when the
experimental model structure is unknown or wrong, or the concerned system has
changed. For most of the existing algorithms for symbolic regression, such as
genetic programming, the convergence speed might be too slow for large scale
problems with a large number of variables. This situation may become even worse
with increasing problem size. The aforementioned difficulty make symbolic
regression limited in practical applications. Fortunately, in many engineering
problems, the independent variables in target models are separable or partially
separable. This feature inspires us to develop a new approach, block building
programming (BBP), in this paper. BBP divides the original target model into
several simple models, and then optimizes them sequentially. Under such
circumstance, BBP can make large reductions to the search space. The partition
of separability is based on a designed method, block and factor detection.
Using two different optimization engines, experiments on a set of symbolic
regression problems with separability are conducted. Numerical results show
that BBP has a good capability of `structure and coefficient optimization’ and
a better computational efficiency.

Improving classification accuracy of feedforward neural networks for spiking neuromorphic chips

Antonio Jimeno Yepes, Jianbin Tang, Benjamin Scott Mashford
Comments: IJCAI-2017. arXiv admin note: text overlap with arXiv:1605.07740
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

Deep Neural Networks (DNN) achieve human level performance in many image
analytics tasks but DNNs are mostly deployed to GPU platforms that consume a
considerable amount of power. New hardware platforms using lower precision
arithmetic achieve drastic reductions in power consumption. More recently,
brain-inspired spiking neuromorphic chips have achieved even lower power
consumption, on the order of milliwatts, while still offering real-time
processing.

However, for deploying DNNs to energy efficient neuromorphic chips the
incompatibility between continuous neurons and synaptic weights of traditional
DNNs, discrete spiking neurons and synapses of neuromorphic chips need to be
overcome. Previous work has achieved this by training a network to learn
continuous probabilities, before it is deployed to a neuromorphic architecture,
such as IBM TrueNorth Neurosynaptic System, by random sampling these
probabilities.

The main contribution of this paper is a new learning algorithm that learns a
TrueNorth configuration ready for deployment. We achieve this by training
directly a binary hardware crossbar that accommodates the TrueNorth axon
configuration constrains and we propose a different neuron model.

Results of our approach trained on electroencephalogram (EEG) data show a
significant improvement with previous work (76% vs 86% accuracy) while
maintaining state of the art performance on the MNIST handwritten data set.

Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon

Xin Dong, Shangyu Chen, Sinno Jialin Pan
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

How to develop slim and accurate deep neural networks has become crucial for
real- world applications, especially for those employed in embedded systems.
Though previous work along this research line has shown some promising results,
most existing methods either fail to significantly compress a well-trained deep
network or require a heavy retraining process for the pruned deep network to
re-boost its prediction performance. In this paper, we propose a new layer-wise
pruning method for deep neural networks. In our proposed method, parameters of
each individual layer are pruned independently based on second order
derivatives of a layer-wise error function with respect to the corresponding
parameters. We prove that the final prediction performance drop after pruning
is bounded by a linear combination of the reconstructed errors caused at each
layer. Therefore, there is a guarantee that one only needs to perform a light
retraining process on the pruned network to resume its original prediction
performance. We conduct extensive experiments on benchmark datasets to
demonstrate the effectiveness of our pruning method compared with several
state-of-the-art baseline methods.

Parallel and in-process compilation of individuals for genetic programming on GPU

Hakan Ayral, Songül Albayrak
Comments: Submitted to PeerJ Computer Science
Subjects: Neural and Evolutionary Computing (cs.NE)

Three approaches to implement genetic programming on GPU hardware are
compilation, interpretation and direct generation of machine code. The compiled
approach is known to have a prohibitive overhead compared to other two. This
paper investigates methods to accelerate compilation of individuals for genetic
programming on GPU hardware. We apply in-process compilation to minimize the
compilation overhead at each generation; and we investigate ways to parallelize
in-process compilation. In-process compilation doesn’t lend itself to trivial
parallelization with threads; we propose a multiprocess parallelization using
memory sharing and operating systems interprocess communication primitives.
With parallelized compilation we achieve further reductions on compilation
overhead. Another contribution of this work is the code framework we built in
C# for the experiments. The framework makes it possible to build arbitrary
grammatical genetic programming experiments that run on GPU with minimal extra
coding effort, and is available as open source.

Diffusion-based neuromodulation can eliminate catastrophic forgetting in simple neural networks

Roby Velez, Jeff Clune
Subjects: Neural and Evolutionary Computing (cs.NE)

A long-term goal of AI is to produce agents that can learn a diversity of
skills throughout their lifetimes and continuously improve those skills via
experience. A longstanding obstacle towards that goal is catastrophic
forgetting, which is when learning new information erases previously learned
information. Catastrophic forgetting occurs in artificial neural networks
(ANNs), which have fueled most recent advances in AI. A recent paper proposed
that catastrophic forgetting in ANNs can be reduced by promoting modularity,
which can limit forgetting by isolating task information to specific clusters
of nodes and connections (functional modules). While the prior work did show
that modular ANNs suffered less from catastrophic forgetting, it was not able
to produce ANNs that possessed task-specific functional modules, thereby
leaving the main theory regarding modularity and forgetting untested. We
introduce diffusion-based neuromodulation, which simulates the release of
diffusing, neuromodulatory chemicals within an ANN that can modulate (i.e. up
or down regulate) learning in a spatial region. On the simple diagnostic
problem from the prior work, diffusion-based neuromodulation 1) induces
task-specific learning in groups of nodes and connections (task-specific
localized learning), which 2) produces functional modules for each subtask, and
3) yields higher performance by eliminating catastrophic forgetting. Overall,
our results suggest that diffusion-based neuromodulation promotes task-specific
localized learning and functional modularity, which can help solve the
challenging, but important problem of catastrophic forgetting.

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, Hai Li
Comments: 9 pages
Subjects: Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE)

High network communication cost for synchronizing gradients and parameters is
the well-known bottleneck of distributed training. In this work, we propose
TernGrad that uses ternary gradients to accelerate distributed deep learning in
data parallelism. Our approach requires only three numerical levels {-1,0,1}
which can aggressively reduce the communication time. We mathematically prove
the convergence of TernGrad under the assumption of a bound on gradients.
Guided by the bound, we propose layer-wise ternarizing and gradient clipping to
improve its convergence. Our experiments show that applying TernGrad on AlexNet
does not incur any accuracy loss and can even improve accuracy. The accuracy
loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a
performance model is proposed to study the scalability of TernGrad. Experiments
show significant speed gains for various deep neural networks.

An Out-of-the-box Full-network Embedding for Convolutional Neural Networks

Dario Garcia-Gasulla, Armand Vilalta, Ferran Parés, Jonatan Moreno, Eduard Ayguadé, Jesus Labarta, Ulises Cortés, Toyotaro Suzumura
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Transfer learning for feature extraction can be used to exploit deep
representations in contexts where there is very few training data, where there
are limited computational resources, or when tuning the hyper-parameters needed
for training is not an option. While previous contributions to feature
extraction propose embeddings based on a single layer of the network, in this
paper we propose a full-network embedding which successfully integrates
convolutional and fully connected features, coming from all layers of a deep
convolutional neural network. To do so, the embedding normalizes features in
the context of the problem, and discretizes their values to reduce noise and
regularize the embedding space. Significantly, this also reduces the
computational cost of processing the resultant representations. The proposed
method is shown to outperform single layer embeddings on several image
classification tasks, while also being more robust to the choice of the
pre-trained model used for obtaining the initial features. The performance gap
in classification accuracy between thoroughly tuned solutions and the
full-network embedding is also reduced, which makes of the proposed approach a
competitive solution for a large set of applications.

Bayesian Belief Updating of Spatiotemporal Seizure Dynamics

Gerald K Cooray, Richard Rosch, Torsten Baldeweg, Louis Lemieux, Karl Friston, Biswa Sengupta
Subjects: Machine Learning (stat.ML); Neural and Evolutionary Computing (cs.NE); Methodology (stat.ME)

Epileptic seizure activity shows complicated dynamics in both space and time.
To understand the evolution and propagation of seizures spatially extended sets
of data need to be analysed. This requires efficient methods of inference. We
have previously described an efficient filtering scheme using variational
Laplace that can be used in the Dynamic Causal Modelling (DCM) framework to
estimate the temporal dynamics of seizures recorded using either invasive or
non-invasive electrical recordings (EEG/ECoG). In this note, we present a
method to analyse the spatiotemporal dynamics of seizures. Spatiotemporal
dynamics are modelled using a partial differential equation – in contrast to
the ordinary differential equation used in our previous work on temporal
estimation of seizure dynamics (Cooray et al 2015). We provide the requisite
theoretical background for the method and test the ensuing scheme on simulated
seizure activity data and empirical invasive ECoG data. The method provides a
framework to assimilate the spatial and temporal dynamics of seizure activity,
an aspect of great physiological and clinical importance.

Batch Reinforcement Learning on the Industrial Benchmark: First Experiences

Daniel Hein, Steffen Udluft, Michel Tokic, Alexander Hentschel, Thomas A. Runkler, Volkmar Sterzing
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Systems and Control (cs.SY)

The Particle Swarm Optimization Policy (PSO-P) has been recently introduced
and proven to produce remarkable results on interacting with academic
reinforcement learning benchmarks in an off-policy, batch-based setting. To
further investigate the properties and feasibility on real-world applications,
this paper investigates PSO-P on the so-called Industrial Benchmark (IB), a
novel reinforcement learning (RL) benchmark that aims at being realistic by
including a variety of aspects found in industrial applications, like
continuous state and action spaces, a high dimensional, partially observable
state space, delayed effects, and complex stochasticity. The experimental
results of PSO-P on IB are compared to results of closed-form control policies
derived from the model-based Recurrent Control Neural Network (RCNN) and the
model-free Neural Fitted Q-Iteration (NFQ). Experiments show that PSO-P is not
only of interest for academic benchmarks, but also for real-world industrial
applications, since it also yielded the best performing policy in our IB
setting. Compared to other well established RL techniques, PSO-P produced
outstanding results in performance and robustness, requiring only a relatively
low amount of effort in finding adequate parameters or making complex design
decisions.

How to Train Your DRAGAN

Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Generative Adversarial Networks have emerged as an effective technique for
estimating data distributions. The basic setup consists of two deep networks
playing against each other in a zero-sum game setting. However, it is not
understood if the networks reach an equilibrium eventually and what dynamics
makes this possible. The current GAN training procedure, which involves
simultaneous gradient descent, lacks a clear game-theoretic justification in
the literature. In this paper, we introduce regret minimization as a technique
to reach equilibrium in games and use this to motivate the use of simultaneous
GD in GANs. In addition, we present a hypothesis that mode collapse, which is a
common occurrence in GAN training, happens due to the existence of spurious
local equilibria in non-convex games. Motivated by these insights, we develop
an algorithm called DRAGAN that is fast, simple to implement and achieves
competitive performance in a stable fashion across different architectures,
datasets (MNIST, CIFAR-10, and CelebA), and divergence measures with almost no
hyperparameter tuning.

Espresso: Efficient Forward Propagation for BCNNs

Fabrizio Pedersoli, George Tzanetakis, Andrea Tagliasacchi
Comments: 10 pages, 4 figures
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

There are many applications scenarios for which the computational performance
and memory footprint of the prediction phase of Deep Neural Networks (DNNs)
needs to be optimized. Binary Neural Networks (BDNNs) have been shown to be an
effective way of achieving this objective. In this paper, we show how
Convolutional Neural Networks (CNNs) can be implemented using binary
representations. Espresso is a compact, yet powerful library written in C/CUDA
that features all the functionalities required for the forward propagation of
CNNs, in a binary file less than 400KB, without any external dependencies.
Although it is mainly designed to take advantage of massive GPU parallelism,
Espresso also provides an equivalent CPU implementation for CNNs. Espresso
provides special convolutional and dense layers for BCNNs, leveraging
bit-packing and bit-wise computations for efficient execution. These techniques
provide a speed-up of matrix-multiplication routines, and at the same time,
reduce memory usage when storing parameters and activations. We experimentally
show that Espresso is significantly faster than existing implementations of
optimized binary neural networks ((approx) 2 orders of magnitude). Espresso is
released under the Apache 2.0 license and is available at
this http URL

Local Information with Feedback Perturbation Suffices for Dictionary Learning in Neural Circuits

Tsung-Han Lin
Comments: 10 pages, 4 figures
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

While the sparse coding principle can successfully model information
processing in sensory neural systems, it remains unclear how learning can be
accomplished under neural architectural constraints. Feasible learning rules
must rely solely on synaptically local information in order to be implemented
on spatially distributed neurons. We describe a neural network with spiking
neurons that can address the aforementioned fundamental challenge and solve the
L1-minimizing dictionary learning problem, representing the first model able to
do so. Our major innovation is to introduce feedback synapses to create a
pathway to turn the seemingly non-local information into local ones. The
resulting network encodes the error signal needed for learning as the change of
network steady states caused by feedback, and operates akin to the classical
stochastic gradient descent method.

Computer Vision and Pattern Recognition

Facial Affect Estimation in the Wild Using Deep Residual and Convolutional Networks

Behzad Hasani, Mohammad H. Mahoor
Comments: To appear in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Automated affective computing in the wild is a challenging task in the field
of computer vision. This paper presents three neural network-based methods
proposed for the task of facial affect estimation submitted to the First
Affect-in-the-Wild challenge. These methods are based on Inception-ResNet
modules redesigned specifically for the task of facial affect estimation. These
methods are: Shallow Inception-ResNet, Deep Inception-ResNet, and
Inception-ResNet with LSTMs. These networks extract facial features in
different scales and simultaneously estimate both the valence and arousal in
each frame. Root Mean Square Error (RMSE) rates of 0.4 and 0.3 are achieved for
the valence and arousal respectively with corresponding Concordance Correlation
Coefficient (CCC) rates of 0.04 and 0.29 using Deep Inception-ResNet method.

Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks

Deep Neural Networks (DNNs) have shown to outperform traditional methods in
various visual recognition tasks including Facial Expression Recognition (FER).
In spite of efforts made to improve the accuracy of FER systems using DNN,
existing methods still are not generalizable enough in practical applications.
This paper proposes a 3D Convolutional Neural Network method for FER in videos.
This new network architecture consists of 3D Inception-ResNet layers followed
by an LSTM unit that together extracts the spatial relations within facial
images as well as the temporal relations between different frames in the video.
Facial landmark points are also used as inputs to our network which emphasize
on the importance of facial components rather than the facial regions that may
not contribute significantly to generating facial expressions. Our proposed
method is evaluated using four publicly available databases in
subject-independent and cross-database tasks and outperforms state-of-the-art
methods.

DepthCut: Improved Depth Edge Estimation Using Multiple Unreliable Channels

Paul Guerrero, Holger Winnemöller, Wilmot Li, Niloy J. Mitra
Comments: 12 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In the context of scene understanding, a variety of methods exists to
estimate different information channels from mono or stereo images, including
disparity, depth, and normals. Although several advances have been reported in
the recent years for these tasks, the estimated information is often imprecise
particularly near depth discontinuities or creases. Studies have however shown
that precisely such depth edges carry critical cues for the perception of
shape, and play important roles in tasks like depth-based segmentation or
foreground selection. Unfortunately, the currently extracted channels often
carry conflicting signals, making it difficult for subsequent applications to
effectively use them. In this paper, we focus on the problem of obtaining
high-precision depth edges (i.e., depth contours and creases) by jointly
analyzing such unreliable information channels. We propose DepthCut, a
data-driven fusion of the channels using a convolutional neural network trained
on a large dataset with known depth. The resulting depth edges can be used for
segmentation, decomposing a scene into depth layers with relatively flat depth,
or improving the accuracy of the depth estimate near depth edges by
constraining its gradients to agree with these edges. Quantitatively, we
compare against 15 variants of baselines and demonstrate that our depth edges
result in an improved segmentation performance and an improved depth estimate
near depth edges compared to data-agnostic channel fusion. Qualitatively, we
demonstrate that the depth edges result in superior segmentation and depth
orderings.

Regularizing deep networks using efficient layerwise adversarial training

Swami Sankaranarayanan, Arpit Jain, Rama Chellappa, Ser Nam Lim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Machine Learning (stat.ML)

Adversarial training has been shown to regularize deep neural networks in
addition to increasing their robustness to adversarial examples. However, its
impact on very deep state of the art networks has not been fully investigated.
In this paper, we present an efficient approach to perform adversarial training
by perturbing intermediate layer activations and study the use of such
perturbations as a regularizer during training. We use these perturbations to
train very deep models such as ResNets and show improvement in performance both
on adversarial and original test data. Our experiments highlight the benefits
of perturbing intermediate layer activations compared to perturbing only the
inputs. The results on CIFAR-10 and CIFAR-100 datasets show the merits of the
proposed adversarial training approach. Additional results on WideResNets show
that our approach provides significant improvement in classification accuracy
for a given base model, outperforming dropout and other base models of larger
size.

TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation

Li Ding, Chenliang Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Action segmentation as a milestone towards building automatic systems to
understand untrimmed videos has received considerable attention in the recent
years. It is typically being modeled as a sequence labeling problem but
contains intrinsic and sufficient differences than text parsing or speech
processing. In this paper, we introduce a novel hybrid temporal convolutional
and recurrent network (TricorNet), which has an encoder-decoder architecture:
the encoder consists of a hierarchy of temporal convolutional kernels that
capture the local motion changes of different actions; the decoder is a
hierarchy of recurrent neural networks that are able to learn and memorize
long-term action dependencies after the encoding stage. Our model is simple but
extremely effective in terms of video sequence labeling. The experimental
results on three public action segmentation datasets have shown that the
proposed model achieves superior performance over the state of the art.

Robust Localized Multi-view Subspace Clustering

Yanbo Fan, Jian Liang, Ran He, Bao-Gang Hu, Siwei Lyu
Comments: 7 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In multi-view clustering, different views may have different confidence
levels when learning a consensus representation. Existing methods usually
address this by assigning distinctive weights to different views. However, due
to noisy nature of real-world applications, the confidence levels of samples in
the same view may also vary. Thus considering a unified weight for a view may
lead to suboptimal solutions. In this paper, we propose a novel localized
multi-view subspace clustering model that considers the confidence levels of
both views and samples. By assigning weight to each sample under each view
properly, we can obtain a robust consensus representation via fusing the
noiseless structures among views and samples. We further develop a regularizer
on weight parameters based on the convex conjugacy theory, and samples weights
are determined in an adaptive manner. An efficient iterative algorithm is
developed with a convergence guarantee. Experimental results on four benchmarks
demonstrate the correctness and effectiveness of the proposed model.

Convolutional Networks with MuxOut Layers as Multi-rate Systems for Image Upscaling

Pablo Navarrete Michelini, Hanwen Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We interpret convolutional networks as adaptive filters and combine them with
so-called MuxOut layers to efficiently upscale low resolution images. We
formalize this interpretation by deriving a linear and space-variant structure
of a convolutional network when its activations are fixed. We introduce general
purpose algorithms to analyze a network and show its overall filter effect for
each given location. We use this analysis to evaluate two types of image
upscalers: deterministic upscalers that target the recovery of details from
original content; and second, a new generation of upscalers that can sample the
distribution of upscale aliases (images that share the same downscale version)
that look like real content.

Learning to Associate Words and Images Using a Large-scale Graph

Heqing Ya, Haonan Sun, Jeffrey Helt, Tai Sing Lee
Comments: 8 pages, 7 figures, 14th Conference on Computer and Robot Vision 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We develop an approach for unsupervised learning of associations between
co-occurring perceptual events using a large graph. We applied this approach to
successfully solve the image captcha of China’s railroad system. The approach
is based on the principle of suspicious coincidence. In this particular
problem, a user is presented with a deformed picture of a Chinese phrase and
eight low-resolution images. They must quickly select the relevant images in
order to purchase their train tickets. This problem presents several
challenges: (1) the teaching labels for both the Chinese phrases and the images
were not available for supervised learning, (2) no pre-trained deep
convolutional neural networks are available for recognizing these Chinese
phrases or the presented images, and (3) each captcha must be solved within a
few seconds. We collected 2.6 million captchas, with 2.6 million deformed
Chinese phrases and over 21 million images. From these data, we constructed an
association graph, composed of over 6 million vertices, and linked these
vertices based on co-occurrence information and feature similarity between
pairs of images. We then trained a deep convolutional neural network to learn a
projection of the Chinese phrases onto a 230-dimensional latent space. Using
label propagation, we computed the likelihood of each of the eight images
conditioned on the latent space projection of the deformed phrase for each
captcha. The resulting system solved captchas with 77% accuracy in 2 seconds on
average. Our work, in answering this practical challenge, illustrates the power
of this class of unsupervised association learning techniques, which may be
related to the brain’s general strategy for associating language stimuli with
visual objects on the principle of suspicious coincidence.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira, Andrew Zisserman
Comments: To appear at CVPR 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

The paucity of videos in current action classification datasets (UCF-101 and
HMDB-51) has made it difficult to identify good video architectures, as most
methods obtain similar performance on existing small-scale benchmarks. This
paper re-evaluates state-of-the-art architectures in light of the new Kinetics
Human Action Video dataset. Kinetics has two orders of magnitude more data,
with 400 human action classes and over 400 clips per class, and is collected
from realistic, challenging YouTube videos. We provide an analysis on how
current architectures fare on the task of action classification on this dataset
and how much performance improves on the smaller benchmark datasets after
pre-training on Kinetics.

We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on
2D ConvNet inflation: filters and pooling kernels of very deep image
classification ConvNets are expanded into 3D, making it possible to learn
seamless spatio-temporal feature extractors from video while leveraging
successful ImageNet architecture designs and even their parameters. We show
that, after pre-training on Kinetics, I3D models considerably improve upon the
state-of-the-art in action classification, reaching 80.7% on HMDB-51 and 98.0%
on UCF-101.

Semantic Softmax Loss for Zero-Shot Learning

Zhong Ji, Yunxin Sun, Yulong Yu, Jichang Guo, Yanwei Pang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

A typical pipeline for Zero-Shot Learning (ZSL) is to integrate the visual
features and the class semantic descriptors into a multimodal framework with a
linear or bilinear model. However, the visual features and the class semantic
descriptors locate in different structural spaces, a linear or bilinear model
can not capture the semantic interactions between different modalities well. In
this letter, we propose a nonlinear approach to impose ZSL as a multi-class
classification problem via a Semantic Softmax Loss by embedding the class
semantic descriptors into the softmax layer of multi-class classification
network. To narrow the structural differences between the visual features and
semantic descriptors, we further use an L2 normalization constraint to the
differences between the visual features and visual prototypes reconstructed
with the semantic descriptors. The results on three benchmark datasets, i.e.,
AwA, CUB and SUN demonstrate the proposed approach can boost the performances
steadily and achieve the state-of-the-art performance for both zero-shot
classification and zero-shot retrieval.

Dynamics Based 3D Skeletal Hand Tracking

Stan Melax, Leonid Keselman, Sterling Orsten
Comments: Published in Graphics Interface 2013
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Tracking the full skeletal pose of the hands and fingers is a challenging
problem that has a plethora of applications for user interaction. Existing
techniques either require wearable hardware, add restrictions to user pose, or
require significant computation resources. This research explores a new
approach to tracking hands, or any articulated model, by using an augmented
rigid body simulation. This allows us to phrase 3D object tracking as a linear
complementarity problem with a well-defined solution. Based on a depth sensor’s
samples, the system generates constraints that limit motion orthogonal to the
rigid body model’s surface. These constraints, along with prior motion,
collision/contact constraints, and joint mechanics, are resolved with a
projected Gauss-Seidel solver. Due to camera noise properties and attachment
errors, the numerous surface constraints are impulse capped to avoid
overpowering mechanical constraints. To improve tracking accuracy, multiple
simulations are spawned at each frame and fed a variety of heuristics,
constraints and poses. A 3D error metric selects the best-fit simulation,
helping the system handle challenging hand motions. Such an approach enables
real-time, robust, and accurate 3D skeletal tracking of a user’s hand on a
variety of depth cameras, while only utilizing a single x86 CPU core for
processing.

Computer vision-based food calorie estimation: dataset, method, and experiment

Yanchao Liang, Jianhua Li
Comments: 5 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Computer vision has been introduced to estimate calories from food images.
But current food image data sets don’t contain volume information and quality
information of foods, which leads to an incomplete calorie estimation. In this
paper, we present a novel food image data set with volume information and
quality information of foods, and a deep learning method for food detection, to
make a complete calorie estimation. Our data set includes 2978 images, and
every image contains corresponding each food’s annotation, volume and quality
information, as well as a certain calibration reference. To estimate calorie of
food in the proposed data set, a deep learning method using Faster R-CNN first
is put forward to detect the food. And the experiment results show our method
is effective to estimate calories and our data set contains adequate
information for calorie estimation. Our data set is the first released food
image data set which can be used to evaluate computer vision-based calorie
estimation methods.

View-Invariant Recognition of Action Style Self-Dissimilarity

Yuping Shen, Hassan Foroosh
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Self-similarity was recently introduced as a measure of inter-class
congruence for classification of actions. Herein, we investigate the dual
problem of intra-class dissimilarity for classification of action styles. We
introduce self-dissimilarity matrices that discriminate between same actions
performed by different subjects regardless of viewing direction and camera
parameters. We investigate two frameworks using these invariant style
dissimilarity measures based on Principal Component Analysis (PCA) and Fisher
Discriminant Analysis (FDA). Extensive experiments performed on IXMAS dataset
indicate remarkably good discriminant characteristics for the proposed
invariant measures for gender recognition from video data.

Learning Robust Object Recognition Using Composed Scenes from Generative Models

Hao Wang, Xingyu Lin, Yimeng Zhang, Tai Sing Lee
Comments: Accepted by 14th Conference on Computer and Robot Vision
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recurrent feedback connections in the mammalian visual system have been
hypothesized to play a role in synthesizing input in the theoretical framework
of analysis by synthesis. The comparison of internally synthesized
representation with that of the input provides a validation mechanism during
perceptual inference and learning. Inspired by these ideas, we proposed that
the synthesis machinery can compose new, unobserved images by imagination to
train the network itself so as to increase the robustness of the system in
novel scenarios. As a proof of concept, we investigated whether images composed
by imagination could help an object recognition system to deal with occlusion,
which is challenging for the current state-of-the-art deep convolutional neural
networks. We fine-tuned a network on images containing objects in various
occlusion scenarios, that are imagined or self-generated through a deep
generator network. Trained on imagined occluded scenarios under the object
persistence constraint, our network discovered more subtle and localized image
features that were neglected by the original network for object classification,
obtaining better separability of different object classes in the feature space.
This leads to significant improvement of object recognition under occlusion for
our network relative to the original network trained only on un-occluded
images. In addition to providing practical benefits in object recognition under
occlusion, this work demonstrates the use of self-generated composition of
visual scenes through the synthesis loop, combined with the object persistence
constraint, can provide opportunities for neural networks to discover new
relevant patterns in the data, and become more flexible in dealing with novel
situations.

Boosting the accuracy of multi-spectral image pan-sharpening by learning a deep residual network

Yancong Wei, Qiangqiang Yuan, Huanfeng Shen, Liangpei Zhang
Comments: 5 pages,5 figures, 1 table
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In the field of fusing multi-spectral and panchromatic images
(Pan-sharpening), the impressive effectiveness of deep neural networks has been
recently employed to overcome the drawbacks of traditional linear models and
boost the fusing accuracy. However, to the best of our knowledge, existing
research works are mainly based on simple and flat networks with relatively
shallow architecture, which severely limited their performances. In this paper,
the concept of residual learning has been introduced to form a very deep
convolutional neural network to make a full use of the high non-linearity of
deep learning models. By both quantitative and visual assessments on a large
number of high quality multi-spectral images from various sources, it has been
supported that our proposed model is superior to all mainstream algorithms
included in the comparison, and achieved the highest spatial-spectral unified
accuracy.

Building Emotional Machines: Recognizing Image Emotions through Deep Neural Networks

Hye-Rin Kim, Seon Joo Kim, In-Kwon Lee
Comments: 11 pages, 15 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

An image is a very effective tool for conveying emotions. Many researchers
have investigated in computing the image emotions by using various features
extracted from images. In this paper, we focus on two high level features, the
object and the background, and assume that the semantic information of images
is a good cue for emotion prediction. An object is one of the most important
elements that define an image, and we find out through experiments that there
is a high correlation between the object and the emotion in images. Even with
the same object, there may be slight difference in emotion due to different
backgrounds, and we use the semantic information of the background to improve
the prediction performance. By combining the different levels of features, we
build an emotion based feed forward deep neural network which produces the
emotion values of a given image. The output emotion values in our framework are
continuous values in the 2-dimensional space (Valence and Arousal), which are
more effective than using a few number of emotion categories in describing
emotions. Experiments confirm the effectiveness of our network in predicting
the emotion of images.

Classification and Retrieval of Digital Pathology Scans: A New Dataset

Morteza Babaie, Shivam Kalra, Aditya Sriram, Christopher Mitcheltree, Shujin Zhu, Amin Khatami, Shahryar Rahnamayan, H.R. Tizhoosh
Comments: Accepted for presentation at Workshop for Computer Vision for Microscopy Image Analysis (CVMI 2017) @ CVPR 2017, Honolulu, Hawaii
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we introduce a new dataset, extbf{Kimia Path24}, for image
classification and retrieval in digital pathology. We use the whole scan images
of 24 different tissue textures to generate 1,325 test patches of size
1000( imes)1000 (0.5mm( imes)0.5mm). Training data can be generated according
to preferences of algorithm designer and can range from approximately 27,000 to
over 50,000 patches if the preset parameters are adopted. We propose a compound
patch-and-scan accuracy measurement that makes achieving high accuracies quite
challenging. In addition, we set the benchmarking line by applying LBP,
dictionary approach and convolutional neural nets (CNNs) and report their
results. The highest accuracy was 41.80\% for CNN.

Image Segmentation by Iterative Inference from Conditional Score Estimation

Adriana Romero, Michal Drozdzal, Akram Erraqabi, Simon Jégou, Yoshua Bengio
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Inspired by the combination of feedforward and iterative computations in the
virtual cortex, and taking advantage of the ability of denoising autoencoders
to estimate the score of a joint distribution, we propose a novel approach to
iterative inference for capturing and exploiting the complex joint distribution
of output variables conditioned on some input variables. This approach is
applied to image pixel-wise segmentation, with the estimated conditional score
used to perform gradient ascent towards a mode of the estimated conditional
distribution. This extends previous work on score estimation by denoising
autoencoders to the case of a conditional distribution, with a novel use of a
corrupted feedforward predictor replacing Gaussian corruption. An advantage of
this approach over more classical ways to perform iterative inference for
structured outputs, like conditional random fields (CRFs), is that it is not
any more necessary to define an explicit energy function linking the output
variables. To keep computations tractable, such energy function
parametrizations are typically fairly constrained, involving only a few
neighbors of each of the output variables in each clique. We experimentally
find that the proposed iterative inference from conditional score estimation by
conditional denoising autoencoders performs better than comparable models based
on CRFs or those not using any explicit modeling of the conditional joint
distribution of outputs.

The Do's and Don'ts for CNN-based Face Verification

Ankan Bansal, Carlos Castillo, Rajeev Ranjan, Rama Chellappa
Comments: 10 pages including references
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Convolutional neural networks (CNN) have become the most sought after tools
for addressing object recognition problems. Specifically, they have produced
state-of-the art results for unconstrained face recognition and verification
tasks. While the research community appears to have developed a consensus on
the methods of acquiring annotated data, design and training of CNNs, many
questions still remain to be answered. In this paper, we explore the following
questions that are critical to face recognition research: (i) Can we train on
still images and expect the systems to work on videos? (ii) Are deeper datasets
better than wider datasets? (iii) Does adding label noise lead to improvement
in performance of deep networks? (iv) Is alignment needed for face recognition?
We address these questions by training CNNs using CASIA-WebFace, UMDFaces, and
a new video dataset and testing on YouTubeFaces, IJBA and a disjoint portion of
UMDFaces datasets. Our new data set, which will be made publicly available, has
22,075 videos and 3,735,476 human annotated frames extracted from them.

Generative Partition Networks for Multi-Person Pose Estimation

Xuecheng Nie, Jiashi Feng, Junliang Xing, Shuicheng Yan
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper proposes a new framework, named Generative Partition Network
(GPN), for addressing the challenging multi-person pose estimation problem.
Different from existing pure top-down and bottom-up solutions, the proposed GPN
models the multi-person partition detection as a generative process from joint
candidates and infers joint configurations for person instances from each
person partition locally, resulting in both low joint detection and joint
partition complexities. In particular, GPN designs a generative model based on
the Generalized Hough Transform framework to detect person partitions via votes
from joint candidates in the Hough space, parameterized by centroids of
persons. Such generative model produces joint candidates and their
corresponding person partitions by performing only one pass of joint detection.
In addition, GPN formulates the inference procedure for joint configurations of
human poses as a graph partition problem and optimizes it locally. Inspired by
recent success of deep learning techniques for human pose estimation, GPN
designs a multi-stage convolutional neural network with feature pyramid branch
to jointly learn joint confidence maps and Hough transformation maps. Extensive
experiments on two benchmarks demonstrate the efficiency and effectiveness of
the proposed GPN.

Structured Image Classification from Conditional Random Field with Deep Class Embedding

Eran Goldman, Jacob Goldberger
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents a novel deep learning architecture to classify structured
objects in datasets with a large number of visually similar categories. Our
model extends the CRF objective function to a nonlinear form, by factorizing
the pairwise potential matrix, to learn neighboring-class embedding. The
embedding and the classifier are jointly trained to optimize this highly
nonlinear CRF objective function. The non-linear model is trained on
object-level samples, which is much faster and more accurate than the standard
sequence-level training of the linear model. This model overcomes the
difficulties of existing CRF methods to learn the contextual relationships
thoroughly when there is a large number of classes and the data is sparse. The
performance of the proposed method is illustrated on a huge dataset that
contains images of retail-store product displays, taken in varying settings and
viewpoints, and shows significantly improved results compared to linear CRF
modeling and sequence-level training.

CrossNets : A New Approach to Complex Learning

Chirag Agarwal, Mehdi Sharifzhadeh, Joe Klobusicky, Dan Schonfeld
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

We propose a novel neural network structure called CrossNets, which considers
architectures on directed acyclic graphs. This structure builds on previous
generalizations of feed forward models, such as ResNets, by allowing for all
forward cross connections between layers (both adjacent and non-adjacent). The
addition of cross connections among the network increases information flow
across the whole network, leading to better training and testing performances.
The superior performance of the network is tested against four benchmark
datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN. We conclude with a proof of
convergence for Crossnets to a local minimum for error, where weights for
connections are chosen through backpropagation with momentum.

DeepMasterPrint: Generating Fingerprints for Presentation Attacks

Philip Bontrager, Julian Togelius, Nasir Memon
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Learning (cs.LG)

We present two related methods for creating MasterPrints, synthetic
fingerprints that a fingerprint verification system identifies as many
different people. Both methods start with training a Generative Adversarial
Network (GAN) on a set of real fingerprint images. The generator network is
then used to search for images that can be recognized as multiple individuals.
The first method uses evolutionary optimization in the space of latent
variables, and the second uses gradient-based search. Our method is able to
design a MasterPrint that a commercial fingerprint system matches to 22% of all
users in a strict security setting, and 75% of all users at a looser security
setting.

Incorporating Depth into both CNN and CRF for Indoor Semantic Segmentation

Jindong Jiang, Zhijun Zhang, Yongqian Huang, Lunan Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)

To improve segmentation performance, a novel neural network architecture
(termed DFCN-DCRF) is proposed, which combines an RGB-D fully convolutional
neural network (DFCN) with a depth-sensitive fully-connected conditional random
field (DCRF). First, a DFCN architecture which fuses depth information into the
early layers and applies dilated convolution for later contextual reasoning is
designed. Then, a depth-sensitive fully-connected conditional random field
(DCRF) is proposed and combined with the previous DFCN to refine the
preliminary result. Comparative experiments show that the proposed DFCN-DCRF
has the best performance compared with most state-of-the-art methods.

Structural Compression of Convolutional Neural Networks Based on Greedy Filter Pruning

Reza Abbasi-Asl, Bin Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Convolutional neural networks (CNNs) have state-of-the-art performance on
many problems in machine vision. However, networks with superior performance
often have millions of weights so that it is difficult or impossible to use
CNNs on computationally limited devices or to humanly interpret them. A myriad
of CNN compression approaches have been proposed and they involve pruning and
compressing the weights and filters. In this article, we introduce a greedy
structural compression scheme that prunes filters in a trained CNN. We define a
filter importance index equal to the classification accuracy reduction (CAR) of
the network after pruning that filter (similarly defined as RAR for
regression). We then iteratively prune filters based on the CAR index. This
algorithm achieves substantially higher classification accuracy in AlexNet
compared to other structural compression schemes that prune filters. Pruning
half of the filters in the first or second layer of AlexNet, our CAR algorithm
achieves 26% and 20% higher classification accuracies respectively, compared to
the best benchmark filter pruning scheme. Our CAR algorithm, combined with
further weight pruning and compressing, reduces the size of first or second
convolutional layer in AlexNet by a factor of 42, while achieving close to
original classification accuracy through retraining (or fine-tuning) network.
Finally, we demonstrate the interpretability of CAR-compressed CNNs by showing
that our algorithm prunes filters with visually redundant functionalities. In
fact, out of top 20 CAR-pruned filters in AlexNet, 17 of them in the first
layer and 14 of them in the second layer are color-selective filters as opposed
to shape-selective filters. To our knowledge, this is the first reported result
on the connection between compression and interpretability of CNNs.

Phase-Shifting Separable Haar Wavelets and Applications

Mais Alnasser, Hassan Foroosh
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents a new approach for tackling the shift-invariance problem
in the discrete Haar domain, without trading off any of its desirable
properties, such as compression, separability, orthogonality, and symmetry. The
paper presents several key theoretical contributions. First, we derive closed
form expressions for phase shifting in the Haar domain both in partially
decimated and fully decimated transforms. Second, it is shown that the wavelet
coefficients of the shifted signal can be computed solely by using the
coefficients of the original transformed signal. Third, we derive closed-form
expressions for non-integer shifts, which have not been previously reported in
the literature. Fourth, we establish the complexity of the proposed phase
shifting approach using the derived analytic expressions. As an application
example of these results, we apply the new formulae to image rotation and
interpolation, and evaluate its performance against standard methods.

Critical Contours: An Invariant Linking Image Flow with Salient Surface Organization

Benjamin S. Kunsberg, Steven W. Zucker
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We exploit a key result from visual psychophysics — that individuals
perceive shape qualitatively — to develop a geometrical/topological invariant
(the Morse-Smale complex) relating image structure with surface structure.
Differences across individuals are minimal near certain configurations such as
ridges and boundaries, and it is these configurations that are often
represented in line drawings. In particular, we introduce a method for
inferring qualitative 3D shape from shading patterns that link the
shape-from-shading inference with shape-from-contour. For a given shape,
certain shading patches become “line drawings” in a well-defined limit. Under
this limit, and invariantly, these shading patterns provide a topological
description of the surface. We further show that, under this model, the
contours partition the surface into meaningful parts using the Morse-Smale
complex. Critical contours are the (perceptually) stable parts of this complex
and are invariant over a wide class of rendering models. Intuitively, our main
result shows that critical contours partition smooth surfaces into bumps and
valleys, in effect providing a scaffold on the image from which a full surface
can be interpolated.

Forecasting Hand and Object Locations in Future Frames

Chenyou Fan, Jangwon Lee, Michael S. Ryoo
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents an approach to forecast future locations of human hands
and objects. Given an image frame, the goal is to predict presence and location
of hands and objects in the future frame (e.g., 5 seconds later), even when
they are not visible in the current frame. The key idea is that (1) an
intermediate representation of a convolutional object recognition model
abstracts scene information in its frame and that (2) we can predict (i.e.,
regress) such representations corresponding to the future frames based on that
of the current frame. We design a new two-stream convolutional neural network
(CNN) architecture for videos by extending the state-of-the-art convolutional
object detection network, and present a new fully convolutional regression
network for predicting future scene representations. Our experiments confirm
that combining the regressed future representation with our detection network
allows reliable estimation of future hands and objects in videos

Gaze Distribution Analysis and Saliency Prediction Across Age Groups

Onkar Krishna, Kiyoharu Aizawa, Andrea Helo, Rama Pia
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Knowledge of the human visual system helps to develop better computational
models of visual attention. State-of-the-art models have been developed to
mimic the visual attention system of young adults that, however, largely ignore
the variations that occur with age. In this paper, we investigated how visual
scene processing changes with age and we propose an age-adapted framework that
helps to develop a computational model that can predict saliency across
different age groups. Our analysis uncovers how the explorativeness of an
observer varies with age, how well saliency maps of an age group agree with
fixation points of observers from the same or different age groups, and how age
influences the center bias. We analyzed the eye movement behavior of 82
observers belonging to four age groups while they explored visual scenes.
Explorativeness was quantified in terms of the entropy of a saliency map, and
area under the curve (AUC) metrics was used to quantify the agreement analysis
and the center bias. These results were used to develop age adapted saliency
models. Our results suggest that the proposed age-adapted saliency model
outperforms existing saliency models in predicting the regions of interest
across age groups.

Non-Linear Phase-Shifting of Haar Wavelets for Run-Time All-Frequency Lighting

Mais Alnasser, Hassan Foroosh
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper focuses on real-time all-frequency image-based rendering using an
innovative solution for run-time computation of light transport. The approach
is based on new results derived for non-linear phase shifting in the Haar
wavelet domain. Although image-based methods for real-time rendering of dynamic
glossy objects have been proposed, they do not truly scale to all possible
frequencies and high sampling rates without trading storage, glossiness, or
computational time, while varying both lighting and viewpoint. This is due to
the fact that current approaches are limited to precomputed radiance transfer
(PRT), which is prohibitively expensive in terms of memory requirements and
real-time rendering when both varying light and viewpoint changes are required
together with high sampling rates for high frequency lighting of glossy
material. On the other hand, current methods cannot handle object rotation,
which is one of the paramount issues for all PRT methods using wavelets. This
latter problem arises because the precomputed data are defined in a global
coordinate system and encoded in the wavelet domain, while the object is
rotated in a local coordinate system. At the root of all the above problems is
the lack of efficient run-time solution to the nontrivial problem of rotating
wavelets (a non-linear phase-shift), which we solve in this paper.

Recurrent Scene Parsing with Perspective Understanding in the Loop

Shu Kong, Charless Fowlkes
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Objects may appear at arbitrary scales in perspective images of a scene,
posing a challenge for recognition systems that process an image at a fixed
resolution. We propose a depth-aware gating module that adaptively chooses the
pooling field size in a convolutional network architecture according to the
object scale (inversely proportional to the depth) so that small details can be
preserved for objects at distance and a larger receptive field can be used for
objects nearer to the camera. The depth gating signal is provided from stereo
disparity (when available) or estimated directly from a single image. We
integrate this depth-aware gating into a recurrent convolutional neural network
trained in an end-to-end fashion to perform semantic segmentation. Our
recurrent module iteratively refines the segmentation results, leveraging the
depth estimate and output prediction from the previous loop. Through extensive
experiments on three popular large-scale RGB-D datasets, we demonstrate our
approach achieves competitive semantic segmentation performance using more
compact model than existing methods. Interestingly, we find segmentation
performance improves when we estimate depth directly from the image rather than
using “ground-truth” and the model produces state-of-the-art results for
quantitative depth estimation from a single image.

Quadruplet Network with One-Shot Learning for Visual Tracking

Xingping Dong, Jianbing Shen, Fatih Porikli
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As a discriminative method of one-shot learning, Siamese deep network allows
recognizing an object from a single exemplar with the same class label.
However, it does not take the advantage of the underlying structure and
relationship among a multitude of instances since it only relies on pairs of
instances for training. In this paper, we propose a quadruplet deep network to
examine the potential connections among the training instances, aiming to
achieve a more powerful representation. We design four shared networks that
receive multi-tuple of instances as inputs and are connected by a novel loss
function consisting of pair-loss and triplet-loss. According to the similarity
metric, we select the most similar and the most dissimilar instances as the
positive and negative inputs of triplet loss from each multi-tuple. We show
that this scheme improves the training performance and convergence speed.
Furthermore, we introduce a new weighted pair loss for an additional
acceleration of the convergence. We demonstrate promising results for
model-free tracking-by-detection of objects from a single initial exemplar in
the Visual Object Tracking benchmark.

PixColor: Pixel Recursive Colorization

Sergio Guadarrama, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, Kevin Murphy
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

We propose a novel approach to automatically produce multiple colorized
versions of a grayscale image. Our method results from the observation that the
task of automated colorization is relatively easy given a low-resolution
version of the color image. We first train a conditional PixelCNN to generate a
low resolution color for a given grayscale image. Then, given the generated
low-resolution color image and the original grayscale image as inputs, we train
a second CNN to generate a high-resolution colorization of an image. We
demonstrate that our approach produces more diverse and plausible colorizations
than existing methods, as judged by human raters in a “Visual Turing Test”.

Towards Real World Human Parsing: Multiple-Human Parsing in the Wild

Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Jiashi Feng
Comments: The first two authors are with equal contribution
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The recent progress of human parsing techniques has been largely driven by
the availability of rich data resources. In this work, we demonstrate some
critical discrepancies between the current benchmark datasets and the real
world human parsing scenarios. For instance, all the human parsing datasets
only contain one person per image, while usually multiple persons appear
simultaneously in a realistic scene. It is more practically demanded to
simultaneously parse multiple persons, which presents a greater challenge to
modern human parsing methods. Unfortunately, absence of relevant data resources
severely impedes the development of multiple-human parsing methods.

To facilitate future human parsing research, we introduce the Multiple-Human
Parsing (MHP) dataset, which contains multiple persons in a real world scene
per single image. The MHP dataset contains various numbers of persons (from 2
to 16) per image with 18 semantic classes for each parsing annotation. Persons
appearing in the MHP images present sufficient variations in pose, occlusion
and interaction. To tackle the multiple-human parsing problem, we also propose
a novel Multiple-Human Parser (MH-Parser), which considers both the global
context and local cues for each person in the parsing process. The model is
demonstrated to outperform the naive “detect-and-parse” approach by a large
margin, which will serve as a solid baseline and help drive the future research
in real world human parsing.

Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation

Lei Cai, Hongyang Gao, Shuiwang Ji
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

Variational auto-encoder (VAE) is a powerful unsupervised learning framework
for image generation. One drawback of VAE is that it generates blurry images
due to its Gaussianity assumption and thus L2 loss. To allow the generation of
high quality images by VAE, we increase the capacity of decoder network by
employing residual blocks and skip connections, which also enable efficient
optimization. To overcome the limitation of L2 loss, we propose to generate
images in a multi-stage manner from coarse to fine. In the simplest case, the
proposed multi-stage VAE divides the decoder into two components in which the
second component generates refined images based on the course images generated
by the first component. Since the second component is independent of the VAE
model, it can employ other loss functions beyond the L2 loss and different
model architectures. The proposed framework can be easily generalized to
contain more than two components. Experiment results on the MNIST and CelebA
datasets demonstrate that the proposed multi-stage VAE can generate sharper
images as compared to those from the original VAE.

A Lightweight Approach for On-the-Fly Reflectance Estimation

Kihwan Kim, Jinwei Gu, Stephen Tyree, Pavlo Molchanov, Matthias Nießner, Jan Kautz
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Estimating surface reflectance (BRDF) is one key component for complete 3D
scene capture, with wide applications in virtual reality, augmented reality,
and human computer interaction. Prior work is either limited to controlled
environments (eg gonioreflectometers, light stages, or multi-camera domes), or
requires the joint optimization of shape, illumination, and reflectance, which
is often computationally too expensive (eg hours of running time) for
real-time applications. Moreover, most prior work requires HDR images as input
which further complicates the capture process. In this paper, we propose a
lightweight approach for surface reflectance estimation directly from (8)-bit
RGB images in real-time, which can be easily plugged into any 3D
scanning-and-fusion system with a commodity RGBD sensor. Our method is
learning-based, with an inference time of less than 90ms per scene and a model
size of less than 340K bytes. We propose two novel network architectures,
HemiCNN and Grouplet, to deal with the unstructured input data from multiple
viewpoints under unknown illumination. We further design a loss function to
resolve the color-constancy and scale ambiguity. In addition, we have created a
large synthetic dataset, SynBRDF, which comprises a total of (500)K RGBD images
rendered with a physically-based ray tracer under a variety of natural
illumination, covering (5000) materials and (5000) shapes. SynBRDF is the first
large-scale benchmark dataset for reflectance estimation. Experiments on both
synthetic data and real data show that the proposed method effectively recovers
surface reflectance, and outperforms prior work for reflectance estimation in
uncontrolled environments.

A New 3D Method to Segment the Lumbar Vertebral Bodies and to Determine Bone Mineral Density and Geometry

Andre Mastmeyer, Klaus Engelke, Sebastian Meller, Willi Kalender
Comments: 8 pages, 6 figures, 1 table, MICCAI 2005 workshop paper
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper we present a new 3D segmentation approach for the vertebrae of
the lower thoracic and the lumbar spine in spiral computed tomography datasets.
We implemented a multi-step procedure. Its main components are deformable
models, volume growing, and morphological operations. The performance analysis
that included an evaluation of accuracy using the European Spine Phantom, and
of intra-operator precision using clinical CT datasets from 10 patients
highlight the potential for clinical use. The intra-operator precision of the
segmentation procedure was better than 1% for Bone Mineral Density (BMD) and
better than 1.8% for volume. The long-term goal of this work is to enable
better fracture prediction and improved patient monitoring in the field of
osteoporosis. A true 3D segmentation also enables an accurate measurement of
geometrical parameters that can augment the classical measurement of BMD.

Sparse Coding on Stereo Video for Object Detection

Sheng Y. Lundquist, Melanie Mitchell, Garrett T. Kenyon
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep Convolutional Neural Networks (DCNN) require millions of labeled
training examples for image classification and object detection tasks, which
restrict these models to domains where such a dataset is available. We explore
the use of unsupervised sparse coding applied to stereo-video data to help
alleviate the need for large amounts of labeled data. In this paper, we show
that unsupervised sparse coding is able to learn disparity and motion sensitive
basis functions when exposed to unlabeled stereo-video data. Additionally, we
show that a DCNN that incorporates unsupervised learning exhibits better
performance than fully supervised networks. Furthermore, finding a sparse
representation in the first layer, which infers a sparse set of activations,
allows for consistent performance over varying initializations and ordering of
training examples when compared to representations computed from a single
convolution. Finally, we compare activations between the unsupervised
sparse-coding layer and the supervised layer when applied to stereo-video data,
and show that sparse coding exhibits an encoding that is depth selective,
whereas encodings from a single convolution do not. These result indicates
promise for using unsupervised sparse-coding approaches in real- world computer
vision tasks in domains with limited labeled training data.

A New 3D Segmentation Methodology for Lumbar Vertebral Bodies for the Measurement of BMD and Geometry

Andre Mastmeyer, Klaus Engelke, Willi Kalender
Comments: 4 pages,2 figures, MIUA05 conference
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper a new technique is presented that extracts the geometry of
lumbar vertebral bodies from spiral CT scans. Our new multi-step segmentation
approach yields highly accurate and precise measurement of the bone mineral
density (BMD) in different volumes of interest which are defined relative to a
local anatomical coordinate systems. The approach also enables the analysis of
the geometry of the relevant vertebrae. Intra- and inter operator precision for
segmentation, BMD measurement and position of the coordinate system are below
1.5% in patient data, accuracy errors are below 1.5% for BMD and below 4% for
volume in phantom data. The long-term goal of the approach is to improve
fracture prediction in osteoporosis.

Simultaneous Multiple Surface Segmentation Using Deep Learning

Abhay Shah, Michael Abramoff, Xiaodong Wu
Comments: 8 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The task of automatically segmenting 3-D surfaces representing boundaries of
objects is important for quantitative analysis of volumetric images, and plays
a vital role in biomedical image analysis. Recently, graph-based methods with a
global optimization property have been developed and optimized for various
medical imaging applications. Despite their widespread use, these require human
experts to design transformations, image features, surface smoothness priors,
and re-design for a different tissue, organ or imaging modality. Here, we
propose a Deep Learning based approach for segmentation of the surfaces in
volumetric medical images, by learning the essential features and
transformations from training data, without any human expert intervention. We
employ a regional approach to learn the local surface profiles. The proposed
approach was evaluated on simultaneous intraretinal layer segmentation of
optical coherence tomography (OCT) images of normal retinas and retinas
affected by age related macular degeneration (AMD). The proposed approach was
validated on 40 retina OCT volumes including 20 normal and 20 AMD subjects. The
experiments showed statistically significant improvement in accuracy for our
approach compared to state-of-the-art graph based optimal surface segmentation
with convex priors (G-OSC). A single Convolution Neural Network (CNN) was used
to learn the surfaces for both normal and diseased images. The mean unsigned
surface positioning errors obtained by G-OSC method 2.31 voxels (95% CI
2.02-2.60 voxels) was improved to (1.27) voxels (95% CI 1.14-1.40 voxels) using
our new approach. On average, our approach takes 94.34 s, requiring 95.35 MB
memory, which is much faster than the 2837.46 s and 6.87 GB memory required by
the G-OSC method on the same computer system.

Deep De-Aliasing for Fast Compressive Sensing MRI

Simiao Yu, Hao Dong, Guang Yang, Greg Slabaugh, Pier Luigi Dragotti, Xujiong Ye, Fangde Liu, Simon Arridge, Jennifer Keegan, David Firmin, Yike Guo
Comments: 15 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Fast Magnetic Resonance Imaging (MRI) is highly in demand for many clinical
applications in order to reduce the scanning cost and improve the patient
experience. This can also potentially increase the image quality by reducing
the motion artefacts and contrast washout. However, once an image field of view
and the desired resolution are chosen, the minimum scanning time is normally
determined by the requirement of acquiring sufficient raw data to meet the
Nyquist-Shannon sampling criteria. Compressive Sensing (CS) theory has been
perfectly matched to the MRI scanning sequence design with much less required
raw data for the image reconstruction. Inspired by recent advances in deep
learning for solving various inverse problems, we propose a conditional
Generative Adversarial Networks-based deep learning framework for de-aliasing
and reconstructing MRI images from highly undersampled data with great promise
to accelerate the data acquisition process. By coupling an innovative content
loss with the adversarial loss our de-aliasing results are more realistic.
Furthermore, we propose a refinement learning procedure for training the
generator network, which can stabilise the training with fast convergence and
less parameter tuning. We demonstrate that the proposed framework outperforms
state-of-the-art CS-MRI methods, in terms of reconstruction error and
perceptual image quality. In addition, our method can reconstruct each image in
0.22ms–0.37ms, which is promising for real-time applications.

Stabilizing GAN Training with Multiple Random Projections

Behnam Neyshabur, Srinadh Bhojanapalli, Ayan Chakrabarti
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Training generative adversarial networks is unstable in high-dimensions when
the true data distribution lies on a lower-dimensional manifold. The
discriminator is then easily able to separate nearly all generated samples
leaving the generator without meaningful gradients. We propose training a
single generator simultaneously against an array of discriminators, each of
which looks at a different random low-dimensional projection of the data. We
show that individual discriminators then provide stable gradients to the
generator, and that the generator learns to produce samples consistent with the
full data distribution to satisfy all discriminators. We demonstrate the
practical utility of this approach experimentally, and show that it is able to
produce image samples with higher quality than traditional training with a
single discriminator.

Improving classification accuracy of feedforward neural networks for spiking neuromorphic chips

Deep Neural Networks (DNN) achieve human level performance in many image
analytics tasks but DNNs are mostly deployed to GPU platforms that consume a
considerable amount of power. New hardware platforms using lower precision
arithmetic achieve drastic reductions in power consumption. More recently,
brain-inspired spiking neuromorphic chips have achieved even lower power
consumption, on the order of milliwatts, while still offering real-time
processing.

However, for deploying DNNs to energy efficient neuromorphic chips the
incompatibility between continuous neurons and synaptic weights of traditional
DNNs, discrete spiking neurons and synapses of neuromorphic chips need to be
overcome. Previous work has achieved this by training a network to learn
continuous probabilities, before it is deployed to a neuromorphic architecture,
such as IBM TrueNorth Neurosynaptic System, by random sampling these
probabilities.

The main contribution of this paper is a new learning algorithm that learns a
TrueNorth configuration ready for deployment. We achieve this by training
directly a binary hardware crossbar that accommodates the TrueNorth axon
configuration constrains and we propose a different neuron model.

Results of our approach trained on electroencephalogram (EEG) data show a
significant improvement with previous work (76% vs 86% accuracy) while
maintaining state of the art performance on the MNIST handwritten data set.

Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon

Xin Dong, Shangyu Chen, Sinno Jialin Pan
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

How to develop slim and accurate deep neural networks has become crucial for
real- world applications, especially for those employed in embedded systems.
Though previous work along this research line has shown some promising results,
most existing methods either fail to significantly compress a well-trained deep
network or require a heavy retraining process for the pruned deep network to
re-boost its prediction performance. In this paper, we propose a new layer-wise
pruning method for deep neural networks. In our proposed method, parameters of
each individual layer are pruned independently based on second order
derivatives of a layer-wise error function with respect to the corresponding
parameters. We prove that the final prediction performance drop after pruning
is bounded by a linear combination of the reconstructed errors caused at each
layer. Therefore, there is a guarantee that one only needs to perform a light
retraining process on the pruned network to resume its original prediction
performance. We conduct extensive experiments on benchmark datasets to
demonstrate the effectiveness of our pruning method compared with several
state-of-the-art baseline methods.

Shake-Shake regularization

Xavier Gastaldi
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The method introduced in this paper aims at helping deep learning
practitioners faced with an overfit problem. The idea is to replace, in a
multi-branch network, the standard summation of parallel branches with a
stochastic affine combination. Applied to 3-branch residual networks,
shake-shake regularization improves on the best single shot published results
on CIFAR-10 and CIFAR-100 by reaching test errors of 2.86% and 15.85%.
Experiments on architectures without skip connections or Batch Normalization
show encouraging results and open the door to a large set of applications. Code
is available at this https URL

Stabilizing Adversarial Nets With Prediction Methods

Abhay Yadav (1), Sohil Shah (1), Zheng Xu (1), David Jacobs (1), Tom Goldstein (1) ((1) University of Maryland, College Park, MD, USA)
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (cs.NA)

Adversarial neural networks solve many important problems in data science,
but are notoriously difficult to train. These difficulties come from the fact
that optimal weights for adversarial nets correspond to saddle points, and not
minimizers, of the loss function. The alternating stochastic gradient methods
typically used for such problems do not reliably converge to saddle points, and
when convergence does happen it is often highly sensitive to learning rates. We
propose a simple modification of stochastic gradient descent that stabilizes
adversarial networks. We show, both in theory and practice, that the proposed
method reliably converges to saddle points. This makes adversarial networks
less likely to “collapse”, and enables faster training with larger learning
rates.

Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods

Nicholas Carlini, David Wagner
Subjects: Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

Neural networks are known to be vulnerable to adversarial examples: inputs
that are close to valid inputs but classified incorrectly. We investigate the
security of ten recent proposals that are designed to detect adversarial
examples. We show that all can be defeated, even when the adversary does not
know the exact parameters of the detector. We conclude that adversarial
examples are significantly harder to detect than previously appreciated, and we
propose several guidelines for evaluating future proposed defenses.

End-to-end Planning of Fixed Millimeter-Wave Networks

Tim Danford, Onur Filiz, Jing Huang, Brian Karrer, Manohar Paluri, Guan Pang, Vish Ponnampalam, Nicolas Stier-Moses, Birce Tezel
Subjects: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

This article discusses a framework to support the design and end-to-end
planning of fixed millimeter-wave networks. Compared to traditional techniques,
the framework allows an organization to quickly plan a deployment in a
cost-effective way. We start by using LiDAR data—basically, a 3D point cloud
captured from a city—to estimate potential sites to deploy antennas and
whether there is line-of-sight between them. With that data on hand, we use
combinatorial optimization techniques to determine the optimal set of locations
and how they should communicate with each other, to satisfy engineering (e.g.,
latency, polarity), design (e.g., reliability) and financial (e.g., total cost
of operation) constraints. The primary goal is to connect as many people as
possible to the network. Our methodology can be used for strategic planning
when an organization is in the process of deciding whether to adopt a
millimeter-wave technology or choosing between locations, or for operational
planning when conducting a detailed design of the actual network to be deployed
in a selected location.

How to Train Your DRAGAN

Generative Adversarial Networks have emerged as an effective technique for
estimating data distributions. The basic setup consists of two deep networks
playing against each other in a zero-sum game setting. However, it is not
understood if the networks reach an equilibrium eventually and what dynamics
makes this possible. The current GAN training procedure, which involves
simultaneous gradient descent, lacks a clear game-theoretic justification in
the literature. In this paper, we introduce regret minimization as a technique
to reach equilibrium in games and use this to motivate the use of simultaneous
GD in GANs. In addition, we present a hypothesis that mode collapse, which is a
common occurrence in GAN training, happens due to the existence of spurious
local equilibria in non-convex games. Motivated by these insights, we develop
an algorithm called DRAGAN that is fast, simple to implement and achieves
competitive performance in a stable fashion across different architectures,
datasets (MNIST, CIFAR-10, and CelebA), and divergence measures with almost no
hyperparameter tuning.

Espresso: Efficient Forward Propagation for BCNNs

There are many applications scenarios for which the computational performance
and memory footprint of the prediction phase of Deep Neural Networks (DNNs)
needs to be optimized. Binary Neural Networks (BDNNs) have been shown to be an
effective way of achieving this objective. In this paper, we show how
Convolutional Neural Networks (CNNs) can be implemented using binary
representations. Espresso is a compact, yet powerful library written in C/CUDA
that features all the functionalities required for the forward propagation of
CNNs, in a binary file less than 400KB, without any external dependencies.
Although it is mainly designed to take advantage of massive GPU parallelism,
Espresso also provides an equivalent CPU implementation for CNNs. Espresso
provides special convolutional and dense layers for BCNNs, leveraging
bit-packing and bit-wise computations for efficient execution. These techniques
provide a speed-up of matrix-multiplication routines, and at the same time,
reduce memory usage when storing parameters and activations. We experimentally
show that Espresso is significantly faster than existing implementations of
optimized binary neural networks ((approx) 2 orders of magnitude). Espresso is
released under the Apache 2.0 license and is available at
this http URL

Evaluation of Direct Haptic 4D Volume Rendering of Partially Segmented Data for Liver Puncture Simulation

Andre Mastmeyer, Dirk Fortmeier, Heinz Handels
Comments: 15 pages, 16 figures, 1 tables, 11 equations, 39 references
Journal-ref: Nature – Scientific Reports, Nature Publishing Group (NPG),
7(671), 2017
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

This work presents an evaluation study using a force feedback evaluation
framework for a novel direct needle force volume rendering concept in the
context of liver puncture simulation. PTC/PTCD puncture interventions targeting
the bile ducts have been selected to illustrate this concept. The haptic
algorithms of the simulator system are based on (1) partially segmented patient
image data and (2) a non-linear spring model effective at organ borders. The
primary aim is to quantitatively evaluate force errors caused by our patient
modeling approach, in comparison to haptic force output obtained from using
gold-standard, completely manually-segmented data. The evaluation of the force
algorithms compared to a force output from fully manually segmented
gold-standard patient models, yields a low mean of 0.12 N root mean squared
force error and up to 1.6 N for systematic maximum absolute errors. Force
errors were evaluated on 31,222 preplanned test paths from 10 patients. Only
twelve percent of the emitted forces along these paths were affected by errors.
This is the first study evaluating haptic algorithms with deformable virtual
patients in silico. We prove haptic rendering plausibility on a very high
number of test paths. Important errors are below just noticeable differences
for the hand-arm system.

Artificial Intelligence

A unified approach to interpreting model predictions

Scott Lundberg, Su-In Lee
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)

Understanding why a model made a certain prediction is crucial in many
applications. However, with large modern datasets the best accuracy is often
achieved by complex models even experts struggle to interpret, such as ensemble
or deep learning models. This creates a tension between accuracy and
interpretability. In response, a variety of methods have recently been proposed
to help users interpret the predictions of complex models. Here, we present a
unified framework for interpreting predictions, namely SHAP (SHapley Additive
exPlanations, which assigns each feature an importance for a particular
prediction. The key novel components of the SHAP framework are the
identification of a class of additive feature importance measures and
theoretical results that there is a unique solution in this class with a set of
desired properties. This class unifies six existing methods, and several recent
methods in this class do not have these desired properties. This means that our
framework can inform the development of new methods for explaining prediction
models. We demonstrate that several new methods we presented in this paper
based on the SHAP framework show better computational performance and better
consistency with human intuition than existing methods.

AIXIjs: A Software Demo for General Reinforcement Learning

John Aslanides
Comments: Masters thesis. Australian National University, October 2016. 97 pp
Subjects: Artificial Intelligence (cs.AI)

Reinforcement learning is a general and powerful framework with which to
study and implement artificial intelligence. Recent advances in deep learning
have enabled RL algorithms to achieve impressive performance in restricted
domains such as playing Atari video games (Mnih et al., 2015) and, recently,
the board game Go (Silver et al., 2016). However, we are still far from
constructing a generally intelligent agent. Many of the obstacles and open
questions are conceptual: What does it mean to be intelligent? How does one
explore and learn optimally in general, unknown environments? What, in fact,
does it mean to be optimal in the general sense? The universal Bayesian agent
AIXI (Hutter, 2005) is a model of a maximally intelligent agent, and plays a
central role in the sub-field of general reinforcement learning (GRL).
Recently, AIXI has been shown to be flawed in important ways; it doesn’t
explore enough to be asymptotically optimal (Orseau, 2010), and it can perform
poorly with certain priors (Leike and Hutter, 2015). Several variants of AIXI
have been proposed to attempt to address these shortfalls: among them are
entropy-seeking agents (Orseau, 2011), knowledge-seeking agents (Orseau et al.,
2013), Bayes with bursts of exploration (Lattimore, 2013), MDL agents (Leike,
2016a), Thompson sampling (Leike et al., 2016), and optimism (Sunehag and
Hutter, 2015). We present AIXIjs, a JavaScript implementation of these GRL
agents. This implementation is accompanied by a framework for running
experiments against various environments, similar to OpenAI Gym (Brockman et
al., 2016), and a suite of interactive demos that explore different properties
of the agents, similar to REINFORCEjs (Karpathy, 2015). We use AIXIjs to
present numerous experiments illustrating fundamental properties of, and
differences between, these agents.

Shallow Updates for Deep Reinforcement Learning

Nir Levine, Tom Zahavy, Daniel J. Mankowitz, Aviv Tamar, Shie Mannor
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)

Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN)
have achieved state-of-the-art results in a variety of challenging,
high-dimensional domains. This success is mainly attributed to the power of
deep neural networks to learn rich domain representations for approximating the
value function or policy. Batch reinforcement learning methods with linear
representations, on the other hand, are more stable and require less hyper
parameter tuning. Yet, substantial feature engineering is necessary to achieve
good results. In this work we propose a hybrid approach — the Least Squares
Deep Q-Network (LS-DQN), which combines rich feature representations learned by
a DRL algorithm with the stability of a linear least squares method. We do this
by periodically re-training the last hidden layer of a DRL network with a batch
least squares update. Key to our approach is a Bayesian regularization term for
the least squares update, which prevents over-fitting to the more recent data.
We tested LS-DQN on five Atari games and demonstrate significant improvement
over vanilla DQN and Double-DQN. We also investigated the reasons for the
superior performance of our method. Interestingly, we found that the
performance improvement can be attributed to the large batch size used by the
LS method when optimizing the last layer.

Experience enrichment based task independent reward model

Min Xu
Comments: 4 pages, 1 figure
Subjects: Artificial Intelligence (cs.AI)

For most reinforcement learning approaches, the learning is performed by
maximizing an accumulative reward that is expectedly and manually defined for
specific tasks. However, in real world, rewards are emergent phenomena from the
complex interactions between agents and environments. In this paper, we propose
an implicit generic reward model for reinforcement learning. Unlike those
rewards that are manually defined for specific tasks, such implicit reward is
task independent. It only comes from the deviation from the agents’ previous
experiences.

Sketched Answer Set Programming

Sergey Paramonov, Christian Bessiere, Anton Dries, Luc De Raedt
Comments: 15 pages, 11 figures
Subjects: Artificial Intelligence (cs.AI)

Answer Set Programming (ASP) is a powerful modeling formalism for
combinatorial problems. However, writing ASP models is not trivial. We propose
a novel method, called Sketched Answer Set Programming (SkASP), aiming at
supporting the user in resolving this issue. The user writes an ASP program
while marking uncertain parts open with question marks. In addition, the user
provides a number of positive and negative examples of the desired program
behaviour. The sketched model is rewritten into another ASP program, which is
solved by traditional methods. As a result, the user obtains a functional and
reusable ASP program modelling her problem. We evaluate our approach on 21 well
known puzzles and combinatorial problems inspired by Karp’s 21 NP-complete
problems and demonstrate a use-case for a database application based on ASP.

Generalizing the Role of Determinization in Probabilistic Planning

Luis Pineda, Shlomo Zilberstein
Subjects: Artificial Intelligence (cs.AI)

The stochastic shortest path problem (SSP) is a highly expressive model for
probabilistic planning. The computational hardness of SSPs has sparked interest
in determinization-based planners that can quickly solve large problems.
However, existing methods employ a simplistic approach to determinization. In
particular, they ignore the possibility of tailoring the determinization to the
specific characteristics of the target domain. In this work we examine this
question, by showing that learning a good determinization for a planning domain
can be done efficiently and can improve performance. Moreover, we show how to
directly incorporate probabilistic reasoning into the planning problem when a
good determinization is not sufficient by itself. Based on these insights, we
introduce a planner, FF-LAO*, that outperforms state-of-the-art probabilistic
planners on several well-known competition benchmarks.

Why You Should Charge Your Friends for Borrowing Your Stuff

Kijung Shin, Euiwoong Lee, Dhivya Eswaran, Ariel D. Procaccia
Comments: to be published in 26th International Joint Conference on Artificial Intelligence (IJCAI-17)
Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)

We consider goods that can be shared with k-hop neighbors (i.e., the set of
nodes within k hops from an owner) on a social network. We examine incentives
to buy such a good by devising game-theoretic models where each node decides
whether to buy the good or free ride. First, we find that social inefficiency,
specifically excessive purchase of the good, occurs in Nash equilibria. Second,
the social inefficiency decreases as k increases and thus a good can be shared
with more nodes. Third, and most importantly, the social inefficiency can also
be significantly reduced by charging free riders an access cost and paying it
to owners, leading to the conclusion that organizations and system designers
should impose such a cost. These findings are supported by our theoretical
analysis in terms of the price of anarchy and the price of stability; and by
simulations based on synthetic and real social networks.

Combining tabu search and graph reduction to solve the maximum balanced biclique problem

Yi Zhou, Jin-Kao Hao
Subjects: Artificial Intelligence (cs.AI)

The Maximum Balanced Biclique Problem is a well-known graph model with
relevant applications in diverse domains. This paper introduces a novel
algorithm, which combines an effective constraint-based tabu search procedure
and two dedicated graph reduction techniques. We verify the effectiveness of
the algorithm on 30 classical random benchmark graphs and 25 very large
real-life sparse graphs from the popular Koblenz Network Collection (KONECT).
The results show that the algorithm improves the best-known results (new lower
bounds) for 10 classical benchmarks and obtains the optimal solutions for 14
KONECT instances.

RankPL: A Qualitative Probabilistic Programming Language

Tjitze Rienstra
Subjects: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)

In this paper we introduce RankPL, a modeling language that can be thought of
as a qualitative variant of a probabilistic programming language with a
semantics based on Spohn’s ranking theory. Broadly speaking, RankPL can be used
to represent and reason about processes that exhibit uncertainty expressible by
distinguishing “normal” from” surprising” events. RankPL allows (iterated)
revision of rankings over alternative program states and supports various types
of reasoning, including abduction and causal inference. We present the
language, its denotational semantics, and a number of practical examples. We
also discuss an implementation of RankPL that is available for download.

How to Train Your DRAGAN

Generative Adversarial Networks have emerged as an effective technique for
estimating data distributions. The basic setup consists of two deep networks
playing against each other in a zero-sum game setting. However, it is not
understood if the networks reach an equilibrium eventually and what dynamics
makes this possible. The current GAN training procedure, which involves
simultaneous gradient descent, lacks a clear game-theoretic justification in
the literature. In this paper, we introduce regret minimization as a technique
to reach equilibrium in games and use this to motivate the use of simultaneous
GD in GANs. In addition, we present a hypothesis that mode collapse, which is a
common occurrence in GAN training, happens due to the existence of spurious
local equilibria in non-convex games. Motivated by these insights, we develop
an algorithm called DRAGAN that is fast, simple to implement and achieves
competitive performance in a stable fashion across different architectures,
datasets (MNIST, CIFAR-10, and CelebA), and divergence measures with almost no
hyperparameter tuning.

Model-Based Planning in Discrete Action Spaces

Mikael Henaff, William F. Whitney, Yann LeCun
Subjects: Artificial Intelligence (cs.AI)

Planning actions using learned and differentiable forward models of the world
is a general approach which has a number of desirable properties, including
improved sample complexity over model-free RL methods, reuse of learned models
across different tasks, and the ability to perform efficient gradient-based
optimization in continuous action spaces. However, this approach does not apply
straightforwardly when the action space is discrete, which may have limited its
adoption. In this work, we introduce two discrete planning tasks inspired by
existing question-answering datasets and show that it is in fact possible to
effectively perform planning via backprop in discrete action spaces using two
simple yet principled modifications. Our experiments show that this approach
can significantly outperform model-free RL based methods and supervised
imitation learners.

Ask the Right Questions: Active Question Reformulation with Reinforcement Learning

Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Andrea Gesmundo, Neil Houlsby, Wojciech Gajewski, Wei Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

We propose an active question answering agent that learns to reformulate
questions and combine evidence to improve question answering. The agent sits
between the user and a black box question-answering system and learns to
optimally probe the system with natural language reformulations of the initial
question and to aggregate the evidence to return the best possible answer. The
system is trained end-to-end to maximize answer quality using policy gradient.
We evaluate on SearchQA, a dataset of complex questions extracted from
Jeopardy!. Our agent improves F1 by 11% over a state-of-the-art base model that
uses the original question/answer pairs.

A unified view of entropy-regularized Markov decision processes

Gergely Neu, Anders Jonsson, Vicenç Gómez
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We propose a general framework for entropy-regularized average-reward
reinforcement learning in Markov decision processes (MDPs). Our approach is
based on extending the linear-programming formulation of policy optimization in
MDPs to accommodate convex regularization functions. Our key result is showing
that using the conditional entropy of the joint state-action distributions as
regularization yields a dual optimization problem closely resembling the
Bellman optimality equations. This result enables us to formalize a number of
state-of-the-art entropy-regularized reinforcement learning algorithms as
approximate variants of Mirror Descent or Dual Averaging, and thus to argue
about the convergence properties of these methods. In particular, we show that
the exact version of the TRPO algorithm of Schulman et al. (2015) actually
converges to the optimal policy, while the entropy-regularized policy gradient
methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally,
we illustrate empirically the effects of using various regularization
techniques on learning performance in a simple reinforcement learning setup.

Note on Evolution and Forecasting of Requirements: Communications Example

Mark Sh. Levin
Comments: 8 pages, 8 figures, 9 tables
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)

Combinatorial evolution and forecasting of system requirements is examined.
The morphological model is used for a hierarchical requirements system (i.e.,
system parts, design alternatives for the system parts, ordinal estimates for
the alternatives). A set of system changes involves changes of the system
structure, component alternatives and their estimates. The composition process
of the forecast is based on combinatorial synthesis (knapsack problem, multiple
choice problem, hierarchical morphological design). An illustrative numerical
example for four-phase evolution and forecasting of requirements to
communications is described.

Statistical inference using SGD

Tianyang Li, Liu Liu, Anastasios Kyrillidis, Constantine Caramanis
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)

We present a novel method for frequentist statistical inference in
(M)-estimation problems, based on stochastic gradient descent (SGD) with a
fixed step size: we demonstrate that the average of such SGD sequences can be
used for statistical inference, after proper scaling. An intuitive analysis
using the Ornstein-Uhlenbeck process suggests that such averages are
asymptotically normal. From a practical perspective, our SGD-based inference
procedure is a first order method, and is well-suited for large scale problems.
To show its merits, we apply it to both synthetic and real datasets, and
demonstrate that its accuracy is comparable to classical statistical methods,
while requiring potentially far less computation.

Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

Sahil Sharma, Srivatsan Ramesh, Girish Raguvir J, Balaraman Ravindran
Comments: 10 pages + 11 page appendix
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI)

Reinforcement Learning (RL) can model complex behavior policies for
goal-directed sequential decision making tasks. A hallmark of RL algorithms is
Temporal Difference (TD) learning: value function for the current state is
moved towards a bootstrapped target that is estimated using next state’s value
function. (lambda)-returns generalize beyond 1-step returns and strike a
balance between Monte Carlo and TD learning methods. While lambda-returns have
been extensively studied in RL, they haven’t been explored a lot in Deep RL.
This paper’s first contribution is an exhaustive benchmarking of
lambda-returns. Although mathematically tractable, the use of exponentially
decaying weighting of n-step returns based targets in lambda-returns is a
rather ad-hoc design choice. Our second major contribution is that we propose a
generalization of lambda-returns called Confidence-based Autodidactic Returns
(CAR), wherein the RL agent learns the weighting of the n-step returns in an
end-to-end manner. This allows the agent to learn to decide how much it wants
to weigh the n-step returns based targets. In contrast, lambda-returns restrict
RL agents to use an exponentially decaying weighting scheme. Autodidactic
returns can be used for improving any RL algorithm which uses TD learning. We
empirically demonstrate that using sophisticated weighted mixtures of
multi-step returns (like CAR and lambda-returns) considerably outperforms the
use of n-step returns. We perform our experiments on the Asynchronous Advantage
Actor Critic (A3C) algorithm in the Atari 2600 domain.

Mixed Membership Word Embeddings for Computational Social Science

James Foulds
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

Word embeddings improve the performance of NLP systems by revealing the
hidden structural relationships between words. These models have recently risen
in popularity due to the performance of scalable algorithms trained in the big
data setting. Despite their success, word embeddings have seen very little use
in computational social science NLP tasks, presumably due to their reliance on
big data, and to a lack of interpretability. I propose a probabilistic
model-based word embedding method which can recover interpretable embeddings,
without big data. The key insight is to leverage the notion of mixed membership
modeling, in which global representations are shared, but individual entities
(i.e. dictionary words) are free to use these representations to uniquely
differing degrees. Leveraging connections to topic models, I show how to train
these models in high dimensions using a combination of state-of-the-art
techniques for word embeddings and topic modeling. Experimental results show an
improvement in predictive performance of up to 63% in MRR over the skip-gram on
small datasets. The models are interpretable, as embeddings of topics are used
to encode embeddings for words (and hence, documents) in a model-based way. I
illustrate this with two computational social science case studies, on NIPS
articles and State of the Union addresses.

Ensemble Sampling

Xiuyuan Lu, Benjamin Van Roy
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Learning (cs.LG)

Thompson sampling has emerged as an effective heuristic for a broad range of
online decision problems. In its basic form, the algorithm requires computing
and sampling from a posterior distribution over models, which is tractable only
for simple special cases. This paper develops ensemble sampling, which aims to
approximate Thompson sampling while maintaining tractability even in the face
of complex models such as neural networks. Ensemble sampling dramatically
expands on the range of applications for which Thompson sampling is viable. We
establish a theoretical basis that supports the approach and present
computational results that offer further insight.

Fast Change Point Detection on Dynamic Social Networks

Yu Wang, Aniket Chakrabarti, David Sivakoff, Srinivasan Parthasarathy
Comments: IJCAI’17
Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)

A number of real world problems in many domains (e.g. sociology, biology,
political science and communication networks) can be modeled as dynamic
networks with nodes representing entities of interest and edges representing
interactions among the entities at different points in time. A common
representation for such models is the snapshot model – where a network is
defined at logical time-stamps. An important problem under this model is change
point detection. In this work we devise an effective and efficient
three-step-approach for detecting change points in dynamic networks under the
snapshot model. Our algorithm achieves up to 9X speedup over the
state-of-the-art while improving quality on both synthetic and real world
networks.

Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning

Sahil Sharma, Aravind Suresh, Rahul Ramesh, Balaraman Ravindran
Comments: 11 pages + 7 pages appendix
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI)

Deep Reinforcement Learning (DRL) methods have performed well in an
increasing numbering of high-dimensional visual decision making domains. Among
all such visual decision making problems, those with discrete action spaces
often tend to have underlying compositional structure in the said action space.
Such action spaces often contain actions such as go left, go up as well as go
diagonally up and left (which is a composition of the former two actions). The
representations of control policies in such domains have traditionally been
modeled without exploiting this inherent compositional structure in the action
spaces. We propose a new learning paradigm, Factored Action space
Representations (FAR) wherein we decompose a control policy learned using a
Deep Reinforcement Learning Algorithm into independent components, analogous to
decomposing a vector in terms of some orthogonal basis vectors. This
architectural modification of the control policy representation allows the
agent to learn about multiple actions simultaneously, while executing only one
of them. We demonstrate that FAR yields considerable improvements on top of two
DRL algorithms in Atari 2600: FARA3C outperforms A3C (Asynchronous Advantage
Actor Critic) in 9 out of 14 tasks and FARAQL outperforms AQL (Asynchronous
n-step Q-Learning) in 9 out of 13 tasks.

Search Engine Guided Non-Parametric Neural Machine Translation

Jiatao Gu, Yong Wang, Kyunghyun Cho, Victor O.K. Li
Comments: 8 pages, 4 figures, 2 tables
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

In this paper, we extend an attention-based neural machine translation (NMT)
model by allowing it to access an entire training set of parallel sentence
pairs even after training. The proposed approach consists of two stages. In the
first stage–retrieval stage–, an off-the-shelf, black-box search engine is
used to retrieve a small subset of sentence pairs from a training set given a
source sentence. These pairs are further filtered based on a fuzzy matching
score based on edit distance. In the second stage–translation stage–, a novel
translation model, called translation memory enhanced NMT (TM-NMT), seamlessly
uses both the source sentence and a set of retrieved sentence pairs to perform
the translation. Empirical evaluation on three language pairs (En-Fr, En-De,
and En-Es) shows that the proposed approach significantly outperforms the
baseline approach and the improvement is more significant when more relevant
sentence pairs were retrieved.

Batch Reinforcement Learning on the Industrial Benchmark: First Experiences

The Particle Swarm Optimization Policy (PSO-P) has been recently introduced
and proven to produce remarkable results on interacting with academic
reinforcement learning benchmarks in an off-policy, batch-based setting. To
further investigate the properties and feasibility on real-world applications,
this paper investigates PSO-P on the so-called Industrial Benchmark (IB), a
novel reinforcement learning (RL) benchmark that aims at being realistic by
including a variety of aspects found in industrial applications, like
continuous state and action spaces, a high dimensional, partially observable
state space, delayed effects, and complex stochasticity. The experimental
results of PSO-P on IB are compared to results of closed-form control policies
derived from the model-based Recurrent Control Neural Network (RCNN) and the
model-free Neural Fitted Q-Iteration (NFQ). Experiments show that PSO-P is not
only of interest for academic benchmarks, but also for real-world industrial
applications, since it also yielded the best performing policy in our IB
setting. Compared to other well established RL techniques, PSO-P produced
outstanding results in performance and robustness, requiring only a relatively
low amount of effort in finding adequate parameters or making complex design
decisions.

AIDE: An algorithm for measuring the accuracy of probabilistic inference algorithms

Marco F. Cusumano-Towner, Vikash K. Mansinghka
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Learning (cs.LG)

Approximate probabilistic inference algorithms are central to many fields.
Examples include sequential Monte Carlo inference in robotics, variational
inference in machine learning, and Markov chain Monte Carlo inference in
statistics. A key problem faced by practitioners is measuring the accuracy of
an approximate inference algorithm on a specific dataset. This paper introduces
the auxiliary inference divergence estimator (AIDE), an algorithm for measuring
the accuracy of approximate inference algorithms. AIDE is based on the
observation that inference algorithms can be treated as probabilistic models
and the random variables used within the inference algorithm can be viewed as
auxiliary variables. This view leads to a new estimator for the symmetric KL
divergence between the output distributions of two inference algorithms. The
paper illustrates application of AIDE to algorithms for inference in
regression, hidden Markov, and Dirichlet process mixture models. The
experiments show that AIDE captures the qualitative behavior of a broad class
of inference algorithms and can detect failure modes of inference algorithms
that are missed by standard heuristics.

Information Retrieval

Learning to Rank Using Localized Geometric Mean Metrics

Yuxin Su, Irwin King, Michael Lyu
Comments: To appear in SIGIR’17
Subjects: Information Retrieval (cs.IR)

Many learning-to-rank (LtR) algorithms focus on query-independent model, in
which query and document do not lie in the same feature space, and the rankers
rely on the feature ensemble about query-document pair instead of the
similarity between query instance and documents. However, existing algorithms
do not consider local structures in query-document feature space, and are
fragile to irrelevant noise features. In this paper, we propose a novel
Riemannian metric learning algorithm to capture the local structures and
develop a robust LtR algorithm. First, we design a concept called extit{ideal
candidate document} to introduce metric learning algorithm to query-independent
model. Previous metric learning algorithms aiming to find an optimal metric
space are only suitable for query-dependent model, in which query instance and
documents belong to the same feature space and the similarity is directly
computed from the metric space. Then we extend the new and extremely fast
global Geometric Mean Metric Learning (GMML) algorithm to develop a localized
GMML, namely L-GMML. Based on the combination of local learned metrics, we
employ the popular Normalized Discounted Cumulative Gain~(NDCG) scorer and
Weighted Approximate Rank Pairwise (WARP) loss to optimize the extit{ideal
candidate document} for each query candidate set. Finally, we can quickly
evaluate all candidates via the similarity between the extit{ideal candidate
document} and other candidates. By leveraging the ability of metric learning
algorithms to describe the complex structural information, our approach gives
us a principled and efficient way to perform LtR tasks. The experiments on
real-world datasets demonstrate that our proposed L-GMML algorithm outperforms
the state-of-the-art metric learning to rank methods and the stylish
query-independent LtR algorithms regarding accuracy and computational
efficiency.

Personalized Ranking for Context-Aware Venue Suggestion

Mohammad Aliannejadi, Ida Mele, Fabio Crestani
Comments: The 32nd ACM SIGAPP Symposium On Applied Computing (SAC), Marrakech, Morocco, April 4-6, 2017
Subjects: Information Retrieval (cs.IR)

Making personalized and context-aware suggestions of venues to the users is
very crucial in venue recommendation. These suggestions are often based on
matching the venues’ features with the users’ preferences, which can be
collected from previously visited locations. In this paper we present a novel
user-modeling approach which relies on a set of scoring functions for making
personalized suggestions of venues based on venues content and reviews as well
as users context. Our experiments, conducted on the dataset of the TREC
Contextual Suggestion Track, prove that our methodology outperforms
state-of-the-art approaches by a significant margin.

Computation and Language

Ask the Right Questions: Active Question Reformulation with Reinforcement Learning

Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Andrea Gesmundo, Neil Houlsby, Wojciech Gajewski, Wei Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

We propose an active question answering agent that learns to reformulate
questions and combine evidence to improve question answering. The agent sits
between the user and a black box question-answering system and learns to
optimally probe the system with natural language reformulations of the initial
question and to aggregate the evidence to return the best possible answer. The
system is trained end-to-end to maximize answer quality using policy gradient.
We evaluate on SearchQA, a dataset of complex questions extracted from
Jeopardy!. Our agent improves F1 by 11% over a state-of-the-art base model that
uses the original question/answer pairs.

W2VLDA: Almost Unsupervised System for Aspect Based Sentiment Analysis

Aitor García-Pablos, Montse Cuadros, German Rigau
Subjects: Computation and Language (cs.CL)

With the increase of online customer opinions in specialised websites and
social networks, the necessity of automatic systems to help to organise and
classify customer reviews by domain-specific aspect/categories and sentiment
polarity is more important than ever. Supervised approaches to Aspect Based
Sentiment Analysis obtain good results for the domain/language their are
trained on, but having manually labelled data for training supervised systems
for all domains and languages use to be very costly and time consuming. In this
work we describe W2VLDA, an unsupervised system based on topic modelling, that
combined with some other unsupervised methods and a minimal configuration,
performs aspect/category classifiation, aspectterms/opinion-words separation
and sentiment polarity classification for any given domain and language. We
also evaluate the performance of the aspect and sentiment classification in the
multilingual SemEval 2016 task 5 (ABSA) dataset. We show competitive results
for several languages (English, Spanish, French and Dutch) and domains (hotels,
restaurants, electronic-devices).

Learning Semantic Relatedness From Human Feedback Using Metric Learning

Thomas Niebler, Martin Becker, Christian Pölitz, Andreas Hotho
Subjects: Computation and Language (cs.CL); Learning (cs.LG)

Assessing the degree of semantic relatedness between words is an important
task with a variety of semantic applications, such as ontology learning for the
Semantic Web, semantic search or query expansion. To accomplish this in an
automated fashion, many relatedness measures have been proposed. However, most
of these metrics only encode information contained in the underlying corpus and
thus do not directly model human intuition. To solve this, we propose to
utilize a metric learning approach to improve existing semantic relatedness
measures by learning from additional information, such as explicit human
feedback. For this, we argue to use word embeddings instead of traditional
high-dimensional vector representations in order to leverage their semantic
density and to reduce computational cost. We rigorously test our approach on
several domains including tagging data as well as publicly available embeddings
based on Wikipedia texts and navigation. Human feedback about semantic
relatedness for learning and evaluation is extracted from publicly available
datasets such as MEN or WS-353. We find that our method can significantly
improve semantic relatedness measures by learning from additional information,
such as explicit human feedback. For tagging data, we are the first to generate
and study embeddings. Our results are of special interest for ontology and
recommendation engineers, but also for any other researchers and practitioners
of Semantic Web techniques.

Recurrent Additive Networks

Kenton Lee, Omer Levy, Luke Zettlemoyer
Subjects: Computation and Language (cs.CL)

We introduce recurrent additive networks (RANs), a new gated RNN which is
distinguished by the use of purely additive latent state updates. At every time
step, the new state is computed as a gated component-wise sum of the input and
the previous state, without any of the non-linearities commonly used in RNN
transition dynamics. We formally show that RAN states are weighted sums of the
input vectors, and that the gates only contribute to computing the weights of
these sums. Despite this relatively simple functional form, experiments
demonstrate that RANs outperform both LSTMs and GRUs on benchmark language
modeling problems. This result shows that many of the non-linear computations
in LSTMs and related networks are not essential, at least for the problems we
consider, and suggests that the gates are doing more of the computational work
than previously understood.

Spelling Correction as a Foreign Language

Yingbo Zhou, Utkarsh Porwal, Roberto Konow
Subjects: Computation and Language (cs.CL)

In this paper, we reformulated the spell correction problem as a machine
translation task under the encoder-decoder framework. This reformulation
enabled us to use a single model for solving the problem that is traditionally
formulated as learning a language model and an error model. This model employs
multi-layer recurrent neural networks as an encoder and a decoder. We
demonstrate the effectiveness of this model using an internal dataset, where
the training data is automatically obtained from user logs. The model offers
competitive performance as compared to the state of the art methods but does
not require any feature engineering nor hand tuning between models.

Mixed Membership Word Embeddings for Computational Social Science

James Foulds
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

Word embeddings improve the performance of NLP systems by revealing the
hidden structural relationships between words. These models have recently risen
in popularity due to the performance of scalable algorithms trained in the big
data setting. Despite their success, word embeddings have seen very little use
in computational social science NLP tasks, presumably due to their reliance on
big data, and to a lack of interpretability. I propose a probabilistic
model-based word embedding method which can recover interpretable embeddings,
without big data. The key insight is to leverage the notion of mixed membership
modeling, in which global representations are shared, but individual entities
(i.e. dictionary words) are free to use these representations to uniquely
differing degrees. Leveraging connections to topic models, I show how to train
these models in high dimensions using a combination of state-of-the-art
techniques for word embeddings and topic modeling. Experimental results show an
improvement in predictive performance of up to 63% in MRR over the skip-gram on
small datasets. The models are interpretable, as embeddings of topics are used
to encode embeddings for words (and hence, documents) in a model-based way. I
illustrate this with two computational social science case studies, on NIPS
articles and State of the Union addresses.

Formalized Lambek Calculus in Higher Order Logic (HOL4)

Chun Tian
Comments: 37 pages
Subjects: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

In this project, a rather complete proof-theoretical formalization of Lambek
Calculus (non-associative with arbitrary extensions) has been ported from Coq
proof assistent to HOL4 theorem prover, with some improvements and new
theorems.

Three deduction systems (Syntactic Calculus, Natural Deduction and Sequent
Calculus) of Lambek Calculus are defined with many related theorems proved. The
equivalance between these systems are formally proved. Finally, a formalization
of Sequent Calculus proofs (where Coq has built-in supports) has been designed
and implemented in HOL4. Some basic results including the sub-formula
properties of the so-called “cut-free” proofs are formally proved.

This work can be considered as the preliminary work towards a language parser
based on category grammars which is not multimodal but still has ability to
support context-sensitive languages through customized extensions.

Search Engine Guided Non-Parametric Neural Machine Translation

Jiatao Gu, Yong Wang, Kyunghyun Cho, Victor O.K. Li
Comments: 8 pages, 4 figures, 2 tables
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

In this paper, we extend an attention-based neural machine translation (NMT)
model by allowing it to access an entire training set of parallel sentence
pairs even after training. The proposed approach consists of two stages. In the
first stage–retrieval stage–, an off-the-shelf, black-box search engine is
used to retrieve a small subset of sentence pairs from a training set given a
source sentence. These pairs are further filtered based on a fuzzy matching
score based on edit distance. In the second stage–translation stage–, a novel
translation model, called translation memory enhanced NMT (TM-NMT), seamlessly
uses both the source sentence and a set of retrieved sentence pairs to perform
the translation. Empirical evaluation on three language pairs (En-Fr, En-De,
and En-Es) shows that the proposed approach significantly outperforms the
baseline approach and the improvement is more significant when more relevant
sentence pairs were retrieved.

On-the-fly Operation Batching in Dynamic Computation Graphs

Graham Neubig, Yoav Goldberg, Chris Dyer
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

Dynamic neural network toolkits such as PyTorch, DyNet, and Chainer offer
more flexibility for implementing models that cope with data of varying
dimensions and structure, relative to toolkits that operate on statically
declared computations (e.g., TensorFlow, CNTK, and Theano). However, existing
toolkits – both static and dynamic – require that the developer organize the
computations into the batches necessary for exploiting high-performance
algorithms and hardware. This batching task is generally difficult, but it
becomes a major hurdle as architectures become complex. In this paper, we
present an algorithm, and its implementation in the DyNet toolkit, for
automatically batching operations. Developers simply write minibatch
computations as aggregations of single instance computations, and the batching
algorithm seamlessly executes them, on the fly, using computationally efficient
batched operations. On a variety of tasks, we obtain throughput similar to that
obtained with manual batches, as well as comparable speedups over
single-instance learning on architectures that are impractical to batch
manually.

A Regularized Framework for Sparse and Structured Neural Attention

Vlad Niculae, Mathieu Blondel
Comments: 21 pages, including appendix
Subjects: Machine Learning (stat.ML); Computation and Language (cs.CL); Learning (cs.LG)

Modern neural networks are often augmented with an attention mechanism, which
tells the network where to focus within the input. We propose in this paper a
new framework for sparse and structured attention, building upon a max operator
regularized with a strongly convex function. We show that this operator is
differentiable and that its gradient defines a mapping from real values to
probabilities, suitable as an attention mechanism. Our framework includes
softmax and a slight generalization of the recently-proposed sparsemax as
special cases. However, we also show how our framework can incorporate modern
structured penalties, resulting in new attention mechanisms that focus on
entire segments or groups of an input, encouraging parsimony and
interpretability. We derive efficient algorithms to compute the forward and
backward passes of these attention mechanisms, enabling their use in a neural
network trained with backpropagation. To showcase their potential as a drop-in
replacement for existing attention mechanisms, we evaluate them on three
large-scale tasks: textual entailment, machine translation, and sentence
summarization. Our attention mechanisms improve interpretability without
sacrificing performance; notably, on textual entailment and summarization, we
outperform the existing attention mechanisms based on softmax and sparsemax.

Softmax Q-Distribution Estimation for Structured Prediction: A Theoretical Interpretation for RAML

Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, Eduard Hovy
Comments: Under Review of NIPS 2017
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

Reward augmented maximum likelihood (RAML) is a simple and effective learning
framework to directly optimize towards the reward function in structured
prediction tasks. RAML incorporates task-specific reward by performing
maximum-likelihood updates on candidate outputs sampled according to an
exponentiated payoff distribution, which gives higher probabilities to
candidates that are close to the reference output. While RAML is notable for
its simplicity, efficiency, and its impressive empirical successes, the
theoretical properties of RAML, especially the behavior of the exponentiated
payoff distribution, has not been examined thoroughly. In this work, we
introduce softmax Q-distribution estimation, a novel theoretical interpretation
of RAML, which reveals the relation between RAML and Bayesian decision theory.
The softmax Q-distribution can be regarded as a smooth approximation of Bayes
decision boundary, and the Bayes decision rule is achieved by decoding with
this Q-distribution. We further show that RAML is equivalent to approximately
estimating the softmax Q-distribution. Experiments on three structured
prediction tasks with rewards defined on sequential (named entity recognition),
tree-based (dependency parsing) and irregular (machine translation) structures
show notable improvements over maximum likelihood baselines.

Distributed, Parallel, and Cluster Computing

Symmetry Breaking in the Congest Model: Time- and Message-Efficient Algorithms for Ruling Sets

Shreyas Pai, Gopal Pandurangan, Sriram V. Pemmaraju, Talal Riaz, Peter Robinson
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

We study local symmetry breaking problems in the CONGEST model, focusing on
ruling set problems, which generalize the fundamental Maximal Independent Set
(MIS) problem. A (eta)-ruling set is an independent set such that every node
in the graph is at most (eta) hops from a node in the independent set. Our
work is motivated by the following central question: can we break the
(Theta(log n)) time complexity barrier and the (Theta(m)) message complexity
barrier in the CONGEST model for MIS or closely-related symmetry breaking
problems? We present the following results:

– Time Complexity: We show that we can break the (O(log n)) “barrier” for 2-
and 3-ruling sets. We compute 3-ruling sets in (Oleft(frac{log n}{log log
n}
ight)) rounds with high probability (whp). More generally we show that
2-ruling sets can be computed in (Oleft(log Delta cdot (log n)^{1/2 +
varepsilon} + frac{log n}{loglog n}
ight)) rounds for any (varepsilon >
0), which is (o(log n)) for a wide range of (Delta) values (e.g., (Delta =
2^{(log n)^{1/2-varepsilon}})). These are the first 2- and 3-ruling set
algorithms to improve over the (O(log n))-round complexity of Luby’s algorithm
in the CONGEST model.

– Message Complexity: We show an (Omega(n^2)) lower bound on the message
complexity of computing an MIS (i.e., 1-ruling set) which holds also for
randomized algorithms and present a contrast to this by showing a randomized
algorithm for 2-ruling sets that, whp, uses only (O(n log^2 n)) messages and
runs in (O(Delta log n)) rounds. This is the first message-efficient
algorithm known for ruling sets, which has message complexity nearly linear in
(n) (which is optimal up to a polylogarithmic factor).

Report of the HPC Correctness Summit, Jan 25–26, 2017, Washington, DC

Ganesh Gopalakrishnan, Paul D. Hovland, Costin Iancu, Sriram Krishnamoorthy, Ignacio Laguna, Richard A. Lethin, Koushik Sen, Stephen F. Siegel, Armando Solar-Lezama
Comments: 57 pages
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Maintaining leadership in HPC requires the ability to support simulations at
large scales and fidelity. In this study, we detail one of the most significant
productivity challenges in achieving this goal, namely the increasing
proclivity to bugs, especially in the face of growing hardware and software
heterogeneity and sheer system scale. We identify key areas where timely new
research must be proactively begun to address these challenges, and create new
correctness tools that must ideally play a significant role even while ramping
up toward exacale. We close with the proposal for a two-day workshop in which
the problems identified in this report can be more broadly discussed, and
specific plans to launch these new research thrusts identified.

Broadcasting in Noisy Radio Networks

Keren Censor-Hillel, Bernhard Haeupler, D. Ellis Hershkowitz, Goran Zuzic
Comments: Principles of Distributed Computing 2017
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)

The widely-studied radio network model [Chlamtac and Kutten, 1985] is a
graph-based description that captures the inherent impact of collisions in
wireless communication. In this model, the strong assumption is made that node
(v) receives a message from a neighbor if and only if exactly one of its
neighbors broadcasts.

We relax this assumption by introducing a new noisy radio network model in
which random faults occur at senders or receivers. Specifically, for a constant
noise parameter (p in [0,1)), either every sender has probability (p) of
transmitting noise or every receiver of a single transmission in its
neighborhood has probability (p) of receiving noise.

We first study single-message broadcast algorithms in noisy radio networks
and show that the Decay algorithm [Bar-Yehuda et al., 1992] remains robust in
the noisy model while the diameter-linear algorithm of Gasieniec et al., 2007
does not. We give a modified version of the algorithm of Gasieniec et al., 2007
that is robust to sender and receiver faults, and extend both this modified
algorithm and the Decay algorithm to robust multi-message broadcast algorithms.

We next investigate the extent to which (network) coding improves throughput
in noisy radio networks. We address the previously perplexing result of Alon et
al. 2014 that worst case coding throughput is no better than worst case routing
throughput up to constants: we show that the worst case throughput performance
of coding is, in fact, superior to that of routing — by a (Theta(log(n)))
gap — provided receiver faults are introduced. However, we show that any
coding or routing scheme for the noiseless setting can be transformed to be
robust to sender faults with only a constant throughput overhead. These
transformations imply that the results of Alon et al., 2014 carry over to noisy
radio networks with sender faults.

Space Complexity of Fault Tolerant Register Emulations

Gregory Chockler, Alexander Spiegelman
Comments: Conference version appears in Proceedings of PODC ’17
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Driven by the rising popularity of cloud storage, the costs associated with
implementing reliable storage services from a collection of fault-prone servers
have recently become an actively studied question. The well-known ABD result
shows that an f-tolerant register can be emulated using a collection of 2f + 1
fault-prone servers each storing a single read-modify-write object type, which
is known to be optimal. In this paper we generalize this bound: we investigate
the inherent space complexity of emulating reliable multi-writer registers as a
fucntion of the type of the base objects exposed by the underlying servers, the
number of writers to the emulated register, the number of available servers,
and the failure threshold. We establish a sharp separation between registers,
and both max-registers (the base object types assumed by ABD) and CAS in terms
of the resources (i.e., the number of base objects of the respective types)
required to support the emulation; we show that no such separation exists
between max-registers and CAS. Our main technical contribution is lower and
upper bounds on the resources required in case the underlying base objects are
fault-prone read/write registers. We show that the number of required registers
is directly proportional to the number of writers and inversely proportional to
the number of servers.

Espresso: Efficient Forward Propagation for BCNNs

There are many applications scenarios for which the computational performance
and memory footprint of the prediction phase of Deep Neural Networks (DNNs)
needs to be optimized. Binary Neural Networks (BDNNs) have been shown to be an
effective way of achieving this objective. In this paper, we show how
Convolutional Neural Networks (CNNs) can be implemented using binary
representations. Espresso is a compact, yet powerful library written in C/CUDA
that features all the functionalities required for the forward propagation of
CNNs, in a binary file less than 400KB, without any external dependencies.
Although it is mainly designed to take advantage of massive GPU parallelism,
Espresso also provides an equivalent CPU implementation for CNNs. Espresso
provides special convolutional and dense layers for BCNNs, leveraging
bit-packing and bit-wise computations for efficient execution. These techniques
provide a speed-up of matrix-multiplication routines, and at the same time,
reduce memory usage when storing parameters and activations. We experimentally
show that Espresso is significantly faster than existing implementations of
optimized binary neural networks ((approx) 2 orders of magnitude). Espresso is
released under the Apache 2.0 license and is available at
this http URL

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

High network communication cost for synchronizing gradients and parameters is
the well-known bottleneck of distributed training. In this work, we propose
TernGrad that uses ternary gradients to accelerate distributed deep learning in
data parallelism. Our approach requires only three numerical levels {-1,0,1}
which can aggressively reduce the communication time. We mathematically prove
the convergence of TernGrad under the assumption of a bound on gradients.
Guided by the bound, we propose layer-wise ternarizing and gradient clipping to
improve its convergence. Our experiments show that applying TernGrad on AlexNet
does not incur any accuracy loss and can even improve accuracy. The accuracy
loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a
performance model is proposed to study the scalability of TernGrad. Experiments
show significant speed gains for various deep neural networks.

MITHRIL: Mining Sporadic Associations for Cache Prefetching

Juncheng Yang, Reza Karimi, Trausti Sæmundsson, Avani Wildani, Ymir Vigfusson
Subjects: Performance (cs.PF); Distributed, Parallel, and Cluster Computing (cs.DC); Operating Systems (cs.OS)

The growing pressure on cloud application scalability has accentuated storage
performance as a critical bottle- neck. Although cache replacement algorithms
have been extensively studied, cache prefetching – reducing latency by
retrieving items before they are actually requested remains an underexplored
area. Existing approaches to history-based prefetching, in particular, provide
too few benefits for real systems for the resources they cost. We propose
MITHRIL, a prefetching layer that efficiently exploits historical patterns in
cache request associations. MITHRIL is inspired by sporadic association rule
mining and only relies on the timestamps of requests. Through evaluation of 135
block-storage traces, we show that MITHRIL is effective, giving an average of a
55% hit ratio increase over LRU and PROBABILITY GRAPH, a 36% hit ratio gain
over AMP at reasonable cost. We further show that MITHRIL can supplement any
cache replacement algorithm and be readily integrated into existing systems.
Furthermore, we demonstrate the improvement comes from MITHRIL being able to
capture mid-frequency blocks.

Dynamic Analysis of the Arrow Distributed Directory Protocol in General Networks

Abdolhamid Ghodselahi, Fabian Kuhn
Comments: 31 pages, 3 figures
Subjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC)

The Arrow protocol is a simple and elegant protocol to coordinate exclusive
access to a shared object in a network. The protocol solves the underlying
distributed queueing problem by using path reversal on a pre-computed spanning
tree (or any other tree topology simulated on top of the given network).

It is known that the Arrow protocol solves the problem with a competitive
ratio of O(log D) on trees of diameter D. This implies a distributed queueing
algorithm with competitive ratio O(s*log D) for general networks with a
spanning tree of diameter D and stretch s. In this work we show that when
running the Arrow protocol on top of the well-known probabilistic tree
embedding of Fakcharoenphol, Rao, and Talwar [STOC 03], we obtain a randomized
distributed queueing algorithm with a competitive ratio of O(log n) even on
general network topologies. The result holds even if the queueing requests
occur in an arbitrarily dynamic and concurrent fashion and even if
communication is asynchronous. From a technical point of view, the main of the
paper shows that the competitive ratio of the Arrow protocol is constant on a
special family of tree topologies, known as hierarchically well separated
trees.

BRPL: Backpressure RPL for High-throughput and Mobile IoTs

Yad Tahir, Shusen Yang, Julie McCann
Comments: 14 pages, to appear in IEEE Transactions on Mobile Computing, 2017
Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)

RPL, an IPv6 routing protocol for Low power Lossy Networks (LLNs), is
considered to be the de facto routing standard for the Internet of Things
(IoT). However, more and more experimental results demonstrate that RPL
performs poorly when it comes to throughput and adaptability to network
dynamics. This significantly limits the application of RPL in many practical
IoT scenarios, such as an LLN with high-speed sensor data streams and mobile
sensing devices. To address this issue, we develop BRPL, an extension of RPL,
providing a practical approach that allows users to smoothly combine any RPL
Object Function (OF) with backpressure routing. BRPL uses two novel algorithms,
QuickTheta and QuickBeta, to support time-varying data traffic loads and node
mobility respectively. We implement BRPL on Contiki OS, an open-source
operating system for the Internet of Things. We conduct an extensive evaluation
using both real-world experiments based on the FIT IoT-LAB testbed and
large-scale simulations using Cooja over 18 virtual servers on the Cloud. The
evaluation results demonstrate that BRPL not only is fully backward compatible
with RPL (i.e. devices running RPL and BRPL can work together seamlessly), but
also significantly improves network throughput and adaptability to changes in
network topologies and data traffic loads. The observed packet loss reduction
in mobile networks is, at a minimum, 60% and up to 1000% can be seen in extreme
cases.

BAMHealthCloud: A Biometric Authentication and Data Management System for Healthcare Data in Cloud

Kashish A. Shakil, Farhana J. Zareen, Mansaf Alam, Suraiya Jabin
Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)

Advancements in healthcare industry with new technology and population growth
has given rise to security threat to our most personal data. The healthcare
data management system consists of records in different formats such as text,
numeric, pictures and videos leading to data which is big and unstructured.
Also, hospitals have several branches at different locations throughout a
country and overseas. In view of these requirements a cloud based healthcare
management system can be an effective solution for efficient health care data
management. One of the major concerns of a cloud based healthcare system is the
security aspect. It includes theft to identity, tax fraudulence, insurance
frauds, medical frauds and defamation of high profile patients. Hence, a secure
data access and retrieval is needed in order to provide security of critical
medical records in health care management system. Biometric authentication
mechanism is suitable in this scenario since it overcomes the limitations of
token theft and forgetting passwords in conventional token id-password
mechanism used for providing security. It also has high accuracy rate for
secure data access and retrieval. In this paper we propose BAMHealthCloud which
is a cloud based system for management of healthcare data, it ensures security
of data through biometric authentication. It has been developed after
performing a detailed case study on healthcare sector in a developing country.
Training of the signature samples for authentication purpose has been performed
in parallel on hadoop MapReduce framework using Resilient Backpropagation
neural network. From rigorous experiments it can be concluded that it achieves
a speedup of 9x, Equal error rate (EER) of 0.12, sensitivity of 0.98 and
specificity of 0.95 as compared to other approaches existing in literature.

Learning

Dynamic Partition of Complex Networks

Lin F. Yang, Vladimir Braverman, Tuo Zhao, Mengdi Wang
Subjects: Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Finding the reduced-dimensional structure is critical to understanding
complex networks. Existing approaches such as spectral clustering are
applicable only when the full network is explicitly observed. In this paper, we
focus on the online factorization and partition of implicit large-scale
networks based on observations from an associated random walk. We propose an
efficient and scalable nonconvex stochastic gradient algorithm. It is able to
process dependent data dynamically generated by the underlying network and
learn a low-dimensional representation for each vertex. By applying a diffusion
approximation analysis, we show that the nonconvex stochastic algorithm
achieves nearly optimal sample complexity. Once given the learned
low-dimensional representations, we further apply clustering techniques to
recover the network partition. We show that, when the associated Markov process
is lumpable, one can recover the partition exactly with high probability. The
proposed approach is experimented on Manhattan taxi data.

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

High network communication cost for synchronizing gradients and parameters is
the well-known bottleneck of distributed training. In this work, we propose
TernGrad that uses ternary gradients to accelerate distributed deep learning in
data parallelism. Our approach requires only three numerical levels {-1,0,1}
which can aggressively reduce the communication time. We mathematically prove
the convergence of TernGrad under the assumption of a bound on gradients.
Guided by the bound, we propose layer-wise ternarizing and gradient clipping to
improve its convergence. Our experiments show that applying TernGrad on AlexNet
does not incur any accuracy loss and can even improve accuracy. The accuracy
loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a
performance model is proposed to study the scalability of TernGrad. Experiments
show significant speed gains for various deep neural networks.

SmartPaste: Learning to Adapt Source Code

Miltiadis Allamanis, Marc Brockschmidt
Subjects: Learning (cs.LG); Software Engineering (cs.SE)

Deep Neural Networks have been shown to succeed at a range of natural
language tasks such as machine translation and text summarization. While tasks
on source code (ie, formal languages) have been considered recently, most work
in this area does not attempt to capitalize on the unique opportunities offered
by its known syntax and structure. In this work, we introduce SmartPaste, a
first task that requires to use such information. The task is a variant of the
program repair problem that requires to adapt a given (pasted) snippet of code
to surrounding, existing source code. As first solutions, we design a set of
deep neural models that learn to represent the context of each variable
location and variable usage in a data flow-sensitive way. Our evaluation
suggests that our models can learn to solve the SmartPaste task in many cases,
achieving 58.6% accuracy, while learning meaningful representation of variable
usages.

On-the-fly Operation Batching in Dynamic Computation Graphs

Graham Neubig, Yoav Goldberg, Chris Dyer
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

Dynamic neural network toolkits such as PyTorch, DyNet, and Chainer offer
more flexibility for implementing models that cope with data of varying
dimensions and structure, relative to toolkits that operate on statically
declared computations (e.g., TensorFlow, CNTK, and Theano). However, existing
toolkits – both static and dynamic – require that the developer organize the
computations into the batches necessary for exploiting high-performance
algorithms and hardware. This batching task is generally difficult, but it
becomes a major hurdle as architectures become complex. In this paper, we
present an algorithm, and its implementation in the DyNet toolkit, for
automatically batching operations. Developers simply write minibatch
computations as aggregations of single instance computations, and the batching
algorithm seamlessly executes them, on the fly, using computationally efficient
batched operations. On a variety of tasks, we obtain throughput similar to that
obtained with manual batches, as well as comparable speedups over
single-instance learning on architectures that are impractical to batch
manually.

Nonparametric Online Regression while Learning the Metric

Ilja Kuzborskij, Nicolò Cesa-Bianchi
Subjects: Learning (cs.LG)

We study algorithms for online nonparametric regression that learn the
directions along which the regression function is smoother. Our algorithm
learns the Mahalanobis metric based on the gradient outer product matrix
(oldsymbol{G}) of the regression function (automatically adapting to the
effective rank of this matrix), while simultaneously bounding the regret —on
the same data sequence— in terms of the spectrum of (oldsymbol{G}). As a
preliminary step in our analysis, we generalize a nonparametric online learning
algorithm by Hazan and Megiddo by enabling it to compete against functions
whose Lipschitzness is measured with respect to an arbitrary Mahalanobis
metric.

Stabilizing GAN Training with Multiple Random Projections

Behnam Neyshabur, Srinadh Bhojanapalli, Ayan Chakrabarti
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Training generative adversarial networks is unstable in high-dimensions when
the true data distribution lies on a lower-dimensional manifold. The
discriminator is then easily able to separate nearly all generated samples
leaving the generator without meaningful gradients. We propose training a
single generator simultaneously against an array of discriminators, each of
which looks at a different random low-dimensional projection of the data. We
show that individual discriminators then provide stable gradients to the
generator, and that the generator learns to produce samples consistent with the
full data distribution to satisfy all discriminators. We demonstrate the
practical utility of this approach experimentally, and show that it is able to
produce image samples with higher quality than traditional training with a
single discriminator.

Sparse hierarchical interaction learning with epigraphical projection

Mingyuan Jiu, Nelly Pustelnik, Stefan Janaqi, Meriam Chebre, Philippe Ricoux
Subjects: Learning (cs.LG)

This work focus on regression optimization problem with hierarchical
interactions between variables, which is beyond the additive models in the
traditional linear regression. We investigate two different fashions in the
literature to deal with this problem: “hierNet” and structural-sparsity
regularization, and study their connections, then we propose a primal-dual
proximal algorithm based on epigraphical projection to optimize the learning
problem. The experimental setting first highlight the improvement of the
proposed procedure compare to state-of-the-art methods based on FISTA or ADMM
and second we provide comparisons between the different hierarchical
penalization. The experiments are conducted both on the synthetic and real
data.

Minimax Statistical Learning and Domain Adaptation with Wasserstein Distances

Jaeho Lee, Maxim Raginsky
Subjects: Learning (cs.LG)

As opposed to standard empirical risk minimization (ERM), distributionally
robust optimization aims to minimize the worst-case risk over a larger
ambiguity set containing the original empirical distribution of the training
data. In this work, we describe a minimax framework for statistical learning
with ambiguity sets given by balls in Wasserstein space. In particular, we
prove a generalization bound that involves the covering number properties of
the original ERM problem. As an illustrative example, we provide generalization
guarantees for domain adaptation problems where the Wasserstein distance
between the source and target domain distributions can be reliably estimated
from unlabeled samples.

Information-theoretic analysis of generalization capability of learning algorithms

Aolin Xu, Maxim Raginsky
Subjects: Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)

We derive upper bounds on the generalization error of a learning algorithm in
terms of the mutual information between its input and output. The upper bounds
provide theoretical guidelines for striking the right balance between data fit
and generalization by controlling the input-output mutual information of a
learning algorithm. The results can also be used to analyze the generalization
capability of learning algorithms under adaptive composition, and the
bias-accuracy tradeoffs in adaptive data analytics. Our work extends and leads
to nontrivial improvements on the recent results of Russo and Zou.

A unified view of entropy-regularized Markov decision processes

Gergely Neu, Anders Jonsson, Vicenç Gómez
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We propose a general framework for entropy-regularized average-reward
reinforcement learning in Markov decision processes (MDPs). Our approach is
based on extending the linear-programming formulation of policy optimization in
MDPs to accommodate convex regularization functions. Our key result is showing
that using the conditional entropy of the joint state-action distributions as
regularization yields a dual optimization problem closely resembling the
Bellman optimality equations. This result enables us to formalize a number of
state-of-the-art entropy-regularized reinforcement learning algorithms as
approximate variants of Mirror Descent or Dual Averaging, and thus to argue
about the convergence properties of these methods. In particular, we show that
the exact version of the TRPO algorithm of Schulman et al. (2015) actually
converges to the optimal policy, while the entropy-regularized policy gradient
methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally,
we illustrate empirically the effects of using various regularization
techniques on learning performance in a simple reinforcement learning setup.

Backprop without Learning Rates Through Coin Betting

Francesco Orabona, Tatiana Tommasi
Subjects: Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Deep learning methods achieve state-of-the-art performance in many
application scenarios. Yet, these methods require a significant amount of
hyperparameters tuning in order to achieve the best results. In particular,
tuning the learning rates in the stochastic optimization process is still one
of the main bottlenecks. In this paper, we propose a new stochastic gradient
descent procedure for deep networks that does not require any learning rate
setting. Contrary to previous methods, we do not adapt the learning rates nor
we make use of the assumed curvature of the objective function. Instead, we
reduce the optimization process to a game of betting on a coin and propose a
learning rate free optimal algorithm for this scenario. Theoretical convergence
is proven for convex and quasi-convex functions and empirical evidence shows
the advantage of our algorithm over popular stochastic gradient algorithms.

Follow the Signs for Robust Stochastic Optimization

Lukas Balles, Philipp Hennig
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Stochastic noise on gradients is now a common feature in machine learning. It
complicates the design of optimization algorithms, and its effect can be
unintuitive: We show that in some settings, particularly those of low
signal-to-noise ratio, it can be helpful to discard all but the signs of
stochastic gradient elements. In fact, we argue that three popular existing
methods already approximate this very paradigm. We devise novel stochastic
optimization algorithms that explicitly follow stochastic sign estimates while
appropriately accounting for their uncertainty. These methods favorably compare
to the state of the art on a number of benchmark problems.

An Out-of-the-box Full-network Embedding for Convolutional Neural Networks

Transfer learning for feature extraction can be used to exploit deep
representations in contexts where there is very few training data, where there
are limited computational resources, or when tuning the hyper-parameters needed
for training is not an option. While previous contributions to feature
extraction propose embeddings based on a single layer of the network, in this
paper we propose a full-network embedding which successfully integrates
convolutional and fully connected features, coming from all layers of a deep
convolutional neural network. To do so, the embedding normalizes features in
the context of the problem, and discretizes their values to reduce noise and
regularize the embedding space. Significantly, this also reduces the
computational cost of processing the resultant representations. The proposed
method is shown to outperform single layer embeddings on several image
classification tasks, while also being more robust to the choice of the
pre-trained model used for obtaining the initial features. The performance gap
in classification accuracy between thoroughly tuned solutions and the
full-network embedding is also reduced, which makes of the proposed approach a
competitive solution for a large set of applications.

Individualized Risk Prognosis for Critical Care Patients: A Multi-task Gaussian Process Model

Ahmed M. Alaa, Jinsung Yoon, Scott Hu, Mihaela van der Schaar
Subjects: Learning (cs.LG)

We report the development and validation of a data-driven real-time risk
score that provides timely assessments for the clinical acuity of ward patients
based on their temporal lab tests and vital signs, which allows for timely
intensive care unit (ICU) admissions. Unlike the existing risk scoring
technologies, the proposed score is individualized; it uses the electronic
health record (EHR) data to cluster the patients based on their static
covariates into subcohorts of similar patients, and then learns a separate
temporal, non-stationary multi-task Gaussian Process (GP) model that captures
the physiology of every subcohort. Experiments conducted on data from a
heterogeneous cohort of 6,094 patients admitted to the Ronald Reagan UCLA
medical center show that our risk score significantly outperforms the
state-of-the-art risk scoring technologies, such as the Rothman index and MEWS,
in terms of timeliness, true positive rate (TPR), and positive predictive value
(PPV). In particular, the proposed score increases the AUC with 20% and 38% as
compared to Rothman index and MEWS respectively, and can predict ICU admissions
8 hours before clinicians at a PPV of 35% and a TPR of 50%. Moreover, we show
that the proposed risk score allows for better decisions on when to discharge
clinically stable patients from the ward, thereby improving the efficiency of
hospital resource utilization.

CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters

Ron Levie, Federico Monti, Xavier Bresson, Michael M. Bronstein
Subjects: Learning (cs.LG)

The rise of graph-structured data such as social networks, regulatory
networks, citation graphs, and functional brain networks, in combination with
resounding success of deep learning in various applications, has brought the
interest in generalizing deep learning models to non-Euclidean domains. In this
paper, we introduce a new spectral domain convolutional architecture for deep
learning on graphs. The core ingredient of our model is a new class of
parametric rational complex functions (Cayley polynomials) allowing to
efficiently compute localized regular filters on graphs that specialize on
frequency bands of interest. Our model scales linearly with the size of the
input data for sparsely-connected graphs, can handle different constructions of
Laplacian operators, and typically requires less parameters than previous
models. Extensive experimental results show the superior performance of our
approach on various graph learning problems.

Streaming Binary Sketching based on Subspace Tracking and Diagonal Uniformization

Anne Morvan, Antoine Souloumiac, Cédric Gouy-Pailler, Jamal Atif
Subjects: Learning (cs.LG)

In this paper, we address the problem of learning compact
similarity-preserving embeddings for massive high-dimensional streams of data
in order to perform efficient similarity search. We present a new method for
computing binary compressed representations – extit{sketches}- of
high-dimensional real feature vectors. Given an expected code length (c) and
high-dimensional input data points, our algorithm provides a binary code of (c)
bits aiming at preserving the distance between the points from the original
high-dimensional space. Our offline version of the algorithm outperforms the
offline state-of-the-art methods regarding their computation time complexity
and have a similar quality of the sketches. It also provides convergence
guarantees. Moreover, our algorithm can be straightforwardly used in the
streaming context by not requiring neither the storage of the whole dataset nor
a chunk. We demonstrate the quality of our binary sketches through extensive
experiments on real data for the nearest neighbors search task in the offline
and online settings.

Classification Using Proximity Catch Digraphs (Technical Report)

Artür Manukyan, Elvan Ceyhan
Comments: 41 pages, 16 figures
Subjects: Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

We employ random geometric digraphs to construct semi-parametric classifiers.
These data-random digraphs are from parametrized random digraph families called
proximity catch digraphs (PCDs). A related geometric digraph family, class
cover catch digraph (CCCD), has been used to solve the class cover problem by
using its approximate minimum dominating set. CCCDs showed relatively good
performance in the classification of imbalanced data sets, and although CCCDs
have a convenient construction in (mathbb{R}^d), finding minimum dominating
sets is NP-hard and its probabilistic behaviour is not mathematically tractable
except for (d=1). On the other hand, a particular family of PCDs, called
emph{proportional-edge} PCDs (PE-PCDs), has mathematical tractable minimum
dominating sets in (mathbb{R}^d); however their construction in higher
dimensions may be computationally demanding. More specifically, we show that
the classifiers based on PE-PCDs are prototype-based classifiers such that the
exact minimum number of prototypes (equivalent to minimum dominating sets) are
found in polynomial time on the number of observations. We construct two types
of classifiers based on PE-PCDs. One is a family of hybrid classifiers depend
on the location of the points of the training data set, and another type is a
family of classifiers solely based on class covers. We assess the
classification performance of our PE-PCD based classifiers by extensive Monte
Carlo simulations, and compare them with that of other commonly used
classifiers. We also show that, similar to CCCD classifiers, our classifiers
are relatively better in classification in the presence of class imbalance.

Infrastructure for Usable Machine Learning: The Stanford DAWN Project

Peter Bailis, Kunle Olukoton, Christopher Re, Matei Zaharia
Subjects: Learning (cs.LG); Databases (cs.DB); Machine Learning (stat.ML)

Despite incredible recent advances in machine learning, building machine
learning applications remains prohibitively time-consuming and expensive for
all but the best-trained, best-funded engineering organizations. This expense
comes not from a need for new and improved statistical models but instead from
a lack of systems and tools for supporting end-to-end machine learning
application development, from data preparation and labeling to
productionization and monitoring. In this document, we outline opportunities
for infrastructure supporting usable, end-to-end machine learning applications
in the context of the nascent DAWN (Data Analytics for What’s Next) project at
Stanford.

Shake-Shake regularization

Xavier Gastaldi
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The method introduced in this paper aims at helping deep learning
practitioners faced with an overfit problem. The idea is to replace, in a
multi-branch network, the standard summation of parallel branches with a
stochastic affine combination. Applied to 3-branch residual networks,
shake-shake regularization improves on the best single shot published results
on CIFAR-10 and CIFAR-100 by reaching test errors of 2.86% and 15.85%.
Experiments on architectures without skip connections or Batch Normalization
show encouraging results and open the door to a large set of applications. Code
is available at this https URL

Statistical inference using SGD

We present a novel method for frequentist statistical inference in
(M)-estimation problems, based on stochastic gradient descent (SGD) with a
fixed step size: we demonstrate that the average of such SGD sequences can be
used for statistical inference, after proper scaling. An intuitive analysis
using the Ornstein-Uhlenbeck process suggests that such averages are
asymptotically normal. From a practical perspective, our SGD-based inference
procedure is a first order method, and is well-suited for large scale problems.
To show its merits, we apply it to both synthetic and real datasets, and
demonstrate that its accuracy is comparable to classical statistical methods,
while requiring potentially far less computation.

Nice latent variable models have log-rank

Madeleine Udell, Alex Townsend
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Matrices of low rank are pervasive in big data, appearing in recommender
systems, movie preferences, topic models, medical records, and genomics. While
there is a vast literature on how to exploit low rank structure in these
datasets, there is less attention on explaining why the low rank structure
appears in the first place. We explain the abundance of low rank matrices in
big data by proving that certain latent variable models associated to piecewise
analytic functions are of log-rank. A large matrix from such a latent variable
model can be approximated, up to a small error, by a low rank matrix.

Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

Sahil Sharma, Srivatsan Ramesh, Girish Raguvir J, Balaraman Ravindran
Comments: 10 pages + 11 page appendix
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI)

Reinforcement Learning (RL) can model complex behavior policies for
goal-directed sequential decision making tasks. A hallmark of RL algorithms is
Temporal Difference (TD) learning: value function for the current state is
moved towards a bootstrapped target that is estimated using next state’s value
function. (lambda)-returns generalize beyond 1-step returns and strike a
balance between Monte Carlo and TD learning methods. While lambda-returns have
been extensively studied in RL, they haven’t been explored a lot in Deep RL.
This paper’s first contribution is an exhaustive benchmarking of
lambda-returns. Although mathematically tractable, the use of exponentially
decaying weighting of n-step returns based targets in lambda-returns is a
rather ad-hoc design choice. Our second major contribution is that we propose a
generalization of lambda-returns called Confidence-based Autodidactic Returns
(CAR), wherein the RL agent learns the weighting of the n-step returns in an
end-to-end manner. This allows the agent to learn to decide how much it wants
to weigh the n-step returns based targets. In contrast, lambda-returns restrict
RL agents to use an exponentially decaying weighting scheme. Autodidactic
returns can be used for improving any RL algorithm which uses TD learning. We
empirically demonstrate that using sophisticated weighted mixtures of
multi-step returns (like CAR and lambda-returns) considerably outperforms the
use of n-step returns. We perform our experiments on the Asynchronous Advantage
Actor Critic (A3C) algorithm in the Atari 2600 domain.

Parallel Streaming Wasserstein Barycenters

Matthew Staib, Sebastian Claici, Justin Solomon, Stefanie Jegelka
Subjects: Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)

Efficiently aggregating data from different sources is a challenging problem,
particularly when samples from each source are distributed differently. These
differences can be inherent to the inference task or present for other reasons:
sensors in a sensor network may be placed far apart, affecting their individual
measurements. Conversely, it is computationally advantageous to split Bayesian
inference tasks across subsets of data, but data need not be identically
distributed across subsets. One principled way to fuse probability
distributions is via the lens of optimal transport: the Wasserstein barycenter
is a single distribution that summarizes a collection of input measures while
respecting their geometry. However, computing the barycenter scales poorly and
requires discretization of all input distributions and the barycenter itself.
Improving on this situation, we present a scalable, communication-efficient,
parallel algorithm for computing the Wasserstein barycenter of arbitrary
distributions. Our algorithm can operate directly on continuous input
distributions and is optimized for streaming data. Our method is even robust to
nonstationary input distributions and produces a barycenter estimate that
tracks the input measures over time. The algorithm is semi-discrete, needing to
discretize only the barycenter estimate. To the best of our knowledge, we also
provide the first bounds on the quality of the approximate barycenter as the
discretization becomes finer. Finally, we demonstrate the practical
effectiveness of our method, both in tracking moving distributions on a sphere,
as well as in a large-scale Bayesian inference task.

Stabilizing Adversarial Nets With Prediction Methods

Adversarial neural networks solve many important problems in data science,
but are notoriously difficult to train. These difficulties come from the fact
that optimal weights for adversarial nets correspond to saddle points, and not
minimizers, of the loss function. The alternating stochastic gradient methods
typically used for such problems do not reliably converge to saddle points, and
when convergence does happen it is often highly sensitive to learning rates. We
propose a simple modification of stochastic gradient descent that stabilizes
adversarial networks. We show, both in theory and practice, that the proposed
method reliably converges to saddle points. This makes adversarial networks
less likely to “collapse”, and enables faster training with larger learning
rates.

Deep Sparse Coding Using Optimized Linear Expansion of Thresholds

Debabrata Mahapatra, Subhadip Mukherjee, Chandra Sekhar Seelamantula
Comments: Submission date: November 11, 2016. 19 pages; 9 figures
Subjects: Learning (cs.LG)

We address the problem of reconstructing sparse signals from noisy and
compressive measurements using a feed-forward deep neural network (DNN) with an
architecture motivated by the iterative shrinkage-thresholding algorithm
(ISTA). We maintain the weights and biases of the network links as prescribed
by ISTA and model the nonlinear activation function using a linear expansion of
thresholds (LET), which has been very successful in image denoising and
deconvolution. The optimal set of coefficients of the parametrized activation
is learned over a training dataset containing measurement-sparse signal pairs,
corresponding to a fixed sensing matrix. For training, we develop an efficient
second-order algorithm, which requires only matrix-vector product computations
in every training epoch (Hessian-free optimization) and offers superior
convergence performance than gradient-descent optimization. Subsequently, we
derive an improved network architecture inspired by FISTA, a faster version of
ISTA, to achieve similar signal estimation performance with about 50% of the
number of layers. The resulting architecture turns out to be a deep residual
network, which has recently been shown to exhibit superior performance in
several visual recognition tasks. Numerical experiments demonstrate that the
proposed DNN architectures lead to 3 to 4 dB improvement in the reconstruction
signal-to-noise ratio (SNR), compared with the state-of-the-art sparse coding
algorithms.

Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning

Sahil Sharma, Aravind Suresh, Rahul Ramesh, Balaraman Ravindran
Comments: 11 pages + 7 pages appendix
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI)

Deep Reinforcement Learning (DRL) methods have performed well in an
increasing numbering of high-dimensional visual decision making domains. Among
all such visual decision making problems, those with discrete action spaces
often tend to have underlying compositional structure in the said action space.
Such action spaces often contain actions such as go left, go up as well as go
diagonally up and left (which is a composition of the former two actions). The
representations of control policies in such domains have traditionally been
modeled without exploiting this inherent compositional structure in the action
spaces. We propose a new learning paradigm, Factored Action space
Representations (FAR) wherein we decompose a control policy learned using a
Deep Reinforcement Learning Algorithm into independent components, analogous to
decomposing a vector in terms of some orthogonal basis vectors. This
architectural modification of the control policy representation allows the
agent to learn about multiple actions simultaneously, while executing only one
of them. We demonstrate that FAR yields considerable improvements on top of two
DRL algorithms in Atari 2600: FARA3C outperforms A3C (Asynchronous Advantage
Actor Critic) in 9 out of 14 tasks and FARAQL outperforms AQL (Asynchronous
n-step Q-Learning) in 9 out of 13 tasks.

Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods

Nicholas Carlini, David Wagner
Subjects: Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

Neural networks are known to be vulnerable to adversarial examples: inputs
that are close to valid inputs but classified incorrectly. We investigate the
security of ten recent proposals that are designed to detect adversarial
examples. We show that all can be defeated, even when the adversary does not
know the exact parameters of the detector. We conclude that adversarial
examples are significantly harder to detect than previously appreciated, and we
propose several guidelines for evaluating future proposed defenses.

Batch Reinforcement Learning on the Industrial Benchmark: First Experiences

The Particle Swarm Optimization Policy (PSO-P) has been recently introduced
and proven to produce remarkable results on interacting with academic
reinforcement learning benchmarks in an off-policy, batch-based setting. To
further investigate the properties and feasibility on real-world applications,
this paper investigates PSO-P on the so-called Industrial Benchmark (IB), a
novel reinforcement learning (RL) benchmark that aims at being realistic by
including a variety of aspects found in industrial applications, like
continuous state and action spaces, a high dimensional, partially observable
state space, delayed effects, and complex stochasticity. The experimental
results of PSO-P on IB are compared to results of closed-form control policies
derived from the model-based Recurrent Control Neural Network (RCNN) and the
model-free Neural Fitted Q-Iteration (NFQ). Experiments show that PSO-P is not
only of interest for academic benchmarks, but also for real-world industrial
applications, since it also yielded the best performing policy in our IB
setting. Compared to other well established RL techniques, PSO-P produced
outstanding results in performance and robustness, requiring only a relatively
low amount of effort in finding adequate parameters or making complex design
decisions.

Learning Feature Nonlinearities with Non-Convex Regularized Binned Regression

Samet Oymak, Mehrdad Mahdavi, Jiasi Chen
Comments: 22 pages, 7 figures
Subjects: Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)

For various applications, the relations between the dependent and independent
variables are highly nonlinear. Consequently, for large scale complex problems,
neural networks and regression trees are commonly preferred over linear models
such as Lasso. This work proposes learning the feature nonlinearities by
binning feature values and finding the best fit in each quantile using
non-convex regularized linear regression. The algorithm first captures the
dependence between neighboring quantiles by enforcing smoothness via
piecewise-constant/linear approximation and then selects a sparse subset of
good features. We prove that the proposed algorithm is statistically and
computationally efficient. In particular, it achieves linear rate of
convergence while requiring near-minimal number of samples. Evaluations on
synthetic and real datasets demonstrate that algorithm is competitive with
current state-of-the-art and accurately learns feature nonlinearities. Finally,
we explore an interesting connection between the binning stage of our algorithm
and sparse Johnson-Lindenstrauss matrices.

SVM via Saddle Point Optimization: New Bounds and Distributed Algorithms

Yifei Jin, Lingxiao Huang, Jian Li
Comments: 20 pages
Subjects: Learning (cs.LG); Numerical Analysis (cs.NA)

Support Vector Machine is one of the most classical approaches for
classification and regression. Despite being studied for decades, obtaining
practical algorithms for SVM is still an active research problem in machine
learning. In this paper, we propose a new perspective for SVM via saddle point
optimization. We provide an algorithm which achieves
((1-epsilon))-approximations with running time ( ilde{O}(nd+nsqrt{d /
epsilon})) for both separable (hard margin SVM) and non-separable cases
((
u)-SVM ), where (n) is the number of points and (d) is the dimensionality.
To the best of our knowledge, the current best algorithm for hard margin SVM
achieved by Gilbert algorithm~cite{gartner2009coresets} requires (O(nd /
epsilon )) time. Our algorithm improves the running time by a factor of
(sqrt{d}/sqrt{epsilon}). For (
u)-SVM, besides the well known quadratic
programming approach which requires (Omega(n^2 d))
time~cite{joachims1998making,platt199912}, no better algorithm is known. In
the paper, we provide the first nearly linear time algorithm for (
u)-SVM. We
also consider the distributed settings and provide distributed algorithms with
low communication cost via saddle point optimization. Our algorithms require
( ilde{O}(k(d +sqrt{d/epsilon}))) communication cost where (k) is the number
of clients, almost matching the theoretical lower bound.

Speedup from a different parametrization within the Neural Net algorithm

Michael F. Zimmer
Comments: 8 pages
Subjects: Learning (cs.LG)

A different parametrization of the hyperplanes is used in the neural net
algorithm. As demonstrated on several autoencoder examples it significantly
outperforms the usual parametrization, reaching lower training error values
with only a fraction of the number of epochs. It’s argued that it makes it
easier to understand and initialize the parameters.

GAR: An efficient and scalable Graph-based Activity Regularization for semi-supervised learning

Ozsel Kilinc, Ismail Uysal
Comments: Submitted to 31st Conference on Neural Information Processing Systems (NIPS 2017)
Subjects: Learning (cs.LG)

In this paper, we propose a novel graph-based approach for semi-supervised
learning problems, which considers an adaptive adjacency of the examples
throughout the unsupervised portion of the training. Adjacency of the examples
is inferred using the predictions of a neural network model which is first
initialized by a supervised pretraining. These predictions are then updated
according to a novel unsupervised objective which regularizes another
adjacency, now linking the output nodes. Regularizing the adjacency of the
output nodes, inferred from the predictions of the network, creates an easier
optimization problem and ultimately provides that the predictions of the
network turn into the optimal embedding. Ultimately, the proposed framework
provides an effective and scalable graph-based solution which is natural to the
operational mechanism of deep neural networks. Our results show
state-of-the-art performance within semi-supervised learning with the highest
accuracies reported to date in the literature for SVHN and NORB datasets.

Securing Deep Neural Nets against Adversarial Attacks with Moving Target Defense

Sailik Sengupta, Tathagata Chakraborti, Subbarao Kambhampati
Comments: 10 page, 4 figures
Subjects: Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)

Deep Neural Networks (DNNs) are presently the state-of-the-art for image
classification tasks. However, recent works have shown that these systems can
be easily fooled to misidentify images by modifying the image in a particular
way. Moreover, defense mechanisms proposed in the literature so far are mostly
attack-specific and prove to be ineffective against new attacks. Indeed, recent
work on universal perturbations can generate a single modification for all test
images that is able to make existing networks misclassify 90% of the time.
Presently, to our knowledge, no defense mechanisms are effective in preventing
this. As such, the design of a general defense strategy against a wide range of
attacks for Neural Networks becomes a challenging problem. In this paper, we
derive inspiration from recent advances in the field of cybersecurity and
multi-agent systems and propose to use the concept of Moving Target Defense
(MTD) for increasing the robustness of well-known deep networks trained on the
ImageNet dataset towards such adversarial attacks. In using this technique, we
formalize and exploit the notion of differential immunity of different networks
to specific attacks. To classify a single test image, we pick one of the
trained networks each time and then use its classification output. To ensure
maximum robustness, we generate an effective strategy by formulating this
interaction as a Repeated Bayesian Stackelberg Game with a Defender and the
Users. As a network switching strategy, we compute a Strong Stackelberg
Equilibrium that optimizes the accuracy of prediction while at the same time
reduces the misclassification rate on adversarial modification of test images.
We show that while our approach produces an accuracy of 92.79% for the
legitimate users, attackers can only misclassify images 58% (instead of 93.7%)
of the time even when they select the best attack available to them.

Two-temperature logistic regression based on the Tsallis divergence

Ehsan Amid, Manfred K. Warmuth
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

We develop a variant of multiclass logistic regression that achieves three
properties: i) We minimize a non-convex surrogate loss which makes the method
robust to outliers, ii) our method allows transitioning between non-convex and
convex losses by the choice of the parameters, iii) the surrogate loss is Bayes
consistent, even in the non-convex case. The algorithm has one weight vector
per class and the surrogate loss is a function of the linear activations (one
per class). The surrogate loss of an example with linear activation vector
(mathbf{a}) and class (c) has the form (-log_{t_1} exp_{t_2} (a_c –
G_{t_2}(mathbf{a}))) where the two temperatures (t_1) and (t_2) “temper” the
(log) and (exp), respectively, and (G_{t_2}) is a generalization of the
log-partition function. We motivate this loss using the Tsallis divergence. As
the temperature of the logarithm becomes smaller than the temperature of the
exponential, the surrogate loss becomes “more quasi-convex”. Various tunings of
the temperatures recover previous methods and tuning the degree of
non-convexity is crucial in the experiments. The choice (t_1<1) and (t_2>1)
performs best experimentally. We explain this by showing that (t_1 < 1) caps
the surrogate loss and (t_2 >1) makes the predictive distribution have a heavy
tail.

Machine learning modeling for time series problem: Predicting flight ticket prices

Jun Lu
Comments: 17 pages, 19 figures
Subjects: Learning (cs.LG)

Machine learning has been used in all kinds of fields. In this article, we
introduce how machine learning can be applied into time series problem.
Especially, we use the airline ticket prediction problem as our specific
problem. Airline companies use many different variables to determine the flight
ticket prices: indicator whether the travel is during the holidays, the number
of free seats in the plane etc. Some of the variables are observed, but some of
them are hidden. Based on the data over a 103 day period, we trained our
models, getting the best model – which is AdaBoost-Decision Tree
Classification. This algorithm has best performance over the observed 8 routes
which has 61.35(\%) better performance than the random purchase strategy, and
relatively small variance over these routes. And we also considered the
situation that we cannot get too much historical datas for some routes (for
example the route is new and does not have historical data) or we do not want
to train historical data to predict to buy or wait quickly, in which problem,
we used HMM Sequence Classification based AdaBoost-Decision Tree Classification
to perform our prediction on 12 new routes. Finally, we got 31.71(\%) better
performance than the random purchase strategy.

The High-Dimensional Geometry of Binary Neural Networks

Alexander G. Anderson, Cory P. Berg
Comments: 12 pages, 4 Figures
Subjects: Learning (cs.LG)

Recent research has shown that one can train a neural network with binary
weights and activations at train time by augmenting the weights with a
high-precision continuous latent variable that accumulates small changes from
stochastic gradient descent. However, there is a dearth of theoretical analysis
to explain why we can effectively capture the features in our data with binary
weights and activations. Our main result is that the neural networks with
binary weights and activations trained using the method of Courbariaux, Hubara
et al. (2016) work because of the high-dimensional geometry of binary vectors.
In particular, the ideal continuous vectors that extract out features in the
intermediate representations of these BNNs are well-approximated by binary
vectors in the sense that dot products are approximately preserved. Compared to
previous research that demonstrated the viability of such BNNs, our work
explains why these BNNs work in terms of the HD geometry. Our theory serves as
a foundation for understanding not only BNNs but a variety of methods that seek
to compress traditional neural networks. Furthermore, a better understanding of
multilayer binary neural networks serves as a starting point for generalizing
BNNs to other neural network architectures such as recurrent neural networks.

Nestrov's Acceleration For Second Order Method

Haishan Ye, Zhihua Zhang
Subjects: Learning (cs.LG)

Optimization plays a key role in machine learning. Recently, stochastic
second-order methods have attracted much attention due to their low
computational cost in each iteration. However, these algorithms might perform
poorly especially if it is hard to approximate the Hessian well and
efficiently. As far as we know, there is no effective way to handle this
problem. In this paper, we resort to Nestrov’s acceleration technique to
improve the convergence performance of a class of second-order methods called
approximate Newton. We give a theoretical analysis that Nestrov’s acceleration
technique can improve the convergence performance for approximate Newton just
like for first-order methods. We accordingly propose an accelerated regularized
sub-sampled Newton. Our accelerated algorithm performs much better than the
original regularized sub-sampled Newton in experiments, which validates our
theory empirically. Besides, the accelerated regularized sub-sampled Newton has
good performance comparable to or even better than state-of-art algorithms.

Local Information with Feedback Perturbation Suffices for Dictionary Learning in Neural Circuits

Tsung-Han Lin
Comments: 10 pages, 4 figures
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

While the sparse coding principle can successfully model information
processing in sensory neural systems, it remains unclear how learning can be
accomplished under neural architectural constraints. Feasible learning rules
must rely solely on synaptically local information in order to be implemented
on spatially distributed neurons. We describe a neural network with spiking
neurons that can address the aforementioned fundamental challenge and solve the
L1-minimizing dictionary learning problem, representing the first model able to
do so. Our major innovation is to introduce feedback synapses to create a
pathway to turn the seemingly non-local information into local ones. The
resulting network encodes the error signal needed for learning as the change of
network steady states caused by feedback, and operates akin to the classical
stochastic gradient descent method.

Softmax Q-Distribution Estimation for Structured Prediction: A Theoretical Interpretation for RAML

Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, Eduard Hovy
Comments: Under Review of NIPS 2017
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

Reward augmented maximum likelihood (RAML) is a simple and effective learning
framework to directly optimize towards the reward function in structured
prediction tasks. RAML incorporates task-specific reward by performing
maximum-likelihood updates on candidate outputs sampled according to an
exponentiated payoff distribution, which gives higher probabilities to
candidates that are close to the reference output. While RAML is notable for
its simplicity, efficiency, and its impressive empirical successes, the
theoretical properties of RAML, especially the behavior of the exponentiated
payoff distribution, has not been examined thoroughly. In this work, we
introduce softmax Q-distribution estimation, a novel theoretical interpretation
of RAML, which reveals the relation between RAML and Bayesian decision theory.
The softmax Q-distribution can be regarded as a smooth approximation of Bayes
decision boundary, and the Bayes decision rule is achieved by decoding with
this Q-distribution. We further show that RAML is equivalent to approximately
estimating the softmax Q-distribution. Experiments on three structured
prediction tasks with rewards defined on sequential (named entity recognition),
tree-based (dependency parsing) and irregular (machine translation) structures
show notable improvements over maximum likelihood baselines.

VAE with a VampPrior

Jakub M. Tomczak, Max Welling
Comments: 15 pages
Subjects: Learning (cs.LG)

Many different methods to train deep generative models have been proposed in
the past. In this paper, we propose to extend the variational auto-encoder
(VAE) framework with a new type of prior which we call “Variational Mixture of
Posteriors” prior, or VampPrior for short. The VampPrior consists of a mixture
distribution (e.g., a mixture of Gaussians) with components given by
variational posteriors conditioned on learnable pseudo-inputs. We further
extend this prior to a two layer hierarchical model and show that this
architecture where prior and posterior are coupled, learns significantly better
models. The model also avoids the usual local optima issues that plague VAEs
related to useless latent dimensions. We provide empirical studies on three
benchmark datasets, namely, MNIST, OMNIGLOT and Caltech 101 Silhouettes, and
show that applying the hierarchical VampPrior delivers state-of-the-art results
on all three datasets in the unsupervised permutation invariant setting.

Deep Learning as a Tool to Predict Flow Patterns in Two-Phase Flow

Mohammadmehdi Ezzatabadipour, Parth Singh, Melvin D. Robinson, Pablo Guillen-Rondon, Carlos Torres
Comments: Part of DM4OG 2017 proceedings (arXiv:1705.03451)
Subjects: Learning (cs.LG)

In order to better model complex real-world data such as multiphase flow, one
approach is to develop pattern recognition techniques and robust features that
capture the relevant information. In this paper, we use deep learning methods,
and in particular employ the multilayer perceptron, to build an algorithm that
can predict flow pattern in twophase flow from fluid properties and pipe
conditions. The preliminary results show excellent performance when compared
with classical methods of flow pattern prediction.

A unified approach to interpreting model predictions

Scott Lundberg, Su-In Lee
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)

Understanding why a model made a certain prediction is crucial in many
applications. However, with large modern datasets the best accuracy is often
achieved by complex models even experts struggle to interpret, such as ensemble
or deep learning models. This creates a tension between accuracy and
interpretability. In response, a variety of methods have recently been proposed
to help users interpret the predictions of complex models. Here, we present a
unified framework for interpreting predictions, namely SHAP (SHapley Additive
exPlanations, which assigns each feature an importance for a particular
prediction. The key novel components of the SHAP framework are the
identification of a class of additive feature importance measures and
theoretical results that there is a unique solution in this class with a set of
desired properties. This class unifies six existing methods, and several recent
methods in this class do not have these desired properties. This means that our
framework can inform the development of new methods for explaining prediction
models. We demonstrate that several new methods we presented in this paper
based on the SHAP framework show better computational performance and better
consistency with human intuition than existing methods.

Regularizing deep networks using efficient layerwise adversarial training

Swami Sankaranarayanan, Arpit Jain, Rama Chellappa, Ser Nam Lim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Machine Learning (stat.ML)

Adversarial training has been shown to regularize deep neural networks in
addition to increasing their robustness to adversarial examples. However, its
impact on very deep state of the art networks has not been fully investigated.
In this paper, we present an efficient approach to perform adversarial training
by perturbing intermediate layer activations and study the use of such
perturbations as a regularizer during training. We use these perturbations to
train very deep models such as ResNets and show improvement in performance both
on adversarial and original test data. Our experiments highlight the benefits
of perturbing intermediate layer activations compared to perturbing only the
inputs. The results on CIFAR-10 and CIFAR-100 datasets show the merits of the
proposed adversarial training approach. Additional results on WideResNets show
that our approach provides significant improvement in classification accuracy
for a given base model, outperforming dropout and other base models of larger
size.

Use Privacy in Data-Driven Systems: Theory and Experiments with Machine Learnt Programs

Anupam Datta, Matthew Fredrikson, Gihyuk Ko, Piotr Mardziel, Shayak Sen
Subjects: Cryptography and Security (cs.CR); Learning (cs.LG)

This paper presents an approach to formalizing and enforcing a class of use
privacy properties in data-driven systems. In contrast to prior work, we focus
on use restrictions on proxies (i.e. strong predictors) of protected
information types. Our definition relates proxy use to intermediate
computations that occur in a program, and identify two essential properties
that characterize this behavior: 1) its result is strongly associated with the
protected information type in question, and 2) it is likely to causally affect
the final output of the program. For a specific instantiation of this
definition, we present a program analysis technique that detects instances of
proxy use in a model, and provides a witness that identifies which parts of the
corresponding program exhibit the behavior. Recognizing that not all instances
of proxy use of a protected information type are inappropriate, we make use of
a normative judgment oracle that makes this inappropriateness determination for
a given witness. Our repair algorithm uses the witness of an inappropriate
proxy use to transform the model into one that provably does not exhibit proxy
use, while avoiding changes that unduly affect classification accuracy. Using a
corpus of social datasets, our evaluation shows that these algorithms are able
to detect proxy use instances that would be difficult to find using existing
techniques, and subsequently remove them while maintaining acceptable
classification performance. Authors

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira, Andrew Zisserman
Comments: To appear at CVPR 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

The paucity of videos in current action classification datasets (UCF-101 and
HMDB-51) has made it difficult to identify good video architectures, as most
methods obtain similar performance on existing small-scale benchmarks. This
paper re-evaluates state-of-the-art architectures in light of the new Kinetics
Human Action Video dataset. Kinetics has two orders of magnitude more data,
with 400 human action classes and over 400 clips per class, and is collected
from realistic, challenging YouTube videos. We provide an analysis on how
current architectures fare on the task of action classification on this dataset
and how much performance improves on the smaller benchmark datasets after
pre-training on Kinetics.

We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on
2D ConvNet inflation: filters and pooling kernels of very deep image
classification ConvNets are expanded into 3D, making it possible to learn
seamless spatio-temporal feature extractors from video while leveraging
successful ImageNet architecture designs and even their parameters. We show
that, after pre-training on Kinetics, I3D models considerably improve upon the
state-of-the-art in action classification, reaching 80.7% on HMDB-51 and 98.0%
on UCF-101.

A Regularized Framework for Sparse and Structured Neural Attention

Vlad Niculae, Mathieu Blondel
Comments: 21 pages, including appendix
Subjects: Machine Learning (stat.ML); Computation and Language (cs.CL); Learning (cs.LG)

Modern neural networks are often augmented with an attention mechanism, which
tells the network where to focus within the input. We propose in this paper a
new framework for sparse and structured attention, building upon a max operator
regularized with a strongly convex function. We show that this operator is
differentiable and that its gradient defines a mapping from real values to
probabilities, suitable as an attention mechanism. Our framework includes
softmax and a slight generalization of the recently-proposed sparsemax as
special cases. However, we also show how our framework can incorporate modern
structured penalties, resulting in new attention mechanisms that focus on
entire segments or groups of an input, encouraging parsimony and
interpretability. We derive efficient algorithms to compute the forward and
backward passes of these attention mechanisms, enabling their use in a neural
network trained with backpropagation. To showcase their potential as a drop-in
replacement for existing attention mechanisms, we evaluate them on three
large-scale tasks: textual entailment, machine translation, and sentence
summarization. Our attention mechanisms improve interpretability without
sacrificing performance; notably, on textual entailment and summarization, we
outperform the existing attention mechanisms based on softmax and sparsemax.

A Linear-Time Kernel Goodness-of-Fit Test

Wittawat Jitkrittum, Wenkai Xu, Zoltan Szabo, Kenji Fukumizu, Arthur Gretton
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

We propose a novel adaptive test of goodness-of-fit, with computational cost
linear in the number of samples. We learn the test features that best indicate
the differences between observed samples and a reference model, by minimizing
the false negative rate. These features are constructed via Stein’s method,
meaning that it is not necessary to compute the normalising constant of the
model. We analyse the asymptotic Bahadur efficiency of the new test, and prove
that under a mean-shift alternative, our test always has greater relative
efficiency than a previous linear-time kernel test, regardless of the choice of
parameters for that test. In experiments, the performance of our method exceeds
that of the earlier linear-time test, and matches or exceeds the power of a
quadratic-time kernel test. In high dimensions and where model structure may be
exploited, our goodness of fit test performs far better than a quadratic-time
two-sample test based on the Maximum Mean Discrepancy, with samples drawn from
the model.

LOGAN: Evaluating Privacy Leakage of Generative Models Using Generative Adversarial Networks

Jamie Hayes, Luca Melis, George Danezis, Emiliano De Cristofaro
Subjects: Cryptography and Security (cs.CR); Learning (cs.LG)

Recent advances in machine learning are paving the way for the artificial
generation of high quality images and videos. In this paper, we investigate how
generating synthetic samples through generative models can lead to information
leakage, and, consequently, to privacy breaches affecting individuals’ privacy
that contribute their personal or sensitive data to train these models. In
order to quantitatively measure privacy leakage, we train a Generative
Adversarial Network (GAN), which combines a discriminative model and a
generative model, to detect overfitting by relying on the discriminator
capacity to learn statistical differences in distributions.

We present attacks based on both white-box and black-box access to the target
model, and show how to improve it through auxiliary knowledge of samples in the
dataset. We test our attacks on several state-of-the-art models such as Deep
Convolutional GAN (DCGAN), Boundary Equilibrium GAN (BEGAN), and the
combination of DCGAN with a Variational Autoencoder (DCGAN+VAE), using datasets
consisting of complex representations of faces (LFW) and objects (CIFAR-10).
Our white-box attacks are 100% successful at inferring which samples were used
to train the target model, while the best black-box attacks can infer training
set membership with over 60% accuracy.

Multi-output Polynomial Networks and Factorization Machines

Mathieu Blondel, Vlad Niculae, Takuma Otsuka, Naonori Ueda
Comments: 16 pages, including appendix
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Factorization machines and polynomial networks are supervised polynomial
models based on an efficient low-rank decomposition. We extend these models to
the multi-output setting, i.e., for learning vector-valued functions, with
application to multi-class or multi-task problems. We cast this as the problem
of learning a 3-way tensor whose slices share a common decomposition and
propose a convex formulation of that problem. We then develop an efficient
conditional gradient algorithm and prove its global convergence, despite the
fact that it involves a non-convex hidden unit selection step. On
classification tasks, we show that our algorithm achieves excellent accuracy
with much sparser models than existing methods. On recommendation system tasks,
we show how to combine our algorithm with a reduction from ordinal regression
to multi-output classification and show that the resulting algorithm
outperforms existing baselines in terms of ranking accuracy.

Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

Paul Hand, Vladislav Voroninski
Subjects: Information Theory (cs.IT); Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)

We examine the theoretical properties of enforcing priors provided by
generative deep neural networks via empirical risk minimization. In particular
we consider two models, one in which the task is to invert a generative neural
network given access to its last layer and another which entails recovering a
latent code in the domain of a generative neural network from compressive
linear observations of its last layer. We establish that in both cases, in
suitable regimes of network layer sizes and a randomness assumption on the
network weights, that the non-convex objective function given by empirical risk
minimization does not have any spurious stationary points. That is, we
establish that with high probability, at any point away from small
neighborhoods around two scalar multiples of the desired solution, there is a
descent direction. These results constitute the first theoretical guarantees
which establish the favorable global geometry of these non-convex optimization
problems, and bridge the gap between the empirical success of deep learning and
a rigorous understanding of non-linear inverse problems.

Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon

Xin Dong, Shangyu Chen, Sinno Jialin Pan
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

How to develop slim and accurate deep neural networks has become crucial for
real- world applications, especially for those employed in embedded systems.
Though previous work along this research line has shown some promising results,
most existing methods either fail to significantly compress a well-trained deep
network or require a heavy retraining process for the pruned deep network to
re-boost its prediction performance. In this paper, we propose a new layer-wise
pruning method for deep neural networks. In our proposed method, parameters of
each individual layer are pruned independently based on second order
derivatives of a layer-wise error function with respect to the corresponding
parameters. We prove that the final prediction performance drop after pruning
is bounded by a linear combination of the reconstructed errors caused at each
layer. Therefore, there is a guarantee that one only needs to perform a light
retraining process on the pruned network to resume its original prediction
performance. We conduct extensive experiments on benchmark datasets to
demonstrate the effectiveness of our pruning method compared with several
state-of-the-art baseline methods.

Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent

Chris Junchi Li, Lei Li, Junyang Qian, Jian-Guo Liu
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

In this paper, we study the stochastic gradient descent method in analyzing
nonconvex statistical optimization problems from a diffusion approximation
point of view. Using the theory of large deviation of random dynamical system,
we prove in the small stepsize regime and the presence of omnidirectional noise
the following: starting from a local minimizer (resp.~saddle point) the SGD
iteration escapes in a number of iteration that is exponentially
(resp.~linearly) dependent on the inverse stepsize. We take the deep neural
network as an example to study this phenomenon. Based on a new analysis of the
mixing rate of multidimensional Ornstein-Uhlenbeck processes, our theory
substantiate a very recent empirical results by citet{keskar2016large},
suggesting that large batch sizes in training deep learning for synchronous
optimization leads to poor generalization error.

Learning from Complementary Labels

Takashi Ishida, Gang Niu, Masashi Sugiyama
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Collecting labeled data is costly and thus is a critical bottleneck in
real-world classification tasks. To mitigate the problem, we consider a
complementary label, which specifies a class that a pattern does not belong to.
Collecting complementary labels would be less laborious than ordinary labels
since users do not have to carefully choose the correct class from many
candidate classes. However, complementary labels are less informative than
ordinary labels and thus a suitable approach is needed to better learn from
complementary labels. In this paper, we show that an unbiased estimator of the
classification risk can be obtained only from complementary labels, if a loss
function satisfies a particular symmetric condition. We theoretically prove the
estimation error bounds for the proposed method, and experimentally demonstrate
the usefulness of the proposed algorithms.

Annealed Generative Adversarial Networks

Arash Mehrjou, Bernhard Schölkopf, Saeed Saremi
Comments: 9 pages, 6 figures
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

We introduce a novel framework for adversarial training where the target
distribution is annealed between the uniform distribution and the data
distribution. We posited a conjecture that learning under continuous annealing
in the nonparametric regime is stable irrespective of the divergence measures
in the objective function and proposed an algorithm, dubbed {ss}-GAN, in
corollary. In this framework, the fact that the initial support of the
generative network is the whole ambient space combined with annealing are key
to balancing the minimax game. In our experiments on synthetic data, MNIST, and
CelebA, {ss}-GAN with a fixed annealing schedule was stable and did not suffer
from mode collapse.

Shallow Updates for Deep Reinforcement Learning

Nir Levine, Tom Zahavy, Daniel J. Mankowitz, Aviv Tamar, Shie Mannor
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)

Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN)
have achieved state-of-the-art results in a variety of challenging,
high-dimensional domains. This success is mainly attributed to the power of
deep neural networks to learn rich domain representations for approximating the
value function or policy. Batch reinforcement learning methods with linear
representations, on the other hand, are more stable and require less hyper
parameter tuning. Yet, substantial feature engineering is necessary to achieve
good results. In this work we propose a hybrid approach — the Least Squares
Deep Q-Network (LS-DQN), which combines rich feature representations learned by
a DRL algorithm with the stability of a linear least squares method. We do this
by periodically re-training the last hidden layer of a DRL network with a batch
least squares update. Key to our approach is a Bayesian regularization term for
the least squares update, which prevents over-fitting to the more recent data.
We tested LS-DQN on five Atari games and demonstrate significant improvement
over vanilla DQN and Double-DQN. We also investigated the reasons for the
superior performance of our method. Interestingly, we found that the
performance improvement can be attributed to the large batch size used by the
LS method when optimizing the last layer.

Learning Semantic Relatedness From Human Feedback Using Metric Learning

Thomas Niebler, Martin Becker, Christian Pölitz, Andreas Hotho
Subjects: Computation and Language (cs.CL); Learning (cs.LG)

Assessing the degree of semantic relatedness between words is an important
task with a variety of semantic applications, such as ontology learning for the
Semantic Web, semantic search or query expansion. To accomplish this in an
automated fashion, many relatedness measures have been proposed. However, most
of these metrics only encode information contained in the underlying corpus and
thus do not directly model human intuition. To solve this, we propose to
utilize a metric learning approach to improve existing semantic relatedness
measures by learning from additional information, such as explicit human
feedback. For this, we argue to use word embeddings instead of traditional
high-dimensional vector representations in order to leverage their semantic
density and to reduce computational cost. We rigorously test our approach on
several domains including tagging data as well as publicly available embeddings
based on Wikipedia texts and navigation. Human feedback about semantic
relatedness for learning and evaluation is extracted from publicly available
datasets such as MEN or WS-353. We find that our method can significantly
improve semantic relatedness measures by learning from additional information,
such as explicit human feedback. For tagging data, we are the first to generate
and study embeddings. Our results are of special interest for ontology and
recommendation engineers, but also for any other researchers and practitioners
of Semantic Web techniques.

CrossNets : A New Approach to Complex Learning

Chirag Agarwal, Mehdi Sharifzhadeh, Joe Klobusicky, Dan Schonfeld
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

We propose a novel neural network structure called CrossNets, which considers
architectures on directed acyclic graphs. This structure builds on previous
generalizations of feed forward models, such as ResNets, by allowing for all
forward cross connections between layers (both adjacent and non-adjacent). The
addition of cross connections among the network increases information flow
across the whole network, leading to better training and testing performances.
The superior performance of the network is tested against four benchmark
datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN. We conclude with a proof of
convergence for Crossnets to a local minimum for error, where weights for
connections are chosen through backpropagation with momentum.

DeepMasterPrint: Generating Fingerprints for Presentation Attacks

Philip Bontrager, Julian Togelius, Nasir Memon
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Learning (cs.LG)

We present two related methods for creating MasterPrints, synthetic
fingerprints that a fingerprint verification system identifies as many
different people. Both methods start with training a Generative Adversarial
Network (GAN) on a set of real fingerprint images. The generator network is
then used to search for images that can be recognized as multiple individuals.
The first method uses evolutionary optimization in the space of latent
variables, and the second uses gradient-based search. Our method is able to
design a MasterPrint that a commercial fingerprint system matches to 22% of all
users in a strict security setting, and 75% of all users at a looser security
setting.

Balanced Policy Evaluation and Learning

Nathan Kallus
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Optimization and Control (math.OC)

We present a new approach to the problems of evaluating and learning
personalized decision policies from observational data of past contexts,
decisions, and outcomes. Only the outcome of the enacted decision is available
and the historical policy is unknown. These problems arise in personalized
medicine using electronic health records and in internet advertising. Existing
approaches use inverse propensity weighting (or, doubly robust versions) to
make historical outcome (or, residual) data look like it were generated by a
new policy being evaluated or learned. But this relies on a plug-in approach
that rejects data points with a decision that disagrees with the new policy,
leading to high variance estimates and ineffective learning. We propose a new,
balance-based approach that too makes the data look like the new policy but
does so directly by finding weights that optimize for balance between the
weighted data and the target policy in the given, finite sample, which is
equivalent to minimizing worst-case or posterior conditional mean square error.
Our policy learner proceeds as a two-level optimization problem over policies
and weights. We demonstrate that this approach markedly outperforms existing
ones both in evaluation and learning, which is unsurprising given the wider
support of balance-based weights. We establish extensive theoretical
consistency guarantees and regret bounds that support this empirical success.

Instrument-Armed Bandits

Nathan Kallus
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

We extend the classic multi-armed bandit (MAB) model to the setting of
noncompliance, where the arm pull is a mere instrument and the treatment
applied may differ from it, which gives rise to the instrument-armed bandit
(IAB) problem. The IAB setting is relevant whenever the experimental units are
human since free will, ethics, and the law may prohibit unrestricted or forced
application of treatment. In particular, the setting is relevant in bandit
models of dynamic clinical trials and other controlled trials on human
interventions. Nonetheless, the setting has not been fully investigate in the
bandit literature. We show that there are various and divergent notions of
regret in this setting, all of which coincide only in the classic MAB setting.
We characterize the behavior of these regrets and analyze standard MAB
algorithms. We argue for a particular kind of regret that captures the causal
effect of treatments but show that standard MAB algorithms cannot achieve
sublinear control on this regret. Instead, we develop new algorithms for the
IAB problem, prove new regret bounds for them, and compare them to standard MAB
algorithms in numerical examples.

Mixed Membership Word Embeddings for Computational Social Science

James Foulds
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

Word embeddings improve the performance of NLP systems by revealing the
hidden structural relationships between words. These models have recently risen
in popularity due to the performance of scalable algorithms trained in the big
data setting. Despite their success, word embeddings have seen very little use
in computational social science NLP tasks, presumably due to their reliance on
big data, and to a lack of interpretability. I propose a probabilistic
model-based word embedding method which can recover interpretable embeddings,
without big data. The key insight is to leverage the notion of mixed membership
modeling, in which global representations are shared, but individual entities
(i.e. dictionary words) are free to use these representations to uniquely
differing degrees. Leveraging connections to topic models, I show how to train
these models in high dimensions using a combination of state-of-the-art
techniques for word embeddings and topic modeling. Experimental results show an
improvement in predictive performance of up to 63% in MRR over the skip-gram on
small datasets. The models are interpretable, as embeddings of topics are used
to encode embeddings for words (and hence, documents) in a model-based way. I
illustrate this with two computational social science case studies, on NIPS
articles and State of the Union addresses.

Forward Thinking: Building Deep Random Forests

Kevin Miller, Chris Hettinger, Jeffrey Humpherys, Tyler Jarvis, David Kartchner
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

The success of deep neural networks has inspired many to wonder whether other
learners could benefit from deep, layered architectures. We present a general
framework called forward thinking for deep learning that generalizes the
architectural flexibility and sophistication of deep neural networks while also
allowing for (i) different types of learning functions in the network, other
than neurons, and (ii) the ability to adaptively deepen the network as needed
to improve results. This is done by training one layer at a time, and once a
layer is trained, the input data are mapped forward through the layer to create
a new learning problem. The process is then repeated, transforming the data
through multiple layers, one at a time, rendering a new dataset, which is
expected to be better behaved, and on which a final output layer can achieve
good performance. In the case where the neurons of deep neural nets are
replaced with decision trees, we call the result a Forward Thinking Deep Random
Forest (FTDRF). We demonstrate a proof of concept by applying FTDRF on the
MNIST dataset. We also provide a general mathematical formulation that allows
for other types of deep learning problems to be considered.

Stability of cross-validation and minmax-optimal number of folds

Ning Xu, Jian Hong, Timothy Fisher
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Statistics Theory (math.ST)

In this paper, we analyze the properties of cross-validation from the
perspective of the stability, that is, the difference between the training
error and the error of the selected model applied to any other finite sample.
In both the i.i.d. and non-i.i.d. cases, we derive the upper bounds of the
one-round and average test error, referred to as the one-round/convoluted
Rademacher-bounds, to quantify the stability of model evaluation for
cross-validation. We show that the convoluted Rademacher-bounds quantify the
stability of the out-of-sample performance of the model in terms of its
training error. We also show that the stability of cross-validation is closely
related to the sample sizes of the training and test sets, the `heaviness’ in
the tails of the loss distribution, and the Rademacher complexity of the model
class. Using the convoluted Rademacher-bounds, we also define the
minmax-optimal number of folds, at which the performance of the selected model
on new-coming samples is most stable, for cross-validation. The minmax-optimal
number of folds also reveals that, given sample size, stability maximization
(or upper bound minimization) may help to quantify optimality in
hyper-parameter tuning or other learning tasks with large variation.

Ensemble Sampling

Xiuyuan Lu, Benjamin Van Roy
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Learning (cs.LG)

Thompson sampling has emerged as an effective heuristic for a broad range of
online decision problems. In its basic form, the algorithm requires computing
and sampling from a posterior distribution over models, which is tractable only
for simple special cases. This paper develops ensemble sampling, which aims to
approximate Thompson sampling while maintaining tractability even in the face
of complex models such as neural networks. Ensemble sampling dramatically
expands on the range of applications for which Thompson sampling is viable. We
establish a theoretical basis that supports the approach and present
computational results that offer further insight.

Search Engine Guided Non-Parametric Neural Machine Translation

Jiatao Gu, Yong Wang, Kyunghyun Cho, Victor O.K. Li
Comments: 8 pages, 4 figures, 2 tables
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

In this paper, we extend an attention-based neural machine translation (NMT)
model by allowing it to access an entire training set of parallel sentence
pairs even after training. The proposed approach consists of two stages. In the
first stage–retrieval stage–, an off-the-shelf, black-box search engine is
used to retrieve a small subset of sentence pairs from a training set given a
source sentence. These pairs are further filtered based on a fuzzy matching
score based on edit distance. In the second stage–translation stage–, a novel
translation model, called translation memory enhanced NMT (TM-NMT), seamlessly
uses both the source sentence and a set of retrieved sentence pairs to perform
the translation. Empirical evaluation on three language pairs (En-Fr, En-De,
and En-Es) shows that the proposed approach significantly outperforms the
baseline approach and the improvement is more significant when more relevant
sentence pairs were retrieved.

Stochastic Recursive Gradient Algorithm for Nonconvex Optimization

Lam M. Nguyen, Jie Liu, Katya Scheinberg, Martin Takáč
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Optimization and Control (math.OC)

In this paper, we study and analyze the mini-batch version of StochAstic
Recursive grAdient algoritHm (SARAH), a method employing the stochastic
recursive gradient, for solving empirical loss minimization for the case of
nonconvex losses. We provide a sublinear convergence rate (to stationary
points) for general nonconvex functions and a linear convergence rate for
gradient dominated functions, both of which have some advantages compared to
other modern stochastic gradient algorithms for nonconvex losses.

AIDE: An algorithm for measuring the accuracy of probabilistic inference algorithms

Marco F. Cusumano-Towner, Vikash K. Mansinghka
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Learning (cs.LG)

Approximate probabilistic inference algorithms are central to many fields.
Examples include sequential Monte Carlo inference in robotics, variational
inference in machine learning, and Markov chain Monte Carlo inference in
statistics. A key problem faced by practitioners is measuring the accuracy of
an approximate inference algorithm on a specific dataset. This paper introduces
the auxiliary inference divergence estimator (AIDE), an algorithm for measuring
the accuracy of approximate inference algorithms. AIDE is based on the
observation that inference algorithms can be treated as probabilistic models
and the random variables used within the inference algorithm can be viewed as
auxiliary variables. This view leads to a new estimator for the symmetric KL
divergence between the output distributions of two inference algorithms. The
paper illustrates application of AIDE to algorithms for inference in
regression, hidden Markov, and Dirichlet process mixture models. The
experiments show that AIDE captures the qualitative behavior of a broad class
of inference algorithms and can detect failure modes of inference algorithms
that are missed by standard heuristics.

How to Train Your DRAGAN

Generative Adversarial Networks have emerged as an effective technique for
estimating data distributions. The basic setup consists of two deep networks
playing against each other in a zero-sum game setting. However, it is not
understood if the networks reach an equilibrium eventually and what dynamics
makes this possible. The current GAN training procedure, which involves
simultaneous gradient descent, lacks a clear game-theoretic justification in
the literature. In this paper, we introduce regret minimization as a technique
to reach equilibrium in games and use this to motivate the use of simultaneous
GD in GANs. In addition, we present a hypothesis that mode collapse, which is a
common occurrence in GAN training, happens due to the existence of spurious
local equilibria in non-convex games. Motivated by these insights, we develop
an algorithm called DRAGAN that is fast, simple to implement and achieves
competitive performance in a stable fashion across different architectures,
datasets (MNIST, CIFAR-10, and CelebA), and divergence measures with almost no
hyperparameter tuning.

PixColor: Pixel Recursive Colorization

Sergio Guadarrama, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, Kevin Murphy
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

We propose a novel approach to automatically produce multiple colorized
versions of a grayscale image. Our method results from the observation that the
task of automated colorization is relatively easy given a low-resolution
version of the color image. We first train a conditional PixelCNN to generate a
low resolution color for a given grayscale image. Then, given the generated
low-resolution color image and the original grayscale image as inputs, we train
a second CNN to generate a high-resolution colorization of an image. We
demonstrate that our approach produces more diverse and plausible colorizations
than existing methods, as judged by human raters in a “Visual Turing Test”.

Ensemble Adversarial Training: Attacks and Defenses

Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh, Patrick McDaniel
Comments: 14 pages, 3 figures
Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Learning (cs.LG)

Machine learning models are vulnerable to adversarial examples, inputs
maliciously perturbed to mislead the model. These inputs transfer between
models, thus enabling black-box attacks against deployed models. Adversarial
training increases robustness to attacks by injecting adversarial examples into
training data.

Surprisingly, we find that although adversarially trained models exhibit
strong robustness to some white-box attacks (i.e., with knowledge of the model
parameters), they remain highly vulnerable to transferred adversarial examples
crafted on other models. We show that the reason for this vulnerability is the
model’s decision surface exhibiting sharp curvature in the vicinity of the data
points, thus hindering attacks based on first-order approximations of the
model’s loss, but permitting black-box attacks that use adversarial examples
transferred from another model.

We harness this observation in two ways: First, we propose a simple yet
powerful novel attack that first applies a small random perturbation to an
input, before finding the optimal perturbation under a first-order
approximation. Our attack outperforms prior “single-step” attacks on models
trained with or without adversarial training.

Second, we propose Ensemble Adversarial Training, an extension of adversarial
training that additionally augments training data with perturbed inputs
transferred from a number of fixed pre-trained models. On MNIST and ImageNet,
ensemble adversarial training vastly improves robustness to black-box attacks.

Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation

Lei Cai, Hongyang Gao, Shuiwang Ji
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

Variational auto-encoder (VAE) is a powerful unsupervised learning framework
for image generation. One drawback of VAE is that it generates blurry images
due to its Gaussianity assumption and thus L2 loss. To allow the generation of
high quality images by VAE, we increase the capacity of decoder network by
employing residual blocks and skip connections, which also enable efficient
optimization. To overcome the limitation of L2 loss, we propose to generate
images in a multi-stage manner from coarse to fine. In the simplest case, the
proposed multi-stage VAE divides the decoder into two components in which the
second component generates refined images based on the course images generated
by the first component. Since the second component is independent of the VAE
model, it can employ other loss functions beyond the L2 loss and different
model architectures. The proposed framework can be easily generalized to
contain more than two components. Experiment results on the MNIST and CelebA
datasets demonstrate that the proposed multi-stage VAE can generate sharper
images as compared to those from the original VAE.

Espresso: Efficient Forward Propagation for BCNNs

There are many applications scenarios for which the computational performance
and memory footprint of the prediction phase of Deep Neural Networks (DNNs)
needs to be optimized. Binary Neural Networks (BDNNs) have been shown to be an
effective way of achieving this objective. In this paper, we show how
Convolutional Neural Networks (CNNs) can be implemented using binary
representations. Espresso is a compact, yet powerful library written in C/CUDA
that features all the functionalities required for the forward propagation of
CNNs, in a binary file less than 400KB, without any external dependencies.
Although it is mainly designed to take advantage of massive GPU parallelism,
Espresso also provides an equivalent CPU implementation for CNNs. Espresso
provides special convolutional and dense layers for BCNNs, leveraging
bit-packing and bit-wise computations for efficient execution. These techniques
provide a speed-up of matrix-multiplication routines, and at the same time,
reduce memory usage when storing parameters and activations. We experimentally
show that Espresso is significantly faster than existing implementations of
optimized binary neural networks ((approx) 2 orders of magnitude). Espresso is
released under the Apache 2.0 license and is available at
this http URL

Relaxed Wasserstein with Applications to GANs

Xin Guo, Johnny Hong, Tianyi Lin, Nan Yang
Comments: 9 pages, 24 figures, submitted to NIPS 2017
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Generative Adversarial Networks (GANs) provide a versatile class of models
for generative modeling. To improve the performance of machine learning models,
there has recently been interest in designing objective functions based on
Wasserstein distance rather than Jensen-Shannon (JS) divergence. In this paper,
we propose a novel asymmetric statistical divergence called Relaxed Wasserstein
(RW) divergence as a generalization of Wasserstein-(L^2) distance of order 2.
We show that RW is dominated by Total Variation (TV) and Wasserstein-(L^2)
distance, and establish continuity, differentiability, and duality
representation of RW divergence. Finally, we provide a nonasymptotic moment
estimate and a concentration inequality for RW divergence. Our experiments show
that RWGANs with Kullback-Leibler (KL) divergence produce recognizable images
with a ReLU Multi-Layer Perceptron (MLP) generator in fewer iterations,
compared to Wasserstein-(L^1) GAN (WGAN).

Clustering under Local Stability: Bridging the Gap between Worst-Case and Beyond Worst-Case Analysis

Maria-Florina Balcan, Colin White
Subjects: Data Structures and Algorithms (cs.DS); Learning (cs.LG)

Recently, there has been substantial interest in clustering research that
takes a beyond worst-case approach to the analysis of algorithms. The typical
idea is to design a clustering algorithm that outputs a near-optimal solution,
provided the data satisfy a natural stability notion. For example, Bilu and
Linial (2010) and Awasthi et al. (2012) presented algorithms that output
near-optimal solutions, assuming the optimal solution is preserved under small
perturbations to the input distances. A drawback to this approach is that the
algorithms are often explicitly built according to the stability assumption and
give no guarantees in the worst case; indeed, several recent algorithms output
arbitrarily bad solutions even when just a small section of the data does not
satisfy the given stability notion.

In this work, we address this concern in two ways. First, we provide
algorithms that inherit the worst-case guarantees of clustering approximation
algorithms, while simultaneously guaranteeing near-optimal solutions when the
data is stable. Our algorithms are natural modifications to existing
state-of-the-art approximation algorithms. Second, we initiate the study of
local stability, which is a property of a single optimal cluster rather than an
entire optimal solution. We show our algorithms output all optimal clusters
which satisfy stability locally. Specifically, we achieve strong positive
results in our local framework under recent stability notions including metric
perturbation resilience (Angelidakis et al. 2017) and robust perturbation
resilience (Balcan and Liang 2012) for the (k)-median, (k)-means, and
symmetric/asymmetric (k)-center objectives.

Information Theory

Finite Blocklength Rates over a Fading Channel with CSIT and CSIR

Deekshith P K, Vinod Sharma
Comments: 6 pages, 2 figures
Subjects: Information Theory (cs.IT)

In this work, we obtain lower and upper bounds on the maximal transmission
rate at a given codeword length (n), average probability of error (epsilon)
and power constraint (ar{P}), over a block fading additive white Gaussian
noise (AWGN) channel with channel state information (CSI) at the transmitter
and the receiver. These bounds characterize deviation of the finite blocklength
coding rates from the channel capacity which is in turn achieved by the water
filling power allocation across time. The bounds obtained also characterize the
rate enhancement possible due to the CSI at the transmitter in the finite
blocklength regime. The results are further elucidated via numerical examples.

An improvement of the asymptotic Elias bound for non-binary codes

Krishna Kaipa
Subjects: Information Theory (cs.IT)

For non-binary codes the Elias bound is a good upper bound for the asymptotic
information rate at low relative minimum distance, where as the Plotkin bound
is better at high relative minimum distance. In this work, we obtain a hybrid
of these bounds which improves both. This in turn is based on the anticode
bound which is a hybrid of the Hamming and Singleton bounds and improves both
bounds.

The question of convexity of the asymptotic rate function is an important
open question. We conjecture a much weaker form of the convexity, and we show
that our bounds follow immediately if we assume the conjecture.

Compressed Sensing with Prior Information via Maximizing Correlation

Xu Zhang, Wei Cui, Yulong Liu
Comments: To appear in Proceedings of IEEE International Symposium on Information Theory 2017
Subjects: Information Theory (cs.IT)

Compressed sensing (CS) with prior information concerns the problem of
reconstructing a sparse signal with the aid of a similar signal which is known
beforehand. We consider a new approach to integrate the prior information into
CS via maximizing the correlation between the prior knowledge and the desired
signal. We then present a geometric analysis for the proposed method under
sub-Gaussian measurements. Our results reveal that if the prior information is
good enough, then the proposed approach can improve the performance of the
standard CS. Simulations are provided to verify our results.

A decoding algorithm for Twisted Gabidulin codes

Tovohery Randrianarisoa, Joachim Rosenthal
Comments: This paper was submitted to ISIT 2017
Subjects: Information Theory (cs.IT)

In this work, we modify the decoding algorithm for subspace codes by Koetter
and Kschischang to get a decoding algorithm for (generalized) twisted Gabidulin
codes. The decoding algorithm we present applies to cases where the code is
linear over the base field (mathbb{F}_q) but not linear over
(mathbb{F}_{q^n}).

Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

Paul Hand, Vladislav Voroninski
Subjects: Information Theory (cs.IT); Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)

We examine the theoretical properties of enforcing priors provided by
generative deep neural networks via empirical risk minimization. In particular
we consider two models, one in which the task is to invert a generative neural
network given access to its last layer and another which entails recovering a
latent code in the domain of a generative neural network from compressive
linear observations of its last layer. We establish that in both cases, in
suitable regimes of network layer sizes and a randomness assumption on the
network weights, that the non-convex objective function given by empirical risk
minimization does not have any spurious stationary points. That is, we
establish that with high probability, at any point away from small
neighborhoods around two scalar multiples of the desired solution, there is a
descent direction. These results constitute the first theoretical guarantees
which establish the favorable global geometry of these non-convex optimization
problems, and bridge the gap between the empirical success of deep learning and
a rigorous understanding of non-linear inverse problems.

An Overview of Massive MIMO Research at the University of Bristol

Paul Harris, Wael Boukley Hasan, Henry Brice, Benny Chitambira, Mark Beach, Evangelos Mellios, Andrew Nix, Simon Armour, Angela Doufexi
Comments: Presented at the IET Radio Propagation and Technologies for 5G Conference (2016). 5 pages
Subjects: Information Theory (cs.IT)

Massive MIMO has rapidly gained popularity as a technology crucial to the
capacity advances required for 5G wireless systems. Since its theoretical
conception six years ago, research activity has grown exponentially, and there
is now a developing industrial interest to commercialise the technology. For
this to happen effectively, we believe it is crucial that further pragmatic
research is conducted with a view to establish how reality differs from
theoretical ideals. This paper presents an overview of the massive MIMO
research activities occurring within the Communication Systems & Networks Group
at the University of Bristol centred around our 128-antenna real-time testbed,
which has been developed through the BIO programmable city initiative in
collaboration with NI and Lund University. Through recent preliminary trials,
we achieved a world first spectral efficiency of 79.4 bits/s/Hz, and
subsequently demonstrated that this could be increased to 145.6 bits/s/Hz. We
provide a summary of this work here along with some of our ongoing research
directions such as large-scale array wave-front analysis, optimised power
control and localisation techniques.

On the Phase Transition of Corrupted Sensing

Huan Zhang, Yulong Liu, Hong Lei
Comments: To appear in Proceedings of IEEE International Symposium on Information Theory 2017
Subjects: Information Theory (cs.IT)

In cite{FOY2014}, a sharp phase transition has been numerically observed
when a constrained convex procedure is used to solve the corrupted sensing
problem. In this paper, we present a theoretical analysis for this phenomenon.
Specifically, we establish the threshold below which this convex procedure
fails to recover signal and corruption with high probability. Together with the
work in cite{FOY2014}, we prove that a sharp phase transition occurs around
the sum of the squares of spherical Gaussian widths of two tangent cones.
Numerical experiments are provided to demonstrate the correctness and sharpness
of our results.

A Note on the Information-Theoretic-(in)Security of Fading Generated Secret Keys

Robert Malaney
Comments: 4 pages, 2 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Information Theory (cs.IT); Quantum Physics (quant-ph)

In this work we explore the security of secret keys generated via the
electromagnetic reciprocity of the wireless fading channel. Identifying a new
sophisticated colluding attack, we explore the information-theoretic-security
for such keys in the presence of an all-powerful adversary constrained only by
the laws of quantum mechanics. Specifically, we calculate the reduction in the
conditional mutual information between transmitter and receiver that can occur
when an adversary with unlimited computational and communication resources
places directional antenna interceptors at chosen locations. Such locations, in
principal, can be arbitrarily far from the intended receiver yet still
influence the secret key rate.

Corrupted Sensing with Sub-Gaussian Measurements

Jinchi Chen, Yulong Liu
Comments: To appear in Proceedings of IEEE International Symposium on Information Theory 2017
Subjects: Information Theory (cs.IT)

This paper studies the problem of accurately recovering a structured signal
from a small number of corrupted sub-Gaussian measurements. We consider three
different procedures to reconstruct signal and corruption when different kinds
of prior knowledge are available. In each case, we provide conditions for
stable signal recovery from structured corruption with added unstructured
noise. The key ingredient in our analysis is an extended matrix deviation
inequality for isotropic sub-Gaussian matrices.

On Capacity of Noncoherent MIMO with Asymmetric Link Strengths

J. Sebastian, A. Sengupta, S. N. Diggavi
Subjects: Information Theory (cs.IT)

We study the generalized degrees of freedom (gDoF) of the block-fading
noncoherent MIMO channel with asymmetric distributions of link strengths, and a
coherence time of (T) symbol durations. We first derive the optimal signaling
structure for communication over this channel, which is distinct from that for
the i.i.d MIMO setting. We prove that for (T=1), the gDoF is zero for MIMO
channels with arbitrary link strength distributions, extending the result for
MIMO with i.i.d links. We then show that selecting the statistically best
antenna is gDoF-optimal for both Multiple Input Single Output (MISO) and Single
Input Multiple Output (SIMO) channels. We also derive the gDoF for the
(2 imes2) MIMO channel with different exponents in the direct and cross links.
In this setting, we show that it is always necessary to use both antennas to
achieve the optimal gDoF, in contrast to the results for (2 imes2) MIMO with
identical link distributions.

Full-Duplex Bidirectional Secure Communications under Perfect and Distributionally Ambiguous Eavesdropper's CSI

Qiang Li, Ying Zhang, Jingran Lin, Sissi Xiaoxiao Wu
Comments: Accepted by IEEE Transactions Signal Procesing, 14 pages, 8 figures
Subjects: Information Theory (cs.IT)

Consider a full-duplex (FD) bidirectional secure communication system, where
two communication nodes, named Alice and Bob, simultaneously transmit and
receive confidential information from each other, and an eavesdropper, named
Eve, overhears the transmissions. Our goal is to maximize the sum secrecy rate
(SSR) of the bidirectional transmissions by optimizing the transmit covariance
matrices at Alice and Bob. To tackle this SSR maximization (SSRM) problem, we
develop an alternating difference-of-concave (ADC) programming approach to
alternately optimize the transmit covariance matrices at Alice and Bob. We show
that the ADC iteration has a semi-closed-form beamforming solution, and is
guaranteed to converge to a stationary solution of the SSRM problem. Besides
the SSRM design, this paper also deals with a robust SSRM transmit design under
a moment-based random channel state information (CSI) model, where only some
roughly estimated first and second-order statistics of Eve’s CSI are available,
but the exact distribution or other high-order statistics is not known. This
moment-based error model is new and different from the widely used
bounded-sphere error model and the Gaussian random error model. Under the
consider CSI error model, the robust SSRM is formulated as an outage
probability-constrained SSRM problem. By leveraging the Lagrangian duality
theory and DC programming, a tractable safe solution to the robust SSRM problem
is derived. The effectiveness and the robustness of the proposed designs are
demonstrated through simulations.

Polar Coding for Parallel Gaussian Channel

David Tse, Bin Li, Kai Chen
Subjects: Information Theory (cs.IT)

In this paper, we propose a Polar coding scheme for parallel Gaussian
channels. The encoder knows the sum rate of the parallel channels but does not
know the rate of any channel. By using the nesting property of Polar code, we
design a coding/decoding scheme to achieve the sum rates.

Coexistence of RF-powered IoT and a Primary Wireless Network with Secrecy Guard Zones

Mustafa A. Kishk, Harpreet S. Dhillon
Subjects: Information Theory (cs.IT)

This paper studies the secrecy performance of a wireless network (primary
network) overlaid with an ambient RF energy harvesting IoT network (secondary
network). The nodes in the secondary network are assumed to be solely powered
by ambient RF energy harvested from the transmissions of the primary network.
We assume that the secondary nodes can eavesdrop on the primary transmissions
due to which the primary network uses secrecy guard zones. The primary
transmitter goes silent if any secondary receiver is detected within its guard
zone. Using tools from stochastic geometry, we derive the probability of
successful connection of the primary network as well as the probability of
secure communication. Two conditions must be jointly satisfied in order to
ensure successful connection: (i) the SINR at the primary receiver is above a
predefined threshold, and (ii) the primary transmitter is not silent. In order
to ensure secure communication, the SINR value at each of the secondary nodes
should be less than a predefined threshold. Clearly, when more secondary nodes
are deployed, more primary transmitters will remain silent for a given guard
zone radius, thus impacting the amount of energy harvested by the secondary
network. Our results concretely show the existence of an optimal deployment
density for the secondary network that maximizes the density of nodes that are
able to harvest sufficient amount of energy. Furthermore, we show the
dependence of this optimal deployment density on the guard zone radius of the
primary network. In addition, we show that the optimal guard zone radius
selected by the primary network is a function of the deployment density of the
secondary network. This interesting coupling between the two networks is
studied using tools from game theory. Overall, this work is one of the few
concrete works that symbiotically merge tools from stochastic geometry and game
theory.

Large System Analysis of Power Normalization Techniques in Massive MIMO

Meysam Sadeghi, Luca Sanguinetti, Romain Couillet, Chau Yuen
Comments: 12 pages, 3 figures, Accepted for publication in the IEEE Transactions on Vehicular Technology
Subjects: Information Theory (cs.IT)

Linear precoding has been widely studied in the context of Massive
multiple-input-multiple-output (MIMO) together with two common power
normalization techniques, namely, matrix normalization (MN) and vector
normalization (VN). Despite this, their effect on the performance of Massive
MIMO systems has not been thoroughly studied yet. The aim of this paper is to
fulfill this gap by using large system analysis. Considering a system model
that accounts for channel estimation, pilot contamination, arbitrary pathloss,
and per-user channel correlation, we compute tight approximations for the
signal-to-interference-plus-noise ratio and the rate of each user equipment in
the system while employing maximum ratio transmission (MRT), zero forcing (ZF),
and regularized ZF precoding under both MN and VN techniques. Such
approximations are used to analytically reveal how the choice of power
normalization affects the performance of MRT and ZF under uncorrelated fading
channels. It turns out that ZF with VN resembles a sum rate maximizer while it
provides a notion of fairness under MN. Numerical results are used to validate
the accuracy of the asymptotic analysis and to show that in Massive MIMO,
non-coherent interference and noise, rather than pilot contamination, are often
the major limiting factors of the considered precoding schemes.

On deep holes of generalized projective Reed-Solomon codes

Xiaofan Xu, Shaofang Hong, Yongchao Xu
Comments: 16 pages
Subjects: Number Theory (math.NT); Information Theory (cs.IT)

Determining deep holes is an important topic in decoding Reed-Solomon codes.
Let (lge 1) be an integer and (a_1,ldots,a_l) be arbitrarily given (l)
distinct elements of the finite field ({f F}_q) of (q) elements with the odd
prime number (p) as its characteristic. Let (D={f
F}_qackslash{a_1,ldots,a_l}) and (k) be an integer such that (2le kle
q-l-1). In this paper, we study the deep holes of generalized projective
Reed-Solomon code ({
m GPRS}_q(D, k)) of length (q-l+1) and dimension (k) over
({f F}_q). For any (f(x)in {f F}_q[x]), we let
(f(D)=(f(y_1),ldots,f(y_{q-l}))) if (D={y_1, …, y_{q-l}}) and
(c_{k-1}(f(x))) be the coefficient of (x^{k-1}) of (f(x)). By using D”ur’s
theorem on the relation between the covering radius and minimum distance of
({
m GPRS}_q(D, k)), we show that if (u(x)in {f F}_q[x]) with (deg
(u(x))=k), then the received codeword ((u(D), c_{k-1}(u(x)))) is a deep hole of
({
m GPRS}_q(D, k)) if and only if the sum (sumlimits_{yin I}y) is nonzero
for any subset (Isubseteq D) with (#(I)=k). We show also that if (j) is an
integer with (1leq jleq l) and (u_j(x):= lambda_j(x-a_j)^{q-2}+
u_j
x^{k-1}+f_{leq k-2}^{(j)}(x)) with (lambda_jin {f F}_q^*), (
u_jin {f
F}_q) and (f_{leq{k-2}}^{(j)}(x)in{f F}_q[x]) being a polynomial of degree
at most (k-2), then ((u_j(D), c_{k-1}(u_j(x)))) is a deep hole of ({
m
GPRS}_q(D, k)) if and only if the sum
(inom{q-2}{k-1}(-a_j)^{q-1-k}prodlimits_{yin I}(a_j-y)+e) is nonzero for
any subset (Isubseteq D) with (#(I)=k), where (e) is the identity of the
group ({f F}_q^*). This implies that ((u_j(D), c_{k-1}(u_j(x)))) is a deep
hole of ({
m GPRS}_q(D, k)) if (p|k).

Information-theoretic analysis of generalization capability of learning algorithms

Aolin Xu, Maxim Raginsky
Subjects: Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)

We derive upper bounds on the generalization error of a learning algorithm in
terms of the mutual information between its input and output. The upper bounds
provide theoretical guidelines for striking the right balance between data fit
and generalization by controlling the input-output mutual information of a
learning algorithm. The results can also be used to analyze the generalization
capability of learning algorithms under adaptive composition, and the
bias-accuracy tradeoffs in adaptive data analytics. Our work extends and leads
to nontrivial improvements on the recent results of Russo and Zou.

Detection Estimation and Grid matching of Multiple Targets with Single Snapshot Measurements

Rakshith Jagannath
Subjects: Applications (stat.AP); Information Theory (cs.IT)

In this work, we explore the problems of detecting the number of narrow-band,
far-field targets and estimating their corresponding directions from single
snapshot measurements. The principles of sparse signal recovery (SSR) are used
for the single snapshot detection and estimation of multiple targets. In the
SSR framework, the DoA estimation problem is grid based and can be posed as the
lasso optimization problem. However, the SSR framework for DoA estimation gives
rise to the grid mismatch problem, when the unknown targets (sources) are not
matched with the estimation grid chosen for the construction of the array
steering matrix at the receiver. The block sparse recovery framework is known
to mitigate the grid mismatch problem by jointly estimating the targets and
their corresponding offsets from the estimation grid using the group lasso
estimator. The corresponding detection problem reduces to estimating the
optimal regularization parameter (( au)) of the lasso (in case of perfect
grid-matching) or group-lasso estimation problem for achieving the required
probability of correct detection ((P_c)). We propose asymptotic and finite
sample test statistics for detecting the number of sources with the required
(P_c) at moderate to high signal to noise ratios. Once the number of sources
are detected, or equivalently the optimal (hat{ au}) is estimated, the
corresponding estimation and grid matching of the DoAs can be performed by
solving the lasso or group-lasso problem at (hat{ au})

Spatially Controlled Relay Beamforming: (2)-Stage Optimal Policies

Dionysios S. Kalogerias, Athina P. Petropulu
Comments: 68 pages, 10 figures, this work constitutes an extended preprint/version of a two part paper (soon to be) submitted for publication to the IEEE Transactions on Signal Processing in Spring/Summer 2017
Subjects: Optimization and Control (math.OC); Information Theory (cs.IT); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)

The problem of enhancing Quality-of-Service (QoS) in power constrained,
mobile relay beamforming networks, by optimally and dynamically controlling the
motion of the relaying nodes, is considered, in a dynamic channel environment.
We assume a time slotted system, where the relays update their positions before
the beginning of each time slot. Modeling the wireless channel as a Gaussian
spatiotemporal stochastic field, we propose a novel (2)-stage stochastic
programming problem formulation for optimally specifying the positions of the
relays at each time slot, such that the expected QoS of the network is
maximized, based on causal Channel State Information (CSI) and under a total
relay transmit power budget. This results in a schema where, at each time slot,
the relays, apart from optimally beamforming to the destination, also
optimally, predictively decide their positions at the next time slot, based on
causally accumulated experience. Exploiting either the Method of Statistical
Differentials, or the multidimensional Gauss-Hermite Quadrature Rule, the
stochastic program considered is shown to be approximately equivalent to a set
of simple subproblems, which are solved in a distributed fashion, one at each
relay. Optimality and performance of the proposed spatially controlled system
are also effectively assessed, under a rigorous technical framework; strict
optimality is rigorously demonstrated via the development of a version of the
Fundamental Lemma of Stochastic Control, and, performance-wise, it is shown
that, quite interestingly, the optimal average network QoS exhibits an
increasing trend across time slots, despite our myopic problem formulation.
Numerical simulations are presented, experimentally corroborating the success
of the proposed approach and the validity of our theoretical predictions.

Performance Evaluation of Optimal Radio Access Technology Selection Algorithms for LTE-WiFi Network

Arghyadip Roy, Prasanna Chaporkar, Abhay Karandikar
Subjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT)

A Heterogeneous Network (HetNet) comprises of multiple Radio Access
Technologies (RATs) allowing a user to associate with a specific RAT and steer
to other RATs in a seamless manner. To cope up with the unprecedented growth of
data traffic, mobile data can be offloaded to Wireless Fidelity (WiFi) in a
Long Term Evolution (LTE) based HetNet. In this paper, an optimal RAT selection
problem is considered to maximize the total system throughput in an LTE-WiFi
system with offload capability. Another formulation is also developed where
maximizing the total system throughput is subject to a constraint on the voice
user blocking probability. It is proved that the optimal policies for the
association and offloading of voice/data users contain threshold structures.
Based on the threshold structures, we propose algorithms for the association
and offloading of users in LTE-WiFi HetNet. Simulation results are presented to
demonstrate the voice user blocking probability and the total system throughput
performance of the proposed algorithms in comparison to another benchmark
algorithm.

Learning Feature Nonlinearities with Non-Convex Regularized Binned Regression

Samet Oymak, Mehrdad Mahdavi, Jiasi Chen
Comments: 22 pages, 7 figures
Subjects: Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)

For various applications, the relations between the dependent and independent
variables are highly nonlinear. Consequently, for large scale complex problems,
neural networks and regression trees are commonly preferred over linear models
such as Lasso. This work proposes learning the feature nonlinearities by
binning feature values and finding the best fit in each quantile using
non-convex regularized linear regression. The algorithm first captures the
dependence between neighboring quantiles by enforcing smoothness via
piecewise-constant/linear approximation and then selects a sparse subset of
good features. We prove that the proposed algorithm is statistically and
computationally efficient. In particular, it achieves linear rate of
convergence while requiring near-minimal number of samples. Evaluations on
synthetic and real datasets demonstrate that algorithm is competitive with
current state-of-the-art and accurately learns feature nonlinearities. Finally,
we explore an interesting connection between the binning stage of our algorithm
and sparse Johnson-Lindenstrauss matrices.