Arash Ardakani, Carlo Condo, Warren J. Gross
Comments: 13 pages, 3 figures
Subjects: Neural and Evolutionary Computing (cs.NE); Learning (cs.LG)
Recently deep neural networks have received considerable attention due to
their ability to extract and represent high-level abstractions in data sets.
Deep neural networks such as fully-connected and convolutional neural networks
have shown excellent performance on a wide range of recognition and
classification tasks. However, their hardware implementations currently suffer
from large silicon area and high power consumption due to the their high degree
of complexity. The power/energy consumption of neural networks is dominated by
memory accesses, the majority of which occur in fully-connected networks. In
fact, they contain most of the deep neural network parameters. In this paper,
we propose sparsely-connected networks, by showing that the number of
connections in fully-connected networks can be reduced by up to 90% while
improving the accuracy performance on three popular datasets (MNIST, CIFAR10
and SVHN). We then propose an efficient hardware architecture based on
linear-feedback shift registers to reduce the memory requirements of the
proposed sparsely-connected networks. The proposed architecture can save up to
90% of memory compared to the conventional implementations of fully-connected
neural networks. Moreover, implementation results show up to 84% reduction in
the energy consumption of a single neuron of the proposed sparsely-connected
networks compared to a single neuron of fully-connected neural networks.
Leon Sixt, Benjamin Wild, Tim Landgraf
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
Deep Convolutional Neuronal Networks (DCNN) are showing remarkable
performance on many computer vision tasks. Due to their large parameter space,
they require many labeled samples when trained in a supervised setting. The
costs of annotating data manually can render the usage of DCNNs infeasible. We
present a novel framework called RenderGAN that can generate large amounts of
realistic, labeled images by combining a 3D model and the Generative
Adversarial Network framework. In our approach, image augmentations (e.g.
lighting, background, and detail) are learned from unlabeled data such that the
generated images are strikingly realistic while preserving the labels known
from the 3D model. We apply the RenderGAN framework to generate images of
barcode-like markers that are attached to honeybees. A DCNN is trained on this
data only. It performs better on a test set of real data than an equal DCNN
trained on the limited amounts of real data available.
Tiffany Hwu, Jacob Isbell, Nicolas Oros, Jeffrey Krichmar
Comments: 6 pages, 8 figures
Subjects: Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
Neuromorphic computing is a promising solution for reducing the size, weight
and power of mobile embedded systems. In this paper, we introduce a realization
of such a system by creating the first closed-loop battery-powered
communication system between an IBM TrueNorth NS1e and an autonomous
Android-Based Robotics platform. Using this system, we constructed a dataset of
path following behavior by manually driving the Android-Based robot along steep
mountain trails and recording video frames from the camera mounted on the robot
along with the corresponding motor commands. We used this dataset to train a
deep convolutional neural network implemented on the TrueNorth NS1e. The NS1e,
which was mounted on the robot and powered by the robot’s battery, resulted in
a self-driving robot that could successfully traverse a steep mountain path in
real time. To our knowledge, this represents the first time the TrueNorth NS1e
neuromorphic chip has been embedded on a mobile platform under closed-loop
control.
Sihan Li, Jiantao Jiao, Yanjun Han, Tsachy Weissman
Subjects: Neural and Evolutionary Computing (cs.NE); Learning (cs.LG); Machine Learning (stat.ML)
We provide a theoretical explanation for the superb performance of ResNet via
the study of deep linear networks and some nonlinear variants. We show that
with or without nonlinearities, by adding shortcuts that have depth two, the
condition number of the Hessian of the loss function at the zero initial point
is depth-invariant, which makes training very deep models no more difficult
than shallow ones. Shortcuts of higher depth result in an extremely flat
(high-order) stationary point initially, from which the optimization algorithm
is hard to escape. The 1-shortcut, however, is essentially equivalent to no
shortcuts. Extensive experiments are provided accompanying our theoretical
results. We show that initializing the network to small weights with
2-shortcuts achieves significantly better results than random Gaussian (Xavier)
initialization, orthogonal initialization, and shortcuts of deeper depth, from
various perspectives ranging from final loss, learning dynamics and stability,
to the behavior of the Hessian along the learning process.
Hyo-Eun Kim, Sangheum Hwang, Kyunghyun Cho
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Latent representation learned from multi-layered neural networks via
hierarchical feature abstraction enables recent success of deep learning. Under
the deep learning framework, generalization performance highly depends on the
learned latent representation which is obtained from an appropriate training
scenario with a task-specific objective on a designed network model. In this
work, we propose a novel latent space modeling method to learn better latent
representation. We designed a neural network model based on the assumption that
good base representation can be attained by maximizing the total correlation
between the input, latent, and output variables. From the base model, we
introduce a semantic noise modeling method which enables class-conditional
perturbation on latent space to enhance the representational power of learned
latent feature. During training, latent vector representation can be
stochastically perturbed by a modeled class-conditional additive noise while
maintaining its original semantic feature. It implicitly brings the effect of
semantic augmentation on the latent space. The proposed model can be easily
learned by back-propagation with common gradient-based optimization algorithms.
Experimental results show that the proposed method helps to achieve performance
benefits against various previous approaches. We also provide the empirical
analyses for the proposed class-conditional perturbation process including
t-SNE visualization.
Zachary C. Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, Li Deng
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
To use deep reinforcement learning in the wild, we might hope for an agent
that would never make catastrophic mistakes. At the very least, we could hope
that an agent would eventually learn to avoid old mistakes. Unfortunately, even
in simple environments, modern deep reinforcement learning techniques are
doomed by a Sisyphean curse. Owing to the use of function approximation, these
agents eventually forget experiences as they become exceedingly unlikely under
a new policy. Consequently, for as long as they continue to train,
state-aggregating agents may periodically relive catastrophic mistakes. We
demonstrate unacceptable performance of deep Q-networks on two toy problems. We
then introduce intrinsic fear, a method that mitigates these problems by
avoiding dangerous states.
Ankan Bansal, Anirudh Nanduri, Carlos Castillo, Rajeev Ranjan, Rama Chellappa
Comments: 9 pages, submitted to WACV 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Recent progress in face detection (including keypoint detection), and
recognition is mainly being driven by (i) deeper convolutional neural network
architectures, and (ii) larger datasets. However, most of the largest datasets
are maintained by private companies and are not publicly available. The
academic computer vision community needs larger and more varied datasets to
make further progress.
In this paper we introduce a new face dataset, called UMDFaces, which has
367,920 face annotations of 8,501 subjects. We discuss how a large dataset can
be collected and annotated using human annotators and deep networks. We provide
human curated bounding boxes for faces. We also provide estimated pose (roll,
pitch and yaw), locations of twenty-one key-points and gender information
generated by a pre-trained neural network. Finally, we compare the quality of
the dataset with other publicly available face datasets at similar scales.
Saeed Reza Kheradpisheh, Mohammad Ganjtabesh, Simon J Thorpe, Timothée Masquelier
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Previous studies have shown that spike-timing-dependent plasticity (STDP) can
be used in spiking neural networks (SNN) to extract visual features of low or
intermediate complexity in an unsupervised manner. These studies, however, used
relatively shallow architectures, and only one layer was trainable. Another
line of research has demonstrated – using rate-based neural networks trained
with back-propagation – that having many layers increases the recognition
robustness, an approach known as deep learning. We thus designed a deep SNN,
comprising several convolutional (trainable with STDP) and pooling layers. We
used a temporal coding scheme where the most strongly activated neurons fire
first, and less activated neurons fire later or not at all. The network was
exposed to natural images. Thanks to STDP, neurons progressively learned
features corresponding to prototypical patterns that were both salient and
frequent. Only a few tens of examples per category were required and no label
was needed. After learning, the complexity of the extracted features increased
along the hierarchy, from edge detectors in the first layer to object
prototypes in the last layer. Coding was very sparse, with only a few thousands
spikes per image, and in some cases the object category could be reasonably
well inferred from the activity of a single higher-order neuron. More
generally, the activity of a few hundreds of such neurons contained robust
category information, as demonstrated using a classifier on Caltech 101,
ETH-80, and MNIST databases. We think that the combination of STDP with latency
coding is key to understanding the way that the primate visual system learns,
its remarkable processing speed and its low energy consumption. These
mechanisms are also interesting for artificial vision systems, particularly for
hardware solutions.
Mariano Tepper, Guillermo Sapiro
Subjects: Computer Vision and Pattern Recognition (cs.CV)
In this work, we introduce the first algorithm to truly address the
nonnegative matrix underapproximation (NMU) problem, i.e., nonnegative matrix
factorization (NMF) with an additional underapproximation constraint. NMU
results are interesting as, compared to traditional NMF, they present
additional sparsity and part-based behavior, explaining unique data features.
To show an example of these features in practice, we first present an
application to the analysis of climate data. We then present an NMU-based
algorithm to robustly fit multiple parametric models to a dataset. The proposed
approach delivers state-of-the-art results for the estimation of multiple
fundamental matrices and homographies, outperforming other alternatives in the
literature and exemplifying the use of efficient NMU computations.
Vania V. Estrela, Luis A. Rivera, Paulo C. Beggio, Ricardo T. Lopes
Comments: 8 pages, 6 figures in Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing, 2003. SIBGRAPI 2003. IEEE. arXiv admin note: text overlap with arXiv:1403.7365, arXiv:1611.00960
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The computation of 2-D optical flow by means of regularized pel-recursive
algorithms raises a host of issues, which include the treatment of outliers,
motion discontinuities and occlusion among other problems. We propose a new
approach which allows us to deal with these issues within a common framework.
Our approach is based on the use of a technique called Generalized
Cross-Validation to estimate the best regularization scheme for a given pixel.
In our model, the regularization parameter is a matrix whose entries can
account for diverse sources of error. The estimation of the motion vectors
takes into consideration local properties of the image following a spatially
adaptive approach where each moving pixel is supposed to have its own
regularization matrix. Preliminary experiments indicate that this approach
provides robust estimates of the optical flow.
Pedro H. P. Savarese
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
We propose a technique to augment network layers by adding a linear gating
mechanism, which provides a way to learn identity mappings by optimizing only
one parameter. We also introduce a new metric which served as basis for the
technique. It captures the difficulty involved in learning identity mappings
for different types of network models, and provides a new theoretical intuition
for the increased depths of models such as Highway and Residual Networks. We
propose a new model, the Gated Residual Network, which is the result when
augmenting Residual Networks. Experimental results show that augmenting layers
grants increased performance, less issues with depth, and more layer
independence — fully removing them does not cripple the model. We evaluate our
method on MNIST using fully-connected networks and on CIFAR-10 using Wide
ResNets, achieving a relative error reduction of more than 8% in the latter
when compared to the original model.
Alexey Kurakin, Ian Goodfellow, Samy Bengio
Comments: 15 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Learning (cs.LG); Machine Learning (stat.ML)
Adversarial examples are malicious inputs designed to fool machine learning
models. They often transfer from one model to another, allowing attackers to
mount black box attacks without knowledge of the target model’s parameters.
Adversarial training is the process of explicitly training a model on
adversarial examples, in order to make it more robust to attack or to reduce
its test error on clean inputs. So far, adversarial training has primarily been
applied to small problems. In this research, we apply adversarial training to
ImageNet. Our contributions include: (1) recommendations for how to succesfully
scale adversarial training to large models and datasets, (2) the observation
that adversarial training confers robustness to single-step attack methods, (3)
the finding that multi-step attack methods are somewhat less transferable than
single-step attack methods, so single-step attacks are the best for mounting
black-box attacks, and (4) resolution of a “label leaking” effect that causes
adversarially trained models to perform better on adversarial examples than on
clean examples, because the adversarial example construction process uses the
true label and the model can learn to exploit regularities in the construction
process.
Shusil Dangi, Nathan Cahill, Cristian A. Linte
Comments: Statistical Atlases and Computational Modelling of Heart workshop 2016
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Magnetic Resonance Imaging (MRI) has evolved as a clinical standard-of-care
imaging modality for cardiac morphology, function assessment, and guidance of
cardiac interventions. All these applications rely on accurate extraction of
the myocardial tissue and blood pool from the imaging data. Here we propose a
framework for left ventricle (LV) segmentation from cardiac cine-MRI. First, we
segment the LV blood pool using iterative graph cuts, and subsequently use this
information to segment the myocardium. We formulate the segmentation procedure
as an energy minimization problem in a graph subject to the shape prior
obtained by label propagation from an average atlas using affine registration.
The proposed framework has been validated on 30 patient cardiac cine-MRI
datasets available through the STACOM LV segmentation challenge and yielded
fast, robust, and accurate segmentation results.
Jonathan Vacher, Andrew Isaac Meso, Laurent U. Perrinet, Gabriel Peyré
Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
A common practice to account for psychophysical biases in vision is to frame
them as consequences of a dynamic process relying on optimal inference with
respect to a generative model. The present study details the complete
formulation of such a generative model intended to probe visual motion
perception. It is first derived in a set of axiomatic steps constrained by
biological plausibility. We then extend previous contributions by detailing
three equivalent formulations of the Gaussian dynamic texture model. First, the
composite dynamic textures are constructed by the random aggregation of warped
patterns, which can be viewed as 3D Gaussian fields. Second, these textures are
cast as solutions to a stochastic partial differential equation (sPDE). This
essential step enables real time, on-the-fly, texture synthesis using
time-discretized auto- regressive processes. It also allows for the derivation
of a local motion-energy model, which corresponds to the log-likelihood of the
probability density. The log-likelihoods are finally essential for the
construction of a Bayesian inference framework. We use the model to probe speed
perception in humans psychophysically using zoom-like changes in stimulus
spatial frequency content. The likelihood is contained within the genera- tive
model and we chose a slow speed prior consistent with previous literature. We
then validated the fitting process of the model using synthesized data. The
human data replicates previous findings that relative perceived speed is
positively biased by spatial frequency increments. The effect cannot be fully
accounted for by previous models, but the current prior acting on the
spatio-temporal likelihoods has proved necessary in accounting for the
perceptual bias.
Leon Sixt, Benjamin Wild, Tim Landgraf
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
Deep Convolutional Neuronal Networks (DCNN) are showing remarkable
performance on many computer vision tasks. Due to their large parameter space,
they require many labeled samples when trained in a supervised setting. The
costs of annotating data manually can render the usage of DCNNs infeasible. We
present a novel framework called RenderGAN that can generate large amounts of
realistic, labeled images by combining a 3D model and the Generative
Adversarial Network framework. In our approach, image augmentations (e.g.
lighting, background, and detail) are learned from unlabeled data such that the
generated images are strikingly realistic while preserving the labels known
from the 3D model. We apply the RenderGAN framework to generate images of
barcode-like markers that are attached to honeybees. A DCNN is trained on this
data only. It performs better on a test set of real data than an equal DCNN
trained on the limited amounts of real data available.
Jie Sun, Erik Bollt
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
Optical flow refers to the visual motion observed between two consecutive
images. Since the degree of freedom is typically much larger than the
constraints imposed by the image observations, the straightforward formulation
of optical flow inference is an ill-posed problem. By setting some type of
additional “regularity” constraints, classical approaches formulate a
well-posed optical flow inference problem in the form of a parameterized set of
variational equations. In this work we build a mathematical connection, focused
on optical flow methods, between classical variational optical flow approaches
and Bayesian statistical inversion. A classical optical flow solution is in
fact identical to a maximum a posteriori estimator under the assumptions of
linear model with additive independent Gaussian noise and a Gaussian prior
distribution. Unlike classical approaches, the statistical inversion approach
to optical flow estimation not only allows for “point” estimates, but also
provides a distribution of solutions which can be used for ensemble estimation
and in particular uncertainty quantification.
Krzysztof Chalupka, Frederick Eberhardt, Pietro Perona
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Learning (cs.LG)
We propose a method to classify the causal relationship between two discrete
variables given only the joint distribution of the variables, acknowledging
that the method is subject to an inherent baseline error. We assume that the
causal system is acyclicity, but we do allow for hidden common causes. Our
algorithm presupposes that the probability distributions (P(C)) of a cause (C)
is independent from the probability distribution (P(Emid C)) of the
cause-effect mechanism. While our classifier is trained with a Bayesian
assumption of flat hyperpriors, we do not make this assumption about our test
data. This work connects to recent developments on the identifiability of
causal models over continuous variables under the assumption of “independent
mechanisms”. Carefully-commented Python notebooks that reproduce all our
experiments are available online at
this http URL
Raman Arora, Amitabh Basu, Poorya Mianjy, Anirbit Mukherjee
Subjects: Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML)
In this paper we investigate the family of functions representable by deep
neural networks (DNN) with rectified linear units (ReLU). We give the
first-ever polynomial time (in the size of data) algorithm to train a ReLU DNN
with one hidden layer to global optimality. This follows from our complete
characterization of the ReLU DNN function class whereby we show that a
(mathbb{R}^n o mathbb{R}) function is representable by a ReLU DNN if and
only if it is a continuous piecewise linear function. The main tool used to
prove this characterization is an elegant result from tropical geometry.
Further, for the (n=1) case, we show that a single hidden layer suffices to
express all piecewise linear functions, and we give tight bounds for the size
of such a ReLU DNN.We follow up with gap results showing that there is a
smoothly parameterized family of (mathbb{R} o mathbb{R}) “hard” functions
that lead to an exponential blow-up in size, if the number of layers is
decreased by a small amount. An example consequence of our gap theorem is that
for every natural number (N), there exists a function representable by a ReLU
DNN with depth (N^2+1) and total size (N^3), such that any ReLU DNN with depth
at most (N + 1) will require at least (frac12N^{N+1}-1) total nodes.
Finally, we construct a family of (mathbb{R}^n o mathbb{R}) functions for
(ngeq 2) (also smoothly parameterized), whose number of affine pieces scales
exponentially with the dimension (n) at any fixed size and depth. To the best
of our knowledge, such a construction with exponential dependence on (n) has
not been achieved by previous families of “hard” functions in the neural nets
literature.
Hanock Kwak, Byoung-Tak Zhang
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The GANs are generative models whose random samples realistically reflect
natural images. It also can generate samples with specific attributes by
concatenating a condition vector into the input, yet research on this field is
not well studied. We propose novel methods of conditioning generative
adversarial networks (GANs) that achieve state-of-the-art results on MNIST and
CIFAR-10. We mainly introduce two models: an information retrieving model that
extracts conditional information from the samples, and a spatial bilinear
pooling model that forms bilinear features derived from the spatial cross
product of an image and a condition vector. These methods significantly enhance
log-likelihood of test data under the conditional distributions compared to the
methods of concatenation.
Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, Charles Sutton
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI)
The question of how procedural knowledge is represented and inferred is a
fundamental problem in machine learning and artificial intelligence. Recent
work on program induction has proposed neural architectures, based on
abstractions like stacks, Turing machines, and interpreters, that operate on
abstract computational machines or on execution traces. But the recursive
abstraction that is central to procedural knowledge is perhaps most naturally
represented by symbolic representations that have syntactic structure, such as
logical expressions and source code. Combining abstract, symbolic reasoning
with continuous neural reasoning is a grand challenge of representation
learning. As a step in this direction, we propose a new architecture, called
neural equivalence networks, for the problem of learning continuous semantic
representations of mathematical and logical expressions. These networks are
trained to represent semantic equivalence, even of expressions that are
syntactically very different. The challenge is that semantic representations
must be computed in a syntax-directed manner, because semantics is
compositional, but at the same time, small changes in syntax can lead to very
large changes in semantics, which can be difficult for continuous neural
architectures. We perform an exhaustive evaluation on the task of checking
equivalence on a highly diverse class of symbolic algebraic and boolean
expression types, showing that our model significantly outperforms existing
architectures.
Jesse M Lingeman, Hong Yu
Comments: 12 pages, 1 figure
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL); Learning (cs.LG); Social and Information Networks (cs.SI)
Finding related published articles is an important task in any science, but
with the explosion of new work in the biomedical domain it has become
especially challenging. Most existing methodologies use text similarity metrics
to identify whether two articles are related or not. However biomedical
knowledge discovery is hypothesis-driven. The most related articles may not be
ones with the highest text similarities. In this study, we first develop an
innovative crowd-sourcing approach to build an expert-annotated
document-ranking corpus. Using this corpus as the gold standard, we then
evaluate the approaches of using text similarity to rank the relatedness of
articles. Finally, we develop and evaluate a new supervised model to
automatically rank related scientific articles. Our results show that authors’
ranking differ significantly from rankings by text-similarity-based models. By
training a learning-to-rank model on a subset of the annotated corpus, we found
the best supervised learning-to-rank model (SVM-Rank) significantly surpassed
state-of-the-art baseline systems.
Deepika, Ashutosh Dixit
Comments: 6 Pages, 5 figures, 1 Table, International Conference on Recent Trends in Computer and Information Technology Research September 2015
Subjects: Information Retrieval (cs.IR)
With the increase in size of web, the information is also spreading at large
scale. Search Engines are the medium to access this information. Crawler is the
module of search engine which is responsible for download the web pages. In
order to download the fresh information and get the database rich, crawler
should crawl the web in some order. This is called as ordering of URLs. URL
ordering should be done in efficient and effective manner in order to crawl the
web in proficient manner. In this paper, a survey is done on some existing
methods of URL ordering and at the end of this paper comparison is also carried
out among them.
Avrim Blum, Nika Haghtalab
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
Recently there has been significant activity in developing algorithms with
provable guarantees for topic modeling. In standard topic models, a topic (such
as sports, business, or politics) is viewed as a probability distribution (vec
a_i) over words, and a document is generated by first selecting a mixture (vec
w) over topics, and then generating words i.i.d. from the associated mixture
(A{vec w}). Given a large collection of such documents, the goal is to recover
the topic vectors and then to correctly classify new documents according to
their topic mixture.
In this work we consider a broad generalization of this framework in which
words are no longer assumed to be drawn i.i.d. and instead a topic is a complex
distribution over sequences of paragraphs. Since one could not hope to even
represent such a distribution in general (even if paragraphs are given using
some natural feature representation), we aim instead to directly learn a
document classifier. That is, we aim to learn a predictor that given a new
document, accurately predicts its topic mixture, without learning the
distributions explicitly. We present several natural conditions under which one
can do this efficiently and discuss issues such as noise tolerance and sample
complexity in this model. More generally, our model can be viewed as a
generalization of the multi-view or co-training setting in machine learning.
Roee Aharoni, Yoav Goldberg
Comments: Under review as a conference paper at ICLR 2017
Subjects: Computation and Language (cs.CL)
We present a supervised sequence to sequence transduction model with a hard
attention mechanism which combines the more traditional statistical alignment
methods with the power of recurrent neural networks. We evaluate the model on
the task of morphological inflection generation and show that it provides state
of the art results in various setups compared to the previous neural and
non-neural approaches. Eventually we present an analysis of the learned
representations for both hard and soft attention models, shedding light on the
features such models extract in order to solve the task.
Kenton Lee, Tom Kwiatkowski, Ankur Parikh, Dipanjan Das
Subjects: Computation and Language (cs.CL)
The reading comprehension task, that asks questions about a given evidence
document, is a central problem in natural language understanding. Recent
formulations of this task have typically focused on answer selection from a set
of candidates pre-defined manually or through the use of an external NLP
pipeline. However, Rajpurkar et al. (2016) recently released the SQuAD dataset
in which the answers can be arbitrary strings from the supplied text. In this
paper, we focus on this answer extraction task, presenting a novel model
architecture that efficiently builds fixed length representations of all spans
in the evidence document with a recurrent network. We show that scoring
explicit span representations significantly improves performance over other
approaches that factor the prediction into separate predictions about words or
start and end markers. Our approach improves upon the best published results of
Wang & Jiang (2016) by 5% and decreases the error of Rajpurkar et al.’s
baseline by > 50%.
Tal Linzen, Emmanuel Dupoux, Yoav Goldberg
Comments: 15 pages; to appear in Transactions of the Association for Computational Linguistics
Subjects: Computation and Language (cs.CL)
The success of long short-term memory (LSTM) neural networks in language
processing is typically attributed to their ability to capture long-distance
statistical regularities. Linguistic regularities are often sensitive to
syntactic structure; can such dependencies be captured by LSTMs, which do not
have explicit structural representations? We begin addressing this question
using number agreement in English subject-verb dependencies. We probe the
architecture’s grammatical competence both using training objectives with an
explicit grammatical target (number prediction, grammaticality judgments) and
using language models. In the strongly supervised settings, the LSTM achieved
very high overall accuracy (less than 1% errors), but errors increased when
sequential and structural information conflicted. The frequency of such errors
rose sharply in the language-modeling setting. We conclude that LSTMs can
capture a non-trivial amount of grammatical structure given targeted
supervision, but stronger architectures may be required to further reduce
errors; furthermore, the language modeling signal is insufficient for capturing
syntax-sensitive dependencies, and should be supplemented with more direct
supervision if such dependencies need to be captured.
Mohit Iyyer, Wen-tau Yih, Ming-Wei Chang
Subjects: Computation and Language (cs.CL)
Recent work in semantic parsing for question answering has focused on long
and complicated questions, many of which would seem unnatural if asked in a
normal conversation between two humans. In an effort to explore a
conversational QA setting, we present a more realistic task: answering
sequences of simple but inter-related questions. We collect a dataset of 6,066
question sequences that inquire about semi-structured tables from Wikipedia,
with 17,553 question-answer pairs in total. Existing QA systems face two major
problems when evaluated on our dataset: (1) handling questions that contain
coreferences to previous questions or answers, and (2) matching words or
phrases in a question to corresponding entries in the associated table. We
conclude by proposing strategies to handle both of these issues.
Hakan Inan, Khashayar Khosravi, Richard Socher
Comments: 10 pages, 2 figures, 3 tables
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Recurrent neural networks have been very successful at predicting sequences
of words in tasks such as language modeling. However, all such models are based
on the conventional classification framework, where model is trained against
one-hot targets, and each word is represented both as an input and as an output
in isolation. This causes inefficiencies in learning both in terms of utilizing
all of the information and in terms of the number of parameters needed to
train. We introduce a novel theoretical framework that facilitates better
learning in language modeling, and show that our framework leads to tying
together the input embedding and the output projection matrices, greatly
reducing the number of trainable variables. Our LSTM model lowers the state of
the art word-level perplexity on the Penn Treebank to 68.5.
Jesse M Lingeman, Hong Yu
Comments: 12 pages, 1 figure
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL); Learning (cs.LG); Social and Information Networks (cs.SI)
Finding related published articles is an important task in any science, but
with the explosion of new work in the biomedical domain it has become
especially challenging. Most existing methodologies use text similarity metrics
to identify whether two articles are related or not. However biomedical
knowledge discovery is hypothesis-driven. The most related articles may not be
ones with the highest text similarities. In this study, we first develop an
innovative crowd-sourcing approach to build an expert-annotated
document-ranking corpus. Using this corpus as the gold standard, we then
evaluate the approaches of using text similarity to rank the relatedness of
articles. Finally, we develop and evaluate a new supervised model to
automatically rank related scientific articles. Our results show that authors’
ranking differ significantly from rankings by text-similarity-based models. By
training a learning-to-rank model on a subset of the annotated corpus, we found
the best supervised learning-to-rank model (SVM-Rank) significantly surpassed
state-of-the-art baseline systems.
Avrim Blum, Nika Haghtalab
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
Recently there has been significant activity in developing algorithms with
provable guarantees for topic modeling. In standard topic models, a topic (such
as sports, business, or politics) is viewed as a probability distribution (vec
a_i) over words, and a document is generated by first selecting a mixture (vec
w) over topics, and then generating words i.i.d. from the associated mixture
(A{vec w}). Given a large collection of such documents, the goal is to recover
the topic vectors and then to correctly classify new documents according to
their topic mixture.
In this work we consider a broad generalization of this framework in which
words are no longer assumed to be drawn i.i.d. and instead a topic is a complex
distribution over sequences of paragraphs. Since one could not hope to even
represent such a distribution in general (even if paragraphs are given using
some natural feature representation), we aim instead to directly learn a
document classifier. That is, we aim to learn a predictor that given a new
document, accurately predicts its topic mixture, without learning the
distributions explicitly. We present several natural conditions under which one
can do this efficiently and discuss issues such as noise tolerance and sample
complexity in this model. More generally, our model can be viewed as a
generalization of the multi-view or co-training setting in machine learning.
Gabriele D'Angelo, Stefano Ferretti, Vittorio Ghini
Comments: To appear in Simulation Modelling Practice and Theory, Elsevier
Subjects: Performance (cs.PF); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
In this paper, a methodology is presented and employed for simulating the
Internet of Things (IoT). The requirement for scalability, due to the possibly
huge amount of involved sensors and devices, and the heterogeneous scenarios
that might occur, impose resorting to sophisticated modeling and simulation
techniques. In particular, multi-level simulation is regarded as a main
framework that allows simulating large-scale IoT environments while keeping
high levels of detail, when it is needed. We consider a use case based on the
deployment of smart services in decentralized territories. A two level
simulator is employed, which is based on a coarse agent-based, adaptive
parallel and distributed simulation approach to model the general life of
simulated entities. However, when needed a finer grained simulator (based on
OMNeT++) is triggered on a restricted portion of the simulated area, which
allows considering all issues concerned with wireless communications. Based on
this use case, it is confirmed that the ad-hoc wireless networking technologies
do represent a principle tool to deploy smart services over decentralized
countrysides. Moreover, the performance evaluation confirms the viability of
utilizing multi-level simulation for simulating large scale IoT environments.
Jayanth Koushik, Hiroaki Hayashi
Comments: ICLR 2017 Submission
Subjects: Learning (cs.LG)
In this paper we propose a simple and efficient method for improving
stochastic gradient descent methods by using feedback from the objective
function. The method tracks the relative changes in the objective function with
a running average, and uses it to adaptively tune the learning rate in
stochastic gradient descent. We specifically apply this idea to modify Adam, a
popular algorithm for training deep neural networks. We conduct experiments to
compare the resulting algorithm, which we call Eve, with state of the art
methods used for training deep learning models. We train CNNs for image
classification, and RNNs for language modeling and question answering. Our
experiments show that Eve outperforms all other algorithms on these benchmark
tasks. We also analyze the behavior of the feedback mechanism during the
training process.
Akosua Busia, Jasmine Collins, Navdeep Jaitly
Comments: 10 pages, 2 figures, submitted to RECOMB 2017
Subjects: Learning (cs.LG); Biomolecules (q-bio.BM)
Recently developed deep learning techniques have significantly improved the
accuracy of various speech and image recognition systems. In this paper we
adapt some of these techniques for protein secondary structure prediction. We
first train a series of deep neural networks to predict eight-class secondary
structure labels given a protein’s amino acid sequence information and find
that using recent methods for regularization, such as dropout and weight-norm
constraining, leads to measurable gains in accuracy. We then adapt recent
convolutional neural network architectures–Inception, ReSNet, and DenseNet
with Batch Normalization–to the problem of protein structure prediction. These
convolutional architectures make heavy use of multi-scale filter layers that
simultaneously compute features on several scales, and use residual connections
to prevent underfitting. Using a carefully modified version of these
architectures, we achieve state-of-the-art performance of 70.0% per amino acid
accuracy on the public CB513 benchmark dataset. Finally, we explore additions
from sequence-to-sequence learning, altering the model to make its predictions
conditioned on both the protein’s amino acid sequence and its past secondary
structure labels. We introduce a new method of ensembling such a conditional
model with our convolutional model, an approach which reaches 70.6% Q8 accuracy
on CB513. We argue that these results can be further refined for larger boosts
in prediction accuracy through more sophisticated attempts to control
overfitting of conditional models. We aim to release the code for these
experiments as part of the TensorFlow repository.
Raman Arora, Amitabh Basu, Poorya Mianjy, Anirbit Mukherjee
Subjects: Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML)
In this paper we investigate the family of functions representable by deep
neural networks (DNN) with rectified linear units (ReLU). We give the
first-ever polynomial time (in the size of data) algorithm to train a ReLU DNN
with one hidden layer to global optimality. This follows from our complete
characterization of the ReLU DNN function class whereby we show that a
(mathbb{R}^n o mathbb{R}) function is representable by a ReLU DNN if and
only if it is a continuous piecewise linear function. The main tool used to
prove this characterization is an elegant result from tropical geometry.
Further, for the (n=1) case, we show that a single hidden layer suffices to
express all piecewise linear functions, and we give tight bounds for the size
of such a ReLU DNN.We follow up with gap results showing that there is a
smoothly parameterized family of (mathbb{R} o mathbb{R}) “hard” functions
that lead to an exponential blow-up in size, if the number of layers is
decreased by a small amount. An example consequence of our gap theorem is that
for every natural number (N), there exists a function representable by a ReLU
DNN with depth (N^2+1) and total size (N^3), such that any ReLU DNN with depth
at most (N + 1) will require at least (frac12N^{N+1}-1) total nodes.
Finally, we construct a family of (mathbb{R}^n o mathbb{R}) functions for
(ngeq 2) (also smoothly parameterized), whose number of affine pieces scales
exponentially with the dimension (n) at any fixed size and depth. To the best
of our knowledge, such a construction with exponential dependence on (n) has
not been achieved by previous families of “hard” functions in the neural nets
literature.
Hakan Inan, Khashayar Khosravi, Richard Socher
Comments: 10 pages, 2 figures, 3 tables
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Recurrent neural networks have been very successful at predicting sequences
of words in tasks such as language modeling. However, all such models are based
on the conventional classification framework, where model is trained against
one-hot targets, and each word is represented both as an input and as an output
in isolation. This causes inefficiencies in learning both in terms of utilizing
all of the information and in terms of the number of parameters needed to
train. We introduce a novel theoretical framework that facilitates better
learning in language modeling, and show that our framework leads to tying
together the input embedding and the output projection matrices, greatly
reducing the number of trainable variables. Our LSTM model lowers the state of
the art word-level perplexity on the Penn Treebank to 68.5.
Asier Mujika
Subjects: Learning (cs.LG)
In recent years, model-free methods that use deep learning have achieved
great success in many different reinforcement learning environments. Most
successful approaches focus on solving a single task, while multi-task
reinforcement learning remains an open problem. In this paper, we present a
model based approach to deep reinforcement learning which we use to solve
different tasks simultaneously. We show that our approach not only does not
degrade but actually benefits from learning multiple tasks. For our model, we
also present a new kind of recurrent neural network inspired by residual
networks that decouples memory from computation allowing to model complex
environments that do not require lots of memory. The code will be released
before ICLR 2017.
Dorina Thanou, Xiaowen Dong, Daniel Kressner, Pascal Frossard
Subjects: Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Effective information analysis generally boils down to properly identifying
the structure or geometry of the data, which is often represented by a graph.
In some applications, this structure may be partly determined by design
constraints or pre-determined sensing arrangements, like in road transportation
networks for example. In general though, the data structure is not readily
available and becomes pretty difficult to define. In particular, the global
smoothness assumptions, that most of the existing works adopt, are often too
general and unable to properly capture localized properties of data. In this
paper, we go beyond this classical data model and rather propose to represent
information as a sparse combination of localized functions that live on a data
structure represented by a graph. Based on this model, we focus on the problem
of inferring the connectivity that best explains the data samples at different
vertices of a graph that is a priori unknown. We concentrate on the case where
the observed data is actually the sum of heat diffusion processes, which is a
quite common model for data on networks or other irregular structures. We cast
a new graph learning problem and solve it with an efficient nonconvex
optimization algorithm. Experiments on both synthetic and real world data
finally illustrate the benefits of the proposed graph learning framework and
confirm that the data structure can be efficiently learned from data
observations only. We believe that our algorithm will help solving key
questions in diverse application domains such as social and biological network
analysis where it is crucial to unveil proper geometry for data understanding
and inference.
Hanock Kwak, Byoung-Tak Zhang
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The GANs are generative models whose random samples realistically reflect
natural images. It also can generate samples with specific attributes by
concatenating a condition vector into the input, yet research on this field is
not well studied. We propose novel methods of conditioning generative
adversarial networks (GANs) that achieve state-of-the-art results on MNIST and
CIFAR-10. We mainly introduce two models: an information retrieving model that
extracts conditional information from the samples, and a spatial bilinear
pooling model that forms bilinear features derived from the spatial cross
product of an image and a condition vector. These methods significantly enhance
log-likelihood of test data under the conditional distributions compared to the
methods of concatenation.
Elad Hoffer, Nir Ailon
Subjects: Learning (cs.LG)
Deep networks are successfully used as classification models yielding
state-of-the-art results when trained on a large number of labeled samples.
These models, however, are usually much less suited for semi-supervised
problems because of their tendency to overfit easily when trained on small
amounts of data. In this work we will explore a new training objective that is
targeting a semi-supervised regime with only a small subset of labeled data.
This criterion is based on a deep metric embedding over distance relations
within the set of labeled samples, together with constraints over the
embeddings of the unlabeled set. The final learned representations are
discriminative in euclidean space, and hence can be used with subsequent
nearest-neighbor classification using the labeled samples.
Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, Charles Sutton
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI)
The question of how procedural knowledge is represented and inferred is a
fundamental problem in machine learning and artificial intelligence. Recent
work on program induction has proposed neural architectures, based on
abstractions like stacks, Turing machines, and interpreters, that operate on
abstract computational machines or on execution traces. But the recursive
abstraction that is central to procedural knowledge is perhaps most naturally
represented by symbolic representations that have syntactic structure, such as
logical expressions and source code. Combining abstract, symbolic reasoning
with continuous neural reasoning is a grand challenge of representation
learning. As a step in this direction, we propose a new architecture, called
neural equivalence networks, for the problem of learning continuous semantic
representations of mathematical and logical expressions. These networks are
trained to represent semantic equivalence, even of expressions that are
syntactically very different. The challenge is that semantic representations
must be computed in a syntax-directed manner, because semantics is
compositional, but at the same time, small changes in syntax can lead to very
large changes in semantics, which can be difficult for continuous neural
architectures. We perform an exhaustive evaluation on the task of checking
equivalence on a highly diverse class of symbolic algebraic and boolean
expression types, showing that our model significantly outperforms existing
architectures.
Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tie-Yan Liu
Subjects: Learning (cs.LG)
Decision tree (and its extensions such as Gradient Boosting Decision Trees
and Random Forest) is a widely used machine learning algorithm, due to its
practical effectiveness and model interpretability. With the emergence of big
data, there is an increasing need to parallelize the training process of
decision tree. However, most existing attempts along this line suffer from high
communication costs. In this paper, we propose a new algorithm, called
emph{Parallel Voting Decision Tree (PV-Tree)}, to tackle this challenge. After
partitioning the training data onto a number of (e.g., (M)) machines, this
algorithm performs both local voting and global voting in each iteration. For
local voting, the top-(k) attributes are selected from each machine according
to its local data. Then, globally top-(2k) attributes are determined by a
majority voting among these local candidates. Finally, the full-grained
histograms of the globally top-(2k) attributes are collected from local
machines in order to identify the best (most informative) attribute and its
split point. PV-Tree can achieve a very low communication cost (independent of
the total number of attributes) and thus can scale out very well. Furthermore,
theoretical analysis shows that this algorithm can learn a near optimal
decision tree, since it can find the best attribute with a large probability.
Our experiments on real-world datasets show that PV-Tree significantly
outperforms the existing parallel decision tree algorithms in the trade-off
between accuracy and efficiency.
Hyo-Eun Kim, Sangheum Hwang, Kyunghyun Cho
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Latent representation learned from multi-layered neural networks via
hierarchical feature abstraction enables recent success of deep learning. Under
the deep learning framework, generalization performance highly depends on the
learned latent representation which is obtained from an appropriate training
scenario with a task-specific objective on a designed network model. In this
work, we propose a novel latent space modeling method to learn better latent
representation. We designed a neural network model based on the assumption that
good base representation can be attained by maximizing the total correlation
between the input, latent, and output variables. From the base model, we
introduce a semantic noise modeling method which enables class-conditional
perturbation on latent space to enhance the representational power of learned
latent feature. During training, latent vector representation can be
stochastically perturbed by a modeled class-conditional additive noise while
maintaining its original semantic feature. It implicitly brings the effect of
semantic augmentation on the latent space. The proposed model can be easily
learned by back-propagation with common gradient-based optimization algorithms.
Experimental results show that the proposed method helps to achieve performance
benefits against various previous approaches. We also provide the empirical
analyses for the proposed class-conditional perturbation process including
t-SNE visualization.
Avrim Blum, Nika Haghtalab
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
Recently there has been significant activity in developing algorithms with
provable guarantees for topic modeling. In standard topic models, a topic (such
as sports, business, or politics) is viewed as a probability distribution (vec
a_i) over words, and a document is generated by first selecting a mixture (vec
w) over topics, and then generating words i.i.d. from the associated mixture
(A{vec w}). Given a large collection of such documents, the goal is to recover
the topic vectors and then to correctly classify new documents according to
their topic mixture.
In this work we consider a broad generalization of this framework in which
words are no longer assumed to be drawn i.i.d. and instead a topic is a complex
distribution over sequences of paragraphs. Since one could not hope to even
represent such a distribution in general (even if paragraphs are given using
some natural feature representation), we aim instead to directly learn a
document classifier. That is, we aim to learn a predictor that given a new
document, accurately predicts its topic mixture, without learning the
distributions explicitly. We present several natural conditions under which one
can do this efficiently and discuss issues such as noise tolerance and sample
complexity in this model. More generally, our model can be viewed as a
generalization of the multi-view or co-training setting in machine learning.
Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, Nando de Freitas
Comments: 20 pages. Prepared for ICLR 2017
Subjects: Learning (cs.LG)
This paper presents an actor-critic deep reinforcement learning agent with
experience replay that is stable, sample efficient, and performs remarkably
well on challenging environments, including the discrete 57-game Atari domain
and several continuous control problems. To achieve this, the paper introduces
several innovations, including truncated importance sampling with bias
correction, stochastic dueling network architectures, and a new trust region
policy optimization method.
Zachary C. Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, Li Deng
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
To use deep reinforcement learning in the wild, we might hope for an agent
that would never make catastrophic mistakes. At the very least, we could hope
that an agent would eventually learn to avoid old mistakes. Unfortunately, even
in simple environments, modern deep reinforcement learning techniques are
doomed by a Sisyphean curse. Owing to the use of function approximation, these
agents eventually forget experiences as they become exceedingly unlikely under
a new policy. Consequently, for as long as they continue to train,
state-aggregating agents may periodically relive catastrophic mistakes. We
demonstrate unacceptable performance of deep Q-networks on two toy problems. We
then introduce intrinsic fear, a method that mitigates these problems by
avoiding dangerous states.
Wei Xie, Yang Wang, Steven M. Boker, Donald E. Brown
Comments: 24 pages, 4 figures. Work done and circulated since 2015
Subjects: Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Safeguarding privacy in machine learning is highly desirable, especially in
collaborative studies across many organizations. Privacy-preserving distributed
machine learning (based on cryptography) is popular to solve the problem.
However, existing cryptographic protocols still incur excess computational
overhead. Here, we make a novel observation that this is partially due to naive
adoption of mainstream numerical optimization (e.g., Newton method) and failing
to tailor for secure computing. This work presents a contrasting perspective:
customizing numerical optimization specifically for secure settings. We propose
a seemingly less-favorable optimization method that can in fact significantly
accelerate privacy-preserving logistic regression. Leveraging this new method,
we propose two new secure protocols for conducting logistic regression in a
privacy-preserving and distributed manner. Extensive theoretical and empirical
evaluations prove the competitive performance of our two secure proposals while
without compromising accuracy or privacy: with speedup up to 2.3x and 8.1x,
respectively, over state-of-the-art; and even faster as data scales up. Such
drastic speedup is on top of and in addition to performance improvements from
existing (and future) state-of-the-art cryptography. Our work provides a new
way towards efficient and practical privacy-preserving logistic regression for
large-scale studies which are common for modern science.
Krzysztof Chalupka, Frederick Eberhardt, Pietro Perona
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Learning (cs.LG)
We propose a method to classify the causal relationship between two discrete
variables given only the joint distribution of the variables, acknowledging
that the method is subject to an inherent baseline error. We assume that the
causal system is acyclicity, but we do allow for hidden common causes. Our
algorithm presupposes that the probability distributions (P(C)) of a cause (C)
is independent from the probability distribution (P(Emid C)) of the
cause-effect mechanism. While our classifier is trained with a Bayesian
assumption of flat hyperpriors, we do not make this assumption about our test
data. This work connects to recent developments on the identifiability of
causal models over continuous variables under the assumption of “independent
mechanisms”. Carefully-commented Python notebooks that reproduce all our
experiments are available online at
this http URL
Arash Ardakani, Carlo Condo, Warren J. Gross
Comments: 13 pages, 3 figures
Subjects: Neural and Evolutionary Computing (cs.NE); Learning (cs.LG)
Recently deep neural networks have received considerable attention due to
their ability to extract and represent high-level abstractions in data sets.
Deep neural networks such as fully-connected and convolutional neural networks
have shown excellent performance on a wide range of recognition and
classification tasks. However, their hardware implementations currently suffer
from large silicon area and high power consumption due to the their high degree
of complexity. The power/energy consumption of neural networks is dominated by
memory accesses, the majority of which occur in fully-connected networks. In
fact, they contain most of the deep neural network parameters. In this paper,
we propose sparsely-connected networks, by showing that the number of
connections in fully-connected networks can be reduced by up to 90% while
improving the accuracy performance on three popular datasets (MNIST, CIFAR10
and SVHN). We then propose an efficient hardware architecture based on
linear-feedback shift registers to reduce the memory requirements of the
proposed sparsely-connected networks. The proposed architecture can save up to
90% of memory compared to the conventional implementations of fully-connected
neural networks. Moreover, implementation results show up to 84% reduction in
the energy consumption of a single neuron of the proposed sparsely-connected
networks compared to a single neuron of fully-connected neural networks.
Wentao Huang, Kechen Zhang
Subjects: Information Theory (cs.IT); Learning (cs.LG)
Information theory is a powerful tool for neuroscience and other disciplines.
Efficient calculation of Shannon’s mutual information (MI) is a key
computational step that often presents the biggest difficulty for practical
applications. In this paper, we propose effective approximation methods for
evaluating MI in the context of neural population coding, especially for
high-dimensional inputs. We prove several information-theoretic asymptotic
bounds and approximation formulas for large size neural populations. We also
discuss how optimization of neural population coding based on these
approximation formulas leads to a convex problem about the population density
distribution in neural population parameter space. Several useful techniques,
including variable transformation and dimensionality reduction, are proposed
for more efficient computation of the approximations. Our numerical simulation
results show that our asymptotic formulas are highly accurate for approximating
MI in neural populations. For some special cases, the approximations by our
formulas are exactly equal to the true MI. The approximation methods for MI may
have a wide range of applications in various disciplines.
Jesse M Lingeman, Hong Yu
Comments: 12 pages, 1 figure
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL); Learning (cs.LG); Social and Information Networks (cs.SI)
Finding related published articles is an important task in any science, but
with the explosion of new work in the biomedical domain it has become
especially challenging. Most existing methodologies use text similarity metrics
to identify whether two articles are related or not. However biomedical
knowledge discovery is hypothesis-driven. The most related articles may not be
ones with the highest text similarities. In this study, we first develop an
innovative crowd-sourcing approach to build an expert-annotated
document-ranking corpus. Using this corpus as the gold standard, we then
evaluate the approaches of using text similarity to rank the relatedness of
articles. Finally, we develop and evaluate a new supervised model to
automatically rank related scientific articles. Our results show that authors’
ranking differ significantly from rankings by text-similarity-based models. By
training a learning-to-rank model on a subset of the annotated corpus, we found
the best supervised learning-to-rank model (SVM-Rank) significantly surpassed
state-of-the-art baseline systems.
Alessandro Achille, Stefano Soatto
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Computation (stat.CO)
We introduce Information Dropout, a generalization of dropout that is
motivated by the Information Bottleneck principle and highlights the way in
which injecting noise in the activations can help in learning optimal
representations of the data. Information Dropout is rooted in information
theoretic principles, it includes as special cases several existing dropout
methods, like Gaussian Dropout and Variational Dropout, and, unlike classical
dropout, it can learn and build representations that are invariant to nuisances
of the data, like occlusions and clutter. When the task is the reconstruction
of the input, we show that the information dropout method yields a variational
autoencoder as a special case, thus providing a link between representation
learning, information theory and variational inference. Our experiments
validate the theoretical intuitions behind our method, and we find that
information dropout achieves a comparable or better generalization performance
than binary dropout, especially on smaller models, since it can automatically
adapt the noise to the structure of the network, as well as to the test sample.
Pedro H. P. Savarese
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
We propose a technique to augment network layers by adding a linear gating
mechanism, which provides a way to learn identity mappings by optimizing only
one parameter. We also introduce a new metric which served as basis for the
technique. It captures the difficulty involved in learning identity mappings
for different types of network models, and provides a new theoretical intuition
for the increased depths of models such as Highway and Residual Networks. We
propose a new model, the Gated Residual Network, which is the result when
augmenting Residual Networks. Experimental results show that augmenting layers
grants increased performance, less issues with depth, and more layer
independence — fully removing them does not cripple the model. We evaluate our
method on MNIST using fully-connected networks and on CIFAR-10 using Wide
ResNets, achieving a relative error reduction of more than 8% in the latter
when compared to the original model.
Seiya Tokui, Issei sato
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
Low-variance gradient estimation is crucial for learning directed graphical
models parameterized by neural networks, where the reparameterization trick is
widely used for those with continuous variables. While this technique gives
low-variance gradient estimates, it has not been directly applicable to
discrete variables, the sampling of which inherently requires discontinuous
operations. We argue that the discontinuity can be bypassed by marginalizing
out the variable of interest, which results in a new reparameterization trick
for discrete variables. This reparameterization greatly reduces the variance,
which is understood by regarding the method as an application of common random
numbers to the estimation. The resulting estimator is theoretically guaranteed
to have a variance not larger than that of the likelihood-ratio method with the
optimal input-dependent baseline. We give empirical results for variational
learning of sigmoid belief networks.
Alexey Kurakin, Ian Goodfellow, Samy Bengio
Comments: 15 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Learning (cs.LG); Machine Learning (stat.ML)
Adversarial examples are malicious inputs designed to fool machine learning
models. They often transfer from one model to another, allowing attackers to
mount black box attacks without knowledge of the target model’s parameters.
Adversarial training is the process of explicitly training a model on
adversarial examples, in order to make it more robust to attack or to reduce
its test error on clean inputs. So far, adversarial training has primarily been
applied to small problems. In this research, we apply adversarial training to
ImageNet. Our contributions include: (1) recommendations for how to succesfully
scale adversarial training to large models and datasets, (2) the observation
that adversarial training confers robustness to single-step attack methods, (3)
the finding that multi-step attack methods are somewhat less transferable than
single-step attack methods, so single-step attacks are the best for mounting
black-box attacks, and (4) resolution of a “label leaking” effect that causes
adversarially trained models to perform better on adversarial examples than on
clean examples, because the adversarial example construction process uses the
true label and the model can learn to exploit regularities in the construction
process.
Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, Jascha Sohl-Dickstein
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
We study the behavior of untrained neural networks whose weights and biases
are randomly distributed using mean field theory. We show the existence of
depth scales that naturally limit the maximum depth of signal propagation
through these random networks. Our main practical result is to show that random
networks may be trained precisely when information can travel through them.
Thus, the depth scales that we identify provide bounds on how deep a network
may be trained for a specific choice of hyperparameters. As a corollary to
this, we argue that in networks at the edge of chaos, one of these depth scales
diverges. Thus arbitrarily deep networks may be trained only sufficiently close
to criticality. We show that the presence of dropout destroys the
order-to-chaos critical point and therefore strongly limits the maximum
trainable depth for random networks. Finally, we develop a mean field theory
for backpropagation and we show that the ordered and chaotic phases correspond
to regions of vanishing and exploding gradient respectively.
Igor C. Oliveira, Rahul Santhanam
Subjects: Computational Complexity (cs.CC); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Learning (cs.LG)
We prove several results giving new and stronger connections between
learning, circuit lower bounds and pseudorandomness. Among other results, we
show a generic learning speedup lemma, equivalences between various learning
models in the exponential time and subexponential time regimes, a dichotomy
between learning and pseudorandomness, consequences of non-trivial learning for
circuit lower bounds, Karp-Lipton theorems for probabilistic exponential time,
and NC(^1)-hardness for the Minimum Circuit Size Problem.
Sihan Li, Jiantao Jiao, Yanjun Han, Tsachy Weissman
Subjects: Neural and Evolutionary Computing (cs.NE); Learning (cs.LG); Machine Learning (stat.ML)
We provide a theoretical explanation for the superb performance of ResNet via
the study of deep linear networks and some nonlinear variants. We show that
with or without nonlinearities, by adding shortcuts that have depth two, the
condition number of the Hessian of the loss function at the zero initial point
is depth-invariant, which makes training very deep models no more difficult
than shallow ones. Shortcuts of higher depth result in an extremely flat
(high-order) stationary point initially, from which the optimization algorithm
is hard to escape. The 1-shortcut, however, is essentially equivalent to no
shortcuts. Extensive experiments are provided accompanying our theoretical
results. We show that initializing the network to small weights with
2-shortcuts achieves significantly better results than random Gaussian (Xavier)
initialization, orthogonal initialization, and shortcuts of deeper depth, from
various perspectives ranging from final loss, learning dynamics and stability,
to the behavior of the Hessian along the learning process.
Laura Luzzi, Roope Vehkalahti, Cong Ling
Comments: 31 pages, 1 figure
Subjects: Information Theory (cs.IT); Number Theory (math.NT)
Despite several works on secrecy coding for fading and MIMO wiretap channels
from an error probability perspective, the construction of information
theoretically secure codes over such channels remains as an open problem. In
this paper, we consider a fading wiretap channel model where the transmitter
has only partial statistical channel state information. We extend the flatness
factor criterion from the Gaussian wiretap channel to fading and MIMO wiretap
channels, and propose concrete lattice codes with a vanishing flatness factor
to achieve information theoretic security. These codes are built from algebraic
number fields with constant root discriminant for the single-antenna fading
wiretap channel, and from division algebras centered at such number fields for
the MIMO wiretap channel, respectively. The proposed lattice codes achieve
strong secrecy and semantic security over any ergodic stationary fading/MIMO
wiretap channel with sufficiently fast decay of time correlations, for all
secrecy rates (R < C_b-C_e -kappa), where (C_b) and (C_e) are Bob and Eve’s
channel capacities respectively, and (kappa) is an explicit constant gap.
Moreover, these codes are almost universal in the sense that a fixed code is
good for secrecy for a wide range of fading models.
Wentao Huang, Kechen Zhang
Subjects: Information Theory (cs.IT); Learning (cs.LG)
Information theory is a powerful tool for neuroscience and other disciplines.
Efficient calculation of Shannon’s mutual information (MI) is a key
computational step that often presents the biggest difficulty for practical
applications. In this paper, we propose effective approximation methods for
evaluating MI in the context of neural population coding, especially for
high-dimensional inputs. We prove several information-theoretic asymptotic
bounds and approximation formulas for large size neural populations. We also
discuss how optimization of neural population coding based on these
approximation formulas leads to a convex problem about the population density
distribution in neural population parameter space. Several useful techniques,
including variable transformation and dimensionality reduction, are proposed
for more efficient computation of the approximations. Our numerical simulation
results show that our asymptotic formulas are highly accurate for approximating
MI in neural populations. For some special cases, the approximations by our
formulas are exactly equal to the true MI. The approximation methods for MI may
have a wide range of applications in various disciplines.
Philip Schniter, Sundeep Rangan, Alyson Fletcher
Subjects: Information Theory (cs.IT)
The denoising-based approximate message passing (D-AMP) methodology, recently
proposed by Metzler, Maleki, and Baraniuk, allows one to plug in sophisticated
denoisers like BM3D into the AMP algorithm to achieve state-of-the-art
compressive image recovery. But AMP diverges with small deviations from the
i.i.d.-Gaussian assumption on the measurement matrix. Recently, the vector AMP
(VAMP) algorithm has been proposed to fix this problem. In this work, we show
that the benefits of VAMP extend to D-VAMP.
Christopher Mollén, Erik G. Larsson, Ulf Gustavsson, Thomas Eriksson, Robert W. Heath Jr
Subjects: Information Theory (cs.IT)
Co-existing wireless systems that share a common spectrum need to mitigate
out-of-band (OOB) radiation to avoid excessive interference. In legacy SISO
transmitters and small MIMO arrays, OOB radiation is well understood and is
commonly handled by digital compensation techniques. In large arrays, however,
new phenomena and hardware limitations have to be considered: First, signals
can be radiated directionally, which might focus the OOB radiation. Second,
low-complexity hardware with poor linearity has to be used for cost reasons,
which increases the relative amount of OOB radiation. Given that massive MIMO
and millimeter wave communication rely on base stations with a huge number of
antennas, the spatial behavior of OOB radiation from large arrays will have
significant implications for future hardware requirements. We show that, if the
OOB radiation is beamformed, its array gain is never larger than that of the
in-band signal. In many cases, the OOB radiation is even close to isotropic
also when the in-band signal is highly directive. With the same total radiated
power, the OOB radiation from large arrays is therefore never more severe than
from a legacy system with the same adjacent-channel-leakage ratio. The OOB
radiation is even less detrimental than from a legacy system since the high
array gain of the in-band signal allows large arrays to radiate less total
power than legacy systems. We also show how OOB radiation from large arrays
varies with location in static propagation environments and how these effects
vanish when averaged over the small-scale fading. A main conclusions is that
the linearity requirement for large arrays can be relaxed as compared to legacy
systems. Specifically, less stringent linearity requirements on each
transmitter makes it possible to build large arrays from low-complexity
hardware.
Salman Beigi, Amin Gohari
Subjects: Information Theory (cs.IT)
A measure of correlation is said to have the tensorization property if it is
unchanged when computed for i.i.d. copies. More precisely, a measure of
correlation between two random variables ((X, Y)) denoted by (
ho(X, Y)), has
the tensorization property if (
ho(X^n, Y^n)=
ho(X, Y)) where ((X^n, Y^n)) is
(n) i.i.d. copies of ((X, Y)).Two well-known examples of such measures are the
maximal correlation and the hypercontractivity ribbon (HC~ribbon). We show that
the maximal correlation and HC ribbons are special cases of (Phi)-ribbon,
defined in this paper for any function (Phi) from a class of convex functions
((Phi)-ribbon reduces to HC~ribbon and the maximal correlation for special
choices of (Phi)). Any (Phi)-ribbon is shown to be a measures of correlation
with the tensorization property. We show that the (Phi)-ribbon also
characterizes the (Phi)-strong data processing inequality constant introduced
by Raginsky. We further study the (Phi)-ribbon for the choice of (Phi(t)=t^2)
and introduce an equivalent characterization of this ribbon.
Aly El Gamal
Comments: submitted to IEEE International Conference on Communications (ICC 2017)
Subjects: Information Theory (cs.IT)
In this work, we study scenarios of wireless networks where using simple
Time-Division-Multiple-Access (TDMA) is optimal if no channel state information
is available at the transmitters (no CSIT). We consider single-hop locally
connected interference networks where each transmitter is connected to the
receiver that has the same index as well as L succeeding receivers. The
considered rate criterion is the per user Degrees of Freedom as the number of
transmitter-receiver pairs goes to infinity. For the case when L=1, it was
shown in previous work that TDMA is optimal, even if each message can be
available at multiple transmitters and linear cooperation schemes are allowed.
Here, we extend this conclusion to the case where L=2, by proving optimality of
TDMA in this case as well. We then study the problem for general values of L
without allowing for cooperative transmission. We prove optimality of TDMA when
each transmitter is serving the receiver with the same index, and finally
conjecture – based on an example – that the same conclusion applies even if
each receiver can be served by any single transmitter connected to it.
Marco Mondelli, S. Hamed Hassani, Ivana Marić, Dennis Hui, Song-Nam Hong
Comments: 6 pages, 2 figures, submitted to WCNC’17 workshop on polar coding
Subjects: Information Theory (cs.IT)
We present a rate-compatible polar coding scheme that achieves the capacity
of any family of channels. Our solution generalizes the previous results [1],
[2] that provide capacity-achieving rate-compatible polar codes for a degraded
family of channels. The motivation for our extension comes from the fact that
in many practical scenarios, e.g., MIMO systems and non-Gaussian interference,
the channels cannot be ordered by degradation. The main technical contribution
of this paper consists in removing the degradation condition. To do so, we
exploit the ideas coming from the construction of universal polar codes.
Our scheme possesses the usual attractive features of polar codes: low
complexity code construction, encoding, and decoding; super-polynomial scaling
of the error probability with the block length; and absence of error floors. On
the negative side, the scaling of the gap to capacity with the block length is
slower than in standard polar codes, and we prove an upper bound on the scaling
exponent.
Fernando Gama, Santiago Segarra, Alejandro Ribeiro
Comments: Submitted to IEEE Transactions on Signal and Information Processing Over Networks
Subjects: Social and Information Networks (cs.SI); Information Theory (cs.IT); Physics and Society (physics.soc-ph)
A novel method to obtain hierarchical and overlapping clusters from network
data -i.e., a set of nodes endowed with pairwise dissimilarities- is presented.
The introduced method is hierarchical in the sense that it outputs a nested
collection of groupings of the node set depending on the resolution or degree
of similarity desired, and it is overlapping since it allows nodes to belong to
more than one group. Our construction is rooted on the facts that a
hierarchical (non-overlapping) clustering of a network can be equivalently
represented by a finite ultrametric space and that a convex combination of
ultrametrics results in a cut metric. By applying a hierarchical
(non-overlapping) clustering method to multiple dithered versions of a given
network and then convexly combining the resulting ultrametrics, we obtain a cut
metric associated to the network of interest. We then show how to extract a
hierarchical overlapping clustering structure from the aforementioned cut
metric. Furthermore, the so-called overlapping function is presented as a tool
for gaining insights about the data by identifying meaningful resolutions of
the obtained hierarchical structure. Additionally, we explore hierarchical
overlapping quasi-clustering methods that preserve the asymmetry of the data
contained in directed networks. Finally, the presented method is illustrated
via synthetic and real-world classification problems including handwritten
digit classification and authorship attribution of famous plays.
Riccardo Aragona, Marco Calderini, Antonio Tortora, Maria Tota
Subjects: Group Theory (math.GR); Cryptography and Security (cs.CR); Information Theory (cs.IT)
We provide two sufficient conditions to guarantee that the round functions of
a translation based cipher generate a primitive group. Furthermore, if a round
of the cipher is strongly proper and consists of m-bit S-Boxes, with m = 3, 4
or 5, we prove that such a group is the alternating group. As an immediate
consequence, we deduce that the round functions of some lightweight translation
based ciphers, such as the PRESENT cipher, generate the alternating group.
Wenjing Liao, Mauro Maggioni
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Statistics Theory (math.ST)
We consider the problem of efficiently approximating and encoding
high-dimensional data sampled from a probability distribution (
ho) in
(mathbb{R}^D), that is nearly supported on a (d)-dimensional set (mathcal{M})
– for example supported on a (d)-dimensional Riemannian manifold. Geometric
Multi-Resolution Analysis (GMRA) provides a robust and computationally
efficient procedure to construct low-dimensional geometric approximations of
(mathcal{M}) at varying resolutions. We introduce a thresholding algorithm on
the geometric wavelet coefficients, leading to what we call adaptive GMRA
approximations. We show that these data-driven, empirical approximations
perform well, when the threshold is chosen as a suitable universal function of
the number of samples (n), on a wide variety of measures (
ho), that are
allowed to exhibit different regularity at different scales and locations,
thereby efficiently encoding data from more complex measures than those
supported on manifolds. These approximations yield a data-driven dictionary,
together with a fast transform mapping data to coefficients, and an inverse of
such a map. The algorithms for both the dictionary construction and the
transforms have complexity (C n log n) with the constant linear in (D) and
exponential in (d). Our work therefore establishes adaptive GMRA as a fast
dictionary learning algorithm with approximation guarantees. We include several
numerical experiments on both synthetic and real data, confirming our
theoretical results and demonstrating the effectiveness of adaptive GMRA.