IT博客汇 | arXiv Paper Daily: Mon, 21 Nov 2016

arXiv Paper Daily: Mon, 21 Nov 2016

我爱机器学习(52ml.net)发表于 2016-11-21 00:00:00

Neural and Evolutionary Computing

NoiseOut: A Simple Way to Prune Neural Networks

Mohammad Babaeizadeh, Paris Smaragdis, Roy H. Campbell
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

Neural networks are usually over-parameterized with significant redundancy in
the number of required neurons which results in unnecessary computation and
memory usage at inference time. One common approach to address this issue is to
prune these big networks by removing extra neurons and parameters while
maintaining the accuracy. In this paper, we propose NoiseOut, a fully automated
pruning algorithm based on the correlation between activations of neurons in
the hidden layers. We prove that adding additional output neurons with entirely
random targets results into a higher correlation between neurons which makes
pruning by NoiseOut even more efficient. Finally, we test our method on various
networks and datasets. These experiments exhibit high pruning rates while
maintaining the accuracy of the original network.

Swarm Intelligence for Multiobjective Optimization of Extraction Process

T. Ganesan, I. Elamvazuthi, P.Vasant
Journal-ref: Handbook of Research on Modern Optimization Algorithms and
Applications in Engineering and Economics, (2016), IGI Global, pp 516 – 544
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)

Multi objective (MO) optimization is an emerging field which is increasingly
being implemented in many industries globally. In this work, the MO
optimization of the extraction process of bioactive compounds from the Gardenia
Jasminoides Ellis fruit was solved. Three swarm-based algorithms have been
applied in conjunction with normal-boundary intersection (NBI) method to solve
this MO problem. The gravitational search algorithm (GSA) and the particle
swarm optimization (PSO) technique were implemented in this work. In addition,
a novel Hopfield-enhanced particle swarm optimization was developed and applied
to the extraction problem. By measuring the levels of dominance, the optimality
of the approximate Pareto frontiers produced by all the algorithms were gauged
and compared. Besides, by measuring the levels of convergence of the frontier,
some understanding regarding the structure of the objective space in terms of
its relation to the level of frontier dominance is uncovered. Detail
comparative studies were conducted on all the algorithms employed and developed
in this work.

Generative Deep Neural Networks for Dialogue: A Short Review

Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, Joelle Pineau
Comments: 6 pages, 1 figure, 3 tables; NIPS 2016 workshop on Learning Methods for Dialogue
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

Researchers have recently started investigating deep neural networks for
dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq)
models have shown promising results for unstructured tasks, such as word-level
dialogue response generation. The hope is that such models will be able to
leverage massive amounts of data to learn meaningful natural language
representations and response generation strategies, while requiring a minimum
amount of domain knowledge and hand-crafting. An important challenge is to
develop models that can effectively incorporate dialogue context and generate
meaningful and diverse responses. In support of this goal, we review recently
proposed models based on generative encoder-decoder neural network
architectures, and show that these models have better ability to incorporate
long-term dialogue history, to model uncertainty and ambiguity in dialogue, and
to generate responses with high-level compositional structure.

Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

Volkan Cirik, Eduard Hovy, Louis-Philippe Morency
Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Curriculum Learning emphasizes the order of training instances in a
computational learning setup. The core hypothesis is that simpler instances
should be learned early as building blocks to learn more complex ones. Despite
its usefulness, it is still unknown how exactly the internal representation of
models are affected by curriculum learning. In this paper, we study the effect
of curriculum learning on Long Short-Term Memory (LSTM) networks, which have
shown strong competency in many Natural Language Processing (NLP) problems. Our
experiments on sentiment analysis task and a synthetic task similar to sequence
prediction tasks in NLP show that curriculum learning has a positive effect on
the LSTM’s internal states by biasing the model towards building constructive
representations i.e. the internal representation at the previous timesteps are
used as building blocks for the final prediction. We also find that smaller
models significantly improves when they are trained with curriculum learning.
Lastly, we show that curriculum learning helps more when the amount of training
data is limited.

An Empirical Study of Continuous Connectivity Degree Sequence Equivalents

Daniel Moyer, Boris A. Gutman, Joshua Faskowitz, Neda Jahanshad, Paul M. Thompson
Comments: Presented at The MICCAI-BACON 16 Workshop (this https URL)
Subjects: Neurons and Cognition (q-bio.NC); Neural and Evolutionary Computing (cs.NE)

In the present work we demonstrate the use of a parcellation free
connectivity model based on Poisson point processes. This model produces for
each subject a continuous bivariate intensity function that represents for
every possible pair of points the relative rate at which we observe tracts
terminating at those points. We fit this model to explore degree sequence
equivalents for spatial continuum graphs, and to investigate the local
differences between estimated intensity functions for two different
tractography methods. This is a companion paper to Moyer et al. (2016), where
the model was originally defined.

Compacting Neural Network Classifiers via Dropout Training

Yotaro Kubo, George Tucker, Simon Wiesler
Comments: Accepted to NIPS Workshop on Efficient Methods for Deep Neural Networks
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

We introduce dropout compaction, a novel method for training feed-forward
neural networks which realizes the performance gains of training a large model
with dropout regularization, yet extracts a compact neural network for run-time
efficiency. In the proposed method, we introduce a sparsity-inducing prior on
the per unit dropout retention probability so that the optimizer can
effectively prune hidden units during training. By changing the prior
hyperparameters, we can control the size of the resulting network. We performed
a systematic comparison of dropout compaction and competing methods on several
real-world speech recognition tasks and found that dropout compaction achieved
comparable accuracy with fewer than 50% of the hidden units, translating to a
2.5x speedup in run-time.

Sparsey: Event Recognition via Deep Hierarchical Spare Distributed Codes

Gerard J. Rinkus
Comments: This is a manuscript form of a paper published in Frontiers in Computational Neuroscience in 2014 (this http URL). 65 pages, 28 figures, 8 tables
Journal-ref: Frontiers in Computational Neuroscience, Vol. 8, Article 160
(2014)
Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

Visual cortex’s hierarchical, multi-level organization is captured in many
biologically inspired computational vision models, the general idea being that
progressively larger scale, more complex spatiotemporal features are
represented in progressively higher areas. However, most earlier models use
localist representations (codes) in each representational field, which we
equate with the cortical macrocolumn (mac), at each level. In localism, each
represented feature/event (item) is coded by a single unit. Our model, Sparsey,
is also hierarchical but crucially, uses sparse distributed coding (SDC) in
every mac in all levels. In SDC, each represented item is coded by a small
subset of the mac’s units. SDCs of different items can overlap and the size of
overlap between items can represent their similarity. The difference between
localism and SDC is crucial because SDC allows the two essential operations of
associative memory, storing a new item and retrieving the best-matching stored
item, to be done in fixed time for the life of the model. Since the model’s
core algorithm, which does both storage and retrieval (inference), makes a
single pass over all macs on each time step, the overall model’s
storage/retrieval operation is also fixed-time, a criterion we consider
essential for scalability to huge datasets. A 2010 paper described a
nonhierarchical version of this model in the context of purely spatial pattern
processing. Here, we elaborate a fully hierarchical model (arbitrary numbers of
levels and macs per level), describing novel model principles like progressive
critical periods, dynamic modulation of principal cells’ activation functions
based on a mac-level familiarity measure, representation of multiple
simultaneously active hypotheses, a novel method of time warp invariant
recognition, and we report results showing learning/recognition of
spatiotemporal patterns.

Computer Vision and Pattern Recognition

Ear Recognition: More Than a Survey

Žiga Emeršič, Vitomir Štruc, Peter Peer
Comments: 17 pages, paper accepted to Neurocomputing
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Automatic identity recognition from ear images represents an active field of
research within the biometric community. The ability to capture ear images from
a distance and in a covert manner makes the technology an appealing choice for
surveillance and security applications as well as other application domains.
Significant contributions have been made in the field over recent years, but
open research problems still remain and hinder a wider (commercial) deployment
of the technology. This paper presents an overview of the field of automatic
ear recognition (from 2D images) and focuses specifically on the most recent,
descriptor-based methods proposed in this area. Open challenges are discussed
and potential research directions are outlined with the goal of providing the
reader with a point of reference for issues worth examining in the future. In
addition to a comprehensive review on ear recognition technology, the paper
also introduces a new, fully unconstrained dataset of ear images gathered from
the web and a toolbox implementing several state-of-the-art techniques for ear
recognition. The dataset and toolbox are meant to address some of the open
issues in the field and are made publicly available to the research community.

Expert Gate: Lifelong Learning with a Network of Experts

Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

In this paper we introduce a model of lifelong learning, based on a Network
of Experts. New tasks / experts are learned and added to the model
sequentially, building on what was learned before. To ensure scalability of
this process, data from previous tasks cannot be stored and hence is not
available when learning a new task. A critical issue in such context, not
addressed in the literature so far, relates to the decision of which expert to
deploy at test time. We introduce a gating autoencoder that learns a
representation for the task at hand, and is used at test time to automatically
forward the test sample to the relevant expert. This has the added advantage of
being memory efficient as only one expert network has to be loaded into memory
at any given time. Further, the autoencoders inherently capture the relatedness
of one task to another, based on which the most relevant prior model to be used
for training a new expert with fine-tuning or learning-without-forgetting can
be selected. We evaluate our system on image classification and video
prediction problems.

Exploring LOTS in Deep Neural Networks

Andras Rozsa, Manuel Günther, Terrance E. Boult
Comments: Under review as a conference paper at ICLR 2017, see open review page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep neural networks have recently demonstrated excellent performance on
various tasks. Despite recent advances, our understanding of these learning
models is still incomplete, at least, as their unexpected vulnerability to
imperceptibly small, non-random perturbations revealed. The existence of these
so-called adversarial examples presents a serious problem of the application of
vulnerable machine learning models. In this paper, we introduce the layerwise
origin-target synthesis (LOTS) that can serve multiple purposes. First, we can
use it as a visualization technique that gives us insights into the function of
any intermediate feature layer by showing the notion of a particular input in
deep neural networks. Second, our approach can be applied to assess the
invariance of the learned features captured at any layer with respect to the
class of the particular input. Finally, it can also be utilized as a general
way for producing a vast amount of diverse adversarial examples that can be
used for training to further improve the robustness of machine learning models
and their performance as well.

End-to-End Subtitle Detection and Recognition for Videos in East Asian Languages via CNN Ensemble with Near-Human-Level Performance

Yan Xu, Siyuan Shan, Ziming Qiu, Zhipeng Jia, Zhengyang Shen, Yipei Wang, Mengfei Shi, Eric I-Chao Chang
Comments: 35 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we propose an innovative end-to-end subtitle detection and
recognition system for videos in East Asian languages. Our end-to-end system
consists of multiple stages. Subtitles are firstly detected by a novel image
operator based on the sequence information of consecutive video frames. Then,
an ensemble of Convolutional Neural Networks (CNNs) trained on synthetic data
is adopted for detecting and recognizing East Asian characters. Finally, a
dynamic programming approach leveraging language models is applied to
constitute results of the entire body of text lines. The proposed system
achieves average end-to-end accuracies of 98.2% and 98.3% on 40 videos in
Simplified Chinese and 40 videos in Traditional Chinese respectively, which is
a significant outperformance of other existing methods. The near-perfect
accuracy of our system dramatically narrows the gap between human cognitive
ability and state-of-the-art algorithms used for such a task.

AFFACT – Alignment Free Facial Attribute Classification Technique

Manuel Günther, Andras Rozsa, Terrance E. Boult
Comments: This is a pre-print of the original paper submitted for review in FG 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we investigate how the latest versions of deep convolutional
neural networks perform on the facial attribute classification task. We test
two loss functions to train the neural networks: the sigmoid cross-entropy loss
usually used in multi-objective classification tasks, and the Euclidean loss
normally applied to regression problems, and find that there is little
difference between these two loss functions. Rather, more attention should be
paid on pre-training the network with external data, the learning rate policy
and the evaluation technique. Using an ensemble of three ResNets, we obtain the
new state-of-the-art facial attribute classification error of 8.00% on the
aligned images of the CelebA dataset. More significantly, we introduce a novel
data augmentation technique allowing to train the AFFACT network that
classifies facial attributes without requiring alignment, but works solely on
detected face bounding boxes. We show that this approach outperforms the CelebA
baseline, which did not use any face alignment either, with a relative
improvement of 34%.

Fast low-level pattern matching algorithm

Janja Paliska Soldo, Ana Sovic Krzic, and Damir Sersic
Comments: 14 pages, 7 tables This work has been fully supported by Croatian Science Foundation under the project UIP-11-2013-7353 Algorithms for Genome Sequence Analysis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Genomics (q-bio.GN)

This paper focuses on pattern matching in the DNA sequence. It was inspired
by a previously reported method that proposes encoding both pattern and
sequence using prime numbers. Although fast, the method is limited to rather
small pattern lengths, due to computing precision problem. Our approach
successfully deals with large patterns, due to our implementation that uses
modular arithmetic. In order to get the results very fast, the code was adapted
for multithreading and parallel implementations. The method is reduced to
assembly language level instructions, thus the final result shows significant
time and memory savings compared to the reference algorithm.

DeepVO: A Deep Learning approach for Monocular Visual Odometry

Vikram Mohanty, Shubh Agrawal, Shaswat Datta, Arna Ghosh, Vishnu Dutt Sharma, Debashish Chakravarty
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep Learning based techniques have been adopted with precision to solve a
lot of standard computer vision problems, some of which are image
classification, object detection and segmentation. Despite the widespread
success of these approaches, they have not yet been exploited largely for
solving the standard perception related problems encountered in autonomous
navigation such as Visual Odometry (VO), Structure from Motion (SfM) and
Simultaneous Localization and Mapping (SLAM). This paper analyzes the problem
of Monocular Visual Odometry using a Deep Learning-based framework, instead of
the regular ‘feature detection and tracking’ pipeline approaches. Several
experiments were performed to understand the influence of a known/unknown
environment, a conventional trackable feature and pre-trained activations tuned
for object classification on the network’s ability to accurately estimate the
motion trajectory of the camera (or the vehicle). Based on these observations,
we propose a Convolutional Neural Network architecture, best suited for
estimating the object’s pose under known environment conditions, and displays
promising results when it comes to inferring the actual scale using just a
single camera in real-time.

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jiaying Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Human action recognition is an important task in computer vision. Extracting
discriminative spatial and temporal features to model the spatial and temporal
evolutions of different actions plays a key role in accomplishing this task. In
this work, we propose an end-to-end spatial and temporal attention model for
human action recognition from skeleton data. We build our model on top of the
Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), which
learns to selectively focus on discriminative joints of skeleton within each
frame of the inputs and pays different levels of attention to the outputs of
different frames. Furthermore, to ensure effective training of the network, we
propose a regularized cross-entropy loss to drive the model learning process
and develop a joint training strategy accordingly. Experimental results
demonstrate the effectiveness of the proposed model,both on the small human
action recognition data set of SBU and the currently largest NTU dataset.

Cross Domain Knowledge Transfer for Person Re-identification

Qiqi Xiao, Kelei Cao, Haonan Chen, Fangyue Peng, Chi Zhang
Comments: 8 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Person Re-Identification (re-id) is a challenging task in computer vision,
especially when there are limited training data from multiple camera views. In
this paper, we pro- pose a deep learning based person re-identification method
by transferring knowledge of mid-level attribute features and high-level
classification features. Building on the idea that identity classification,
attribute recognition and re- identification share the same mid-level semantic
representations, they can be trained sequentially by fine-tuning one based on
another. In our framework, we train identity classification and attribute
recognition tasks from deep Convolutional Neural Network (dCNN) to learn person
information. The information can be transferred to the person re-id task and
improves its accuracy by a large margin. Further- more, a Long Short Term
Memory(LSTM) based Recurrent Neural Network (RNN) component is extended by a
spacial gate. This component is used in the re-id model to pay attention to
certain spacial parts in each recurrent unit. Experimental results show that
our method achieves 78.3% of rank-1 recognition accuracy on the CUHK03
benchmark.

Improving training of deep neural networks via Singular Value Bounding

Kui Jia
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning methods achieve great success recently on many computer vision
problems, with image classification and object detection as the prominent
examples. In spite of these practical successes, optimization of deep networks
remains an active topic in deep learning research. In this work, we focus on
investigation of the network solution properties that can potentially lead to
good performance. Our research is inspired by theoretical and empirical results
that use orthogonal matrices to initialize networks, but we are interested in
investigating how orthogonal weight matrices perform when network training
converges. To this end, we propose to constrain the solutions of weight
matrices in the orthogonal feasible set during the whole process of network
training, and achieve this by a simple yet effective method called Singular
Value Bounding (SVB). In SVB, all singular values of each weight matrix are
simply bounded in a narrow band around the value of 1. Based on the same
motivation, we also propose Bounded Batch Normalization (BBN), which improves
Batch Normalization by removing its potential risk of ill-conditioned layer
transform. We present both theoretical and empirical results to justify our
proposed methods. Experiments on benchmark image classification datasets show
the efficacy of our proposed SVB and BBN. In particular, we reach the level of
state-of-the-art results on CIFAR10 and set the new record on CIFAR100, using
off-the-shelf network architectures.

Online Visual Multi-Object Tracking via Labeled Random Finite Set Filtering

Du Yong Kim, Ba-Ngu Vo, Ba-Tuong Vo
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper proposes an online visual multi-object tracking algorithm using a
top-down Bayesian formulation that seamlessly integrates state estimation,
track management, clutter rejection, occlusion and mis-detection handling into
a single recursion. This is achieved by modeling the multi-object state as
labeled random finite set and using the Bayes recursion to propagate the
multi-object filtering density forward in time. The proposed filter updates
tracks with detections but switches to image data when mis-detection occurs,
thereby exploiting the efficiency of detection data and the accuracy of image
data. Furthermore the labeled random finite set framework enables the
incorporation of prior knowledge that mis-detections of long tracks which occur
in the middle of the scene are likely to be due to occlusions. Such prior
knowledge can be exploited to improve occlusion handling, especially long
occlusions that can lead to premature track termination in on-line multi-object
tracking. Tracking performance are compared to state-of-the-art algorithms on
well-known benchmark video datasets.

Fuzzy Statistical Matrices for Cell Classification

Guillaume Thibault, Izhak Shafran
Comments: 21 pages, 7 figures, 5 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we generalize image (texture) statistical descriptors and
propose algorithms that improve their efficacy. Recently, a new method showed
how the popular Co-Occurrence Matrix (COM) can be modified into a fuzzy version
(FCOM) which is more effective and robust to noise. Here, we introduce new
fuzzy versions of two additional higher order statistical matrices: the Run
Length Matrix (RLM) and the Size Zone Matrix (SZM). We define the fuzzy zones
and propose an efficient algorithm to compute the descriptors. We demonstrate
the advantage of the proposed improvements over several state-of-the-art
methods on three tasks from quantitative cell biology: analyzing and
classifying Human Epithelial type 2 (HEp-2) cells using Indirect
Immunofluorescence protocol (IFF).

To Find the Symmetry Plane in Any Dimension, Reflect, Register, and Compute a -1 Eigenvector

Marcelo Cicconet, David G. C. Hildebrand, Hunter Elliott
Comments: 9 pages, 8 figures, submitted to CVPR 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we demonstrate that the problem of fitting a plane of
reflection symmetry to data in any dimension can be reduced to the problem of
registering two datasets, and that the exactness of the solution depends on the
accuracy of the registration. The pipeline for symmetry plane detection
consists of (1) reflecting the data with respect to an arbitrary plane, (2)
registering the original and reflected datasets, and (3) finding the
eigenvector of eigenvalue -1 of a matrix given by the reflection and
registration mappings. Results are shown for 2D and 3D datasets. We discuss in
detail a particular biological application in which we study the 3D symmetry of
manual myelinated neuron reconstructions throughout the body of a larval
zebrafish that were extracted from serial-section electron micrographs. The
data consists of curves that are represented as sequences of points in 3D, and
there are two goals: first, find the plane of mirror symmetry given that the
neuron reconstructions are nearly symmetric; second, find pairings of symmetric
curves.

Reweighted Low-Rank Tensor Completion and its Applications in Video Recovery

Baburaj M., Sudhish N. George
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper focus on recovering multi-dimensional data called tensor from
randomly corrupted incomplete observation. Inspired by reweighted (l_1) norm
minimization for sparsity enhancement, this paper proposes a reweighted
singular value enhancement scheme to improve tensor low tubular rank in the
tensor completion process. An efficient iterative decomposition scheme based on
t-SVD is proposed which improves low-rank signal recovery significantly. The
effectiveness of the proposed method is established by applying to video
completion problem, and experimental results reveal that the algorithm
outperforms its counterparts.

Reweighted Low-Rank Tensor Decomposition and its Applications in Video Denoising

Baburaj M., Sudhish N. George
Subjects: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (cs.NA)

A tensor is decomposed into low-rank and sparse components by simultaneously
minimizing tensor nuclear norm and the (l_1) norm in Tensor Principal Component
Pursuit (TPCP). Inspired by reweighted (l_1) norm minimization for sparsity
enhancement, this paper proposes a reweighted singular value enhancement scheme
to improve tensor low tubular rank in TPCP. The sparse component of a tensor is
also recovered by the reweighted (l_1) norm which enhances the accuracy of
decomposition. An efficient iterative decomposition scheme based on t-SVD is
proposed which improves low-rank signal recovery significantly. The
effectiveness of the proposed method is established by applying to video
denoising problem, and experimental results reveal that the algorithm
outperforms its counterparts.

SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing

Ao Ren, Ji Li, Zhe Li, Caiwen Ding, Xuehai Qian, Qinru Qiu, Bo Yuan, Yanzhi Wang
Comments: This paper is accepted by 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With recent advancing of Internet of Things (IoTs), it becomes very
attractive to implement the deep convolutional neural networks (DCNNs) onto
embedded/portable systems. Presently, executing the software-based DCNNs
requires high-performance server clusters in practice, restricting their
widespread deployment on the mobile devices. To overcome this issue,
considerable research efforts have been conducted in the context of developing
highly-parallel and specific DCNN hardware, utilizing GPGPUs, FPGAs, and ASICs.
Stochastic Computing (SC), which uses bit-stream to represent a number within
[-1, 1] by counting the number of ones in the bit-stream, has a high potential
for implementing DCNNs with high scalability and ultra-low hardware footprint.
Since multiplications and additions can be calculated using AND gates and
multiplexers in SC, significant reductions in power/energy and hardware
footprint can be achieved compared to the conventional binary arithmetic
implementations. The tremendous savings in power (energy) and hardware
resources bring about immense design space for enhancing scalability and
robustness for hardware DCNNs. This paper presents the first comprehensive
design and optimization framework of SC-based DCNNs (SC-DCNNs). We first
present the optimal designs of function blocks that perform the basic
operations, i.e., inner product, pooling, and activation function. Then we
propose the optimal design of four types of combinations of basic function
blocks, named feature extraction blocks, which are in charge of extracting
features from input feature maps. Besides, weight storage methods are
investigated to reduce the area and power/energy consumption for storing
weights. Finally, the whole SC-DCNN implementation is optimized, with feature
extraction blocks carefully selected, to minimize area and power/energy
consumption while maintaining a high network accuracy level.

Generalized BackPropagation, Étude De Cas: Orthogonality

Mehrtash Harandi, Basura Fernando
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper introduces an extension of the backpropagation algorithm that
enables us to have layers with constrained weights in a deep network. In
particular, we make use of the Riemannian geometry and optimization techniques
on matrix manifolds to step outside of normal practice in training deep
networks, equipping the network with structures such as orthogonality or
positive definiteness. Based on our development, we make another contribution
by introducing the Stiefel layer, a layer with orthogonal weights. Among
various applications, Stiefel layers can be used to design orthogonal filter
banks, perform dimensionality reduction and feature extraction. We demonstrate
the benefits of having orthogonality in deep networks through a broad set of
experiments, ranging from unsupervised feature learning to fine-grained image
classification.

Squared Earth Mover's Distance-based Loss for Training Deep Neural Networks

Le Hou, Chen-Ping Yu, Dimitris Samaras
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep neural networks (DNNs) especially convolutional neural networks (CNNs)
have achieved state-of-the-art performances in many applications, and they have
been shown to be especially powerful in multi-class classification tasks.
Existing DNNs and CNNs are typically trained with a soft-max cross-entropy loss
which only considers the ground-truth class by maximizing the predicted
probability of the correct label. This cross-entropy loss ignores the intricate
inter-class relationships that exist in the data. In this work, we propose to
use the exact squared Earth Mover’s Distance (EMD) as a loss function for
multi-class classification tasks. The squared EMD loss uses the predicted
probabilities of all classes and penalizes the miss-predictions accordingly. In
experiments, we evaluate our squared EMD loss in ordered-classes datasets such
as age estimation and image aesthetic judgment. We also generalize the squared
EMD loss to classification datasets with orderless-classes such as the
ImageNet. Our results show that the squared EMD loss allows networks to achieve
lower errors than the standard cross-entropy loss, and result in
state-of-the-art performances on two age estimation datasets and one image
aesthetic judgment dataset.

Generative One-Class Models for Text-based Person Retrieval in Forensic Applications

David Gerónimo, Hedvig Kjellström
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Automatic forensic image analysis assists criminal investigation experts in
the search for suspicious persons, abnormal behaviors detection and identity
matching in images. In this paper we propose a person retrieval system that
uses textual queries (e.g., “black trousers and green shirt”) as descriptions
and a one-class generative color model with outlier filtering to represent the
images both to train the models and to perform the search. The method is
evaluated in terms of its efficiency in fulfilling the needs of a forensic
retrieval system: limited annotation, robustness, extensibility, adaptability
and computational cost. The proposed generative method is compared to a
corresponding discriminative approach. Experiments are carried out using a
range of queries in three different databases. The experiments show that the
two evaluated algorithms provide average retrieval performance and adaptable to
new datasets. The proposed generative algorithm has some advantages over the
discriminative one, specifically its capability to work with very few training
samples and its much lower computational requirements when the number of
training examples increases.

Answering Image Riddles using Vision and Reasoning through Probabilistic Soft Logic

Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos
Comments: 14 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

In this work, we explore a genre of puzzles (“image riddles”) which involves
a set of images and a question. Answering these puzzles require both
capabilities involving visual detection (including object, activity
recognition) and, knowledge-based or commonsense reasoning. We compile a
dataset of over 3k riddles where each riddle consists of 4 images and a
groundtruth answer. The annotations are validated using crowd-sourced
evaluation. We also define an automatic evaluation metric to track future
progress. Our task bears similarity with the commonly known IQ tasks such as
analogy solving, sequence filling that are often used to test intelligence.

We develop a Probabilistic Reasoning-based approach that utilizes
probabilistic commonsense knowledge to answer these riddles with a reasonable
accuracy. We demonstrate the results of our approach using both automatic and
human evaluations. Our approach achieves some promising results for these
riddles and provides a strong baseline for future attempts. We make the entire
dataset and related materials publicly available to the community in
ImageRiddle Website (this http URL).

ModelHub: Towards Unified Data and Lifecycle Management for Deep Learning

Hui Miao, Ang Li, Larry S. Davis, Amol Deshpande
Subjects: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV)

Deep learning has improved state-of-the-art results in many important fields,
and has been the subject of much research in recent years, leading to the
development of several systems for facilitating deep learning. Current systems,
however, mainly focus on model building and training phases, while the issues
of data management, model sharing, and lifecycle management are largely
ignored. Deep learning modeling lifecycle generates a rich set of data
artifacts, such as learned parameters and training logs, and comprises of
several frequently conducted tasks, e.g., to understand the model behaviors and
to try out new models. Dealing with such artifacts and tasks is cumbersome and
largely left to the users. This paper describes our vision and implementation
of a data and lifecycle management system for deep learning. First, we
generalize model exploration and model enumeration queries from commonly
conducted tasks by deep learning modelers, and propose a high-level domain
specific language (DSL), inspired by SQL, to raise the abstraction level and
accelerate the modeling process. To manage the data artifacts, especially the
large amount of checkpointed float parameters, we design a novel model
versioning system (dlv), and a read-optimized parameter archival storage system
(PAS) that minimizes storage footprint and accelerates query workloads without
losing accuracy. PAS archives versioned models using deltas in a
multi-resolution fashion by separately storing the less significant bits, and
features a novel progressive query (inference) evaluation algorithm. Third, we
show that archiving versioned models using deltas poses a new dataset
versioning problem and we develop efficient algorithms for solving it. We
conduct extensive experiments over several real datasets from computer vision
domain to show the efficiency of the proposed techniques.

NoiseOut: A Simple Way to Prune Neural Networks

Mohammad Babaeizadeh, Paris Smaragdis, Roy H. Campbell
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

Neural networks are usually over-parameterized with significant redundancy in
the number of required neurons which results in unnecessary computation and
memory usage at inference time. One common approach to address this issue is to
prune these big networks by removing extra neurons and parameters while
maintaining the accuracy. In this paper, we propose NoiseOut, a fully automated
pruning algorithm based on the correlation between activations of neurons in
the hidden layers. We prove that adding additional output neurons with entirely
random targets results into a higher correlation between neurons which makes
pruning by NoiseOut even more efficient. Finally, we test our method on various
networks and datasets. These experiments exhibit high pruning rates while
maintaining the accuracy of the original network.

Minimal Problems for the Calibrated Trifocal Variety

Joe Kileel
Comments: 23 pages, 1 table
Subjects: Algebraic Geometry (math.AG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

We determine the algebraic degree of minimal problems for the calibrated
trifocal variety in computer vision. We rely on numerical algebraic geometry
and the homotopy continuation software Bertini.

Sparsey: Event Recognition via Deep Hierarchical Spare Distributed Codes

Visual cortex’s hierarchical, multi-level organization is captured in many
biologically inspired computational vision models, the general idea being that
progressively larger scale, more complex spatiotemporal features are
represented in progressively higher areas. However, most earlier models use
localist representations (codes) in each representational field, which we
equate with the cortical macrocolumn (mac), at each level. In localism, each
represented feature/event (item) is coded by a single unit. Our model, Sparsey,
is also hierarchical but crucially, uses sparse distributed coding (SDC) in
every mac in all levels. In SDC, each represented item is coded by a small
subset of the mac’s units. SDCs of different items can overlap and the size of
overlap between items can represent their similarity. The difference between
localism and SDC is crucial because SDC allows the two essential operations of
associative memory, storing a new item and retrieving the best-matching stored
item, to be done in fixed time for the life of the model. Since the model’s
core algorithm, which does both storage and retrieval (inference), makes a
single pass over all macs on each time step, the overall model’s
storage/retrieval operation is also fixed-time, a criterion we consider
essential for scalability to huge datasets. A 2010 paper described a
nonhierarchical version of this model in the context of purely spatial pattern
processing. Here, we elaborate a fully hierarchical model (arbitrary numbers of
levels and macs per level), describing novel model principles like progressive
critical periods, dynamic modulation of principal cells’ activation functions
based on a mac-level familiarity measure, representation of multiple
simultaneously active hypotheses, a novel method of time warp invariant
recognition, and we report results showing learning/recognition of
spatiotemporal patterns.

Artificial Intelligence

Stratified Knowledge Bases as Interpretable Probabilistic Models (Extended Abstract)

Ondrej Kuzelka, Jesse Davis, Steven Schockaert
Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems
Subjects: Artificial Intelligence (cs.AI)

In this paper, we advocate the use of stratified logical theories for
representing probabilistic models. We argue that such encodings can be more
interpretable than those obtained in existing frameworks such as Markov logic
networks. Among others, this allows for the use of domain experts to improve
learned models by directly removing, adding, or modifying logical formulas.

Team-maxmin equilibrium: efficiency bounds and algorithms

Nicola Basilico, Andrea Celli, Giuseppe De Nittis, Nicola Gatti
Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)

The Team-maxmin equilibrium prescribes the optimal strategies for a team of
rational players sharing the same goal and without the capability of
correlating their strategies in strategic games against an adversary. This
solution concept can capture situations in which an agent controls multiple
resources-corresponding to the team members-that cannot communicate. It is
known that such equilibrium always exists and it is unique (unless degeneracy)
and these properties make it a credible solution concept to be used in
real-world applications, especially in security scenarios. Nevertheless, to the
best of our knowledge, the Team-maxmin equilibrium is almost completely
unexplored in the literature. In this paper, we investigate bounds of
(in)efficiency of the Team-maxmin equilibrium w.r.t. the Nash equilibria and
w.r.t. the Maxmin equilibrium when the team members can play correlated
strategies. Furthermore, we study a number of algorithms to find and/or
approximate an equilibrium, discussing their theoretical guarantees and
evaluating their performance by using a standard testbed of game instances.

Navigational Rule Derivation: An algorithm to determine the effect of traffic signs on road networks

Daniil Galaktionov, Miguel R. Luaces, Ángeles S. Places
Comments: This research has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sk{l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. in PACIS 2016 Online Proceedings
Subjects: Artificial Intelligence (cs.AI)

In this paper we present an algorithm to build a road network map enriched
with traffic rules such as one-way streets and forbidden turns, based on the
interpretation of already detected and classified traffic signs. Such algorithm
helps to automatize the elaboration of maps for commercial navigation systems.
Our solution is based on simulating navigation along the road network,
determining at each point of interest the visibility of the signs and their
effect on the roads. We test our approach in a small urban network and discuss
various ways to generalize it to support more complex environments.

NCBO Ontology Recommender 2.0: An Enhanced Approach for Biomedical Ontology Recommendation

Marcos Martinez-Romero, Clement Jonquet, Martin J. O'Connor, John Graybeal, Alejandro Pazos, Mark A. Musen
Comments: 29 pages, 8 figures, 11 tables
Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Biomedical researchers use ontologies to annotate their data with ontology
terms, enabling better data integration and interoperability. However, the
number, variety and complexity of current biomedical ontologies make it
cumbersome for researchers to determine which ones to reuse for their specific
needs. To overcome this problem, in 2010 the National Center for Biomedical
Ontology (NCBO) released the Ontology Recommender, which is a service that
receives a biomedical text corpus or a list of keywords and suggests ontologies
appropriate for referencing the indicated terms. We developed a new version of
the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a new
recommendation approach that evaluates the relevance of an ontology to
biomedical text data according to four criteria: (1) the extent to which the
ontology covers the input data; (2) the acceptance of the ontology in the
biomedical community; (3) the level of detail of the ontology classes that
cover the input data; and (4) the specialization of the ontology to the domain
of the input data. Our evaluation shows that the enhanced recommender provides
higher quality suggestions than the original approach, providing better
coverage of the input data, more detailed information about their concepts,
increased specialization for the domain of the input data, and greater
acceptance and use in the community. In addition, it provides users with more
explanatory information, along with suggestions of not only individual
ontologies but also groups of ontologies. It also can be customized to fit the
needs of different scenarios. Ontology Recommender 2.0 combines the strengths
of its predecessor with a range of adjustments and new features that improve
its reliability and usefulness. Ontology Recommender 2.0 recommends over 500
biomedical ontologies from the NCBO BioPortal platform, where it is openly
available.

Analysis of a Design Pattern for Teaching with Features and Labels

Christopher Meek, Patrice Simard, Xiaojin Zhu
Comments: Also available at this https URL
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG)

We study the task of teaching a machine to classify objects using features
and labels. We introduce the Error-Driven-Featuring design pattern for teaching
using features and labels in which a teacher prefers to introduce features only
if they are needed. We analyze the potential risks and benefits of this
teaching pattern through the use of teaching protocols, illustrative examples,
and by providing bounds on the effort required for an optimal machine teacher
using a linear learning algorithm, the most commonly used type of learners in
interactive machine learning systems. Our analysis provides a deeper
understanding of potential trade-offs of using different learning algorithms
and between the effort required for featuring (creating new features) and
labeling (providing labels for objects).

Structural Causal Models: Cycles, Marginalizations, Exogenous Reparametrizations and Reductions

Stephan Bongers, Jonas Peters, Bernhard Schölkopf, Joris M. Mooij
Comments: Will probably be submitted to The Annals of Statistics
Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Learning (cs.LG)

Structural causal models (SCMs), also known as non-parametric structural
equation models (NP-SEMs), are widely used for causal modeling purposes. In
this paper, we give a rigorous treatment of structural causal models, dealing
with measure-theoretic complications that arise in the presence of cyclic
relations. The central question studied in this paper is: given a (possibly
cyclic) SCM defined on a large system (consisting of observable endogenous and
latent exogenous variables), can we “project it down” to an SCM that describes
a subsystem (consisting of a subset of the observed endogenous variables and
possibly different latent exogenous variables) in order to obtain a more
parsimonious but equivalent representation of the subsystem? We define a
marginalization operation that effectively removes a subset of the endogenous
variables from the model, and a class of mappings, exogenous
reparameterizations, that can be used to reduce the space of exogenous
variables. We show that both operations preserve the causal semantics of the
model and that under mild conditions they can lead to a significant reduction
of the model complexity, at least in terms of the number of variables in the
model. We argue that for the task of estimating an SCM from data, the existence
of “smooth” reductions would be desirable. We provide several conditions under
which the existence of such reductions can be shown, but also provide a
counterexample that shows that such reductions do not exist in general. The
latter result implies that existing approaches to estimate linear or Markovian
SCMs from data cannot be extended to general SCMs.

Generative Deep Neural Networks for Dialogue: A Short Review

Researchers have recently started investigating deep neural networks for
dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq)
models have shown promising results for unstructured tasks, such as word-level
dialogue response generation. The hope is that such models will be able to
leverage massive amounts of data to learn meaningful natural language
representations and response generation strategies, while requiring a minimum
amount of domain knowledge and hand-crafting. An important challenge is to
develop models that can effectively incorporate dialogue context and generate
meaningful and diverse responses. In support of this goal, we review recently
proposed models based on generative encoder-decoder neural network
architectures, and show that these models have better ability to incorporate
long-term dialogue history, to model uncertainty and ambiguity in dialogue, and
to generate responses with high-level compositional structure.

Expert Gate: Lifelong Learning with a Network of Experts

Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

In this paper we introduce a model of lifelong learning, based on a Network
of Experts. New tasks / experts are learned and added to the model
sequentially, building on what was learned before. To ensure scalability of
this process, data from previous tasks cannot be stored and hence is not
available when learning a new task. A critical issue in such context, not
addressed in the literature so far, relates to the decision of which expert to
deploy at test time. We introduce a gating autoencoder that learns a
representation for the task at hand, and is used at test time to automatically
forward the test sample to the relevant expert. This has the added advantage of
being memory efficient as only one expert network has to be loaded into memory
at any given time. Further, the autoencoders inherently capture the relatedness
of one task to another, based on which the most relevant prior model to be used
for training a new expert with fine-tuning or learning-without-forgetting can
be selected. We evaluate our system on image classification and video
prediction problems.

Query Complexity of Tournament Solutions

Palash Dey
Comments: To appear in AAAI 2017
Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)

A directed graph where there is exactly one edge between every pair of
vertices is called a {em tournament}. Finding the “best” set of vertices of a
tournament is a well studied problem in social choice theory. A {em tournament
solution} takes a tournament as input and outputs a subset of vertices of the
input tournament. However, in many applications, for example, choosing the best
set of drugs from a given set of drugs, the edges of the tournament are given
only implicitly and knowing the orientation of an edge is costly. In such
scenarios, we would like to know the best set of vertices (according to some
tournament solution) by “querying” as few edges as possible. We, in this paper,
precisely study this problem for commonly used tournament solutions: given an
oracle access to the edges of a tournament T, find (f(T)) by querying as few
edges as possible, for a tournament solution f. We first show that the set of
Condorcet non-losers in a tournament can be found by querying (2n-lfloor log
n
floor -2) edges only and this is tight in the sense that every algorithm
for finding the set of Condorcet non-losers needs to query at least (2n-lfloor
log n
floor -2) edges in the worst case, where (n) is the number of vertices
in the input tournament. We then move on to study other popular tournament
solutions and show that any algorithm for finding the Copeland set, the Slater
set, the Markov set, the bipartisan set, the uncovered set, the Banks set, and
the top cycle must query (Omega(n^2)) edges in the worst case. On the positive
side, we are able to circumvent our strong query complexity lower bound results
by proving that, if the size of the top cycle of the input tournament is at
most (k), then we can find all the tournament solutions mentioned above by
querying (O(nk + frac{nlog n}{log(1-frac{1}{k})})) edges only.

Variable Computation in Recurrent Neural Networks

Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Learning (cs.LG)

Recurrent neural networks (RNNs) have been used extensively and with
increasing success to model various types of sequential data. Much of this
progress has been achieved through devising recurrent units and architectures
with the flexibility to capture complex statistics in the data, such as long
range dependency or localized attention phenomena. However, while many
sequential data (such as video, speech or language) can have highly variable
information flow, most recurrent models still consume input features at a
constant rate and perform a constant number of computations per time step,
which can be detrimental to both speed and model capacity. In this paper, we
explore a modification to existing recurrent units which allows them to learn
to vary the amount of computation they perform at each step, without prior
knowledge of the sequence’s time structure. We show experimentally that not
only is our model more computationally efficient, it also leads to better
performance overall on our evaluation tasks.

Learning Interpretability for Visualizations using Adapted Cox Models through a User Experiment

Adrien Bibal, Benoit Frénay
Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Learning (cs.LG)

In order to be useful, visualizations need to be interpretable. This paper
uses a user-based approach to combine and assess quality measures in order to
better model user preferences. Results show that cluster separability measures
are outperformed by a neighborhood conservation measure, even though the former
are usually considered as intuitively representative of user motives. Moreover,
combining measures, as opposed to using a single measure, further improves
prediction performances.

Faster variational inducing input Gaussian process classification

Pavel Izmailov, Dmitry Kropotov
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Gaussian processes (GP) provide a prior over functions and allow finding
complex regularities in data. Gaussian processes are successfully used for
classification/regression problems and dimensionality reduction. In this work
we consider the classification problem only. The complexity of standard methods
for GP-classification scales cubically with the size of the training dataset.
This complexity makes them inapplicable to big data problems. Therefore, a
variety of methods were introduced to overcome this limitation. In the paper we
focus on methods based on so called inducing inputs. This approach is based on
variational inference and proposes a particular lower bound for marginal
likelihood (evidence). This bound is then maximized w.r.t. parameters of kernel
function of the Gaussian process, thus fitting the model to data. The
computational complexity of this method is (O(nm^2)), where (m) is the number
of inducing inputs used by the model and is assumed to be substantially smaller
than the size of the dataset (n). Recently, a new evidence lower bound for
GP-classification problem was introduced. It allows using stochastic
optimization, which makes it suitable for big data problems. However, the new
lower bound depends on (O(m^2)) variational parameter, which makes optimization
challenging in case of big m. In this work we develop a new approach for
training inducing input GP models for classification problems. Here we use
quadratic approximation of several terms in the aforementioned evidence lower
bound, obtaining analytical expressions for optimal values of most of the
parameters in the optimization, thus sufficiently reducing the dimension of
optimization space. In our experiments we achieve as well or better results,
compared to the existing method. Moreover, our method doesn’t require the user
to manually set the learning rate, making it more practical, than the existing
method.

Swarm Intelligence for Multiobjective Optimization of Extraction Process

Multi objective (MO) optimization is an emerging field which is increasingly
being implemented in many industries globally. In this work, the MO
optimization of the extraction process of bioactive compounds from the Gardenia
Jasminoides Ellis fruit was solved. Three swarm-based algorithms have been
applied in conjunction with normal-boundary intersection (NBI) method to solve
this MO problem. The gravitational search algorithm (GSA) and the particle
swarm optimization (PSO) technique were implemented in this work. In addition,
a novel Hopfield-enhanced particle swarm optimization was developed and applied
to the extraction problem. By measuring the levels of dominance, the optimality
of the approximate Pareto frontiers produced by all the algorithms were gauged
and compared. Besides, by measuring the levels of convergence of the frontier,
some understanding regarding the structure of the objective space in terms of
its relation to the level of frontier dominance is uncovered. Detail
comparative studies were conducted on all the algorithms employed and developed
in this work.

Monte Carlo Connection Prover

Michael Färber, Cezary Kaliszyk, Josef Urban
Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Learning (cs.LG)

Monte Carlo Tree Search (MCTS) is a technique to guide search in a large
decision space by taking random samples and evaluating their outcome. In this
work, we study MCTS methods in the context of the connection calculus and
implement them on top of the leanCoP prover. This includes proposing useful
proof-state evaluation heuristics that are learned from previous proofs, and
proposing and automatically improving suitable MCTS strategies in this context.
The system is trained and evaluated on a large suite of related problems coming
from the Mizar proof assistant, showing that it is capable to find new and
different proofs. To our knowledge, this is the first time MCTS has been
applied to theorem proving.

Answering Image Riddles using Vision and Reasoning through Probabilistic Soft Logic

Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos
Comments: 14 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

In this work, we explore a genre of puzzles (“image riddles”) which involves
a set of images and a question. Answering these puzzles require both
capabilities involving visual detection (including object, activity
recognition) and, knowledge-based or commonsense reasoning. We compile a
dataset of over 3k riddles where each riddle consists of 4 images and a
groundtruth answer. The annotations are validated using crowd-sourced
evaluation. We also define an automatic evaluation metric to track future
progress. Our task bears similarity with the commonly known IQ tasks such as
analogy solving, sequence filling that are often used to test intelligence.

We develop a Probabilistic Reasoning-based approach that utilizes
probabilistic commonsense knowledge to answer these riddles with a reasonable
accuracy. We demonstrate the results of our approach using both automatic and
human evaluations. Our approach achieves some promising results for these
riddles and provides a strong baseline for future attempts. We make the entire
dataset and related materials publicly available to the community in
ImageRiddle Website (this http URL).

Information Retrieval

Approximate Near Neighbors for General Symmetric Norms

Alexandr Andoni, Aleksandar Nikolov, Ilya Razenshteyn, Erik Waingarten
Comments: 24 pages, no figures
Subjects: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Information Retrieval (cs.IR); Learning (cs.LG); Metric Geometry (math.MG)

We show that every symmetric normed space admits an efficient nearest
neighbor search data structure with doubly-logarithmic approximation.
Specifically, for every (n), (d = 2^{oleft(frac{log n}{log log
n}
ight)}), and every (d)-dimensional symmetric norm (|cdot|), there exists
a data structure for (mathrm{poly}(log log n))-approximate nearest neighbor
search over (|cdot|) for (n)-point datasets achieving (n^{o(1)}) query time
and (n^{1+o(1)}) space. The main technical ingredient of the algorithm is a
low-distortion embedding of a symmetric norm into a low-dimensional iterated
product of top-(k) norms.

We also show that our techniques cannot be extended to general norms.

NCBO Ontology Recommender 2.0: An Enhanced Approach for Biomedical Ontology Recommendation

Biomedical researchers use ontologies to annotate their data with ontology
terms, enabling better data integration and interoperability. However, the
number, variety and complexity of current biomedical ontologies make it
cumbersome for researchers to determine which ones to reuse for their specific
needs. To overcome this problem, in 2010 the National Center for Biomedical
Ontology (NCBO) released the Ontology Recommender, which is a service that
receives a biomedical text corpus or a list of keywords and suggests ontologies
appropriate for referencing the indicated terms. We developed a new version of
the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a new
recommendation approach that evaluates the relevance of an ontology to
biomedical text data according to four criteria: (1) the extent to which the
ontology covers the input data; (2) the acceptance of the ontology in the
biomedical community; (3) the level of detail of the ontology classes that
cover the input data; and (4) the specialization of the ontology to the domain
of the input data. Our evaluation shows that the enhanced recommender provides
higher quality suggestions than the original approach, providing better
coverage of the input data, more detailed information about their concepts,
increased specialization for the domain of the input data, and greater
acceptance and use in the community. In addition, it provides users with more
explanatory information, along with suggestions of not only individual
ontologies but also groups of ontologies. It also can be customized to fit the
needs of different scenarios. Ontology Recommender 2.0 combines the strengths
of its predecessor with a range of adjustments and new features that improve
its reliability and usefulness. Ontology Recommender 2.0 recommends over 500
biomedical ontologies from the NCBO BioPortal platform, where it is openly
available.

Computation and Language

Generative Deep Neural Networks for Dialogue: A Short Review

Researchers have recently started investigating deep neural networks for
dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq)
models have shown promising results for unstructured tasks, such as word-level
dialogue response generation. The hope is that such models will be able to
leverage massive amounts of data to learn meaningful natural language
representations and response generation strategies, while requiring a minimum
amount of domain knowledge and hand-crafting. An important challenge is to
develop models that can effectively incorporate dialogue context and generate
meaningful and diverse responses. In support of this goal, we review recently
proposed models based on generative encoder-decoder neural network
architectures, and show that these models have better ability to incorporate
long-term dialogue history, to model uncertainty and ambiguity in dialogue, and
to generate responses with high-level compositional structure.

Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

Volkan Cirik, Eduard Hovy, Louis-Philippe Morency
Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Curriculum Learning emphasizes the order of training instances in a
computational learning setup. The core hypothesis is that simpler instances
should be learned early as building blocks to learn more complex ones. Despite
its usefulness, it is still unknown how exactly the internal representation of
models are affected by curriculum learning. In this paper, we study the effect
of curriculum learning on Long Short-Term Memory (LSTM) networks, which have
shown strong competency in many Natural Language Processing (NLP) problems. Our
experiments on sentiment analysis task and a synthetic task similar to sequence
prediction tasks in NLP show that curriculum learning has a positive effect on
the LSTM’s internal states by biasing the model towards building constructive
representations i.e. the internal representation at the previous timesteps are
used as building blocks for the final prediction. We also find that smaller
models significantly improves when they are trained with curriculum learning.
Lastly, we show that curriculum learning helps more when the amount of training
data is limited.

Word and Document Embeddings based on Neural Network Approaches

Siwei Lai
Comments: PhD thesis, in Chinese, Institute of Automation, Chinese Academy of Sciences, 2016
Subjects: Computation and Language (cs.CL)

Data representation is a fundamental task in machine learning. The
representation of data affects the performance of the whole machine learning
system. In a long history, the representation of data is done by feature
engineering, and researchers aim at designing better features for specific
tasks. Recently, the rapid development of deep learning and representation
learning has brought new inspiration to various domains.

In natural language processing, the most widely used feature representation
is the Bag-of-Words model. This model has the data sparsity problem and cannot
keep the word order information. Other features such as part-of-speech tagging
or more complex syntax features can only fit for specific tasks in most cases.
This thesis focuses on word representation and document representation. We
compare the existing systems and present our new model.

First, for generating word embeddings, we make comprehensive comparisons
among existing word embedding models. In terms of theory, we figure out the
relationship between the two most important models, i.e., Skip-gram and GloVe.
In our experiments, we analyze three key points in generating word embeddings,
including the model construction, the training corpus and parameter design. We
evaluate word embeddings with three types of tasks, and we argue that they
cover the existing use of word embeddings. Through theory and practical
experiments, we present some guidelines for how to generate a good word
embedding.

Second, in Chinese character or word representation. We introduce the joint
training of Chinese character and word. …

Third, for document representation, we analyze the existing document
representation models, including recursive NNs, recurrent NNs and convolutional
NNs. We point out the drawbacks of these models and present our new model, the
recurrent convolutional neural networks. …

Variable Computation in Recurrent Neural Networks

Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Learning (cs.LG)

Recurrent neural networks (RNNs) have been used extensively and with
increasing success to model various types of sequential data. Much of this
progress has been achieved through devising recurrent units and architectures
with the flexibility to capture complex statistics in the data, such as long
range dependency or localized attention phenomena. However, while many
sequential data (such as video, speech or language) can have highly variable
information flow, most recurrent models still consume input features at a
constant rate and perform a constant number of computations per time step,
which can be detrimental to both speed and model capacity. In this paper, we
explore a modification to existing recurrent units which allows them to learn
to vary the amount of computation they perform at each step, without prior
knowledge of the sequence’s time structure. We show experimentally that not
only is our model more computationally efficient, it also leads to better
performance overall on our evaluation tasks.

Distributed, Parallel, and Cluster Computing

Parallelizing Word2Vec in Multi-Core and Many-Core Architectures

Shihao Ji, Nadathur Satish, Sheng Li, Pradeep Dubey
Comments: NIPS Workshop on Efficient Methods for Deep Neural Networks (2016)
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)

Word2vec is a widely used algorithm for extracting low-dimensional vector
representations of words. State-of-the-art algorithms including those by
Mikolov et al. have been parallelized for multi-core CPU architectures, but are
based on vector-vector operations with “Hogwild” updates that are
memory-bandwidth intensive and do not efficiently use computational resources.
In this paper, we propose “HogBatch” by improving reuse of various data
structures in the algorithm through the use of minibatching and negative sample
sharing, hence allowing us to express the problem using matrix multiply
operations. We also explore different techniques to distribute word2vec
computation across nodes in a compute cluster, and demonstrate good strong
scalability up to 32 nodes. The new algorithm is particularly suitable for
modern multi-core/many-core architectures, especially Intel’s latest Knights
Landing processors, and allows us to scale up the computation near linearly
across cores and nodes, and process hundreds of millions of words per second,
which is the fastest word2vec implementation to the best of our knowledge.

Polynomial self-stabilizing algorithm and proof for a 2/3-approximation of a maximum matching

Johanne Cohen, Khaled Maâmra, George Manoussakis, Laurence Pilard
Comments: 16 pages, 6 figures
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

We present the first polynomial self-stabilizing algorithm for finding a
(frac23)-approximation of a maximum matching in a general graph. The previous
best known algorithm has been presented by Manne emph{et al.}
cite{ManneMPT11} and has a sub-exponential time complexity under the
distributed adversarial daemon cite{Coor}. Our new algorithm is an adaptation
of the Manne emph{et al.} algorithm and works under the same daemon, but with
a time complexity in (O(n^3)) moves. Moreover, our algorithm only needs one
more boolean variable than the previous one, thus as in the Manne emph{et al.}
algorithm, it only requires a constant amount of memory space (three
identifiers and (two) booleans per node).

GaDei: On Scale-up Training As A Service For Deep Learning

Wei Zhang, Minwei Feng, Yunhui Zheng, Yufei Ren, Yandong Wang, Ji Liu, Peng Liu, Bing Xiang, Li Zhang, Bowen Zhou
Subjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Learning (cs.LG)

Deep learning (DL) training-as-a-service (TaaS) is an important emerging
industrial workload. The unique challenge of TaaS is that it must satisfy a
wide range of customers who have no experience and resources to tune DL
hyper-parameters, and meticulous tuning for each user’s dataset is
prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with
values that are applicable to all users. IBM Watson Natural Language Classifier
(NLC) service, the most popular IBM cognitive service used by thousands of
enterprise-level clients around the globe, is a typical TaaS service. By
evaluating the NLC workloads, we show that only the conservative
hyper-parameter setup (e.g., small mini-batch size and small learning rate) can
guarantee acceptable model accuracy for a wide range of customers. We further
justify theoretically why such a setup guarantees better model convergence in
general. Unfortunately, the small mini-batch size causes a high volume of
communication traffic in a parameter-server based system. We characterize the
high communication bandwidth requirement of TaaS using representative
industrial deep learning workloads and demonstrate that none of the
state-of-the-art scale-up or scale-out solutions can satisfy such a
requirement. We then present GaDei, an optimized shared-memory based scale-up
parameter server design. We prove that the designed protocol is deadlock-free
and it processes each gradient exactly once. Our implementation is evaluated on
both commercial benchmarks and public benchmarks to demonstrate that it
significantly outperforms the state-of-the-art parameter-server based
implementation while maintaining the required accuracy and our implementation
reaches near the best possible runtime performance, constrained only by the
hardware limitation. Furthermore, to the best of our knowledge, GaDei is the
only scale-up DL system that provides fault-tolerance.

Fast low-level pattern matching algorithm

This paper focuses on pattern matching in the DNA sequence. It was inspired
by a previously reported method that proposes encoding both pattern and
sequence using prime numbers. Although fast, the method is limited to rather
small pattern lengths, due to computing precision problem. Our approach
successfully deals with large patterns, due to our implementation that uses
modular arithmetic. In order to get the results very fast, the code was adapted
for multithreading and parallel implementations. The method is reduced to
assembly language level instructions, thus the final result shows significant
time and memory savings compared to the reference algorithm.

Learning

Faster variational inducing input Gaussian process classification

Pavel Izmailov, Dmitry Kropotov
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Gaussian processes (GP) provide a prior over functions and allow finding
complex regularities in data. Gaussian processes are successfully used for
classification/regression problems and dimensionality reduction. In this work
we consider the classification problem only. The complexity of standard methods
for GP-classification scales cubically with the size of the training dataset.
This complexity makes them inapplicable to big data problems. Therefore, a
variety of methods were introduced to overcome this limitation. In the paper we
focus on methods based on so called inducing inputs. This approach is based on
variational inference and proposes a particular lower bound for marginal
likelihood (evidence). This bound is then maximized w.r.t. parameters of kernel
function of the Gaussian process, thus fitting the model to data. The
computational complexity of this method is (O(nm^2)), where (m) is the number
of inducing inputs used by the model and is assumed to be substantially smaller
than the size of the dataset (n). Recently, a new evidence lower bound for
GP-classification problem was introduced. It allows using stochastic
optimization, which makes it suitable for big data problems. However, the new
lower bound depends on (O(m^2)) variational parameter, which makes optimization
challenging in case of big m. In this work we develop a new approach for
training inducing input GP models for classification problems. Here we use
quadratic approximation of several terms in the aforementioned evidence lower
bound, obtaining analytical expressions for optimal values of most of the
parameters in the optimization, thus sufficiently reducing the dimension of
optimization space. In our experiments we achieve as well or better results,
compared to the existing method. Moreover, our method doesn’t require the user
to manually set the learning rate, making it more practical, than the existing
method.

Robust and Scalable Column/Row Sampling from Corrupted Big Data

Mostafa Rahmani, George Atia
Subjects: Learning (cs.LG); Numerical Analysis (cs.NA); Applications (stat.AP); Machine Learning (stat.ML)

Conventional sampling techniques fall short of drawing descriptive sketches
of the data when the data is grossly corrupted as such corruptions break the
low rank structure required for them to perform satisfactorily. In this paper,
we present new sampling algorithms which can locate the informative columns in
presence of severe data corruptions. In addition, we develop new scalable
randomized designs of the proposed algorithms. The proposed approach is
simultaneously robust to sparse corruption and outliers and substantially
outperforms the state-of-the-art robust sampling algorithms as demonstrated by
experiments conducted using both real and synthetic data.

A Characterization of Prediction Errors

Christopher Meek
Subjects: Learning (cs.LG)

Understanding prediction errors and determining how to fix them is critical
to building effective predictive systems. In this paper, we delineate four
types of prediction errors and demonstrate that these four types characterize
all prediction errors. In addition, we describe potential remedies and tools
that can be used to reduce the uncertainty when trying to determine the source
of a prediction error and when trying to take action to remove a prediction
errors.

Associative Memories to Accelerate Approximate Nearest Neighbor Search

Vincent Gripon, Matthias Löwe, Franck Vermet
Comments: 25 pages, 12 figures
Subjects: Learning (cs.LG); Probability (math.PR)

Nearest neighbor search is a very active field in machine learning for it
appears in many application cases, including classification and object
retrieval. In its canonical version, the complexity of the search is linear
with both the dimension and the cardinal of the collection of vectors the
search is performed in. Recently many works have focused on reducing the
dimension of vectors using quantization techniques or hashing, while providing
an approximate result. In this paper we focus instead on tackling the cardinal
of the collection of vectors. Namely, we introduce a technique that partitions
the collection of vectors and stores each part in its own associative memory.
When a query vector is given to the system, associative memories are polled to
identify which one contain the closest match. Then an exhaustive search is
conducted only on the part of vectors stored in the selected associative
memory. We study the effectiveness of the system when messages to store are
generated from i.i.d. uniform (pm)1 random variables or 0-1 sparse i.i.d.
random variables. We also conduct experiment on both synthetic data and real
data and show it is possible to achieve interesting trade-offs between
complexity and accuracy.

Approximate Near Neighbors for General Symmetric Norms

We show that every symmetric normed space admits an efficient nearest
neighbor search data structure with doubly-logarithmic approximation.
Specifically, for every (n), (d = 2^{oleft(frac{log n}{log log
n}
ight)}), and every (d)-dimensional symmetric norm (|cdot|), there exists
a data structure for (mathrm{poly}(log log n))-approximate nearest neighbor
search over (|cdot|) for (n)-point datasets achieving (n^{o(1)}) query time
and (n^{1+o(1)}) space. The main technical ingredient of the algorithm is a
low-distortion embedding of a symmetric norm into a low-dimensional iterated
product of top-(k) norms.

We also show that our techniques cannot be extended to general norms.

Structural Causal Models: Cycles, Marginalizations, Exogenous Reparametrizations and Reductions

Structural causal models (SCMs), also known as non-parametric structural
equation models (NP-SEMs), are widely used for causal modeling purposes. In
this paper, we give a rigorous treatment of structural causal models, dealing
with measure-theoretic complications that arise in the presence of cyclic
relations. The central question studied in this paper is: given a (possibly
cyclic) SCM defined on a large system (consisting of observable endogenous and
latent exogenous variables), can we “project it down” to an SCM that describes
a subsystem (consisting of a subset of the observed endogenous variables and
possibly different latent exogenous variables) in order to obtain a more
parsimonious but equivalent representation of the subsystem? We define a
marginalization operation that effectively removes a subset of the endogenous
variables from the model, and a class of mappings, exogenous
reparameterizations, that can be used to reduce the space of exogenous
variables. We show that both operations preserve the causal semantics of the
model and that under mild conditions they can lead to a significant reduction
of the model complexity, at least in terms of the number of variables in the
model. We argue that for the task of estimating an SCM from data, the existence
of “smooth” reductions would be desirable. We provide several conditions under
which the existence of such reductions can be shown, but also provide a
counterexample that shows that such reductions do not exist in general. The
latter result implies that existing approaches to estimate linear or Markovian
SCMs from data cannot be extended to general SCMs.

GaDei: On Scale-up Training As A Service For Deep Learning

Deep learning (DL) training-as-a-service (TaaS) is an important emerging
industrial workload. The unique challenge of TaaS is that it must satisfy a
wide range of customers who have no experience and resources to tune DL
hyper-parameters, and meticulous tuning for each user’s dataset is
prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with
values that are applicable to all users. IBM Watson Natural Language Classifier
(NLC) service, the most popular IBM cognitive service used by thousands of
enterprise-level clients around the globe, is a typical TaaS service. By
evaluating the NLC workloads, we show that only the conservative
hyper-parameter setup (e.g., small mini-batch size and small learning rate) can
guarantee acceptable model accuracy for a wide range of customers. We further
justify theoretically why such a setup guarantees better model convergence in
general. Unfortunately, the small mini-batch size causes a high volume of
communication traffic in a parameter-server based system. We characterize the
high communication bandwidth requirement of TaaS using representative
industrial deep learning workloads and demonstrate that none of the
state-of-the-art scale-up or scale-out solutions can satisfy such a
requirement. We then present GaDei, an optimized shared-memory based scale-up
parameter server design. We prove that the designed protocol is deadlock-free
and it processes each gradient exactly once. Our implementation is evaluated on
both commercial benchmarks and public benchmarks to demonstrate that it
significantly outperforms the state-of-the-art parameter-server based
implementation while maintaining the required accuracy and our implementation
reaches near the best possible runtime performance, constrained only by the
hardware limitation. Furthermore, to the best of our knowledge, GaDei is the
only scale-up DL system that provides fault-tolerance.

Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

Volkan Cirik, Eduard Hovy, Louis-Philippe Morency
Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Curriculum Learning emphasizes the order of training instances in a
computational learning setup. The core hypothesis is that simpler instances
should be learned early as building blocks to learn more complex ones. Despite
its usefulness, it is still unknown how exactly the internal representation of
models are affected by curriculum learning. In this paper, we study the effect
of curriculum learning on Long Short-Term Memory (LSTM) networks, which have
shown strong competency in many Natural Language Processing (NLP) problems. Our
experiments on sentiment analysis task and a synthetic task similar to sequence
prediction tasks in NLP show that curriculum learning has a positive effect on
the LSTM’s internal states by biasing the model towards building constructive
representations i.e. the internal representation at the previous timesteps are
used as building blocks for the final prediction. We also find that smaller
models significantly improves when they are trained with curriculum learning.
Lastly, we show that curriculum learning helps more when the amount of training
data is limited.

Variable Computation in Recurrent Neural Networks

Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Learning (cs.LG)

Recurrent neural networks (RNNs) have been used extensively and with
increasing success to model various types of sequential data. Much of this
progress has been achieved through devising recurrent units and architectures
with the flexibility to capture complex statistics in the data, such as long
range dependency or localized attention phenomena. However, while many
sequential data (such as video, speech or language) can have highly variable
information flow, most recurrent models still consume input features at a
constant rate and perform a constant number of computations per time step,
which can be detrimental to both speed and model capacity. In this paper, we
explore a modification to existing recurrent units which allows them to learn
to vary the amount of computation they perform at each step, without prior
knowledge of the sequence’s time structure. We show experimentally that not
only is our model more computationally efficient, it also leads to better
performance overall on our evaluation tasks.

Learning Interpretability for Visualizations using Adapted Cox Models through a User Experiment

In order to be useful, visualizations need to be interpretable. This paper
uses a user-based approach to combine and assess quality measures in order to
better model user preferences. Results show that cluster separability measures
are outperformed by a neighborhood conservation measure, even though the former
are usually considered as intuitively representative of user motives. Moreover,
combining measures, as opposed to using a single measure, further improves
prediction performances.

Compacting Neural Network Classifiers via Dropout Training

We introduce dropout compaction, a novel method for training feed-forward
neural networks which realizes the performance gains of training a large model
with dropout regularization, yet extracts a compact neural network for run-time
efficiency. In the proposed method, we introduce a sparsity-inducing prior on
the per unit dropout retention probability so that the optimizer can
effectively prune hidden units during training. By changing the prior
hyperparameters, we can control the size of the resulting network. We performed
a systematic comparison of dropout compaction and competing methods on several
real-world speech recognition tasks and found that dropout compaction achieved
comparable accuracy with fewer than 50% of the hidden units, translating to a
2.5x speedup in run-time.

A Generalized Stochastic Variational Bayesian Hyperparameter Learning Framework for Sparse Spectrum Gaussian Process Regression

Quang Minh Hoang, Trong Nghia Hoang, Kian Hsiang Low
Comments: 31st AAAI Conference on Artificial Intelligence (AAAI 2017), Extended version with proofs, 11 pages
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

While much research effort has been dedicated to scaling up sparse Gaussian
process (GP) models based on inducing variables for big data, little attention
is afforded to the other less explored class of low-rank GP approximations that
exploit the sparse spectral representation of a GP kernel. This paper presents
such an effort to advance the state of the art of sparse spectrum GP models to
achieve competitive predictive performance for massive datasets. Our
generalized framework of stochastic variational Bayesian sparse spectrum GP
(sVBSSGP) models addresses their shortcomings by adopting a Bayesian treatment
of the spectral frequencies to avoid overfitting, modeling these frequencies
jointly in its variational distribution to enable their interaction a
posteriori, and exploiting local data for boosting the predictive performance.
However, such structural improvements result in a variational lower bound that
is intractable to be optimized. To resolve this, we exploit a variational
parameterization trick to make it amenable to stochastic optimization.
Interestingly, the resulting stochastic gradient has a linearly decomposable
structure that can be exploited to refine our stochastic optimization method to
incur constant time per iteration while preserving its property of being an
unbiased estimator of the exact gradient of the variational lower bound.
Empirical evaluation on real-world datasets shows that sVBSSGP outperforms
state-of-the-art stochastic implementations of sparse GP models.

Monte Carlo Connection Prover

Michael Färber, Cezary Kaliszyk, Josef Urban
Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Learning (cs.LG)

Monte Carlo Tree Search (MCTS) is a technique to guide search in a large
decision space by taking random samples and evaluating their outcome. In this
work, we study MCTS methods in the context of the connection calculus and
implement them on top of the leanCoP prover. This includes proposing useful
proof-state evaluation heuristics that are learned from previous proofs, and
proposing and automatically improving suitable MCTS strategies in this context.
The system is trained and evaluated on a large suite of related problems coming
from the Mizar proof assistant, showing that it is capable to find new and
different proofs. To our knowledge, this is the first time MCTS has been
applied to theorem proving.

Analysis of a Design Pattern for Teaching with Features and Labels

Christopher Meek, Patrice Simard, Xiaojin Zhu
Comments: Also available at this https URL
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG)

We study the task of teaching a machine to classify objects using features
and labels. We introduce the Error-Driven-Featuring design pattern for teaching
using features and labels in which a teacher prefers to introduce features only
if they are needed. We analyze the potential risks and benefits of this
teaching pattern through the use of teaching protocols, illustrative examples,
and by providing bounds on the effort required for an optimal machine teacher
using a linear learning algorithm, the most commonly used type of learners in
interactive machine learning systems. Our analysis provides a deeper
understanding of potential trade-offs of using different learning algorithms
and between the effort required for featuring (creating new features) and
labeling (providing labels for objects).

Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models

Viktoriya Krakovna, Finale Doshi-Velez
Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems. arXiv admin note: substantial text overlap with arXiv:1606.05320
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

As deep neural networks continue to revolutionize various application
domains, there is increasing interest in making these powerful models more
understandable and interpretable, and narrowing down the causes of good and bad
predictions. We focus on recurrent neural networks, state of the art models in
speech recognition and translation. Our approach to increasing interpretability
is by combining a long short-term memory (LSTM) model with a hidden Markov
model (HMM), a simpler and more transparent model. We add the HMM state
probabilities to the output layer of the LSTM, and then train the HMM and LSTM
either sequentially or jointly. The LSTM can make use of the information from
the HMM, and fill in the gaps when the HMM is not performing well. A small
hybrid model usually performs better than a standalone LSTM of the same size,
especially on smaller data sets. We test the algorithms on text data and
medical time series data, and find that the LSTM and HMM learn complementary
information about the features in the text.

"Influence Sketching": Finding Influential Samples In Large-Scale Regressions

Mike Wojnowicz, Ben Cruz, Xuan Zhao, Brian Wallace, Matt Wolff, Jay Luan, Caleb Crable
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

There is an especially strong need in modern large-scale data analysis to
prioritize samples for manual inspection. For example, the inspection could
target important mislabeled samples or key vulnerabilities exploitable by an
adversarial attack. In order to solve the “needle in the haystack” problem of
which samples to inspect, we develop a new scalable version of Cook’s distance,
a classical statistical technique for identifying samples which unusually
strongly impact the fit of a regression model (and its downstream predictions).
In order to scale this technique up to very large and high-dimensional
datasets, we introduce a new algorithm which we call “influence sketching.”
Influence sketching embeds random projections within the influence computation;
in particular, the influence score is calculated using the randomly projected
pseudo-dataset from the post-convergence General Linear Model (GLM). We
validate that influence sketching can reliably and successfully discover
influential samples by applying the technique to a malware detection dataset of
over 2 million executable files, each represented with almost 100,000 features.
For example, we find that randomly deleting approximately 10% of training
samples reduces predictive accuracy only slightly from 99.47% to 99.45%,
whereas deleting the same number of samples with high influence sketch scores
reduces predictive accuracy all the way down to 90.24%. Moreover, we find that
influential samples are especially likely to be mislabeled. In the case study,
we manually inspect the most influential samples, and find that influence
sketching pointed us to new, previously unidentified pieces of malware.

Information Theory

Optimal Key Consensus in Presence of Noise

Zhengzhong Jin, Yunlei Zhao
Subjects: Information Theory (cs.IT); Cryptography and Security (cs.CR)

In this work, we introduce and formalize a new primitive, referred to as key
consensus (KC), and its asymmetric variant AKC, which are for reaching
consensus from close values. Some inherent constraints on the relationship
among bandwidth, correctness and consensus range, for any KC and AKC schemes,
are then revealed, which are particularly instrumental in choosing and
evaluating parameters towards different optimization or balance goals. KC and
AKC are fundamental to lattice based cryptography, in the sense that a list of
cryptographic primitives based on LWE or Ring-LWE (including key exchange,
public key encryption, oblivious transfer, and more) can be modularly
constructed from them. As a conceptual contribution, this much simplifies the
design and analysis of these cryptosystems in the future.

Highly practical KC and AKC schemes are then designed and analyzed, within a
generalized framework and with tight constraints on parameters that are almost
the same as the inherent ones discovered. The structure generalization and
constraint tightness allow us to choose and evaluate parameters towards optimal
balance among security, computational cost, bandwidth, consensus range, and
error rate. When applied to LWE or RLWE based cryptosystems, generally
speaking, by carefully choosing parameters they can result in (to our
knowledge) state-of-the-art practical schemes of key exchange, CPA-secure
public key encryption, and oblivious transfer.

Multi-User Millimeter Wave MIMO with Full-Dimensional Lens Antenna Array

Yong Zeng, Lu Yang, Rui Zhang
Comments: 14 pages, 9 figures, submitted for possible journal publication
Subjects: Information Theory (cs.IT)

Millimeter wave (mmWave) communication by utilizing lens antenna arrays is a
promising technique for realizing cost-effective 5G wireless systems with large
MIMO (multiple-input multiple-output) but only limited radio frequency (RF)
chains. This paper studies an uplink multi-user mmWave single-sided lens MIMO
system, where only the base station (BS) is equipped with a full-dimensional
(FD) lens antenna array with both elevation and azimuth angle resolution
capabilities, and each mobile station (MS) employs the conventional uniform
planar array (UPA) without the lens. By exploiting the angle-dependent energy
focusing property of the lens antenna array at the BS as well as the multi-path
sparsity of mmWave channels, we propose a low-complexity path division multiple
access (PDMA) scheme, which enables virtually interference-free multi-user
communications when the angle of arrivals (AoAs) of all MS multi-path signals
are sufficiently separable at the BS. To this end, a new technique called path
delay compensation is proposed at the BS to effectively transform the
multi-user frequency-selective MIMO channels to parallel frequency-flat
small-size MIMO channels for different MSs, for each of which the
low-complexity single-carrier(SC) transmission is applied. For general
scenarios with insufficient AoA separations, analog beamforming at the MSs and
digital combining at the BS are jointly designed to maximize the achievable
sum-rate of the MSs based on their effective MIMO channels resulting from path
delay compensation. In addition, we propose a new and efficient channel
estimation scheme tailored for PDMA, which requires negligible training
overhead in practical mmWave systems and yet leads to comparable performance as
that based on perfect channel state information (CSI).

Wireless-powered relaying with finite block-length codes

Mahdi Haghifam, Behrooz Makki, Masoumeh Nasiri-Kenari, Tommy Svensson, Michele Zorzi
Subjects: Information Theory (cs.IT)

This paper studies the outage probability and the throughput of
amplify-and-forward relay networks with wireless information and energy
transfer. We use some recent results on finite block-length codes to analyze
the system performance in the cases with short codewords. Specifically, the
time switching relaying and the power splitting relaying protocols are
considered for energy and information transfer. We derive tight approximations
for the outage probability/throughput. Then, we analyze the outage probability
in asymptotically high signal-to-noise ratios. Finally, we use numerical
results to confirm the accuracy of our analysis and to evaluate the system
performance in different scenarios. Our results indicate that, in
delay-constrained scenarios, the codeword length affects the outage
probability/throughput of the joint energy and information transfer systems
considerably.

Robust Secrecy Rate Beamforming for Multi-cell Networks with TS-based Information and Energy Harvesting

Hoang Duong Tuan, Ali Arshad Nasir, Trung Q. Duong
Subjects: Information Theory (cs.IT)

Wireless energy harvesting networks are more vulnerable to eavesdropping due
to higher received power levels. In this paper, we consider a dense multicell
network in the presence of multi-antenna eavesdroppers and consider
multi-objective beamforming where the energy beamformers are also used to jam
the eavesdropper. Thus, we study transmit time-switching (TS) mode to realize
wireless information and power transfer, where energy and information signal
are transmitted separately in time by the BS. We study the joint design of
transmit energy and information beamformers at the BSs and the transmit TS
ratio. In the presence of imperfect channel estimation and multi-antenna
eavesdroppers, our objective is to maximize the worst-case user secure rate
under BS transmit power and UE minimum harvested energy constraints. We solve
such highly non-convex problem by proposing new robust path-following
algorithm, which involves one simple convex quadratic program (QP) in each
iteration. Numerical results confirm that performance of the proposed algorithm
is close to that of the perfect channel knowledge case. Moreover, the proposed
algorithm not only outperforms the existing algorithm that models
power-splitting (PS) based receiver but also the proposed transmit TS based
model is implementation-wise quite simple than the PS-based model.

Compressed Sensing from Phaseless Gaussian Measurements via Linear Programming in the Natural Parameter Space

Paul Hand, Vladislav Voroninski
Subjects: Information Theory (cs.IT); Optimization and Control (math.OC); Probability (math.PR)

We consider faithfully combining phase retrieval with classical compressed
sensing. Inspired by the recent novel formulation for phase retrieval called
PhaseMax, we present and analyze SparsePhaseMax, a linear program for phaseless
compressed sensing in the natural parameter space. We establish that when
provided with an initialization that correlates with a k-sparse n-vector,
SparsePhaseMax recovers this vector up to global sign with high probability
from O(k log (n/k)) magnitude measurements against k i.i.d. Gaussian random
vectors. Our proof of this fact exploits a curious newfound connection between
phaseless and 1-bit compressed sensing. This is the first result to establish
bootstrapped compressed sensing from phaseless Gaussian measurements under
optimal sample complexity.

Hardware-Based Linear Program Decoding with the Alternating Direction Method of Multipliers

Mitchell Wasson, Mario Milicevic, Stark C. Draper, Glenn Gulak
Subjects: Information Theory (cs.IT); Optimization and Control (math.OC)

We present a hardware-based implementation of Linear Program (LP) decoding
for binary linear codes. LP decoding frames error-correction as an optimization
problem. In contrast, variants of Belief Propagation (BP) decoding frame
error-correction as a problem of graphical inference. LP decoding has several
advantages over BP-based methods, including convergence guarantees and better
error-rate performance in high-reliability channels. The latter makes LP
decoding attractive for optical transport and storage applications. However, LP
decoding, when implemented with general solvers, does not scale to large
blocklengths and is not suitable for a parallelized implementation in hardware.
It has been recently shown that the Alternating Direction Method of Multipliers
(ADMM) can be applied to decompose the LP decoding problem. The result is a
message-passing algorithm with a structure very similar to BP. We present new
intuition for this decoding algorithm as well as for its major computational
primitive: projection onto the parity polytope. Furthermore, we present results
for a fixed-point Verilog implementation of ADMM-LP decoding. This
implementation targets a Field-Programmable Gate Array (FPGA) platform to
evaluate error-rate performance and estimate resource usage. We show that Frame
Error Rate (FER) performance well within 0.5dB of double-precision
implementations is possible with 10-bit messages. Finally, we outline a number
of research opportunities that should be explored en-route to the realization
of an Application Specific Integrated Circuit (ASIC) implementation capable of
gigabit per second throughput.

A Machine Learning Approach to Model the Received Signal in Molecular Communications

H. Birkan Yilmaz, Changmin Lee, Yae Jee Cho, Chan-Byoung Chae
Subjects: Emerging Technologies (cs.ET); Information Theory (cs.IT)

A molecular communication channel is determined by the received signal.
Received signal models form the basis for studies focused on modulation,
receiver design, capacity, and coding depend on the received signal models.
Therefore, it is crucial to model the number of received molecules until time
(t) analytically. Modeling the diffusion-based molecular communication channel
with the first-hitting process is an open issue for a spherical transmitter. In
this paper, we utilize the artificial neural networks technique to model the
received signal for a spherical transmitter and a perfectly absorbing receiver
(i.e., first hitting process). The proposed technique may be utilized in other
studies that assume a spherical transmitter instead of a point transmitter.