IT博客汇 | arXiv Paper Daily: Tue, 15 Nov 2016

arXiv Paper Daily: Tue, 15 Nov 2016

我爱机器学习(52ml.net)发表于 2016-11-15 00:00:00

Neural and Evolutionary Computing

Deep Learning with Sets and Point Clouds

Siamak Ravanbakhsh, Jeff Schneider, Barnabas Poczos
Comments: under review at ICLR
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

We study a simple notion of structural invariance that readily suggests a
parameter-sharing scheme in deep neural networks. In particular, we define
structure as a collection of relations, and derive graph convolution and
recurrent neural networks as special cases. We study composition of basic
structures in defining models that are invariant to more complex “product”
structures such as graph of graphs, sets of images or sequence of sets. For
demonstration, our experimental results are focused on the setting where the
discrete structure of interest is a set. We present results on several novel
and non-trivial problems on sets, including semi-supervised learning using
clustering information, point-cloud classification and set outlier detection.

Generate Models and Model Criticism via Optimized Maximum Mean Discrepancy

Dougal J. Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, Arthur Gretton
Comments: Submitted to ICLR 2017; public comments at this http URL
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Methodology (stat.ME)

We propose a method to optimize the representation and distinguishability of
samples from two probability distributions, by maximizing the estimated power
of a statistical test based on the maximum mean discrepancy (MMD). This
optimized MMD is applied to the setting of unsupervised learning by generative
adversarial networks (GAN), in which a model attempts to generate realistic
samples, and a discriminator attempts to tell these apart from data samples. In
this context, the MMD may be used in two roles: first, as a discriminator,
either directly on the samples, or on features of the samples. Second, the MMD
can be used to evaluate the performance of a generative model, by testing the
model’s samples against a reference data set. In the latter role, the optimized
MMD is particularly helpful, as it gives an interpretable indication of how the
model and data distributions differ, even in cases where individual model
samples are not easily distinguished either by eye or by classifier.

Advancing Memristive Analog Neuromorphic Networks: Increasing Complexity, and Coping with Imperfect Hardware Components

F. Merrikh Bayat, M. Prezioso, B. Chakrabarti, I. Kataeva, D. B. Strukov
Comments: 4 pages, 13 figures
Subjects: Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)

We experimentally demonstrate classification of 4×4 binary images into 4
classes, using a 3-layer mixed-signal neuromorphic network (“MLP perceptron”),
based on two passive 20×20 memristive crossbar arrays, board-integrated with
discrete CMOS components. The network features 10 hidden-layer and 4
output-layer analog CMOS neurons and 428 metal-oxide memristors, i.e. is almost
an order of magnitude more complex than any previously reported functional
memristor circuit. Moreover, the inference operation of this classifier is
performed entirely in the integrated hardware. To deal with larger crossbar
arrays, we have developed a semi-automatic approach to their forming and
testing, and compared several memristor training schemes for coping with
imperfect behavior of these devices, as well as with variability of analog CMOS
neurons. The effectiveness of the proposed schemes for defect and variation
tolerance was verified experimentally using the implemented network and,
additionally, by modeling the operation of a larger network, with 300
hidden-layer neurons, on the MNIST benchmark. Finally, we propose a simple
modification of the implemented memristor-based vector-by-matrix multiplier to
allow its operation in a wider temperature range.

Attending to Characters in Neural Sequence Labeling Models

Marek Rei, Gamal K.O. Crichton, Sampo Pyysalo
Comments: Proceedings of COLING 2016
Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Sequence labeling architectures use word embeddings for capturing similarity,
but suffer when handling previously unseen or rare words. We investigate
character-level extensions to such models and propose a novel architecture for
combining alternative word representations. By using an attention mechanism,
the model is able to dynamically decide how much information to use from a
word- or character-level component. We evaluated different architectures on a
range of sequence labeling datasets, and character-level extensions were found
to improve performance on every benchmark. In addition, the proposed
attention-based architecture delivered the best results even with a smaller
number of trainable parameters.

Identity Matters in Deep Learning

Moritz Hardt, Tengyu Ma
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

An emerging design principle in deep learning is that each layer of a deep
artificial neural network should be able to easily express the identity
transformation. This idea not only motivated various normalization techniques,
such as emph{batch normalization}, but was also key to the immense success of
emph{residual networks}.

In this work, we put the principle of emph{identity parameterization} on a
more solid theoretical footing alongside further empirical progress. We first
give a strikingly simple proof that arbitrarily deep linear residual networks
have no spurious local optima. The same result for linear feed-forward networks
in their standard parameterization is substantially more delicate. Second, we
show that residual networks with ReLu activations have universal finite-sample
expressivity in the sense that the network can represent any function of its
sample provided that the model has more parameters than the sample size.

Directly inspired by our theory, we experiment with a radically simple
residual architecture consisting of only residual convolutional layers and ReLu
activations, but no batch normalization, dropout, or max pool. Our model
improves significantly on previous all-convolutional networks on the CIFAR10,
CIFAR100, and ImageNet classification benchmarks.

Computer Vision and Pattern Recognition

3-D Convolutional Neural Networks for Glioblastoma Segmentation

Darvin Yi, Mu Zhou, Zhao Chen, Olivier Gevaert
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Convolutional Neural Networks (CNN) have emerged as powerful tools for
learning discriminative image features. In this paper, we propose a framework
of 3-D fully CNN models for Glioblastoma segmentation from multi-modality MRI
data. By generalizing CNN models to true 3-D convolutions in learning 3-D tumor
MRI data, the proposed approach utilizes a unique network architecture to
decouple image pixels. Specifically, we design a convolutional layer with
pre-defined Difference- of-Gaussian (DoG) filters to perform true 3-D
convolution incorporating local neighborhood information at each pixel. We then
use three trained convolutional layers that act to decouple voxels from the
initial 3-D convolution. The proposed framework allows identification of
high-level tumor structures on MRI. We evaluate segmentation performance on the
BRATS segmentation dataset with 274 tumor samples. Extensive experimental
results demonstrate encouraging performance of the proposed approach comparing
to the state-of-the-art methods. Our data-driven approach achieves a median
Dice score accuracy of 89% in whole tumor glioblastoma segmentation, revealing
a generalized low-bias possibility to learn from medium-size MRI datasets.

Fast Task-Specific Target Detection via Graph Based Constraints Representation and Checking

Went Luan, Yezhou Yang, Cornelia Fermuller, John S. Baras
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this work, we present a fast target detection framework for real-world
robotics applications. Considering that an intelligent agent attends to a
task-specific object target during execution, our goal is to detect the object
efficiently. We propose the concept of early recognition, which influences the
candidate proposal process to achieve fast and reliable detection performance.
To check the target constraints efficiently, we put forward a novel policy to
generate a sub-optimal checking order, and prove that it has bounded time cost
compared to the optimal checking sequence, which is not achievable in
polynomial time. Experiments on two different scenarios: 1) rigid object and 2)
non-rigid body part detection validate our pipeline. To show that our method is
widely applicable, we further present a human-robot interaction system based on
our non-rigid body part detection.

Can fully convolutional networks perform well for general image restoration problems?

Subhajit Chaudhury, Hiya Roy
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present a fully convolutional network(FCN) based approach for color image
restoration. Fully convolutional networks have recently shown remarkable
performance for high level vision problems like semantic segmentation. In this
paper, we investigate if fully convolutional networks can show promising
performance for low level problems like image restoration as well. We propose a
FCN model, that learns a direct end-to-end mapping between the corrupted images
as input and the desired clean images as output. Experimental results show that
our FCN model outperforms traditional sparse coding based methods and
demonstrates competitive performance compared to the state-of-the-art methods
for image denoising. We further show that our proposed model can solve the
difficult problem of blind image inpainting and can produce reconstructed
images of impressive visual quality.

Automatic discovery of discriminative parts as a quadratic assignment problem

Ronan Sicre, Julien Rabin, Yannis Avrithis, Teddy Furon, Frederic Jurie
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Part-based image classification consists in representing categories by small
sets of discriminative parts upon which a representation of the images is
built. This paper addresses the question of how to automatically learn such
parts from a set of labeled training images. The training of parts is cast as a
quadratic assignment problem in which optimal correspondences between image
regions and parts are automatically learned. The paper analyses different
assignment strategies and thoroughly evaluates them on two public datasets:
Willow actions and MIT 67 scenes. State-of-the art results are obtained on
these datasets.

Joint Graph Decomposition and Node Labeling by Local Search

Evgeny Levinkov, Siyu Tang, Eldar Insafutdinov, Bjoern Andres
Subjects: Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM)

We state a combinatorial optimization problem whose feasible solutions define
both a decomposition and a node labeling of a given graph. This problem offers
a common mathematical abstraction of seemingly unrelated computer vision tasks,
including instance-separating semantic segmentation, articulated human body
pose estimation and multiple object tracking. Conceptually, the problem we
propose generalizes the unconstrained integer quadratic program and the minimum
cost lifted multicut problem, both of which are NP-hard. In order to find
feasible solutions efficiently, we define a local search algorithm that
converges monotonously to a local optimum, offering a feasible solution at any
time. To demonstrate the effectiveness of this algorithm in solving computer
vision tasks, we report running times and competitive solutions for two
above-mentioned applications.

Selfie Detection by Synergy-Constraint Based Convolutional Neural Network

Yashas Annadani, Vijayakrishna Naganoor, Akshay Kumar Jagadish, Krishnan Chemmangat
Comments: 8 Pages, Accepted for Publication at IEEE SITIS 2016
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Categorisation of huge amount of data on the multimedia platform is a crucial
task. In this work, we propose a novel approach to address the subtle problem
of selfie detection for image database segregation on the web, given rapid rise
in number of selfies clicked. A Convolutional Neural Network (CNN) is modeled
to learn a synergy feature in the common subspace of head and shoulder
orientation, derived from Local Binary Pattern (LBP) and Histogram of Oriented
Gradients (HOG) features respectively. This synergy was captured by projecting
the aforementioned features using Canonical Correlation Analysis (CCA). We show
that the resulting network’s convolutional activations in the neighbourhood of
spatial keypoints captured by SIFT are discriminative for selfie-detection. In
general, proposed approach aids in capturing intricacies present in the image
data and has the potential for usage in other subtle image analysis scenarios
apart from just selfie detection. We investigate and analyse the performance of
popular CNN architectures (GoogleNet, AlexNet), used for other image
classification tasks, when subjected to the task of detecting the selfies on
the multimedia platform. The results of the proposed approach are compared with
these popular architectures on a dataset of ninety thousand images comprising
of roughly equal number of selfies and non-selfies. Experimental results on
this dataset shows the effectiveness of the proposed approach.

Herding Generalizes Diverse M -Best Solutions

Ece Ozkan, Gemma Roig, Orcun Goksel, Xavier Boix
Comments: 10 pages, 2 algorithms, 2 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We show that the algorithm to extract diverse M -solutions from a Conditional
Random Field (called divMbest [1]) takes exactly the form of a Herding
procedure [2], i.e. a deterministic dynamical system that produces a sequence
of hypotheses that respect a set of observed moment constraints. This
generalization enables us to invoke properties of Herding that show that
divMbest enforces implausible constraints which may yield wrong assumptions for
some problem settings. Our experiments in semantic segmentation demonstrate
that seeing divMbest as an instance of Herding leads to better alternatives for
the implausible constraints of divMbest.

A DNN Framework For Text Image Rectification From Planar Transformations

Chengzhe Yan, Jie Hu, Changshui Zhang
Comments: 9 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, a novel neural network architecture is proposed attempting to
rectify text images with mild assumptions. A new dataset of text images is
collected to verify our model and open to public. We explored the capability of
deep neural network in learning geometric transformation and found the model
could segment the text image without explicit supervised segmentation
information. Experiments show the architecture proposed can restore planar
transformations with wonderful robustness and effectiveness.

Baseline CNN structure analysis for facial expression recognition

Minchul Shin, Munsang Kim, Dong-Soo Kwon
Comments: 6 pages, RO-MAN2016 Conference
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present a baseline convolutional neural network (CNN) structure and image
preprocessing methodology to improve facial expression recognition algorithm
using CNN. To analyze the most efficient network structure, we investigated
four network structures that are known to show good performance in facial
expression recognition. Moreover, we also investigated the effect of input
image preprocessing methods. Five types of data input (raw, histogram
equalization, isotropic smoothing, diffusion-based normalization, difference of
Gaussian) were tested, and the accuracy was compared. We trained 20 different
CNN models (4 networks x 5 data input types) and verified the performance of
each network with test images from five different databases. The experiment
result showed that a three-layer structure consisting of a simple convolutional
and a max pooling layer with histogram equalization image input was the most
efficient. We describe the detailed training procedure and analyze the result
of the test accuracy based on considerable observation.

Multi-Shot Mining Semantic Part Concepts in CNNs

Quanshi Zhang, Ruiming Cao, Ying Nian Wu, Song-Chun Zhu
Comments: accepted for presentation at the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper proposes a new learning strategy that incrementally embeds new
object-part concepts into a pre-trained convolutional neural network (CNN), in
order to 1) explore explicit semantics for the CNN units and 2) gradually
transfer the pre-trained CNN into a “white-box” model for hierarchical object
understanding. Given part annotations on a very small number (e.g. 3–12) of
objects, our method mines certain units from the pre-trained CNN and associate
them with different part concepts. We use a four-layer And-Or graph to organize
the CNN units, which clarifies their internal semantic hierarchy. Our method is
guided by a small number of part annotations, and it achieves superior
part-localization performance (about 28%–107% improvement in part center
prediction).

Convolutional Regression for Visual Tracking

Kai Chen, Wenbing Tao
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, discriminatively learned correlation filters (DCF) has drawn much
attention in visual object tracking community. The success of DCF is
potentially attributed to the fact that a large amount of samples are utilized
to train the ridge regression model and predict the location of object. To
solve the regression problem in an efficient way, these samples are all
generated by circularly shifting from a search patch. However, these synthetic
samples also induce some negative effects which decrease the robustness of DCF
based trackers.

In this paper, we propose a Convolutional Regression framework for visual
tracking (CRT). Instead of learning the linear regression model in a closed
form, we try to solve the regression problem by optimizing a one-channel-output
convolution layer with Gradient Descent (GD). In particular, the receptive
field size of the convolution layer is set to the size of object. Contrary to
DCF, it is possible to incorporate all “real” samples clipped from the whole
image. A critical issue of the GD approach is that most of the convolutional
samples are negative and the contribution of positive samples will be
submerged. To address this problem, we propose a novel Automatic Hard Negative
Mining method to eliminate easy negatives and enhance positives. Extensive
experiments are conducted on a widely-used benchmark with 100 sequences. The
results show that the proposed algorithm achieves outstanding performance and
outperforms almost all the existing DCF based algorithms.

Semi-Dense 3D Semantic Mapping from Monocular SLAM

Xuanpeng Li, Rachid Belaroussi
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The bundle of geometry and appearance in computer vision has proven to be a
promising solution for robots across a wide variety of applications. Stereo
cameras and RGB-D sensors are widely used to realise fast 3D reconstruction and
trajectory tracking in a dense way. However, they lack flexibility of seamless
switch between different scaled environments, i.e., indoor and outdoor scenes.
In addition, semantic information are still hard to acquire in a 3D mapping. We
address this challenge by combining the state-of-art deep learning method and
semi-dense Simultaneous Localisation and Mapping (SLAM) based on video stream
from a monocular camera. In our approach, 2D semantic information are
transferred to 3D mapping via correspondence between connective Keyframes with
spatial consistency. There is no need to obtain a semantic segmentation for
each frame in a sequence, so that it could achieve a reasonable computation
time. We evaluate our method on indoor/outdoor datasets and lead to an
improvement in the 2D semantic labelling over baseline single frame
predictions.

Hand Gesture Recognition for Contactless Device Control in Operating Rooms

Ebrahim Nasr-Esfahani, Nader Karimi, S.M. Reza Soroushmehr, M. Hossein Jafari, M. Amin Khorsandi, Shadrokh Samavi, Kayvan Najarian
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Hand gesture is one of the most important means of touchless communication
between human and machines. There is a great interest for commanding electronic
equipment in surgery rooms by hand gesture for reducing the time of surgery and
the potential for infection. There are challenges in implementation of a hand
gesture recognition system. It has to fulfill requirements such as high
accuracy and fast response. In this paper we introduce a system of hand gesture
recognition based on a deep learning approach. Deep learning is known as an
accurate detection model, but its high complexity prevents it from being
fabricated as an embedded system. To cope with this problem, we applied some
changes in the structure of our work to achieve low complexity. As a result,
the proposed method could be implemented on a naive embedded system. Our
experiments show that the proposed system results in higher accuracy while
having less complexity in comparison with the existing comparable methods.

Automated Inference on Criminality using Face Images

Xiaolin Wu, Xi Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We study, for the first time, automated inference on criminality based solely
on still face images. Via supervised machine learning, we build four
classifiers (logistic regression, KNN, SVM, CNN) using facial images of 1856
real persons controlled for race, gender, age and facial expressions, nearly
half of whom were convicted criminals, for discriminating between criminals and
non-criminals. All four classifiers perform consistently well and produce
evidence for the validity of automated face-induced inference on criminality,
despite the historical controversy surrounding the topic. Also, we find some
discriminating structural features for predicting criminality, such as lip
curvature, eye inner corner distance, and the so-called nose-mouth angle. Above
all, the most important discovery of this research is that criminal and
non-criminal face images populate two quite distinctive manifolds. The
variation among criminal faces is significantly greater than that of the
non-criminal faces. The two manifolds consisting of criminal and non-criminal
faces appear to be concentric, with the non-criminal manifold lying in the
kernel with a smaller span, exhibiting a law of normality for faces of
non-criminals. In other words, the faces of general law-biding public have a
greater degree of resemblance compared with the faces of criminals, or
criminals have a higher degree of dissimilarity in facial appearance than
normal people.

Multi-class Generative Adversarial Networks with the L2 Loss Function

Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Generative adversarial networks (GANs) have achieved huge success in
unsupervised learning. Most of GANs treat the discriminator as a classifier
with the binary sigmoid cross entropy loss function. However, we find that the
sigmoid cross entropy loss function will sometimes lead to the saturation
problem in GANs learning. In this work, we propose to adopt the L2 loss
function for the discriminator. The properties of the L2 loss function can
improve the stabilization of GANs learning. With the usage of the L2 loss
function, we propose the multi-class generative adversarial networks for the
purpose of image generation with multiple classes. We evaluate the multi-class
GANs on a handwritten Chinese characters dataset with 3740 classes. The
experiments demonstrate that the multi-class GANs can generate elegant images
on datasets with a large number of classes. Comparison experiments between the
L2 loss function and the sigmoid cross entropy loss function are also conducted
and the results demonstrate the stabilization of the L2 loss function.

Leveraging Video Descriptions to Learn Video Question Answering

Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, Min Sun
Comments: 7 pages, 5 figures. Accepted to AAAI 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

We propose a scalable approach to learn video-based question answering (QA):
answer a “free-form natural language question” about a video content. Our
approach automatically harvests a large number of videos and descriptions
freely available online. Then, a large number of candidate QA pairs are
automatically generated from descriptions rather than manually annotated. Next,
we use these candidate QA pairs to train a number of video-based QA methods
extended fromMN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et
al. 2015), SS (Venugopalan et al. 2015). In order to handle non-perfect
candidate QA pairs, we propose a self-paced learning procedure to iteratively
identify them and mitigate their effects in training. Finally, we evaluate
performance on manually generated video-based QA pairs. The results show that
our self-paced learning procedure is effective, and the extended SS model
outperforms various baselines.

Optimized clothes segmentation to boost gender classification in unconstrained scenarios

D. Freire-Obregón, M. Castrillón-Santana, J. Lorenzo-Navarro
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Several applications require demographic information of ordinary people in
unconstrained scenarios. This is not a trivial task due to significant human
appearance variations. In this work, we introduce trixels for clustering image
regions, enumerating their advantages compared to superpixels. The classical
GrabCut algorithm is later modified to segment trixels instead of pixels in an
unsupervised context. Combining with face detection lead us to a clothes
segmentation approach close to real time. The study uses the challenging Pascal
VOC dataset for segmentation evaluation experiments. A final experiment
analyzes the fusion of clothes features with state-of-the-art gender
classifiers in ClothesDB, revealing a significant performance improvement in
gender classification.

Autonomous Learning Framework Based on Online Hybrid Classifier for Multi-view Object Detection in Video

Dapeng Luo, Zhipeng Zeng, Longsheng Wei, Yongwen Liu, Chen Luo, Jun Chen, Nong Sang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, our proposed framework takes a remarkably different direction
to resolve multi-view detection problem in a bottom-up fashion. Firstly, a
state of the art scene-specific objector is obtained from a fully autonomous
learning process triggered by marking several bounding boxes around the object
in the first video frame via a mouse, where the human labeled training data or
a generic detector are not needed. Secondly, this learning process is
conveniently replicated many times in different surveillance scenes and result
in a particular detector under various camera viewpoints. Thus, the proposed
framework can be employed in multi-view object detection application with
unsupervised manner. Obviously, the initial scene-specific detector, initialed
by several bounding boxes, exhibits poor detection performance and difficult to
improve by traditional online learning algorithm. Consequently, in our
self-learning process, we propose to partition detection responses space by
online hybrid classifiers and assign each partition individual classifier that
achieves the high classification accuracy progressively. A novel online gradual
learning algorithm is proposed to train the hybrid classifiers automatically
and focus online learning on the hard samples, the most informative samples
lying around the decision boundary. The output is a hybrid classifier based
scene-specific detector which achieves decent performance under different
viewing angles. Experimental results on several video datasets show that our
approach achieves comparable performance to robust supervised methods and
outperforms the state of the art online learning methods in varying imaging
conditions.

When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing Features

Kuan-Ting Chen, Jiebo Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With the prevalence of e-commence websites and the ease of online shopping,
consumers are embracing huge amounts of various options in products.
Undeniably, shopping is one of the most essential activities in our society and
studying consumer’s shopping behavior is important for the industry as well as
sociology and psychology. Not surprisingly, one of the most popular e-commerce
categories is clothing business. There arises the needs for analysis of popular
and attractive clothing features which could further boost many emerging
applications, such as clothing recommendation and advertising. In this work, we
design a novel system that consists of three major components: 1) exploring and
organizing a large-scale clothing dataset from a online shopping website, 2)
pruning and extracting images of best-selling products in clothing item data
and user transaction history, and 3) utilizing a machine learning based
approach to discovering clothing attributes as the representative and
discriminative characteristics of popular clothing style elements. Through the
experiments over a large-scale online clothing dataset, we demonstrate the
effectiveness of our proposed system, and obtain useful insights on clothing
consumption trends and profitable clothing features.

Effective sparse representation of X-Ray medical images

Laura Rebollo-Neira
Comments: Routines for implementing the approach are available on this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Effective sparse representation of X-Ray medical images within the context of
data reduction is considered. The proposed framework is shown to render an
enormous reduction in the cardinality of the data set required to represent
this class of images at very good quality. The particularity of the approach is
that it can be implemented at very competitive processing time and low memory
requirements

Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Hideki Nakayama, Noriki Nishida
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

We propose an approach to build a neural machine translation system with no
supervised resources (i.e., no parallel corpora) using multimodal embedded
representation over texts and images. Based on the assumption that text
documents are often likely to be described with other multimedia information
(e.g., images) somewhat related to the content, we try to indirectly estimate
the relevance between two languages. Using multimedia as the “pivot”, we
project all modalities into one common hidden space where samples belonging to
similar semantic concepts should come close to each other, whatever the
observed space of each sample is. This modality-agnostic representation is the
key to bridging the gap between different modalities. Putting a decoder on top
of it, our network can flexibly draw the outputs from any input modality.
Notably, in the testing phase, we need only source language texts as the input
for translation. In experiments, we tested our method on two benchmarks to show
that it can achieve reasonable translation performance. We compared and
investigated several possible implementations and found that an end-to-end
model that simultaneously optimized both rank loss in multimodal encoders and
cross-entropy loss in decoders performed the best.

(CAD)(^2)RL: Real Single-Image Flight without a Single Real Image

Fereshteh Sadeghi, Sergey Levine
Comments: 11 pages
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

We propose (CAD)(^2)RL, a flight controller for Collision Avoidance via Deep
Reinforcement Learning that can be used to perform collision-free flight in the
real world although it is trained entirely in a 3D CAD model simulator. Our
method uses only single RGB images from a monocular camera mounted on the robot
as the input and is specialized for indoor hallway following and obstacle
avoidance. In contrast to most indoor navigation techniques that aim to
directly reconstruct the 3D geometry of the environment, our approach directly
predicts the probability of collision given the current monocular image and a
candidate action. To obtain accurate predictions, we develop a deep
reinforcement learning algorithm for learning indoor navigation, which uses the
actual performance of the current policy to construct accurate supervision. The
collision prediction model is represented by a deep convolutional neural
network that directly processes raw image inputs. Our collision avoidance
system is entirely trained in simulation and thus addresses the high sample
complexity of deep reinforcement learning and avoids the dangers of
trial-and-error learning in the real world. By highly randomizing the rendering
settings for our simulated training set, we show that we can train a collision
predictor that generalizes to new environments with substantially different
appearance from the training scenarios. Finally, we evaluate our method in the
real world by controlling a real quadrotor flying through real hallways. We
demonstrate that our method can perform real-world collision avoidance and
hallway following after being trained exclusively on synthetic images, without
ever having seen a single real image at the training time. For supplementary
video see: this https URL

Artificial Intelligence

Feature Engineering and Ensemble Modeling for Paper Acceptance Rank Prediction

Yujie Qian, Yinpeng Dong, Ye Ma, Hailong Jin, Juanzi Li
Comments: 2nd place winner report of KDD Cup 2016. More details can be found at this https URL
Subjects: Artificial Intelligence (cs.AI)

Measuring research impact and ranking academic achievement are important and
challenging problems. Having an objective picture of research institution is
particularly valuable for students, parents and funding agencies, and also
attracts attention from government and industry. KDD Cup 2016 proposes the
paper acceptance rank prediction task, in which the participants are asked to
rank the importance of institutions based on predicting how many of their
papers will be accepted at the 8 top conferences in computer science. In our
work, we adopt a three-step feature engineering method, including basic
features definition, finding similar conferences to enhance the feature set,
and dimension reduction using PCA. We propose three ranking models and the
ensemble methods for combining such models. Our experiment verifies the
effectiveness of our approach. In KDD Cup 2016, we achieved the overall rank of
the 2nd place.

Learning for Expertise Matching with Declination Prediction

Yujie Qian, Jie Tang
Subjects: Artificial Intelligence (cs.AI)

We study the problem of finding appropriate experts who are able to complete
timely reviews and would not say “no” to the invitation. The problem is a
central issue in many question-and-answer systems, but has received little
research attention. Different from most existing studies that focus on
expertise matching, we want to further predict the expert’s response: given a
question, how can we find the expert who is able to provide a quality review
and will agree to do it. We formalize the problem as a ranking problem. We
first present an embedding-based question-to-expert distance metric for
expertise matching and propose a ranking factor graph (RankFG) model to predict
expert response. For online evaluation, we developed a Chrome Extension for
reviewer recommendation and deployed it in the Google Chrome Web Store, and
then collected the reviewers’ feedback. We also used the review bidding of a CS
conference for evaluation. In the experiments, the proposed method demonstrates
its superiority (+6.6-21.2% by MAP) over several state-of-the-art algorithms.

Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems

Quan Liu, Hui Jiang, Zhen-Hua Ling, Xiaodan Zhu, Si Wei, Yu Hu
Comments: 7 pages
Subjects: Artificial Intelligence (cs.AI)

This paper proposes a general framework to combine context and commonsense
knowledge for solving the Winograd Schema (WS) and Pronoun Disambiguation
Problems (PDP). In the proposed framework, commonsense knowledge bases (e.g.
cause-effect word pairs) are quantized as knowledge constraints. The
constraints guide us to learn knowledge enhanced embeddings (KEE) from large
text corpus. Based on the pre-trained KEE models, this paper proposes two
methods to solve the WS and PDP problems. The first method is an unsupervised
method, which represents all the pronouns and candidate mentions in continuous
vector spaces based on their contexts and calculates the semantic similarities
between all the possible word pairs. The pronoun disambiguation procedure could
then be implemented by comparing the semantic similarities between the pronoun
(to be resolved) and all the candidate mentions. The second method is a
supervised method, which extracts features for all the pronouns and candidate
mentions and solves the WS problems by training a typical mention pair
classification model. Similar to the first method, the features used in the
second method are also extracted based on the KEE models. Experiments conducted
on the available PDP and WS test sets show that, these two methods both achieve
consistent improvements over the baseline systems. The best performance reaches
62\% in accuracy on the PDP test set of the first Winograd Schema Challenge.

Entropic Causal Inference

Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath, Babak Hassibi
Comments: To appear in AAAI 2017
Subjects: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)

We consider the problem of identifying the causal direction between two
discrete random variables using observational data. Unlike previous work, we
keep the most general functional model but make an assumption on the unobserved
exogenous variable: Inspired by Occam’s razor, we assume that the exogenous
variable is simple in the true causal direction. We quantify simplicity using
R’enyi entropy. Our main result is that, under natural assumptions, if the
exogenous variable has low (H_0) entropy (cardinality) in the true direction,
it must have high (H_0) entropy in the wrong direction. We establish several
algorithmic hardness results about estimating the minimum entropy exogenous
variable. We show that the problem of finding the exogenous variable with
minimum entropy is equivalent to the problem of finding minimum joint entropy
given (n) marginal distributions, also known as minimum entropy coupling
problem. We propose an efficient greedy algorithm for the minimum entropy
coupling problem, that for (n=2) provably finds a local optimum. This gives a
greedy algorithm for finding the exogenous variable with minimum (H_1) (Shannon
Entropy). Our greedy entropy-based causal inference algorithm has similar
performance to the state of the art additive noise models in real datasets. One
advantage of our approach is that we make no use of the values of random
variables but only their distributions. Our method can therefore be used for
causal inference for both ordinal and also categorical data, unlike additive
noise models.

A Review on Algorithms for Constraint-based Causal Discovery

Kui Yu, Jiuyong Li, Lin Liu
Subjects: Artificial Intelligence (cs.AI)

Causal discovery studies the problem of mining causal relationships between
variables from data, which is of primary interest in science. During the past
decades, significant amount of progresses have been made toward this
fundamental data mining paradigm. Recent years, as the availability of abundant
large-sized and complex observational data, the constrain-based approaches have
gradually attracted a lot of interest and have been widely applied to many
diverse real-world problems due to the fast running speed and easy generalizing
to the problem of causal insufficiency. In this paper, we aim to review the
constraint-based causal discovery algorithms. Firstly, we discuss the learning
paradigm of the constraint-based approaches. Secondly and primarily, the
state-of-the-art constraint-based casual inference algorithms are surveyed with
the detailed analysis. Thirdly, several related open-source software packages
and benchmark data repositories are briefly summarized. As a conclusion, some
open problems in constraint-based causal discovery are outlined for future
research.

Multi-lingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment

Muhao Chen, Yingtao Tian, Mohan Yang, Carlo Zaniolo
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Many recent works have demonstrated the benefits of knowledge graph
embeddings in completing monolingual knowledge graphs. Inasmuch as related
knowledge bases are built in several different languages, achieving
cross-lingual knowledge alignment will help people in constructing a coherent
knowledge base, and assist machines in dealing with different expressions of
entity relationships across diverse human languages. Unfortunately, achieving
this highly desirable crosslingual alignment by human labor is very costly and
errorprone. Thus, we propose MTransE, a translation-based model for
multilingual knowledge graph embeddings, to provide a simple and automated
solution. By encoding entities and relations of each language in a separated
embedding space, MTransE provides transitions for each embedding vector to its
cross-lingual counterparts in other spaces, while preserving the
functionalities of monolingual embeddings. We deploy three different techniques
to represent cross-lingual transitions, namely axis calibration, translation
vectors, and linear transformations, and derive five variants for MTransE using
different loss functions. Our models can be trained on partially aligned
graphs, where just a small portion of triples are aligned with their
cross-lingual counterparts. The experiments on cross-lingual entity matching
and triple-wise alignment verification show promising results, with some
variants consistently outperforming others on different tasks. We also explore
how MTransE preserves the key properties of its monolingual counterpart TransE.

Reinforcement Learning of Contextual MDPs using Spectral Methods

Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)

We propose a new reinforcement learning (RL) algorithm for contextual Markov
decision processes (CMDP) using spectral methods. CMDPs are structured MDPs
where the dynamics and rewards depend on a smaller number of hidden states or
contexts. If the mapping between the hidden and observed states is known a
priori, then standard RL algorithms such as UCRL are guaranteed to attain low
regret. Is it possible to achieve regret of the same order even when the
mapping is unknown? We provide an affirmative answer in this paper. We exploit
spectral methods to learn the mapping from hidden to observed states with
guaranteed confidence bounds, and incorporate it into the UCRL-based framework
to obtain order-optimal regret.

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

We propose a simple, elegant solution to use a single Neural Machine
Translation (NMT) model to translate between multiple languages. Our solution
requires no change in the model architecture from our base system but instead
introduces an artificial token at the beginning of the input sentence to
specify the required target language. The rest of the model, which includes
encoder, decoder and attention, remains unchanged and is shared across all
languages. Using a shared wordpiece vocabulary, our approach enables
Multilingual NMT using a single model without any increase in parameters, which
is significantly simpler than previous proposals for Multilingual NMT. Our
method often improves the translation quality of all involved language pairs,
even while keeping the total number of model parameters constant. On the WMT’14
benchmarks, a single multilingual model achieves comparable performance for
English(
ightarrow)French and surpasses state-of-the-art results for
English(
ightarrow)German. Similarly, a single multilingual model surpasses
state-of-the-art results for French(
ightarrow)English and
German(
ightarrow)English on WMT’14 and WMT’15 benchmarks respectively. On
production corpora, multilingual models of up to twelve language pairs allow
for better translation of many individual pairs. In addition to improving the
translation quality of language pairs that the model was trained with, our
models can also learn to perform implicit bridging between language pairs never
seen explicitly during training, showing that transfer learning and zero-shot
translation is possible for neural translation. Finally, we show analyses that
hints at a universal interlingua representation in our models and show some
interesting examples when mixing languages.

Learning the best algorithm for max-cut, clustering, and other partitioning problems

Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, Colin White
Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Learning (cs.LG)

Many data analysis problems are NP-hard, a reality that has motivated
researchers to develop a wealth of approximation algorithms and heuristics over
the past few decades. Max-cut, clustering, and many other widely applicable
partitioning problems fall into this camp. We focus on two common classes of
algorithms that are used for clustering and partitioning problems, including
SDP rounding algorithms and agglomerative clustering algorithms with dynamic
programming. The best algorithm to use typically depends on the specific
application or distribution over instances. A worst-case analysis is often used
to compare algorithms, but worst-case instances may not be present in the
particular problem domain, and thus may be misleading when determining which
algorithm to apply. Therefore, it is necessary to develop optimization methods
which return the algorithms and parameters best suited for typical inputs from
the application at hand. We address this problem for integer quadratic
programming and clustering within a learning-theoretic framework where the
learning algorithm is given samples from an application-specific distribution
over problem instances and learns an algorithm which performs well over the
distribution. We provide sample complexity guarantees and computationally
efficient algorithms which optimize an algorithm family’s parameters to best
suit typical inputs from a particular application. We analyze these algorithms
using the learning-theoretic notion of pseudo-dimension. Along with upper
bounds, we provide the first pseudo-dimension lower bounds in this line of
work, which require an involved analysis of each algorithm family’s performance
on carefully constructed problem instances. Our bounds are tight and therefore
nail down the intrinsic complexity of the algorithm classes we analyze, which
are significantly more complex than classes commonly used in learning theory.

Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Hideki Nakayama, Noriki Nishida
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

We propose an approach to build a neural machine translation system with no
supervised resources (i.e., no parallel corpora) using multimodal embedded
representation over texts and images. Based on the assumption that text
documents are often likely to be described with other multimedia information
(e.g., images) somewhat related to the content, we try to indirectly estimate
the relevance between two languages. Using multimedia as the “pivot”, we
project all modalities into one common hidden space where samples belonging to
similar semantic concepts should come close to each other, whatever the
observed space of each sample is. This modality-agnostic representation is the
key to bridging the gap between different modalities. Putting a decoder on top
of it, our network can flexibly draw the outputs from any input modality.
Notably, in the testing phase, we need only source language texts as the input
for translation. In experiments, we tested our method on two benchmarks to show
that it can achieve reasonable translation performance. We compared and
investigated several possible implementations and found that an end-to-end
model that simultaneously optimized both rank loss in multimodal encoders and
cross-entropy loss in decoders performed the best.

Generate Models and Model Criticism via Optimized Maximum Mean Discrepancy

We propose a method to optimize the representation and distinguishability of
samples from two probability distributions, by maximizing the estimated power
of a statistical test based on the maximum mean discrepancy (MMD). This
optimized MMD is applied to the setting of unsupervised learning by generative
adversarial networks (GAN), in which a model attempts to generate realistic
samples, and a discriminator attempts to tell these apart from data samples. In
this context, the MMD may be used in two roles: first, as a discriminator,
either directly on the samples, or on features of the samples. Second, the MMD
can be used to evaluate the performance of a generative model, by testing the
model’s samples against a reference data set. In the latter role, the optimized
MMD is particularly helpful, as it gives an interpretable indication of how the
model and data distributions differ, even in cases where individual model
samples are not easily distinguished either by eye or by classifier.

Learning to Gather Information via Imitation

Sanjiban Choudhury, Ashish Kapoor, Gireeja Ranade, Debadeepta Dey
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

The budgeted information gathering problem – where a robot with a fixed fuel
budget is required to maximize the amount of information gathered from the
world – appears in practice across a wide range of applications in autonomous
exploration and inspection with mobile robots. Although there is an extensive
amount of prior work investigating effective approximations of the problem,
these methods do not address the fact that their performance is heavily
dependent on distribution of objects in the world. In this paper, we attempt to
address this issue by proposing a novel data-driven imitation learning
framework.

We present an efficient algorithm, EXPLORE, that trains a policy on the
target distribution to imitate a clairvoyant oracle – an oracle that has full
information about the world and computes non-myopic solutions to maximize
information gathered. We validate the approach on a spectrum of results on a
number of 2D and 3D exploration problems that demonstrates the ability of
EXPLORE to adapt to different object distributions. Additionally, our analysis
provides theoretical insight into the behavior of EXPLORE. Our approach paves
the way forward for efficiently applying data-driven methods to the domain of
information gathering.

Recognizing and Eliciting Weakly Single Crossing Profiles on Trees

Palash Dey
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)

The domain of single crossing preference profiles is a widely studied domain
in social choice theory. It has been generalized to the domain of single
crossing preference profiles with respect to trees which inherits many
desirable properties from the single crossing domain, for example, transitivity
of majority relation, existence of polynomial time algorithms for finding
winners of Kemeny voting rule, etc. In this paper, we consider a further
generalization of the domain of single crossing profiles on trees to the domain
consisting of all preference profiles which can be extended to single crossing
preference profiles with respect to some tree by adding more preferences to it.
We call this domain the weakly single crossing domain on trees. We present a
polynomial time algorithm for recognizing weakly single crossing profiles on
trees. We then move on to develop a polynomial time algorithm with low query
complexity for eliciting weakly single crossing profiles on trees even when we
do not know any tree with respect to which the closure of the input profile is
single crossing and the preferences can be queried only sequentially; moreover,
the sequential order is also unknown. We complement the performance of our
preference elicitation algorithm by proving that our algorithm makes an optimal
number of queries up to constant factors when the number of preferences is
large compared to the number of candidates, even if the input profile is known
to be single crossing with respect to some given tree and the preferences can
be accessed randomly.

Leveraging Video Descriptions to Learn Video Question Answering

We propose a scalable approach to learn video-based question answering (QA):
answer a “free-form natural language question” about a video content. Our
approach automatically harvests a large number of videos and descriptions
freely available online. Then, a large number of candidate QA pairs are
automatically generated from descriptions rather than manually annotated. Next,
we use these candidate QA pairs to train a number of video-based QA methods
extended fromMN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et
al. 2015), SS (Venugopalan et al. 2015). In order to handle non-perfect
candidate QA pairs, we propose a self-paced learning procedure to iteratively
identify them and mitigate their effects in training. Finally, we evaluate
performance on manually generated video-based QA pairs. The results show that
our self-paced learning procedure is effective, and the extended SS model
outperforms various baselines.

Information Retrieval

1.5 billion words Arabic Corpus

Ibrahim Abu El-khair
Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

This study is an attempt to build a contemporary linguistic corpus for Arabic
language. The corpus produced, is a text corpus includes more than five million
newspaper articles. It contains over a billion and a half words in total, out
of which, there is about three million unique words. The data were collected
from newspaper articles in ten major news sources from eight Arabic countries,
over a period of fourteen years. The corpus was encoded with two types of
encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two
mark-up languages, namely: SGML, and XML.

Computation and Language

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

We propose a simple, elegant solution to use a single Neural Machine
Translation (NMT) model to translate between multiple languages. Our solution
requires no change in the model architecture from our base system but instead
introduces an artificial token at the beginning of the input sentence to
specify the required target language. The rest of the model, which includes
encoder, decoder and attention, remains unchanged and is shared across all
languages. Using a shared wordpiece vocabulary, our approach enables
Multilingual NMT using a single model without any increase in parameters, which
is significantly simpler than previous proposals for Multilingual NMT. Our
method often improves the translation quality of all involved language pairs,
even while keeping the total number of model parameters constant. On the WMT’14
benchmarks, a single multilingual model achieves comparable performance for
English(
ightarrow)French and surpasses state-of-the-art results for
English(
ightarrow)German. Similarly, a single multilingual model surpasses
state-of-the-art results for French(
ightarrow)English and
German(
ightarrow)English on WMT’14 and WMT’15 benchmarks respectively. On
production corpora, multilingual models of up to twelve language pairs allow
for better translation of many individual pairs. In addition to improving the
translation quality of language pairs that the model was trained with, our
models can also learn to perform implicit bridging between language pairs never
seen explicitly during training, showing that transfer learning and zero-shot
translation is possible for neural translation. Finally, we show analyses that
hints at a universal interlingua representation in our models and show some
interesting examples when mixing languages.

Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Hideki Nakayama, Noriki Nishida
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

We propose an approach to build a neural machine translation system with no
supervised resources (i.e., no parallel corpora) using multimodal embedded
representation over texts and images. Based on the assumption that text
documents are often likely to be described with other multimedia information
(e.g., images) somewhat related to the content, we try to indirectly estimate
the relevance between two languages. Using multimedia as the “pivot”, we
project all modalities into one common hidden space where samples belonging to
similar semantic concepts should come close to each other, whatever the
observed space of each sample is. This modality-agnostic representation is the
key to bridging the gap between different modalities. Putting a decoder on top
of it, our network can flexibly draw the outputs from any input modality.
Notably, in the testing phase, we need only source language texts as the input
for translation. In experiments, we tested our method on two benchmarks to show
that it can achieve reasonable translation performance. We compared and
investigated several possible implementations and found that an end-to-end
model that simultaneously optimized both rank loss in multimodal encoders and
cross-entropy loss in decoders performed the best.

Multi-view Recurrent Neural Acoustic Word Embeddings

Wanjia He, Weiran Wang, Karen Livescu
Comments: In submission to ICLR 2017
Subjects: Computation and Language (cs.CL)

Recent work has begun exploring neural acoustic word
embeddings–fixed-dimensional vector representations of arbitrary-length speech
segments corresponding to words. Such embeddings are applicable to speech
retrieval and recognition tasks, where reasoning about whole words may make it
possible to avoid ambiguous sub-word representations. The main idea is to map
acoustic sequences to fixed-dimensional vectors such that examples of the same
word are mapped to similar vectors, while different-word examples are mapped to
very different vectors. In this work we take a multi-view approach to learning
acoustic word embeddings, in which we jointly learn to embed acoustic sequences
and their corresponding character sequences. We use deep bidirectional LSTM
embedding models and multi-view contrastive losses. We study the effect of
different loss variants, including fixed-margin and cost-sensitive losses. Our
acoustic word embeddings improve over previous approaches for the task of word
discrimination. We also present results on other tasks that are enabled by the
multi-view approach, including cross-view word discrimination and word
similarity.

Ranking medical jargon in electronic health record notes by adapted distant supervision

Jinying Chen, Abhyuday N. Jagannatha, Samah J. Jarad, Hong Yu
Subjects: Computation and Language (cs.CL)

Objective: Allowing patients to access their own electronic health record
(EHR) notes through online patient portals has the potential to improve
patient-centered care. However, medical jargon, which abounds in EHR notes, has
been shown to be a barrier for patient EHR comprehension. Existing knowledge
bases that link medical jargon to lay terms or definitions play an important
role in alleviating this problem but have low coverage of medical jargon in
EHRs. We developed a data-driven approach that mines EHRs to identify and rank
medical jargon based on its importance to patients, to support the building of
EHR-centric lay language resources.

Methods: We developed an innovative adapted distant supervision (ADS) model
based on support vector machines to rank medical jargon from EHRs. For distant
supervision, we utilized the open-access, collaborative consumer health
vocabulary, a large, publicly available resource that links lay terms to
medical jargon. We explored both knowledge-based features from the Unified
Medical Language System and distributed word representations learned from
unlabeled large corpora. We evaluated the ADS model using physician-identified
important medical terms.

Results: Our ADS model significantly surpassed two state-of-the-art automatic
term recognition methods, TF*IDF and C-Value, yielding 0.810 ROC-AUC versus
0.710 and 0.667, respectively. Our model identified 10K important medical
jargon terms after ranking over 100K candidate terms mined from over 7,500 EHR
narratives.

Conclusion: Our work is an important step towards enriching lexical resources
that link medical jargon to lay terms/definitions to support patient EHR
comprehension. The identified medical jargon terms and their rankings are
available upon request.

Attending to Characters in Neural Sequence Labeling Models

Marek Rei, Gamal K.O. Crichton, Sampo Pyysalo
Comments: Proceedings of COLING 2016
Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Sequence labeling architectures use word embeddings for capturing similarity,
but suffer when handling previously unseen or rare words. We investigate
character-level extensions to such models and propose a novel architecture for
combining alternative word representations. By using an attention mechanism,
the model is able to dynamically decide how much information to use from a
word- or character-level component. We evaluated different architectures on a
range of sequence labeling datasets, and character-level extensions were found
to improve performance on every benchmark. In addition, the proposed
attention-based architecture delivered the best results even with a smaller
number of trainable parameters.

Character-level Convolutional Network for Text Classification Applied to Chinese Corpus

Weijie Huang
Subjects: Computation and Language (cs.CL)

This article provides an interesting exploration of character-level
convolutional neural network solving Chinese corpus text classification
problem. We constructed a large-scale Chinese language dataset, and the result
shows that character-level convolutional neural network works better on Chinese
corpus than its corresponding pinyin format dataset. This is the first time
that character-level convolutional neural network applied to text
classification problem.

`Who would have thought of that!': A Hierarchical Topic Model for Extraction of Sarcasm-prevalent Topics and Sarcasm Detection

Aditya Joshi, Prayas Jain, Pushpak Bhattacharyya, Mark Carman
Comments: This paper will be presented at ExPROM workshop at COLING 2016
Subjects: Computation and Language (cs.CL)

Topic Models have been reported to be beneficial for aspect-based sentiment
analysis. This paper reports a simple topic model for sarcasm detection, a
first, to the best of our knowledge. Designed on the basis of the intuition
that sarcastic tweets are likely to have a mixture of words of both sentiments
as against tweets with literal sentiment (either positive or negative), our
hierarchical topic model discovers sarcasm-prevalent topics and topic-level
sentiment. Using a dataset of tweets labeled using hashtags, the model
estimates topic-level, and sentiment-level distributions. Our evaluation shows
that topics such as `work’, `gun laws’, `weather’ are sarcasm-prevalent topics.
Our model is also able to discover the mixture of sentiment-bearing words that
exist in a text of a given sentiment-related label. Finally, we apply our model
to predict sarcasm in tweets. We outperform two prior work based on statistical
classifiers with specific features, by around 25\%.

Classify or Select: Neural Architectures for Extractive Document Summarization

Ramesh Nallapati, Bowen Zhou, Mingbo Ma
Subjects: Computation and Language (cs.CL)

We present two novel and contrasting Recurrent Neural Network (RNN) based
architectures for extractive summarization of documents. The Classifier based
architecture sequentially accepts or rejects each sentence in the original
document order for its membership in the final summary. The Selector
architecture, on the other hand, is free to pick one sentence at a time in any
arbitrary order to piece together the summary.

Our models under both architectures jointly capture the notions of salience
and redundancy of sentences. In addition, these models have the advantage of
being very interpretable, since they allow visualization of their predictions
broken up by abstract features such as information content, salience and
redundancy.

We show that our models reach or outperform state-of-the-art supervised
models on two different corpora. We also recommend the conditions under which
one architecture is superior to the other based on experimental evidence.

F-Score Driven Max Margin Neural Network for Named Entity Recognition in Chinese Social Media

Hangfeng He, Xu Sun
Subjects: Computation and Language (cs.CL)

We focus on named entity recognition (NER) for Chinese social media. With
massive unlabeled text and quite limited labelled corpus, we propose a
semi-supervised learning model based on B-LSTM neural network. To take
advantage of traditional methods in NER such as CRF, we combine transition
probability with deep learning in our model. To bridge the gap between label
accuracy and F-score of NER, we construct a model which can be directly trained
on F-score. When considering the instability of F-score driven method and
meaningful information provided by label accuracy, we propose an integrated
method to train on both F-score and label accuracy. Our integrated model yields
7.44\% improvement over previous state-of-the-art result.

A New Recurrent Neural CRF for Learning Non-linear Edge Features

Shuming Ma, Xu Sun
Subjects: Computation and Language (cs.CL)

Conditional Random Field (CRF) and recurrent neural models have achieved
success in structured prediction. More recently, there is a marriage of CRF and
recurrent neural models, so that we can gain from both non-linear dense
features and globally normalized CRF objective. These recurrent neural CRF
models mainly focus on encode node features in CRF undirected graphs. However,
edge features prove important to CRF in structured prediction. In this work, we
introduce a new recurrent neural CRF model, which learns non-linear edge
features, and thus makes non-linear features encoded completely. We compare our
model with different neural models in well-known structured prediction tasks.
Experiments show that our model outperforms state-of-the-art methods in NP
chunking, shallow parsing, Chinese word segmentation and POS tagging.

SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents

Ramesh Nallapati, Feifei Zhai, Bowen Zhou
Comments: Published at AAAI 2017, The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-2017)
Subjects: Computation and Language (cs.CL)

We present SummaRuNNer, a Recurrent Neural Network (RNN) based sequence model
for extractive summarization of documents and show that it achieves performance
better than or comparable to state-of-the-art. Our model has the additional
advantage of being very interpretable, since it allows visualization of its
predictions broken up by abstract features such as information content,
salience and novelty. Another novel contribution of our work is abstractive
training of our extractive model that can train on human generated reference
summaries alone, eliminating the need for sentence-level extractive labels.

Joint Representation Learning of Text and Knowledge for Knowledge Graph Completion

Xu Han, Zhiyuan Liu, Maosong Sun
Subjects: Computation and Language (cs.CL)

Joint representation learning of text and knowledge within a unified semantic
space enables us to perform knowledge graph completion more accurately. In this
work, we propose a novel framework to embed words, entities and relations into
the same continuous vector space. In this model, both entity and relation
embeddings are learned by taking knowledge graph and plain text into
consideration. In experiments, we evaluate the joint learning model on three
tasks including entity prediction, relation prediction and relation
classification from text. The experiment results show that our model can
significantly and consistently improve the performance on the three tasks as
compared with other baselines.

Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence

Yangqiu Song, Stephen Mayhew, Dan Roth
Subjects: Computation and Language (cs.CL)

This paper presents an approach to classify documents in any language into an
English topical label space, without any text categorization training data. The
approach, Cross-Lingual Dataless Document Classification (CLDDC) relies on
mapping the English labels or short category description into a Wikipedia-based
semantic representation, and on the use of the target language Wikipedia.
Consequently, performance could suffer when Wikipedia in the target language is
small. In this paper, we focus on languages with small Wikipedias,
(Small-Wikipedia languages, SWLs). We use a word-level dictionary to convert
documents in a SWL to a large-Wikipedia language (LWLs), and then perform CLDDC
based on the LWL’s Wikipedia. This approach can be applied to thousands of
languages, which can be contrasted with machine translation, which is a
supervision heavy approach and can be done for about 100 languages. We also
develop a ranking algorithm that makes use of language similarity metrics to
automatically select a good LWL, and show that this significantly improves
classification of SWLs’ documents, performing comparably to the best bridge
possible.

Semi-automatic Simultaneous Interpreting Quality Evaluation

Xiaojun Zhang
Journal-ref: International Journal on Natural Language Computing (IJNLC) Vol.
5, No.5, October 2016
Subjects: Computation and Language (cs.CL)

Increasing interpreting needs a more objective and automatic measurement. We
hold a basic idea that ‘translating means translating meaning’ in that we can
assessment interpretation quality by comparing the meaning of the interpreting
output with the source input. That is, a translation unit of a ‘chunk’ named
Frame which comes from frame semantics and its components named Frame Elements
(FEs) which comes from Frame Net are proposed to explore their matching rate
between target and source texts. A case study in this paper verifies the
usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score.
Experiments results show that the semantic-scoring metrics have a significantly
correlation coefficient with human judgment.

1.5 billion words Arabic Corpus

Ibrahim Abu El-khair
Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

This study is an attempt to build a contemporary linguistic corpus for Arabic
language. The corpus produced, is a text corpus includes more than five million
newspaper articles. It contains over a billion and a half words in total, out
of which, there is about three million unique words. The data were collected
from newspaper articles in ten major news sources from eight Arabic countries,
over a period of fourteen years. The corpus was encoded with two types of
encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two
mark-up languages, namely: SGML, and XML.

Multi-Language Identification Using Convolutional Recurrent Neural Network

Vrishabh Ajay Lakhani, Rohan Mahadev
Comments: 10 pages, 6 figures
Subjects: Computation and Language (cs.CL)

Language Identification, being an important aspect of Automatic Speaker
Recognition has had many changes and new approaches to ameliorate performance
over the last decade. We compare the performance of using audio spectrum in the
log scale and using Polyphonic sound sequences from raw audio samples to train
the neural network and to classify speech as either English or Spanish. To
achieve this, we use the novel approach of using a Convolutional Recurrent
Neural Network using Long Short Term Memory (LSTM) or a Gated Recurrent Unit
(GRU) for forward propagation of the neural network. Our hypothesis is that the
performance of using polyphonic sound sequence as features and both LSTM and
GRU as the gating mechanisms for the neural network outperform the traditional
MFCC features using a unidirectional Deep Neural Network.

Linguistically Regularized LSTMs for Sentiment Classification

Qiao Qian, Minlie Huang, Xiaoyan Zhu
Subjects: Computation and Language (cs.CL)

Sentiment understanding has been a long-term goal of AI in the past decades.
This paper deals with sentence-level sentiment classification. Though a variety
of neural network models have been proposed very recently, however, previous
models either depend on expensive phrase-level annotation, whose performance
drops substantially when trained with only sentence-level annotation; or do not
fully employ linguistic resources (e.g., sentiment lexicons, negation words,
intensity words), thus not being able to produce linguistically coherent
representations. In this paper, we propose simple models trained with
sentence-level annotation, but also attempt to generating linguistically
coherent representations by employing regularizers that model the linguistic
role of sentiment lexicons, negation words, and intensity words. Results show
that our models are effective to capture the sentiment shifting effect of
sentiment, negation, and intensity words, while still obtain competitive
results without sacrificing the models’ simplicity.

Training IBM Watson using Automatically Generated Question-Answer Pairs

Jangho Lee, Gyuwan Kim, Jaeyoon Yoo, Changwoo Jung, Minseok Kim, Sungroh Yoon
Subjects: Computation and Language (cs.CL)

IBM Watson is a cognitive computing system capable of question answering in
natural languages. It is believed that IBM Watson can understand large corpora
and answer relevant questions more effectively than any other
question-answering system currently available. To unleash the full power of
Watson, however, we need to train its instance with a large number of
well-prepared question-answer pairs. Obviously, manually generating such pairs
in a large quantity is prohibitively time consuming and significantly limits
the efficiency of Watson’s training. Recently, a large-scale dataset of over 30
million question-answer pairs was reported. Under the assumption that using
such an automatically generated dataset could relieve the burden of manual
question-answer generation, we tried to use this dataset to train an instance
of Watson and checked the training efficiency and accuracy. According to our
experiments, using this auto-generated dataset was effective for training
Watson, complementing manually crafted question-answer pairs. To the best of
the authors’ knowledge, this work is the first attempt to use a large-scale
dataset of automatically generated question-answer pairs for training IBM
Watson. We anticipate that the insights and lessons obtained from our
experiments will be useful for researchers who want to expedite Watson training
leveraged by automatically generated question-answer pairs.

Multi-lingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment

Muhao Chen, Yingtao Tian, Mohan Yang, Carlo Zaniolo
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Many recent works have demonstrated the benefits of knowledge graph
embeddings in completing monolingual knowledge graphs. Inasmuch as related
knowledge bases are built in several different languages, achieving
cross-lingual knowledge alignment will help people in constructing a coherent
knowledge base, and assist machines in dealing with different expressions of
entity relationships across diverse human languages. Unfortunately, achieving
this highly desirable crosslingual alignment by human labor is very costly and
errorprone. Thus, we propose MTransE, a translation-based model for
multilingual knowledge graph embeddings, to provide a simple and automated
solution. By encoding entities and relations of each language in a separated
embedding space, MTransE provides transitions for each embedding vector to its
cross-lingual counterparts in other spaces, while preserving the
functionalities of monolingual embeddings. We deploy three different techniques
to represent cross-lingual transitions, namely axis calibration, translation
vectors, and linear transformations, and derive five variants for MTransE using
different loss functions. Our models can be trained on partially aligned
graphs, where just a small portion of triples are aligned with their
cross-lingual counterparts. The experiments on cross-lingual entity matching
and triple-wise alignment verification show promising results, with some
variants consistently outperforming others on different tasks. We also explore
how MTransE preserves the key properties of its monolingual counterpart TransE.

Distributed, Parallel, and Cluster Computing

Byzantine Processors and Cuckoo Birds: Confining Maliciousness to the Outset

Danny Dolev, Eli Gafni
Comments: arXiv admin note: text overlap with arXiv:1607.01210
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Are there Byzantine Animals? A Fooling Behavior is exhibited by the Cuckoo
bird. It sneakily replaces some of the eggs of other species with its own. Lest
the Cuckoo extinct itself by destroying its host, it self-limits its power: It
does not replace too large a fraction of the eggs. Here, we show that any
Byzantine Behavior that does not destroy the system it attacks, i.e. allows the
system to solve an easy task like epsilon-agreement, then its maliciousness can
be confined to be the exact replica of the Cuckoo bird behavior: Undetectably
replace an input of a processor and let the processor behave correctly
thereafter with respect to the new input. In doing so we reduce the study of
Byzantine behavior to fail-stop (benign) behavior with the Cuckoo caveat of a
fraction of the inputs replaced. We establish a complete correspondence between
the Byzantine and the Benign, modulo different thresholds, and replaced inputs.
This work is yet another step in a line of work unifying seemingly distinct
distributed system models, dispelling the Myth that Distributed Computing is a
plethora of distinct isolated models, each requiring its specialized tools and
ideas in order to determine solvability of tasks. Thus, hereafter, Byzantine
Computability questions can be reduced to questions in the benign failure
setting. We also show that the known results about correlated faults in the
asynchronous benign setting can be imported verbatim to the asynchronous
Byzantine setting. Finally, as in the benign case in which we have the property
that a processor can output once its faulty behavior stops for long enough, we
show this can be done in a similar manner in the Byzantine case. This
necessitated the generalization of Reliable Broadcast to what we term
Recoverable Reliable Broadcast.

Efficient Communications in Training Large Scale Neural Networks

Linnan Wang, Wei Wu, George Bosilca, Richard Vuduc, Zenglin Xu
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

We consider the problem of how to reduce the cost of communication that is
required for the parallel training of a neural network. The state-of-the-art
method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD),
requires many collective communication operations, like broadcasts of
parameters or reductions for sub-gradient aggregations, which for large
messages quickly dominates overall execution time and limits parallel
scalability. To address this problem, we develop a new technique for collective
operations, referred to as Linear Pipelining (LP). It is tuned to the message
sizes that arise in BSP-SGD, and works effectively on multi-GPU systems.
Theoretically, the cost of LP is invariant to (P), where (P) is the number of
GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales
like (O(log P)). LP also demonstrate up to 2x faster bandwidth than
Bidirectional Exchange (BE) techniques that are widely adopted by current MPI
implementations. We apply these collectives to BSP-SGD, showing that the
proposed implementations reduce communication bottlenecks in practice while
preserving the attractive convergence properties of BSP-SGD.

A parallel workload has extreme varibility

R. Henwood, N. W. Watkins, S. C. Chapman, R. McLay
Comments: 7 pages, 2 figures, 1 code listing
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Analysis, Statistics and Probability (physics.data-an); Computation (stat.CO)

In both high-performance computing (HPC) environments and the public cloud,
the duration of time to retrieve or save your results is simultaneously
unpredictable and important to your over all resource budget. It is generally
accepted (“Google: Taming the Long Latency Tail – When More Machines Equals
Worse Results”, Todd Hoff, highscalability.com 2012), but without a robust
explanation, that identical parallel tasks do take different durations to
complete — a phenomena known as variability. This paper advances understanding
of this topic. We carefully choose a model from which system-level complexity
emerges that can be studied directly. We find that a generalized extreme value
(GEV) model for variability naturally emerges. Using the public cloud, we find
real-world observations have excellent agreement with our model. Since the GEV
distribution is a limit distribution this suggests a universal property of
parallel systems gated by the slowest communication element of some sort.
Hence, this model is applicable to a variety of processing and IO tasks in
parallel environments. These findings have important implications, ranging from
characterizing ideal performance for parallel codes to detecting degraded
behaviour at extreme scales.

Timestamps for Partial Replication

Zhuolun Xiang, Nitin H. Vaidya
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Maintaining causal consistency in distributed shared memory systems using
vector timestamps has received a lot of attention from both theoretical and
practical prospective. However, most of the previous literature focuses on full
replication where each data is stored in all replicas, which may not be
scalable due to the increasing amount of data. In this report, we investigate
how to achieve causal consistency in partial replicated systems, where each
replica may store different set of data. We propose an algorithm that tracks
causal dependencies via vector timestamp in client-server model for partial
replication. The cost of our algorithm in terms of timestamps size varies as a
function of the manner in which the replicas share data, and the set of
replicas accessed by each client. We also establish a connection between our
algorithm with the previous work on full replication.

Maintaining Acyclicity of Concurrent Graphs

Sathya Peri, Muktikanta Sa, Nandini Singhal
Comments: 23 pages 7 figures
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

In this paper, we consider the problem of preserving acyclicity in a directed
graph (for shared memory architecture) that is concurrently being updated by
threads adding/deleting vertices and edges. To the best of our knowledge, no
previous paper has presented a concurrent graph data structure. We implement
the concurrent directed graph data-structure as a concurrent adjacency list
representation. We extend the lazy list implementation of concurrent linked
lists for maintaining concurrent adjacency lists. There exists a number of
graph applications which require the acyclic invariant in a directed graph. One
such example is Serialization Graph Testing Algorithm used in databases and
transactional memory. We present two concurrent algorithms for maintaining
acyclicity in a concurrent graph: (i) Based on obstruction-free snapshots (ii)
Using wait-free reachability. We compare the performance of these algorithms
against the coarse-grained locking strategy, commonly used technique for
allowing concurrent updates. We present the speedup obtained by these
algorithms over sequential execution. As a future direction, we plan to extend
this data structure for other progress conditions.

On Location Hiding in Distributed Systems

Karol Gotfryd, Marek Klonowski, Dominik Pająk
Comments: Submitted to 10th International Conference on Algorithms and Complexity CIAC 2017
Subjects: Multiagent Systems (cs.MA); Distributed, Parallel, and Cluster Computing (cs.DC)

We consider the following problem – a group of mobile agents perform some
task on a terrain modeled as a graph. In a given moment of time an adversary
gets an access to the graph and positions of the agents. Shortly before
adversary’s observation the mobile agents have a chance to relocate themselves
in order to hide their initial configuration. We assume that the initial
configuration may possibly reveal to the adversary some information about the
task they performed. Clearly agents have to change their location in possibly
short time using minimal energy. In our paper we introduce a definition of a
emph{well hiding} algorithm in which the starting and final configurations of
the agents have small mutual information. Then we discuss the influence of
various features of the model on the running time of the optimal well-hiding
algorithm. We show that if the topology of the graph is known to the agents,
then the number of steps proportional to the diameter of the graph is
sufficient and necessary. In the unknown topology scenario we only consider a
single agent case. We first show that the task is impossible in the
deterministic case if the agent has no memory. Then we present a polynomial
randomized algorithm. Finally in the model with memory we show that the number
of steps proportional to the number of edges of the graph is sufficient and
necessary. In some sense we investigate how complex is the problem of “losing”
information about location (both physical and logical) for different settings.

Learning

How to scale distributed deep learning?

Peter H. Jin, Qiaochu Yuan, Forrest Iandola, Kurt Keutzer
Comments: Extended version of paper accepted at ML Sys 2016 (at NIPS 2016)
Subjects: Learning (cs.LG)

Training time on large datasets for deep neural networks is the principal
workflow bottleneck in a number of important applications of deep learning,
such as object classification and detection in automatic driver assistance
systems (ADAS). To minimize training time, the training of a deep neural
network must be scaled beyond a single machine to as many machines as possible
by distributing the optimization method used for training. While a number of
approaches have been proposed for distributed stochastic gradient descent
(SGD), at the current time synchronous approaches to distributed SGD appear to
be showing the greatest performance at large scale. Synchronous scaling of SGD
suffers from the need to synchronize all processors on each gradient step and
is not resilient in the face of failing or lagging processors. In asynchronous
approaches using parameter servers, training is slowed by contention to the
parameter server. In this paper we compare the convergence of synchronous and
asynchronous SGD for training a modern ResNet network architecture on the
ImageNet classification problem. We also propose an asynchronous method,
gossiping SGD, that aims to retain the positive features of both systems by
replacing the all-reduce collective operation of synchronous training with a
gossip aggregation algorithm. We find, perhaps counterintuitively, that
asynchronous SGD, including both elastic averaging and gossiping, converges
faster at fewer nodes (up to about 32 nodes), whereas synchronous SGD scales
better to more nodes (up to about 100 nodes).

Earliness-Aware Deep Convolutional Networks for Early Time Series Classification

Wenlin Wang, Changyou Chen, Wenqi Wang, Piyush Rai, Lawrence Carin
Subjects: Learning (cs.LG)

We present Earliness-Aware Deep Convolutional Networks (EA-ConvNets), an
end-to-end deep learning framework, for early classification of time series
data. Unlike most existing methods for early classification of time series
data, that are designed to solve this problem under the assumption of the
availability of a good set of pre-defined (often hand-crafted) features, our
framework can jointly perform feature learning (by learning a deep hierarchy of
emph{shapelets} capturing the salient characteristics in each time series),
along with a dynamic truncation model to help our deep feature learning
architecture focus on the early parts of each time series. Consequently, our
framework is able to make highly reliable early predictions, outperforming
various state-of-the-art methods for early time series classification, while
also being competitive when compared to the state-of-the-art time series
classification algorithms that work with emph{fully observed} time series
data. To the best of our knowledge, the proposed framework is the first to
perform data-driven (deep) feature learning in the context of early
classification of time series data. We perform a comprehensive set of
experiments, on several benchmark data sets, which demonstrate that our method
yields significantly better predictions than various state-of-the-art methods
designed for early time series classification. In addition to obtaining high
accuracies, our experiments also show that the learned deep shapelets based
features are also highly interpretable and can help gain better understanding
of the underlying characteristics of time series data.

Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes

Mengye Ren, Renjie Liao, Raquel Urtasun, Fabian H. Sinz, Richard S. Zemel
Subjects: Learning (cs.LG)

Normalization techniques have only recently begun to be exploited in
supervised learning tasks. Batch normalization exploits mini-batch statistics
to normalize the activations. This was shown to speed up training and result in
better models. However its success has been very limited when dealing with
recurrent neural networks. On the other hand, layer normalization normalizes
the activations across all activities within a layer. This was shown to work
well in the recurrent setting. In this paper we propose a unified view of
normalization techniques, as forms of divisive normalization, which includes
layer and batch normalization as special cases. Our second contribution is the
finding that a small modification to these normalization schemes, in
conjunction with a sparse regularizer on the activations, leads to significant
benefits over standard normalization techniques. We demonstrate the
effectiveness of our unified divisive normalization framework in the context of
convolutional neural nets and recurrent neural networks, showing improvements
over baselines in image classification, language modeling as well as
super-resolution.

On the Quantitative Analysis of Decoder-Based Generative Models

Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, Roger Grosse
Subjects: Learning (cs.LG)

The past several years have seen remarkable progress in generative models
which produce convincing samples of images and other modalities. A shared
component of many powerful generative models is a decoder network, a parametric
deep neural net that defines a generative distribution. Examples include
variational autoencoders, generative adversarial networks, and generative
moment matching networks. Unfortunately, it can be difficult to quantify the
performance of these models because of the intractability of log-likelihood
estimation, and inspecting samples can be misleading. We propose to use
Annealed Importance Sampling for evaluating log-likelihoods for decoder-based
models and validate its accuracy using bidirectional Monte Carlo. Using this
technique, we analyze the performance of decoder-based models, the
effectiveness of existing log-likelihood estimators, the degree of overfitting,
and the degree to which these models miss important modes of the data
distribution.

Identity Matters in Deep Learning

Moritz Hardt, Tengyu Ma
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

An emerging design principle in deep learning is that each layer of a deep
artificial neural network should be able to easily express the identity
transformation. This idea not only motivated various normalization techniques,
such as emph{batch normalization}, but was also key to the immense success of
emph{residual networks}.

In this work, we put the principle of emph{identity parameterization} on a
more solid theoretical footing alongside further empirical progress. We first
give a strikingly simple proof that arbitrarily deep linear residual networks
have no spurious local optima. The same result for linear feed-forward networks
in their standard parameterization is substantially more delicate. Second, we
show that residual networks with ReLu activations have universal finite-sample
expressivity in the sense that the network can represent any function of its
sample provided that the model has more parameters than the sample size.

Directly inspired by our theory, we experiment with a radically simple
residual architecture consisting of only residual convolutional layers and ReLu
activations, but no batch normalization, dropout, or max pool. Our model
improves significantly on previous all-convolutional networks on the CIFAR10,
CIFAR100, and ImageNet classification benchmarks.

Learning Sparse, Distributed Representations using the Hebbian Principle

Aseem Wadhwa, Upamanyu Madhow
Subjects: Learning (cs.LG)

The “fire together, wire together” Hebbian model is a central principle for
learning in neuroscience, but surprisingly, it has found limited applicability
in modern machine learning. In this paper, we take a first step towards
bridging this gap, by developing flavors of competitive Hebbian learning which
produce sparse, distributed neural codes using online adaptation with minimal
tuning. We propose an unsupervised algorithm, termed Adaptive Hebbian Learning
(AHL). We illustrate the distributed nature of the learned representations via
output entropy computations for synthetic data, and demonstrate superior
performance, compared to standard alternatives such as autoencoders, in
training a deep convolutional net on standard image datasets.

(CAD)(^2)RL: Real Single-Image Flight without a Single Real Image

Fereshteh Sadeghi, Sergey Levine
Comments: 11 pages
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

We propose (CAD)(^2)RL, a flight controller for Collision Avoidance via Deep
Reinforcement Learning that can be used to perform collision-free flight in the
real world although it is trained entirely in a 3D CAD model simulator. Our
method uses only single RGB images from a monocular camera mounted on the robot
as the input and is specialized for indoor hallway following and obstacle
avoidance. In contrast to most indoor navigation techniques that aim to
directly reconstruct the 3D geometry of the environment, our approach directly
predicts the probability of collision given the current monocular image and a
candidate action. To obtain accurate predictions, we develop a deep
reinforcement learning algorithm for learning indoor navigation, which uses the
actual performance of the current policy to construct accurate supervision. The
collision prediction model is represented by a deep convolutional neural
network that directly processes raw image inputs. Our collision avoidance
system is entirely trained in simulation and thus addresses the high sample
complexity of deep reinforcement learning and avoids the dangers of
trial-and-error learning in the real world. By highly randomizing the rendering
settings for our simulated training set, we show that we can train a collision
predictor that generalizes to new environments with substantially different
appearance from the training scenarios. Finally, we evaluate our method in the
real world by controlling a real quadrotor flying through real hallways. We
demonstrate that our method can perform real-world collision avoidance and
hallway following after being trained exclusively on synthetic images, without
ever having seen a single real image at the training time. For supplementary
video see: this https URL

Realistic risk-mitigating recommendations via inverse classification

Michael T. Lash, W. Nick Street
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Inverse classification, the process of making meaningful perturbations to a
test point such that it is more likely to have a desired classification, has
previously been addressed using data from a single static point in time. Such
an approach yields inflated probability estimates, stemming from an implicitly
made assumption that recommendations are implemented instantaneously. We
propose using longitudinal data to alleviate such issues in two ways. First, we
use past outcome probabilities as features in the present. Use of such past
probabilities ties historical behavior to the present, allowing for more
information to be taken into account when making initial probability estimates
and subsequently performing inverse classification. Secondly, following inverse
classification application, optimized instances’ unchangeable features
(e.g.,~age) are updated using values from the next longitudinal time period.
Optimized test instance probabilities are then reassessed. Updating the
unchangeable features in this manner reflects the notion that improvements in
outcome likelihood, which result from following the inverse classification
recommendations, do not materialize instantaneously. As our experiments
demonstrate, more realistic estimates of probability can be obtained by
factoring in such considerations.

Batched Gaussian Process Bandit Optimization via Determinantal Point Processes

Tarun Kathuria, Amit Deshpande, Pushmeet Kohli
Comments: To appear at NIPS 2016
Subjects: Learning (cs.LG)

Gaussian Process bandit optimization has emerged as a powerful tool for
optimizing noisy black box functions. One example in machine learning is
hyper-parameter optimization where each evaluation of the target function
requires training a model which may involve days or even weeks of computation.
Most methods for this so-called “Bayesian optimization” only allow sequential
exploration of the parameter space. However, it is often desirable to propose
batches or sets of parameter values to explore simultaneously, especially when
there are large parallel processing facilities at our disposal. Batch methods
require modeling the interaction between the different evaluations in the
batch, which can be expensive in complex scenarios. In this paper, we propose a
new approach for parallelizing Bayesian optimization by modeling the diversity
of a batch via Determinantal point processes (DPPs) whose kernels are learned
automatically. This allows us to generalize a previous result as well as prove
better regret bounds based on DPP sampling. Our experiments on a variety of
synthetic and real-world robotics and hyper-parameter optimization tasks
indicate that our DPP-based methods, especially those based on DPP sampling,
outperform state-of-the-art methods.

Prognostics of Surgical Site Infections using Dynamic Health Data

Chuyang Ke, Yan Jin, Heather Evans, Bill Lober, Xiaoning Qian, Ji Liu, Shuai Huang
Comments: 23 pages, 8 figures
Subjects: Learning (cs.LG)

Surgical Site Infection (SSI) is a national priority in healthcare research.
Much research attention has been attracted to develop better SSI risk
prediction models. However, most of the existing SSI risk prediction models are
built on static risk factors such as comorbidities and operative factors. In
this paper, we investigate the use of the dynamic wound data for SSI risk
prediction. There have been emerging mobile health (mHealth) tools that can
closely monitor the patients and generate continuous measurements of many
wound-related variables and other evolving clinical variables. Since existing
prediction models of SSI have quite limited capacity to utilize the evolving
clinical data, we develop the corresponding solution to equip these mHealth
tools with decision-making capabilities for SSI prediction with a seamless
assembly of several machine learning models to tackle the analytic challenges
arising from the spatial-temporal data. The basic idea is to exploit the
low-rank property of the spatial-temporal data via the bilinear formulation,
and further enhance it with automatic missing data imputation by the matrix
completion technique. We derive efficient optimization algorithms to implement
these models and demonstrate the superior performances of our new predictive
model on a real-world dataset of SSI, compared to a range of state-of-the-art
methods.

Dual Teaching: A Practical Semi-supervised Wrapper Method

Fuqaing Liu, Chenwei Deng, Fukun Bi, Yiding Yang
Comments: 7 pages, 4 figures, 1 table
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Semi-supervised wrapper methods are concerned with building effective
supervised classifiers from partially labeled data. Though previous works have
succeeded in some fields, it is still difficult to apply semi-supervised
wrapper methods to practice because the assumptions those methods rely on tend
to be unrealistic in practice. For practical use, this paper proposes a novel
semi-supervised wrapper method, Dual Teaching, whose assumptions are easy to
set up. Dual Teaching adopts two external classifiers to estimate the false
positives and false negatives of the base learner. Only if the recall of every
external classifier is greater than zero and the sum of the precision is
greater than one, Dual Teaching will train a base learner from partially
labeled data as effectively as the fully-labeled-data-trained classifier. The
effectiveness of Dual Teaching is proved in both theory and practice.

Anomaly Detection in Bitcoin Network Using Unsupervised Learning Methods

Thai Pham, Steven Lee
Subjects: Learning (cs.LG); Cryptography and Security (cs.CR)

The problem of anomaly detection has been studied for a long time. In short,
anomalies are abnormal or unlikely things. In financial networks, thieves and
illegal activities are often anomalous in nature. Members of a network want to
detect anomalies as soon as possible to prevent them from harming the network’s
community and integrity. Many Machine Learning techniques have been proposed to
deal with this problem; some results appear to be quite promising but there is
no obvious superior method. In this paper, we consider anomaly detection
particular to the Bitcoin transaction network. Our goal is to detect which
users and transactions are the most suspicious; in this case, anomalous
behavior is a proxy for suspicious behavior. To this end, we use three
unsupervised learning methods including k-means clustering, Mahalanobis
distance, and Unsupervised Support Vector Machine (SVM) on two graphs generated
by the Bitcoin transaction network: one graph has users as nodes, and the other
has transactions as nodes.

Personalized Donor-Recipient Matching for Organ Transplantation

Jinsung Yoon, Ahmed M. Alaa, Martin Cadeiras, Mihaela van der Schaar
Subjects: Learning (cs.LG)

Organ transplants can improve the life expectancy and quality of life for the
recipient but carries the risk of serious post-operative complications, such as
septic shock and organ rejection. The probability of a successful transplant
depends in a very subtle fashion on compatibility between the donor and the
recipient but current medical practice is short of domain knowledge regarding
the complex nature of recipient-donor compatibility. Hence a data-driven
approach for learning compatibility has the potential for significant
improvements in match quality. This paper proposes a novel system
(ConfidentMatch) that is trained using data from electronic health records.
ConfidentMatch predicts the success of an organ transplant (in terms of the 3
year survival rates) on the basis of clinical and demographic traits of the
donor and recipient. ConfidentMatch captures the heterogeneity of the donor and
recipient traits by optimally dividing the feature space into clusters and
constructing different optimal predictive models to each cluster. The system
controls the complexity of the learned predictive model in a way that allows
for assuring more granular and confident predictions for a larger number of
potential recipient-donor pairs, thereby ensuring that predictions are
“personalized” and tailored to individual characteristics to the finest
possible granularity. Experiments conducted on the UNOS heart transplant
dataset show the superiority of the prognostic value of ConfidentMatch to other
competing benchmarks; ConfidentMatch can provide predictions of success with
95% confidence for 5,489 patients of a total population of 9,620 patients,
which corresponds to 410 more patients than the most competitive benchmark
algorithm (DeepBoost).

Low Latency Anomaly Detection and Bayesian Network Prediction of Anomaly Likelihood

Derek Farren, Thai Pham, Marco Alban-Hidalgo
Subjects: Learning (cs.LG)

We develop a supervised machine learning model that detects anomalies in
systems in real time. Our model processes unbounded streams of data into time
series which then form the basis of a low-latency anomaly detection model.
Moreover, we extend our preliminary goal of just anomaly detection to
simultaneous anomaly prediction. We approach this very challenging problem by
developing a Bayesian Network framework that captures the information about the
parameters of the lagged regressors calibrated in the first part of our
approach and use this structure to learn local conditional probability
distributions.

Unsupervised Learning For Effective User Engagement on Social Media

Thai Pham, Camelia Simoiu
Subjects: Learning (cs.LG)

In this paper, we investigate the effectiveness of unsupervised feature
learning techniques in predicting user engagement on social media.
Specifically, we compare two methods to predict the number of feedbacks (i.e.,
comments) that a blog post is likely to receive. We compare Principal Component
Analysis (PCA) and sparse Autoencoder to a baseline method where the data are
only centered and scaled, on each of two models: Linear Regression and
Regression Tree. We find that unsupervised learning techniques significantly
improve the prediction accuracy on both models. For the Linear Regression
model, sparse Autoencoder achieves the best result, with an improvement in the
root mean squared error (RMSE) on the test set of 42% over the baseline method.
For the Regression Tree model, PCA achieves the best result, with an
improvement in RMSE of 15% over the baseline.

Learning the best algorithm for max-cut, clustering, and other partitioning problems

Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, Colin White
Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Learning (cs.LG)

Many data analysis problems are NP-hard, a reality that has motivated
researchers to develop a wealth of approximation algorithms and heuristics over
the past few decades. Max-cut, clustering, and many other widely applicable
partitioning problems fall into this camp. We focus on two common classes of
algorithms that are used for clustering and partitioning problems, including
SDP rounding algorithms and agglomerative clustering algorithms with dynamic
programming. The best algorithm to use typically depends on the specific
application or distribution over instances. A worst-case analysis is often used
to compare algorithms, but worst-case instances may not be present in the
particular problem domain, and thus may be misleading when determining which
algorithm to apply. Therefore, it is necessary to develop optimization methods
which return the algorithms and parameters best suited for typical inputs from
the application at hand. We address this problem for integer quadratic
programming and clustering within a learning-theoretic framework where the
learning algorithm is given samples from an application-specific distribution
over problem instances and learns an algorithm which performs well over the
distribution. We provide sample complexity guarantees and computationally
efficient algorithms which optimize an algorithm family’s parameters to best
suit typical inputs from a particular application. We analyze these algorithms
using the learning-theoretic notion of pseudo-dimension. Along with upper
bounds, we provide the first pseudo-dimension lower bounds in this line of
work, which require an involved analysis of each algorithm family’s performance
on carefully constructed problem instances. Our bounds are tight and therefore
nail down the intrinsic complexity of the algorithm classes we analyze, which
are significantly more complex than classes commonly used in learning theory.

Benchmarking Quantum Hardware for Training of Fully Visible Boltzmann Machines

Dmytro Korenkevych, Yanbo Xue, Zhengbing Bian, Fabian Chudak, William G. Macready, Jason Rolfe, Evgeny Andriyash
Comments: 22 pages, 13 figures, D-Wave quantum system for sampling Boltzmann machines
Subjects: Quantum Physics (quant-ph); Learning (cs.LG); Machine Learning (stat.ML)

Quantum annealing (QA) is a hardware-based heuristic optimization and
sampling method applicable to discrete undirected graphical models. While
similar to simulated annealing, QA relies on quantum, rather than thermal,
effects to explore complex search spaces. For many classes of problems, QA is
known to offer computational advantages over simulated annealing. Here we
report on the ability of recent QA hardware to accelerate training of fully
visible Boltzmann machines. We characterize the sampling distribution of QA
hardware, and show that in many cases, the quantum distributions differ
significantly from classical Boltzmann distributions. In spite of this
difference, training (which seeks to match data and model statistics) using
standard classical gradient updates is still effective. We investigate the use
of QA for seeding Markov chains as an alternative to contrastive divergence
(CD) and persistent contrastive divergence (PCD). Using (k=50) Gibbs steps, we
show that for problems with high-energy barriers between modes, QA-based seeds
can improve upon chains with CD and PCD initializations. For these hard
problems, QA gradient estimates are more accurate, and allow for faster
learning. Furthermore, and interestingly, even the case of raw QA samples (that
is, (k=0)) achieved similar improvements. We argue that this relates to the
fact that we are training a quantum rather than classical Boltzmann
distribution in this case. The learned parameters give rise to hardware QA
distributions closely approximating classical Boltzmann distributions that are
hard to train with CD/PCD.

Deep Learning with Sets and Point Clouds

Siamak Ravanbakhsh, Jeff Schneider, Barnabas Poczos
Comments: under review at ICLR
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

We study a simple notion of structural invariance that readily suggests a
parameter-sharing scheme in deep neural networks. In particular, we define
structure as a collection of relations, and derive graph convolution and
recurrent neural networks as special cases. We study composition of basic
structures in defining models that are invariant to more complex “product”
structures such as graph of graphs, sets of images or sequence of sets. For
demonstration, our experimental results are focused on the setting where the
discrete structure of interest is a set. We present results on several novel
and non-trivial problems on sets, including semi-supervised learning using
clustering information, point-cloud classification and set outlier detection.

Generate Models and Model Criticism via Optimized Maximum Mean Discrepancy

We propose a method to optimize the representation and distinguishability of
samples from two probability distributions, by maximizing the estimated power
of a statistical test based on the maximum mean discrepancy (MMD). This
optimized MMD is applied to the setting of unsupervised learning by generative
adversarial networks (GAN), in which a model attempts to generate realistic
samples, and a discriminator attempts to tell these apart from data samples. In
this context, the MMD may be used in two roles: first, as a discriminator,
either directly on the samples, or on features of the samples. Second, the MMD
can be used to evaluate the performance of a generative model, by testing the
model’s samples against a reference data set. In the latter role, the optimized
MMD is particularly helpful, as it gives an interpretable indication of how the
model and data distributions differ, even in cases where individual model
samples are not easily distinguished either by eye or by classifier.

Attending to Characters in Neural Sequence Labeling Models

Marek Rei, Gamal K.O. Crichton, Sampo Pyysalo
Comments: Proceedings of COLING 2016
Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Sequence labeling architectures use word embeddings for capturing similarity,
but suffer when handling previously unseen or rare words. We investigate
character-level extensions to such models and propose a novel architecture for
combining alternative word representations. By using an attention mechanism,
the model is able to dynamically decide how much information to use from a
word- or character-level component. We evaluated different architectures on a
range of sequence labeling datasets, and character-level extensions were found
to improve performance on every benchmark. In addition, the proposed
attention-based architecture delivered the best results even with a smaller
number of trainable parameters.

Riemannian Tensor Completion with Side Information

Tengfei Zhou, Hui Qian, Zebang Shen, Congfu Xu
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Numerical Analysis (cs.NA)

Riemannian optimization methods have shown to be both fast and accurate in
recovering a large-scale tensor from its incomplete observation. However, in
almost all recent Riemannian tensor completion methods, only low rank
constraint is considered. Another important fact, side information or features,
remains far from exploiting within the Riemannian optimization framework. In
this paper, we explicitly incorporate the side information into a Riemannian
minimization model. Specifically, a feature-embedded objective is designed to
substantially reduce the sample complexity. For such a Riemannian optimization,
a particular metric can be constructed based on the curvature of the objective,
which leads to a novel Riemannian conjugate gradient descent solver. Numerical
experiments suggest that our solver is more efficient than the state-of-the-art
when a highly accurate solution is required.

Reinforcement Learning of Contextual MDPs using Spectral Methods

Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)

We propose a new reinforcement learning (RL) algorithm for contextual Markov
decision processes (CMDP) using spectral methods. CMDPs are structured MDPs
where the dynamics and rewards depend on a smaller number of hidden states or
contexts. If the mapping between the hidden and observed states is known a
priori, then standard RL algorithms such as UCRL are guaranteed to attain low
regret. Is it possible to achieve regret of the same order even when the
mapping is unknown? We provide an affirmative answer in this paper. We exploit
spectral methods to learn the mapping from hidden to observed states with
guaranteed confidence bounds, and incorporate it into the UCRL-based framework
to obtain order-optimal regret.

Annealing Gaussian into ReLU: a New Sampling Strategy for Leaky-ReLU RBM

Chun-Liang Li, Siamak Ravanbakhsh, Barnabas Poczos
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Restricted Boltzmann Machine (RBM) is a bipartite graphical model that is
used as the building block in energy-based deep generative models. Due to
numerical stability and quantifiability of the likelihood, RBM is commonly used
with Bernoulli units. Here, we consider an alternative member of exponential
family RBM with leaky rectified linear units — called leaky RBM. We first
study the joint and marginal distributions of leaky RBM under different
leakiness, which provides us important insights by connecting the leaky RBM
model and truncated Gaussian distributions. The connection leads us to a simple
yet efficient method for sampling from this model, where the basic idea is to
anneal the leakiness rather than the energy; — i.e., start from a fully
Gaussian/Linear unit and gradually decrease the leakiness over iterations. This
serves as an alternative to the annealing of the temperature parameter and
enables numerical estimation of the likelihood that are more efficient and more
accurate than the commonly used annealed importance sampling (AIS). We further
demonstrate that the proposed sampling algorithm enjoys faster mixing property
than contrastive divergence algorithm, which benefits the training without any
additional computational cost.

Information Theory

Detection Performance with Many Antennas Available for Bandwidth-Efficient Uplink Transmission in MU-MIMO Systems

Paulo Torres, Antonio Gusmao
Comments: 8 pages, 7 figures. arXiv admin note: text overlap with arXiv:1502.05597
Subjects: Information Theory (cs.IT)

This paper is concerned with SC/FDE for bandwidth-efficient uplink block
transmission, with QAM schemes, in a MU MIMO system. The number of BS receiver
antennas is assumed to be large, but not necessarily much larger than the
overall number of transmitter antennas jointly using the same time/frequency
resource at MT.

In this context, we consider several detection techniques and evaluate, in
detail, the corresponding detection performances (discussed with the help of
selected performance bounds), for a range of values regarding the number of
available BS receiver antennas. From our performance results, we conclude that
simple linear detection techniques, designed to avoid the need of complex
matrix inversions, can lead to unacceptably high error floor levels. However,
by combining the use of such simple linear detectors with an appropriate
interference cancellation procedure – within an iterative DF technique -, a
close approximation to the SIMO MFB performance can be achieved after a few
iterations, even for 64-QAM schemes, when the number of BS antennas is, at
least, five times higher than the number of antennas which are jointly used at
the user terminals.

Matrix Characterization for GFDM Systems: Low-Complexity MMSE Receivers and Optimal Prototype Filters

Po-Chih Chen, Borching Su, Yenming Huang
Comments: 16 pages, 10 figures
Subjects: Information Theory (cs.IT)

Generalized frequency division multiplexing (GFDM) is considered to be a
promising waveform candidate for 5G new radio. It features good properties such
as low out-of-band (OOB) radiation. One major drawback of GFDM known in the
literature is that a zero-forcing receiver suffers from the noise enhancement
effect since the GFDM (transmitter) matrix in general has a greater-than-unity
condition number. In this paper, we propose a new matrix-based characterization
of GFDM matrices, as opposed to traditional vector-based characterization with
prototype filters. The characterization is helpful in deriving properties of
GFDM matrices, including conditions on GFDM matrices being nonsingular and
unitary, respectively. Further, the condition on the existence of a form of
low-complexity MMSE receivers is also derived. It is found that such a receiver
under frequency-selective channels exists if and only if the GFDM transmitter
matrix is chosen to be unitary. For non-unitary transmitter matrices, a
low-complexity suboptimal MMSE receiver is proposed with a performance very
close to that of an MMSE receiver. Besides, optimal prototype filters in terms
of minimizing receiver mean square error (MSE) are derived and found to
correspond to the use of unitary GFDM matrices under many scenarios. These
optimal filters can be applied in GFDM systems without causing the problem of
noise enhancement, thereby having the same MSE performance as OFDM.
Furthermore, we find that GFDM matrices with a size of power of two do exist in
the class of unitary GFDM matrices. Finally, while the OOB radiation
performance of systems using a unitary GFDM matrix is not optimal in general,
it is shown that the OOB radiation can be satisfactorily low if the parameters
are carefully chosen.

Leech Constellations of Construction-A Lattices

Nicola di Pietro, Joseph J. Boutros
Comments: 30 pages, 5 figures
Subjects: Information Theory (cs.IT)

The problem of communicating over the AWGN channel with lattice codes is
addressed in this paper. Theoretically, Voronoi constellations have proven to
yield very powerful lattice codes when the fine lattice is AWGN-good and the
coarse lattice has an optimal shaping gain. However, achieving Shannon capacity
with these premises and practically implementable encoding algorithms is in
general not an easy task. In this work, constructions called Leech
constellations are proposed. They involve Construction-A lattices as fine
lattices and a direct sum of Leech lattices as shaping lattice. The combination
of these two ingredients allows to design finite lattice constellations of
dual-diagonal LDA lattices whose numerically measured waterfall region is
situated at less than 0:8 dB from Shannon capacity. A detailed explanation on
how to perform encoding and demapping is given. A very appreciable feature of
Leech constellations of dual-diagonal LDA lattices is that encoding, deocoding,
and demapping have all linear complexity in the block length.

Interference Cancellation at Receivers in Cache-Enabled Wireless Networks

Chenchen Yang, Yao Yao, Bin Xia, Kaibin Huang, Weiliang Xie, Yong Zhao
Subjects: Information Theory (cs.IT)

In this paper, we propose to exploit the limited cache packets as side
information to cancel incoming interference at the receiver side. We consider a
stochastic network where the random locations of base stations and users are
modeled using Poisson point processes. Caching schemes to reap both the local
caching gain and the interference cancellation gain for the users are developed
based on two factors: the density of different user subsets and the packets
cached in the corresponding subsets. The packet loss rate (PLR) is analyzed,
which depends on both the cached packets and the channel state information
(CSI) available at the receiver. Theoretical results reveal the tradeoff
between caching resource and wireless resource. The performance for different
caching schemes are analyzed and the minimum achievable PLR for the distributed
caching is derived.

Counting generalized Reed-Solomon codes

Peter Beelen, David Glynn, Tom Høholdt, Krishna Kaipa
Subjects: Information Theory (cs.IT); Discrete Mathematics (cs.DM)

In this article we count the number of generalized Reed-Solomon (GRS) codes
of dimension k and length n, including the codes coming from a non-degenerate
conic plus nucleus. We compare our results with known formulae for the number
of 3-dimensional MDS codes of length n=6,7,8,9.

Towards Information Privacy for the Internet of Things

Meng Sun, Wee Peng Tay, Xin He
Comments: 13 pages, 4 figures, submitted to IEEE transaction on signal processing
Subjects: Information Theory (cs.IT)

In an Internet of Things network, multiple sensors send information to a
fusion center for it to infer a public hypothesis of interest. However, the
same sensor information may be used by the fusion center to make inferences of
a private nature that the sensors wish to protect. To model this, we adopt a
decentralized hypothesis testing framework with binary public and private
hypotheses. Each sensor makes a private observation and utilizes a local sensor
decision rule or privacy mapping to summarize that observation before sending
it to the fusion center. Without assuming knowledge of the joint distribution
of the sensor observations and hypotheses, we adopt a nonparametric learning
approach to design local privacy mappings. We introduce the concept of an
empirical normalized risk, which provides a theoretical guarantee for the
network to achieve information privacy for the private hypothesis with high
probability when the number of training samples is large. We develop iterative
optimization algorithms to determine an appropriate privacy threshold and the
best sensor privacy mappings, and show that they converge. Numerical results on
both synthetic and real data sets suggest that our proposed approach yields low
error rates for inferring the public hypothesis, but high error rates for
detecting the private hypothesis.

An algebraic framework for end-to-end physical-layer network coding

Elisa Gorla, Alberto Ravagnani
Subjects: Information Theory (cs.IT); Commutative Algebra (math.AC)

We propose an algebraic setup for end-to-end physical-layer network coding
based on submodule transmission. We introduce a distance function between
modules, describe how it relates to information loss and errors, and show how
to compute it. Then we propose a definition of submodule error-correcting code,
and investigate bounds and constructions for such codes.

BDMA for Millimeter-Wave/Terahertz Massive MIMO Transmission with Per-Beam Synchronization

Li You, Xiqi Gao, Geoffrey Ye Li, Xiang-Gen Xia, Ni Ma
Comments: 29 pages, 4 figures
Subjects: Information Theory (cs.IT)

We propose beam division multiple access (BDMA) with per-beam synchronization
(PBS) in time and frequency for wideband massive multiple-input multiple-output
(MIMO) transmission over millimeter-wave (mmW)/Terahertz (THz) bands. We first
introduce a physically motivated beam domain channel model for massive MIMO and
demonstrate that the envelopes of the beam domain channel elements tend to be
independent of time and frequency when both the numbers of antennas at base
station and user terminals (UTs) tend to infinity. Motivated by the derived
beam domain channel properties, we then propose PBS for mmW/THz massive MIMO.
We show that both the effective delay and Doppler frequency spreads of wideband
massive MIMO channels with PBS are reduced by a factor of the number of UT
antennas compared with the conventional synchronization approaches.
Subsequently, we apply PBS to BDMA, investigate beam scheduling to maximize the
achievable ergodic rates for both uplink and downlink BDMA, and develop a
greedy beam scheduling algorithm. Simulation results verify the effectiveness
of BDMA with PBS for mmW/THz wideband massive MIMO systems in typical mobility
scenarios.

Optimal Placement Delivery Arrays

Minquan Cheng, Jing Jiang, Qifa Yan, Xiaohu Tang, Haitao Cao
Comments: 11pages, coded caching, placement delivery array
Subjects: Information Theory (cs.IT)

In wireless networks, coded caching scheme is an effective technique to
reduce network congestion during peak traffic times. A ((K,F,Z,S)) placement
delivery array (((K,F,Z,S))PDA in short) can be used to design a coded caching
scheme with the delivery rate (S/F) during the peak traffic times. Clearly in
order to minimize delivery rate, we only need to consider a ((K,F,Z,S))PDA with
minimum (S). For the fixed parameters (K), (F) and (Z), a ((K,F,Z,S))PDA with
the minimum (S) is called optimal. So it is meaningful to study optimal PDAs.

There are some known PDAs constructed by combinatorial design theory,
hypergraphs and so on. However there is only one class of optimal PDAs (IEEE
Trans. Inf. Theory 60:2856-2867, 2014) constructed by Ali and Niesen. We will
focus on constructing optimal PDAs. In this paper, two classes of lower bounds
on the value of (S) in a ((K,F,Z,S))PDA are derived first. Then two classes of
recursive constructions are proposed. Consequently (i) optimal PDAs with (Z=1)
and (F-1) for any positive integers (K) and (F) are obtained; (ii) several
infinite classes of optimal PDAs for some (K), (F) and (Z) are constructed.
Finally more existence of optimal PDAs with (Z=F-2) are given.

Millimeter Wave Beam Alignment: Large Deviations Analysis and Design Insights

Chunshan Liu, Min Li, Stephen V. Hanly, Iain B. Collings, Philip Whiting
Comments: 27 pages, 7 figures, submitted
Subjects: Information Theory (cs.IT)

In millimeter wave cellular communication, fast and reliable beam alignment
via beam training is crucial to harvest sufficient beamforming gain for the
subsequent data transmission. In this paper, we establish fundamental limits in
beam-alignment performance under both the exhaustive search and the
hierarchical search that adopts multi-resolution beamforming codebooks,
accounting for time-domain training overhead. Specifically, we derive lower and
upper bounds on the probability of misalignment for an arbitrary level in the
hierarchical search, based on a single-path channel model. Using the method of
large deviations, we characterize the decay rate functions of both bounds and
show that the bounds coincide as the training sequence length goes large. With
these results, we characterize the asymptotic misalignment probability of both
the hierarchical and exhaustive search, and show that the latter asymptotically
outperforms the former, subject to the same training overhead and codebook
resolution. We show via numerical results that this relative performance
behavior holds in the non-asymptotic regime. Moreover, the exhaustive search is
shown to achieve significantly higher worst-case spectrum efficiency than the
hierarchical search, when the pre-beamforming signal-to-noise ratio (SNR) is
relatively low. This study hence implies that the exhaustive search is more
effective for users situated not so close to base stations, as they tend to
have low SNR.

Fast Polarization and Finite-Length Scaling for Non-Stationary Channels

Hessam Mahdavifar
Subjects: Information Theory (cs.IT)

We consider the problem of polar coding for transmission over a
non-stationary sequence of independent binary-input memoryless symmetric (BMS)
channels (left{W_i
ight}_{i=1}^{infty}), where the (i)-th encoded bit is
transmitted over (W_i). We show, for the first time, a polar coding scheme that
achieves the effective average symmetric capacity ()
overline{I}(left{W_i
ight}_{i=1}^{infty}) := lim_{N
ightarrow infty}
frac{1}{N}sum_{i=1}^N I(W_i), () assuming that the limit exists. The polar
coding scheme is constructed using Ar{i}kan’s channel polarization
transformation in combination with certain permutations at each polarization
level and certain skipped operations. This guarantees a fast polarization
process that results in polar coding schemes with block lengths upper bounded
by a polynomial of (1/epsilon), where (epsilon) is the gap to the average
capacity. More specifically, given an arbitrary sequence of BMS channels
(left{W_i
ight}_{i=1}^{N}) and (P_e), where (0 < P_e <1), we construct a
polar code of length (N) and rate (R) guaranteeing a block error probability of
at most (P_e) for transmission over (left{W_i
ight}_{i=1}^{N}) such that ()
N leq frac{kappa}{(overline{I}_N – R)^{mu}} () where (mu) is a constant,
(kappa) is a constant depending on (P_e) and (mu), and (overline{I}_N) is
the average of the symmetric capacities (I(W_i)), for (i=1,2,,dots,N). We
further show a numerical upper bound on (mu) that is: (mu leq 10.78). The
encoding and decoding complexities of the constructed polar code preserves (O(N
log N)) complexity of Ar{i}kan’s polar codes.

Self-Calibration via Linear Least Squares

Shuyang Ling, Thomas Strohmer
Subjects: Information Theory (cs.IT)

Whenever we use devices to take measurements, calibration is indispensable.
While the purpose of calibration is to reduce bias and uncertainty in the
measurements, it can be quite difficult, expensive and sometimes even
impossible to implement. We study a challenging problem called
self-calibration, i.e., the task of designing an algorithm for devices so that
the algorithm is able to perform calibration automatically. More precisely, we
consider the setup (oldsymbol{y} = mathcal{A}(oldsymbol{d}) oldsymbol{x}
+ oldsymbol{epsilon}) where only partial information about the sensing
matrix (mathcal{A}(oldsymbol{d})) is known and where
(mathcal{A}(oldsymbol{d})) linearly depends on (oldsymbol{d}). The goal is
to estimate the calibration parameter (oldsymbol{d}) (resolve the uncertainty
in the sensing process) and the signal/object of interests (oldsymbol{x})
simultaneously. For three different models of practical relevance we show how
such a bilinear inverse problem, including blind deconvolution as an important
example, can be solved via a simple linear least squares approach. As a
consequence, the proposed algorithms are numerically extremely efficient, thus
allowing for real-time deployment. Explicit theoretical guarantees and
stability theory are derived and the number of sampling complexity is nearly
optimal (up to a poly-log factor). Applications in imaging sciences and signal
processing are discussed and numerical simulations are presented to demonstrate
the effectiveness and efficiency of our approach.

Energy-Based Adaptive Multiple Access in LoRaWAN IoT Systems with Energy Harvesting

Nicolo Michelusi, Marco Levorato
Comments: Submitted to IEEE ICASSP 2017
Subjects: Information Theory (cs.IT)

This paper considers a network of energy harvesting nodes which perform data
communication to a sink node over a multiple access channel. In order to reduce
the complexity of network control resulting from adaptation to the energy
storage level at each node, an optimization framework is proposed where energy
storage dynamics are replaced by dynamic average power constraints induced by
the time correlated energy supply, thus enabling lightweight and flexible
network control. The structure of the throughput-optimal genie-aided
transmission policy is derived under constraints on the average energy
consumption. The policy takes the form of the packet transmission probability
defined on the energy harvesting state and number of active wireless nodes.
Inspired by the genie-aided policy, a Bayesian estimation approach is proposed
to address the case where the base station estimates the number of active
wireless nodes based on the observed network transmission pattern, and
broadcasts low-overhead control information to the whole network.

Resource Allocation in Wireless Powered Relay Networks: A Bargaining Game Approach

Zijie Zheng, Lingyang Song, Dusit Niyato, Zhu Han
Comments: 14 pages, 7 figures, journal paper
Subjects: Information Theory (cs.IT); Computer Science and Game Theory (cs.GT)

Simultaneously information and power transfer in mobile relay networks have
recently emerged, where the relay can harvest the radio frequency (RF) energy
and then use this energy for data forwarding and system operation. Most of the
previous works do not consider that the relay may have its own objectives, such
as using the harvested energy for its own transmission instead of maximizing
transmission of the network. Therefore, in this paper, we propose a Nash
bargaining approach to balance the information transmission efficiency of
source-destination pairs and the harvested energy of the relay in a wireless
powered relay network with multiple source-destination pairs and one relay. We
analyze and prove that the Nash bargaining problem has several desirable
properties such as the discreteness and quasi-concavity, when it is decomposed
into three sub-problems: the energy transmission power optimization, the power
control for data transmission and the time division between energy transmission
and data transmission. Based on the theoretical analysis, we propose an
alternating power control and time division algorithm to find a suboptimal
solution. Simulation results clearly show and demonstrate the properties of the
problem and the convergence of our algorithm.

WINDOW: Wideband Demodulator for Optical Waveforms

Omri Lev, Tal Wiener, Deborah Cohen, Yonina C. Eldar
Subjects: Information Theory (cs.IT)

Optical communication systems, which operate at very high rates, are often
limited by the sampling rate bottleneck. The optical wideband regime may exceed
analog to digital converters (ADCs) front-end bandwidth. Multi-channel sampling
approaches, such as multicoset or interleaved ADCs, have been proposed to
sample the wideband signal using several channels. Each channel samples below
the Nyquist rate such that the overall sampling rate is preserved. However,
this scheme suffers from two practical limitations that make its implementation
difficult. First, the inherent anti-aliasing filter of the samplers distorts
the wideband signal. Second, it requires accurate time shifts on the order of
the signal’s Nyquist rate, which are challenging to maintain. In this work, we
propose an alternative multi-channel sampling scheme, the wideband demodulator
for optical waveforms (WINDOW), based on analog RF demodulation, where each
channel aliases the spectrum using a periodic mixing function before
integration and sampling. We show that intentionally using the inherent ADC
filter to perform integration increases the signal to noise ratio (SNR). We
demonstrate both theoretically and through numerical experiments that our
system outperforms multicoset in terms of signal recovery and symbol estimation
in the presence of both thermal and quantization noise but is slightly less
robust to timing jitter.

The Entropy Region is not Closed Under Duality

Tarik Kaced
Comments: submitted
Subjects: Information Theory (cs.IT)

We import a duality notion coming from polymatroids to define duality for
information inequalities. We show that the entropy region for (nge 5) is not
closed under duality. Our result answers an open question of Mat`uv{s}
(1992).

An Elementary Proof of Convex Phase Retrieval in the Natural Parameter Space via the Linear Program PhaseMax

Paul Hand, Vladislav Voroninski
Subjects: Information Theory (cs.IT); Optimization and Control (math.OC)

The phase retrieval problem has garnered significant attention since the
development of the PhaseLift algorithm, which is a convex program that operates
in a lifted space of matrices. Because of the substantial computational cost
due to lifting, many approaches to phase retrieval have been developed,
including non-convex optimization algorithms which operate in the natural
parameter space, such as Wirtinger Flow. Very recently, a convex formulation
called PhaseMax has been discovered, and it has been proven to achieve phase
retrieval via linear programming in the natural parameter space under optimal
sample complexity. The current proofs of PhaseMax rely on statistical learning
theory or geometric probability theory. Here, we present a short and elementary
proof that PhaseMax exactly recovers real-valued vectors from random
measurements under optimal sample complexity. Our proof only relies on standard
probabilistic concentration and covering arguments, yielding a simpler and more
direct proof than those that require statistical learning theory, geometric
probability or the highly technical arguments for Wirtinger Flow-like
approaches.

Some conjectures on codes

Clelia De Felice
Subjects: Formal Languages and Automata Theory (cs.FL); Information Theory (cs.IT)

Variable-length codes are the bases of the free submonoids of a free monoid.
There are some important longstanding open questions about the structure of
finite maximal codes. In this paper we discuss this conjectures and their
relations with factorizations of cyclic groups.

Good Integers and Applications in Coding Theory

Somphong Jitman
Comments: 21 pages
Subjects: Rings and Algebras (math.RA); Information Theory (cs.IT); Number Theory (math.NT)

A class of good integers has been introduced by P. Moree in (1997) together
with the characterization of good odd integers. Such integers have shown to
have nice number theoretical properties and wide applications. In this paper, a
complete characterization of all good integers is given.

Two subclasses of good integers are introduced, namely, oddly-good and
evenly-good integers. The characterization and properties of good integers in
the two subclasses are determined.

As applications, good integers and oddly-good integers are applied in the
study of the hulls of abelian codes. The average dimension of the hulls of
abelian codes is given together with some upper and lower bounds.

Nuclei and automorphism groups of generalized twisted Gabidulin codes

Rocco Trombetti, Yue Zhou
Comments: 20 pages
Subjects: Combinatorics (math.CO); Information Theory (cs.IT)

Generalized twisted Gabidulin codes are one of the few known families of
maximum rank matrix codes over finite fields. As a subset of m by n matrices,
when m=n, the automorphism group of any generalized twisted Gabidulin code has
been completely determined by the authors recently. In this paper, we consider
the same problem for m<n. Under certain conditions on their parameters, we
determine their middle nuclei and right nuclei, which are important invariants
with respect to the equivalence for rank metric codes. Furthermore, we also use
them to derive necessary conditions on the automorphisms of generalized twisted
Gabidulin codes.

Bounds and Constructions for (overline{3})-Strongly Separable Codes with Length (3)

Xuli Zhang, Jing Jiang, Minquan Cheng
Comments: 17 pages, separable codes, strongly separable codes
Subjects: Discrete Mathematics (cs.DM); Information Theory (cs.IT)

As separable code (SC, IEEE Trans Inf Theory 57:4843-4851, 2011) and
frameproof code (FPC, IEEE Trans Inf Theory 44:1897-1905, 1998) do in
multimedia fingerprinting, strongly separable code (SSC, Des. Codes and
Cryptogr.79:303-318, 2016) can be also used to construct anti-collusion codes.
Furthermore, SSC is better than FPC and SC in the applications for multimedia
fingerprinting since SSC has lower tracing complexity than that of SC (the same
complexity as FPC) and weaker structure than that of FPC. In this paper, we
first derive several upper bounds on the number of codewords of
(overline{t})-SSC. Then we focus on (overline{3})-SSC with codeword length
(3), and obtain the following two main results: (1) An equivalence between an
SSC and an SC. %Consequently a more tighter upper bound ((3q^2/4)) and lower
bound ((q^{3/2})) on the number of codewords are obtained; (2) An improved
lower bound (Omega (q^{5/3}+q^{4/3}-q)) on the size of a (q)-ary SSC when
(q=q_1^6) for any prime power (q_1equiv 1 pmod 6), better than the before
known bound (lfloorsqrt{q}
floor^{3}), which is obtained by means of
difference matrix and the known result on the subset of (mathbb{F}^{n}_q)
containing no three points on a line.

A linear-time benchmarking tool for generalized surface codes

Nicolas Delfosse, Pavithran Iyer, David Poulin
Comments: Software available online: this http URL
Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT)

Quantum information processors need to be protected against errors and
faults. One of the most widely considered fault-tolerant architecture is based
on surface codes. While the general principles of these codes are well
understood and basic code properties such as minimum distance and rate are easy
to characterize, a code’s average performance depends on the detailed geometric
layout of the qubits. To date, optimizing a surface code architecture and
comparing different geometric layouts relies on costly numerical simulations.
Here, we propose a benchmarking algorithm for simulating the performance of
surface codes, and generalizations thereof, that runs in linear time. We
implemented this algorithm in a software that generates performance reports and
allows to quickly compare different architectures.

Sub-channel and Power Allocation for Non-orthogonal Multiple Access Relay Networks with Amplify-and-Forward Protocol

Shuhang Zhang, Boya Di, Lingyang Song, Yonghui Li
Comments: 30 pages, 6 figures, submit to Tran. Wireless Commun
Subjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT)

In this paper, we study the resource allocation problem for a single-cell
non-orthogonal multiple access (NOMA) relay network where an OFDM
amplify-and-forward (AF) relay allocates the spectrum and power resources to
the source-destination (SD) pairs. We aim to optimize the resource allocation
to maximize the average sum-rate. The optimal approach requires an exhaustive
search, leading to an NP-hard problem. To solve this problem, we propose two
efficient many-to-many two-sided SD pair-subchannel matching algorithms in
which the SD pairs and sub-channels are considered as two sets of players
chasing their own interests. The proposed algorithms can provide a sub-optimal
solution to this resource allocation problem in affordable time. Both the
static matching algorithm and dynamic matching algorithm converge to a
pair-wise stable matching after a limited number of iterations. Simulation
results show that the capacity of both proposed algorithms in the NOMA scheme
significantly outperforms the conventional orthogonal multiple access scheme.
The proposed matching algorithms in NOMA scheme also achieve a better
user-fairness performance than the conventional orthogonal multiple access.

Vulnerability analysis of coherent quantum cryptography protocols towards to an active beam-splitting attack

D.A. Kronberg, E.O. Kiktenko, A.K. Fedorov, Y.V. Kurochkin
Comments: 5 pages, 2 figures; comments are welcome
Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT)

A novel attack on the coherent quantum key distribution (coherent one-way,
COW) protocol is considered. The key idea of the proposed attack is performing
of individual measurements on a fraction of intercepted states and transferring
or blocking of the remaining part depending on a measurement result.

Entropic Causal Inference

We consider the problem of identifying the causal direction between two
discrete random variables using observational data. Unlike previous work, we
keep the most general functional model but make an assumption on the unobserved
exogenous variable: Inspired by Occam’s razor, we assume that the exogenous
variable is simple in the true causal direction. We quantify simplicity using
R’enyi entropy. Our main result is that, under natural assumptions, if the
exogenous variable has low (H_0) entropy (cardinality) in the true direction,
it must have high (H_0) entropy in the wrong direction. We establish several
algorithmic hardness results about estimating the minimum entropy exogenous
variable. We show that the problem of finding the exogenous variable with
minimum entropy is equivalent to the problem of finding minimum joint entropy
given (n) marginal distributions, also known as minimum entropy coupling
problem. We propose an efficient greedy algorithm for the minimum entropy
coupling problem, that for (n=2) provably finds a local optimum. This gives a
greedy algorithm for finding the exogenous variable with minimum (H_1) (Shannon
Entropy). Our greedy entropy-based causal inference algorithm has similar
performance to the state of the art additive noise models in real datasets. One
advantage of our approach is that we make no use of the values of random
variables but only their distributions. Our method can therefore be used for
causal inference for both ordinal and also categorical data, unlike additive
noise models.