IT博客汇 | arXiv Paper Daily: Wed, 2 Nov 2016

arXiv Paper Daily: Wed, 2 Nov 2016

我爱机器学习(52ml.net)发表于 2016-11-02 00:00:00

Neural and Evolutionary Computing

Surrogate-Assisted Partial Order-based Evolutionary Optimisation

Vanessa Volz, Günter Rudolph, Boris Naujoks
Subjects: Neural and Evolutionary Computing (cs.NE)

In this paper, we propose a novel approach (SAPEO) to support the survival
selection process in multi-objective evolutionary algorithms with surrogate
models – it dynamically chooses individuals to evaluate exactly based on the
model uncertainty and the distinctness of the population. We introduce variants
that differ in terms of the risk they allow when doing survival selection.
Here, the anytime performance of different SAPEO variants is evaluated in
conjunction with an SMS-EMOA using the BBOB bi-objective benchmark. We compare
the obtained results with the performance of the regular SMS-EMOA, as well as
another surrogate-assisted approach. The results open up general questions
about the applicability and required conditions for surrogate-assisted
multi-objective evolutionary algorithms to be tackled in the future.

Full-Capacity Unitary Recurrent Neural Networks

Scott Wisdom, Thomas Powers, John R. Hershey, Jonathan Le Roux, Les Atlas
Comments: 9 pages, to appear in NIPS
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Recurrent neural networks are powerful models for processing sequential data,
but they are generally plagued by vanishing and exploding gradient problems.
Unitary recurrent neural networks (uRNNs), which use unitary recurrence
matrices, have recently been proposed as a means to avoid these issues.
However, in previous experiments, the recurrence matrices were restricted to be
a product of parameterized unitary matrices, and an open question remains: when
does such a parameterization fail to represent all unitary matrices, and how
does this restricted representational capacity limit what can be learned? To
address this question, we propose full-capacity uRNNs that optimize their
recurrence matrix over all unitary matrices, leading to significantly improved
performance over uRNNs that use a restricted-capacity recurrence matrix. Our
contribution consists of two main components. First, we provide a theoretical
argument to determine if a unitary parameterization has restricted capacity.
Using this argument, we show that a recently proposed unitary parameterization
has restricted capacity for hidden state dimension greater than 7. Second, we
show how a complete, full-capacity unitary recurrence matrix can be optimized
over the differentiable manifold of unitary matrices. The resulting
multiplicative gradient step is very simple and does not require gradient
clipping or learning rate adaptation. We confirm the utility of our claims by
empirically evaluating our new full-capacity uRNNs on both synthetic and
natural data, achieving superior performance compared to both LSTMs and the
original restricted-capacity uRNNs.

Computer Vision and Pattern Recognition

Structured illumination microscopy with unknown patterns and a statistical prior

Li-Hao Yeh, Lei Tian, Laura Waller
Subjects: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)

Structured illumination microscopy (SIM) improves resolution by
down-modulating high-frequency information of an object to fit within the
passband of the optical system. Generally, the reconstruction process requires
prior knowledge of the illumination patterns, which implies a well-calibrated
and aberration-free system. Here, we propose a new algorithmic self-calibration
strategy for SIM that does not need to know the exact patterns a priori, but
only their covariance. The algorithm, termed PE-SIMS, includes a
Pattern-Estimation (PE) step and a SIM reconstruction procedure using a
Statistical prior (SIMS). Additionally, we perform a pixel reassignment process
(SIMS-PR) to enhance the reconstruction quality. We achieve 2x better
resolution than a conventional widefield microscope, while remaining
insensitive to aberration-induced pattern distortion and robust against
parameter tuning compared to other blind SIM algorithms.

Dictionary Integration using 3D Morphable Face Models for Pose-invariant Collaborative-representation-based Classification

Xiaoning Song, Zhen-Hua Feng, Guosheng Hu, Josef Kittler, William Christmas, Xiao-Jun Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The paper presents a dictionary integration algorithm using 3D morphable face
models (3DMM) for pose-invariant collaborative-representation-based face
classification. To this end, we first fit a 3DMM to the 2D face images of a
dictionary to reconstruct the 3D shape and texture of each image. The 3D faces
are used to render a number of virtual 2D face images with arbitrary pose
variations to augment the training data, by merging the original and rendered
virtual samples {color{black}to create} an extended dictionary. Second, to
reduce the information redundancy of the extended dictionary and improve the
sparsity of reconstruction coefficient vectors using
collaborative-representation-based classification (CRC), we exploit an on-line
elimination scheme to optimise the extended dictionary by identifying the most
representative training samples for a given query. The final goal is to perform
pose-invariant face classification using the proposed dictionary integration
method and the on-line pruning strategy under the CRC framework. Experimental
results obtained for a set of well-known face datasets demonstrate the merits
of the proposed method, especially its robustness to pose variations.

Sliding Dictionary Based Sparse Representation For Action Recognition

Yashas Annadani, D L Rakshith, Soma Biswas
Comments: 7 Pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The task of action recognition has been in the forefront of research, given
its applications in gaming, surveillance and health care. In this work, we
propose a simple, yet very effective approach which works seamlessly for both
offline and online action recognition using the skeletal joints. We construct a
sliding dictionary which has the training data along with their time stamps.
This is used to compute the sparse coefficients of the input action sequence
which is divided into overlapping windows and each window gives a probability
score for each action class. In addition, we compute another simple feature,
which calibrates each of the action sequences to the training sequences, and
models the deviation of the action from the each of the training data. Finally,
a score level fusion of the two heterogeneous but complementary features for
each window is obtained and the scores for the available windows are
successively combined to give the confidence scores of each action class. This
way of combining the scores makes the approach suitable for scenarios where
only part of the sequence is available. Extensive experimental evaluation on
three publicly available datasets shows the effectiveness of the proposed
approach for both offline and online action recognition tasks.

Best-Buddies Tracking

Shaul Oron, Denis Suhanov, Shai Avidan
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Best-Buddies Tracking (BBT) applies the Best-Buddies Similarity measure (BBS)
to the problem of model-free online tracking. BBS was introduced as a
similarity measure between two point sets and was shown to be very effective
for template matching. Originally, BBS was designed to work with point sets of
equal size, and we propose a modification that lets it handle point sets of
different size. The modified BBS is better suited to handle scale changes in
the template size, as well as support a variable number of template images. We
embed the modified BBS in a particle filter framework and obtain good results
on a number of standard benchmarks.

Deep fusion of visual signatures for client-server facial analysis

Binod Bhattarai, Gaurav Sharma, Frederic Jurie
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Facial analysis is a key technology for enabling human-machine interaction.
In this context, we present a client-server framework, where a client transmits
the signature of a face to be analyzed to the server, and, in return, the
server sends back various information describing the face e.g. is the person
male or female, is she/he bald, does he have a mustache, etc. We assume that a
client can compute one (or a combination) of visual features; from very simple
and efficient features, like Local Binary Patterns, to more complex and
computationally heavy, like Fisher Vectors and CNN based, depending on the
computing resources available. The challenge addressed in this paper is to
design a common universal representation such that a single merged signature is
transmitted to the server, whatever be the type and number of features computed
by the client, ensuring nonetheless an optimal performance. Our solution is
based on learning of a common optimal subspace for aligning the different face
features and merging them into a universal signature. We have validated the
proposed method on the challenging CelebA dataset, on which our method
outperforms existing state-of-the-art methods when rich representation is
available at test time, while giving competitive performance when only simple
signatures (like LBP) are available at test time due to resource constraints on
the client.

Embedding Deep Metric for Person Re-identication A Study Against Large Variations

Hailin Shi, Yang Yang, Xiangyu Zhu, Shengcai Liao, Zhen Lei, Weishi Zheng, Stan Z. Li
Comments: Published in ECCV2016. arXiv admin note: substantial text overlap with arXiv:1511.07545
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

Person re-identification is challenging due to the large variations of pose,
illumination, occlusion and camera view. Owing to these variations, the
pedestrian data is distributed as highly-curved manifolds in the feature space,
despite the current convolutional neural networks (CNN)’s capability of feature
extraction. However, the distribution is unknown, so it is difficult to use the
geodesic distance when comparing two samples. In practice, the current deep
embedding methods use the Euclidean distance for the training and test. On the
other hand, the manifold learning methods suggest to use the Euclidean distance
in the local range, combining with the graphical relationship between samples,
for approximating the geodesic distance. From this point of view, selecting
suitable positive i.e. intra-class) training samples within a local range is
critical for training the CNN embedding, especially when the data has large
intra-class variations. In this paper, we propose a novel moderate positive
sample mining method to train robust CNN for person re-identification, dealing
with the problem of large variation. In addition, we improve the learning by a
metric weight constraint, so that the learned metric has a better
generalization ability. Experiments show that these two strategies are
effective in learning robust deep metrics for person re-identification, and
accordingly our deep model significantly outperforms the state-of-the-art
methods on several benchmarks of person re-identification. Therefore, the study
presented in this paper may be useful in inspiring new designs of deep models
for person re-identification.

A Benchmark Dataset and Saliency-guided Stacked Autoencoder for Video-based Salient Object Detection

Jia Li, Changqun Xia, Xiaowu Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image-based salient object detection (SOD) has been extensively studied in
the past decades. However, video-based SOD is much less explored since there
lacks large-scale video datasets within which salient objects are unambiguously
defined and annotated. Toward this end, this paper proposes a video-based SOD
dataset that consists of 200 videos (64 minutes). In constructing the dataset,
we manually annotate all objects and regions over 7,650 uniformly sampled
keyframes and collect the eye-tracking data of 23 subjects that free-view all
videos. From the user data, we find salient objects in video can be defined as
objects that consistently pop-out throughout the video, and objects with such
attributes can be unambiguously annotated by combining manually annotated
object/region masks with eye-tracking data of multiple subjects. To the best of
our knowledge, it is currently the largest dataset for video-based salient
object detection.

Based on the dataset, this paper proposes an unsupervised approach for
video-based SOD by using a saliency-guided stacked autoencoder. In the proposed
approach, spatiotemporal saliency cues are first extracted at pixel, superpixel
and object levels. With these saliency cues, a stacked autoencoder is
unsupervisedly trained which can automatically infer a saliency score for each
pixel by progressively encoding the high-dimensional saliency cues gathered
from the pixel and its spatiotemporal neighbors. Experimental results show that
the proposed approach outperforms 19 image-based and 5 video-based models on
the proposed dataset. Moreover, the comprehensive benchmarking results show
that the proposed dataset is very challenging and have the potential to greatly
boost the development of video-based SOD.

Learning recurrent representations for hierarchical behavior modeling

Eyrun Eyjolfsdottir, Kristin Branson, Yisong Yue, Pietro Perona
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

We propose a framework for detecting action patterns from motion sequences
and modeling the sensory-motor relationship of animals, using a generative
recurrent neural network. The network has a discriminative part (classifying
actions) and a generative part (predicting motion), whose recurrent cells are
laterally connected, allowing higher levels of the network to represent high
level phenomena. We test our framework on two types of data, fruit fly behavior
and online handwriting. Our results show that 1) taking advantage of unlabeled
sequences, by predicting future motion, significantly improves action detection
performance when training labels are scarce, 2) the network learns to represent
high level phenomena such as writer identity and fly gender, without
supervision, and 3) simulated motion trajectories, generated by treating motion
prediction as input to the network, look realistic and may be used to
qualitatively evaluate whether the model has learnt generative control rules.

Exploiting Spatio-Temporal Structure with Recurrent Winner-Take-All Networks

Eder Santana, Matthew Emigh, Pablo Zerges, Jose C Principe
Comments: under review
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

We propose a convolutional recurrent neural network, with Winner-Take-All
dropout for high dimensional unsupervised feature learning in multi-dimensional
time series. We apply the proposedmethod for object recognition with temporal
context in videos and obtain better results than comparable methods in the
literature, including the Deep Predictive Coding Networks previously proposed
by Chalasani and Principe.Our contributions can be summarized as a scalable
reinterpretation of the Deep Predictive Coding Networks trained end-to-end with
backpropagation through time, an extension of the previously proposed
Winner-Take-All Autoencoders to sequences in time, and a new technique for
initializing and regularizing convolutional-recurrent neural networks.

Artificial Intelligence

Using Artificial Intelligence to Identify State Secrets

Renato Rocha Souza, Flavio Codeco Coelho, Rohan Shah, Matthew Connelly
Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Whether officials can be trusted to protect national security information has
become a matter of great public controversy, reigniting a long-standing debate
about the scope and nature of official secrecy. The declassification of
millions of electronic records has made it possible to analyze these issues
with greater rigor and precision. Using machine-learning methods, we examined
nearly a million State Department cables from the 1970s to identify features of
records that are more likely to be classified, such as international
negotiations, military operations, and high-level communications. Even with
incomplete data, algorithms can use such features to identify 90% of classified
cables with <11% false positives. But our results also show that there are
longstanding problems in the identification of sensitive information. Error
analysis reveals many examples of both overclassification and
underclassification. This indicates both the need for research on inter-coder
reliability among officials as to what constitutes classified material and the
opportunity to develop recommender systems to better manage both classification
and declassification.

Detecting Affordances by Visuomotor Simulation

Wolfram Schenck, Hendrik Hasenbein, Ralf Möller
Comments: 26 pages, 8 figures
Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)

The term “affordance” denotes the behavioral meaning of objects. We propose a
cognitive architecture for the detection of affordances in the visual modality.
This model is based on the internal simulation of movement sequences. For each
movement step, the resulting sensory state is predicted by a forward model,
which in turn triggers the generation of a new (simulated) motor command by an
inverse model. Thus, a series of mental images in the sensory and in the motor
domain is evoked. Starting from a real sensory state, a large number of such
sequences is simulated in parallel. Final affordance detection is based on the
generated motor commands. We apply this model to a real-world mobile robot
which is faced with obstacle arrangements some of which are passable (corridor)
and some of which are not (dead ends). The robot’s task is to detect the right
affordance (“pass-through-able” or “non-pass-through-able”). The required
internal models are acquired in a hierarchical training process. Afterwards,
the robotic agent is able to distinguish reliably between corridors and dead
ends. This real-world result enhances the validity of the proposed mental
simulation approach. In addition, we compare several key factors in the
simulation process regarding performance and efficiency.

Local Subspace-Based Outlier Detection using Global Neighbourhoods

Bas van Stein, Matthijs van Leeuwen, Thomas Bäck
Comments: Short version accepted at IEEE BigData 2016
Subjects: Artificial Intelligence (cs.AI)

Outlier detection in high-dimensional data is a challenging yet important
task, as it has applications in, e.g., fraud detection and quality control.
State-of-the-art density-based algorithms perform well because they 1) take the
local neighbourhoods of data points into account and 2) consider feature
subspaces. In highly complex and high-dimensional data, however, existing
methods are likely to overlook important outliers because they do not
explicitly take into account that the data is often a mixture distribution of
multiple components.

We therefore introduce GLOSS, an algorithm that performs local subspace
outlier detection using global neighbourhoods. Experiments on synthetic data
demonstrate that GLOSS more accurately detects local outliers in mixed data
than its competitors. Moreover, experiments on real-world data show that our
approach identifies relevant outliers overlooked by existing methods,
confirming that one should keep an eye on the global perspective even when
doing local outlier detection.

Learning recurrent representations for hierarchical behavior modeling

Eyrun Eyjolfsdottir, Kristin Branson, Yisong Yue, Pietro Perona
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

We propose a framework for detecting action patterns from motion sequences
and modeling the sensory-motor relationship of animals, using a generative
recurrent neural network. The network has a discriminative part (classifying
actions) and a generative part (predicting motion), whose recurrent cells are
laterally connected, allowing higher levels of the network to represent high
level phenomena. We test our framework on two types of data, fruit fly behavior
and online handwriting. Our results show that 1) taking advantage of unlabeled
sequences, by predicting future motion, significantly improves action detection
performance when training labels are scarce, 2) the network learns to represent
high level phenomena such as writer identity and fly gender, without
supervision, and 3) simulated motion trajectories, generated by treating motion
prediction as input to the network, look realistic and may be used to
qualitatively evaluate whether the model has learnt generative control rules.

Towards Blended Reactive Planning and Acting using Behavior Trees

Michele Colledanchise, Diogo Almeida, Petter Ögren
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

In this paper, we study the problem of using a planning algorithm to
automatically create and update a Behavior Tree (BT), controlling a robot in a
dynamic environment. Exploiting the characteristic of BTs, in terms of
modularity and reactivity, the robot continually acts and plans to achieve a
given goal using a set of abstract actions and conditions. The construction of
the BT is based on an extension of the Hybrid Backward-Forward algorithm (HBF)
that allows us to refine the acting process by mapping the descriptive models
onto operational models of actions, thus integrating the ability of planning in
infinite state space of HBF with the continuous modular reactive action
execution of BTs. We believe that this might be a first step to address the
recently raised open challenge in automated planning: the need of a
hierarchical structure and a continuous online planning and acting framework.
We prove the convergence of the proposed approach as well as the absence of
deadlocks and livelocks, and we illustrate our approach in two different
robotics scenarios.

Towards Lifelong Self-Supervision: A Deep Learning Direction for Robotics

Jay M. Wong
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Despite outstanding success in vision amongst other domains, many of the
recent deep learning approaches have evident drawbacks for robots. This
manuscript surveys recent work in the literature that pertain to applying deep
learning systems to the robotics domain, either as means of estimation or as a
tool to resolve motor commands directly from raw percepts. These recent
advances are only a piece to the puzzle. We suggest that deep learning as a
tool alone is insufficient in building a unified framework to acquire general
intelligence. For this reason, we complement our survey with insights from
cognitive development and refer to ideas from classical control theory,
producing an integrated direction for a lifelong learning architecture.

Robust Spectral Inference for Joint Stochastic Matrix Factorization

Moontae Lee, David Bindel, David Mimno
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI)

Spectral inference provides fast algorithms and provable optimality for
latent topic analysis. But for real data these algorithms require additional
ad-hoc heuristics, and even then often produce unusable results. We explain
this poor performance by casting the problem of topic inference in the
framework of Joint Stochastic Matrix Factorization (JSMF) and showing that
previous methods violate the theoretical conditions necessary for a good
solution to exist. We then propose a novel rectification method that learns
high quality topics and their interactions even on small, noisy data. This
method achieves results comparable to probabilistic techniques in several
domains while maintaining scalability and provable optimality.

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, Ni Lao
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

Extending the success of deep neural networks to natural language
understanding and symbolic reasoning requires complex operations and external
memory. Recent neural program induction approaches have attempted to address
this problem, but are typically limited to differentiable memory, and
consequently cannot scale beyond small synthetic tasks. In this work, we
propose the Manager-Programmer-Computer framework, which integrates neural
networks with non-differentiable memory to support abstract, scalable and
precise operations through a friendly neural computer interface. Specifically,
we introduce a Neural Symbolic Machine, which contains a sequence-to-sequence
neural “programmer”, and a non-differentiable “computer” that is a Lisp
interpreter with code assist. To successfully apply REINFORCE for training, we
augment it with approximate gold programs found by an iterative maximum
likelihood training process. NSM is able to learn a semantic parser from weak
supervision over a large knowledge base. It achieves new state-of-the-art
performance on WebQuestionsSP, a challenging semantic parsing dataset, with
weak supervision. Compared to previous approaches, NSM is end-to-end, therefore
does not rely on feature engineering or domain specific knowledge.

Information Retrieval

Product-based Neural Networks for User Response Prediction

Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, Jun Wang
Comments: 6 pages, 5 figures, ICDM2016
Subjects: Learning (cs.LG); Information Retrieval (cs.IR)

Predicting user responses, such as clicks and conversions, is of great
importance and has found its usage in many Web applications including
recommender systems, web search and online advertising. The data in those
applications is mostly categorical and contains multiple fields; a typical
representation is to transform it into a high-dimensional sparse binary feature
representation via one-hot encoding. Facing with the extreme sparsity,
traditional models may limit their capacity of mining shallow patterns from the
data, i.e. low-order feature combinations. Deep models like deep neural
networks, on the other hand, cannot be directly applied for the
high-dimensional input because of the huge feature space. In this paper, we
propose a Product-based Neural Networks (PNN) with an embedding layer to learn
a distributed representation of the categorical data, a product layer to
capture interactive patterns between inter-field categories, and further fully
connected layers to explore high-order feature interactions. Our experimental
results on two large-scale real-world ad click datasets demonstrate that PNNs
consistently outperform the state-of-the-art models on various metrics.

MusicMood: Predicting the mood of music from song lyrics using machine learning

Sebastian Raschka
Comments: 9 pages, 5 figures
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Sentiment prediction of contemporary music can have a wide-range of
applications in modern society, for instance, selecting music for public
institutions such as hospitals or restaurants to potentially improve the
emotional well-being of personnel, patients, and customers, respectively. In
this project, music recommendation system built upon on a naive Bayes
classifier, trained to predict the sentiment of songs based on song lyrics
alone. The experimental results show that music corresponding to a happy mood
can be detected with high precision based on text features obtained from song
lyrics.

CBAS: context based arabic stemmer

Mahmoud El-Defrawy, Yasser El-Sonbaty, Nahla A. Belal
Journal-ref: International Journal on Natural Language Computing (IJNLC) Vol.
4, No.3, June 2015
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

Arabic morphology encapsulates many valuable features such as word root.
Arabic roots are being utilized for many tasks; the process of extracting a
word root is referred to as stemming. Stemming is an essential part of most
Natural Language Processing tasks, especially for derivative languages such as
Arabic. However, stemming is faced with the problem of ambiguity, where two or
more roots could be extracted from the same word. On the other hand,
distributional semantics is a powerful co-occurrence model. It captures the
meaning of a word based on its context. In this paper, a distributional
semantics model utilizing Smoothed Pointwise Mutual Information (SPMI) is
constructed to investigate its effectiveness on the stemming analysis task. It
showed an accuracy of 81.5%, with a at least 9.4% improvement over other
stemmers.

Computation and Language

Faster decoding for subword level Phrase-based SMT between related languages

Anoop Kunchukuttan, Pushpak Bhattacharyya
Comments: Accepted at VarDial3 (Third Workshop on NLP for Similar Languages, Varieties and Dialects) collocated with COLING 2016; 7 pages
Subjects: Computation and Language (cs.CL)

A common and effective way to train translation systems between related
languages is to consider sub-word level basic units. However, this increases
the length of the sentences resulting in increased decoding time. The increase
in length is also impacted by the specific choice of data format for
representing the sentences as subwords. In a phrase-based SMT framework, we
investigate different choices of decoder parameters as well as data format and
their impact on decoding time and translation accuracy. We suggest best options
for these settings that significantly improve decoding time with little impact
on the translation accuracy.

Recurrent Neural Network Language Model Adaptation Derived Document Vector

Wei Li, Brian Kan, Wing Mak
Subjects: Computation and Language (cs.CL)

In many natural language processing (NLP) tasks, a document is commonly
modeled as a bag of words using the term frequency-inverse document frequency
(TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature
vector is that it ignores word orders that carry syntactic and semantic
relationships among the words in a document, and they can be important in some
NLP tasks such as genre classification. This paper proposes a novel distributed
vector representation of a document: a simple recurrent-neural-network language
model (RNN-LM) or a long short-term memory RNN language model (LSTM-LM) is
first created from all documents in a task; some of the LM parameters are then
adapted by each document, and the adapted parameters are vectorized to
represent the document. The new document vectors are labeled as DV-RNN and
DV-LSTM respectively. We believe that our new document vectors can capture some
high-level sequential information in the documents, which other current
document representations fail to capture. The new document vectors were
evaluated in the genre classification of documents in three corpora: the Brown
Corpus, the BNC Baby Corpus and an artificially created Penn Treebank dataset.
Their classification performances are compared with the performance of TF-IDF
vector and the state-of-the-art distributed memory model of paragraph vector
(PV-DM). The results show that DV-LSTM significantly outperforms TF-IDF and
PV-DM in most cases, and combinations of the proposed document vectors with
TF-IDF or PV-DM may further improve performance.

Dual Learning for Machine Translation

Yingce Xia, Di He, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, Wei-Ying Ma
Journal-ref: NIPS 2016
Subjects: Computation and Language (cs.CL)

While neural machine translation (NMT) is making good progress in the past
two years, tens of millions of bilingual sentence pairs are needed for its
training. However, human labeling is very costly. To tackle this training data
bottleneck, we develop a dual-learning mechanism, which can enable an NMT
system to automatically learn from unlabeled data through a dual-learning game.
This mechanism is inspired by the following observation: any machine
translation task has a dual task, e.g., English-to-French translation (primal)
versus French-to-English translation (dual); the primal and dual tasks can form
a closed loop, and generate informative feedback signals to train the
translation models, even if without the involvement of a human labeler. In the
dual-learning mechanism, we use one agent to represent the model for the primal
task and the other agent to represent the model for the dual task, then ask
them to teach each other through a reinforcement learning process. Based on the
feedback signals generated during this process (e.g., the language-model
likelihood of the output of a model, and the reconstruction error of the
original sentence after the primal and dual translations), we can iteratively
update the two models until convergence (e.g., using the policy gradient
methods). We call the corresponding approach to neural machine translation
emph{dual-NMT}. Experiments show that dual-NMT works very well on
English(leftrightarrow)French translation; especially, by learning from
monolingual data (with 10% bilingual data for warm start), it achieves a
comparable accuracy to NMT trained from the full bilingual data for the
French-to-English translation task.

Improving Twitter Sentiment Classification via Multi-Level Sentiment-Enriched Word Embeddings

Shufeng Xiong
Subjects: Computation and Language (cs.CL)

Most of existing work learn sentiment-specific word representation for
improving Twitter sentiment classification, which encoded both n-gram and
distant supervised tweet sentiment information in learning process. They assume
all words within a tweet have the same sentiment polarity as the whole tweet,
which ignores the word its own sentiment polarity. To address this problem, we
propose to learn sentiment-specific word embedding by exploiting both lexicon
resource and distant supervised information. We develop a multi-level
sentiment-enriched word embedding learning method, which uses parallel
asymmetric neural network to model n-gram, word level sentiment and tweet level
sentiment in learning process. Experiments on standard benchmarks show our
approach outperforms state-of-the-art methods.

RNN Approaches to Text Normalization: A Challenge

Richard Sproat, Navdeep Jaitly
Comments: 17 pages, 13 tables, 3 figures
Subjects: Computation and Language (cs.CL)

This paper presents a challenge to the community: given a large corpus of
written text aligned to its normalized spoken form, train an RNN to learn the
correct normalization function. We present a data set of general text where the
normalizations were generated using an existing text normalization component of
a text-to-speech system. This data set will be released open-source in the near
future.

We also present our own experiments with this data set with a variety of
different RNN architectures. While some of the architectures do in fact produce
very good results when measured in terms of overall accuracy, the errors that
are produced are problematic, since they would convey completely the wrong
message if such a system were deployed in a speech application. On the other
hand, we show that a simple FST-based filter can mitigate those errors, and
achieve a level of accuracy not achievable by the RNN alone.

Though our conclusions are largely negative on this point, we are actually
not arguing that the text normalization problem is intractable using an pure
RNN approach, merely that it is not going to be something that can be solved
merely by having huge amounts of annotated text data and feeding that to a
general RNN model. And when we open-source our data, we will be providing a
novel data set for sequence-to-sequence modeling in the hopes that the the
community can find better solutions.

CBAS: context based arabic stemmer

Arabic morphology encapsulates many valuable features such as word root.
Arabic roots are being utilized for many tasks; the process of extracting a
word root is referred to as stemming. Stemming is an essential part of most
Natural Language Processing tasks, especially for derivative languages such as
Arabic. However, stemming is faced with the problem of ambiguity, where two or
more roots could be extracted from the same word. On the other hand,
distributional semantics is a powerful co-occurrence model. It captures the
meaning of a word based on its context. In this paper, a distributional
semantics model utilizing Smoothed Pointwise Mutual Information (SPMI) is
constructed to investigate its effectiveness on the stemming analysis task. It
showed an accuracy of 81.5%, with a at least 9.4% improvement over other
stemmers.

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, Ni Lao
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

Extending the success of deep neural networks to natural language
understanding and symbolic reasoning requires complex operations and external
memory. Recent neural program induction approaches have attempted to address
this problem, but are typically limited to differentiable memory, and
consequently cannot scale beyond small synthetic tasks. In this work, we
propose the Manager-Programmer-Computer framework, which integrates neural
networks with non-differentiable memory to support abstract, scalable and
precise operations through a friendly neural computer interface. Specifically,
we introduce a Neural Symbolic Machine, which contains a sequence-to-sequence
neural “programmer”, and a non-differentiable “computer” that is a Lisp
interpreter with code assist. To successfully apply REINFORCE for training, we
augment it with approximate gold programs found by an iterative maximum
likelihood training process. NSM is able to learn a semantic parser from weak
supervision over a large knowledge base. It achieves new state-of-the-art
performance on WebQuestionsSP, a challenging semantic parsing dataset, with
weak supervision. Compared to previous approaches, NSM is end-to-end, therefore
does not rely on feature engineering or domain specific knowledge.

MusicMood: Predicting the mood of music from song lyrics using machine learning

Sebastian Raschka
Comments: 9 pages, 5 figures
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Sentiment prediction of contemporary music can have a wide-range of
applications in modern society, for instance, selecting music for public
institutions such as hospitals or restaurants to potentially improve the
emotional well-being of personnel, patients, and customers, respectively. In
this project, music recommendation system built upon on a naive Bayes
classifier, trained to predict the sentiment of songs based on song lyrics
alone. The experimental results show that music corresponding to a happy mood
can be detected with high precision based on text features obtained from song
lyrics.

Distributed, Parallel, and Cluster Computing

Self-Awareness of Cloud Applications

Alexandru Iosup, Xiaoyun Zhu, Arif Merchant, Eva Kalyvianaki, Martina Maggio, Simon Spinner, Tarek Abdelzaher, Ole Mengshoel, Sara Bouchenak
Comments: Overall: 40 pages, 1 figure, 135 references. Note: This is an extended survey. A much shorter, revised version of this material will be available in print, as part of a Springer book on “Self-Aware Computing”. The book is due to appear in 2017
Subjects: Software Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Systems and Control (cs.SY)

Cloud applications today deliver an increasingly larger portion of the
Information and Communication Technology (ICT) services. To address the scale,
growth, and reliability of cloud applications, self-aware management and
scheduling are becoming commonplace. How are they used in practice? In this
chapter, we propose a conceptual framework for analyzing state-of-the-art
self-awareness approaches used in the context of cloud applications. We map
important applications corresponding to popular and emerging application
domains to this conceptual framework, and compare the practical
characteristics, benefits, and drawbacks of self-awareness approaches. Last, we
propose a roadmap for addressing open challenges in self-aware cloud and
datacenter applications.

Learning

Semi-Supervised Radio Signal Identification

Timothy J. O'Shea, Nathan West, Matthew Vondal, T. Charles Clancy
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Radio recognition in complex multi-user environments is an important tool for
optimizing spectrum utilization, identifying and minimizing interference, and
enforcing spectrum policy. Radio data is readily available and easy to obtain,
but labeled data is often scarce making supervised learning strategies
difficult and time consuming to curate. We demonstrate that semi-supervised
learning techniques can be used to scale learning beyond supervised datasets,
allowing for both discerning and recalling radio signals of interest by using
sparse signal representations based on both unsupervised and supervised methods
for nonlinear feature learning.

Recurrent Neural Radio Anomaly Detection

Timothy J O'Shea, T. Charles Clancy, Robert W. McGwier
Subjects: Learning (cs.LG)

We introduce a powerful recurrent neural network based method for novelty
detection to the application of detecting radio anomalies. This approach holds
promise in significantly increasing the ability of naive anomaly detection to
detect small anomalies in highly complex complexity multi-user radio bands. We
demonstrate the efficacy of this approach on a number of common real over the
air radio communications bands of interest and quantify detection performance
in terms of probability of detection an false alarm rates across a range of
interference to band power ratios and compare to baseline methods.

Stationary time-vertex signal processing

Andreas Loukas, Nathanaël Perraudin
Subjects: Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)

The goal of this paper is to improve learning for multivariate processes
whose structure is dependent on some known graph topology. Typically, the graph
information is incorporated to the learning process via a smoothness assumption
postulating that the values supported on well connected vertices exhibit small
variations. We argue that smoothness is not enough. To capture the behavior of
complex interconnected systems, such as transportation and biological networks,
it is important to train expressive models, being able to reproduce a wide
range of graph and temporal behaviors. Motivated by this need, this paper puts
forth a novel definition of time-vertex wide-sense stationarity, or joint
stationarity for short. We believe that the proposed definition is natural, at
it elegantly relates to existing definitions of stationarity in the time and
vertex domains. We use joint stationarity to regularize learning and to reduce
computational complexity in both estimation and recovery tasks. In particular,
we show that for any jointly stationary process: (a) one can learn the
covariance structure from O(1) samples, and (b) can solve MMSE recovery
problems, such as interpolation, denoising, forecasting, in complexity that is
linear to the edges and timesteps. Experiments with three datasets suggest that
joint stationarity can yield significant accuracy improvements in the
reconstruction effort.

Improving a Credit Scoring Model by Incorporating Bank Statement Derived Features

Rory P. Bunker, Wenjun Zhang, M. Asif Naeem
Subjects: Learning (cs.LG)

In this paper, we investigate the extent to which features derived from bank
statements provided by loan applicants, and which are not declared on an
application form, can enhance a credit scoring model for a New Zealand lending
company. Exploring the potential of such information to improve credit scoring
models in this manner has not been studied previously. We construct a baseline
model based solely on the existing scoring features obtained from the loan
application form, and a second baseline model based solely on the new bank
statement-derived features. A combined feature model is then created by
augmenting the application form features with the new bank statement derived
features. Our experimental results using ROC analysis show that a combined
feature model performs better than both of the two baseline models, and show
that a number of the bank statement-derived features have value in improving
the credit scoring model. The target data set used for modelling was highly
imbalanced, and Naive Bayes was found to be the best performing model, and
outperformed a number of other classifiers commonly used in credit scoring,
suggesting its potential for future use on highly imbalanced data sets.

Robust Spectral Inference for Joint Stochastic Matrix Factorization

Moontae Lee, David Bindel, David Mimno
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI)

Spectral inference provides fast algorithms and provable optimality for
latent topic analysis. But for real data these algorithms require additional
ad-hoc heuristics, and even then often produce unusable results. We explain
this poor performance by casting the problem of topic inference in the
framework of Joint Stochastic Matrix Factorization (JSMF) and showing that
previous methods violate the theoretical conditions necessary for a good
solution to exist. We then propose a novel rectification method that learns
high quality topics and their interactions even on small, noisy data. This
method achieves results comparable to probabilistic techniques in several
domains while maintaining scalability and provable optimality.

Product-based Neural Networks for User Response Prediction

Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, Jun Wang
Comments: 6 pages, 5 figures, ICDM2016
Subjects: Learning (cs.LG); Information Retrieval (cs.IR)

Predicting user responses, such as clicks and conversions, is of great
importance and has found its usage in many Web applications including
recommender systems, web search and online advertising. The data in those
applications is mostly categorical and contains multiple fields; a typical
representation is to transform it into a high-dimensional sparse binary feature
representation via one-hot encoding. Facing with the extreme sparsity,
traditional models may limit their capacity of mining shallow patterns from the
data, i.e. low-order feature combinations. Deep models like deep neural
networks, on the other hand, cannot be directly applied for the
high-dimensional input because of the huge feature space. In this paper, we
propose a Product-based Neural Networks (PNN) with an embedding layer to learn
a distributed representation of the categorical data, a product layer to
capture interactive patterns between inter-field categories, and further fully
connected layers to explore high-order feature interactions. Our experimental
results on two large-scale real-world ad click datasets demonstrate that PNNs
consistently outperform the state-of-the-art models on various metrics.

MusicMood: Predicting the mood of music from song lyrics using machine learning

Sebastian Raschka
Comments: 9 pages, 5 figures
Subjects: Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Sentiment prediction of contemporary music can have a wide-range of
applications in modern society, for instance, selecting music for public
institutions such as hospitals or restaurants to potentially improve the
emotional well-being of personnel, patients, and customers, respectively. In
this project, music recommendation system built upon on a naive Bayes
classifier, trained to predict the sentiment of songs based on song lyrics
alone. The experimental results show that music corresponding to a happy mood
can be detected with high precision based on text features obtained from song
lyrics.

Bayesian Adaptive Data Analysis Guarantees from Subgaussianity

Sam Elder
Subjects: Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)

The new field of adaptive data analysis seeks to provide algorithms and
provable guarantees for models of machine learning that allow researchers to
reuse their data, which normally falls outside of the usual statistical
paradigm of static data analysis. In 2014, Dwork, Feldman, Hardt, Pitassi,
Reingold and Roth introduced one potential model and proposed several solutions
based on differential privacy. In previous work in 2016, we described a problem
with this model and instead proposed a Bayesian variant, but also found that
the analogous Bayesian methods cannot achieve the same statistical guarantees
as in the static case.

In this paper, we prove the first positive results for the Bayesian model,
showing that with a Dirichlet prior, the posterior mean algorithm indeed
matches the statistical guarantees of the static case. We conjecture that this
is true for any conjugate prior from the exponential family, but can only prove
this in special cases.

The main ingredient, Theorem 4, shows that the ( ext{Beta}(alpha,eta))
distribution is subgaussian with variance proxy (O(1/(alpha+eta+1))), a
concentration result also of independent interest. Unlike most moment-based
concentration techniques, which bound the centered moments, our proof utilizes
a simple condition on the raw moments of a positive random variable.

Kernel Bandwidth Selection for SVDD: Peak Criterion Approach for Large Data

Sergiy Peredriy, Deovrat Kakde, Arin Chaudhuri
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Support Vector Data Description (SVDD) provides a useful approach to
construct a description of multivariate data for single-class classification
and outlier detection with various practical applications. Gaussian kernel used
in SVDD formulation allows flexible data description defined by observations
designated as support vectors. The data boundary of such description is
non-spherical and conforms to the geometric features of the data. By varying
the Gaussian kernel bandwidth parameter, the SVDD-generated boundary can be
made either smoother (more spherical) or tighter/jagged. The former case may
lead to under-fitting, whereas the latter may result in overfitting. Peak
criterion has been proposed to select an optimal value of the kernel bandwidth
to strike the balance between the data boundary smoothness and its ability to
capture the general geometric shape of the data. Peak criterion involves
training SVDD at various values of the kernel bandwidth parameter. When
training datasets are large, the time required to obtain the optimal value of
the Gaussian kernel bandwidth parameter according to Peak method can become
prohibitively large. This paper proposes an extension of Peak method for the
case of large data. The proposed method gives good results when applied to
several datasets. Two existing alternative methods of computing the Gaussian
kernel bandwidth parameter (Coefficient of Variation and Distance to the
Farthest Neighbor) were modified to allow comparison with the proposed method
on convergence. Empirical comparison demonstrates the advantage of the proposed
method.

Exploiting Spatio-Temporal Structure with Recurrent Winner-Take-All Networks

Eder Santana, Matthew Emigh, Pablo Zerges, Jose C Principe
Comments: under review
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

We propose a convolutional recurrent neural network, with Winner-Take-All
dropout for high dimensional unsupervised feature learning in multi-dimensional
time series. We apply the proposedmethod for object recognition with temporal
context in videos and obtain better results than comparable methods in the
literature, including the Deep Predictive Coding Networks previously proposed
by Chalasani and Principe.Our contributions can be summarized as a scalable
reinterpretation of the Deep Predictive Coding Networks trained end-to-end with
backpropagation through time, an extension of the previously proposed
Winner-Take-All Autoencoders to sequences in time, and a new technique for
initializing and regularizing convolutional-recurrent neural networks.

Computationally Efficient Influence Maximization in Stochastic and Adversarial Models: Algorithms and Analysis

Justin Khim, Varun Jog, Po-Ling Loh
Comments: 56 pages, 2 figures, 1 table
Subjects: Social and Information Networks (cs.SI); Learning (cs.LG); Machine Learning (stat.ML)

We consider the problem of influence maximization in fixed networks, for both
stochastic and adversarial contagion models. The common goal is to select a
subset of nodes of a specified size to infect so that the number of infected
nodes at the conclusion of the epidemic is as large as possible. In the
stochastic setting, the epidemic spreads according to a general triggering
model, which includes the popular linear threshold and independent cascade
models. We establish upper and lower bounds for the influence of an initial
subset of nodes in the network, where the influence is defined as the expected
number of infected nodes. Although the problem of exact influence computation
is NP-hard in general, our bounds may be evaluated efficiently, leading to
scalable algorithms for influence maximization with rigorous theoretical
guarantees. In the adversarial spreading setting, an adversary is allowed to
specify the edges through which contagion may spread, and the player chooses
sets of nodes to infect in successive rounds. Both the adversary and player may
behave stochastically, but we limit the adversary to strategies that are
oblivious of the player’s actions. We establish upper and lower bounds on the
minimax pseudo-regret in both undirected and directed networks.

Surpassing Gradient Descent Provably: A Cyclic Incremental Method with Linear Convergence Rate

Aryan Mokhtari, Mert Gürbüzbalaban, Alejandro Ribeiro
Subjects: Optimization and Control (math.OC); Learning (cs.LG)

Recently, there has been growing interest in developing optimization methods
for solving large-scale machine learning problems. Most of these problems boil
down to the problem of minimizing an average of a finite set of smooth and
strongly convex functions where the number of functions (n) is large. Gradient
descent method (GD) is successful in minimizing convex problems at a fast
linear rate; however, it is not applicable to the considered large-scale
optimization setting because of the high computational complexity. Incremental
methods resolve this drawback of gradient methods by replacing the required
gradient for the descent direction with an incremental gradient approximation.
They operate by evaluating one gradient per iteration and executing the average
of the (n) available gradients as a gradient approximate. Although, incremental
methods reduce the computational cost of GD, their convergence rates do not
justify their advantage relative to GD in terms of the total number of gradient
evaluations until convergence. In this paper, we introduce a Double Incremental
Aggregated Gradient method (DIAG) that computes the gradient of only one
function at each iteration, which is chosen based on a cyclic scheme, and uses
the aggregated average gradient of all the functions to approximate the full
gradient. The iterates of the proposed DIAG method uses averages of both
iterates and gradients in oppose to classic incremental methods that utilize
gradient averages but do not utilize iterate averages. We prove that not only
the proposed DIAG method converges linearly to the optimal solution, but also
its linear convergence factor justifies the advantage of incremental methods on
GD. In particular, we prove that the worst case performance of DIAG is better
than the worst case performance of GD.

Stochastic Variational Deep Kernel Learning

Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, Eric P. Xing
Comments: 13 pages, 6 tables, 3 figures. Appearing in NIPS 2016
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Methodology (stat.ME)

Deep kernel learning combines the non-parametric flexibility of kernel
methods with the inductive biases of deep learning architectures. We propose a
novel deep kernel learning model and stochastic variational inference procedure
which generalizes deep kernel learning approaches to enable classification,
multi-task learning, additive covariance structures, and stochastic gradient
training. Specifically, we apply additive base kernels to subsets of output
features from deep neural architectures, and jointly learn the parameters of
the base kernels and deep network through a Gaussian process marginal
likelihood objective. Within this framework, we derive an efficient form of
stochastic variational inference which leverages local kernel interpolation,
inducing points, and structure exploiting algebra. We show improved performance
over stand alone deep networks, SVMs, and state of the art scalable Gaussian
processes on several classification benchmarks, including an airline delay
dataset containing 6 million training points, CIFAR, and ImageNet.

The (χ)-Divergence for Approximate Inference

Adji B. Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, David M. Blei
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)

Variational inference enables Bayesian analysis for complex probabilistic
models with massive data sets. It works by positing a family of distributions
and finding the member in the family that is closest to the posterior. While
successful, variational methods can run into pathologies; for example, they
typically underestimate posterior uncertainty. We propose CHI-VI, a
complementary algorithm to traditional variational inference with KL((q) ||
(p)) and an alternative algorithm to EP. CHI-VI is a black box algorithm that
minimizes the (chi)-divergence from the posterior to the family of
approximating distributions. In EP, only local minimization of the KL((p) ||
(q)) objective is possible. In contrast, CHI-VI optimizes a well-defined global
objective. It directly minimizes an upper bound to the model evidence that
equivalently minimizes the (chi)-divergence. In experiments, we illustrate the
utility of the upper bound for sandwich estimating the model evidence. We also
compare several probabilistic models and a Cox process for basketball data. We
find CHI-VI often yields better classification error rates and better posterior
uncertainty.

Enhanced Factored Three-Way Restricted Boltzmann Machines for Speech Detection

Pengfei Sun, Jun Qin
Comments: 8 pages
Journal-ref: IEEE Signal Processing Letter 2016
Subjects: Sound (cs.SD); Learning (cs.LG); Machine Learning (stat.ML)

In this letter, we propose enhanced factored three way restricted Boltzmann
machines (EFTW-RBMs) for speech detection. The proposed model incorporates
conditional feature learning by multiplying the dynamical state of the third
unit, which allows a modulation over the visible-hidden node pairs. Instead of
stacking previous frames of speech as the third unit in a recursive manner, the
correlation related weighting coefficients are assigned to the contextual
neighboring frames. Specifically, a threshold function is designed to capture
the long-term features and blend the globally stored speech structure. A
factored low rank approximation is introduced to reduce the parameters of the
three-dimensional interaction tensor, on which non-negative constraint is
imposed to address the sparsity characteristic. The validations through the
area-under-ROC-curve (AUC) and signal distortion ratio (SDR) show that our
approach outperforms several existing 1D and 2D (i.e., time and time-frequency
domain) speech detection algorithms in various noisy environments.

Application Specific Instrumentation (ASIN): A Bio-inspired Paradigm to Instrumentation using recognition before detection

Amit Kumar Mishra
Subjects: Other Computer Science (cs.OH); Learning (cs.LG)

In this paper we present a new scheme for instrumentation, which has been
inspired by the way small mammals sense their environment. We call this scheme
Application Specific Instrumentation (ASIN). A conventional instrumentation
system focuses on gathering as much information about the scene as possible.
This, usually, is a generic system whose data can be used by another system to
take a specific action. ASIN fuses these two steps into one. The major merit of
the proposed scheme is that it uses low resolution sensors and much less
computational overhead to give good performance for a highly specialised
application

Embedding Deep Metric for Person Re-identication A Study Against Large Variations

Person re-identification is challenging due to the large variations of pose,
illumination, occlusion and camera view. Owing to these variations, the
pedestrian data is distributed as highly-curved manifolds in the feature space,
despite the current convolutional neural networks (CNN)’s capability of feature
extraction. However, the distribution is unknown, so it is difficult to use the
geodesic distance when comparing two samples. In practice, the current deep
embedding methods use the Euclidean distance for the training and test. On the
other hand, the manifold learning methods suggest to use the Euclidean distance
in the local range, combining with the graphical relationship between samples,
for approximating the geodesic distance. From this point of view, selecting
suitable positive i.e. intra-class) training samples within a local range is
critical for training the CNN embedding, especially when the data has large
intra-class variations. In this paper, we propose a novel moderate positive
sample mining method to train robust CNN for person re-identification, dealing
with the problem of large variation. In addition, we improve the learning by a
metric weight constraint, so that the learned metric has a better
generalization ability. Experiments show that these two strategies are
effective in learning robust deep metrics for person re-identification, and
accordingly our deep model significantly outperforms the state-of-the-art
methods on several benchmarks of person re-identification. Therefore, the study
presented in this paper may be useful in inspiring new designs of deep models
for person re-identification.

Full-Capacity Unitary Recurrent Neural Networks

Recurrent neural networks are powerful models for processing sequential data,
but they are generally plagued by vanishing and exploding gradient problems.
Unitary recurrent neural networks (uRNNs), which use unitary recurrence
matrices, have recently been proposed as a means to avoid these issues.
However, in previous experiments, the recurrence matrices were restricted to be
a product of parameterized unitary matrices, and an open question remains: when
does such a parameterization fail to represent all unitary matrices, and how
does this restricted representational capacity limit what can be learned? To
address this question, we propose full-capacity uRNNs that optimize their
recurrence matrix over all unitary matrices, leading to significantly improved
performance over uRNNs that use a restricted-capacity recurrence matrix. Our
contribution consists of two main components. First, we provide a theoretical
argument to determine if a unitary parameterization has restricted capacity.
Using this argument, we show that a recently proposed unitary parameterization
has restricted capacity for hidden state dimension greater than 7. Second, we
show how a complete, full-capacity unitary recurrence matrix can be optimized
over the differentiable manifold of unitary matrices. The resulting
multiplicative gradient step is very simple and does not require gradient
clipping or learning rate adaptation. We confirm the utility of our claims by
empirically evaluating our new full-capacity uRNNs on both synthetic and
natural data, achieving superior performance compared to both LSTMs and the
original restricted-capacity uRNNs.

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, Ni Lao
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)

Extending the success of deep neural networks to natural language
understanding and symbolic reasoning requires complex operations and external
memory. Recent neural program induction approaches have attempted to address
this problem, but are typically limited to differentiable memory, and
consequently cannot scale beyond small synthetic tasks. In this work, we
propose the Manager-Programmer-Computer framework, which integrates neural
networks with non-differentiable memory to support abstract, scalable and
precise operations through a friendly neural computer interface. Specifically,
we introduce a Neural Symbolic Machine, which contains a sequence-to-sequence
neural “programmer”, and a non-differentiable “computer” that is a Lisp
interpreter with code assist. To successfully apply REINFORCE for training, we
augment it with approximate gold programs found by an iterative maximum
likelihood training process. NSM is able to learn a semantic parser from weak
supervision over a large knowledge base. It achieves new state-of-the-art
performance on WebQuestionsSP, a challenging semantic parsing dataset, with
weak supervision. Compared to previous approaches, NSM is end-to-end, therefore
does not rely on feature engineering or domain specific knowledge.

Information Theory

Energy Efficiency Optimization with Simultaneous Wireless Information and Power Transfer in MIMO Broadcast Channels

Jie Tang, Daniel K. C. So, Arman Shojaeifard, Kai-Kit Wong
Subjects: Information Theory (cs.IT)

Simultaneous wireless information and power transfer (SWIPT) is anticipated
to have great applications in fifth-generation (5G) and beyond communication
systems. In this paper, we address the energy efficiency (EE) optimization
problem for SWIPT multiple-input multiple-output broadcast channel (MIMO-BC)
with time-switching (TS) receiver design. Our aim is to maximize the EE of the
system whilst satisfying certain constraints in terms of maximum transmit power
and minimum harvested energy per user. The coupling of the optimization
variables, namely, transmit covariance matrices and TS ratios, leads to a EE
problem which is non-convex, and hence very difficult to solve directly. Hence,
we transform the original maximization problem with multiple constraints into a
min-max problem with a single constraint and multiple auxiliary variables. We
propose a dual inner/outer layer resource allocation framework to tackle the
problem. For the inner-layer, we invoke an extended SWIPT-based BC-multiple
access channel (MAC) duality approach and provide two iterative resource
allocation schemes under fixed auxiliary variables for solving the dual MAC
problem. A sub-gradient searching scheme is then proposed for the outer-layer
in order to obtain the optimal auxiliary variables. Numerical results confirm
the effectiveness of the proposed algorithms and illustrate that significant
performance gain in terms of EE can be achieved by adopting the proposed
extended BC-MAC duality-based algorithm.

Detection of Single vs Multiple Antenna Transmission Systems Using Pilot Data

Thakshila Wimalajeewa, Pramod K Varshney, Wei Su
Subjects: Information Theory (cs.IT)

In this paper, we consider the problem of classifying the transmission system
when it is not known a priori whether the transmission is via a single antenna
or multiple antennas. The receiver is assumed to be employed with a known
number of antennas. In a data frame transmitted by most multiple input multiple
output (MIMO) systems, some pilot or training data is inserted for symbol
timing synchronization and estimation of the channel. Our goal is to perform
MIMO transmit antenna classification using this pilot data. More specifically,
the problem of determining the transmission system is cast as a multiple
hypothesis testing problem where the number of hypotheses is equal to the
maximum number of transmit antennas. Under the assumption of receiver having
the exact knowledge of pilot data used for timing synchronization and channel
estimation, we consider maximum likelihood (ML) and correlation test statistics
to classify the MIMO transmit system. When only probabilistic knowledge of
pilot data is available at the receiver, a hybrid-maximum likelihood (HML)
based test statistic is constructed using the expectation-maximization (EM)
algorithm. The performance of the proposed algorithms is illustrated via
simulations and comparative merits of different techniques in terms of the
computational complexity and performance are discussed.

Generalized Entropy Concentration for Counts

Kostas N. Oikonomou
Subjects: Information Theory (cs.IT)

We consider the phenomenon of entropy concentration under linear constraints
in a discrete setting, using the “balls and bins” paradigm, but without the
assumption that the number of balls allocated to the bins is known. Therefore
instead of frequency vectors and ordinary entropy, we have count vectors with
unknown sum, and a certain generalized entropy. We show that if the constraints
bound the allowable sums, this suffices for concentration to occur even in this
setting. The concentration can be either in terms of deviation from the maximum
generalized entropy value, or in terms of the norm of the difference from the
maximum generalized entropy vector. Without any asymptotic considerations, we
quantify the concentration in terms of various parameters, notably a tolerance
on the constraints which ensures that they are always satisfied by an integral
vector. Generalized entropy maximization is not only compatible with ordinary
MaxEnt, but can also be considered an extension of it, as it allows us to
address problems that cannot be formulated as MaxEnt problems.

Joint Antenna Selection and Spatial Switching for Energy Efficient MIMO SWIPT System

Jie Tang, Daniel K. C. So, Arman Shojaeifard, Kai-Kit Wong, Jinming Wen
Subjects: Information Theory (cs.IT)

In this paper, we investigate joint antenna selection and spatial switching
(SS) for quality-of-service (QoS)-constrained energy efficiency (EE)
optimization in a multiple-input multiple-output (MIMO) simultaneous wireless
information and power transfer (SWIPT) system. A practical linear power model
taking into account the entire transmit-receive chain is accordingly utilized.
The corresponding fractional-combinatorial and non-convex EE problem, involving
joint optimization of eigen-channel assignment, power allocation, and active
receive antenna set selection, subject to satisfying minimum sum-rate and power
transfer constraints, is extremely difficult to solve directly. In order to
tackle this, we separate the eigen-channel assignment and power allocation
procedure with the antenna selection functionality. In particular, we first
tackle the EE maximization problem under fixed receive antenna set using
Dinkelbach-based convex programming, iterative joint eigen-channel assignment
and power allocation, and low-complexity multi-objective optimization
(MOO)-based approach. On the other hand, the number of active receive antennas
induces a trade-off in the achievable sum-rate and power transfer versus the
transmit-independent power consumption. We provide a fundamental study of the
achievable EE with antenna selection and accordingly develop dynamic optimal
exhaustive search and Frobenius-norm-based schemes. Simulation results confirm
the theoretical findings and demonstrate that the proposed resource allocation
algorithms can efficiently approach the optimal EE.

When is Noisy State Information at the Encoder as Useless as No Information or as Good as Noise-Free State?

Rui Xu, Jun Chen, Tsachy Weissman, Jian-Kang Zhang
Comments: This paper was presented in part at the 2016 IEEE International Symposium on Information Theory. 16 pages, 8 figures
Subjects: Information Theory (cs.IT)

For any binary-input channel with perfect state information at the decoder,
if the mutual information between the noisy state observation at the encoder
and the true channel state is below a positive threshold determined solely by
the state distribution, then the capacity is the same as that with no encoder
side information. A complementary phenomenon is revealed for the generalized
probing capacity. Extensions beyond binary-input channels are developed.

An Experimental Comparison of Coded Modulation Strategies for 100 Gbit/s Transceivers

Eric Sillekens, Alex Alvarado, Chigo Okonkwo, Benn C. Thomsen
Subjects: Information Theory (cs.IT)

Coded modulation is a key technique to increase the spectral efficiency of
coherent optical communication systems. Two popular strategies for coded
modulation are turbo trellis-coded modulation (TTCM) and bit-interleaved coded
modulation (BICM) based on low-density parity-check (LDPC) codes. Although BICM
LDPC is suboptimal, its simplicity makes it very popular in practice. In this
work, we compare the performance of TTCM and BICM LDPC using
information-theoretic measures. Our information-theoretic results show that for
the same overhead and modulation format only a very small penalty (less than
0.1 dB) is to be expected when an ideal BICM LDPC scheme is used. However, the
results obtained for the coded modulation schemes implemented in this paper
show that the TTCM outperforms BICM LDPC by a larger margin. For a 1000 km
transmission at 100 Gbit/s, the observed gain was 0.4 dB.

Nonlinear Frequency-Division Multiplexing in the Focusing Regime

Xianhe Yangzhang, Mansoor I. Yousefi, Alex Alvarado, Domanic Lavery, Polina Bayvel
Comments: Invited paper to be presented at The Optical Fiber Communications Conference and Exposition (OFC), March 2017
Subjects: Information Theory (cs.IT)

Achievable rates of the nonlinear frequency-division multiplexing (NFDM) and
wavelength-division multiplexing (WDM) subject to the same power and bandwidth
constraints are computed as a function of transmit power in the standard
single-mode fiber. NFDM achieves higher rates than WDM.

Multiuser Media-based Modulation for Massive MIMO Systems

Bharath Shamasundar, A. Chockalingam
Subjects: Information Theory (cs.IT)

In this paper, we consider {em media-based modulation (MBM)}, an attractive
modulation scheme which is getting increased research attention recently, for
the uplink of a massive MIMO system. Each user is equipped with one transmit
antenna with multiple radio frequency (RF) mirrors (parasitic elements) placed
near it. The base station (BS) is equipped with tens to hundreds of receive
antennas. MBM with (m_{rf}) RF mirrors and (n_r) receive antennas over a
multipath channel has been shown to asymptotically (as (m_{rf}
ightarrow
infty)) achieve the capacity of (n_r) parallel AWGN channels. This suggests
that MBM can be attractive for use in massive MIMO systems which typically
employ a large number of receive antennas at the BS. In this paper, we
investigate the potential performance advantage of multiuser MBM (MU-MBM) in a
massive MIMO setting. Our results show that multiuser MBM (MU-MBM) can
significantly outperform other modulation schemes. For example, a bit error
performance achieved using 500 receive antennas at the BS in a massive MIMO
system using conventional modulation can be achieved using just 128 antennas
using MU-MBM. Even multiuser spatial modulation, and generalized spatial
modulation in the same massive MIMO settings require more than 200 antennas to
achieve the same bit error performance. Also, recognizing that the MU-MBM
signal vectors are inherently sparse, we propose an efficient MU-MBM signal
detection scheme that uses compressive sensing based reconstruction algorithms
like orthogonal matching pursuit (OMP), compressive sampling matching pursuit
(CoSaMP), and subspace pursuit (SP).

Bounds on Codes with Locality and Availability

S.B.Balaji, P.Vijay Kumar
Comments: To be submitted to ISIT 2017
Subjects: Information Theory (cs.IT)

In this paper we investigate bounds on rate and minimum distance of codes
with (t) availability. We present bounds on minimum distance of code with (t)
availability that are tighter than existing bounds. For bounds on rate of a
code with (t) availability, we restrict ourself to a sub class of codes with
(t) availability and derive a tighter rate bound. For (t=3,4), we also present
a high-rate construction.

Interference-Constrained Pricing for D2D Networks

Yuan Liu, Rui Wang, Zhu Han
Comments: To appear in IEEE Transactions on Wireless Communications
Subjects: Information Theory (cs.IT)

The concept of device-to-device (D2D) communications underlaying cellular
networks opens up potential benefits for improving system performance but also
brings new challenges such as interference management. In this paper, we
propose a pricing framework for interference management from the D2D users to
the cellular system, where the base station (BS) protects itself (or its
serving cellular users) by pricing the crosstier interference caused from the
D2D users. A Stackelberg game is formulated to model the interactions between
the BS and D2D users. Specifically, the BS sets prices to a maximize its
revenue (or any desired utility) subject to an interference temperature
constraint. For given prices, the D2D users competitively adapt their power
allocation strategies for individual utility maximization. We first analyze the
competition among the D2D users by noncooperative game theory and an iterative
based distributed power allocation algorithm is proposed. Then, depending on
how much network information the BS knows, we develop two optimal algorithms,
one for uniform pricing with limited network information and the other for
differentiated pricing with global network information. The uniform pricing
algorithm can be implemented by a fully distributed manner and requires minimum
information exchange between the BS and D2D users, and the differentiated
pricing algorithm is partially distributed and requires no iteration between
the BS and D2D users. Then a suboptimal differentiated pricing scheme is
proposed to reduce complexity and it can be implemented in a fully distributed
fashion. Extensive simulations are conducted to verify the proposed framework
and algorithms.

Rethinking Sketching as Sampling: A Graph Signal Processing Approach

Fernando Gama, Antonio G. Marques, Gonzalo Mateos, Alejandro Ribeiro
Comments: Submitted to IEEE Trans. Signal Processing
Subjects: Information Theory (cs.IT)

Sampling of bandlimited graph signals has well-documented merits for
dimensionality reduction, affordable storage, and online processing of
streaming network data. Most existing sampling methods are designed to minimize
the error incurred when reconstructing the original signal from its samples.
Oftentimes these parsimonious signals serve as inputs to
computationally-intensive linear operator (e.g., graph filters and transforms).
Hence, interest shifts from reconstructing the signal itself towards instead
approximating the output of the prescribed linear operator efficiently. In this
context, we propose a novel sampling scheme that leverages the bandlimitedness
of the input as well as the transformation whose output we wish to approximate.
We formulate problems to jointly optimize sample selection and a sketch of the
target linear transformation, so when the latter is affordably applied to the
sampled input signal the result is close to the desired output. These designs
are carried out off line, and several heuristic (sub)optimal solvers are
proposed to accommodate high-dimensional problems, especially when
computational resources are at a premium. Similar sketching as sampling ideas
are also shown effective in the context of linear inverse problems. The
developed sampling plus reduced-complexity processing pipeline is particularly
useful for streaming data, where the linear transform has to be applied fast
and repeatedly to successive inputs or response signals. Numerical tests show
the effectiveness of the proposed algorithms in classifying handwritten digits
from as few as 20 out of 784 pixels in the input images, as well as in
accurately estimating the frequency components of bandlimited graph signals
sampled at few nodes.

Optimal Signaling for Secure Communications over Gaussian MIMO Wiretap Channels

Sergey Loyka, Charalambos D. Charalambous
Comments: accepted by IEEE Trans. Info. Theory
Subjects: Information Theory (cs.IT)

Optimal signalling over the Gaussian MIMO wire-tap channel is studied under
the total transmit power constraint. A closed-form solution for an optimal
transmit covariance matrix is obtained when the channel is strictly degraded.
In combination with the rank-1 solution, this provides the complete
characterization of the optimal covariance for the case of two transmit
antennas. The cases of weak eavesdropper and high SNR are considered. It is
shown that the optimal covariance does not converge to a scaled identity in the
high-SNR regime. Necessary optimality conditions and a tight upper bound on the
rank of an optimal covariance matrix are established for the general case,
along with a lower bound to the secrecy capacity, which is tight in a number of
scenarios.

Symmetry, Outer Bounds, and Code Constructions: A Computer-Aided Investigation on the Fundamental Limits of Caching

Chao Tian
Comments: 34 pages, 7 figures; submitted to IEEE Trans. Information Theory
Subjects: Information Theory (cs.IT)

We considered the optimal memory-transmission-rate tradeoff of caching
systems. Different from the conventional analytical approach usually seen in
the information theory literature, we rely on a computer-aided approach in this
investigation. The linear programming (LP) outer bound of the entropy space
serves as the starting point of this approach, however our effort goes
significantly beyond using it to prove information inequalities. We first
identify and formalize the symmetry structure in the problem, which enables us
to show the existence of optimal symmetric solutions. A symmetry-reduced linear
program is then used to identify the boundary of the memory-transmission-rate
tradeoff for several simple cases, for which we obtain a set of tight outer
bounds. General hypotheses on the optimal tradeoff region are formed from these
computed data, which are then analytically proved. This leads to a complete
characterization of the optimal tradeoff for systems with only two users, and
certain partial characterization for systems with only two files. Next, we show
that by carefully analyzing the joint entropy structure of the outer bounds for
certain cases, a novel code construction can be reverse-engineered, which
eventually leads to a general class of codes. Finally, we show that strong
outer bounds can be computed through strategically relaxing the LP. This allows
us firstly to deduce generic characteristic of the converse proof, and secondly
to compute outer bounds for larger problem cases, despite the seemingly
impossible computation scale.