IT博客汇 | arXiv Paper Daily: Fri, 24 Feb 2017

arXiv Paper Daily: Fri, 24 Feb 2017

我爱机器学习(52ml.net)发表于 2017-02-24 00:00:00

Neural and Evolutionary Computing

Bidirectional Backpropagation: Towards Biologically Plausible Error Signal Transmission in Neural Networks

Hongyin Luo, Jie Fu, James Glass
Subjects: Neural and Evolutionary Computing (cs.NE); Learning (cs.LG)

The back-propagation (BP) algorithm has been considered the de facto method
for training deep neural networks. It back-propagates errors from the output
layer to the hidden layers in an exact manner using feedforward weights. In
this work, we propose a more biologically plausible paradigm of neural
architecture according to biological findings. Specifically, we propose two
bidirectional learning algorithms with two sets of trainable weights.
Preliminary results show that our models perform best on the MNIST and the
CIFAR10 datasets among the asymmetric error signal passing methods, and their
performance is more close to that of BP.

Computer Vision and Pattern Recognition

k-Means Clustering and Ensemble of Regressions: An Algorithm for the ISIC 2017 Skin Lesion Segmentation Challenge

David Alvarez, Monica Iglesias
Comments: Abstract submitted to arXiv as prerequisite to participate in the ISIC 2017 Skin Lesion Segmentation Challenge
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This abstract briefly describes a segmentation algorithm developed for the
ISIC 2017 Skin Lesion Detection Competition hosted at [ref]. The objective of
the competition is to perform a segmentation (in the form of a binary mask
image) of skin lesions in dermoscopic images as close as possible to a
segmentation performed by trained clinicians, which is taken as ground truth.
This project only takes part in the segmentation phase of the challenge. The
other phases of the competition (feature extraction and lesion identification)
are not considered.

The proposed algorithm consists of 4 steps: (1) lesion image preprocessing,
(2) image segmentation using k-means clustering of pixel colors, (3)
calculation of a set of features describing the properties of each segmented
region, and (4) calculation of a final score for each region, representing the
likelihood of corresponding to a suitable lesion segmentation. The scores in
step (4) are obtained by averaging the results of 2 different regression models
using the scores of each region as input. Before using the algorithm these
regression models must be trained using the training set of images and ground
truth masks provided by the Competition. Steps 2 to 4 are repeated with an
increasing number of clusters (and therefore the image is segmented into more
regions) until there is no further improvement of the calculated scores.

ViP-CNN: A Visual Phrase Reasoning Convolutional Neural Network for Visual Relationship Detection

Yikang Li, Wanli Ouyang, Xiaogang Wang
Comments: 10 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As the intermediate level task connecting image captioning and object
detection, visual relationship detection started to catch researchers’
attention because of its descriptive power and clear structure. It localizes
the objects and captures their interactions with a subject-predicate-object
triplet, e.g. person-ride-horse. In this paper, the visual relationship is
considered as a phrase with three components. So we formulate the visual
relationship detection as three inter-connected recognition problems and
propose a Visual Phrase reasoning Convolutional Neural Network (ViP-CNN) to
address them simultaneously. In ViP-CNN, we present a Visual Phrase Reasoning
Structure (VPRS) to set up the connection among the relationship components and
help the model consider the three problems jointly. Corresponding non-maximum
suppression method and model training strategy are also proposed. Experimental
results show that our ViP-CNN outperforms the state-of-art method both in speed
and accuracy. We further pretrain our model on our cleansed Visual Genome
Relationship dataset, which is found to perform better than the pretraining on
the ImageNet for this task.

Analyzing Learned Convnet Features with Dirichlet Process Gaussian Mixture Models

David Malmgren-Hansen, Allan Aasbjerg Nielsen, Rasmus Engholm
Comments: Presented at NIPS 2016 Workshop: Practical Bayesian Nonparametrics
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Convolutional Neural Networks (Convnets) have achieved good results in a
range of computer vision tasks the recent years. Though given a lot of
attention, visualizing the learned representations to interpret Convnets, still
remains a challenging task. The high dimensionality of internal representations
and the high abstractions of deep layers are the main challenges when
visualizing Convnet functionality. We present in this paper a technique based
on clustering internal Convnet representations with a Dirichlet Process
Gaussian Mixture Model, for visualization of learned representations in
Convnets. Our method copes with the high dimensionality of a Convnet by
clustering representations across all nodes of each layer. We will discuss how
this application is useful when considering transfer learning, i.e.
transferring a model trained on one dataset to solve a task on a different one.

Robust and fully automated segmentation of mandible from CT scans

Neslisah Torosdagli, Denise K. Liberton, Payal Verma, Murat Sincan Janice Lee, Sumanta Pattanaik, Ulas Bagci
Comments: 4 pages, 5 figures, IEEE International Symposium on Biomedical Imaging (ISBI) 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Mandible bone segmentation from computed tomography (CT) scans is challenging
due to mandible’s structural irregularities, complex shape patterns, and lack
of contrast in joints. Furthermore, connections of teeth to mandible and
mandible to remaining parts of the skull make it extremely difficult to
identify mandible boundary automatically. This study addresses these challenges
by proposing a novel framework where we define the segmentation as two
complementary tasks: recognition and delineation. For recognition, we use
random forest regression to localize mandible in 3D. For delineation, we
propose to use 3D gradient-based fuzzy connectedness (FC) image segmentation
algorithm, operating on the recognized mandible sub-volume. Despite heavy CT
artifacts and dental fillings, consisting half of the CT image data in our
experiments, we have achieved highly accurate detection and delineation
results. Specifically, detection accuracy more than 96% (measured by union of
intersection (UoI)), the delineation accuracy of 91% (measured by dice
similarity coefficient), and less than 1 mm in shape mismatch (Hausdorff
Distance) were found.

Learning Chained Deep Features and Classifiers for Cascade in Object Detection

Wanli Ouyang, Ku Wang, Xin Zhu, Xiaogang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Cascade is a widely used approach that rejects obvious negative samples at
early stages for learning better classifier and faster inference. This paper
presents chained cascade network (CC-Net). In this CC-Net, the cascaded
classifier at a stage is aided by the classification scores in previous stages.
Feature chaining is further proposed so that the feature learning for the
current cascade stage uses the features in previous stages as the prior
information. The chained ConvNet features and classifiers of multiple stages
are jointly learned in an end-to-end network. In this way, features and
classifiers at latter stages handle more difficult samples with the help of
features and classifiers in previous stages. It yields consistent boost in
detection performance on benchmarks like PASCAL VOC 2007 and ImageNet. Combined
with better region proposal, CC-Net leads to state-of-the-art result of 81.1%
mAP on PASCAL VOC 2007.

Increasing Deep Learning Melanoma Classification by Classical And Expert Knowledge Based Image Transforms

Cristina Nader Vasconcelos, Bárbara Nader Vasconcelos
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Skin cancer is a major public health problem, as is the most common type of
cancer and represents more than half of cancer diagnoses worldwide. Early
detection influences the outcome of the disease and motivates our work. We
obtain the state of the art results for the ISBI 2016 Melanoma Classification
Challenge (named Skin Lesion Analysis towards Melanoma Detection) facing the
peculiarities of dealing with such a small, unbalanced, biological database.
For that, we explore committees of Convolutional Neural Networks trained over
the ISBI challenge training dataset artificially augmented by both classical
image processing transforms and image warping guided by specialist knowledge
about the lesion axis and improve the final classifier invariance to common
melanoma variations.

CT Image Denoising with Perceptive Deep Neural Networks

Qingsong Yang, Pingkun Yan, Mannudeep K. Kalra, Ge Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Increasing use of CT in modern medical practice has raised concerns over
associated radiation dose. Reduction of radiation dose associated with CT can
increase noise and artifacts, which can adversely affect diagnostic confidence.
Denoising of low-dose CT images on the other hand can help improve diagnostic
confidence, which however is a challenging problem due to its ill-posed nature,
since one noisy image patch may correspond to many different output patches. In
the past decade, machine learning based approaches have made quite impressive
progress in this direction. However, most of those methods, including the
recently popularized deep learning techniques, aim for minimizing
mean-squared-error (MSE) between a denoised CT image and the ground truth,
which results in losing important structural details due to over-smoothing,
although the PSNR based performance measure looks great. In this work, we
introduce a new perceptual similarity measure as the objective function for a
deep convolutional neural network to facilitate CT image denoising. Instead of
directly computing MSE for pixel-to-pixel intensity loss, we compare the
perceptual features of a denoised output against those of the ground truth in a
feature space. Therefore, our proposed method is capable of not only reducing
the image noise levels, but also keeping the critical structural information at
the same time. Promising results have been obtained in our experiments with a
large number of CT images.

Synthesising Dynamic Textures using Convolutional Neural Networks

Christina M. Funke, Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Here we present a parametric model for dynamic textures. The model is based
on spatiotemporal summary statistics computed from the feature representations
of a Convolutional Neural Network (CNN) trained on object recognition. We
demonstrate how the model can be used to synthesise new samples of dynamic
textures and to predict motion in simple movies.

Artificial Intelligence

A Probabilistic Framework for Location Inference from Social Media

Yujie Qian, Jie Tang, Zhilin Yang, Binxuan Huang, Wei Wei, Kathleen M. Carley
Subjects: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

We study the extent to which we can infer users’ geographical locations from
social media. Location inference from social media can benefit many
applications, such as disaster management, targeted advertising, and news
content tailoring. In recent years, a number of algorithms have been proposed
for identifying user locations on social media platforms such as Twitter and
Facebook from message contents, friend networks, and interactions between
users. In this paper, we propose a novel probabilistic model based on factor
graphs for location inference that offers several unique advantages for this
task. First, the model generalizes previous methods by incorporating content,
network, and deep features learned from social context. The model is also
flexible enough to support both supervised learning and semi-supervised
learning. Second, we explore several learning algorithms for the proposed
model, and present a Two-chain Metropolis-Hastings (MH+) algorithm, which
improves the inference accuracy. Third, we validate the proposed model on three
different genres of data – Twitter, Weibo, and Facebook – and demonstrate that
the proposed model can substantially improve the inference accuracy (+3.3-18.5%
by F1-score) over that of several state-of-the-art methods.

Ontologies in System Engineering: a Field Report

Marco Menapace, Armando Tacchella
Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

In recent years ontologies enjoyed a growing popularity outside specialized
AI communities. System engineering is no exception to this trend, with
ontologies being proposed as a basis for several tasks in complex industrial
implements, including system design, monitoring and diagnosis. In this paper,
we consider four different contributions to system engineering wherein
ontologies are instrumental to provide enhancements over traditional ad-hoc
techniques. For each application, we briefly report the methodologies, the
tools and the results obtained with the goal to provide an assessment of merits
and limits of ontologies in such domains.

A DIKW Paradigm to Cognitive Engineering

Amit Kumar Mishra
Subjects: Artificial Intelligence (cs.AI)

Though the word cognitive has a wide range of meanings we define cognitive
engineering as learning from brain to bolster engineering solutions. However,
giving an achievable framework to the process towards this has been a difficult
task. In this work we take the classic data information knowledge wisdom (DIKW)
framework to set some achievable goals and sub-goals towards cognitive
engineering. A layered framework like DIKW aligns nicely with the layered
structure of pre-frontal cortex. And breaking the task into sub-tasks based on
the layers also makes it easier to start developmental endeavours towards
achieving the final goal of a brain-inspired system.

Theoretical and Experimental Analysis of the Canadian Traveler Problem

Doron Zarchy
Subjects: Artificial Intelligence (cs.AI)

Devising an optimal strategy for navigation in a partially observable
environment is one of the key objectives in AI. One of the problem in this
context is the Canadian Traveler Problem (CTP). CTP is a navigation problem
where an agent is tasked to travel from source to target in a partially
observable weighted graph, whose edge might be blocked with a certain
probability and observing such blockage occurs only when reaching upon one of
the edges end points. The goal is to find a strategy that minimizes the
expected travel cost. The problem is known to be P(#) hard. In this work we
study the CTP theoretically and empirically. First, we study the Dep-CTP, a CTP
variant we introduce which assumes dependencies between the edges status. We
show that Dep-CTP is intractable, and further we analyze two of its subclasses
on disjoint paths graph. Second, we develop a general algorithm Gen-PAO that
optimally solve the CTP. Gen-PAO is capable of solving two other types of CTP
called Sensing-CTP and Expensive-Edges CTP. Since the CTP is intractable,
Gen-PAO use some pruning methods to reduce the space search for the optimal
solution. We also define some variants of Gen-PAO, compare their performance
and show some benefits of Gen-PAO over existing work.

A Realistic Dataset for the Smart Home Device Scheduling Problem for DCOPs

William Kluegel, Muhammad Aamir Iqbal, Ferdinando Fioretto, William Yeoh, Enrico Pontelli
Comments: 15 pages, OPTMAS17
Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

The field of Distributed Constraint Optimization has gained momentum in
recent years thanks to its ability to address various applications related to
multi-agent cooperation. While techniques to solve Distributed Constraint
Optimization Problems (DCOPs) are abundant and have matured substantially since
the field inception, the number of DCOP realistic applications and benchmark
used to asses the performance of DCOP algorithms is lagging behind. To contrast
this background we (i) introduce the Smart Home Device Scheduling (SHDS)
problem, which describe the problem of coordinating smart devices schedules
across multiple homes as a multi-agent system, (ii) detail the physical models
adopted to simulate smart sensors, smart actuators, and homes environments, and
(iii) introduce a DCOP realistic benchmark for SHDS problems.

Proactive Resource Management in LTE-U Systems: A Deep Learning Perspective

Ursula Challita, Li Dong, Walid Saad
Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)

LTE in unlicensed spectrum (LTE-U) is a promising approach to overcome the
wireless spectrum scarcity. However, to reap the benefits of LTE-U, a fair
coexistence mechanism with other incumbent WiFi deployments is required. In
this paper, a novel deep learning approach is proposed for modeling the
resource allocation problem of LTE-U small base stations (SBSs). The proposed
approach enables multiple SBSs to proactively perform dynamic channel
selection, carrier aggregation, and fractional spectrum access while
guaranteeing fairness with existing WiFi networks and other LTE-U operators.
Adopting a proactive coexistence mechanism enables future delay-intolerant
LTE-U data demands to be served within a given prediction window ahead of their
actual arrival time thus avoiding the underutilization of the unlicensed
spectrum during off-peak hours while maximizing the total served LTE-U traffic
load. To this end, a noncooperative game model is formulated in which SBSs are
modeled as Homo Egualis agents that aim at predicting a sequence of future
actions and thus achieving long-term equal weighted fairness with WLAN and
other LTE-U operators over a given time horizon. The proposed deep learning
algorithm is then shown to reach a mixed-strategy Nash equilibrium (NE), when
it converges. Simulation results using real data traces show that the proposed
scheme can yield up to 28% and 11% gains over a conventional reactive approach
and a proportional fair coexistence mechanism, respectively. The results also
show that the proposed framework prevents WiFi performance degradation for a
densely deployed LTE-U network.

Transferring Face Verification Nets To Pain and Expression Regression

Feng Wang, Xiang Xiang, Chang Liu, Trac D. Tran, Austin Reiter, Gregory D. Hager, Harry Quon, Jian Cheng, Alan L. Yuille
Comments: 5 pages, 3 figure; submitted to IEEE ICIP 2017 which does not perform blind reviews
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Learning (cs.LG); Multimedia (cs.MM)

Limited annotated data is available for the research of estimating facial
expression intensities, which makes the training of deep networks for automated
expression assessment very challenging. Fortunately, fine-tuning from a
data-extensive pre-trained domain such as face verification can alleviate the
problem. In this paper, we propose a transferred network that fine-tunes a
state-of-the-art face verification network using expression-intensity labeled
data with a regression layer. In this way, the expression regression task can
benefit from the rich feature representations trained on a huge amount of data
for face verification. The proposed transferred deep regressor is applied in
estimating the intensity of facial action units (2017 EmotionNet Challenge) and
in particular pain intensity estimation (UNBS-McMaster Shoulder-Pain dataset).
It wins the second place in the challenge and achieves the state-of-the-art
performance on Shoulder-Pain dataset. Particularly for Shoulder-Pain with the
imbalance issue of different pain levels, a new weighted evaluation metric is
proposed.

Information Retrieval

Time-Series Adaptive Estimation of Vaccination Uptake Using Web Search Queries

Niels Dalum Hansen, Kåre Mølbak, Ingemar J. Cox, Christina Lioma
Subjects: Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM); Applications (stat.AP)

Estimating vaccination uptake is an integral part of ensuring public health.
It was recently shown that vaccination uptake can be estimated automatically
from web data, instead of slowly collected clinical records or population
surveys. All prior work in this area assumes that features of vaccination
uptake collected from the web are temporally regular. We present the first ever
method to remove this assumption from vaccination uptake estimation: our method
dynamically adapts to temporal fluctuations in time series web data used to
estimate vaccination uptake. We show our method to outperform the state of the
art compared to competitive baselines that use not only web data but also
curated clinical data. This performance improvement is more pronounced for
vaccines whose uptake has been irregular due to negative media attention (HPV-1
and HPV-2), problems in vaccine supply (DiTeKiPol), and targeted at children of
12 years old (whose vaccination is more irregular compared to younger
children).

Stability of Topic Modeling via Matrix Factorization

Mark Belford, Brian Mac Namee, Derek Greene
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Learning (cs.LG); Machine Learning (stat.ML)

Topic models can provide us with an insight into the underlying latent
structure of a large corpus of documents. A range of methods have been proposed
in the literature, including probabilistic topic models and techniques based on
matrix factorization. However, in both cases, standard implementations rely on
stochastic elements in their initialization phase, which can potentially lead
to different results being generated on the same corpus when using the same
parameter values. This corresponds to the concept of “instability” which has
previously been studied in the context of (k)-means clustering. In many
applications of topic modeling, this problem of instability is not considered
and topic models are treated as being definitive, even though the results may
change considerably if the initialization process is altered. In this paper we
demonstrate the inherent instability of popular topic modeling approaches,
using a number of new measures to assess stability. To address this issue in
the context of matrix factorization for topic modeling, we propose the use of
ensemble learning strategies. Based on experiments performed on annotated text
corpora, we show that a K-Fold ensemble strategy, combining both ensembles and
structured initialization, can significantly reduce instability, while
simultaneously yielding more accurate topic models.

A Neural Attention Model for Categorizing Patient Safety Events

Arman Cohan, Allan Fong, Nazli Goharian, Raj Ratwani
Comments: ECIR 2017
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

Medical errors are leading causes of death in the US and as such, prevention
of these errors is paramount to promoting health care. Patient Safety Event
reports are narratives describing potential adverse events to the patients and
are important in identifying and preventing medical errors. We present a neural
network architecture for identifying the type of safety events which is the
first step in understanding these narratives. Our proposed model is based on a
soft neural attention model to improve the effectiveness of encoding long
sequences. Empirical results on two large-scale real-world datasets of patient
safety reports demonstrate the effectiveness of our method with significant
improvements over existing methods.

Scalable Inference for Nested Chinese Restaurant Process Topic Models

Jianfei Chen, Jun Zhu, Jie Lu, Shixia Liu
Subjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Learning (cs.LG)

Nested Chinese Restaurant Process (nCRP) topic models are powerful
nonparametric Bayesian methods to extract a topic hierarchy from a given text
corpus, where the hierarchical structure is automatically determined by the
data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of
nCRP topic models. However, hLDA has only been evaluated at small scale,
because the existing collapsed Gibbs sampling and instantiated weight
variational inference algorithms either are not scalable or sacrifice inference
quality with mean-field assumptions. Moreover, an efficient distributed
implementation of the data structures, such as dynamically growing count
matrices and trees, is challenging.

In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS)
algorithm, which combines the advantages of collapsed and instantiated weight
algorithms to achieve good scalability as well as high model quality. An
initialization strategy is presented to further improve the model quality.
Finally, we propose an efficient distributed implementation of PCGS through
vectorization, pre-processing, and a careful design of the concurrent data
structures and communication strategy.

Empirical studies show that our algorithm is 111 times more efficient than
the previous open-source implementation for hLDA, with comparable or even
better model quality. Our distributed implementation can extract 1,722 topics
from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders
of magnitude larger than the previous largest corpus, with 50 machines in 7
hours.

Computation and Language

Inherent Biases of Recurrent Neural Networks for Phonological Assimilation and Dissimilation

Amanda Doucette
Subjects: Computation and Language (cs.CL)

A recurrent neural network model of phonological pattern learning is
proposed. The model is a relatively simple neural network with one recurrent
layer, and displays biases in learning that mimic observed biases in human
learning. Single-feature patterns are learned faster than two-feature patterns,
and vowel or consonant-only patterns are learned faster than patterns involving
vowels and consonants, mimicking the results of laboratory learning
experiments. In non-recurrent models, capturing these biases requires the use
of alpha features or some other representation of repeated features, but with a
recurrent neural network, these elaborations are not necessary.

Are Emojis Predictable?

Francesco Barbieri, Miguel Ballesteros, Horacio Saggion
Comments: To appear at EACL 2017
Subjects: Computation and Language (cs.CL)

Emojis are ideograms which are naturally combined with plain text to visually
complement or condense the meaning of a message. Despite being widely used in
social media, their underlying semantics have received little attention from a
Natural Language Processing standpoint. In this paper, we investigate the
relation between words and emojis, studying the novel task of predicting which
emojis are evoked by text-based tweet messages. We train several models based
on Long Short-Term Memory networks (LSTMs) in this task. Our experimental
results show that our neural model outperforms two baselines as well as humans
solving the same task, suggesting that computational models are able to better
capture the underlying semantics of emojis.

Utilizing Lexical Similarity for pivot translation involving resource-poor, related languages

Anoop Kunchukuttan, Maulik Shah, Pradyot Prakash, Pushpak Bhattacharyya
Comments: Submitted to ACL 2017, 10 pages, 9 tables
Subjects: Computation and Language (cs.CL)

We investigate the use of pivot languages for phrase-based statistical
machine translation (PB-SMT) between related languages with limited parallel
corpora. We show that subword-level pivot translation via a related pivot
language is: (i) highly competitive with the best direct translation model and
(ii) better than a pivot model which uses an unrelated pivot language, but has
at its disposal large parallel corpora to build the source-pivot (S-P) and
pivot-target (P-T) translation models. In contrast, pivot models trained at
word and morpheme level are far inferior to their direct counterparts. We also
show that using multiple related pivot languages can outperform a direct
translation model. Thus, the use of subwords as translation units coupled with
the use of multiple related pivot languages can compensate for the lack of a
direct parallel corpus. Subword units make pivot models competitive by (i)
utilizing lexical similarity to improve the underlying S-P and P-T translation
models, and (ii) reducing loss of translation candidates during pivoting.

LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations

Jarvan Law, Hankz Hankui Zhuo, Junhua He, Erhu Rong (Dept. of Computer Science, Sun Yat-Sen University, GuangZhou, China.)
Subjects: Computation and Language (cs.CL)

Topic models have been widely used in discovering latent topics which are
shared across documents in text mining. Vector representations, word embeddings
and topic embeddings, map words and topics into a low-dimensional and dense
real-value vector space, which have obtained high performance in NLP tasks.
However, most of the existing models assume the result trained by one of them
are perfect correct and used as prior knowledge for improving the other model.
Some other models use the information trained from external large corpus to
help improving smaller corpus. In this paper, we aim to build such an algorithm
framework that makes topic models and vector representations mutually improve
each other within the same corpus. An EM-style algorithm framework is employed
to iteratively optimize both topic model and vector representations.
Experimental results show that our model outperforms state-of-art methods on
various NLP tasks.

A Neural Attention Model for Categorizing Patient Safety Events

Arman Cohan, Allan Fong, Nazli Goharian, Raj Ratwani
Comments: ECIR 2017
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

Medical errors are leading causes of death in the US and as such, prevention
of these errors is paramount to promoting health care. Patient Safety Event
reports are narratives describing potential adverse events to the patients and
are important in identifying and preventing medical errors. We present a neural
network architecture for identifying the type of safety events which is the
first step in understanding these narratives. Our proposed model is based on a
soft neural attention model to improve the effectiveness of encoding long
sequences. Empirical results on two large-scale real-world datasets of patient
safety reports demonstrate the effectiveness of our method with significant
improvements over existing methods.

Pronunciation recognition of English phonemes / extipa{@}/, /æ/, / extipa{A}:/ and / extipa{2}/ using Formants and Mel Frequency Cepstral Coefficients

Keith Y. Patarroyo, Vladimir Vargas-Calderón
Comments: 11 pages, pre-print version
Subjects: Computation and Language (cs.CL); Sound (cs.SD)

The Vocal Joystick Vowel Corpus, by Washington University, was used to study
monophthongs pronounced by native English speakers. The objective of this study
was to quantitatively measure the extent at which speech recognition methods
can distinguish between similar sounding vowels. In particular, the phonemes
/ extipa{@}/, /{ae}/, / extipa{A}:/ and / extipa{2}/ were analysed. 748
sound files from the corpus were used and subjected to Linear Predictive Coding
(LPC) to compute their formants, and to Mel Frequency Cepstral Coefficients
(MFCC) algorithm, to compute the cepstral coefficients. A Decision Tree
Classifier was used to build a predictive model that learnt the patterns of the
two first formants measured in the data set, as well as the patterns of the 13
cepstral coefficients. An accuracy of 70\% was achieved using formants for the
mentioned phonemes. For the MFCC analysis an accuracy of 52 \% was achieved and
an accuracy of 71\% when / extipa{@}/ was ignored. The results obtained show
that the studied algorithms are far from mimicking the ability of
distinguishing subtle differences in sounds like human hearing does.

Feature Generation for Robust Semantic Role Labeling

Travis Wolfe, Mark Dredze, Benjamin Van Durme
Subjects: Computation and Language (cs.CL)

Hand-engineered feature sets are a well understood method for creating robust
NLP models, but they require a lot of expertise and effort to create. In this
work we describe how to automatically generate rich feature sets from simple
units called featlets, requiring less engineering. Using information gain to
guide the generation process, we train models which rival the state of the art
on two standard Semantic Role Labeling datasets with almost no task or
linguistic insight.

Unsupervised Learning of Morphological Forests

Jiaming Luo, Karthik Narasimhan, Regina Barzilay
Comments: 12 pages, 5 figures, accepted by TACL 2017
Subjects: Computation and Language (cs.CL)

This paper focuses on unsupervised modeling of morphological families,
collectively comprising a forest over the language vocabulary. This formulation
enables us to capture edgewise properties reflecting single-step morphological
derivations, along with global distributional properties of the entire forest.
These global properties constrain the size of the affix set and encourage
formation of tight morphological families. The resulting objective is solved
using Integer Linear Programming (ILP) paired with contrastive estimation. We
train the model by alternating between optimizing the local log-linear model
and the global ILP objective. We evaluate our system on three tasks: root
detection, clustering of morphological families and segmentation. Our
experiments demonstrate that our model yields consistent gains in all three
tasks compared with the best published results.

Stability of Topic Modeling via Matrix Factorization

Mark Belford, Brian Mac Namee, Derek Greene
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Learning (cs.LG); Machine Learning (stat.ML)

Topic models can provide us with an insight into the underlying latent
structure of a large corpus of documents. A range of methods have been proposed
in the literature, including probabilistic topic models and techniques based on
matrix factorization. However, in both cases, standard implementations rely on
stochastic elements in their initialization phase, which can potentially lead
to different results being generated on the same corpus when using the same
parameter values. This corresponds to the concept of “instability” which has
previously been studied in the context of (k)-means clustering. In many
applications of topic modeling, this problem of instability is not considered
and topic models are treated as being definitive, even though the results may
change considerably if the initialization process is altered. In this paper we
demonstrate the inherent instability of popular topic modeling approaches,
using a number of new measures to assess stability. To address this issue in
the context of matrix factorization for topic modeling, we propose the use of
ensemble learning strategies. Based on experiments performed on annotated text
corpora, we show that a K-Fold ensemble strategy, combining both ensembles and
structured initialization, can significantly reduce instability, while
simultaneously yielding more accurate topic models.

Distributed, Parallel, and Cluster Computing

Synchronizability of Communicating Finite State Machines is not Decidable

Alain Finkel (LSV), Etienne Lozes (LSV)
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Formal Languages and Automata Theory (cs.FL)

A system of communicating finite state machines is synchronizable if its send
trace semantics, i.e. the set of sequences of sendings it can perform, is the
same when its communications are FIFO asynchronous and when they are just
rendezvous synchronizations. This property was claimed to be decidable in
several conference and journal papers. In this paper, we show that
synchronizability is actually undecidable. We show that synchronizability is
decidable if the topology of communications is an oriented ring. We also show
that, in this case, synchronizability implies the absence of unspecified
receptions and orphan messages, and the channel-recognizability of the
reachability set.

First Experiences Optimizing Smith-Waterman on Intel's Knights Landing Processor

Enzo Rucci, Carlos Garcia, Guillermo Botella, Armando De Giusti, Marcelo Naiouf, Manuel Prieto-Matias
Comments: Submitted to Euro-Par 2017 Conference
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)

The well-known Smith-Waterman (SW) algorithm is the most commonly used method
for local sequence alignments. However, SW is very computationally demanding
for large protein databases. There exist several implementations that take
advantage of computing parallelization on many-cores, FPGAs or GPUs, in order
to increase the alignment throughtput. In this paper, we have explored SW
acceleration on Intel KNL processor. The novelty of this architecture requires
the revision of previous programming and optimization techniques on many-core
architectures. To the best of authors knowledge, this is the first KNL
architecture assessment for SW algorithm. Our evaluation, using the renowned
Environmental NR database as benchmark, has shown that multi-threading and SIMD
exploitation reports competitive performance (351 GCUPS) in comparison with
other implementations.

DyAdHyTM: A Low Overhead Dynamically Adaptive Hybrid Transactional Memory on Big Data Graphs

Mohammad Qayum, Abdul-Hameed Badawy, Jeanine Cook
Comments: 13 pages
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Big data is a buzzword used to describe massive volumes of data that provides
opportunities of exploring new insights through data analytics. However, big
data is mostly structured but can be semi-structured or unstructured. It is
normally so large that it is not only difficult but also slow to process using
traditional computing systems. One of the solutions is to format the data as
graph data structures and process them on shared memory architecture to use
fast and novel policies such as transactional memory. In most graph
applications in big data type problems such as bioinformatics, social networks,
and cyber security, graphs are sparse in nature. Due to this sparsity, we have
the opportunity to use Transactional Memory (TM) as the synchronization policy
for critical sections to speedup applications. At low conflict probability TM
performs better than most synchronization policies due to its inherent
non-blocking characteristics. TM can be implemented in Software, Hardware or a
combination of both. However, hardware TM implementations are fast but limited
by scarce hardware resources while software implementations have high overheads
which can degrade performance. In this paper, we develop a low overhead, yet
simple, dynamically adaptive (i.e. at runtime) hybrid (i.e. combines hardware
and software) TM (DyAdHyTM) scheme that combines the best features of both
Hardware TM (HTM) and Software TM (STM) while adapting to application
requirements. It performs better than coarse grain lock by up to 8.12x, a low
overhead STM by up to 2.68x, a couple of implementations of HTMs (by up to
2.59x), and other HyTMs (by up to 1.55x) for SSCA2 graph benchmark running on a
multicore machine with a large shared memory.

ERA: A Framework for Economic Resource Allocation for the Cloud

Moshe Babaioff, Yishay Mansour, Noam Nisan, Gali Noti, Carlo Curino, Nar Ganapathy, Ishai Menache, Omer Reingold, Moshe Tennenholtz, Erez Timnat
Subjects: Computer Science and Game Theory (cs.GT); Distributed, Parallel, and Cluster Computing (cs.DC)

Cloud computing has reached significant maturity from a systems perspective,
but currently deployed solutions rely on rather basic economics mechanisms that
yield suboptimal allocation of the costly hardware resources. In this paper we
present Economic Resource Allocation (ERA), a complete framework for scheduling
and pricing cloud resources, aimed at increasing the efficiency of cloud
resources usage by allocating resources according to economic principles. The
ERA architecture carefully abstracts the underlying cloud infrastructure,
enabling the development of scheduling and pricing algorithms independently of
the concrete lower-level cloud infrastructure and independently of its
concerns. Specifically, ERA is designed as a flexible layer that can sit on top
of any cloud system and interfaces with both the cloud resource manager and
with the users who reserve resources to run their jobs. The jobs are scheduled
based on prices that are dynamically calculated according to the predicted
demand. Additionally, ERA provides a key internal API to pluggable algorithmic
modules that include scheduling, pricing and demand prediction. We provide a
proof-of-concept software and demonstrate the effectiveness of the architecture
by testing ERA over both public and private cloud systems — Azure Batch of
Microsoft and Hadoop/YARN. A broader intent of our work is to foster
collaborations between economics and system communities. To that end, we have
developed a simulation platform via which economics and system experts can test
their algorithmic implementations.

How to Optimally Allocate Resources for Coded Distributed Computing?

Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, A. Salman Avestimehr
Subjects: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC)

Today’s data centers have an abundance of computing resources, hosting server
clusters consisting of as many as tens or hundreds of thousands of machines. To
execute a complex computing task over a data center, it is natural to
distribute computations across many nodes to take advantage of parallel
processing. However, as we allocate more and more computing resources to a
computation task and further distribute the computations, large amounts of
(partially) computed data must be moved between consecutive stages of
computation tasks among the nodes, hence the communication load can become the
bottleneck. In this paper, we study the optimal allocation of computing
resources in distributed computing, in order to minimize the total execution
time in distributed computing accounting for both the duration of computation
and communication phases. In particular, we consider a general MapReduce-type
distributed computing framework, in which the computation is decomposed into
three stages: emph{Map}, emph{Shuffle}, and emph{Reduce}. We focus on a
recently proposed emph{Coded Distributed Computing} approach for MapReduce and
study the optimal allocation of computing resources in this framework. For all
values of problem parameters, we characterize the optimal number of servers
that should be used for distributed processing, provide the optimal placements
of the Map and Reduce tasks, and propose an optimal coded data shuffling
scheme, in order to minimize the total execution time. To prove the optimality
of the proposed scheme, we first derive a matching information-theoretic
converse on the execution time, then we prove that among all possible resource
allocation schemes that achieve the minimum execution time, our proposed scheme
uses the exactly minimum possible number of servers.

Scalable Inference for Nested Chinese Restaurant Process Topic Models

Jianfei Chen, Jun Zhu, Jie Lu, Shixia Liu
Subjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Learning (cs.LG)

Nested Chinese Restaurant Process (nCRP) topic models are powerful
nonparametric Bayesian methods to extract a topic hierarchy from a given text
corpus, where the hierarchical structure is automatically determined by the
data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of
nCRP topic models. However, hLDA has only been evaluated at small scale,
because the existing collapsed Gibbs sampling and instantiated weight
variational inference algorithms either are not scalable or sacrifice inference
quality with mean-field assumptions. Moreover, an efficient distributed
implementation of the data structures, such as dynamically growing count
matrices and trees, is challenging.

In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS)
algorithm, which combines the advantages of collapsed and instantiated weight
algorithms to achieve good scalability as well as high model quality. An
initialization strategy is presented to further improve the model quality.
Finally, we propose an efficient distributed implementation of PCGS through
vectorization, pre-processing, and a careful design of the concurrent data
structures and communication strategy.

Empirical studies show that our algorithm is 111 times more efficient than
the previous open-source implementation for hLDA, with comparable or even
better model quality. Our distributed implementation can extract 1,722 topics
from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders
of magnitude larger than the previous largest corpus, with 50 machines in 7
hours.

Large-Scale Stochastic Learning using GPUs

Thomas Parnell, Celestine Dünner, Kubilay Atasu, Manolis Sifalakis, Haris Pozidis
Comments: Accepted for publication in ParLearning 2017: The 6th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, Orlando, Florida, May 2017
Subjects: Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)

In this work we propose an accelerated stochastic learning system for very
large-scale applications. Acceleration is achieved by mapping the training
algorithm onto massively parallel processors: we demonstrate a parallel,
asynchronous GPU implementation of the widely used stochastic coordinate
descent/ascent algorithm that can provide up to 35x speed-up over a sequential
CPU implementation. In order to train on very large datasets that do not fit
inside the memory of a single GPU, we then consider techniques for distributed
stochastic learning. We propose a novel method for optimally aggregating model
updates from worker nodes when the training data is distributed either by
example or by feature. Using this technique, we demonstrate that one can scale
out stochastic learning across up to 8 worker nodes without any significant
loss of training time. Finally, we combine GPU acceleration with the optimized
distributed method to train on a dataset consisting of 200 million training
examples and 75 million features. We show by scaling out across 4 GPUs, one can
attain a high degree of training accuracy in around 4 seconds: a 20x speed-up
in training time compared to a multi-threaded, distributed implementation
across 4 CPUs.

A Realistic Dataset for the Smart Home Device Scheduling Problem for DCOPs

The field of Distributed Constraint Optimization has gained momentum in
recent years thanks to its ability to address various applications related to
multi-agent cooperation. While techniques to solve Distributed Constraint
Optimization Problems (DCOPs) are abundant and have matured substantially since
the field inception, the number of DCOP realistic applications and benchmark
used to asses the performance of DCOP algorithms is lagging behind. To contrast
this background we (i) introduce the Smart Home Device Scheduling (SHDS)
problem, which describe the problem of coordinating smart devices schedules
across multiple homes as a multi-agent system, (ii) detail the physical models
adopted to simulate smart sensors, smart actuators, and homes environments, and
(iii) introduce a DCOP realistic benchmark for SHDS problems.

Learning

On the ability of neural nets to express distributions

Holden Lee, Rong Ge, Andrej Risteski, Tengyu Ma, Sanjeev Arora
Comments: Under review for COLT 2017
Subjects: Learning (cs.LG)

Deep neural nets have caused a revolution in many classification tasks. A
related ongoing revolution—also theoretically not understood—concerns their
ability to serve as generative models for complicated types of data such as
images and texts. These models are trained using ideas like variational
autoencoders and Generative Adversarial Networks.

We take a first cut at explaining the expressivity of multilayer nets by
giving a sufficient criterion for a function to be approximable by a neural
network with (n) hidden layers. A key ingredient is Barron’s Theorem
cite{Barron1993}, which gives a Fourier criterion for approximability of a
function by a neural network with 1 hidden layer. We show that a composition of
(n) functions which satisfy certain Fourier conditions (“Barron functions”) can
be approximated by a (n+1)-layer neural network.

For probability distributions, this translates into a criterion for a
probability distribution to be approximable in Wasserstein distance—a natural
metric on probability distributions—by a neural network applied to a fixed
base distribution (e.g., multivariate gaussian).

Building up recent lower bound work, we also give an example function that
shows that composition of Barron functions is more expressive than Barron
functions alone.

Learning Hawkes Processes from Short Doubly-Censored Event Sequences

Hongteng Xu, Dixin Luo, Hongyuan Zha
Subjects: Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)

Many real-world applications require robust algorithms to learn point process
models based on a type of incomplete data — the so-called short
doubly-censored (SDC) event sequences. In this paper, we study this critical
problem of quantitative asynchronous event sequence analysis under the
framework of Hawkes processes by leveraging the general idea of data synthesis.
In particular, given SDC event sequences observed in a variety of time
intervals, we propose a sampling-stitching data synthesis method — sampling
predecessor and successor for each SDC event sequence from potential candidates
and stitching them together to synthesize long training sequences. The
rationality and the feasibility of our method are discussed in terms of
arguments based on likelihood. Experiments on both synthetic and real-world
data demonstrate that the proposed data synthesis method improves learning
results indeed for both time-invariant and time-varying Hawkes processes.

Large-Scale Stochastic Learning using GPUs

In this work we propose an accelerated stochastic learning system for very
large-scale applications. Acceleration is achieved by mapping the training
algorithm onto massively parallel processors: we demonstrate a parallel,
asynchronous GPU implementation of the widely used stochastic coordinate
descent/ascent algorithm that can provide up to 35x speed-up over a sequential
CPU implementation. In order to train on very large datasets that do not fit
inside the memory of a single GPU, we then consider techniques for distributed
stochastic learning. We propose a novel method for optimally aggregating model
updates from worker nodes when the training data is distributed either by
example or by feature. Using this technique, we demonstrate that one can scale
out stochastic learning across up to 8 worker nodes without any significant
loss of training time. Finally, we combine GPU acceleration with the optimized
distributed method to train on a dataset consisting of 200 million training
examples and 75 million features. We show by scaling out across 4 GPUs, one can
attain a high degree of training accuracy in around 4 seconds: a 20x speed-up
in training time compared to a multi-threaded, distributed implementation
across 4 CPUs.

Heavy-Tailed Analogues of the Covariance Matrix for ICA

Joseph Anderson, Navin Goyal, Anupama Nandi, Luis Rademacher
Comments: 16 Pages, 9 Figures, AAAI 2017
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Independent Component Analysis (ICA) is the problem of learning a square
matrix (A), given samples of (X=AS), where (S) is a random vector with
independent coordinates. Most existing algorithms are provably efficient only
when each (S_i) has finite and moderately valued fourth moment. However, there
are practical applications where this assumption need not be true, such as
speech and finance. Algorithms have been proposed for heavy-tailed ICA, but
they are not practical, using random walks and the full power of the ellipsoid
algorithm multiple times. The main contributions of this paper are:

(1) A practical algorithm for heavy-tailed ICA that we call HTICA. We provide
theoretical guarantees and show that it outperforms other algorithms in some
heavy-tailed regimes, both on real and synthetic data. Like the current
state-of-the-art, the new algorithm is based on the centroid body (a first
moment analogue of the covariance matrix). Unlike the state-of-the-art, our
algorithm is practically efficient. To achieve this, we use explicit analytic
representations of the centroid body, which bypasses the use of the ellipsoid
method and random walks.

(2) We study how heavy tails affect different ICA algorithms, including
HTICA. Somewhat surprisingly, we show that some algorithms that use the
covariance matrix or higher moments can successfully solve a range of ICA
instances with infinite second moment. We study this theoretically and
experimentally, with both synthetic and real-world heavy-tailed data.

A Converse to Banach's Fixed Point Theorem and its CLS Completeness

Constantinos Daskalakis, Christos Tzamos, Manolis Zampetakis
Subjects: Computational Complexity (cs.CC); Learning (cs.LG); General Topology (math.GN); Machine Learning (stat.ML)

Banach’s fixed point theorem for contraction maps has been widely used to
analyze the convergence of iterative methods in non-convex problems. It is a
common experience, however, that iterative maps fail to be globally contracting
under the natural metric in their domain, making the applicability of Banach’s
theorem limited. We explore how generally we can apply Banach’s fixed point
theorem to establish the convergence of iterative methods when pairing it with
carefully designed metrics.

Our first result is a strong converse of Banach’s theorem, showing that it is
a universal analysis tool for establishing uniqueness of fixed points and for
bounding the convergence rate of iterative maps to a unique fixed point. In
other words, we show that, whenever an iterative map globally converges to a
unique fixed point, there exists a metric under which the iterative map is
contracting and which can be used to bound the number of iterations until
convergence. We illustrate our approach in the widely used power method,
providing a new way of bounding its convergence rate through contraction
arguments.

We next consider the computational complexity of Banach’s fixed point
theorem. Making the proof of our converse theorem constructive, we show that
computing a fixed point whose existence is guaranteed by Banach’s fixed point
theorem is CLS-complete. We thus provide the first natural complete problem for
the class CLS, which was defined in [Daskalakis-Papadimitriou 2011] to capture
the complexity of problems such as P-matrix LCP, computing KKT-points, and
finding mixed Nash equilibria in congestion and network coordination games.

Learning to Draw Dynamic Agent Goals with Generative Adversarial Networks

Shariq Iqbal, John Pearson
Comments: Submitted to ICML
Subjects: Neurons and Cognition (q-bio.NC); Learning (cs.LG); Machine Learning (stat.ML)

We address the problem of designing artificial agents capable of reproducing
human behavior in a competitive game involving dynamic control. Given data
consisting of multiple realizations of inputs generated by pairs of interacting
players, we model each agent’s actions as governed by a time-varying latent
goal state coupled to a control model. These goals, in turn, are described as
stochastic processes evolving according to player-specific value functions
depending on the current state of the game. We model these value functions
using generative adversarial networks (GANs) and show that our GAN-based
approach succeeds in producing sample gameplay that captures the rich dynamics
of human agents. The latent goal dynamics inferred and generated by our model
has applications to fields like neuroscience and animal behavior, where the
underlying value functions themselves are of theoretical interest.

Causal Discovery Using Proxy Variables

Mateo Rojas-Carulla, Marco Baroni, David Lopez-Paz
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Discovering causal relations is fundamental to reasoning and intelligence. In
particular, observational causal discovery algorithms estimate the cause-effect
relation between two random entities (X) and (Y), given (n) samples from
(P(X,Y)).

In this paper, we develop a framework to estimate the cause-effect relation
between two static entities (x) and (y): for instance, an art masterpiece (x)
and its fraudulent copy (y). To this end, we introduce the notion of proxy
variables, which allow the construction of a pair of random entities ((A,B))
from the pair of static entities ((x,y)). Then, estimating the cause-effect
relation between (A) and (B) using an observational causal discovery algorithm
leads to an estimation of the cause-effect relation between (x) and (y). For
example, our framework detects the causal relation between unprocessed
photographs and their modifications, and orders in time a set of shuffled
frames from a video.

As our main case study, we introduce a human-elicited dataset of 10,000 pairs
of casually-linked pairs of words from natural language. Our methods discover
75% of these causal relations. Finally, we discuss the role of proxy variables
in machine learning, as a general tool to incorporate static knowledge into
prediction tasks.

Online Multiclass Boosting

Young Hun Jung, Ambuj Tewari
Comments: 17 pages, 2 figures
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Recent work has extended the theoretical analysis of boosting algorithms to
multiclass problems and online settings. However, the multiclass extension is
in the batch setting and the online extensions only consider binary
classification. To the best of our knowledge, there exists no framework to
analyze online boosting algorithms for multiclass classification. We fill this
gap in the literature by defining, and justifying, a weak learning condition
for online multiclass boosting. We also provide an algorithm called online
multiclass boost-by-majority to optimally combine weak learners in our setting.

Rotting Bandits

Nir Levine, Koby Crammer, Shie Mannor
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

The Multi-Armed Bandits (MAB) framework highlights the tension between
acquiring new knowledge (Exploration) and leveraging available knowledge
(Exploitation). In the classical MAB problem, a decision maker must choose an
arm at each time step, upon which she receives a reward. The decision maker’s
objective is to maximize her cumulative expected reward over the time horizon.
The MAB problem has been studied extensively, specifically under the assumption
of the arms’ rewards distributions being stationary, or quasi-stationary, over
time. We consider a variant of the MAB framework, which we termed
extit{Rotting Bandits}, where each arm’s expected reward decays as a function
of the number of times it has been pulled. We are motivated by many real-world
scenarios such as online advertising, content recommendation, crowdsourcing,
and more. We present algorithms, accompanied by simulations, and derive
theoretical guarantees.

A minimax and asymptotically optimal algorithm for stochastic bandits

Pierre Ménard (IMT), Aurélien Garivier (IMT)
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Statistics Theory (math.ST)

We propose the kl-UCB ++ algorithm for regret minimization in stochastic
bandit models with exponential families of distributions. We prove that it is
simultaneously asymptotically optimal (in the sense of Lai and Robbins’ lower
bound) and minimax optimal. This is the first algorithm proved to enjoy these
two properties at the same time. This work thus merges two different lines of
research, with simple proofs involving no complexity overhead.

Stability of Topic Modeling via Matrix Factorization

Mark Belford, Brian Mac Namee, Derek Greene
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Learning (cs.LG); Machine Learning (stat.ML)

Topic models can provide us with an insight into the underlying latent
structure of a large corpus of documents. A range of methods have been proposed
in the literature, including probabilistic topic models and techniques based on
matrix factorization. However, in both cases, standard implementations rely on
stochastic elements in their initialization phase, which can potentially lead
to different results being generated on the same corpus when using the same
parameter values. This corresponds to the concept of “instability” which has
previously been studied in the context of (k)-means clustering. In many
applications of topic modeling, this problem of instability is not considered
and topic models are treated as being definitive, even though the results may
change considerably if the initialization process is altered. In this paper we
demonstrate the inherent instability of popular topic modeling approaches,
using a number of new measures to assess stability. To address this issue in
the context of matrix factorization for topic modeling, we propose the use of
ensemble learning strategies. Based on experiments performed on annotated text
corpora, we show that a K-Fold ensemble strategy, combining both ensembles and
structured initialization, can significantly reduce instability, while
simultaneously yielding more accurate topic models.

Automatic Representation for Lifetime Value Recommender Systems

Assaf Hallak, Yishay Mansour, Elad Yom-Tov
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Many modern commercial sites employ recommender systems to propose relevant
content to users. While most systems are focused on maximizing the immediate
gain (clicks, purchases or ratings), a better notion of success would be the
lifetime value (LTV) of the user-system interaction. The LTV approach considers
the future implications of the item recommendation, and seeks to maximize the
cumulative gain over time. The Reinforcement Learning (RL) framework is the
standard formulation for optimizing cumulative successes over time. However, RL
is rarely used in practice due to its associated representation, optimization
and validation techniques which can be complex. In this paper we propose a new
architecture for combining RL with recommendation systems which obviates the
need for hand-tuned features, thus automating the state-space representation
construction process. We analyze the practical difficulties in this formulation
and test our solutions on batch off-line real-world recommendation data.

Consistent On-Line Off-Policy Evaluation

Assaf Hallak, Shie Mannor
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

The problem of on-line off-policy evaluation (OPE) has been actively studied
in the last decade due to its importance both as a stand-alone problem and as a
module in a policy improvement scheme. However, most Temporal Difference (TD)
based solutions ignore the discrepancy between the stationary distribution of
the behavior and target policies and its effect on the convergence limit when
function approximation is applied. In this paper we propose the Consistent
Off-Policy Temporal Difference (COP-TD((lambda), (eta))) algorithm that
addresses this issue and reduces this bias at some computational expense. We
show that COP-TD((lambda), (eta)) can be designed to converge to the same
value that would have been obtained by using on-policy TD((lambda)) with the
target policy. Subsequently, the proposed scheme leads to a related and
promising heuristic we call log-COP-TD((lambda), (eta)). Both algorithms
have favorable empirical results to the current state of the art on-line OPE
algorithms. Finally, our formulation sheds some new light on the recently
proposed Emphatic TD learning.

Discriminating Traces with Time

Saeid Tizpaz-Niari, Pavol Cerny, Bor-Yuh Evan Chang, Sriram Sankaranarayanan, Ashutosh Trivedi
Comments: Published in TACAS 2017
Subjects: Programming Languages (cs.PL); Cryptography and Security (cs.CR); Formal Languages and Automata Theory (cs.FL); Learning (cs.LG); Software Engineering (cs.SE)

What properties about the internals of a program explain the possible
differences in its overall running time for different inputs? In this paper, we
propose a formal framework for considering this question we dub trace-set
discrimination. We show that even though the algorithmic problem of computing
maximum likelihood discriminants is NP-hard, approaches based on integer linear
programming (ILP) and decision tree learning can be useful in zeroing-in on the
program internals. On a set of Java benchmarks, we find that
compactly-represented decision trees scalably discriminate with high
accuracy—more scalably than maximum likelihood discriminants and with
comparable accuracy. We demonstrate on three larger case studies how
decision-tree discriminants produced by our tool are useful for debugging
timing side-channel vulnerabilities (i.e., where a malicious observer infers
secrets simply from passively watching execution times) and availability
vulnerabilities.

Bidirectional Backpropagation: Towards Biologically Plausible Error Signal Transmission in Neural Networks

Hongyin Luo, Jie Fu, James Glass
Subjects: Neural and Evolutionary Computing (cs.NE); Learning (cs.LG)

The back-propagation (BP) algorithm has been considered the de facto method
for training deep neural networks. It back-propagates errors from the output
layer to the hidden layers in an exact manner using feedforward weights. In
this work, we propose a more biologically plausible paradigm of neural
architecture according to biological findings. Specifically, we propose two
bidirectional learning algorithms with two sets of trainable weights.
Preliminary results show that our models perform best on the MNIST and the
CIFAR10 datasets among the asymmetric error signal passing methods, and their
performance is more close to that of BP.

Scalable Inference for Nested Chinese Restaurant Process Topic Models

Jianfei Chen, Jun Zhu, Jie Lu, Shixia Liu
Subjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Learning (cs.LG)

Nested Chinese Restaurant Process (nCRP) topic models are powerful
nonparametric Bayesian methods to extract a topic hierarchy from a given text
corpus, where the hierarchical structure is automatically determined by the
data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of
nCRP topic models. However, hLDA has only been evaluated at small scale,
because the existing collapsed Gibbs sampling and instantiated weight
variational inference algorithms either are not scalable or sacrifice inference
quality with mean-field assumptions. Moreover, an efficient distributed
implementation of the data structures, such as dynamically growing count
matrices and trees, is challenging.

In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS)
algorithm, which combines the advantages of collapsed and instantiated weight
algorithms to achieve good scalability as well as high model quality. An
initialization strategy is presented to further improve the model quality.
Finally, we propose an efficient distributed implementation of PCGS through
vectorization, pre-processing, and a careful design of the concurrent data
structures and communication strategy.

Empirical studies show that our algorithm is 111 times more efficient than
the previous open-source implementation for hLDA, with comparable or even
better model quality. Our distributed implementation can extract 1,722 topics
from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders
of magnitude larger than the previous largest corpus, with 50 machines in 7
hours.

Learning Model Predictive Control for Iterative Tasks: A Computationally Efficient Approach for Linear System

Ugo Rosolia, Francesco Borrelli
Comments: arXiv admin note: substantial text overlap with arXiv:1609.01387
Subjects: Optimization and Control (math.OC); Learning (cs.LG)

A Learning Model Predictive Controller (LMPC) for linear system in presented.
The proposed controller is an extension of the LMPC [1] and it aims to decrease
the computational burden. The control scheme is reference-free and is able to
improve its performance by learning from previous iterations. A convex safe set
and a terminal cost function are used in order to guarantee recursive
feasibility and non- increasing performance at each iteration. The paper
presents the control design approach, and shows how to recursively construct
the convex terminal set and the terminal cost from state and input trajectories
of previous iterations. Simulation results show the effectiveness of the
proposed control logic.

One Size Fits Many: Column Bundle for Multi-X Learning

Trang Pham, Truyen Tran, Svetha Venkatesh
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Much recent machine learning research has been directed towards leveraging
shared statistics among labels, instances and data views, commonly referred to
as multi-label, multi-instance and multi-view learning. The underlying premises
are that there exist correlations among input parts and among output targets,
and the predictive performance would increase when the correlations are
incorporated. In this paper, we propose Column Bundle (CLB), a novel deep
neural network for capturing the shared statistics in data. CLB is generic that
the same architecture can be applied for various types of shared statistics by
changing only input and output handling. CLB is capable of scaling to thousands
of input parts and output labels by avoiding explicit modeling of pairwise
relations. We evaluate CLB on different types of data: (a) multi-label, (b)
multi-view, (c) multi-view/multi-label and (d) multi-instance. CLB demonstrates
a comparable and competitive performance in all datasets against
state-of-the-art methods designed specifically for each type.

On Polynomial Time Methods for Exact Low Rank Tensor Completion

Dong Xia, Ming Yuan
Comments: 56 pages, 4 figures
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

In this paper, we investigate the sample size requirement for exact recovery
of a high order tensor of low rank from a subset of its entries. We show that a
gradient descent algorithm with initial value obtained from a spectral method
can, in particular, reconstruct a ({d imes d imes d}) tensor of multilinear
ranks ((r,r,r)) with high probability from as few as
(O(r^{7/2}d^{3/2}log^{7/2}d+r^7dlog^6d)) entries. In the case when the ranks
(r=O(1)), our sample size requirement matches those for nuclear norm
minimization (Yuan and Zhang, 2016a), or alternating least squares assuming
orthogonal decomposability (Jain and Oh, 2014). Unlike these earlier
approaches, however, our method is efficient to compute, easy to implement, and
does not impose extra structures on the tensor. Numerical results are presented
to further demonstrate the merits of the proposed approach.

Approximations of the Restless Bandit Problem

Steffen Grunewalder, Azadeh Khaleghi
Subjects: Statistics Theory (math.ST); Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)

The multi-armed restless bandit problem is studied in the case where the
pay-offs are not necessarily independent over time nor across the arms. Even
though this version of the problem provides a more realistic model for most
real-world applications, it cannot be optimally solved in practice since it is
known to be PSPACE-hard. The objective of this paper is to characterize special
sub-classes of the problem where good approximate solutions can be found using
tractable approaches. Specifically, it is shown that in the case where the
joint distribution over the arms is (varphi)-mixing, and under some conditions
on the (varphi)-mixing coefficients, a modified version of UCB can prove
optimal. On the other hand, it is shown that when the pay-off distributions are
strongly dependent, simple switching strategies may be devised which leverage
the strong inter-dependencies. To this end, an example is provided using
Gaussian Processes. The techniques developed in this paper apply, more
generally, to the problem of online sampling under dependence.

Transferring Face Verification Nets To Pain and Expression Regression

Limited annotated data is available for the research of estimating facial
expression intensities, which makes the training of deep networks for automated
expression assessment very challenging. Fortunately, fine-tuning from a
data-extensive pre-trained domain such as face verification can alleviate the
problem. In this paper, we propose a transferred network that fine-tunes a
state-of-the-art face verification network using expression-intensity labeled
data with a regression layer. In this way, the expression regression task can
benefit from the rich feature representations trained on a huge amount of data
for face verification. The proposed transferred deep regressor is applied in
estimating the intensity of facial action units (2017 EmotionNet Challenge) and
in particular pain intensity estimation (UNBS-McMaster Shoulder-Pain dataset).
It wins the second place in the challenge and achieves the state-of-the-art
performance on Shoulder-Pain dataset. Particularly for Shoulder-Pain with the
imbalance issue of different pain levels, a new weighted evaluation metric is
proposed.

Predicting non-linear dynamics: a stable local learning scheme for recurrent spiking neural networks

Aditya Gilra, Wulfram Gerstner
Subjects: Neurons and Cognition (q-bio.NC); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (cs.SY)

Brains need to predict how our muscles and body react to motor commands. How
networks of spiking neurons can learn to reproduce these non-linear dynamics,
using local, online and stable learning rules, is an important, open question.
Here, we present a supervised learning scheme for the feedforward and recurrent
connections in a network of heterogeneous spiking neurons. The error in the
output is fed back through fixed random connections with a negative gain,
causing the network to follow the desired dynamics, while an online and local
rule changes the weights; hence we call the scheme FOLLOW (Feedback-based
Online Local Learning Of Weights) The rule is local in the sense that weight
changes depend on the presynaptic activity and the error signal projected onto
the post-synaptic neuron. We provide examples of learning linear, non-linear
and chaotic dynamics, as well as the dynamics of a two-link arm. Using the
Lyapunov method, and under reasonable assumptions and approximations, we show
that FOLLOW learning is uniformly stable, with the error going to zero
asymptotically.

Information Theory

Two-Moment Inequalities for Rényi Entropy and Mutual Information

Galen Reeves
Subjects: Information Theory (cs.IT)

This paper explores some applications of a two-moment inequality for the
integral of the (r)-th power of a function, where (0 < r< 1). The first
contribution is an upper bound on the R'{e}nyi entropy of a random vector in
terms of the two different moments. When one of the moments is the zeroth
moment, these bounds recover previous results based on maximum entropy
distributions under a single moment constraint. More generally, evaluation of
the bound with two carefully chosen nonzero moments can lead to significant
improvements with a modest increase in complexity. The second contribution is a
method for upper bounding mutual information in terms of certain integrals with
respect to the variance of the conditional density. The bounds have a number of
useful properties arising from the connection with variance decompositions.

How to Optimally Allocate Resources for Coded Distributed Computing?

Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, A. Salman Avestimehr
Subjects: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC)

Today’s data centers have an abundance of computing resources, hosting server
clusters consisting of as many as tens or hundreds of thousands of machines. To
execute a complex computing task over a data center, it is natural to
distribute computations across many nodes to take advantage of parallel
processing. However, as we allocate more and more computing resources to a
computation task and further distribute the computations, large amounts of
(partially) computed data must be moved between consecutive stages of
computation tasks among the nodes, hence the communication load can become the
bottleneck. In this paper, we study the optimal allocation of computing
resources in distributed computing, in order to minimize the total execution
time in distributed computing accounting for both the duration of computation
and communication phases. In particular, we consider a general MapReduce-type
distributed computing framework, in which the computation is decomposed into
three stages: emph{Map}, emph{Shuffle}, and emph{Reduce}. We focus on a
recently proposed emph{Coded Distributed Computing} approach for MapReduce and
study the optimal allocation of computing resources in this framework. For all
values of problem parameters, we characterize the optimal number of servers
that should be used for distributed processing, provide the optimal placements
of the Map and Reduce tasks, and propose an optimal coded data shuffling
scheme, in order to minimize the total execution time. To prove the optimality
of the proposed scheme, we first derive a matching information-theoretic
converse on the execution time, then we prove that among all possible resource
allocation schemes that achieve the minimum execution time, our proposed scheme
uses the exactly minimum possible number of servers.

A Novel Index Coding Scheme and its Application to Coded Caching

Kai Wan, Daniela Tuninetti, Pablo Piantanida
Comments: ITA 2017
Subjects: Information Theory (cs.IT)

This paper proposes a novel achievable scheme for the index problem and
applies it to the caching problem. Index coding and caching are noiseless
broadcast channel problems where receivers have message side information.In the
index coding problem the side information sets are fixed, while in the caching
problem the side information sets correspond the cache contents, which are
under the control of the system designer. The proposed index coding scheme,
based on distributed source coding and non-unique decoding,is shown to strictly
enlarge the rate region achievable by composite coding.The novel index coding
scheme applied to the caching problem is then shown to match an outer bound
(previously proposed by the authors and also based on known results for the
index coding problem) under the assumption of uncoded cache
placement/prefetching.

Massive MIMO Pilot Decontamination and Channel Interpolation via Wideband Sparse Channel Estimation

Saeid Haghighatshoar, Giuseppe Caire
Comments: A short version of this paper was presented in 50th Annual Asilomar Conference on Signals, Systems, and Computers (Asilomar 2016)
Subjects: Information Theory (cs.IT)

We consider a massive MIMO system, based on Time Division Duplexing (TDD) and
channel reciprocity, where the base stations learn the channel vectors of their
users via the pilots transmitted by the users in the uplink time-frequency
slots. It has been observed that, in the limit of very large number of
antennas, the only factor limiting the achievable throughput of such a system
is pilot contamination, which arises since the same set of orthogonal pilots is
eventually reused in multiple cells. In the regime of moderately large number
of antennas, another source of degradation is channel interpolation since, due
to limited pilot dimension, the pilot signal of each user probes only a limited
number OFDM subcarriers and the channel must be interpolated in the remaining
subcarriers over which no pilot symbol is transmitted.

In this paper, we design a low-complexity algorithm that uses the received
wideband pilot snapshots of a user inside an observation window containing
several coherence blocks to obtain an estimate of the angle-delay Power Spread
Function (PSF) of its contaminated channel vectors. We propose supervised and
unsupervised clustering algorithms to decompose the estimated PSF into two
parts: 1) the signal part corresponding to the desired user and 2) the
interference part corresponding to the copilot users. We use this decomposition
to obtain an estimate of the covariance matrix of the user channel vector,
which we exploit to decontaminate the received pilot signal in the next
coherence blocks. We also propose an effective low-complexity decontamination
scheme which also serves as an efficient approximation of the Minimum Mean
Squared Error (MMSE) smoothing filter, i.e., the optimal channel interpolator
in the MMSE sense. We use numerical simulations to assess the performance of
our proposed PSF estimation and pilot-decontamination/channel interpolation
algorithm.

Massive MIMO 5G Cellular Networks: mm-wave vs. μ-wave Frequencies

Stefano Buzzi, Carmen D'Andrea
Comments: Invited paper; to appear on the ZTE Communications special issue on 5G New Radio
Subjects: Information Theory (cs.IT)

Enhanced mobile broadband (eMBB) is one of the key use-cases for the
development of the new standard 5G New Radio for the next generation of mobile
wireless networks. Large-scale antenna arrays, a.k.a. Massive MIMO, the usage
of carrier frequencies in the range 10-100 GHz, the so-called millimeter wave
(mm-wave) band, and the network densification with the introduction of
small-sized cells are the three technologies that will permit implementing eMBB
services and realizing the Gbit/s mobile wireless experience. This paper is
focused on the massive MIMO technology; initially conceived for conventional
cellular frequencies in the sub-6 GHz range (mu-wave), the massive MIMO
concept has been then progressively extended to the case in which mm-wave
frequencies are used. However, due to different propagation mechanisms in urban
scenarios, the resulting MIMO channel models at mu-wave and mm-wave are
radically different. Six key basic differences are pinpointed in this paper,
along with the implications that they have on the architecture and algorithms
of the communication transceivers and on the attainable performance in terms of
reliability and multiplexing capabilities.

Space-Time Channel Modulation

Ertugrul Basar, Ibrahim Altunbas
Comments: Accepted for publication in IEEE Transactions on Vehicular Technology
Subjects: Information Theory (cs.IT)

In this paper, we introduce the concept of space-time channel modulation
(STCM), which extends the classical space-time block codes into a new third
dimension: channel states (transmission media) dimension. Three novel STCM
schemes, which provide interesting trade-offs among decoding complexity, error
performance and data rate, are proposed. It is shown via computer simulations
that the proposed STCM schemes achieve considerably better error performance
than the existing media-based modulation (MBM) and classical systems.

Design of a Cognitive VLC Network with Illumination and Handover Requirements

Marwan Hammouda, Jürgen Peissig, Anna Maria Vegni
Subjects: Information Theory (cs.IT)

In this paper, we consider a cognitive indoor visible light communications
(VLC) system, comprised of multiple access points serving primary and secondary
users through the orthogonal frequency division multiple access method. A
cognitive lighting cell is divided into two non-overlapping regions that
distinguish the primary and secondary users based on the region they are
located in. Under the assumption of equal-power allocation among subcarriers,
each region is defined in terms of its physical area and the number of
allocated subcarriers within that region. In this paper, we provide the
lighting cell design with cognitive constraints that guarantee fulfilling
certain illumination, user mobility, and handover requirements in each cell. We
further argue that, under some conditions, a careful assignment of the
subcarriers in each region can mitigate the co-channel interference in the
overlapping areas of adjacent cells. Numerical results depict the influence of
different system parameters, such as user density, on defining both regions.
Finally, a realistic example is implemented to assess the performance of the
proposed scheme via Monte Carlo simulations.

Multiuser Millimeter Wave Beamforming Strategies with Quantized and Statistical CSIT

Mingbo Dai, Bruno Clerckx
Comments: submitted to TWC
Subjects: Information Theory (cs.IT)

To alleviate the high cost of hardware in mmWave systems, hybrid
analog/digital precoding is typically employed. In the conventional two-stage
feedback scheme, the analog beamformer is determined by beam search and
feedback to maximize the desired signal power of each user. The digital
precoder is designed based on quantization and feedback of effective channel to
mitigate multiuser interference. Alternatively, we propose a one-stage feedback
scheme which effectively reduces the complexity of the signalling and feedback
procedure. Specifically, the second-order channel statistics are leveraged to
design digital precoder for interference mitigation while all feedback overhead
is reserved for precise analog beamforming. Under a fixed total feedback
constraint, we investigate the conditions under which the one-stage feedback
scheme outperforms the conventional two-stage counterpart. Moreover, a rate
splitting (RS) transmission strategy is introduced to further tackle the
multiuser interference and enhance the rate performance. Consider (1) RS
precoded by the one-stage feedback scheme and (2) conventional transmission
strategy precoded by the two-stage scheme with the same first-stage feedback as
(1) and also certain amount of extra second-stage feedback. We show that (1)
can achieve a sum rate comparable to that of (2). Hence, RS enables remarkable
saving in the second-stage training and feedback overhead.

Proactive Resource Management in LTE-U Systems: A Deep Learning Perspective

Ursula Challita, Li Dong, Walid Saad
Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)

LTE in unlicensed spectrum (LTE-U) is a promising approach to overcome the
wireless spectrum scarcity. However, to reap the benefits of LTE-U, a fair
coexistence mechanism with other incumbent WiFi deployments is required. In
this paper, a novel deep learning approach is proposed for modeling the
resource allocation problem of LTE-U small base stations (SBSs). The proposed
approach enables multiple SBSs to proactively perform dynamic channel
selection, carrier aggregation, and fractional spectrum access while
guaranteeing fairness with existing WiFi networks and other LTE-U operators.
Adopting a proactive coexistence mechanism enables future delay-intolerant
LTE-U data demands to be served within a given prediction window ahead of their
actual arrival time thus avoiding the underutilization of the unlicensed
spectrum during off-peak hours while maximizing the total served LTE-U traffic
load. To this end, a noncooperative game model is formulated in which SBSs are
modeled as Homo Egualis agents that aim at predicting a sequence of future
actions and thus achieving long-term equal weighted fairness with WLAN and
other LTE-U operators over a given time horizon. The proposed deep learning
algorithm is then shown to reach a mixed-strategy Nash equilibrium (NE), when
it converges. Simulation results using real data traces show that the proposed
scheme can yield up to 28% and 11% gains over a conventional reactive approach
and a proportional fair coexistence mechanism, respectively. The results also
show that the proposed framework prevents WiFi performance degradation for a
densely deployed LTE-U network.

Maximally Correlated Principal Component Analysis

Soheil Feizi, David Tse
Comments: 35 pages, 5 figures
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Learning (cs.LG)

In the era of big data, reducing data dimensionality is critical in many
areas of science. Widely used Principal Component Analysis (PCA) addresses this
problem by computing a low dimensional data embedding that maximally explain
variance of the data. However, PCA has two major weaknesses. Firstly, it only
considers linear correlations among variables (features), and secondly it is
not suitable for categorical data. We resolve these issues by proposing
Maximally Correlated Principal Component Analysis (MCPCA). MCPCA computes
transformations of variables whose covariance matrix has the largest Ky Fan
norm. Variable transformations are unknown, can be nonlinear and are computed
in an optimization. MCPCA can also be viewed as a multivariate extension of
Maximal Correlation. For jointly Gaussian variables we show that the covariance
matrix corresponding to the identity (or the negative of the identity)
transformations majorizes covariance matrices of non-identity functions. Using
this result we characterize global MCPCA optimizers for nonlinear functions of
jointly Gaussian variables for every rank constraint. For categorical variables
we characterize global MCPCA optimizers for the rank one constraint based on
the leading eigenvector of a matrix computed using pairwise joint
distributions. For a general rank constraint we propose a block coordinate
descend algorithm and show its convergence to stationary points of the MCPCA
optimization. We compare MCPCA with PCA and other state-of-the-art
dimensionality reduction methods including Isomap, LLE, multilayer autoencoders
(neural networks), kernel PCA, probabilistic PCA and diffusion maps on several
synthetic and real datasets. We show that MCPCA consistently provides improved
performance compared to other methods.