IT博客汇 | arXiv Paper Daily: Mon, 15 May 2017

arXiv Paper Daily: Mon, 15 May 2017

我爱机器学习(52ml.net)发表于 2017-05-15 00:00:00

Neural and Evolutionary Computing

A natural approach to studying schema processing

Jack McKay Fletcher, Thomas Wennekers
Subjects: Neural and Evolutionary Computing (cs.NE)

The Building Block Hypothesis (BBH) states that adaptive systems combine good
partial solutions (so-called building blocks) to find increasingly better
solutions. It is thought that Genetic Algorithms (GAs) implement the BBH.
However, for GAs building blocks are semi-theoretical objects in that they are
thought only to be implicitly exploited via the selection and crossover
operations of a GA. In the current work, we discover a mathematical method to
identify the complete set of schemata present in a given population of a GA; as
such a natural way to study schema processing (and thus the BBH) is revealed.
We demonstrate how this approach can be used both theoretically and
experimentally. Theoretically, we show that the search space for good schemata
is a complete lattice and that each generation samples a complete sub-lattice
of this search space. In addition, we show that combining schemata can only
explore a subset of the search space. Experimentally, we compare how well
different crossover methods combine building blocks. We find that for most
crossover methods approximately 25-35% of building blocks in a generation
result from the combination of the previous generation’s building blocks. We
also find that an increase in the combination of building blocks does not lead
to an increase in the efficiency of a GA. To complement this article, we
introduce an open source Python package called schematax, which allows one to
calculate the schemata present in a population using the methods described in
this article.

An overview and comparative analysis of Recurrent Neural Networks for Short Term Load Forecasting

Filippo Maria Bianchi, Enrico Maiorino, Michael C. Kampffmeyer, Antonello Rizzi, Robert Jenssen
Subjects: Neural and Evolutionary Computing (cs.NE)

The key component in forecasting demand and consumption of resources in a
supply network is an accurate prediction of real-valued time series. Indeed,
both service interruptions and resource waste can be reduced with the
implementation of an effective forecasting system. Significant research has
thus been devoted to the design and development of methodologies for short term
load forecasting over the past decades. A class of mathematical models, called
Recurrent Neural Networks, are nowadays gaining renewed interest among
researchers and they are replacing many practical implementation of the
forecasting systems, previously based on static methods. Despite the undeniable
expressive power of these architectures, their recurrent nature complicates
their understanding and poses challenges in the training procedures. Recently,
new important families of recurrent architectures have emerged and their
applicability in the context of load forecasting has not been investigated
completely yet. In this paper we perform a comparative study on the problem of
Short-Term Load Forecast, by using different classes of state-of-the-art
Recurrent Neural Networks. We test the reviewed models first on controlled
synthetic tasks and then on different real datasets, covering important
practical cases of study. We provide a general overview of the most important
architectures and we define guidelines for configuring the recurrent networks
to predict real-valued time series.

Computer Vision and Pattern Recognition

Single Image Action Recognition by Predicting Space-Time Saliency

Marjaneh Safaei, Hassan Foroosh
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose a novel approach based on deep Convolutional Neural Networks (CNN)
to recognize human actions in still images by predicting the future motion, and
detecting the shape and location of the salient parts of the image. We make the
following major contributions to this important area of research: (i) We use
the predicted future motion in the static image (Walker et al., 2015) as a
means of compensating for the missing temporal information, while using the
saliency map to represent the the spatial information in the form of location
and shape of what is predicted as significant. (ii) We cast action
classification in static images as a domain adaptation problem by transfer
learning. We first map the input static image to a new domain that we refer to
as the Predicted Optical Flow-Saliency Map domain (POF-SM), and then fine-tune
the layers of a deep CNN model trained on classifying the ImageNet dataset to
perform action classification in the POF-SM domain. (iii) We tested our method
on the popular Willow dataset. But unlike existing methods, we also tested on a
more realistic and challenging dataset of over 2M still images that we
collected and labeled by taking random frames from the UCF-101 video dataset.
We call our dataset the UCF Still Image dataset or UCFSI-101 in short. Our
results outperform the state of the art.

Towards a Principled Integration of Multi-Camera Re-Identification and Tracking through Optimal Bayes Filters

Lucas Beyer, Stefan Breuers, Vitaly Kurin, Bastian Leibe
Comments: First two authors have equal contribution. This is initial work into a new direction, not a benchmark-beating method
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With the rise of end-to-end learning through deep learning, person detectors
and re-identification (ReID) models have recently become very strong.
Multi-camera multi-target (MCMT) tracking has not fully gone through this
transformation yet. We intend to take another step in this direction by
presenting a theoretically principled way of integrating ReID with tracking
formulated as an optimal Bayes filter. This conveniently side-steps the need
for data-association and opens up a direct path from full images to the core of
the tracker. While the results are still sub-par, we believe that this new,
tight integration opens many interesting research opportunities and leads the
way towards full end-to-end tracking from raw pixels.

Self-Commmittee Approach for Image Restoration Problems using Convolutional Neural Network

Byeongyong Ahn, Nam Ik Cho
Comments: 4 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

There have been many discriminative learning methods using convolutional
neural networks (CNN) for several image restoration problems, which learn the
mapping function from a degraded input to the clean output. In this letter, we
propose a self-committee method that can find enhanced restoration results from
the multiple trial of a trained CNN with different but related inputs.
Specifically, it is noted that the CNN sometimes finds different mapping
functions when the input is transformed by a reversible transform and thus
produces different but related outputs with the original. Hence averaging the
outputs for several different transformed inputs can enhance the results as
evidenced by the network committee methods. Unlike the conventional committee
approaches that require several networks, the proposed method needs only a
single network. Experimental results show that adding an additional transform
as a committee always brings additional gain on image denoising and single
image supre-resolution problems.

Detection of irregular QRS complexes using Hermite Transform and Support Vector Machine

Zoja Vulaj, Milos Brajovic, Andjela Draganic, Irena Orovic
Comments: submitted to 59th International Symposium ELMAR-2017, Zadar, Croatia
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Computer based recognition and detection of abnormalities in ECG signals is
proposed. For this purpose, the Support Vector Machines (SVM) are combined with
the advantages of Hermite transform representation. SVM represent a special
type of classification techniques commonly used in medical applications.
Automatic classification of ECG could make the work of cardiologic departments
faster and more efficient. It would also reduce the number of false diagnosis
and, as a result, save lives. The working principle of the SVM is based on
translating the data into a high dimensional feature space and separating it
using a linear classificator. In order to provide an optimal representation for
SVM application, the Hermite transform domain is used. This domain is proved to
be suitable because of the similarity of the QRS complex with Hermite basis
functions. The maximal signal information is obtained using a small set of
features that are used for detection of irregular QRS complexes. The aim of the
paper is to show that these features can be employed for automatic ECG signal
analysis.

Spatial-Temporal Recurrent Neural Network for Emotion Recognition

Tong Zhang (1 and 2), Wenming Zheng (2), Zhen Cui (2), Yuan Zong (2), Yang Li (1 and 2) ((1) the Department of Information Science and Engineering, Southeast University, Nanjing, China (2) the Key Laboratory of Child Development and Learning Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing, China)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Emotion analysis is a crucial problem to endow artifact machines with real
intelligence in many large potential applications. As external appearances of
human emotions, electroencephalogram (EEG) signals and video face signals are
widely used to track and analyze human’s affective information. According to
their common characteristics of spatial-temporal volumes, in this paper we
propose a novel deep learning framework named spatial-temporal recurrent neural
network (STRNN) to unify the learning of two different signal sources into a
spatial-temporal dependency model. In STRNN, to capture those spatially
cooccurrent variations of human emotions, a multi-directional recurrent neural
network (RNN) layer is employed to capture longrange contextual cues by
traversing the spatial region of each time slice from multiple angles. Then a
bi-directional temporal RNN layer is further used to learn discriminative
temporal dependencies from the sequences concatenating spatial features of each
time slice produced from the spatial RNN layer. To further select those salient
regions of emotion representation, we impose sparse projection onto those
hidden states of spatial and temporal domains, which actually also increases
the model discriminant ability because of this global consideration.
Consequently, such a two-layer RNN model builds spatial dependencies as well as
temporal dependencies of the input signals. Experimental results on the public
emotion datasets of EEG and facial expression demonstrate the proposed STRNN
method is more competitive over those state-of-the-art methods.

External Prior Guided Internal Prior Learning for Real Noisy Image Denoising

Jun Xu, Lei Zhang, David Zhang
Comments: 13 pages, 11figures, submitted to TIP
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Most of existing image denoising methods learn image priors from either
external data or the noisy image itself to remove noise. However, priors
learned from external data may not be adaptive to the image to be denoised,
while priors learned from the given noisy image may not be accurate due to the
interference of corrupted noise. Meanwhile, the noise in real-world noisy
images is very complex, which is hard to be described by simple distributions
such as Gaussian distribution, making real noisy image denoising a very
challenging problem. We propose to exploit the information in both external
data and the given noisy image, and develop an external prior guided internal
prior learning method for real noisy image denoising. We first learn external
priors from an independent set of clean natural images. With the aid of learned
external priors, we then learn internal priors from the given noisy image to
refine the prior model. The external and internal priors are formulated as a
set of orthogonal dictionaries to efficiently reconstruct the desired image.
Extensive experiments are performed on several real noisy image datasets. The
proposed method demonstrates highly competitive denoising performance,
outperforming state-of-the-art denoising methods including those designed for
real noisy images.

TraX: The visual Tracking eXchange Protocol and Library

Luka Čehovin
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper we address the problem of developing on-line visual tracking
algorithms. We present a specialized communication protocol that serves as a
bridge between a tracker implementation and utilizing application. It decouples
development of algorithms and application, encouraging re-usability. The
primary use case is algorithm evaluation where the protocol facilitates more
complex evaluation scenarios that are used nowadays thus pushing forward the
field of visual tracking. We present a reference implementation of the protocol
that makes it easy to use in several popular programming languages and discuss
where the protocol is already used and some usage scenarios that we envision
for the future.

Learning to Refine Object Contours with a Top-Down Fully Convolutional Encoder-Decoder Network

Yahui Liu, Jian Yao, Li Li, Xiaohu Lu, Jing Han
Comments: 12 pages, 13 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We develop a novel deep contour detection algorithm with a top-down fully
convolutional encoder-decoder network. Our proposed method, named TD-CEDN,
solves two important issues in this low-level vision problem: (1) learning
multi-scale and multi-level features; and (2) applying an effective top-down
refined approach in the networks. TD-CEDN performs the pixel-wise prediction by
means of leveraging features at all layers of the net. Unlike skip connections
and previous encoder-decoder methods, we first learn a coarse feature map after
the encoder stage in a feedforward pass, and then refine this feature map in a
top-down strategy during the decoder stage utilizing features at successively
lower layers. Therefore, the deconvolutional process is conducted stepwise,
which is guided by Deeply-Supervision Net providing the integrated direct
supervision. The above proposed technologies lead to a more precise and clearer
prediction. Our proposed algorithm achieved the state-of-the-art on the BSDS500
dataset (ODS F-score of 0.788), the PASCAL VOC2012 dataset (ODS F-score of
0.588), and and the NYU Depth dataset (ODS F-score of 0.735).

Using Satellite Imagery for Good: Detecting Communities in Desert and Mapping Vaccination Activities

Anza Shakeel, Mohsen Ali
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep convolutional neural networks (CNNs) have outperformed existing object
recognition and detection algorithms. On the other hand satellite imagery
captures scenes that are diverse. This paper describes a deep learning approach
that analyzes a geo referenced satellite image and efficiently detects built
structures in it. A Fully Convolution Network (FCN) is trained on low
resolution Google earth satellite imagery in order to achieve end result. The
detected built communities are then correlated with the vaccination activity
that has furnished some useful statistics.

Adaptive Feature Representation for Visual Tracking

Yuqi Han, Chenwei Deng, Zengshuo Zhang, Jiatong Li, Baojun Zhao
Comments: 4 pages, ICIP 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Robust feature representation plays significant role in visual tracking.
However, it remains a challenging issue, since many factors may affect the
experimental performance. The existing method which combine different features
by setting them equally with the fixed weight could hardly solve the issues,
due to the different statistical properties of different features across
various of scenarios and attributes. In this paper, by exploiting the internal
relationship among these features, we develop a robust method to construct a
more stable feature representation. More specifically, we utilize a co-training
paradigm to formulate the intrinsic complementary information of multi-feature
template into the efficient correlation filter framework. We test our approach
on challenging se- quences with illumination variation, scale variation,
deformation etc. Experimental results demonstrate that the proposed method
outperforms state-of-the-art methods favorably.

View-Invariant Template Matching Using Homography Constraints

Sina Lotfian, Hassan Foroosh
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Change in viewpoint is one of the major factors for variation in object
appearance across different images. Thus, view-invariant object recognition is
a challenging and important image understanding task. In this paper, we propose
a method that can match objects in images taken under different viewpoints.
Unlike most methods in the literature, no restriction on camera orientations or
internal camera parameters are imposed and no prior knowledge of 3D structure
of the object is required. We prove that when two cameras take pictures of the
same object from two different viewing angels, the relationship between every
quadruple of points reduces to the special case of homography with two equal
eigenvalues. Based on this property, we formulate the problem as an error
function that indicates how likely two sets of 2D points are projections of the
same set of 3D points under two different cameras. Comprehensive set of
experiments were conducted to prove the robustness of the method to noise, and
evaluate its performance on real-world applications, such as face and object
recognition.

Convolutional Sparse Representations with Gradient Penalties

Brendt Wohlberg
Subjects: Computer Vision and Pattern Recognition (cs.CV)

While convolutional sparse representations enjoy a number of useful
properties, they have received limited attention for image reconstruction
problems. The present paper compares the performance of block-based and
convolutional sparse representations in the removal of Gaussian white noise.
While the usual formulation of the convolutional sparse coding problem is
slightly inferior to the block-based representations in this problem, the
performance of the convolutional form can be boosted beyond that of the
block-based form by the inclusion of suitable penalties on the gradients of the
coefficient maps.

Negative Results in Computer Vision: A Perspective

Ali Borji
Subjects: Computer Vision and Pattern Recognition (cs.CV)

A negative result is when the outcome of an experiment or a model is not what
is expected or when a hypothesis does not hold. Despite being often overlooked
in the scientific community, negative results are results and they carry value.
While this topic has been extensively discussed in other fields such as social
sciences and biosciences, less attention has been paid to it in the computer
vision community. The unique characteristics of computer vision, in particular
its experimental aspect, calls for a special treatment of this matter. In this
paper, I will address questions such as what makes negative results important,
how they should be disseminated, and how they should be incentivized. Further,
I will discuss issues such as computer and human vision interaction,
experimental design and statistical hypothesis testing, performance evaluation
and model comparison, as well as computer vision research culture.

Cross-Dataset Recognition: A Survey

Jing Zhang, Wanqing Li, Philip Ogunbona
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper summarise and analyse the cross-dataset recognition techniques
with the emphasize on what kinds of methods can be used when the available
source and target data are presented in different forms for boosting the target
task. This paper for the first time summarises several transferring criteria in
details from the concept level, which are the key bases to guide what kind of
knowledge to transfer between datasets. In addition, a taxonomy of
cross-dataset scenarios and problems is proposed according the properties of
data that define how different datasets are diverged, thereby review the recent
advances on each specific problem under different scenarios. Moreover, some
real world applications and corresponding commonly used benchmarks of
cross-dataset recognition are reviewed. Lastly, several future directions are
identified.

Object-Level Context Modeling For Scene Classification with Context-CNN

Syed Ashar Javed, Anil Kumar Nelakanti
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Convolutional Neural Networks (CNNs) have been used extensively for computer
vision tasks and produce rich feature representation for objects or parts of an
image. But reasoning about scenes requires integration between the low-level
feature representations and the high-level semantic information. We propose a
deep network architecture which models the semantic context of scenes by
capturing object-level information. We use Long Short Term Memory(LSTM) units
in conjunction with object proposals to incorporate object-object relationship
and object-scene relationship in an end-to-end trainable manner. We evaluate
our model on the LSUN dataset and achieve results comparable to the
state-of-art. We further show visualization of the learned features and analyze
the model with experiments to verify our model’s ability to model context.

Reconfiguring the Imaging Pipeline for Computer Vision

Mark Buckler, Suren Jayasuriya, Adrian Sampson
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Advancements in deep learning have ignited an explosion of research on
efficient hardware for embedded computer vision. Hardware vision acceleration,
however, does not address the cost of capturing and processing the image data
that feeds these algorithms. We examine the role of the image signal processing
(ISP) pipeline in computer vision to identify opportunities to reduce
computation and save energy. The key insight is that imaging pipelines should
be designed to be configurable: to switch between a traditional photography
mode and a low-power vision mode that produces lower-quality image data
suitable only for computer vision. We use eight computer vision algorithms and
a reversible pipeline simulation tool to study the imaging system’s impact on
vision performance. For both CNN-based and classical vision algorithms, we
observe that only two ISP stages, demosaicing and gamma compression, are
critical for task performance. We propose a new image sensor design that can
compensate for skipping these stages. The sensor design features an adjustable
resolution and tunable analog-to-digital converters (ADCs). Our proposed
imaging system’s vision mode disables the ISP entirely and configures the
sensor to produce subsampled, lower-precision image data. This vision mode can
save ~75% of the average energy of a baseline photography mode while having
only a small impact on vision task accuracy.

An Optimal Dimensionality Multi-shell Sampling Scheme with Accurate and Efficient Transforms for Diffusion MRI

Alice P. Bates, Zubair Khalid, Jason D. McEwen, Rodney A. Kennedy
Comments: 4 pages, 4 figures presented at ISBI 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper proposes a multi-shell sampling scheme and corresponding
transforms for the accurate reconstruction of the diffusion signal in diffusion
MRI by expansion in the spherical polar Fourier (SPF) basis. The sampling
scheme uses an optimal number of samples, equal to the degrees of freedom of
the band-limited diffusion signal in the SPF domain, and allows for
computationally efficient reconstruction. We use synthetic data sets to
demonstrate that the proposed scheme allows for greater reconstruction accuracy
of the diffusion signal than the multi-shell sampling schemes obtained using
the generalised electrostatic energy minimisation (gEEM) method used in the
Human Connectome Project. We also demonstrate that the proposed sampling scheme
allows for increased angular discrimination and improved rotational invariance
of reconstruction accuracy than the gEEM schemes.

Imagination improves Multimodal Translation

Desmond Elliott, Ákos Kádár
Comments: Under review
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Multimodal machine translation is the task of translating sentences in a
visual context. We decompose this problem into two sub-tasks: learning to
translate and learning visually grounded representations. In a multitask
learning framework, translations are learned in an attention-based
encoder-decoder, and grounded representations are learned through image
representation prediction. Our approach improves translation performance
compared to the state of the art on the Multi30K dataset. Furthermore, it is
equally effective if we train the image prediction task on the external MS COCO
dataset, and we find improvements if we train the translation model on the
external News Commentary parallel text.

Artificial Intelligence

A Formal Characterization of the Local Search Topology of the Gap Heuristic

Richard Anthony Valenzano, Danniel Sihui Yang
Comments: Technical report providing proofs of statements appearing in a “An Analysis and Enhancement of the Gap Heuristic for the Pancake Puzzle” by Richard Anthony Valenzano and Danniel Yang. This paper appeared at the 2017 Symposium on Combinatorial Search
Subjects: Artificial Intelligence (cs.AI)

The pancake puzzle is a classic optimization problem that has become a
standard benchmark for heuristic search algorithms. In this paper, we provide
full proofs regarding the local search topology of the gap heuristic for the
pancake puzzle. First, we show that in any non-goal state in which there is no
move that will decrease the number of gaps, there is a move that will keep the
number of gaps constant. We then classify any state in which the number of gaps
cannot be decreased in a single action into two groups: those requiring 2
actions to decrease the number of gaps, and those which require 3 actions to
decrease the number of gaps.

Clingcon: The Next Generation

Mutsunori Banbara, Benjamin Kaufmann, Max Ostrowski, Torsten Schaub
Comments: Under consideration in Theory and Practice of Logic Programming (TPLP)
Subjects: Artificial Intelligence (cs.AI)

We present the third generation of the constraint answer set system clingcon,
combining Answer Set Programming (ASP) with finite domain constraint processing
(CP). While its predecessors rely on a black-box approach to hybrid solving by
integrating the CP solver gecode, the new clingcon system pursues a lazy
approach using dedicated constraint propagators to extend propagation in the
underlying ASP solver clasp. No extension is needed for parsing and grounding
clingcon’s hybrid modeling language since both can be accommodated by the new
generic theory handling capabilities of the ASP grounder gringo. As a whole,
clingcon 3 is thus an extension of the ASP system clingo 5, which itself relies
on the grounder gringo and the solver clasp. The new approach of clingcon
offers a seamless integration of CP propagation into ASP solving that benefits
from the whole spectrum of clasp’s reasoning modes, including for instance
multi-shot solving and advanced optimization techniques. This is accomplished
by a lazy approach that unfolds the representation of constraints and adds it
to that of the logic program only when needed. Although the unfolding is
usually dictated by the constraint propagators during solving, it can already
be partially (or even totally) done during preprocessing. Moreover, clingcon’s
constraint preprocessing and propagation incorporate several well established
CP techniques that greatly improve its performance. We demonstrate this via an
extensive empirical evaluation contrasting, first, the various techniques in
the context of CSP solving and, second, the new clingcon system with other
hybrid ASP systems. Under consideration in Theory and Practice of Logic
Programming (TPLP)

A Survey of Question Answering for Math and Science Problem

Arindam Bhattacharya
Subjects: Artificial Intelligence (cs.AI)

Turing test was long considered the measure for artificial intelligence. But
with the advances in AI, it has proved to be insufficient measure. We can now
aim to mea- sure machine intelligence like we measure human intelligence. One
of the widely accepted measure of intelligence is standardized math and science
test. In this paper, we explore the progress we have made towards the goal of
making a machine smart enough to pass the standardized test. We see the
challenges and opportunities posed by the domain, and note that we are quite
some ways from actually making a system as smart as a even a middle school
scholar.

A rational analysis of curiosity

Rachit Dubey, Thomas L. Griffiths
Comments: Conference paper in CogSci 2017
Subjects: Artificial Intelligence (cs.AI)

We present a rational analysis of curiosity, proposing that people’s
curiosity is driven by seeking stimuli that maximize their ability to make
appropriate responses in the future. This perspective offers a way to unify
previous theories of curiosity into a single framework. Experimental results
confirm our model’s predictions, showing how the relationship between curiosity
and confidence can change significantly depending on the nature of the
environment.

R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections

TonTon Hsien-De Huang, Chia-Mu Yu, Hung-Yu Kao
Comments: 2017/05/12 Draft Version
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)

Machine Learning (ML) has found it particularly useful in malware detection.
However, as the malware evolves very fast, the stability of the feature
extracted from malware serves as a critical issue in malware detection. The
recent success of deep learning in image recognition, natural language
processing, and machine translation indicates a potential solution for
stabilizing the malware detection effectiveness. In this research, we haven’t
extract selected any features (e.g., the control-flow of op-code, classes,
methods of functions and the timing they are invoked etc.) from Android apps.
We develop our own method for translating Android apps into rgb color code and
transform them to a fixed-sized encoded image. After that, the encoded image is
fed to convolutional neural network (CNN) for automatic feature extraction and
learning, reducing the expert’s intervention. Deep learning usually involves a
large number of parameters that cannot be learned from only a small dataset. In
this way, we currently have collected 1500k Android apps samples, have run our
system over these 800k malware samples (benign and malicious samples are
roughly equal-sized), and also through our back-end (60 million monthly active
users and 10k new malware samples per day), we can effectively detect the
malware. We believe that our methodology and the corresponding use of deep
learning malware classification can overcome the weakness, and computational
cost of the common static/dynamic analysis process or machine learning-based of
Android malware detection approach.

CLTune: A Generic Auto-Tuner for OpenCL Kernels

Cedric Nugteren, Valeriu Codreanu
Comments: 8 pages, published in MCSoC ’15, IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2015
Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and
tunes kernel performance of a generic, user-defined search space of possible
parameter-value combinations. Example parameters include the OpenCL workgroup
size, vector data-types, tile sizes, and loop unrolling factors. CLTune can be
used in the following scenarios: 1) when there are too many tunable parameters
to explore manually, 2) when performance portability across OpenCL devices is
desired, or 3) when the optimal parameters change based on input argument
values (e.g. matrix dimensions). The auto-tuner is generic, easy to use,
open-source, and supports multiple search strategies including simulated
annealing and particle swarm optimisation. CLTune is evaluated on two GPU
case-studies inspired by the recent successes in deep learning: 2D convolution
and matrix-multiplication (GEMM). For 2D convolution, we demonstrate the need
for auto-tuning by optimizing for different filter sizes, achieving performance
on-par or better than the state-of-the-art. For matrix-multiplication, we use
CLTune to explore a parameter space of more than two-hundred thousand
configurations, we show the need for device-specific tuning, and outperform the
clBLAS library on NVIDIA, AMD and Intel GPUs.

Computation and Language

Arc-swift: A Novel Transition System for Dependency Parsing

Peng Qi, Christopher D. Manning
Comments: Accepted at ACL 2017
Subjects: Computation and Language (cs.CL)

Transition-based dependency parsers often need sequences of local shift and
reduce operations to produce certain attachments. Correct individual decisions
hence require global information about the sentence context and mistakes cause
error propagation. This paper proposes a novel transition system, arc-swift,
that enables direct attachments between tokens farther apart with a single
transition. This allows the parser to leverage lexical information more
directly in transition decisions. Hence, arc-swift can achieve significantly
better performance with a very small beam size. Our parsers reduce error by
3.7–7.6% relative to those using existing transition systems on the Penn
Treebank dependency parsing task and English Universal Dependencies.

Evaluating vector-space models of analogy

Dawn Chen, Joshua C. Peterson, Thomas L. Griffiths
Comments: 6 pages, 4 figures, To appear in the Proceedings of the 39th Annual Conference of the Cognitive Science Society
Subjects: Computation and Language (cs.CL)

Vector-space representations provide geometric tools for reasoning about the
similarity of a set of objects and their relationships. Recent machine learning
methods for deriving vector-space embeddings of words (e.g., word2vec) have
achieved considerable success in natural language processing. These vector
spaces have also been shown to exhibit a surprising capacity to capture verbal
analogies, with similar results for natural images, giving new life to a
classic model of analogies as parallelograms that was first proposed by
cognitive scientists. We evaluate the parallelogram model of analogy as applied
to modern word embeddings, providing a detailed analysis of the extent to which
this approach captures human relational similarity judgments in a large
benchmark dataset. We find that that some semantic relationships are better
captured than others. We then provide evidence for deeper limitations of the
parallelogram model based on the intrinsic geometric constraints of vector
spaces, paralleling classic results for first-order similarity.

Reducing Bias in Production Speech Models

Eric Battenberg, Rewon Child, Adam Coates, Christopher Fougner, Yashesh Gaur, Jiaji Huang, Heewoo Jun, Ajay Kannan, Markus Kliegl, Atul Kumar, Hairong Liu, Vinay Rao, Sanjeev Satheesh, David Seetapun, Anuroop Sriram, Zhenyao Zhu
Subjects: Computation and Language (cs.CL)

Replacing hand-engineered pipelines with end-to-end deep learning systems has
enabled strong results in applications like speech and object recognition.
However, the causality and latency constraints of production systems put
end-to-end speech models back into the underfitting regime and expose biases in
the model that we show cannot be overcome by “scaling up”, i.e., training
bigger models on more data. In this work we systematically identify and address
sources of bias, reducing error rates by up to 20% while remaining practical
for deployment. We achieve this by utilizing improved neural architectures for
streaming inference, solving optimization issues, and employing strategies that
increase audio and label modelling versatility.

Imagination improves Multimodal Translation

Desmond Elliott, Ákos Kádár
Comments: Under review
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Multimodal machine translation is the task of translating sentences in a
visual context. We decompose this problem into two sub-tasks: learning to
translate and learning visually grounded representations. In a multitask
learning framework, translations are learned in an attention-based
encoder-decoder, and grounded representations are learned through image
representation prediction. Our approach improves translation performance
compared to the state of the art on the Multi30K dataset. Furthermore, it is
equally effective if we train the image prediction task on the external MS COCO
dataset, and we find improvements if we train the translation model on the
external News Commentary parallel text.

Distributed, Parallel, and Cluster Computing

Distributed-Memory Breadth-First Search on Massive Graphs

Aydin Buluc, Scott Beamer, Kamesh Madduri, Krste Asanovic, David Patterson
Comments: arXiv admin note: text overlap with arXiv:1104.4518
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.

GRID Storage Optimization in Transparent and User-Friendly Way for LHCb Datasets

Mikhail Hushchyn, Andrey Ustyuzhanin, Philippe Charpentier, Christophe Haen
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

The LHCb collaboration is one of the four major experiments at the Large
Hadron Collider at CERN. Many petabytes of data are produced by the detectors
and Monte-Carlo simulations. The LHCb Grid interware LHCbDIRAC is used to make
data available to all collaboration members around the world. The data is
replicated to the Grid sites in different locations. However the Grid disk
storage is limited and does not allow keeping replicas of each file at all
sites. Thus it is essential to optimize number of replicas to achieve a better
Grid performance.

In this study, we present a new approach of data replication and distribution
strategy based on data popularity prediction. The popularity is performed based
on the data access history and metadata, and uses machine learning techniques
and time series analysis methods.

Distributed Protocols at the Rescue for Trustworthy Online Voting

Robert Riemann (DICE), Stéphane Grumbach (DICE)
Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)

While online services emerge in all areas of life, the voting procedure in
many democracies remains paper-based as the security of current online voting
technology is highly disputed. We address the issue of trustworthy online
voting protocols and recall therefore their security concepts with its trust
assumptions. Inspired by the Bitcoin protocol, the prospects of distributed
online voting protocols are analysed. No trusted authority is assumed to ensure
ballot secrecy. Further, the integrity of the voting is enforced by all voters
themselves and without a weakest link, the protocol becomes more robust. We
introduce a taxonomy of notions of distribution in online voting protocols that
we apply on selected online voting protocols. Accordingly, blockchain-based
protocols seem to be promising for online voting due to their similarity with
paper-based protocols.

CLTune: A Generic Auto-Tuner for OpenCL Kernels

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and
tunes kernel performance of a generic, user-defined search space of possible
parameter-value combinations. Example parameters include the OpenCL workgroup
size, vector data-types, tile sizes, and loop unrolling factors. CLTune can be
used in the following scenarios: 1) when there are too many tunable parameters
to explore manually, 2) when performance portability across OpenCL devices is
desired, or 3) when the optimal parameters change based on input argument
values (e.g. matrix dimensions). The auto-tuner is generic, easy to use,
open-source, and supports multiple search strategies including simulated
annealing and particle swarm optimisation. CLTune is evaluated on two GPU
case-studies inspired by the recent successes in deep learning: 2D convolution
and matrix-multiplication (GEMM). For 2D convolution, we demonstrate the need
for auto-tuning by optimizing for different filter sizes, achieving performance
on-par or better than the state-of-the-art. For matrix-multiplication, we use
CLTune to explore a parameter space of more than two-hundred thousand
configurations, we show the need for device-specific tuning, and outperform the
clBLAS library on NVIDIA, AMD and Intel GPUs.

Learning

Forecasting using incomplete models

Vadim Kosoy
Subjects: Learning (cs.LG)

We consider the task of forecasting an infinite sequence of future
observation based on some number of past observations, where the probability
measure generating the observations is “suspected” to satisfy one or more of a
set of incomplete models, i.e. convex sets in the space of probability
measures. This setting is in some sense intermediate between the realizable
setting where the probability measure comes from some known set of probability
measures (which can be addressed using e.g. Bayesian inference) and the
unrealizable setting where the probability measure is completely arbitrary. We
demonstrate a method of forecasting which guarantees that, whenever the true
probability measure satisfies an incomplete model in a given countable set, the
forecast converges to the same incomplete model in the (appropriately
normalized) Kantorovich-Rubinstein metric. This is analogous to merging of
opinions for Bayesian inference, except that convergence in the
Kantorovich-Rubinstein metric is weaker than convergence in total variation.

Molecular Generation with Recurrent Neural Networks

Esben Jannik Bjerrum
Subjects: Learning (cs.LG); Biomolecules (q-bio.BM)

The potential number of drug like small molecules is estimated to be between
10^23 and 10^60 while current databases of known compounds are orders of
magnitude smaller with approximately 10^8 compounds. This discrepancy has led
to an interest in generating virtual libraries using hand crafted chemical
rules and fragment based methods to cover a larger area of chemical space and
generate chemical libraries for use in in silico drug discovery endeavors. Here
it is explored to what extent a recurrent neural network with long short term
memory cells can figure out sensible chemical rules and generate synthesizable
molecules by being trained on existing compounds encoded as SMILES. The
networks can to a high extent generate novel, but chemically sensible
molecules. The properties of the molecules are tuned by training on two
different datasets consisting of fragment like molecules and drug like
molecules. The produced molecules and the training databases have very similar
distributions of molar weight, predicted logP, number of hydrogen bond
acceptors and donors, number of rotatable bonds and topological polar surface
area when compared to their respective training sets. The compounds are for the
most cases synthesizable as assessed with SA score and Wiley ChemPlanner.

Learning ReLUs via Gradient Descent

Mahdi Soltanolkotabi
Comments: arXiv admin note: text overlap with arXiv:1702.06175
Subjects: Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)

In this paper we study the problem of learning Rectified Linear Units (ReLUs)
which are functions of the form (max(0,<w,x>)) with (w) denoting the weight
vector. We study this problem in the high-dimensional regime where the number
of observations are fewer than the dimension of the weight vector. We assume
that the weight vector belongs to some closed set (convex or nonconvex) which
captures known side-information about its structure. We focus on the realizable
model where the inputs are chosen i.i.d.~from a Gaussian distribution and the
labels are generated according to a planted weight vector. We show that
projected gradient descent, when initialization at 0, converges at a linear
rate to the planted model with a number of samples that is optimal up to
numerical constants. Our results on the dynamics of convergence of these very
shallow neural nets may provide some insights towards understanding the
dynamics of deeper architectures.

Predicting Blood Pressure with Deep Bidirectional LSTM Network

Peng Su, Xiaorong Ding, Yuanting Zhang, Ye Li, Ni Zhao
Subjects: Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)

Blood pressure (BP) has been a difficult vascular risk factor to measure
continuously and precisely with a small cuffless electronic gadget. In the
meantime, it is the key biomarker for control of cardiovascular diseases (CVD),
the leading cause of death worldwide. In this work, we addressed the current
limitation of BP prediction models by formulating BP extraction as a temporal
sequence prediction problem in which both the input and target are sequential
data. By incorporating both a bidirectional layer structure and a deep
architecture in a standard long short term-memory (LSTM), we established a deep
bidirectional LSTM (DB-LSTM) network that can adaptively discover the latent
structures of different timescales in BP sequences and automatically learn such
multiscale dependencies. We evaluated our proposed model on a one-day and
four-day continuous BP dataset, and the results show that DB-LSTM network can
effectively learn different timescale dependencies in the BP sequences and
advances the state-of-the-art by achieving superior accuracy performance than
other leading methods on both datasets. To the best of our knowledge, this is
the first study to validate the ability of recurrent neural networks to learn
the different timescale dependencies of long-term continuous BP sequence.

Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization–Minimization Algorithm Approach

Hien D. Nguyen, Geoffrey J. McLachlan
Subjects: Computation (stat.CO); Learning (cs.LG); Machine Learning (stat.ML)

Support vector machines (SVMs) are an important tool in modern data analysis.
Traditionally, support vector machines have been fitted via quadratic
programming, either using purpose-built or off-the-shelf algorithms. We present
an alternative approach to SVM fitting via the majorization–minimization (MM)
paradigm. Algorithms that are derived via MM algorithm constructions can be
shown to monotonically decrease their objectives at each iteration, as well as
be globally convergent to stationary points. We demonstrate the construction of
iteratively-reweighted least-squares (IRLS) algorithms, via the MM paradigm,
for SVM risk minimization problems involving the hinge, least-square,
squared-hinge, and logistic losses, and 1-norm, 2-norm, and elastic net
penalizations. Successful implementations of our algorithms are presented via
some numerical examples.

The Network Nullspace Property for Compressed Sensing of Big Data over Networks

Alexander Jung
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

We adapt the nullspace property of compressed sensing for sparse vectors to
semi-supervised learning of labels for network-structured datasets. In
particular, we derive a sufficient condition, which we term the network
nullspace property, for convex optimization methods to accurately learn labels
which form smooth graph signals. The network nullspace property involves both
the network topology and the sampling strategy and can be used to guide the
design of efficient sampling strategies, i.e., the selection of those data
points whose labels provide the most information for the learning task.

Information Theory

Rate-Memory Trade-off for the Two-User Broadcast Caching Network with Correlated Sources

Parisa Hassanzadeh, Antonia Tulino, Jaime Llorca, Elza Erkip
Comments: 5 pages, IEEE International Symposium on Information Theory (ISIT), 2017
Subjects: Information Theory (cs.IT)

This paper studies the fundamental limits of caching in a network with two
receivers and two files generated by a two-component discrete memoryless source
with arbitrary joint distribution. Each receiver is equipped with a cache of
equal capacity, and the requested files are delivered over a shared error-free
broadcast link. First, a lower bound on the optimal peak rate-memory trade-off
is provided. Then, in order to leverage the correlation among the library files
to alleviate the load over the shared link, a two-step correlation-aware
cache-aided coded multicast (CACM) scheme is proposed. The first step uses
Gray-Wyner source coding to represent the library via one common and two
private descriptions, such that a second correlation-unaware multiple-request
CACM step can exploit the additional coded multicast opportunities that arise.
It is shown that the rate achieved by the proposed two-step scheme matches the
lower bound for a significant memory regime and it is within half of the
conditional entropy for all other memory values.

Construction of Sidon spaces with applications to coding

Ron M. Roth, Netanel Raviv, Itzhak Tamo
Comments: Parts of this paper will be presented at the International Symposium on Information Theory (ISIT), Aachen, Germany, June 2017
Subjects: Information Theory (cs.IT)

A subspace of a finite extension field is called a Sidon space if the product
of any two of its elements is unique up to a scalar multiplier from the base
field. Sidon spaces were recently introduced by Bachoc et al. as a means to
characterize multiplicative properties of subspaces, and yet no explicit
constructions were given. In this paper, several constructions of Sidon spaces
are provided. In particular, in some of the constructions the relation between
(k), the dimension of the Sidon space, and (n), the dimension of the ambient
extension field, is optimal.

These constructions are shown to provide cyclic subspace codes, which are
useful tools in network coding schemes. To the best of the authors’ knowledge,
this constitutes the first set of constructions of non-trivial cyclic subspace
codes in which the relation between (k) and (n) is polynomial, and in
particular, linear. As a result, a conjecture by Trautmann et al. regarding the
existence of non-trivial cyclic subspace codes is resolved for most parameters,
and multi-orbit cyclic subspace codes are attained, whose cardinality is within
a constant factor (close to (1/2)) from the sphere-packing bound for subspace
codes.

On Multiuser Gain and the Constant-Gap Sum Capacity of the Gaussian Interfering Multiple Access Channel

Rick Fritschek, Gerhard Wunder
Comments: 43 pages, submitted to IEEE Transactions on Information Theory
Subjects: Information Theory (cs.IT)

Recent investigations have shown sum capacity results within a constant
bit-gap for several channel models, e.g. the two-user Gaussian interference
channel (G-IC), k-user G-IC or the Gaussian X-channel. This has motivated
investigations of interference-limited multi-user channels, for example, the
Gaussian interfering multiple access channel (G-IMAC). Networks with
interference usually require the use of interference alignment (IA) as a
technique to achieve the upper bounds of a network. A promising approach in
view of constant-gap capacity results is a special form of IA called
signal-scale alignment, which works for time-invariant, frequency-flat,
single-antenna networks. However, until now, results were limited to the
many-to-one interference channel and the Gaussian X-channel. To make progress
on this front, we investigate signal-scale IA schemes for the G-IMAC and aim to
show a constant-gap capacity result for the G-IMAC. We derive a constant-gap
sum capacity approximation for the lower triangular deterministic (LTD)-IMAC
and see that the LTD model can overcome difficulties of the linear
deterministic model. We show that the schemes can be translated to the Gaussian
IMAC and that they achieve capacity within a constant gap. We show that
multi-user gain is possible in the whole regime and provide a new look at
cellular interference channels.

Multi-Channel Random Access with Replications

Olga Galinina, Andrey Turlikov, Sergey Andreev, Yevgeni Koucheryavy
Comments: 5 pages, 2 figures, accepted by ISIT 2017
Subjects: Information Theory (cs.IT)

This paper considers a class of multi-channel random access algorithms, where
contending devices may send multiple copies (replicas) of their messages to the
central base station. We first develop a hypothetical algorithm that delivers a
lower estimate for the access delay performance within this class. Further, we
propose a feasible access control algorithm achieving low access delay by
sending multiple message replicas, which approaches the performance of the
hypothetical algorithm. The resulting performance is readily approximated by a
simple lower bound, which is derived for a large number of channels.

Radio Resource Allocation for Multicarrier-Low Density Spreading Multiple Access

Mohammed Al-Imari, Muhammad Ali Imran, Pei Xiao
Journal-ref: IEEE Transactions on Vehicular Technology, vol. 66, no. 3, pp.
2382-2393, March 2017
Subjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)

Multicarrier-low density spreading multiple access (MC-LDSMA) is a promising
multiple access technique that enables near optimum multiuser detection. In
MC-LDSMA, each user’s symbol spread on a small set of subcarriers, and each
subcarrier is shared by multiple users. The unique structure of MC-LDSMA makes
the radio resource allocation more challenging comparing to some well-known
multiple access techniques. In this paper, we study the radio resource
allocation for single-cell MC-LDSMA system. Firstly, we consider the
single-user case, and derive the optimal power allocation and subcarriers
partitioning schemes. Then, by capitalizing on the optimal power allocation of
the Gaussian multiple access channel, we provide an optimal solution for
MC-LDSMA that maximizes the users’ weighted sum-rate under relaxed constraints.
Due to the prohibitive complexity of the optimal solution, suboptimal
algorithms are proposed based on the guidelines inferred by the optimal
solution. The performance of the proposed algorithms and the effect of
subcarrier loading and spreading are evaluated through Monte Carlo simulations.
Numerical results show that the proposed algorithms significantly outperform
conventional static resource allocation, and MC-LDSMA can improve the system
performance in terms of spectral efficiency and fairness in comparison with
OFDMA.

Beamforming Codebook Compensation for Beam Squint with Channel Capacity Constraint

Mingming Cai, J. Nicholas Laneman, Bertrand Hochwald
Comments: 5 pages, to be published in Proc. IEEE ISIT 2017, Aachen, Germany
Subjects: Information Theory (cs.IT)

Analog beamforming with phased arrays is a promising technique for 5G
wireless communication in millimeter wave bands. A beam focuses on a small
range of angles of arrival or departure and corresponds to a set of fixed phase
shifts across frequency due to practical hardware constraints. In switched
beamforming, a discrete codebook consisting of multiple beams is used to cover
a larger angle range. However, for sufficiently large bandwidth, the gain
provided by the phased array is frequency dependent even if the radiation
pattern of the antenna elements is frequency independent, an effect called beam
squint. This paper shows that the beam squint reduces channel capacity of a
uniform linear array (ULA). The beamforming codebook is designed to compensate
for the beam squint by imposing a channel capacity constraint. For example, our
codebook design algorithm can improve the channel capacity by 17.8% for a ULA
with 64 antennas operating at bandwidth of 2.5 GHz and carrier frequency of 73
GHz. Analysis and numerical examples suggest that a denser codebook is required
to compensate for the beam squint compared to the case without beam squint.
Furthermore, the effect of beam squint is shown to increase as bandwidth
increases, and the beam squint limits the bandwidth given the number of
antennas in the array.

The Distributed MIMO Scenario: Can Ideal ADCs Be Replaced by Low-resolution ADCs?

Jide Yuan, Shi Jin, Chao-Kai Wen, Kai-Kit Wong
Subjects: Information Theory (cs.IT)

This letter considers the architecture of distributed antenna system, which
is made up of a massive number of single-antenna remote radio heads (RRHs),
some with full-resolution but others with low-resolution analog-to-digital
converter (ADC) receivers. This architecture is greatly motivated by its high
energy efficiency and low-cost implementation. We derive the worst-case uplink
spectral efficiency (SE) of the system assuming a frequency-flat channel and
maximum-ratio combining (MRC), and reveal that the SE increases as the number
of quantization bits for the low-resolution ADCs increases, and the SE
converges as the number of RRHs with low-resolution ADCs grows. Our results
furthermore demonstrate that a great improvement can be obtained by adding a
majority of RRHs with low-resolution ADC receivers, if sufficient quantization
precision and an acceptable proportion of high-to-low resolution RRHs are used.

Performance Analysis of Decoupled Cell Association in Multi-Tier Hybrid Networks using Real Blockage Environments

Osama Waqar Bhatti, Haris Suhail, Uzair Akbar, Syed Ali Hassan, Haris Pervaiz, Leila Musavian, Qiang Ni
Comments: 6 pages, 7 figures. Submitted to International Wireless Communications and Mobile Computing (IWCMC) Conference
Subjects: Information Theory (cs.IT)

Millimeter wave (mmWave) links have the potential to offer high data rates
and capacity needed in fifth generation (5G) networks, however they have very
high penetration and path loss. A solution to this problem is to bring the base
station closer to the end-user through heterogeneous networks (HetNets).
HetNets could be designed to allow users to connect to different base stations
(BSs) in the uplink and downlink. This phenomenon is known as downlink-uplink
decoupling (DUDe). This paper explores the effect of DUDe in a three tier
HetNet deployed in two different real-world environments. Our simulation
results show that DUDe can provide improvements with regard to increasing the
system coverage and data rates while the extent of improvement depends on the
different environments that the system is deployed in.

Learning ReLUs via Gradient Descent

In this paper we study the problem of learning Rectified Linear Units (ReLUs)
which are functions of the form (max(0,<w,x>)) with (w) denoting the weight
vector. We study this problem in the high-dimensional regime where the number
of observations are fewer than the dimension of the weight vector. We assume
that the weight vector belongs to some closed set (convex or nonconvex) which
captures known side-information about its structure. We focus on the realizable
model where the inputs are chosen i.i.d.~from a Gaussian distribution and the
labels are generated according to a planted weight vector. We show that
projected gradient descent, when initialization at 0, converges at a linear
rate to the planted model with a number of samples that is optimal up to
numerical constants. Our results on the dynamics of convergence of these very
shallow neural nets may provide some insights towards understanding the
dynamics of deeper architectures.