Johannes Schemmel, Laura Kriener, Paul Müller, Karlheinz Meier
This paper presents an extension of the BrainScaleS accelerated analog
neuromorphic hardware model. The scalable neuromorphic architecture is extended
by the support for multi-compartment models and non-linear dendrites. These
65
anometer} prototype ASIC. It allows to
emulate different spike types observed in cortical pyramidal neurons: NMDA
plateau potentials, calcium and sodium spikes. By replicating some of the
structures of these cells, they can be configured to perform coincidence
detection within a single neuron. Built-in plasticity mechanisms can modify not
only the synaptic weights, but also the dendritic synaptic composition to
efficiently train large multi-compartment neurons. Transistor-level simulations
demonstrate the functionality of the analog implementation and illustrate
analogies to biological measurements.
Alexander Hagg, Maximilian Mensing, Alexander Asteroth
Neuroevolution methods evolve the weights of a neural network, and in some
cases the topology, but little work has been done to analyze the effect of
evolving the activation functions of individual nodes on network size, which is
important when training networks with a small number of samples. In this work
we extend the neuroevolution algorithm NEAT to evolve the activation function
of neurons in addition to the topology and weights of the network. The size and
performance of networks produced using NEAT with uniform activation in all
nodes, or homogenous networks, is compared to networks which contain a mixture
of activation functions, or heterogenous networks. For a number of regression
and classification benchmarks it is shown that, (1) qualitatively different
activation functions lead to different results in homogeneous networks, (2) the
heterogeneous version of NEAT is able to select well performing activation
functions, (3) producing heterogeneous networks that are significantly smaller
than homogeneous networks.
William La Cava, Jason H. Moore
Recently we proposed a general, ensemble-based feature engineering wrapper
(FEW) that was paired with a number of machine learning methods to solve
regression problems. Here, we adapt FEW for supervised classification and
perform a thorough analysis of fitness and survival methods within this
framework. Our tests demonstrate that two fitness metrics, one introduced as an
adaptation of the silhouette score, outperform the more commonly used Fisher
criterion. We analyze survival methods and demonstrate that (epsilon)-lexicase
survival works best across our test problems, followed by random survival which
outperforms both tournament and deterministic crowding. We conduct
hyper-parameter optimization for several classification methods using a large
set of problems to benchmark the ability of FEW to improve data
representations. The results show that FEW can improve the best classifier
performance on several problems. We show that FEW generates readable and
meaningful features for a biomedical problem with different ML pairings.
Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, Wojciech Zaremba
Imitation learning has been commonly applied to solve different tasks in
isolation. This usually requires either careful feature engineering, or a
significant number of samples. This is far from what we desire: ideally, robots
should be able to learn from very few demonstrations of any given task, and
instantly generalize to new situations of the same task, without requiring
task-specific engineering. In this paper, we propose a meta-learning framework
for achieving such capability, which we call one-shot imitation learning.
Specifically, we consider the setting where there is a very large set of
tasks, and each task has many instantiations. For example, a task could be to
stack all blocks on a table into a single tower, another task could be to place
all blocks on a table into two-block towers, etc. In each case, different
instances of the task would consist of different sets of blocks with different
initial states. At training time, our algorithm is presented with pairs of
demonstrations for a subset of all tasks. A neural net is trained that takes as
input one demonstration and the current state (which initially is the initial
state of the other demonstration of the pair), and outputs an action with the
goal that the resulting sequence of states and actions matches as closely as
possible with the second demonstration. At test time, a demonstration of a
single instance of a new task is presented, and the neural net is expected to
perform well on new instances of this new task. The use of soft attention
allows the model to generalize to conditions and tasks unseen in the training
data. We anticipate that by training this model on a much greater variety of
tasks and settings, we will obtain a general system that can turn any
demonstrations into robust policies that can accomplish an overwhelming variety
of tasks.
Videos available at https URL
Chris Donahue, Zachary C. Lipton, Julian McAuley
Dance Dance Revolution (DDR) is a popular rhythm-based video game. Players
perform steps on a dance platform in synchronization with music as directed by
on-screen step charts. While many step charts are available in standardized
packs, users may grow tired of existing charts, or wish to dance to a song for
which no chart exists. We introduce the task of learning to choreograph. Given
a raw audio track, the goal is to produce a new step chart. This task
decomposes naturally into two subtasks: deciding when to place steps and
deciding which steps to select. For the step placement task, we combine
recurrent and convolutional neural networks to ingest spectrograms of low-level
audio features to predict steps, conditioned on chart difficulty. For step
selection, we present a conditional LSTM generative model that substantially
outperforms n-gram and fixed-window approaches.
Shichao Yang, Yu Song, Michael Kaess, Sebastian Scherer
Existing simultaneous localization and mapping (SLAM) algorithms are not
robust in challenging low-texture environments because there are only few
salient features. The resulting sparse or semi-dense map also conveys little
information for motion planning. Though some work utilize plane or scene layout
for dense map regularization, they require decent state estimation from other
sources. In this paper, we propose real-time monocular plane SLAM to
demonstrate that scene understanding could improve both state estimation and
dense mapping especially in low-texture environments. The plane measurements
come from a pop-up 3D plane model applied to each single image. We also combine
planes with point based SLAM to improve robustness. On a public TUM dataset,
our algorithm generates a dense semantic 3D model with pixel depth error of 6.2
cm while existing SLAM algorithms fail. On a 60 m long dataset with loops, our
method creates a much better 3D model with state estimation error of 0.67%.
Adrian Bulat, Georgios Tzimiropoulos
This paper investigates how far a very deep neural network is from attaining
close to saturating performance on existing 2D and 3D face alignment datasets.
To this end, we make the following three contributions: (a) we construct, for
the first time, a very strong baseline by combining a state-of-the-art
architecture for landmark localization with a state-of-the-art residual block,
train it on a very large yet synthetically expanded 2D facial landmark dataset
and finally evaluate it on all other 2D facial landmark datasets. (b) We create
a guided by 2D landmarks network which converts 2D landmark annotations to 3D
and unifies all existing datasets, leading to the creation of LS3D-W, the
largest and most challenging 3D facial landmark dataset to date (~230,000
images). (c) Following that, we train a neural network for 3D face alignment
and evaluate it on the newly introduced LS3D-W. (d) We further look into the
effect of all “traditional” factors affecting face alignment performance like
large pose, initialization and resolution, and introduce a “new” one, namely
the size of the network. (e) We show that both 2D and 3D face alignment
networks achieve performance of remarkable accuracy which is probably close to
saturating the datasets used. Demo code and pre-trained models can be
downloaded from this http URL
Syed Zain Masood, Guang Shu, Afshin Dehghan, Enrique G. Ortiz
This work details Sighthounds fully automated license plate detection and
recognition system. The core technology of the system is built using a sequence
of deep Convolutional Neural Networks (CNNs) interlaced with accurate and
efficient algorithms. The CNNs are trained and fine-tuned so that they are
robust under different conditions (e.g. variations in pose, lighting,
occlusion, etc.) and can work across a variety of license plate templates (e.g.
sizes, backgrounds, fonts, etc). For quantitative analysis, we show that our
system outperforms the leading license plate detection and recognition
technology i.e. ALPR on several benchmarks. Our system is available to
developers through the Sighthound Cloud API at
https URL
Daniel Peralta, Isaac Triguero, Salvador García, Yvan Saeys, Jose M. Benitez, Francisco Herrera
The growth of fingerprint databases creates a need for strategies to reduce
the identification time. Fingerprint classification reduces the search
penetration rate by grouping the fingerprints into several classes. Typically,
features describing the visual patterns of a fingerprint are extracted and fed
to a classifier. The extraction can be time-consuming and error-prone,
especially for fingerprints whose visual classification is dubious, and often
includes a criterion to reject ambiguous fingerprints. In this paper, we
propose to improve on this manually designed process by using deep neural
networks, which extract implicit features directly from the images and perform
the classification within a single learning process. An extensive experimental
study assesses that convolutional neural networks outperform all other tested
approaches by achieving a very high accuracy with no rejection. Moreover,
multiple copies of the same fingerprint are consistently classified. The
runtime of convolutional networks is also lower than that of combining feature
extraction procedures with classification algorithms.
Hao Wang, Xiaodan Liang, Hao Zhang, Dit-Yan Yeung, Eric P. Xing
Many problems in image processing and computer vision (e.g. colorization,
style transfer) can be posed as ‘manipulating’ an input image into a
corresponding output image given a user-specified guiding signal. A holy-grail
solution towards generic image manipulation should be able to efficiently alter
an input image with any personalized signals (even signals unseen during
training), such as diverse paintings and arbitrary descriptive attributes.
However, existing methods are either inefficient to simultaneously process
multiple signals (let alone generalize to unseen signals), or unable to handle
signals from other modalities. In this paper, we make the first attempt to
address the zero-shot image manipulation task. We cast this problem as
manipulating an input image according to a parametric model whose key
parameters can be conditionally generated from any guiding signal (even unseen
ones). To this end, we propose the Zero-shot Manipulation Net (ZM-Net), a
fully-differentiable architecture that jointly optimizes an
image-transformation network (TNet) and a parameter network (PNet). The PNet
learns to generate key transformation parameters for the TNet given any guiding
signal while the TNet performs fast zero-shot image manipulation according to
both signal-dependent parameters from the PNet and signal-invariant parameters
from the TNet itself. Extensive experiments show that our ZM-Net can perform
high-quality image manipulation conditioned on different forms of guiding
signals (e.g. style images and attributes) in real-time (tens of milliseconds
per image) even for unseen signals. Moreover, a large-scale style dataset with
over 20,000 style images is also constructed to promote further research.
Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, Yi Yang
Person re-identification (re-ID) and attribute recognition share a common
target at the pedestrian description. Their difference consists in the
granularity. Attribute recognition focuses on local aspects of a person while
person re-ID usually extracts global representations. Considering their
similarity and difference, this paper proposes a very simple convolutional
neural network (CNN) that learns a re-ID embedding and predicts the pedestrian
attributes simultaneously. This multi-task method integrates an ID
classification loss and a number of attribute classification losses, and
back-propagates the weighted sum of the individual losses.
Albeit simple, we demonstrate on two pedestrian benchmarks that by learning a
more discriminative representation, our method significantly improves the re-ID
baseline and is scalable on large galleries. We report competitive re-ID
performance compared with the state-of-the-art methods on the two datasets.
Huikai Wu, Shuai Zheng, Junge Zhang, Kaiqi Huang
Recent advances in generative adversarial networks (GANs) have shown
promising potentials in conditional image generation. However, how to generate
high-resolution images remains an open problem. In this paper, we aim at
generating high-resolution well-blended images given composited copy-and-paste
ones, i.e. realistic high-resolution image blending. To achieve this goal, we
propose Gaussian-Poisson GAN (GP-GAN), a framework that combines the strengths
of classical gradient-based approaches and GANs, which is the first work that
explores the capability of GANs in high-resolution image blending task to the
best of our knowledge. Particularly, we propose Gaussian-Poisson Equation to
formulate the high-resolution image blending problem, which is a joint
optimisation constrained by the gradient and colour information. Gradient
filters can obtain gradient information. For generating the colour information,
we propose Blending GAN to learn the mapping between the composited image and
the well-blended one. Compared to the alternative methods, our approach can
deliver high-resolution, realistic images with fewer bleedings and unpleasant
artefacts. Experiments confirm that our approach achieves the state-of-the-art
performance on Transient Attributes dataset. A user study on Amazon Mechanical
Turk finds that majority of workers are in favour of the proposed approach.
Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce
Finding image correspondences remains a challenging problem in the presence
of intra-class variations and large changes in scene layout. Semantic flow
methods are designed to handle images depicting different instances of the same
object or scene category. We introduce a novel approach to semantic flow,
dubbed proposal flow, that establishes reliable correspondences using object
proposals. Unlike prevailing semantic flow approaches that operate on pixels or
regularly sampled local regions, proposal flow benefits from the
characteristics of modern object proposals, that exhibit high repeatability at
multiple scales, and can take advantage of both local and geometric consistency
constraints among proposals. We also show that the corresponding sparse
proposal flow can effectively be transformed into a conventional dense flow
field. We introduce two new challenging datasets that can be used to evaluate
both general semantic flow techniques and region-based approaches such as
proposal flow. We use these benchmarks to compare different matching
algorithms, object proposals, and region features within proposal flow, to the
state of the art in semantic flow. This comparison, along with experiments on
standard datasets, demonstrates that proposal flow significantly outperforms
existing semantic flow methods in various settings.
Youngsung Kim, ByungIn Yoo, Youngjun Kwak, Changkyu Choi, Junmo Kim
As the expressive depth of an emotional face differs with individuals,
expressions, or situations, recognizing an expression using a single facial
image at a moment is difficult. One of the approaches to alleviate this
difficulty is using a video-based method that utilizes multiple frames to
extract temporal information between facial expression images. In this paper,
we attempt to utilize a generative image that is estimated based on a given
single image. Then, we propose to utilize a contrastive representation that
explains an expression difference for discriminative purposes. The contrastive
representation is calculated at the embedding layer of a deep network by
comparing a single given image with a reference sample generated by a deep
encoder-decoder network. Consequently, we deploy deep neural networks that
embed a combination of a generative model, a contrastive model, and a
discriminative model. In our proposed networks, we attempt to disentangle a
facial expressive factor in two steps including learning of a reference
generator network and learning of a contrastive encoder network. We conducted
extensive experiments on three publicly available face expression databases
(CK+, MMI, and Oulu-CASIA) that have been widely adopted in the recent
literatures. The proposed method outperforms the known state-of-the art methods
in terms of the recognition accuracy.
Mandar Kulkarni, Kalpesh Patil, Shirish Karande
Current approaches for Knowledge Distillation (KD) either directly use
training data or sample from the training data distribution. In this paper, we
demonstrate effectiveness of ‘mismatched’ unlabeled stimulus to perform KD for
image classification networks. For illustration, we consider scenarios where
this is a complete absence of training data, or mismatched stimulus has to be
used for augmenting a small amount of training data. We demonstrate that
stimulus complexity is a key factor for distillation’s good performance. Our
examples include use of various datasets for stimulating MNIST and CIFAR
Krzysztof J. Geras, Stacey Wolfson, S. Gene Kim, Linda Moy, Kyunghyun Cho
Recent advances in deep learning for object recognition in natural images has
prompted a surge of interest in applying a similar set of techniques to medical
images. Most of the initial attempts largely focused on replacing the input to
such a deep convolutional neural network from a natural image to a medical
image. This, however, does not take into consideration the fundamental
differences between these two types of data. More specifically, detection or
recognition of an anomaly in medical images depends significantly on fine
details, unlike object recognition in natural images where coarser, more global
structures matter more. This difference makes it inadequate to use the existing
deep convolutional neural networks architectures, which were developed for
natural images, because they rely on heavily downsampling an image to a much
lower resolution to reduce the memory requirements. This hides details
necessary to make accurate predictions for medical images. Furthermore, a
single exam in medical imaging often comes with a set of different views which
must be seamlessly fused in order to reach a correct conclusion. In our work,
we propose to use a multi-view deep convolutional neural network that handles a
set of more than one high-resolution medical image. We evaluate this network on
large-scale mammography-based breast cancer screening (BI-RADS prediction)
using 103 thousand images. We focus on investigating the impact of training set
sizes and image sizes on the prediction accuracy. Our results highlight that
performance clearly increases with the size of training set, and that the best
performance can only be achieved using the images in the original resolution.
This suggests the future direction of medical imaging research using deep
neural networks is to utilize as much data as possible with the least amount of
potentially harmful preprocessing.
Mohammad Sadegh Aliakbarian, Fatemehsadat Saleh, Mathieu Salzmann, Basura Fernando, Lars Petersson, Lars Andersson
In contrast to the widely studied problem of recognizing an action given a
complete sequence, action anticipation aims to identify the action from only
partially available videos. As such, it is therefore key to the success of
computer vision applications requiring to react as early as possible, such as
autonomous navigation. In this paper, we propose a new action anticipation
method that achieves high prediction accuracy even in the presence of a very
small percentage of a video sequence. To this end, we develop a multi-stage
LSTM architecture that leverages context- and action-aware features, and
introduce a novel loss function that encourages the model to predict the
correct class as early as possible. Our experiments on standard benchmark
datasets evidence the benefits of our approach; We outperform the
state-of-the-art action anticipation methods for early prediction by a relative
increase in accuracy of 22.0% on JHMDB-21, 14.0% on UT-Interaction and 49.9% on
Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, Eric P. Xing
A natural image usually conveys rich semantic content and can be viewed from
different angles. Existing image description methods are largely restricted by
small sets of biased visual paragraph annotations, and fail to cover rich
underlying semantics. In this paper, we investigate a semi-supervised paragraph
generative framework that is able to synthesize diverse and semantically
coherent paragraph descriptions by reasoning over local semantic regions and
exploiting linguistic knowledge. The proposed Recurrent Topic-Transition
Generative Adversarial Network (RTT-GAN) builds an adversarial framework
between a structured paragraph generator and multi-level paragraph
discriminators. The paragraph generator generates sentences recurrently by
incorporating region-based visual and language attention mechanisms at each
step. The quality of generated paragraph sentences is assessed by multi-level
adversarial discriminators from two aspects, namely, plausibility at sentence
level and topic-transition coherence at paragraph level. The joint adversarial
training of RTT-GAN drives the model to generate realistic paragraphs with
smooth logical transition between sentence topics. Extensive quantitative
experiments on image and video paragraph datasets demonstrate the effectiveness
of our RTT-GAN in both supervised and semi-supervised settings. Qualitative
results on telling diverse stories for an image also verify the
interpretability of RTT-GAN.
Behzad Hasani, Mohammad H. Mahoor
Automated Facial Expression Recognition (FER) has been a challenging task for
decades. Many of the existing works use hand-crafted features such as LBP, HOG,
LPQ, and Histogram of Optical Flow (HOF) combined with classifiers such as
Support Vector Machines for expression recognition. These methods often require
rigorous hyperparameter tuning to achieve good results. Recently Deep Neural
Networks (DNN) have shown to outperform traditional methods in visual object
recognition. In this paper, we propose a two-part network consisting of a
DNN-based architecture followed by a Conditional Random Field (CRF) module for
facial expression recognition in videos. The first part captures the spatial
relation within facial images using convolutional layers followed by three
Inception-ResNet modules and two fully-connected layers. To capture the
temporal relation between the image frames, we use linear chain CRF in the
second part of our network. We evaluate our proposed network on three publicly
available databases, viz. CK+, MMI, and FERA. Experiments are performed in
subject-independent and cross-database manners. Our experimental results show
that cascading the deep network architecture with the CRF module considerably
increases the recognition of facial expressions in videos and in particular it
outperforms the state-of-the-art methods in the cross-database experiments and
yields comparable results in the subject-independent experiments.
Yan Wang, Lingxi Xie, Chenxi Liu, Ya Zhang, Wenjun Zhang, Alan Yuille
In this paper, we reveal the importance and benefits of introducing
second-order operations into deep neural networks. We propose a novel approach
named Second-Order Response Transform (SORT), which appends element-wise
product transform to the linear sum of a two-branch network module. A direct
advantage of SORT is to facilitate cross-branch response propagation, so that
each branch can update its weights based on the current status of the other
branch. Moreover, SORT augments the family of transform operations and
increases the nonlinearity of the network, making it possible to learn flexible
functions to fit the complicated distribution of feature space. SORT can be
applied to a wide range of network architectures, including a branched variant
of a chain-styled network and a residual network, with very light-weighted
modifications. We observe consistent accuracy gain on both small (CIFAR10,
CIFAR100 and SVHN) and big (ILSVRC2012) datasets. In addition, SORT is very
efficient, as the extra computation overhead is less than 5%.
Miriam W. Huijser, Jan C. van Gemert
This paper is on active learning where the goal is to reduce the data
annotation burden by interacting with a (human) oracle during training.
Standard active learning methods ask the oracle to annotate data samples.
Instead, we take a profoundly different approach: we ask for annotations of the
decision boundary. We achieve this using a deep generative model to create
novel instances along a 1d line. A point on the decision boundary is revealed
where the instances change class. Experimentally we show on three data sets
that our method can be plugged-in to other active learning schemes, that human
oracles can effectively annotate points on the decision boundary, that our
method is robust to annotation noise, and that decision boundary annotations
improve over annotating data samples.
Hang Zhang, Kristin Dana
Recent work in style transfer learns a feed-forward generative network to
approximate the prior optimization-based approaches, resulting in real-time
performance. However, these methods require training separate networks for
different target styles which greatly limits the scalability. We introduce a
Multi-style Generative Network (MSG-Net) with a novel Inspiration Layer, which
retains the functionality of optimization-based approaches and has the fast
speed of feed-forward networks. The proposed Inspiration Layer explicitly
matches the feature statistics with the target styles at run time, which
dramatically improves versatility of existing generative network, so that
multiple styles can be realized within one network. The proposed MSG-Net
matches image styles at multiple scales and puts the computational burden into
the training. The learned generator is a compact feed-forward network that runs
in real-time after training. Comparing to previous work, the proposed network
can achieve fast style transfer with at least comparable quality using a single
network. The experimental results have covered (but are not limited to)
simultaneous training of twenty different styles in a single network. The
complete software system and pre-trained models will be publicly available upon
The complete software system and pre-trained models will be publicly available upon publication.
Despite the success of deep learning on representing images for particular
object retrieval, recent studies show that the learned representations still
lie on manifolds in a high dimensional space. Therefore, nearest neighbor
search cannot be expected to be optimal for this task. Even if a nearest
neighbor graph is computed offline, exploring the manifolds online remains
expensive. This work introduces an explicit embedding reducing manifold search
to Euclidean search followed by dot product similarity search. We show this is
equivalent to linear graph filtering of a sparse signal in the frequency
domain, and we introduce a scalable offline computation of an approximate
Fourier basis of the graph. We improve the state of art on standard particular
object retrieval datasets including a challenging one containing small objects.
At a scale of (10^5) images, the offline cost is only a few hours, while query
time is comparable to standard similarity search.
Weiyao Lin, Yang Shen, Junchi Yan, Mingliang Xu, Jianxin Wu, Jingdong Wang, Ke Lu
This paper addresses the problem of handling spatial misalignments due to
camera-view changes or human-pose variations in person re-identification. We
first introduce a boosting-based approach to learn a correspondence structure
which indicates the patch-wise matching probabilities between images from a
target camera pair. The learned correspondence structure can not only capture
the spatial correspondence pattern between cameras but also handle the
viewpoint or human-pose variation in individual images. We further introduce a
global constraint-based matching process. It integrates a global matching
constraint over the learned correspondence structure to exclude cross-view
misalignments during the image patch matching process, hence achieving a more
reliable matching score between images. Finally, we also extend our approach by
introducing a multi-structure scheme, which learns a set of local
correspondence structures to capture the spatial correspondence sub-patterns
between a camera pair, so as to handle the spatial misalignments between
individual images in a more precise way. Experimental results on various
datasets demonstrate the effectiveness of our approach.
Marco Fiorucci, Alessandro Torcinovich, Manuel Curado, Francisco Escolano, Marcello Pelillo
In this paper we analyze the practical implications of Szemer’edi’s
regularity lemma in the preservation of metric information contained in large
graphs. To this end, we present a heuristic algorithm to find regular
partitions. Our experiments show that this method is quite robust to the
natural sparsification of proximity graphs. In addition, this robustness can be
enforced by graph densification.
Xin Huang, Yuxin Peng
DNN-based cross-modal retrieval has become a research hotspot, by which users
can search results across various modalities like image and text. However,
existing methods mainly focus on the pairwise correlation and reconstruction
error of labeled data. They ignore the semantically similar and dissimilar
constraints between different modalities, and cannot take advantage of
unlabeled data. This paper proposes Cross-modal Deep Metric Learning with
Multi-task Regularization (CDMLMR), which integrates quadruplet ranking loss
and semi-supervised contrastive loss for modeling cross-modal semantic
similarity in a unified multi-task learning architecture. The quadruplet
ranking loss can model the semantically similar and dissimilar constraints to
preserve cross-modal relative similarity ranking information. The
semi-supervised contrastive loss is able to maximize the semantic similarity on
both labeled and unlabeled data. Compared to the existing methods, CDMLMR
exploits not only the similarity ranking information but also unlabeled
cross-modal data, and thus boosts cross-modal retrieval accuracy.
Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei, Peihao Wu, Wenchang Situ, Shuai Li, Yang Zhang
Recurrent neural networks (RNNs), especially long short-term memory (LSTM)
RNNs, are effective network for sequential task like speech recognition. Deeper
LSTM models perform well on large vocabulary continuous speech recognition,
because of their impressive learning ability. However, it is more difficult to
train a deeper network. We introduce a training framework with layer-wise
training and exponential moving average methods for deeper LSTM models. It is a
competitive framework that LSTM models of more than 7 layers are successfully
trained on Shenma voice search data in Mandarin and they outperform the deep
LSTM models trained by conventional approach. Moreover, in order for online
streaming speech recognition applications, the shallow model with low real time
factor is distilled from the very deep model. The recognition accuracy have
little loss in the distillation process. Therefore, the model trained with the
proposed training framework reduces relative 14\% character error rate,
compared to original model which has the similar real-time capability.
Furthermore, the novel transfer learning strategy with segmental Minimum
Bayes-Risk is also introduced in the framework. The strategy makes it possible
that training with only a small part of dataset could outperform full dataset
training from the beginning.
Haichuan Yang, Shupeng Gui, Chuyang Ke, Daniel Stefankovic, Ryohei Fujimaki, Ji Liu
The cardinality constraint is an intrinsic way to restrict the solution
structure in many domains, for example, sparse learning, feature selection, and
compressed sensing. To solve a cardinality constrained problem, the key
challenge is to solve the projection onto the cardinality constraint set, which
is NP-hard in general when there exist multiple overlapped cardiaiality
constraints. In this paper, we consider the scenario where overlapped
cardinality constraints satisfy a Three-view Cardinality Structure (TVCS),
which reflects the natural restriction in many applications, such as
identification of gene regulatory networks and task-worker assignment problem.
We cast the projection onto the TVCS set into a linear programming, and prove
that its solution can be obtained by finding an integer solution to such linear
programming. We further prove that such integer solution can be found with the
complexity proportional to the problem scale. We finally use synthetic
experiments and two interesting applications in bioinformatics and
crowdsourcing to validate the proposed TVCS model and method.
DNN-based cross-modal retrieval has become a research hotspot, by which users
can search results across various modalities like image and text. However,
existing methods mainly focus on the pairwise correlation and reconstruction
error of labeled data. They ignore the semantically similar and dissimilar
constraints between different modalities, and cannot take advantage of
unlabeled data. This paper proposes Cross-modal Deep Metric Learning with
Multi-task Regularization (CDMLMR), which integrates quadruplet ranking loss
and semi-supervised contrastive loss for modeling cross-modal semantic
similarity in a unified multi-task learning architecture. The quadruplet
ranking loss can model the semantically similar and dissimilar constraints to
preserve cross-modal relative similarity ranking information. The
semi-supervised contrastive loss is able to maximize the semantic similarity on
both labeled and unlabeled data. Compared to the existing methods, CDMLMR
exploits not only the similarity ranking information but also unlabeled
cross-modal data, and thus boosts cross-modal retrieval accuracy.
Multivariate time series forecasting is an important machine learning problem
across many domains, including predictions of solar plant energy output,
electricity consumption, and traffic jam situation. Temporal data arise in
these real-world applications often involves a mixture of long-term and
short-term patterns, for which traditional approaches such as Autoregressive
models and Gaussian Process may fail. In this paper, we proposed a novel deep
learning framework, namely Long- and Short-term Time-series network (LSTNet),
to address this open challenge. LSTNet uses the Convolution Neural Network
(CNN) to extract short-term local dependency patterns among variables, and the
Recurrent Neural Network (RNN) to discover long-term patterns and trends. In
our evaluation on real-world data with complex mixtures of repetitive patterns,
LSTNet achieved significant performance improvements over that of several
state-of-the-art baseline methods.
