Eric Hunsberger, Chris Eliasmith
Comments: 10 pages, 3 figures, 4 tables; the “methods” section of this article draws heavily on arXiv:1510.08829
Subjects: Neural and Evolutionary Computing (cs.NE); Learning (cs.LG)
We describe a method to train spiking deep networks that can be run using
leaky integrate-and-fire (LIF) neurons, achieving state-of-the-art results for
spiking LIF networks on five datasets, including the large ImageNet ILSVRC-2012
benchmark. Our method for transforming deep artificial neural networks into
spiking networks is scalable and works with a wide range of neural
nonlinearities. We achieve these results by softening the neural response
function, such that its derivative remains bounded, and by training the network
with noise to provide robustness against the variability introduced by spikes.
Our analysis shows that implementations of these networks on neuromorphic
hardware will be many times more power-efficient than the equivalent
non-spiking networks on traditional hardware.
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Deep reinforcement learning agents have achieved state-of-the-art results by
directly maximising cumulative reward. However, environments contain a much
wider variety of possible training signals. In this paper, we introduce an
agent that also maximises many other pseudo-reward functions simultaneously by
reinforcement learning. All of these tasks share a common representation that,
like unsupervised learning, continues to develop in the absence of extrinsic
rewards. We also introduce a novel mechanism for focusing this representation
upon extrinsic rewards, so that learning can rapidly adapt to the most relevant
aspects of the actual task. Our agent significantly outperforms the previous
state-of-the-art on Atari, averaging 880\% expert human performance, and a
challenging suite of first-person, three-dimensional emph{Labyrinth} tasks
leading to a mean speedup in learning of 10( imes) and averaging 87\% expert
human performance on Labyrinth.
Mennatullah Siam, Sepehr Valipour, Martin Jagersand, Nilanjan Ray
Comments: arXiv admin note: substantial text overlap with arXiv:1606.00487
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Semantic segmentation has recently witnessed major progress, where fully
convolutional neural networks have shown to perform well. However, most of the
previous work focused on improving single image segmentation. To our knowledge,
no prior work has made use of temporal video information in a recurrent
network. In this paper, we propose and implement a novel method for online
semantic segmentation of video sequences that utilizes temporal data. The
network combines a fully convolutional network and a gated recurrent unit that
works on a sliding window over consecutive frames. The convolutional gated
recurrent unit is used to preserve spatial information and reduce the
parameters learned. Our method has the advantage that it can work in an online
fashion instead of operating over the whole input batch of video frames. This
architecture is tested for both binary and semantic video segmentation tasks.
Experiments are conducted on the recent benchmarks in SegTrack V2, Davis,
CityScapes, and Synthia. It is shown to have 5% improvement in Segtrack and 3%
improvement in Davis in F-measure over a baseline plain fully convolutional
network. It also proved to have 5.7% improvement on Synthia in mean IoU, and
3.5% improvement on CityScapes in mean category IoU over the baseline network.
The performance of the RFCN network depends on its baseline fully convolutional
network. Thus RFCN architecture can be seen as a method to improve its baseline
segmentation network by exploiting spatiotemporal information in videos.
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He
Comments: Tech report
Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present a simple, highly modularized network architecture for image
classification. Our network is constructed by repeating a building block that
aggregates a set of transformations with the same topology. Our simple design
results in a homogeneous, multi-branch architecture that has only a few
hyper-parameters to set. This strategy exposes a new dimension, which we call
“cardinality” (the size of the set of transformations), as an essential factor
in addition to the dimensions of depth and width. On the ImageNet-1K dataset,
we empirically show that even under the restricted condition of maintaining
complexity, increasing cardinality is able to improve classification accuracy.
Moreover, increasing cardinality is more effective than going deeper or wider
when we increase the capacity. Our models, codenamed ResNeXt, are the
foundations of our entry to the ILSVRC 2016 classification task in which we
secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the
COCO detection set, also showing better results than its ResNet counterpart.
Alejandro Newell, Jia Deng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce associative embedding, a novel method for supervising
convolutional neural networks for the task of detection and grouping. A number
of computer vision problems can be framed in this manner including multi-person
pose estimation, instance segmentation, and multi-object tracking. Usually the
grouping of detections is achieved with multi-stage pipelines, instead we
propose an approach that teaches a network to simultaneously output detections
and group assignments. This technique can be easily integrated into any
state-of-the-art network architecture that produces pixel-wise predictions. We
show how to apply this method to both multi-person pose estimation and instance
segmentation. We present results for both tasks, and report state-of-the-art
performance for multi-person pose.
Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Larry Jackel, Urs Muller, Karol Zieba
Subjects: Computer Vision and Pattern Recognition (cs.CV)
This paper proposes a new method, that we call VisualBackProp, for
visualizing which sets of pixels of the input image contribute most to the
predictions made by the convolutional neural network (CNN). The method heavily
hinges on exploring the intuition that the feature maps contain less and less
irrelevant information to the prediction decision when moving deeper into the
network. The technique we propose was developed as a debugging tool for
CNN-based systems for steering self-driving cars and is therefore required to
run in real-time, i.e. it was designed to require less computation than a
forward propagation. This makes the presented visualization method a valuable
debugging tool which can be easily used during both training and inference. We
furthermore justify our approach with theoretical arguments and theoretically
confirm that the proposed method identifies sets of input pixels, rather than
individual pixels, that collaboratively contribute to the prediction. Our
theoretical findings stand in agreement with experimental results. The
empirical evaluation shows the plausibility of the proposed approach on road
data.
Zhen-Hua Feng, Josef Kittler, William Christmas, Patrik Huber, Xiao-Jun Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present a new Cascaded Shape Regression (CSR) architecture, namely Dynamic
Attention-Controlled CSR (DAC-CSR), for robust facial landmark detection on
unconstrained faces. Our DAC-CSR divides facial landmark detection into three
cascaded sub-tasks: face bounding box refinement, general CSR and
attention-controlled CSR. The first two stages refine initial face bounding
boxes and output intermediate facial landmarks. Then, an online dynamic model
selection method is used to choose appropriate domain-specific CSRs for further
landmark refinement. The key innovation of our DAC-CSR is the fault-tolerant
mechanism, using fuzzy set sample weighting for attention-controlled
domain-specific model training. Moreover, we advocate data augmentation with a
simple but effective 2D profile face generator, and context-aware feature
extraction for better facial feature representation. Experimental results
obtained on challenging datasets demonstrate the merits of our DAC-CSR over the
state-of-the-art.
Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, Rogerio Feris
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
Multi-task learning aims to improve generalization performance of multiple
prediction tasks by appropriately sharing relevant information across them. In
the context of deep neural networks, this idea is often realized by
hand-designed network architectures with layers that are shared across tasks
and branches that encode task-specific features. However, the space of possible
multi-task deep architectures is combinatorially large and often the final
architecture is arrived at by manual exploration of this space subject to
designer’s bias, which can be both error-prone and tedious. In this work, we
propose a principled approach for designing compact multi-task deep learning
architectures. Our approach starts with a thin network and dynamically widens
it in a greedy manner during training using a novel criterion that promotes
grouping of similar tasks together. Our Extensive evaluation on person
attributes classification tasks involving facial and clothing attributes
suggests that the models produced by the proposed method are fast, compact and
can closely match or exceed the state-of-the-art accuracy from strong baselines
by much more expensive models.
Anthony D. Rhodes, Max H. Quinn, Melanie Mitchell
Comments: arXiv admin note: text overlap with arXiv:1607.00548
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
A major goal of computer vision is to enable computers to interpret visual
situations—abstract concepts (e.g., “a person walking a dog,” “a crowd
waiting for a bus,” “a picnic”) whose image instantiations are linked more by
their common spatial and semantic structure than by low-level visual
similarity. In this paper, we propose a novel method for prior learning and
active object localization for this kind of knowledge-driven search in static
images. In our system, prior situation knowledge is captured by a set of
flexible, kernel-based density estimations—a situation model—that represent
the expected spatial structure of the given situation. These estimations are
efficiently updated by information gained as the system searches for relevant
objects, allowing the system to use context as it is discovered to narrow the
search.
More specifically, at any given time in a run on a test image, our system
uses image features plus contextual information it has discovered to identify a
small subset of training images—an importance cluster—that is deemed most
similar to the given test image, given the context. This subset is used to
generate an updated situation model in an on-line fashion, using an efficient
multipole expansion technique.
As a proof of concept, we apply our algorithm to a highly varied and
challenging dataset consisting of instances of a “dog-walking” situation. Our
results support the hypothesis that dynamically-rendered, context-based
probability models can support efficient object localization in visual
situations. Moreover, our approach is general enough to be applied to diverse
machine learning paradigms requiring interpretable, probabilistic
representations generated from partially observed data.
Jeremiah Johnson
Comments: 10 pages, 4 figures, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Applications (stat.AP); Machine Learning (stat.ML)
The artistic style of a painting is a subtle aesthetic judgment used by art
historians for grouping and classifying artwork. The recently introduced
`neural-style’ algorithm substantially succeeds in merging the perceived
artistic style of one image or set of images with the perceived content of
another. In light of this and other recent developments in image analysis via
convolutional neural networks, we investigate the effectiveness of a
`neural-style’ representation for classifying the artistic style of paintings.
Gedas Bertasius, Stella X. Yu, Hyun Soo Park, Jianbo Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Skill assessment is a fundamental problem in sports like basketball.
Nowadays, basketball skill assessment is handled by basketball experts who
evaluate a player’s skill from unscripted third-person basketball game videos.
However, due to a large distance between a camera and the players, a
third-person video captures a low-resolution view of the players, which makes
it difficult to 1) identify specific players in the video and 2) to recognize
what they are doing.
To address these issues, we use first-person cameras, which 1) provide a
high-resolution view of a player’s actions, and 2) also eliminate the need to
track each player. Despite this, learning a basketball skill assessment model
from the first-person data is still challenging, because 1) a player’s actions
of interest occur rarely, and 2) the data labeling requires using basketball
experts, which is costly.
To counter these problems, we introduce a concept of basketball elements, 1)
which addresses a limited player’s activity data issue, and 2) eliminates the
reliance on basketball experts. Basketball elements define simple basketball
concepts, making labeling easy even for non-experts. Basketball elements are
also prevalent in the first-person data, which allows us to learn, and use them
for a player’s basketball activity recognition and his basketball skill
assessment.
Thus, our contributions include (1) a new task of assessing a player’s
basketball skill from an unscripted first-person basketball game video, (2) a
new 10.3 hour long first-person basketball video dataset capturing 48 players
and (3) a data-driven model that assesses a player’s basketball skill without
relying on basketball expert labelers.
Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem – unconstrained natural language sentences,
and in the wild videos.
Our key contributions are: (1) a ‘Watch, Listen, Attend and Spell’ (WLAS)
network that learns to transcribe videos of mouth motion to characters; (2) a
curriculum learning strategy to accelerate training and to reduce overfitting;
(3) a ‘Lip Reading Sentences’ (LRS) dataset for visual speech recognition,
consisting of over 100,000 natural sentences from British television.
The WLAS model trained on the LRS dataset surpasses the performance of all
previous work on standard lip reading benchmark datasets, often by a
significant margin. This lip reading performance beats a professional lip
reader on videos from BBC television, and we also demonstrate that visual
information helps to improve speech recognition performance even when the audio
is available.
Hisham Cholakkal, Jubin Johnson, Deepu Rajan
Comments: 14 pages, 12 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Top-down saliency models produce a probability map that peaks at target
locations specified by a task/goal such as object detection. They are usually
trained in a fully supervised setting involving pixel-level annotations of
objects. We propose a weakly supervised top-down saliency framework using only
binary labels that indicate the presence/absence of an object in an image.
First, the probabilistic contribution of each image region to the confidence of
a CNN-based image classifier is computed through a backtracking strategy to
produce top-down saliency. From a set of saliency maps of an image produced by
fast bottom-up saliency approaches, we select the best saliency map suitable
for the top-down task. The selected bottom-up saliency map is combined with the
top-down saliency map. Features having high combined saliency are used to train
a linear SVM classifier to estimate feature saliency. This is integrated with
combined saliency and further refined through a multi-scale
superpixel-averaging of saliency map. We evaluate the performance of the
proposed weakly supervised top-down saliency against fully supervised
approaches and achieve state-of-the-art performance. Experiments are carried
out on seven challenging datasets and quantitative results are compared with 36
closely related approaches across 4 different applications.
Gedas Bertasius, Stella X. Yu, Jianbo Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Many first-person vision tasks such as activity recognition or video
summarization requires knowing, which objects the camera wearer is interacting
with (i.e. action-objects). The standard way to obtain this information is via
a manual annotation, which is costly and time consuming. Also, whereas for the
third-person tasks such as object detection, the annotator can be anybody,
action-object detection task requires the camera wearer to annotate the data
because a third-person may not know what the camera wearer was thinking. Such a
constraint makes it even more difficult to obtain first-person annotations.
To address this problem, we propose a Visual-Spatial Network (VSN) that
detects action-objects without using any first-person labels. We do so (1) by
exploiting the visual-spatial co-occurrence in the first-person data and (2) by
employing an alternating cross-pathway supervision between the visual and
spatial pathways of our VSN. During training, we use a selected action-object
prior location to initialize the pseudo action-object ground truth, which is
then used to optimize both pathways in an alternating fashion. The predictions
from the spatial pathway are used to update the pseudo ground truth for the
visual pathway and vice versa, which allows both pathways to improve each
other. We show our method’s success on two different action-object datasets,
where our method achieves similar or better results than the supervised
methods. We also show that our method can be successfully used as pretraining
for a supervised action-object detection task.
Wenhu Chen, Aurelien Lucchi, Thomas Hofmann
Subjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a novel way of using out-of-domain textual data to enhance the
performance of existing image captioning systems. We evaluate this learning
approach on a newly designed model that uses – and improves upon – building
blocks from state-of-the-art methods. This model starts from detecting visual
concepts present in an image which are then fed to a reviewer-decoder
architecture with an attention mechanism. Unlike previous approaches that
encode visual concepts using word embeddings, we instead suggest using regional
image features which capture more intrinsic information. The main benefit of
this architecture is that it synthesizes meaningful thought vectors that
capture salient image properties and then applies a soft attentive decoder to
decode the thought vectors and generate image captions. We evaluate our model
on both Microsoft COCO and Flickr30K datasets and demonstrate that this model
combined with our bootstrap learning method can largely improve performance and
help the model to generate more accurate and diverse captions.
L. Robert Hocking, Russell MacKenzie, Carola-Bibiane Schoenlieb
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The conversion of traditional film into stereo 3D has become an important
problem in the past decade. One of the main bottlenecks is a disocclusion step,
which in commercial 3D conversion is usually done by teams of artists armed
with a toolbox of inpainting algorithms. A current difficulty in this is that
most available algorithms are either too slow for interactive use, or provide
no intuitive means for users to tweak the output.
In this paper we present a new fast inpainting algorithm based on
transporting along automatically detected splines, which the user may edit. Our
algorithm is implemented on the GPU and fills the inpainting domain in
successive shells that adapt their shape on the fly. In order to allocate GPU
resources as efficiently as possible, we propose a parallel algorithm to track
the inpainting interface as it evolves, ensuring that no resources are wasted
on pixels that are not currently being worked on. Theoretical analysis of the
time and processor complexiy of our algorithm without and with tracking (as
well as numerous numerical experiments) demonstrate the merits of the latter.
Our transport mechanism is similar to the one used in coherence transport,
but improves upon it by corrected a “kinking” phenomena whereby extrapolated
isophotes may bend at the boundary of the inpainting domain. Theoretical
results explaining this phenomena and its resolution are presented.
Although our method ignores texture, in many cases this is not a problem due
to the thin inpainting domains in 3D conversion. Experimental results show that
our method can achieve a visual quality that is competitive with the
state-of-the-art while maintaining interactive speeds and providing the user
with an intuitive interface to tweak the results.
Tu Bui, Leonardo Ribeiro, Moacir Ponti, John Collomosse
Comments: submitted to CVPR2017 on 15Nov16
Subjects: Computer Vision and Pattern Recognition (cs.CV)
We propose and evaluate several triplet CNN architectures for measuring the
similarity between sketches and photographs, within the context of the sketch
based image retrieval (SBIR) task. In contrast to recent fine-grained SBIR
work, we study the ability of our networks to generalise across diverse object
categories from limited training data, and explore in detail strategies for
weight sharing, pre-processing, data augmentation and dimensionality reduction.
We exceed the performance of pre-existing techniques on both the Flickr15k
category level SBIR benchmark by (18\%), and the TU-Berlin SBIR benchmark by
(sim10 mathcal{T}_b), when trained on the 250 category TU-Berlin
classification dataset augmented with 25k corresponding photographs harvested
from the Internet.
Shu Zhang, Ran He, Tieniu Tan
Comments: 10pages, submitted to CVPR 17
Subjects: Computer Vision and Pattern Recognition (cs.CV)
MeshFace photos have been widely used in many Chinese business organizations
to protect ID face photos from being misused. The occlusions incurred by random
meshes severely degenerate the performance of face verification systems, which
raises the MeshFace verification problem between MeshFace and daily photos.
Previous methods cast this problem as a typical low-level vision problem, i.e.
blind inpainting. They recover perceptually pleasing clear ID photos from
MeshFaces by enforcing pixel level similarity between the recovered ID images
and the ground-truth clear ID images and then perform face verification on
them. Essentially, face verification is conducted on a compact feature space
rather than the image pixel space. Therefore, this paper argues that pixel
level similarity and feature level similarity jointly offer the key to improve
the verification performance. Based on this insight, we offer a novel feature
oriented blind face inpainting framework. Specifically, we implement this by
establishing a novel DeMeshNet, which consists of three parts. The first part
addresses blind inpainting of the MeshFaces by implicitly exploiting extra
supervision from the occlusion position to enforce pixel level similarity. The
second part explicitly enforces a feature level similarity in the compact
feature space, which can explore informative supervision from the feature space
to produce better inpainting results for verification. The last part copes with
face alignment within the net via a customized spatial transformer module when
extracting deep facial features. All the three parts are implemented within an
end-to-end network that facilitates efficient optimization. Extensive
experiments on two MeshFace datasets demonstrate the effectiveness of the
proposed DeMeshNet as well as the insight of this paper.
Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, Gregory D. Hager
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The ability to identify and temporally segment fine-grained human actions
throughout a video is crucial for robotics, surveillance, education, and
beyond. Typical approaches decouple this problem by first extracting local
spatiotemporal features from video frames and then feeding them into a temporal
classifier that captures high-level temporal patterns. We introduce a new class
of temporal models, which we call Temporal Convolutional Networks (TCNs), that
use a hierarchy of temporal convolutions to perform fine-grained action
segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling
to efficiently capture long-range temporal patterns whereas our Dilated TCN
uses dilated convolutions. We show that TCNs are capable of capturing action
compositions, segment durations, and long-range dependencies, and are over a
magnitude faster to train than competing LSTM-based Recurrent Neural Networks.
We apply these models to three challenging fine-grained datasets and show large
improvements over the state of the art.
Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes Totz, Zehan Wang, Wenzhe Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Convolutional neural networks have enabled accurate image super-resolution in
real-time. However, recent attempts to benefit from temporal correlations in
video super-resolution have been limited to naive or inefficient architectures.
In this paper, we introduce spatio-temporal sub-pixel convolution networks that
effectively exploit temporal redundancies and improve reconstruction accuracy
while maintaining real-time speed. Specifically, we discuss the use of early
fusion, slow fusion and 3D convolutions for the joint processing of multiple
consecutive video frames. We also propose a novel joint motion compensation and
video super-resolution algorithm that is orders of magnitude more efficient
than competing methods, relying on a fast multi-resolution spatial transformer
module that is end-to-end trainable. These contributions provide both higher
accuracy and temporally more consistent videos, which we confirm qualitatively
and quantitatively. Relative to single-frame models, spatio-temporal networks
can either reduce the computational cost by 30% whilst maintaining the same
quality or provide a 0.2dB gain for a similar computational cost. Results on
publicly available datasets demonstrate that the proposed algorithms surpass
current state-of-the-art performance in both accuracy and efficiency.
Mengyue Geng, Yaowei Wang, Tao Xiang, Yonghong Tian
Comments: 10 pages, 1 figure
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Person re-identification (Re-ID) poses a unique challenge to deep learning:
how to learn a deep model with millions of parameters on a small training set
of few or no labels. In this paper, a number of deep transfer learning models
are proposed to address the data sparsity problem. First, a deep network
architecture is designed which differs from existing deep Re-ID models in that
(a) it is more suitable for transferring representations learned from large
image classification datasets, and (b) classification loss and verification
loss are combined, each of which adopts a different dropout strategy. Second, a
two-stepped fine-tuning strategy is developed to transfer knowledge from
auxiliary datasets. Third, given an unlabelled Re-ID dataset, a novel
unsupervised deep transfer learning model is developed based on co-training.
The proposed models outperform the state-of-the-art deep Re-ID models by large
margins: we achieve Rank-1 accuracy of 85.4\%, 83.7\% and 56.3\% on CUHK03,
Market1501, and VIPeR respectively, whilst on VIPeR, our unsupervised model
(45.1\%) beats most supervised models.
Florian Bernard, Frank R. Schmidt, Johan Thunberg, Daniel Cremers
Comments: 10 pages, 7 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
We propose a combinatorial solution for the problem of non-rigidly matching a
3D shape to 3D image data. To this end, we model the shape as a triangular mesh
and allow each triangle of this mesh to be rigidly transformed to achieve a
suitable matching to the image. By penalising the distance and the relative
rotation between neighbouring triangles our matching compromises between the
image and the shape information. In this paper, we resolve two major
challenges: Firstly, we address the resulting large and NP-hard combinatorial
problem with a suitable graph-theoretic approach. Secondly, we propose an
efficient discretisation of the unbounded 6-dimensional Lie group SE(3). To our
knowledge this is the first combinatorial formulation for non-rigid 3D
shape-to-image matching. In contrast to existing local (gradient descent)
optimisation methods, we obtain solutions that do not require a good
initialisation and that are within a bound of the optimal solution. We evaluate
the proposed combinatorial method on the two problems of non-rigid 3D
shape-to-shape and non-rigid 3D shape-to-image registration and demonstrate
that it provides promising results.
Yemin Shi, Yonghong Tian, Yaowei Wang, Tiejun Huang
Comments: 10 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Despite a lot of research efforts devoted in recent years, how to efficiently
learn long-term dependencies from sequences still remains a pretty challenging
task. As one of the key models for sequence learning, recurrent neural network
(RNN) and its variants such as long short term memory (LSTM) and gated
recurrent unit (GRU) are still not powerful enough in practice. One possible
reason is that they have only feedforward connections, which is different from
biological neural network that is typically composed of both feedforward and
feedback connections. To address the problem, this paper proposes a
biologically-inspired RNN structure, called shuttleNet, by introducing loop
connections in the network and utilizing parameter sharing to prevent
overfitting. Unlike the traditional RNNs, the cells of shuttleNet are loop
connected to mimic the brain’s feedforward and feedback connections. The
structure is then stretched in the depth dimension to generate a deeper model
with multiple information flow paths, while the parameters are shared so as to
prevent shuttleNet from being over-fitting. The attention mechanism is then
applied to select the best information path. The extensive experiments are
conducted on two datasets for action recognition: UCF101 and HMDB51. We find
that our model can outperform LSTMs and GRUs remarkably. Even only replacing
the LSTMs with our shuttleNet in a CNN-RNN network, we can still achieve the
state-of-the-art performance on both datasets.
Yemin Shi, Yonghong Tian, Yaowei Wang, Tiejun Huang
Comments: 8 pages, 5 figures, JNA
Subjects: Computer Vision and Pattern Recognition (cs.CV)
By extracting spatial and temporal characteristics in one network, the
two-stream ConvNets can achieve the state-of-the-art performance in action
recognition. However, such a framework typically suffers from the separately
processing of spatial and temporal information between the two standalone
streams and is hard to capture long-term temporal dependence of an action. More
importantly, it is incapable of finding the salient portions of an action, say,
the frames that are the most discriminative to identify the action. To address
these problems, a extbf{j}oint extbf{n}etwork based extbf{a}ttention
(JNA) is proposed in this study. We find that the fully-connected fusion,
branch selection and spatial attention mechanism are totally infeasible for
action recognition. Thus in our joint network, the spatial and temporal
branches share some information during the training stage. We also introduce an
attention mechanism on the temporal domain to capture the long-term dependence
meanwhile finding the salient portions. Extensive experiments are conducted on
two benchmark datasets, UCF101 and HMDB51. Experimental results show that our
method can improve the action recognition performance significantly and
achieves the state-of-the-art results on both datasets.
Katharina Schwarz, Patrick Wieschollek, Hendrik P.A. Lensch
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The wide distribution of digital devices as well as cheap storage allow us to
take series of photos making sure not to miss any specific beautiful moment.
Thereby, the huge and constantly growing image assembly makes it quite
time-consuming to manually pick the best shots afterwards. Even more
challenging, finding the most aesthetically pleasing images that might also be
worth sharing is a largely subjective task in which general rules rarely apply.
Nowadays, online platforms allow users to “like” or favor certain content with
a single click. As we aim to predict the aesthetic quality of images, we now
make use of such multi-user agreements. More precisely, we assemble a large
data set of 380K images with associated meta information and derive a score to
rate how visually pleasing a given photo is. predict the aesthetic quality of
any arbitrary image or video, we transfer the Our proposed model of aesthetics
is validated in a user study. We demonstrate our results on applications for
resorting photo collections, capturing the best shot on mobile devices and
aesthetic key-frame extraction from videos.
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, Luc Van Gool
Subjects: Computer Vision and Pattern Recognition (cs.CV)
This paper tackles the task of semi-supervised video object segmentation,
i.e., the separation of an object from the background in a video, given the
mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS),
based on a fully-convolutional neural network architecture that is able to
successively transfer generic semantic information, learned on ImageNet, to the
task of foreground segmentation, and finally to learning the appearance of a
single annotated object of the test sequence (hence one-shot). Although all
frames are processed independently, the results are temporally coherent and
stable. We perform experiments on three annotated video segmentation databases,
which show that OSVOS is fast and improves the state of the art by a
significant margin (79.8% vs 68.0%).
Zhuxi jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, Hanning Zhou
Comments: 8 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Clustering is among the most fundamental tasks in computer vision and machine
learning. In this paper, we propose Variational Deep Embedding (VaDE), a novel
unsupervised generative clustering approach within the framework of Variational
Auto-Encoder (VAE). Specifically, VaDE models the data generative procedure
with a Gaussian Mixture Model (GMM) and a deep neural network (DNN): 1) the GMM
picks a cluster; 2) from which a latent embedding is generated; 3) then the DNN
decodes the latent embedding into observables. Inference in VaDE is done in a
variational way: a different DNN is used to encode observables to latent
embeddings, so that the evidence lower bound (ELBO) can be optimized using
Stochastic Gradient Variational Bayes (SGVB) estimator and the
reparameterization trick. Quantitative comparisons with strong baselines are
included in this paper, and experimental results show that VaDE significantly
outperforms the state-of-the-art clustering methods on 4 benchmarks from
various modalities. Moreover, by VaDE’s generative nature, we show its
capability of generating highly realistic samples for any specified cluster,
without using supervised information during training. Lastly, VaDE is a
flexible and extensible framework for unsupervised generative clustering, more
general mixture models than GMM can be easily plugged in.
Yu-An Chung, Hsuan-Tien Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
While deep neural networks have succeeded in several visual applications,
such as object recognition, detection, and localization, by reaching very high
classification accuracies, it is important to note that many real-world
applications demand vary- ing costs for different types of misclassification
errors, thus requiring cost-sensitive classification algorithms. Current models
of deep neural networks for cost-sensitive classification are restricted to
some specific network structures and limited depth. In this paper, we propose a
novel framework that can be applied to deep neural networks with any structure
to facilitate their learning of meaningful representations for cost-sensitive
classification problems. Furthermore, the framework allows end- to-end training
of deeper networks directly. The framework is designed by augmenting auxiliary
neurons to the output of each hidden layer for layer-wise cost estimation, and
including the total estimation loss within the optimization objective.
Experimental results on public benchmark visual data sets with two cost
information settings demonstrate that the proposed frame- work outperforms
state-of-the-art cost-sensitive deep learning models.
Tien-Ju Yang, Yu-Hsin Chen, Vivienne Sze
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Deep convolutional neural networks (CNNs) are indispensable to
state-of-the-art computer vision algorithms. However, they are still rarely
deployed on battery-powered mobile devices, such as smartphones and wearable
gadgets, where vision algorithms can enable many revolutionary real-world
applications. The key limiting factor is the high energy consumption of CNN
processing due to its high computational complexity. While there are many
previous efforts that try to reduce the CNN model size or amount of
computation, we find that they do not necessarily result in lower energy
consumption, and therefore do not serve as a good metric for energy cost
estimation.
To close the gap between CNN design and energy consumption optimization, we
propose an energy-aware pruning algorithm for CNNs that directly uses energy
consumption estimation of a CNN to guide the pruning process. The energy
estimation methodology uses parameters extrapolated from actual hardware
measurements that target realistic battery-powered system setups. The proposed
layer-by-layer pruning algorithm also prunes more aggressively than previously
proposed pruning methods by minimizing the error in output feature maps instead
of filter weights. For each layer, the weights are first pruned and then
locally fine-tuned with a closed-form least-square solution to quickly restore
the accuracy. After all layers are pruned, the entire network is further
globally fine-tuned using back-propagation. With the proposed pruning method,
the energy consumption of AlexNet and GoogLeNet are reduced by 3.7x and 1.6x,
respectively, with less than 1% top-5 accuracy loss. Finally, we show that
pruning the AlexNet with a reduced number of target classes can greatly
decrease the number of weights but the energy reduction is limited.
Paritosh Parmar, Brendan Tran Morris
Subjects: Computer Vision and Pattern Recognition (cs.CV)
While action recognition has been addressed extensively in the field of
computer vision, action quality assessment has not been given much attention.
Estimating action quality is crucial in areas such as sports and health care,
while being useful in other areas like video retrieval. Unlike action
recognition, which has millions of examples to learn from, the action quality
datasets that are currently available are small — typically comprised of only
a few hundred samples. We develop quality assessment frameworks which use SVR,
LSTM and LSTM-SVR on top of spatiotemporal features learned using 3D
convolutional neural networks (C3D). We demonstrate an efficient training
mechanism for action quality LSTM suitable for limited data scenarios. The
proposed systems show significant improvement over existing quality assessment
approaches on the task of predicting scores of Olympic events both with
short-time length actions (10m platform diving) and long-time length actions
(figure skating short program). While SVR based frameworks yields better
results, LSTM based frameworks are more intuitive and natural for describing
the action, and can be used for improvement feedback.
Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III, Larry Davis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Visual narrative is often a combination of explicit information and judicious
omissions, relying on the viewer to supply missing details. In comics, most
movements in time and space are hidden in the “gutters” between panels. To
follow the story, readers logically connect panels together by inferring unseen
actions through a process called “closure”. While computers can now describe
the content of natural images, in this paper we examine whether they can
understand the closure-driven narratives conveyed by stylized artwork and
dialogue in comic book panels. We collect a dataset, COMICS, that consists of
over 1.2 million panels (120 GB) paired with automatic textbox transcriptions.
An in-depth analysis of COMICS demonstrates that neither text nor image alone
can tell a comic book story, so a computer must understand both modalities to
keep up with the plot. We introduce three cloze-style tasks that ask models to
predict narrative and character-centric aspects of a panel given n preceding
panels as context. Various deep neural architectures underperform human
baselines on these tasks, suggesting that COMICS contains fundamental
challenges for both vision and language.
Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, Ondrej Chum
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Query expansion is a popular method to improve the quality of image retrieval
with both conventional and CNN representations. It has been so far limited to
global image similarity. This work focuses on diffusion, a mechanism that
captures the image manifold in the feature space. The diffusion is carried out
on descriptors of overlapping image regions rather than on a global image
descriptor like in previous approaches. An efficient off-line stage allows
optional reduction in the number of stored regions. In the on-line stage, the
proposed handling of unseen queries in the indexing stage removes additional
computation to adjust the precomputed data. A novel way to perform diffusion
through a sparse linear system solver yields practical query times well below
one second. Experimentally, we observe a significant boost in performance of
image retrieval with compact CNN descriptors on standard benchmarks, especially
when the query object covers only a small part of the image. Small objects have
been a common failure case of CNN-based retrieval.
Shu Kong, Charless Fowlkes
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Pooling second-order local feature statistics to form a high-dimensional
bilinear feature has been shown to achieve state-of-the-art performance on a
variety of fine-grained classification tasks. To address the computational
demands of high feature dimensionality, we propose to represent the covariance
features as a matrix and apply a low-rank bilinear classifier. The resulting
classifier can be evaluated without explicitly computing the bilinear feature
map which allows for a large reduction in the compute time as well as
decreasing the effective number of parameters to be learned.
To further compress the model, we propose classifier co-decomposition that
factorizes the collection of bilinear classifiers into a common factor and
compact per-class terms. The co-decomposition idea can be deployed through two
convolutional layers and trained in an end-to-end architecture. We suggest a
simple yet effective initialization that avoids explicitly first training and
factorizing the larger bilinear classifiers. Through extensive experiments, we
show that our model achieves state-of-the-art performance on several public
datasets for fine-grained classification trained with only category labels.
Importantly, our final model is an order of magnitude smaller than the recently
proposed compact bilinear model, and three orders smaller than the standard
bilinear CNN model.
Li Zhang, Tao Xiang, Shaogang Gong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Zero-shot learning (ZSL) models rely on learning a joint embedding space
where both textual/semantic description of object classes and visual
representation of object images can be projected to for nearest neighbour
search. Despite the success of deep neural networks that learn an end-to-end
model between text and images in other vision problems such as image
captioning, very few deep ZSL model exists and they show little advantage over
ZSL models that utilise deep feature representations but do not learn an
end-to-end embedding. In this paper we argue that the key to make deep ZSL
models succeed is to choose the right embedding space. Instead of embedding
into a semantic space or an intermediate space, we propose to use the visual
space as the embedding space. This is because that in this space, the
subsequent nearest neighbour search would suffer much less from the hubness
problem and thus become more effective. This model design also provides a
natural mechanism for multiple semantic modalities (e.g., attributes and
sentence descriptions) to be fused and optimised jointly in an end-to-end
manner. Extensive experiments on four benchmarks show that our model
significantly outperforms the existing models.
Elad Richardson, Matan Sela, Roy Or-El, Ron Kimmel
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Reconstructing the detailed geometric structure of a face from a given image
is a key to many computer vision and graphics applications, such as motion
capture and reenactment. The reconstruction task is challenging as human faces
vary extensively when considering expressions, poses, textures, and intrinsic
geometry. While many approaches tackle this complexity by using additional data
to reconstruct the face of a single subject, extracting facial surface from a
single image remains a difficult problem. As a result, single-image based
methods can usually provide only a rough estimate of the facial geometry. In
contrast, we propose to leverage the power of convolutional neural networks to
produce a highly detailed face reconstruction from a single image. For this
purpose, we introduce an end-to-end CNN framework which derives the shape in a
coarse-to-fine fashion. The proposed architecture is composed of two main
blocks, a network that recovers the coarse facial geometry (CoarseNet),
followed by a CNN that refines the facial features of that geometry (FineNet).
The proposed networks are connected by a novel layer which renders a depth
image given a mesh in 3D. Unlike object recognition and detection problems,
there are no suitable datasets for training CNNs to perform face geometry
reconstruction. Therefore, our training regime begins with a supervised phase,
based on synthetic images, followed by an unsupervised phase that uses only
unconstrained facial images. The accuracy and robustness of the proposed model
is demonstrated by both qualitative and quantitative evaluation tests.
Zhiwei Jin, Juan Cao, Jiebo Luo, Yongdong Zhang
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
Numerous fake images spread on social media today and can severely jeopardize
the credibility of online content to public. In this paper, we employ deep
networks to learn distinct fake image related features. In contrast to
authentic images, fake images tend to be eye-catching and visually striking.
Compared with traditional visual recognition tasks, it is extremely challenging
to understand these psychologically triggered visual patterns in fake images.
Traditional general image classification datasets, such as ImageNet set, are
designed for feature learning at the object level but are not suitable for
learning the hyper-features that would be required by image credibility
analysis. In order to overcome the scarcity of training samples of fake images,
we first construct a large-scale auxiliary dataset indirectly related to this
task. This auxiliary dataset contains 0.6 million weakly-labeled fake and real
images collected automatically from social media. Through an AdaBoost-like
transfer learning algorithm, we train a CNN model with a few instances in the
target training set and 0.6 million images in the collected auxiliary set. This
learning algorithm is able to leverage knowledge from the auxiliary set and
gradually transfer it to the target task. Experiments on a real-world testing
set show that our proposed domain transferred CNN model outperforms several
competing baselines. It obtains superiror results over transfer learning
methods based on the general ImageNet set. Moreover, case studies show that our
proposed method reveals some interesting patterns for distinguishing fake and
authentic images.
Siddharth Agrawal, Ambedkar Dukkipati
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
Variational autoencoders (VAEs), that are built upon deep neural networks
have emerged as popular generative models in computer vision. Most of the work
towards improving variational autoencoders has focused mainly on making the
approximations to the posterior flexible and accurate, leading to tremendous
progress. However, there have been limited efforts to replace pixel-wise
reconstruction, which have known shortcomings. In this work, we use real-valued
non-volume preserving transformations (real NVP) to exactly compute the
conditional likelihood of the data given the latent distribution. We show that
a simple VAE with this form of reconstruction is competitive with complicated
VAE structures, on image modeling tasks. As part of our model, we develop
powerful conditional coupling layers that enable real NVP to learn with fewer
intermediate layers.
Shuangfei Zhai, Hui Wu, Abhishek Kumar, Yu Cheng, Yongxi Lu, Zhongfei Zhang, Rogerio Feris
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Feature pooling layers (e.g., max pooling) in convolutional neural networks
(CNNs) serve the dual purpose of providing increasingly abstract
representations as well as yielding computational savings in subsequent
convolutional layers. We view the pooling operation in CNNs as a two-step
procedure: first, a pooling window (e.g., (2 imes 2)) slides over the feature
map with stride one which leaves the spatial resolution intact, and second,
downsampling is performed by selecting one pixel from each non-overlapping
pooling window in an often uniform and deterministic (e.g., top-left) manner.
Our starting point in this work is the observation that this regularly spaced
downsampling arising from non-overlapping windows, although intuitive from a
signal processing perspective (which has the goal of signal reconstruction), is
not necessarily optimal for emph{learning} (where the goal is to generalize).
We study this aspect and propose a novel pooling strategy with stochastic
spatial sampling (S3Pool), where the regular downsampling is replaced by a more
general stochastic version. We observe that this general stochasticity acts as
a strong regularizer, and can also be seen as doing implicit data augmentation
by introducing distortions in the feature maps. We further introduce a
mechanism to control the amount of distortion to suit different datasets and
architectures. To demonstrate the effectiveness of the proposed approach, we
perform extensive experiments on several popular image classification
benchmarks, observing excellent improvements over baseline models. Experimental
code is available at this https URL
Baoxu Shi, Tim Weninger
Comments: 14 pages, Accepted to AAAI 2017
Subjects: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
With the large volume of new information created every day, determining the
validity of information in a knowledge graph and filling in its missing parts
are crucial tasks for many researchers and practitioners. To address this
challenge, a number of knowledge graph completion methods have been developed
using low-dimensional graph embeddings. Although researchers continue to
improve these models using an increasingly complex feature space, we show that
simple changes in the architecture of the underlying model can outperform
state-of-the-art models without the need for complex feature engineering. In
this work, we present a shared variable neural network model called ProjE that
fills-in missing information in a knowledge graph by learning joint embeddings
of the knowledge graph’s entities and edges, and through subtle, but important,
changes to the standard loss function. In doing so, ProjE has a parameter size
that is smaller than 11 out of 15 existing methods while performing (37\%)
better than the current-best method on standard datasets. We also show, via a
new fact checking task, that ProjE is capable of accurately determining the
veracity of many declarative statements.
Prof. Roger K. Moore
Comments: To appear in A. McElhone & W. Mansell (Eds.), Living Control Systems IV: Perceptual Control Theory and the Future of the Life and Social Sciences, Benchmark Publications Inc
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Recent years have witnessed increasing interest in the potential benefits of
`intelligent’ autonomous machines such as robots. Honda’s Asimo humanoid robot,
iRobot’s Roomba robot vacuum cleaner and Google’s driverless cars have fired
the imagination of the general public, and social media buzz with speculation
about a utopian world of helpful robot assistants or the coming robot
apocalypse! However, there is a long way to go before autonomous systems reach
the level of capabilities required for even the simplest of tasks involving
human-robot interaction – especially if it involves communicative behaviour
such as speech and language. Of course the field of Artificial Intelligence
(AI) has made great strides in these areas, and has moved on from abstract
high-level rule-based paradigms to embodied architectures whose operations are
grounded in real physical environments. What is still missing, however, is an
overarching theory of intelligent communicative behaviour that informs
system-level design decisions in order to provide a more coherent approach to
system integration. This chapter introduces the beginnings of such a framework
inspired by the principles of Perceptual Control Theory (PCT). In particular,
it is observed that PCT has hitherto tended to view perceptual processes as a
relatively straightforward series of transformations from sensation to
perception, and has overlooked the potential of powerful generative model-based
solutions that have emerged in practical fields such as visual or auditory
scene analysis. Starting from first principles, a sequence of arguments is
presented which not only shows how these ideas might be integrated into PCT,
but which also extend PCT towards a remarkably symmetric architecture for a
needs-driven communicative agent. It is concluded that, if behaviour is the
control of perception, then perception is the simulation of behaviour.
Carmine Dodaro, Philip Gasteiger, Nicola Leone, Benjamin Musitsch, Francesco Ricca, Konstantin Schekotihin
Comments: Paper presented at the 1st Workshop on Trends and Applications of Answer Set Programming (TAASP 2016), Klagenfurt, Austria, 26 September 2016, 15 pages, LaTeX, 5 figures
Subjects: Artificial Intelligence (cs.AI)
The CDCL algorithm is the leading solution adopted by state-of-the-art
solvers for SAT, SMT, ASP, and others. Experiments show that the performance of
CDCL solvers can be significantly boosted by embedding domain-specific
heuristics, especially on large real-world problems. However, a proper
integration of such criteria in off-the-shelf CDCL implementations is not
obvious. In this paper, we distill the key ingredients that drive the search of
CDCL solvers, and propose a general framework for designing and implementing
new heuristics. We implemented our strategy in an ASP solver, and we
experimented on two industrial domains. On hard problem instances,
state-of-the-art implementations fail to find any solution in acceptable time,
whereas our implementation is very successful and finds all solutions.
Luiz H. Nunes, Julio C. Estrella, Alexandre C. B. Delbem, Charith Perera, Stephan Reiff-Marganiec
Comments: Proceedings of the 9th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2016) Shaghai, China, December, 2016
Journal-ref: Proceedings of the 9th IEEE/ACM International Conference on
Utility and Cloud Computing (UCC 2016) Shaghai, China, December, 2016
Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Over the last few years, the number of smart objects connected to the
Internet has grown exponentially in comparison to the number of services and
applications. The integration between Cloud Computing and Internet of Things,
named as Cloud of Things, plays a key role in managing the connected things,
their data and services. One of the main challenges in Cloud of Things is the
resource discovery of the smart objects and their reuse in different contexts.
Most of the existent work uses some kind of multi-criteria decision analysis
algorithm to perform the resource discovery, but do not evaluate the impact
that the user constraints has in the final solution. In this paper, we analyse
the behaviour of the SAW, TOPSIS and VIKOR multi-objective decision analyses
algorithms and the impact of user constraints on them. We evaluated the quality
of the proposed solutions using the Pareto-optimality concept.
Mahtab J. Fard, Sattar Ameri, Ratna B. Chinnam, Abhilash K. Pandya, Michael D. Klein, R. Darin Ellis
Journal-ref: Lecture Notes in Engineering and Computer Science: Proceedings of
The World Congress on Engineering and Computer Science 2016, 19-21 October,
2016, San Francisco, USA
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)
Evaluating surgeon skill has predominantly been a subjective task.
Development of objective methods for surgical skill assessment are of increased
interest. Recently, with technological advances such as robotic-assisted
minimally invasive surgery (RMIS), new opportunities for objective and
automated assessment frameworks have arisen. In this paper, we applied machine
learning methods to automatically evaluate performance of the surgeon in RMIS.
Six important movement features were used in the evaluation including
completion time, path length, depth perception, speed, smoothness and
curvature. Different classification methods applied to discriminate expert and
novice surgeons. We test our method on real surgical data for suturing task and
compare the classification result with the ground truth data (obtained by
manual labeling). The experimental results show that the proposed framework can
classify surgical skill level with relatively high accuracy of 85.7%. This
study demonstrates the ability of machine learning methods to automatically
classify expert and novice surgeons using movement features for different RMIS
tasks. Due to the simplicity and generalizability of the introduced
classification method, it is easy to implement in existing trainers.
Zheng Sun, Jiaqi Liu, Zewang Zhang, Jingwen Chen, Zhao Huo, Ching Hua Lee, Xiao Zhang
Comments: 5 pages, 5 figures
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
Creating any aesthetically pleasing piece of art, like music, has been a long
time dream for artificial intelligence research. Based on recent success of
long-short term memory (LSTM) on sequence learning, we put forward a novel
system to reflect the thinking pattern of a musician. For data representation,
we propose a note-level encoding method, which enables our model to simulate
how human composes and polishes music phrases. To avoid failure against music
theory, we invent a novel method, grammar argumented (GA) method. It can teach
machine basic composing principles. In this method, we propose three rules as
argumented grammars and three metrics for evaluation of machine-made music.
Results show that comparing to basic LSTM, grammar argumented model’s
compositions have higher contents of diatonic scale notes, short pitch
intervals, and chords.
Jeremiah Johnson
Comments: 10 pages, 4 figures, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Applications (stat.AP); Machine Learning (stat.ML)
The artistic style of a painting is a subtle aesthetic judgment used by art
historians for grouping and classifying artwork. The recently introduced
`neural-style’ algorithm substantially succeeds in merging the perceived
artistic style of one image or set of images with the perceived content of
another. In light of this and other recent developments in image analysis via
convolutional neural networks, we investigate the effectiveness of a
`neural-style’ representation for classifying the artistic style of paintings.
Paolo Detti, Garazi Zabalo Manrique de Lara
Subjects: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI)
In this work, a study on Variable Neighborhood Search algorithms for
multi-depot dial-a-ride problems is presented. In dial-a-ride problems patients
need to be transported from pre-specified pickup locations to pre-specified
delivery locations, under different considerations. The addressed problem
presents several constraints and features, such as heterogeneous vehicles,
distributed in different depots, and heterogeneous patients. The aim is of
minimizing the total routing cost, while respecting time-window, ride-time,
capacity and route duration constraints. The objective of the study is of
determining the best algorithm configuration in terms of initial solution,
neighborhood and local search procedures. At this aim, two different procedures
for the computation of an initial solution, six different type of neighborhoods
and five local search procedures, where only intra-route changes are made, have
been considered and compared.
We have also evaluated an “adjusting procedure” that aims to produce feasible
solutions from infeasible solutions with small constraints violations. The
different VNS algorithms have been tested on instances from literature as well
as on random instances arising from a real-world healthcare application.
Drahomira Herrmannova, Petr Knoth
Comments: WSDM Cup 2016 – Entity Ranking Challenge. The 9th ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA. February 22-25, 2016
Subjects: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
With the growing amount of published research, automatic evaluation of
scholarly publications is becoming an important task. In this paper we address
this problem and present a simple and transparent approach for evaluating the
importance of scholarly publications. Our method has been ranked among the top
performers in the WSDM Cup 2016 Challenge. The first part of this paper
describes our method. In the second part we present potential improvements to
the method and analyse the evaluation setup which was provided during the
challenge. Finally, we discuss future challenges in automatic evaluation of
papers including the use of full-texts based evaluation methods.
Jeremiah Johnson
Comments: 10 pages, 4 figures, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Applications (stat.AP); Machine Learning (stat.ML)
The artistic style of a painting is a subtle aesthetic judgment used by art
historians for grouping and classifying artwork. The recently introduced
`neural-style’ algorithm substantially succeeds in merging the perceived
artistic style of one image or set of images with the perceived content of
another. In light of this and other recent developments in image analysis via
convolutional neural networks, we investigate the effectiveness of a
`neural-style’ representation for classifying the artistic style of paintings.
Xinchi Chen, Xipeng Qiu, Xuanjing Huang
Subjects: Computation and Language (cs.CL)
Long-term context is crucial to joint Chinese word segmentation and POS
tagging (S&T) task. However, most of machine learning based methods extract
features from a window of characters. Due to the limitation of window size,
these methods can not exploit the long distance information. In this work, we
propose a long dependency aware deep architecture for joint S&T task.
Specifically, to simulate the feature templates of traditional discrete feature
based models, we use different filters to model the complex compositional
features with convolutional and pooling layer, and then utilize long distance
dependency information with recurrent layer. Experiment results on five
different datasets show the effectiveness of our proposed model.
Javier de la Rosa, Juan-Luis Suárez
Comments: 66 pages, 11 figures
Journal-ref: Lemir: Revista de Literatura Espa~nola Medieval y del
Renacimiento, 20 (2016)
Subjects: Computation and Language (cs.CL)
Summit work of the Spanish Golden Age and forefather of the so-called
picaresque novel, The Life of Lazarillo de Tormes and of His Fortunes and
Adversities still remains an anonymous text. Although distinguished scholars
have tried to attribute it to different authors based on a variety of criteria,
a consensus has yet to be reached. The list of candidates is long and not all
of them enjoy the same support within the scholarly community. Analyzing their
works from a data-driven perspective and applying machine learning techniques
for style and text fingerprinting, we shed light on the authorship of the
Lazarillo. As in a state-of-the-art survey, we discuss the methods used and how
they perform in our specific case. According to our methodology, the most
likely author seems to be Juan Arce de Ot’alora, closely followed by Alfonso
de Vald’es. The method states that not certain attribution can be made with
the given corpus.
Kimmo Kettunen, Tuula Pääkkönen
Comments: 24 pages, 6 tables, 6 figures
Subjects: Computation and Language (cs.CL)
The National Library of Finland has digitized the historical newspapers
published in Finland between 1771 and 1910. This collection contains
approximately 1.95 million pages in Finnish and Swedish. Finnish part of the
collection consists of about 2.40 billion words. The National Library’s Digital
Collections are offered via the digi.kansalliskirjasto.fi web service, also
known as Digi. Part of the newspaper material (from 1771 to 1874) is also
available freely downloadable in The Language Bank of Finland provided by the
FINCLARIN consortium. The collection can also be accessed through the Korp
environment that has been developed by Spr{aa}kbanken at the University of
Gothenburg and extended by FINCLARIN team at the University of Helsinki to
provide concordances of text resources. A Cranfield style information retrieval
test collection has also been produced out of a small part of the Digi
newspaper material at the University of Tampere.
Quality of OCRed collections is an important topic in digital humanities, as
it affects general usability and searchability of collections. There is no
single available method to assess quality of large collections, but different
methods can be used to approximate quality. This paper discusses different
corpus analysis style methods to approximate overall lexical quality of the
Finnish part of the Digi collection. Methods include usage of parallel samples
and word error rates, usage of morphological analyzers, frequency analysis of
words and comparisons to comparable edited lexical data. Our aim in the quality
analysis is twofold: firstly to analyze the present state of the lexical data
and secondly, to establish a set of assessment methods that build up a compact
procedure for quality assessment after e.g. new OCRing or post correction of
the material. In the discussion part of the paper we shall synthesize results
of our different analyses.
Shayne Longpre, Sabeek Pradhan, Caiming Xiong, Richard Socher
Subjects: Computation and Language (cs.CL)
LSTMs have become a basic building block for many deep NLP models. In recent
years, many improvements and variations have been proposed for deep sequence
models in general, and LSTMs in particular. We propose and analyze a series of
architectural modifications for LSTM networks resulting in improved performance
for text classification datasets. We observe compounding improvements on
traditional LSTMs using Monte Carlo test-time model averaging, deep vector
averaging (DVA), and residual connections, along with four other suggested
modifications. Our analysis provides a simple, reliable, and high quality
baseline model.
Prof. Roger K. Moore
Comments: To appear in A. McElhone & W. Mansell (Eds.), Living Control Systems IV: Perceptual Control Theory and the Future of the Life and Social Sciences, Benchmark Publications Inc
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Recent years have witnessed increasing interest in the potential benefits of
`intelligent’ autonomous machines such as robots. Honda’s Asimo humanoid robot,
iRobot’s Roomba robot vacuum cleaner and Google’s driverless cars have fired
the imagination of the general public, and social media buzz with speculation
about a utopian world of helpful robot assistants or the coming robot
apocalypse! However, there is a long way to go before autonomous systems reach
the level of capabilities required for even the simplest of tasks involving
human-robot interaction – especially if it involves communicative behaviour
such as speech and language. Of course the field of Artificial Intelligence
(AI) has made great strides in these areas, and has moved on from abstract
high-level rule-based paradigms to embodied architectures whose operations are
grounded in real physical environments. What is still missing, however, is an
overarching theory of intelligent communicative behaviour that informs
system-level design decisions in order to provide a more coherent approach to
system integration. This chapter introduces the beginnings of such a framework
inspired by the principles of Perceptual Control Theory (PCT). In particular,
it is observed that PCT has hitherto tended to view perceptual processes as a
relatively straightforward series of transformations from sensation to
perception, and has overlooked the potential of powerful generative model-based
solutions that have emerged in practical fields such as visual or auditory
scene analysis. Starting from first principles, a sequence of arguments is
presented which not only shows how these ideas might be integrated into PCT,
but which also extend PCT towards a remarkably symmetric architecture for a
needs-driven communicative agent. It is concluded that, if behaviour is the
control of perception, then perception is the simulation of behaviour.
Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III, Larry Davis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Visual narrative is often a combination of explicit information and judicious
omissions, relying on the viewer to supply missing details. In comics, most
movements in time and space are hidden in the “gutters” between panels. To
follow the story, readers logically connect panels together by inferring unseen
actions through a process called “closure”. While computers can now describe
the content of natural images, in this paper we examine whether they can
understand the closure-driven narratives conveyed by stylized artwork and
dialogue in comic book panels. We collect a dataset, COMICS, that consists of
over 1.2 million panels (120 GB) paired with automatic textbox transcriptions.
An in-depth analysis of COMICS demonstrates that neither text nor image alone
can tell a comic book story, so a computer must understand both modalities to
keep up with the plot. We introduce three cloze-style tasks that ask models to
predict narrative and character-centric aspects of a panel given n preceding
panels as context. Various deep neural architectures underperform human
baselines on these tasks, suggesting that COMICS contains fundamental
challenges for both vision and language.
Zulqarnain Mehdi, Hani Ragab-Hassen
Comments: The Sixth International Conference on Computer Science, Engineering & Applications (ICCSEA 2016)
Journal-ref: ICCSEA 2016
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Several solutions exist for file storage, sharing, and synchronization. Many
of them involve a central server, or a collection of servers, that either store
the files, or act as a gateway for them to be shared. Some systems take a
decentralized approach, wherein interconnected users form a peer-to-peer (P2P)
network, and partake in the sharing process: they share the files they possess
with others, and can obtain the files owned by other peers. In this paper, we
survey various technologies, both cloud-based and P2P-based, that users use to
synchronize their files across the network, and discuss their strengths and
weaknesses.
Danny Dolev, Meir Spielrien
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The Reliable Broadcast concept allows an honest party to send a message to
all other parties and to make sure that all honest parties receive this
message. In addition, it allows an honest party that received a message to know
that all other honest parties would also receive the same message. This
technique is important to ensure distributed consistency when facing failures.
In the current paper, we study the ability to use RR to consistently
transmit a sequence of input values in an asynchronous environment with a
designated sender. The task can be easily achieved using counters, but cannot
be achieved with a bounded memory facing failures. We weaken the problem and
ask whether the receivers can at least share a common suffix. We prove that in
a standard (lossless) asynchronous system no bounded memory protocol can
guarantee a common suffix at all receivers for every input sequence if a single
party might crash.
We further study the problem facing transient faults and prove that when
limiting the problem to transmitting a stream of a single value being sent
repeatedly we show a bounded memory self-stabilizing protocol that can ensure a
common suffix even in the presence of transient faults and an arbitrary number
of crash faults. We further prove that this last problem is not solvable in the
presence of a single Byzantine fault. Thus, this problem {f separates}
Byzantine behavior from crash faults in an asynchronous environment.
Luiz H. Nunes, Julio C. Estrella, Alexandre C. B. Delbem, Charith Perera, Stephan Reiff-Marganiec
Comments: Proceedings of the 9th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2016) Shaghai, China, December, 2016
Journal-ref: Proceedings of the 9th IEEE/ACM International Conference on
Utility and Cloud Computing (UCC 2016) Shaghai, China, December, 2016
Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Over the last few years, the number of smart objects connected to the
Internet has grown exponentially in comparison to the number of services and
applications. The integration between Cloud Computing and Internet of Things,
named as Cloud of Things, plays a key role in managing the connected things,
their data and services. One of the main challenges in Cloud of Things is the
resource discovery of the smart objects and their reuse in different contexts.
Most of the existent work uses some kind of multi-criteria decision analysis
algorithm to perform the resource discovery, but do not evaluate the impact
that the user constraints has in the final solution. In this paper, we analyse
the behaviour of the SAW, TOPSIS and VIKOR multi-objective decision analyses
algorithms and the impact of user constraints on them. We evaluated the quality
of the proposed solutions using the Pareto-optimality concept.
Zheng Sun, Jiaqi Liu, Zewang Zhang, Jingwen Chen, Zhao Huo, Ching Hua Lee, Xiao Zhang
Comments: 5 pages, 5 figures
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
Creating any aesthetically pleasing piece of art, like music, has been a long
time dream for artificial intelligence research. Based on recent success of
long-short term memory (LSTM) on sequence learning, we put forward a novel
system to reflect the thinking pattern of a musician. For data representation,
we propose a note-level encoding method, which enables our model to simulate
how human composes and polishes music phrases. To avoid failure against music
theory, we invent a novel method, grammar argumented (GA) method. It can teach
machine basic composing principles. In this method, we propose three rules as
argumented grammars and three metrics for evaluation of machine-made music.
Results show that comparing to basic LSTM, grammar argumented model’s
compositions have higher contents of diatonic scale notes, short pitch
intervals, and chords.
Hantian Zhang, Kaan Kara, Jerry Li, Dan Alistarh, Ji Liu, Ce Zhang
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
We present ZipML, the first framework for training dense generalized linear
models using end-to-end low-precision representation–in ZipML, all movements
of data, including those for input samples, model, and gradients, are
represented using as little as two bits per component. Within our framework, we
have successfully compressed, separately, the input data by 16x, gradient by
16x, and model by 16x while still getting the same training result. Even for
the most challenging datasets, we find that robust convergence can be ensured
using only an end-to-end 8-bit representation or a 6-bit representation if only
samples are quantized.
Our work builds on previous research on using low-precision representations
for gradient and model in the context of stochastic gradient descent. Our main
technical contribution is a new set of techniques which allow the training
samples to be processed with low precision, without affecting the convergence
of the algorithm. In turn, this leads to a system where all data items move in
a quantized, low precision format. In particular, we first establish that
randomized rounding, while sufficient when quantizing the model and the
gradients, is biased when quantizing samples, and thus leads to a different
training result. We propose two new data representations which converge to the
same solution as in the original data representation both in theory and
empirically and require as little as 2-bits per component. As a result, if the
original data is stored as 32-bit floats, we decrease the bandwidth footprint
for each training iteration by up to 16x. Our results hold for models such as
linear regression and least squares SVM.
ZipML raises interesting theoretical questions related to the robustness of
SGD to approximate data, model, and gradient representations. We conclude this
working paper by a description of ongoing work extending these preliminary
results.
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
Subjects: Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Deep reinforcement learning agents have achieved state-of-the-art results by
directly maximising cumulative reward. However, environments contain a much
wider variety of possible training signals. In this paper, we introduce an
agent that also maximises many other pseudo-reward functions simultaneously by
reinforcement learning. All of these tasks share a common representation that,
like unsupervised learning, continues to develop in the absence of extrinsic
rewards. We also introduce a novel mechanism for focusing this representation
upon extrinsic rewards, so that learning can rapidly adapt to the most relevant
aspects of the actual task. Our agent significantly outperforms the previous
state-of-the-art on Atari, averaging 880\% expert human performance, and a
challenging suite of first-person, three-dimensional emph{Labyrinth} tasks
leading to a mean speedup in learning of 10( imes) and averaging 87\% expert
human performance on Labyrinth.
Maria Francesca, Arthur Hughes, David Gregg
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
Previous research has shown that computation of convolution in the frequency
domain provides a significant speedup versus traditional convolution network
implementations. However, this performance increase comes at the expense of
repeatedly computing the transform and its inverse in order to apply other
network operations such as activation, pooling, and dropout. We show,
mathematically, how convolution and activation can both be implemented in the
frequency domain using either the Fourier or Laplace transformation. The main
contributions are a description of spectral activation under the Fourier
transform and a further description of an efficient algorithm for computing
both convolution and activation under the Laplace transform. By computing both
the convolution and activation functions in the frequency domain, we can reduce
the number of transforms required, as well as reducing overall complexity. Our
description of a spectral activation function, together with existing spectral
analogs of other network functions may then be used to compose a fully spectral
implementation of a convolution network.
Abhay Gupta
Comments: 8 pages, 1 figure
Subjects: Learning (cs.LG)
An important way to make large training sets is to gather noisy labels from
crowds of non experts. We propose a method to aggregate noisy labels collected
from a crowd of workers or annotators. Eliciting labels is important in tasks
such as judging web search quality and rating products. Our method assumes that
labels are generated by a probability distribution over items and labels. We
formulate the method by drawing parallels between Gaussian Mixture Models
(GMMs) and Restricted Boltzmann Machines (RBMs) and show that the problem of
vote aggregation can be viewed as one of clustering. We use K-RBMs to perform
clustering. We finally show some empirical evaluations over real datasets.
Carter Lassetter, Eduardo Cotilla-Sanchez, Jinsub Kim
Comments: 8 pages, 4 figures
Subjects: Learning (cs.LG); Systems and Control (cs.SY)
This paper introduces a robust learning scheme that can dynamically predict
the stability of the reconnection of sub-networks to a main grid. As the future
electrical power systems tend towards smarter and greener technology, the
deployment of self sufficient networks, or microgrids, becomes more likely.
Microgrids may operate on their own or synchronized with the main grid, thus
control methods need to take into account islanding and reconnecting said
networks. The ability to optimally and safely reconnect a portion of the grid
is not well understood and, as of now, limited to raw synchronization between
interconnection points. A support vector machine (SVM) leveraging real-time
data from phasor measurement units (PMUs) is proposed to predict in real time
whether the reconnection of a sub-network to the main grid would lead to
stability or instability. A dynamics simulator fed with pre-acquired system
parameters is used to create training data for the SVM in various operating
states. The classifier was tested on a variety of cases and operating points to
ensure diversity. Accuracies of approximately 90% were observed throughout most
conditions when making dynamic predictions of a given network.
Jan Yperman, Thijs Becker
Subjects: Learning (cs.LG)
We describe a method for searching the optimal hyper-parameters in reservoir
computing, which consists of a Gaussian process with Bayesian optimization. It
provides an alternative to other frequently used optimization methods such as
grid, random, or manual search. In addition to a set of optimal
hyper-parameters, the method also provides a probability distribution of the
cost function as a function of the hyper-parameters. We apply this method to
two types of reservoirs: nonlinear delay nodes and echo state networks. It
shows excellent performance on all considered benchmarks, either matching or
significantly surpassing expert human optimization. We find that some values
for hyper-parameters that have become standard in the research community, are
in fact suboptimal for most of the problems we considered. In general, the
algorithm achieves optimal results in fewer iterations when compared to other
optimization methods, and scales well with increasing dimensionality of the
hyper-parameter space. Due to its automated nature, this method significantly
reduces the need for expert knowledge when optimizing the hyper-parameters in
reservoir computing. Existing software libraries for Bayesian optimization make
the implementation of the algorithm straightforward.
Hilmi E. Egilmez, Eduardo Pavez, Antonio Ortega
Comments: This paper has been submitted to IEEE Trans. on Selected Topics in Signal Processing
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
Graphs are fundamental mathematical structures used in various fields to
represent data, signals and processes. In this paper, we propose a novel
framework for learning/estimating graphs from data. The proposed framework
includes (i) formulation of various graph learning problems, (ii) their
probabilistic interpretations and (iii) efficient algorithms to solve them. We
specifically focus on graph learning problems where the goal is to estimate a
graph Laplacian matrix from some observed data under given structural
constraints (e.g., graph connectivity and sparsity). Our experimental results
demonstrate that the proposed algorithms outperform the current
state-of-the-art methods in terms of graph learning performance.
Alireza Aghasi, Nam Nguyen, Justin Romberg
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
Model reduction is a highly desirable process for deep neural networks. While
large networks are theoretically capable of learning arbitrarily complex
models, overfitting and model redundancy negatively affects the prediction
accuracy and model variance. Net-Trim is a layer-wise convex framework to prune
(sparsify) deep neural networks. The method is applicable to neural networks
operating with the rectified linear unit (ReLU) as the nonlinear activation.
The basic idea is to retrain the network layer by layer keeping the layer
inputs and outputs close to the originally trained model, while seeking a
sparse transform matrix. We present both the parallel and cascade versions of
the algorithm. While the former enjoys computational distributability, the
latter is capable of achieving simpler models. In both cases, we mathematically
show a consistency between the retrained model and the initial trained network.
We also derive the general sufficient conditions for the recovery of a sparse
transform matrix. In the case of standard Gaussian training samples of
dimension (N) being fed to a layer, and (s) being the maximum number of nonzero
terms across all columns of the transform matrix, we show that
(mathcal{O}(slog N)) samples are enough to accurately learn the layer model.
Ahmed M. Alaa, Jinsung Yoon, Scott Hu, Mihaela van der Schaar
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
Critically ill patients in regular wards are vulnerable to unanticipated
clinical dete- rioration which requires timely transfer to the intensive care
unit (ICU). To allow for risk scoring and patient monitoring in such a setting,
we develop a novel Semi- Markov Switching Linear Gaussian Model (SSLGM) for the
inpatients’ physiol- ogy. The model captures the patients’ latent clinical
states and their corresponding observable lab tests and vital signs. We present
an efficient unsupervised learn- ing algorithm that capitalizes on the
informatively censored data in the electronic health records (EHR) to learn the
parameters of the SSLGM; the learned model is then used to assess the new
inpatients’ risk for clinical deterioration in an online fashion, allowing for
timely ICU admission. Experiments conducted on a het- erogeneous cohort of
6,094 patients admitted to a large academic medical center show that the
proposed model significantly outperforms the currently deployed risk scores
such as Rothman index, MEWS, SOFA and APACHE.
Shuangfei Zhai, Hui Wu, Abhishek Kumar, Yu Cheng, Yongxi Lu, Zhongfei Zhang, Rogerio Feris
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Feature pooling layers (e.g., max pooling) in convolutional neural networks
(CNNs) serve the dual purpose of providing increasingly abstract
representations as well as yielding computational savings in subsequent
convolutional layers. We view the pooling operation in CNNs as a two-step
procedure: first, a pooling window (e.g., (2 imes 2)) slides over the feature
map with stride one which leaves the spatial resolution intact, and second,
downsampling is performed by selecting one pixel from each non-overlapping
pooling window in an often uniform and deterministic (e.g., top-left) manner.
Our starting point in this work is the observation that this regularly spaced
downsampling arising from non-overlapping windows, although intuitive from a
signal processing perspective (which has the goal of signal reconstruction), is
not necessarily optimal for emph{learning} (where the goal is to generalize).
We study this aspect and propose a novel pooling strategy with stochastic
spatial sampling (S3Pool), where the regular downsampling is replaced by a more
general stochastic version. We observe that this general stochasticity acts as
a strong regularizer, and can also be seen as doing implicit data augmentation
by introducing distortions in the feature maps. We further introduce a
mechanism to control the amount of distortion to suit different datasets and
architectures. To demonstrate the effectiveness of the proposed approach, we
perform extensive experiments on several popular image classification
benchmarks, observing excellent improvements over baseline models. Experimental
code is available at this https URL
Cheng Tang, Claire Monteleoni
Comments: arXiv admin note: substantial text overlap with arXiv:1610.04900
Subjects: Learning (cs.LG)
We analyze online cite{BottouBengio} and mini-batch cite{Sculley} (k)-means
variants. Both scale up the widely used (k)-means algorithm via stochastic
approximation, and have become popular for large-scale clustering and
unsupervised feature learning. We show, for the first time, that starting with
any initial solution, they converge to a “local optimum” at rate
(O(frac{1}{t})) (in terms of the (k)-means objective) under general
conditions. In addition, we show if the dataset is clusterable, when
initialized with a simple and scalable seeding algorithm, mini-batch (k)-means
converges to an optimal (k)-means solution at rate (O(frac{1}{t})) with high
probability. The (k)-means objective is non-convex and non-differentiable: we
exploit ideas from recent work on stochastic gradient descent for non-convex
problems cite{ge:sgd_tensor, balsubramani13} by providing a novel
characterization of the trajectory of (k)-means algorithm on its solution
space, and circumvent the non-differentiability problem via geometric insights
about (k)-means update.
Vikash Kumar, Abhishek Gupta, Emanuel Todorov, Sergey Levine
Comments: Initial draft for a journal submission
Subjects: Learning (cs.LG); Robotics (cs.RO); Systems and Control (cs.SY)
We explore learning-based approaches for feedback control of a dexterous
five-finger hand performing non-prehensile manipulation. First, we learn local
controllers that are able to perform the task starting at a predefined initial
state. These controllers are constructed using trajectory optimization with
respect to locally-linear time-varying models learned directly from sensor
data. In some cases, we initialize the optimizer with human demonstrations
collected via teleoperation in a virtual environment. We demonstrate that such
controllers can perform the task robustly, both in simulation and on the
physical platform, for a limited range of initial conditions around the trained
starting state. We then consider two interpolation methods for generalizing to
a wider range of initial conditions: deep learning, and nearest neighbors. We
find that nearest neighbors achieve higher performance. Nevertheless, the
neural network has its advantages: it uses only tactile and proprioceptive
feedback but no visual feedback about the object (i.e. it performs the task
blind) and learns a time-invariant policy. In contrast, the nearest neighbors
method switches between time-varying local controllers based on the proximity
of initial object states sensed via motion capture. While both generalization
methods leave room for improvement, our work shows that (i) local
trajectory-based controllers for complex non-prehensile manipulation tasks can
be constructed from surprisingly small amounts of training data, and (ii)
collections of such controllers can be interpolated to form more global
controllers. Results are summarized in the supplementary video:
this https URL
Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, Rogerio Feris
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
Multi-task learning aims to improve generalization performance of multiple
prediction tasks by appropriately sharing relevant information across them. In
the context of deep neural networks, this idea is often realized by
hand-designed network architectures with layers that are shared across tasks
and branches that encode task-specific features. However, the space of possible
multi-task deep architectures is combinatorially large and often the final
architecture is arrived at by manual exploration of this space subject to
designer’s bias, which can be both error-prone and tedious. In this work, we
propose a principled approach for designing compact multi-task deep learning
architectures. Our approach starts with a thin network and dynamically widens
it in a greedy manner during training using a novel criterion that promotes
grouping of similar tasks together. Our Extensive evaluation on person
attributes classification tasks involving facial and clothing attributes
suggests that the models produced by the proposed method are fast, compact and
can closely match or exceed the state-of-the-art accuracy from strong baselines
by much more expensive models.
Cheng Li, Jiaqi Ma, Xiaoxiao Guo, Qiaozhu Mei
Subjects: Social and Information Networks (cs.SI); Learning (cs.LG)
Information cascades, effectively facilitated by most social network
platforms, are recognized as a major factor in almost every social success and
disaster in these networks. Can cascades be predicted? While many believe that
they are inherently unpredictable, recent work has shown that some key
properties of information cascades, such as size, growth, and shape, can be
predicted by a machine learning algorithm that combines many features. These
predictors all depend on a bag of hand-crafting features to represent the
cascade network and the global network structure. Such features, always
carefully and sometimes mysteriously designed, are not easy to extend or to
generalize to a different platform or domain.
Inspired by the recent successes of deep learning in multiple data mining
tasks, we investigate whether an end-to-end deep learning approach could
effectively predict the future size of cascades. Such a method automatically
learns the representation of individual cascade graphs in the context of the
global network structure, without hand-crafted features and heuristics. We find
that node embeddings fall short of predictive power, and it is critical to
learn the representation of a cascade graph as a whole. We present algorithms
that learn the representation of cascade graphs in an end-to-end manner, which
significantly improve the performance of cascade prediction over strong
baselines that include feature based methods, node embedding methods, and graph
kernel methods. Our results also provide interesting implications for cascade
prediction in general.
Anthony D. Rhodes, Max H. Quinn, Melanie Mitchell
Comments: arXiv admin note: text overlap with arXiv:1607.00548
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
A major goal of computer vision is to enable computers to interpret visual
situations—abstract concepts (e.g., “a person walking a dog,” “a crowd
waiting for a bus,” “a picnic”) whose image instantiations are linked more by
their common spatial and semantic structure than by low-level visual
similarity. In this paper, we propose a novel method for prior learning and
active object localization for this kind of knowledge-driven search in static
images. In our system, prior situation knowledge is captured by a set of
flexible, kernel-based density estimations—a situation model—that represent
the expected spatial structure of the given situation. These estimations are
efficiently updated by information gained as the system searches for relevant
objects, allowing the system to use context as it is discovered to narrow the
search.
More specifically, at any given time in a run on a test image, our system
uses image features plus contextual information it has discovered to identify a
small subset of training images—an importance cluster—that is deemed most
similar to the given test image, given the context. This subset is used to
generate an updated situation model in an on-line fashion, using an efficient
multipole expansion technique.
As a proof of concept, we apply our algorithm to a highly varied and
challenging dataset consisting of instances of a “dog-walking” situation. Our
results support the hypothesis that dynamically-rendered, context-based
probability models can support efficient object localization in visual
situations. Moreover, our approach is general enough to be applied to diverse
machine learning paradigms requiring interpretable, probabilistic
representations generated from partially observed data.
Siddharth Agrawal, Ambedkar Dukkipati
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
Variational autoencoders (VAEs), that are built upon deep neural networks
have emerged as popular generative models in computer vision. Most of the work
towards improving variational autoencoders has focused mainly on making the
approximations to the posterior flexible and accurate, leading to tremendous
progress. However, there have been limited efforts to replace pixel-wise
reconstruction, which have known shortcomings. In this work, we use real-valued
non-volume preserving transformations (real NVP) to exactly compute the
conditional likelihood of the data given the latent distribution. We show that
a simple VAE with this form of reconstruction is competitive with complicated
VAE structures, on image modeling tasks. As part of our model, we develop
powerful conditional coupling layers that enable real NVP to learn with fewer
intermediate layers.
Eric Hunsberger, Chris Eliasmith
Comments: 10 pages, 3 figures, 4 tables; the “methods” section of this article draws heavily on arXiv:1510.08829
Subjects: Neural and Evolutionary Computing (cs.NE); Learning (cs.LG)
We describe a method to train spiking deep networks that can be run using
leaky integrate-and-fire (LIF) neurons, achieving state-of-the-art results for
spiking LIF networks on five datasets, including the large ImageNet ILSVRC-2012
benchmark. Our method for transforming deep artificial neural networks into
spiking networks is scalable and works with a wide range of neural
nonlinearities. We achieve these results by softening the neural response
function, such that its derivative remains bounded, and by training the network
with noise to provide robustness against the variability introduced by spikes.
Our analysis shows that implementations of these networks on neuromorphic
hardware will be many times more power-efficient than the equivalent
non-spiking networks on traditional hardware.
Mahtab J. Fard, Sattar Ameri, Ratna B. Chinnam, Abhilash K. Pandya, Michael D. Klein, R. Darin Ellis
Journal-ref: Lecture Notes in Engineering and Computer Science: Proceedings of
The World Congress on Engineering and Computer Science 2016, 19-21 October,
2016, San Francisco, USA
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)
Evaluating surgeon skill has predominantly been a subjective task.
Development of objective methods for surgical skill assessment are of increased
interest. Recently, with technological advances such as robotic-assisted
minimally invasive surgery (RMIS), new opportunities for objective and
automated assessment frameworks have arisen. In this paper, we applied machine
learning methods to automatically evaluate performance of the surgeon in RMIS.
Six important movement features were used in the evaluation including
completion time, path length, depth perception, speed, smoothness and
curvature. Different classification methods applied to discriminate expert and
novice surgeons. We test our method on real surgical data for suturing task and
compare the classification result with the ground truth data (obtained by
manual labeling). The experimental results show that the proposed framework can
classify surgical skill level with relatively high accuracy of 85.7%. This
study demonstrates the ability of machine learning methods to automatically
classify expert and novice surgeons using movement features for different RMIS
tasks. Due to the simplicity and generalizability of the introduced
classification method, it is easy to implement in existing trainers.
Ning Ge, Marc Panten, Xavier Crégut
Journal-ref: International Conference on Embedded Real Time Software and
Systems (ERTS 2014)
Subjects: Software Engineering (cs.SE); Learning (cs.LG)
Automated fault localization is an important issue in model validation and
verification. It helps the end users in analyzing the origin of failure. In
this work, we show the early experiments with probabilistic analysis approaches
in fault localization. Inspired by the Kullback-Leibler Divergence from
Bayesian probabilistic theory, we propose a suspiciousness factor to compute
the fault contribution for the transitions in the reachability graph of model
checking, using which to rank the potential faulty transitions. To
automatically locate design faults in the simulation model of detailed design,
we propose to use the statistical model Hidden Markov Model (HMM), which
provides statistically identical information to component’s real behavior. The
core of this method is a fault localization algorithm that gives out the set of
suspicious ranked faulty components and a backward algorithm that computes the
matching degree between the HMM and the simulation model to evaluate the
confidence degree of the localization conclusion.
Zhiguo Ding, Zhongyuan Zhao, Mugen Peng, H. Vincent Poor
Subjects: Information Theory (cs.IT)
This paper considers the application of non-orthogonal multiple access (NOMA)
to a multi-user network with mixed multicasting and unicasting traffic. The
proposed design of beamforming and power allocation ensures that the unicasting
performance is improved while maintaining the reception reliability of
multicasting. Both analytical and simulation results are provided to
demonstrate that the use of the NOMA assisted multicast-unicast scheme yields a
significant improvement in spectral efficiency compared to orthogonal multiple
access (OMA) schemes which realize multicasting and unicasting services
separately. Since unicasting messages are broadcasted to all the users, how the
use of NOMA can prevent those multicasting receivers intercepting the
unicasting messages is also investigated, where it is shown that the secrecy
unicasting rate achieved by NOMA is always larger than or equal to that of OMA.
This security gain is mainly due to the fact that the multicasting messages can
be used as jamming signals to prevent potential eavesdropping when the
multicasting and unicasting messages are superimposed together following the
NOMA principle.
Hong Xing, Xin Kang, Kai-Kit Wong, Arumugam Nallanathan
Comments: 30 pages, 7 figures, submitted for possible journal publication
Subjects: Information Theory (cs.IT)
With the recent advances in radio frequency (RF) energy harvesting (EH)
technologies, wireless powered cooperative cognitive radio network (CCRN) has
drawn an upsurge of interest for improving the spectrum utilization with
incentive to motivate joint information and energy cooperation between the
primary and secondary systems. Dedicated energy beamforming (EB) is aimed for
remedying the low efficiency of wireless power transfer (WPT), which
nevertheless arouses out-of-band EH phases and thus low cooperation efficiency.
To address this issue, in this paper, we consider a novel RF EH CCRN aided by
full-duplex (FD)-enabled energy access points (EAPs) that can cooperate to
wireless charge the secondary transmitter (ST) while concurrently receiving
primary transmitter (PT)’s signal in the first transmission phase, and to
perform decode-and-forward (DF) relaying in the second transmission phase. We
investigate a weighted sum-rate maximization problem subject to the
transmitting power constraints as well as a total cost constraint using
successive convex approximation (SCA) techniques. A zero-forcing (ZF) based
suboptimal scheme that is locally optimal at the EAPs is also derived. Various
tradeoffs between the weighted sum-rate and other system parameters are
provided in numerical results to corroborate the effectiveness of the proposed
solutions against the benchmark schemes.
Victor Quintero, Samir M. Perlaza, Iñaki Esnaola, Jean-Marie Gorce
Comments: This work was submitted to the IEEE Transactions on Information Theory in November 10 2016. Part of this work was presented at the IEEE International Workshop on Information Theory (ITW), Cambridge, United Kingdom, September, 2016 (arXiv:1603.07554), and IEEE International Workshop on Information Theory (ITW), Jeju Island, Korea, October, 2015 (arXiv:1502.04649). Parts of this work appear in INRIA Research Reports 0456 (arXiv:1608.08920) and 8861 (arXiv:1608.08907)
Subjects: Information Theory (cs.IT)
In this paper, the capacity region of the linear deterministic interference
channel with noisy channel-output feedback (LD-IC-NOF) is fully characterized.
A capacity-achieving scheme is obtained using a random coding argument and
three well-known techniques: rate splitting, superposition coding and backward
decoding. The converse region is obtained using some of the existing outer
bounds as well as a set of new outer bounds that are obtained by using
genie-aided models of the original LD-IC-NOF. Using the insights gained from
the analysis of the LD-IC-NOF, an achievability region and a converse region
for the two-user Gaussian interference channel with noisy channel-output
feedback (G-IC-NOF) are presented. Finally, the achievability region and the
converse region approximate the capacity region of the G-IC-NOF to within 4.4
bits.
Arun Venkitaraman, Saikat Chatterjee, Peter Händel
Comments: Submitted to IEEE JSTSP
Subjects: Information Theory (cs.IT); Social and Information Networks (cs.SI)
We propose Hilbert transform (HT) and analytic signal (AS) construction for
signals over graphs. This is motivated by the popularity of HT, AS, and
modulation analysis in conventional signal processing, and the observation that
complementary insight is often obtained by viewing conventional signals in the
graph setting. Our definitions of HT and AS use a conjugate-symmetry-like
property exhibited by the graph Fourier transform (GFT). We show that a real
graph signal (GS) can be represented using smaller number of GFT coefficients
than the signal length. We show that the graph HT (GHT) and graph AS (GAS)
operations are linear and shift-invariant over graphs. Using the GAS, we define
the amplitude, phase, and frequency modulations for a graph signal (GS).
Further, we use convex optimization to develop an alternative definition of
envelope for a GS. We illustrate the proposed concepts by showing applications
to synthesized and real-world signals. For example, we show that the GHT is
suitable for anomaly detection/analysis over networks and that GAS reveals
complementary information in speech signals.
Vaneet Aggarwal, Mark R. Bell, Anis Elgabli, Xiaodong Wang, Shan Zhong
Comments: arXiv admin note: text overlap with arXiv:1502.04391 by other authors
Subjects: Information Theory (cs.IT)
In this paper, we consider the energy-bandwidth allocation for a network of
multiple users, where the transmitters each powered by both an energy harvester
and conventional grid, access the network orthogonally on the assigned
frequency band. We assume that the energy harvesting state and channel gain of
each transmitter can be predicted for (K) time slots a priori. The different
transmitters can cooperate by donating energy to each other. The tradeoff among
the weighted sum throughput, the use of grid energy, and the amount of energy
cooperation is studied through an optimization objective which is a linear
combination of these quantities. This leads to an optimization problem with
O((N^2K)) constraints, where (N) is the total number of transmitter-receiver
pairs, and the optimization is over seven sets of variables that denote energy
and bandwidth allocation, grid energy utilization, and energy cooperation. To
solve the problem efficiently, an iterative algorithm is proposed using the
Proximal Jacobian ADMM. The optimization sub-problems corresponding to Proximal
Jacobian ADMM steps are solved in closed form. We show that this algorithm
converges to the optimal solution with an overall complexity of O((N^2K^2)).
Numerical results show that the proposed algorithms can make efficient use of
the harvested energy, grid energy, energy cooperation, and the available
bandwidth.
Juan M. Romero-Jerez, F. Javier Lopez-Martinez, José F. Paris, Andrea J. Goldsmith
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accesible
Subjects: Information Theory (cs.IT)
We introduce the Fluctuating Two-Ray (FTR) fading model, a new statistical
channel model that consists of two fluctuating specular components with random
phases plus a diffuse component. The FTR model arises as the natural
generalization of the two-wave with diffuse power (TWDP) fading model; this
generalization allows its two specular components to exhibit a random amplitude
fluctuation. Unlike the TWDP model, all the chief probability functions of the
FTR fading model (PDF, CDF and MGF) are expressed in closed-form, having a
functional form similar to other state-of-the-art fading models. We also
provide approximate closed-form expressions for the PDF and CDF in terms of a
finite number of elementary functions, which allow for a simple evaluation of
these statistics to an arbitrary level of precision. We show that the FTR
fading model provides a much better fit than Rician fading for recent
small-scale fading measurements in 28 GHz outdoor millimeter-wave channels.
Finally, the performance of wireless communication systems over FTR fading is
evaluated in terms of the bit error rate and the outage capacity, and the
interplay between the FTR fading model parameters and the system performance is
discussed. Monte Carlo simulations have been carried out in order to validate
the obtained theoretical expressions.
Chong Huang, Lalitha Sankar
Subjects: Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Social and Information Networks (cs.SI)
The emerging marketplace for online free services in which service providers
earn revenue from using consumer data in direct and indirect ways has lead to
significant privacy concerns. This begs understanding of the following
question: can the marketplace sustain multiple service providers (SPs) that
offer privacy-differentiated free services? This paper studies this problem of
market segmentation for the free online services market by augmenting the
classical Hotelling model for market segmentation analysis to include the fact
that for the free services market, a consumer values service not in monetized
terms but by its quality of service (QoS) and that the differentiator of
services is not product price but the privacy risk advertised by a SP. Building
upon the Hotelling model, this paper presents a parametrized model for SP
profit and consumer valuation of service for both the two- and multi-SP
problems to show that: (i) when consumers place a high value on privacy, it
leads to a lower use of private data by SPs (i.e., their advertised privacy
risk reduces), and thus, SPs compete on the QoS; (ii) SPs that are capable of
differentiating on services that do not use directly consumer data (untargeted
services) gain larger market share; and (iii) a higher valuation of privacy by
consumers forces SPs with smaller untargeted revenue to offer lower privacy
risk to attract more consumers. The work also illustrates the market
segmentation problem for more than two SPs and highlights the instability of
such markets.
Anantha K. Karthik, Rick S. Blum
Comments: 30 pages, 4 figures, Journal paper
Subjects: Applications (stat.AP); Information Theory (cs.IT)
This paper addresses the problem of robust clock phase offset estimation for
the IEEE 1588 precision time protocol (PTP) in the presence of delay attacks.
Delay attacks are one of the most effective cyber attacks in PTP, as they
cannot be mitigated using typical security measures. In this paper, we consider
the case where the slave node can exchange synchronization messages with
multiple master nodes synchronized to the same clock. We first provide lower
bounds on the best achievable performance for any phase offset estimation
scheme in the presence of delay attacks. We then present a novel phase offset
estimation scheme that employs the Expectation-Maximization algorithm for
detecting which of the master-slave communication links have been subject to
delay attacks. After discarding information from the links identified as
attacked, which we show to be optimal, the optimal vector location parameter
estimator is employed to estimate the phase offset of the slave node.
Simulation results are presented to show that the proposed phase offset
estimation scheme exhibits performance close to the lower bounds in a wide
variety of scenarios.
Hugo Gabriel Eyherabide
Comments: 14 Pages, 9 Figures, 1 Table
Subjects: Neurons and Cognition (q-bio.NC); Information Theory (cs.IT); Quantitative Methods (q-bio.QM); Applications (stat.AP)
Identifying informative aspects of brain activity has traditionally been
thought to provide insight into how brains may perform optimal computations.
However, here we show that this need not be the case when studying spike-time
precision or response discrimination, among other activity aspects beyond noise
correlations. Our results show that decoders designed with noisy data may
perform optimally on quality data, thereby potentially yielding experimental
and computational savings.