IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    arXiv Paper Daily: Mon, 21 Nov 2016

    我爱机器学习(52ml.net)发表于 2016-11-21 00:00:00
    love 0

    Neural and Evolutionary Computing

    NoiseOut: A Simple Way to Prune Neural Networks

    Mohammad Babaeizadeh, Paris Smaragdis, Roy H. Campbell
    Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

    Neural networks are usually over-parameterized with significant redundancy in
    the number of required neurons which results in unnecessary computation and
    memory usage at inference time. One common approach to address this issue is to
    prune these big networks by removing extra neurons and parameters while
    maintaining the accuracy. In this paper, we propose NoiseOut, a fully automated
    pruning algorithm based on the correlation between activations of neurons in
    the hidden layers. We prove that adding additional output neurons with entirely
    random targets results into a higher correlation between neurons which makes
    pruning by NoiseOut even more efficient. Finally, we test our method on various
    networks and datasets. These experiments exhibit high pruning rates while
    maintaining the accuracy of the original network.

    Swarm Intelligence for Multiobjective Optimization of Extraction Process

    T. Ganesan, I. Elamvazuthi, P.Vasant
    Journal-ref: Handbook of Research on Modern Optimization Algorithms and
    Applications in Engineering and Economics, (2016), IGI Global, pp 516 – 544
    Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)

    Multi objective (MO) optimization is an emerging field which is increasingly
    being implemented in many industries globally. In this work, the MO
    optimization of the extraction process of bioactive compounds from the Gardenia
    Jasminoides Ellis fruit was solved. Three swarm-based algorithms have been
    applied in conjunction with normal-boundary intersection (NBI) method to solve
    this MO problem. The gravitational search algorithm (GSA) and the particle
    swarm optimization (PSO) technique were implemented in this work. In addition,
    a novel Hopfield-enhanced particle swarm optimization was developed and applied
    to the extraction problem. By measuring the levels of dominance, the optimality
    of the approximate Pareto frontiers produced by all the algorithms were gauged
    and compared. Besides, by measuring the levels of convergence of the frontier,
    some understanding regarding the structure of the objective space in terms of
    its relation to the level of frontier dominance is uncovered. Detail
    comparative studies were conducted on all the algorithms employed and developed
    in this work.

    Generative Deep Neural Networks for Dialogue: A Short Review

    Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, Joelle Pineau
    Comments: 6 pages, 1 figure, 3 tables; NIPS 2016 workshop on Learning Methods for Dialogue
    Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

    Researchers have recently started investigating deep neural networks for
    dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq)
    models have shown promising results for unstructured tasks, such as word-level
    dialogue response generation. The hope is that such models will be able to
    leverage massive amounts of data to learn meaningful natural language
    representations and response generation strategies, while requiring a minimum
    amount of domain knowledge and hand-crafting. An important challenge is to
    develop models that can effectively incorporate dialogue context and generate
    meaningful and diverse responses. In support of this goal, we review recently
    proposed models based on generative encoder-decoder neural network
    architectures, and show that these models have better ability to incorporate
    long-term dialogue history, to model uncertainty and ambiguity in dialogue, and
    to generate responses with high-level compositional structure.

    Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

    Volkan Cirik, Eduard Hovy, Louis-Philippe Morency
    Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

    Curriculum Learning emphasizes the order of training instances in a
    computational learning setup. The core hypothesis is that simpler instances
    should be learned early as building blocks to learn more complex ones. Despite
    its usefulness, it is still unknown how exactly the internal representation of
    models are affected by curriculum learning. In this paper, we study the effect
    of curriculum learning on Long Short-Term Memory (LSTM) networks, which have
    shown strong competency in many Natural Language Processing (NLP) problems. Our
    experiments on sentiment analysis task and a synthetic task similar to sequence
    prediction tasks in NLP show that curriculum learning has a positive effect on
    the LSTM’s internal states by biasing the model towards building constructive
    representations i.e. the internal representation at the previous timesteps are
    used as building blocks for the final prediction. We also find that smaller
    models significantly improves when they are trained with curriculum learning.
    Lastly, we show that curriculum learning helps more when the amount of training
    data is limited.

    An Empirical Study of Continuous Connectivity Degree Sequence Equivalents

    Daniel Moyer, Boris A. Gutman, Joshua Faskowitz, Neda Jahanshad, Paul M. Thompson
    Comments: Presented at The MICCAI-BACON 16 Workshop (this https URL)
    Subjects: Neurons and Cognition (q-bio.NC); Neural and Evolutionary Computing (cs.NE)

    In the present work we demonstrate the use of a parcellation free
    connectivity model based on Poisson point processes. This model produces for
    each subject a continuous bivariate intensity function that represents for
    every possible pair of points the relative rate at which we observe tracts
    terminating at those points. We fit this model to explore degree sequence
    equivalents for spatial continuum graphs, and to investigate the local
    differences between estimated intensity functions for two different
    tractography methods. This is a companion paper to Moyer et al. (2016), where
    the model was originally defined.

    Compacting Neural Network Classifiers via Dropout Training

    Yotaro Kubo, George Tucker, Simon Wiesler
    Comments: Accepted to NIPS Workshop on Efficient Methods for Deep Neural Networks
    Subjects: Machine Learning (stat.ML); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

    We introduce dropout compaction, a novel method for training feed-forward
    neural networks which realizes the performance gains of training a large model
    with dropout regularization, yet extracts a compact neural network for run-time
    efficiency. In the proposed method, we introduce a sparsity-inducing prior on
    the per unit dropout retention probability so that the optimizer can
    effectively prune hidden units during training. By changing the prior
    hyperparameters, we can control the size of the resulting network. We performed
    a systematic comparison of dropout compaction and competing methods on several
    real-world speech recognition tasks and found that dropout compaction achieved
    comparable accuracy with fewer than 50% of the hidden units, translating to a
    2.5x speedup in run-time.

    Sparsey: Event Recognition via Deep Hierarchical Spare Distributed Codes

    Gerard J. Rinkus
    Comments: This is a manuscript form of a paper published in Frontiers in Computational Neuroscience in 2014 (this http URL). 65 pages, 28 figures, 8 tables
    Journal-ref: Frontiers in Computational Neuroscience, Vol. 8, Article 160
    (2014)
    Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

    Visual cortex’s hierarchical, multi-level organization is captured in many
    biologically inspired computational vision models, the general idea being that
    progressively larger scale, more complex spatiotemporal features are
    represented in progressively higher areas. However, most earlier models use
    localist representations (codes) in each representational field, which we
    equate with the cortical macrocolumn (mac), at each level. In localism, each
    represented feature/event (item) is coded by a single unit. Our model, Sparsey,
    is also hierarchical but crucially, uses sparse distributed coding (SDC) in
    every mac in all levels. In SDC, each represented item is coded by a small
    subset of the mac’s units. SDCs of different items can overlap and the size of
    overlap between items can represent their similarity. The difference between
    localism and SDC is crucial because SDC allows the two essential operations of
    associative memory, storing a new item and retrieving the best-matching stored
    item, to be done in fixed time for the life of the model. Since the model’s
    core algorithm, which does both storage and retrieval (inference), makes a
    single pass over all macs on each time step, the overall model’s
    storage/retrieval operation is also fixed-time, a criterion we consider
    essential for scalability to huge datasets. A 2010 paper described a
    nonhierarchical version of this model in the context of purely spatial pattern
    processing. Here, we elaborate a fully hierarchical model (arbitrary numbers of
    levels and macs per level), describing novel model principles like progressive
    critical periods, dynamic modulation of principal cells’ activation functions
    based on a mac-level familiarity measure, representation of multiple
    simultaneously active hypotheses, a novel method of time warp invariant
    recognition, and we report results showing learning/recognition of
    spatiotemporal patterns.


    Computer Vision and Pattern Recognition

    Ear Recognition: More Than a Survey

    Žiga Emeršič, Vitomir Štruc, Peter Peer
    Comments: 17 pages, paper accepted to Neurocomputing
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    Automatic identity recognition from ear images represents an active field of
    research within the biometric community. The ability to capture ear images from
    a distance and in a covert manner makes the technology an appealing choice for
    surveillance and security applications as well as other application domains.
    Significant contributions have been made in the field over recent years, but
    open research problems still remain and hinder a wider (commercial) deployment
    of the technology. This paper presents an overview of the field of automatic
    ear recognition (from 2D images) and focuses specifically on the most recent,
    descriptor-based methods proposed in this area. Open challenges are discussed
    and potential research directions are outlined with the goal of providing the
    reader with a point of reference for issues worth examining in the future. In
    addition to a comprehensive review on ear recognition technology, the paper
    also introduces a new, fully unconstrained dataset of ear images gathered from
    the web and a toolbox implementing several state-of-the-art techniques for ear
    recognition. The dataset and toolbox are meant to address some of the open
    issues in the field and are made publicly available to the research community.

    Expert Gate: Lifelong Learning with a Network of Experts

    Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars
    Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

    In this paper we introduce a model of lifelong learning, based on a Network
    of Experts. New tasks / experts are learned and added to the model
    sequentially, building on what was learned before. To ensure scalability of
    this process, data from previous tasks cannot be stored and hence is not
    available when learning a new task. A critical issue in such context, not
    addressed in the literature so far, relates to the decision of which expert to
    deploy at test time. We introduce a gating autoencoder that learns a
    representation for the task at hand, and is used at test time to automatically
    forward the test sample to the relevant expert. This has the added advantage of
    being memory efficient as only one expert network has to be loaded into memory
    at any given time. Further, the autoencoders inherently capture the relatedness
    of one task to another, based on which the most relevant prior model to be used
    for training a new expert with fine-tuning or learning-without-forgetting can
    be selected. We evaluate our system on image classification and video
    prediction problems.

    Exploring LOTS in Deep Neural Networks

    Andras Rozsa, Manuel Günther, Terrance E. Boult
    Comments: Under review as a conference paper at ICLR 2017, see open review page: this https URL
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    Deep neural networks have recently demonstrated excellent performance on
    various tasks. Despite recent advances, our understanding of these learning
    models is still incomplete, at least, as their unexpected vulnerability to
    imperceptibly small, non-random perturbations revealed. The existence of these
    so-called adversarial examples presents a serious problem of the application of
    vulnerable machine learning models. In this paper, we introduce the layerwise
    origin-target synthesis (LOTS) that can serve multiple purposes. First, we can
    use it as a visualization technique that gives us insights into the function of
    any intermediate feature layer by showing the notion of a particular input in
    deep neural networks. Second, our approach can be applied to assess the
    invariance of the learned features captured at any layer with respect to the
    class of the particular input. Finally, it can also be utilized as a general
    way for producing a vast amount of diverse adversarial examples that can be
    used for training to further improve the robustness of machine learning models
    and their performance as well.

    End-to-End Subtitle Detection and Recognition for Videos in East Asian Languages via CNN Ensemble with Near-Human-Level Performance

    Yan Xu, Siyuan Shan, Ziming Qiu, Zhipeng Jia, Zhengyang Shen, Yipei Wang, Mengfei Shi, Eric I-Chao Chang
    Comments: 35 pages
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    In this paper, we propose an innovative end-to-end subtitle detection and
    recognition system for videos in East Asian languages. Our end-to-end system
    consists of multiple stages. Subtitles are firstly detected by a novel image
    operator based on the sequence information of consecutive video frames. Then,
    an ensemble of Convolutional Neural Networks (CNNs) trained on synthetic data
    is adopted for detecting and recognizing East Asian characters. Finally, a
    dynamic programming approach leveraging language models is applied to
    constitute results of the entire body of text lines. The proposed system
    achieves average end-to-end accuracies of 98.2% and 98.3% on 40 videos in
    Simplified Chinese and 40 videos in Traditional Chinese respectively, which is
    a significant outperformance of other existing methods. The near-perfect
    accuracy of our system dramatically narrows the gap between human cognitive
    ability and state-of-the-art algorithms used for such a task.

    AFFACT – Alignment Free Facial Attribute Classification Technique

    Manuel Günther, Andras Rozsa, Terrance E. Boult
    Comments: This is a pre-print of the original paper submitted for review in FG 2017
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    In this paper, we investigate how the latest versions of deep convolutional
    neural networks perform on the facial attribute classification task. We test
    two loss functions to train the neural networks: the sigmoid cross-entropy loss
    usually used in multi-objective classification tasks, and the Euclidean loss
    normally applied to regression problems, and find that there is little
    difference between these two loss functions. Rather, more attention should be
    paid on pre-training the network with external data, the learning rate policy
    and the evaluation technique. Using an ensemble of three ResNets, we obtain the
    new state-of-the-art facial attribute classification error of 8.00% on the
    aligned images of the CelebA dataset. More significantly, we introduce a novel
    data augmentation technique allowing to train the AFFACT network that
    classifies facial attributes without requiring alignment, but works solely on
    detected face bounding boxes. We show that this approach outperforms the CelebA
    baseline, which did not use any face alignment either, with a relative
    improvement of 34%.

    Fast low-level pattern matching algorithm

    Janja Paliska Soldo, Ana Sovic Krzic, and Damir Sersic
    Comments: 14 pages, 7 tables This work has been fully supported by Croatian Science Foundation under the project UIP-11-2013-7353 Algorithms for Genome Sequence Analysis
    Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Genomics (q-bio.GN)

    This paper focuses on pattern matching in the DNA sequence. It was inspired
    by a previously reported method that proposes encoding both pattern and
    sequence using prime numbers. Although fast, the method is limited to rather
    small pattern lengths, due to computing precision problem. Our approach
    successfully deals with large patterns, due to our implementation that uses
    modular arithmetic. In order to get the results very fast, the code was adapted
    for multithreading and parallel implementations. The method is reduced to
    assembly language level instructions, thus the final result shows significant
    time and memory savings compared to the reference algorithm.

    DeepVO: A Deep Learning approach for Monocular Visual Odometry

    Vikram Mohanty, Shubh Agrawal, Shaswat Datta, Arna Ghosh, Vishnu Dutt Sharma, Debashish Chakravarty
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    Deep Learning based techniques have been adopted with precision to solve a
    lot of standard computer vision problems, some of which are image
    classification, object detection and segmentation. Despite the widespread
    success of these approaches, they have not yet been exploited largely for
    solving the standard perception related problems encountered in autonomous
    navigation such as Visual Odometry (VO), Structure from Motion (SfM) and
    Simultaneous Localization and Mapping (SLAM). This paper analyzes the problem
    of Monocular Visual Odometry using a Deep Learning-based framework, instead of
    the regular ‘feature detection and tracking’ pipeline approaches. Several
    experiments were performed to understand the influence of a known/unknown
    environment, a conventional trackable feature and pre-trained activations tuned
    for object classification on the network’s ability to accurately estimate the
    motion trajectory of the camera (or the vehicle). Based on these observations,
    we propose a Convolutional Neural Network architecture, best suited for
    estimating the object’s pose under known environment conditions, and displays
    promising results when it comes to inferring the actual scale using just a
    single camera in real-time.

    An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

    Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jiaying Liu
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    Human action recognition is an important task in computer vision. Extracting
    discriminative spatial and temporal features to model the spatial and temporal
    evolutions of different actions plays a key role in accomplishing this task. In
    this work, we propose an end-to-end spatial and temporal attention model for
    human action recognition from skeleton data. We build our model on top of the
    Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), which
    learns to selectively focus on discriminative joints of skeleton within each
    frame of the inputs and pays different levels of attention to the outputs of
    different frames. Furthermore, to ensure effective training of the network, we
    propose a regularized cross-entropy loss to drive the model learning process
    and develop a joint training strategy accordingly. Experimental results
    demonstrate the effectiveness of the proposed model,both on the small human
    action recognition data set of SBU and the currently largest NTU dataset.

    Cross Domain Knowledge Transfer for Person Re-identification

    Qiqi Xiao, Kelei Cao, Haonan Chen, Fangyue Peng, Chi Zhang
    Comments: 8 pages
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    Person Re-Identification (re-id) is a challenging task in computer vision,
    especially when there are limited training data from multiple camera views. In
    this paper, we pro- pose a deep learning based person re-identification method
    by transferring knowledge of mid-level attribute features and high-level
    classification features. Building on the idea that identity classification,
    attribute recognition and re- identification share the same mid-level semantic
    representations, they can be trained sequentially by fine-tuning one based on
    another. In our framework, we train identity classification and attribute
    recognition tasks from deep Convolutional Neural Network (dCNN) to learn person
    information. The information can be transferred to the person re-id task and
    improves its accuracy by a large margin. Further- more, a Long Short Term
    Memory(LSTM) based Recurrent Neural Network (RNN) component is extended by a
    spacial gate. This component is used in the re-id model to pay attention to
    certain spacial parts in each recurrent unit. Experimental results show that
    our method achieves 78.3% of rank-1 recognition accuracy on the CUHK03
    benchmark.

    Improving training of deep neural networks via Singular Value Bounding

    Kui Jia
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    Deep learning methods achieve great success recently on many computer vision
    problems, with image classification and object detection as the prominent
    examples. In spite of these practical successes, optimization of deep networks
    remains an active topic in deep learning research. In this work, we focus on
    investigation of the network solution properties that can potentially lead to
    good performance. Our research is inspired by theoretical and empirical results
    that use orthogonal matrices to initialize networks, but we are interested in
    investigating how orthogonal weight matrices perform when network training
    converges. To this end, we propose to constrain the solutions of weight
    matrices in the orthogonal feasible set during the whole process of network
    training, and achieve this by a simple yet effective method called Singular
    Value Bounding (SVB). In SVB, all singular values of each weight matrix are
    simply bounded in a narrow band around the value of 1. Based on the same
    motivation, we also propose Bounded Batch Normalization (BBN), which improves
    Batch Normalization by removing its potential risk of ill-conditioned layer
    transform. We present both theoretical and empirical results to justify our
    proposed methods. Experiments on benchmark image classification datasets show
    the efficacy of our proposed SVB and BBN. In particular, we reach the level of
    state-of-the-art results on CIFAR10 and set the new record on CIFAR100, using
    off-the-shelf network architectures.

    Online Visual Multi-Object Tracking via Labeled Random Finite Set Filtering

    Du Yong Kim, Ba-Ngu Vo, Ba-Tuong Vo
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    This paper proposes an online visual multi-object tracking algorithm using a
    top-down Bayesian formulation that seamlessly integrates state estimation,
    track management, clutter rejection, occlusion and mis-detection handling into
    a single recursion. This is achieved by modeling the multi-object state as
    labeled random finite set and using the Bayes recursion to propagate the
    multi-object filtering density forward in time. The proposed filter updates
    tracks with detections but switches to image data when mis-detection occurs,
    thereby exploiting the efficiency of detection data and the accuracy of image
    data. Furthermore the labeled random finite set framework enables the
    incorporation of prior knowledge that mis-detections of long tracks which occur
    in the middle of the scene are likely to be due to occlusions. Such prior
    knowledge can be exploited to improve occlusion handling, especially long
    occlusions that can lead to premature track termination in on-line multi-object
    tracking. Tracking performance are compared to state-of-the-art algorithms on
    well-known benchmark video datasets.

    Fuzzy Statistical Matrices for Cell Classification

    Guillaume Thibault, Izhak Shafran
    Comments: 21 pages, 7 figures, 5 tables
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    In this paper, we generalize image (texture) statistical descriptors and
    propose algorithms that improve their efficacy. Recently, a new method showed
    how the popular Co-Occurrence Matrix (COM) can be modified into a fuzzy version
    (FCOM) which is more effective and robust to noise. Here, we introduce new
    fuzzy versions of two additional higher order statistical matrices: the Run
    Length Matrix (RLM) and the Size Zone Matrix (SZM). We define the fuzzy zones
    and propose an efficient algorithm to compute the descriptors. We demonstrate
    the advantage of the proposed improvements over several state-of-the-art
    methods on three tasks from quantitative cell biology: analyzing and
    classifying Human Epithelial type 2 (HEp-2) cells using Indirect
    Immunofluorescence protocol (IFF).

    To Find the Symmetry Plane in Any Dimension, Reflect, Register, and Compute a -1 Eigenvector

    Marcelo Cicconet, David G. C. Hildebrand, Hunter Elliott
    Comments: 9 pages, 8 figures, submitted to CVPR 2017
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    In this paper, we demonstrate that the problem of fitting a plane of
    reflection symmetry to data in any dimension can be reduced to the problem of
    registering two datasets, and that the exactness of the solution depends on the
    accuracy of the registration. The pipeline for symmetry plane detection
    consists of (1) reflecting the data with respect to an arbitrary plane, (2)
    registering the original and reflected datasets, and (3) finding the
    eigenvector of eigenvalue -1 of a matrix given by the reflection and
    registration mappings. Results are shown for 2D and 3D datasets. We discuss in
    detail a particular biological application in which we study the 3D symmetry of
    manual myelinated neuron reconstructions throughout the body of a larval
    zebrafish that were extracted from serial-section electron micrographs. The
    data consists of curves that are represented as sequences of points in 3D, and
    there are two goals: first, find the plane of mirror symmetry given that the
    neuron reconstructions are nearly symmetric; second, find pairings of symmetric
    curves.

    Reweighted Low-Rank Tensor Completion and its Applications in Video Recovery

    Baburaj M., Sudhish N. George
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    This paper focus on recovering multi-dimensional data called tensor from
    randomly corrupted incomplete observation. Inspired by reweighted (l_1) norm
    minimization for sparsity enhancement, this paper proposes a reweighted
    singular value enhancement scheme to improve tensor low tubular rank in the
    tensor completion process. An efficient iterative decomposition scheme based on
    t-SVD is proposed which improves low-rank signal recovery significantly. The
    effectiveness of the proposed method is established by applying to video
    completion problem, and experimental results reveal that the algorithm
    outperforms its counterparts.

    Reweighted Low-Rank Tensor Decomposition and its Applications in Video Denoising

    Baburaj M., Sudhish N. George
    Subjects: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (cs.NA)

    A tensor is decomposed into low-rank and sparse components by simultaneously
    minimizing tensor nuclear norm and the (l_1) norm in Tensor Principal Component
    Pursuit (TPCP). Inspired by reweighted (l_1) norm minimization for sparsity
    enhancement, this paper proposes a reweighted singular value enhancement scheme
    to improve tensor low tubular rank in TPCP. The sparse component of a tensor is
    also recovered by the reweighted (l_1) norm which enhances the accuracy of
    decomposition. An efficient iterative decomposition scheme based on t-SVD is
    proposed which improves low-rank signal recovery significantly. The
    effectiveness of the proposed method is established by applying to video
    denoising problem, and experimental results reveal that the algorithm
    outperforms its counterparts.

    SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing

    Ao Ren, Ji Li, Zhe Li, Caiwen Ding, Xuehai Qian, Qinru Qiu, Bo Yuan, Yanzhi Wang
    Comments: This paper is accepted by 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2017
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    With recent advancing of Internet of Things (IoTs), it becomes very
    attractive to implement the deep convolutional neural networks (DCNNs) onto
    embedded/portable systems. Presently, executing the software-based DCNNs
    requires high-performance server clusters in practice, restricting their
    widespread deployment on the mobile devices. To overcome this issue,
    considerable research efforts have been conducted in the context of developing
    highly-parallel and specific DCNN hardware, utilizing GPGPUs, FPGAs, and ASICs.
    Stochastic Computing (SC), which uses bit-stream to represent a number within
    [-1, 1] by counting the number of ones in the bit-stream, has a high potential
    for implementing DCNNs with high scalability and ultra-low hardware footprint.
    Since multiplications and additions can be calculated using AND gates and
    multiplexers in SC, significant reductions in power/energy and hardware
    footprint can be achieved compared to the conventional binary arithmetic
    implementations. The tremendous savings in power (energy) and hardware
    resources bring about immense design space for enhancing scalability and
    robustness for hardware DCNNs. This paper presents the first comprehensive
    design and optimization framework of SC-based DCNNs (SC-DCNNs). We first
    present the optimal designs of function blocks that perform the basic
    operations, i.e., inner product, pooling, and activation function. Then we
    propose the optimal design of four types of combinations of basic function
    blocks, named feature extraction blocks, which are in charge of extracting
    features from input feature maps. Besides, weight storage methods are
    investigated to reduce the area and power/energy consumption for storing
    weights. Finally, the whole SC-DCNN implementation is optimized, with feature
    extraction blocks carefully selected, to minimize area and power/energy
    consumption while maintaining a high network accuracy level.

    Generalized BackPropagation, Étude De Cas: Orthogonality

    Mehrtash Harandi, Basura Fernando
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    This paper introduces an extension of the backpropagation algorithm that
    enables us to have layers with constrained weights in a deep network. In
    particular, we make use of the Riemannian geometry and optimization techniques
    on matrix manifolds to step outside of normal practice in training deep
    networks, equipping the network with structures such as orthogonality or
    positive definiteness. Based on our development, we make another contribution
    by introducing the Stiefel layer, a layer with orthogonal weights. Among
    various applications, Stiefel layers can be used to design orthogonal filter
    banks, perform dimensionality reduction and feature extraction. We demonstrate
    the benefits of having orthogonality in deep networks through a broad set of
    experiments, ranging from unsupervised feature learning to fine-grained image
    classification.

    Squared Earth Mover's Distance-based Loss for Training Deep Neural Networks

    Le Hou, Chen-Ping Yu, Dimitris Samaras
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    Deep neural networks (DNNs) especially convolutional neural networks (CNNs)
    have achieved state-of-the-art performances in many applications, and they have
    been shown to be especially powerful in multi-class classification tasks.
    Existing DNNs and CNNs are typically trained with a soft-max cross-entropy loss
    which only considers the ground-truth class by maximizing the predicted
    probability of the correct label. This cross-entropy loss ignores the intricate
    inter-class relationships that exist in the data. In this work, we propose to
    use the exact squared Earth Mover’s Distance (EMD) as a loss function for
    multi-class classification tasks. The squared EMD loss uses the predicted
    probabilities of all classes and penalizes the miss-predictions accordingly. In
    experiments, we evaluate our squared EMD loss in ordered-classes datasets such
    as age estimation and image aesthetic judgment. We also generalize the squared
    EMD loss to classification datasets with orderless-classes such as the
    ImageNet. Our results show that the squared EMD loss allows networks to achieve
    lower errors than the standard cross-entropy loss, and result in
    state-of-the-art performances on two age estimation datasets and one image
    aesthetic judgment dataset.

    Generative One-Class Models for Text-based Person Retrieval in Forensic Applications

    David Gerónimo, Hedvig Kjellström
    Subjects: Computer Vision and Pattern Recognition (cs.CV)

    Automatic forensic image analysis assists criminal investigation experts in
    the search for suspicious persons, abnormal behaviors detection and identity
    matching in images. In this paper we propose a person retrieval system that
    uses textual queries (e.g., “black trousers and green shirt”) as descriptions
    and a one-class generative color model with outlier filtering to represent the
    images both to train the models and to perform the search. The method is
    evaluated in terms of its efficiency in fulfilling the needs of a forensic
    retrieval system: limited annotation, robustness, extensibility, adaptability
    and computational cost. The proposed generative method is compared to a
    corresponding discriminative approach. Experiments are carried out using a
    range of queries in three different databases. The experiments show that the
    two evaluated algorithms provide average retrieval performance and adaptable to
    new datasets. The proposed generative algorithm has some advantages over the
    discriminative one, specifically its capability to work with very few training
    samples and its much lower computational requirements when the number of
    training examples increases.

    Answering Image Riddles using Vision and Reasoning through Probabilistic Soft Logic

    Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos
    Comments: 14 pages, 10 figures
    Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

    In this work, we explore a genre of puzzles (“image riddles”) which involves
    a set of images and a question. Answering these puzzles require both
    capabilities involving visual detection (including object, activity
    recognition) and, knowledge-based or commonsense reasoning. We compile a
    dataset of over 3k riddles where each riddle consists of 4 images and a
    groundtruth answer. The annotations are validated using crowd-sourced
    evaluation. We also define an automatic evaluation metric to track future
    progress. Our task bears similarity with the commonly known IQ tasks such as
    analogy solving, sequence filling that are often used to test intelligence.

    We develop a Probabilistic Reasoning-based approach that utilizes
    probabilistic commonsense knowledge to answer these riddles with a reasonable
    accuracy. We demonstrate the results of our approach using both automatic and
    human evaluations. Our approach achieves some promising results for these
    riddles and provides a strong baseline for future attempts. We make the entire
    dataset and related materials publicly available to the community in
    ImageRiddle Website (this http URL).

    ModelHub: Towards Unified Data and Lifecycle Management for Deep Learning

    Hui Miao, Ang Li, Larry S. Davis, Amol Deshpande
    Subjects: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV)

    Deep learning has improved state-of-the-art results in many important fields,
    and has been the subject of much research in recent years, leading to the
    development of several systems for facilitating deep learning. Current systems,
    however, mainly focus on model building and training phases, while the issues
    of data management, model sharing, and lifecycle management are largely
    ignored. Deep learning modeling lifecycle generates a rich set of data
    artifacts, such as learned parameters and training logs, and comprises of
    several frequently conducted tasks, e.g., to understand the model behaviors and
    to try out new models. Dealing with such artifacts and tasks is cumbersome and
    largely left to the users. This paper describes our vision and implementation
    of a data and lifecycle management system for deep learning. First, we
    generalize model exploration and model enumeration queries from commonly
    conducted tasks by deep learning modelers, and propose a high-level domain
    specific language (DSL), inspired by SQL, to raise the abstraction level and
    accelerate the modeling process. To manage the data artifacts, especially the
    large amount of checkpointed float parameters, we design a novel model
    versioning system (dlv), and a read-optimized parameter archival storage system
    (PAS) that minimizes storage footprint and accelerates query workloads without
    losing accuracy. PAS archives versioned models using deltas in a
    multi-resolution fashion by separately storing the less significant bits, and
    features a novel progressive query (inference) evaluation algorithm. Third, we
    show that archiving versioned models using deltas poses a new dataset
    versioning problem and we develop efficient algorithms for solving it. We
    conduct extensive experiments over several real datasets from computer vision
    domain to show the efficiency of the proposed techniques.

    NoiseOut: A Simple Way to Prune Neural Networks

    Mohammad Babaeizadeh, Paris Smaragdis, Roy H. Campbell
    Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

    Neural networks are usually over-parameterized with significant redundancy in
    the number of required neurons which results in unnecessary computation and
    memory usage at inference time. One common approach to address this issue is to
    prune these big networks by removing extra neurons and parameters while
    maintaining the accuracy. In this paper, we propose NoiseOut, a fully automated
    pruning algorithm based on the correlation between activations of neurons in
    the hidden layers. We prove that adding additional output neurons with entirely
    random targets results into a higher correlation between neurons which makes
    pruning by NoiseOut even more efficient. Finally, we test our method on various
    networks and datasets. These experiments exhibit high pruning rates while
    maintaining the accuracy of the original network.

    Minimal Problems for the Calibrated Trifocal Variety

    Joe Kileel
    Comments: 23 pages, 1 table
    Subjects: Algebraic Geometry (math.AG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

    We determine the algebraic degree of minimal problems for the calibrated
    trifocal variety in computer vision. We rely on numerical algebraic geometry
    and the homotopy continuation software Bertini.

    Sparsey: Event Recognition via Deep Hierarchical Spare Distributed Codes

    Gerard J. Rinkus
    Comments: This is a manuscript form of a paper published in Frontiers in Computational Neuroscience in 2014 (this http URL). 65 pages, 28 figures, 8 tables
    Journal-ref: Frontiers in Computational Neuroscience, Vol. 8, Article 160
    (2014)
    Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

    Visual cortex’s hierarchical, multi-level organization is captured in many
    biologically inspired computational vision models, the general idea being that
    progressively larger scale, more complex spatiotemporal features are
    represented in progressively higher areas. However, most earlier models use
    localist representations (codes) in each representational field, which we
    equate with the cortical macrocolumn (mac), at each level. In localism, each
    represented feature/event (item) is coded by a single unit. Our model, Sparsey,
    is also hierarchical but crucially, uses sparse distributed coding (SDC) in
    every mac in all levels. In SDC, each represented item is coded by a small
    subset of the mac’s units. SDCs of different items can overlap and the size of
    overlap between items can represent their similarity. The difference between
    localism and SDC is crucial because SDC allows the two essential operations of
    associative memory, storing a new item and retrieving the best-matching stored
    item, to be done in fixed time for the life of the model. Since the model’s
    core algorithm, which does both storage and retrieval (inference), makes a
    single pass over all macs on each time step, the overall model’s
    storage/retrieval operation is also fixed-time, a criterion we consider
    essential for scalability to huge datasets. A 2010 paper described a
    nonhierarchical version of this model in the context of purely spatial pattern
    processing. Here, we elaborate a fully hierarchical model (arbitrary numbers of
    levels and macs per level), describing novel model principles like progressive
    critical periods, dynamic modulation of principal cells’ activation functions
    based on a mac-level familiarity measure, representation of multiple
    simultaneously active hypotheses, a novel method of time warp invariant
    recognition, and we report results showing learning/recognition of
    spatiotemporal patterns.


    Artificial Intelligence

    Stratified Knowledge Bases as Interpretable Probabilistic Models (Extended Abstract)

    Ondrej Kuzelka, Jesse Davis, Steven Schockaert
    Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems
    Subjects: Artificial Intelligence (cs.AI)

    In this paper, we advocate the use of stratified logical theories for
    representing probabilistic models. We argue that such encodings can be more
    interpretable than those obtained in existing frameworks such as Markov logic
    networks. Among others, this allows for the use of domain experts to improve
    learned models by directly removing, adding, or modifying logical formulas.

    Team-maxmin equilibrium: efficiency bounds and algorithms

    Nicola Basilico, Andrea Celli, Giuseppe De Nittis, Nicola Gatti
    Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)

    The Team-maxmin equilibrium prescribes the optimal strategies for a team of
    rational players sharing the same goal and without the capability of
    correlating their strategies in strategic games against an adversary. This
    solution concept can capture situations in which an agent controls multiple
    resources-corresponding to the team members-that cannot communicate. It is
    known that such equilibrium always exists and it is unique (unless degeneracy)
    and these properties make it a credible solution concept to be used in
    real-world applications, especially in security scenarios. Nevertheless, to the
    best of our knowledge, the Team-maxmin equilibrium is almost completely
    unexplored in the literature. In this paper, we investigate bounds of
    (in)efficiency of the Team-maxmin equilibrium w.r.t. the Nash equilibria and
    w.r.t. the Maxmin equilibrium when the team members can play correlated
    strategies. Furthermore, we study a number of algorithms to find and/or
    approximate an equilibrium, discussing their theoretical guarantees and
    evaluating their performance by using a standard testbed of game instances.

    Navigational Rule Derivation: An algorithm to determine the effect of traffic signs on road networks

    Daniil Galaktionov, Miguel R. Luaces, Ángeles S. Places
    Comments: This research has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sk{l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. in PACIS 2016 Online Proceedings
    Subjects: Artificial Intelligence (cs.AI)

    In this paper we present an algorithm to build a road network map enriched
    with traffic rules such as one-way streets and forbidden turns, based on the
    interpretation of already detected and classified traffic signs. Such algorithm
    helps to automatize the elaboration of maps for commercial navigation systems.
    Our solution is based on simulating navigation along the road network,
    determining at each point of interest the visibility of the signs and their
    effect on the roads. We test our approach in a small urban network and discuss
    various ways to generalize it to support more complex environments.

    NCBO Ontology Recommender 2.0: An Enhanced Approach for Biomedical Ontology Recommendation

    Marcos Martinez-Romero, Clement Jonquet, Martin J. O'Connor, John Graybeal, Alejandro Pazos, Mark A. Musen
    Comments: 29 pages, 8 figures, 11 tables
    Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

    Biomedical researchers use ontologies to annotate their data with ontology
    terms, enabling better data integration and interoperability. However, the
    number, variety and complexity of current biomedical ontologies make it
    cumbersome for researchers to determine which ones to reuse for their specific
    needs. To overcome this problem, in 2010 the National Center for Biomedical
    Ontology (NCBO) released the Ontology Recommender, which is a service that
    receives a biomedical text corpus or a list of keywords and suggests ontologies
    appropriate for referencing the indicated terms. We developed a new version of
    the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a new
    recommendation approach that evaluates the relevance of an ontology to
    biomedical text data according to four criteria: (1) the extent to which the
    ontology covers the input data; (2) the acceptance of the ontology in the
    biomedical community; (3) the level of detail of the ontology classes that
    cover the input data; and (4) the specialization of the ontology to the domain
    of the input data. Our evaluation shows that the enhanced recommender provides
    higher quality suggestions than the original approach, providing better
    coverage of the input data, more detailed information about their concepts,
    increased specialization for the domain of the input data, and greater
    acceptance and use in the community. In addition, it provides users with more
    explanatory information, along with suggestions of not only individual
    ontologies but also groups of ontologies. It also can be customized to fit the
    needs of different scenarios. Ontology Recommender 2.0 combines the strengths
    of its predecessor with a range of adjustments and new features that improve
    its reliability and usefulness. Ontology Recommender 2.0 recommends over 500
    biomedical ontologies from the NCBO BioPortal platform, where it is openly
    available.

    Analysis of a Design Pattern for Teaching with Features and Labels

    Christopher Meek, Patrice Simard, Xiaojin Zhu
    Comments: Also available at this https URL
    Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG)

    We study the task of teaching a machine to classify objects using features
    and labels. We introduce the Error-Driven-Featuring design pattern for teaching
    using features and labels in which a teacher prefers to introduce features only
    if they are needed. We analyze the potential risks and benefits of this
    teaching pattern through the use of teaching protocols, illustrative examples,
    and by providing bounds on the effort required for an optimal machine teacher
    using a linear learning algorithm, the most commonly used type of learners in
    interactive machine learning systems. Our analysis provides a deeper
    understanding of potential trade-offs of using different learning algorithms
    and between the effort required for featuring (creating new features) and
    labeling (providing labels for objects).

    Structural Causal Models: Cycles, Marginalizations, Exogenous Reparametrizations and Reductions

    Stephan Bongers, Jonas Peters, Bernhard Schölkopf, Joris M. Mooij
    Comments: Will probably be submitted to The Annals of Statistics
    Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Learning (cs.LG)

    Structural causal models (SCMs), also known as non-parametric structural
    equation models (NP-SEMs), are widely used for causal modeling purposes. In
    this paper, we give a rigorous treatment of structural causal models, dealing
    with measure-theoretic complications that arise in the presence of cyclic
    relations. The central question studied in this paper is: given a (possibly
    cyclic) SCM defined on a large system (consisting of observable endogenous and
    latent exogenous variables), can we “project it down” to an SCM that describes
    a subsystem (consisting of a subset of the observed endogenous variables and
    possibly different latent exogenous variables) in order to obtain a more
    parsimonious but equivalent representation of the subsystem? We define a
    marginalization operation that effectively removes a subset of the endogenous
    variables from the model, and a class of mappings, exogenous
    reparameterizations, that can be used to reduce the space of exogenous
    variables. We show that both operations preserve the causal semantics of the
    model and that under mild conditions they can lead to a significant reduction
    of the model complexity, at least in terms of the number of variables in the
    model. We argue that for the task of estimating an SCM from data, the existence
    of “smooth” reductions would be desirable. We provide several conditions under
    which the existence of such reductions can be shown, but also provide a
    counterexample that shows that such reductions do not exist in general. The
    latter result implies that existing approaches to estimate linear or Markovian
    SCMs from data cannot be extended to general SCMs.

    Generative Deep Neural Networks for Dialogue: A Short Review

    Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, Joelle Pineau
    Comments: 6 pages, 1 figure, 3 tables; NIPS 2016 workshop on Learning Methods for Dialogue
    Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

    Researchers have recently started investigating deep neural networks for
    dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq)
    models have shown promising results for unstructured tasks, such as word-level
    dialogue response generation. The hope is that such models will be able to
    leverage massive amounts of data to learn meaningful natural language
    representations and response generation strategies, while requiring a minimum
    amount of domain knowledge and hand-crafting. An important challenge is to
    develop models that can effectively incorporate dialogue context and generate
    meaningful and diverse responses. In support of this goal, we review recently
    proposed models based on generative encoder-decoder neural network
    architectures, and show that these models have better ability to incorporate
    long-term dialogue history, to model uncertainty and ambiguity in dialogue, and
    to generate responses with high-level compositional structure.

    Expert Gate: Lifelong Learning with a Network of Experts

    Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars
    Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

    In this paper we introduce a model of lifelong learning, based on a Network
    of Experts. New tasks / experts are learned and added to the model
    sequentially, building on what was learned before. To ensure scalability of
    this process, data from previous tasks cannot be stored and hence is not
    available when learning a new task. A critical issue in such context, not
    addressed in the literature so far, relates to the decision of which expert to
    deploy at test time. We introduce a gating autoencoder that learns a
    representation for the task at hand, and is used at test time to automatically
    forward the test sample to the relevant expert. This has the added advantage of
    being memory efficient as only one expert network has to be loaded into memory
    at any given time. Further, the autoencoders inherently capture the relatedness
    of one task to another, based on which the most relevant prior model to be used
    for training a new expert with fine-tuning or learning-without-forgetting can
    be selected. We evaluate our system on image classification and video
    prediction problems.

    Query Complexity of Tournament Solutions

    Palash Dey
    Comments: To appear in AAAI 2017
    Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)

    A directed graph where there is exactly one edge between every pair of
    vertices is called a {em tournament}. Finding the “best” set of vertices of a
    tournament is a well studied problem in social choice theory. A {em tournament
    solution} takes a tournament as input and outputs a subset of vertices of the
    input tournament. However, in many applications, for example, choosing the best
    set of drugs from a given set of drugs, the edges of the tournament are given
    only implicitly and knowing the orientation of an edge is costly. In such
    scenarios, we would like to know the best set of vertices (according to some
    tournament solution) by “querying” as few edges as possible. We, in this paper,
    precisely study this problem for commonly used tournament solutions: given an
    oracle access to the edges of a tournament T, find (f(T)) by querying as few
    edges as possible, for a tournament solution f. We first show that the set of
    Condorcet non-losers in a tournament can be found by querying (2n-lfloor log
    n
    floor -2) edges only and this is tight in the sense that every algorithm
    for finding the set of Condorcet non-losers needs to query at least (2n-lfloor
    log n
    floor -2) edges in the worst case, where (n) is the number of vertices
    in the input tournament. We then move on to study other popular tournament
    solutions and show that any algorithm for finding the Copeland set, the Slater
    set, the Markov set, the bipartisan set, the uncovered set, the Banks set, and
    the top cycle must query (Omega(n^2)) edges in the worst case. On the positive
    side, we are able to circumvent our strong query complexity lower bound results
    by proving that, if the size of the top cycle of the input tournament is at
    most (k), then we can find all the tournament solutions mentioned above by
    querying (O(nk + frac{nlog n}{log(1-frac{1}{k})})) edges only.

    Variable Computation in Recurrent Neural Networks

    Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov
    Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Learning (cs.LG)

    Recurrent neural networks (RNNs) have been used extensively and with
    increasing success to model various types of sequential data. Much of this
    progress has been achieved through devising recurrent units and architectures
    with the flexibility to capture complex statistics in the data, such as long
    range dependency or localized attention phenomena. However, while many
    sequential data (such as video, speech or language) can have highly variable
    information flow, most recurrent models still consume input features at a
    constant rate and perform a constant number of computations per time step,
    which can be detrimental to both speed and model capacity. In this paper, we
    explore a modification to existing recurrent units which allows them to learn
    to vary the amount of computation they perform at each step, without prior
    knowledge of the sequence’s time structure. We show experimentally that not
    only is our model more computationally efficient, it also leads to better
    performance overall on our evaluation tasks.

    Learning Interpretability for Visualizations using Adapted Cox Models through a User Experiment

    Adrien Bibal, Benoit Frénay
    Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems
    Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Learning (cs.LG)

    In order to be useful, visualizations need to be interpretable. This paper
    uses a user-based approach to combine and assess quality measures in order to
    better model user preferences. Results show that cluster separability measures
    are outperformed by a neighborhood conservation measure, even though the former
    are usually considered as intuitively representative of user motives. Moreover,
    combining measures, as opposed to using a single measure, further improves
    prediction performances.

    Faster variational inducing input Gaussian process classification

    Pavel Izmailov, Dmitry Kropotov
    Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

    Gaussian processes (GP) provide a prior over functions and allow finding
    complex regularities in data. Gaussian processes are successfully used for
    classification/regression problems and dimensionality reduction. In this work
    we consider the classification problem only. The complexity of standard methods
    for GP-classification scales cubically with the size of the training dataset.
    This complexity makes them inapplicable to big data problems. Therefore, a
    variety of methods were introduced to overcome this limitation. In the paper we
    focus on methods based on so called inducing inputs. This approach is based on
    variational inference and proposes a particular lower bound for marginal
    likelihood (evidence). This bound is then maximized w.r.t. parameters of kernel
    function of the Gaussian process, thus fitting the model to data. The
    computational complexity of this method is (O(nm^2)), where (m) is the number
    of inducing inputs used by the model and is assumed to be substantially smaller
    than the size of the dataset (n). Recently, a new evidence lower bound for
    GP-classification problem was introduced. It allows using stochastic
    optimization, which makes it suitable for big data problems. However, the new
    lower bound depends on (O(m^2)) variational parameter, which makes optimization
    challenging in case of big m. In this work we develop a new approach for
    training inducing input GP models for classification problems. Here we use
    quadratic approximation of several terms in the aforementioned evidence lower
    bound, obtaining analytical expressions for optimal values of most of the
    parameters in the optimization, thus sufficiently reducing the dimension of
    optimization space. In our experiments we achieve as well or better results,
    compared to the existing method. Moreover, our method doesn’t require the user
    to manually set the learning rate, making it more practical, than the existing
    method.

    Swarm Intelligence for Multiobjective Optimization of Extraction Process

    T. Ganesan, I. Elamvazuthi, P.Vasant
    Journal-ref: Handbook of Research on Modern Optimization Algorithms and
    Applications in Engineering and Economics, (2016), IGI Global, pp 516 – 544
    Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)

    Multi objective (MO) optimization is an emerging field which is increasingly
    being implemented in many industries globally. In this work, the MO
    optimization of the extraction process of bioactive compounds from the Gardenia
    Jasminoides Ellis fruit was solved. Three swarm-based algorithms have been
    applied in conjunction with normal-boundary intersection (NBI) method to solve
    this MO problem. The gravitational search algorithm (GSA) and the particle
    swarm optimization (PSO) technique were implemented in this work. In addition,
    a novel Hopfield-enhanced particle swarm optimization was developed and applied
    to the extraction problem. By measuring the levels of dominance, the optimality
    of the approximate Pareto frontiers produced by all the algorithms were gauged
    and compared. Besides, by measuring the levels of convergence of the frontier,
    some understanding regarding the structure of the objective space in terms of
    its relation to the level of frontier dominance is uncovered. Detail
    comparative studies were conducted on all the algorithms employed and developed
    in this work.

    Monte Carlo Connection Prover

    Michael Färber, Cezary Kaliszyk, Josef Urban
    Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Learning (cs.LG)

    Monte Carlo Tree Search (MCTS) is a technique to guide search in a large
    decision space by taking random samples and evaluating their outcome. In this
    work, we study MCTS methods in the context of the connection calculus and
    implement them on top of the leanCoP prover. This includes proposing useful
    proof-state evaluation heuristics that are learned from previous proofs, and
    proposing and automatically improving suitable MCTS strategies in this context.
    The system is trained and evaluated on a large suite of related problems coming
    from the Mizar proof assistant, showing that it is capable to find new and
    different proofs. To our knowledge, this is the first time MCTS has been
    applied to theorem proving.

    Answering Image Riddles using Vision and Reasoning through Probabilistic Soft Logic

    Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos
    Comments: 14 pages, 10 figures
    Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

    In this work, we explore a genre of puzzles (“image riddles”) which involves
    a set of images and a question. Answering these puzzles require both
    capabilities involving visual detection (including object, activity
    recognition) and, knowledge-based or commonsense reasoning. We compile a
    dataset of over 3k riddles where each riddle consists of 4 images and a
    groundtruth answer. The annotations are validated using crowd-sourced
    evaluation. We also define an automatic evaluation metric to track future
    progress. Our task bears similarity with the commonly known IQ tasks such as
    analogy solving, sequence filling that are often used to test intelligence.

    We develop a Probabilistic Reasoning-based approach that utilizes
    probabilistic commonsense knowledge to answer these riddles with a reasonable
    accuracy. We demonstrate the results of our approach using both automatic and
    human evaluations. Our approach achieves some promising results for these
    riddles and provides a strong baseline for future attempts. We make the entire
    dataset and related materials publicly available to the community in
    ImageRiddle Website (this http URL).


    Information Retrieval

    Approximate Near Neighbors for General Symmetric Norms

    Alexandr Andoni, Aleksandar Nikolov, Ilya Razenshteyn, Erik Waingarten
    Comments: 24 pages, no figures
    Subjects: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Information Retrieval (cs.IR); Learning (cs.LG); Metric Geometry (math.MG)

    We show that every symmetric normed space admits an efficient nearest
    neighbor search data structure with doubly-logarithmic approximation.
    Specifically, for every (n), (d = 2^{oleft(frac{log n}{log log
    n}
    ight)}), and every (d)-dimensional symmetric norm (|cdot|), there exists
    a data structure for (mathrm{poly}(log log n))-approximate nearest neighbor
    search over (|cdot|) for (n)-point datasets achieving (n^{o(1)}) query time
    and (n^{1+o(1)}) space. The main technical ingredient of the algorithm is a
    low-distortion embedding of a symmetric norm into a low-dimensional iterated
    product of top-(k) norms.

    We also show that our techniques cannot be extended to general norms.

    NCBO Ontology Recommender 2.0: An Enhanced Approach for Biomedical Ontology Recommendation

    Marcos Martinez-Romero, Clement Jonquet, Martin J. O'Connor, John Graybeal, Alejandro Pazos, Mark A. Musen
    Comments: 29 pages, 8 figures, 11 tables
    Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

    Biomedical researchers use ontologies to annotate their data with ontology
    terms, enabling better data integration and interoperability. However, the
    number, variety and complexity of current biomedical ontologies make it
    cumbersome for researchers to determine which ones to reuse for their specific
    needs. To overcome this problem, in 2010 the National Center for Biomedical
    Ontology (NCBO) released the Ontology Recommender, which is a service that
    receives a biomedical text corpus or a list of keywords and suggests ontologies
    appropriate for referencing the indicated terms. We developed a new version of
    the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a new
    recommendation approach that evaluates the relevance of an ontology to
    biomedical text data according to four criteria: (1) the extent to which the
    ontology covers the input data; (2) the acceptance of the ontology in the
    biomedical community; (3) the level of detail of the ontology classes that
    cover the input data; and (4) the specialization of the ontology to the domain
    of the input data. Our evaluation shows that the enhanced recommender provides
    higher quality suggestions than the original approach, providing better
    coverage of the input data, more detailed information about their concepts,
    increased specialization for the domain of the input data, and greater
    acceptance and use in the community. In addition, it provides users with more
    explanatory information, along with suggestions of not only individual
    ontologies but also groups of ontologies. It also can be customized to fit the
    needs of different scenarios. Ontology Recommender 2.0 combines the strengths
    of its predecessor with a range of adjustments and new features that improve
    its reliability and usefulness. Ontology Recommender 2.0 recommends over 500
    biomedical ontologies from the NCBO BioPortal platform, where it is openly
    available.


    Computation and Language

    Generative Deep Neural Networks for Dialogue: A Short Review

    Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, Joelle Pineau
    Comments: 6 pages, 1 figure, 3 tables; NIPS 2016 workshop on Learning Methods for Dialogue
    Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

    Researchers have recently started investigating deep neural networks for
    dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq)
    models have shown promising results for unstructured tasks, such as word-level
    dialogue response generation. The hope is that such models will be able to
    leverage massive amounts of data to learn meaningful natural language
    representations and response generation strategies, while requiring a minimum
    amount of domain knowledge and hand-crafting. An important challenge is to
    develop models that can effectively incorporate dialogue context and generate
    meaningful and diverse responses. In support of this goal, we review recently
    proposed models based on generative encoder-decoder neural network
    architectures, and show that these models have better ability to incorporate
    long-term dialogue history, to model uncertainty and ambiguity in dialogue, and
    to generate responses with high-level compositional structure.

    Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

    Volkan Cirik, Eduard Hovy, Louis-Philippe Morency
    Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

    Curriculum Learning emphasizes the order of training instances in a
    computational learning setup. The core hypothesis is that simpler instances
    should be learned early as building blocks to learn more complex ones. Despite
    its usefulness, it is still unknown how exactly the internal representation of
    models are affected by curriculum learning. In this paper, we study the effect
    of curriculum learning on Long Short-Term Memory (LSTM) networks, which have
    shown strong competency in many Natural Language Processing (NLP) problems. Our
    experiments on sentiment analysis task and a synthetic task similar to sequence
    prediction tasks in NLP show that curriculum learning has a positive effect on
    the LSTM’s internal states by biasing the model towards building constructive
    representations i.e. the internal representation at the previous timesteps are
    used as building blocks for the final prediction. We also find that smaller
    models significantly improves when they are trained with curriculum learning.
    Lastly, we show that curriculum learning helps more when the amount of training
    data is limited.

    Word and Document Embeddings based on Neural Network Approaches

    Siwei Lai
    Comments: PhD thesis, in Chinese, Institute of Automation, Chinese Academy of Sciences, 2016
    Subjects: Computation and Language (cs.CL)

    Data representation is a fundamental task in machine learning. The
    representation of data affects the performance of the whole machine learning
    system. In a long history, the representation of data is done by feature
    engineering, and researchers aim at designing better features for specific
    tasks. Recently, the rapid development of deep learning and representation
    learning has brought new inspiration to various domains.

    In natural language processing, the most widely used feature representation
    is the Bag-of-Words model. This model has the data sparsity problem and cannot
    keep the word order information. Other features such as part-of-speech tagging
    or more complex syntax features can only fit for specific tasks in most cases.
    This thesis focuses on word representation and document representation. We
    compare the existing systems and present our new model.

    First, for generating word embeddings, we make comprehensive comparisons
    among existing word embedding models. In terms of theory, we figure out the
    relationship between the two most important models, i.e., Skip-gram and GloVe.
    In our experiments, we analyze three key points in generating word embeddings,
    including the model construction, the training corpus and parameter design. We
    evaluate word embeddings with three types of tasks, and we argue that they
    cover the existing use of word embeddings. Through theory and practical
    experiments, we present some guidelines for how to generate a good word
    embedding.

    Second, in Chinese character or word representation. We introduce the joint
    training of Chinese character and word. …

    Third, for document representation, we analyze the existing document
    representation models, including recursive NNs, recurrent NNs and convolutional
    NNs. We point out the drawbacks of these models and present our new model, the
    recurrent convolutional neural networks. …

    Variable Computation in Recurrent Neural Networks

    Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov
    Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Learning (cs.LG)

    Recurrent neural networks (RNNs) have been used extensively and with
    increasing success to model various types of sequential data. Much of this
    progress has been achieved through devising recurrent units and architectures
    with the flexibility to capture complex statistics in the data, such as long
    range dependency or localized attention phenomena. However, while many
    sequential data (such as video, speech or language) can have highly variable
    information flow, most recurrent models still consume input features at a
    constant rate and perform a constant number of computations per time step,
    which can be detrimental to both speed and model capacity. In this paper, we
    explore a modification to existing recurrent units which allows them to learn
    to vary the amount of computation they perform at each step, without prior
    knowledge of the sequence’s time structure. We show experimentally that not
    only is our model more computationally efficient, it also leads to better
    performance overall on our evaluation tasks.


    Distributed, Parallel, and Cluster Computing

    Parallelizing Word2Vec in Multi-Core and Many-Core Architectures

    Shihao Ji, Nadathur Satish, Sheng Li, Pradeep Dubey
    Comments: NIPS Workshop on Efficient Methods for Deep Neural Networks (2016)
    Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)

    Word2vec is a widely used algorithm for extracting low-dimensional vector
    representations of words. State-of-the-art algorithms including those by
    Mikolov et al. have been parallelized for multi-core CPU architectures, but are
    based on vector-vector operations with “Hogwild” updates that are
    memory-bandwidth intensive and do not efficiently use computational resources.
    In this paper, we propose “HogBatch” by improving reuse of various data
    structures in the algorithm through the use of minibatching and negative sample
    sharing, hence allowing us to express the problem using matrix multiply
    operations. We also explore different techniques to distribute word2vec
    computation across nodes in a compute cluster, and demonstrate good strong
    scalability up to 32 nodes. The new algorithm is particularly suitable for
    modern multi-core/many-core architectures, especially Intel’s latest Knights
    Landing processors, and allows us to scale up the computation near linearly
    across cores and nodes, and process hundreds of millions of words per second,
    which is the fastest word2vec implementation to the best of our knowledge.

    Polynomial self-stabilizing algorithm and proof for a 2/3-approximation of a maximum matching

    Johanne Cohen, Khaled Maâmra, George Manoussakis, Laurence Pilard
    Comments: 16 pages, 6 figures
    Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

    We present the first polynomial self-stabilizing algorithm for finding a
    (frac23)-approximation of a maximum matching in a general graph. The previous
    best known algorithm has been presented by Manne emph{et al.}
    cite{ManneMPT11} and has a sub-exponential time complexity under the
    distributed adversarial daemon cite{Coor}. Our new algorithm is an adaptation
    of the Manne emph{et al.} algorithm and works under the same daemon, but with
    a time complexity in (O(n^3)) moves. Moreover, our algorithm only needs one
    more boolean variable than the previous one, thus as in the Manne emph{et al.}
    algorithm, it only requires a constant amount of memory space (three
    identifiers and (two) booleans per node).

    GaDei: On Scale-up Training As A Service For Deep Learning

    Wei Zhang, Minwei Feng, Yunhui Zheng, Yufei Ren, Yandong Wang, Ji Liu, Peng Liu, Bing Xiang, Li Zhang, Bowen Zhou
    Subjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Learning (cs.LG)

    Deep learning (DL) training-as-a-service (TaaS) is an important emerging
    industrial workload. The unique challenge of TaaS is that it must satisfy a
    wide range of customers who have no experience and resources to tune DL
    hyper-parameters, and meticulous tuning for each user’s dataset is
    prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with
    values that are applicable to all users. IBM Watson Natural Language Classifier
    (NLC) service, the most popular IBM cognitive service used by thousands of
    enterprise-level clients around the globe, is a typical TaaS service. By
    evaluating the NLC workloads, we show that only the conservative
    hyper-parameter setup (e.g., small mini-batch size and small learning rate) can
    guarantee acceptable model accuracy for a wide range of customers. We further
    justify theoretically why such a setup guarantees better model convergence in
    general. Unfortunately, the small mini-batch size causes a high volume of
    communication traffic in a parameter-server based system. We characterize the
    high communication bandwidth requirement of TaaS using representative
    industrial deep learning workloads and demonstrate that none of the
    state-of-the-art scale-up or scale-out solutions can satisfy such a
    requirement. We then present GaDei, an optimized shared-memory based scale-up
    parameter server design. We prove that the designed protocol is deadlock-free
    and it processes each gradient exactly once. Our implementation is evaluated on
    both commercial benchmarks and public benchmarks to demonstrate that it
    significantly outperforms the state-of-the-art parameter-server based
    implementation while maintaining the required accuracy and our implementation
    reaches near the best possible runtime performance, constrained only by the
    hardware limitation. Furthermore, to the best of our knowledge, GaDei is the
    only scale-up DL system that provides fault-tolerance.

    Fast low-level pattern matching algorithm

    Janja Paliska Soldo, Ana Sovic Krzic, and Damir Sersic
    Comments: 14 pages, 7 tables This work has been fully supported by Croatian Science Foundation under the project UIP-11-2013-7353 Algorithms for Genome Sequence Analysis
    Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Genomics (q-bio.GN)

    This paper focuses on pattern matching in the DNA sequence. It was inspired
    by a previously reported method that proposes encoding both pattern and
    sequence using prime numbers. Although fast, the method is limited to rather
    small pattern lengths, due to computing precision problem. Our approach
    successfully deals with large patterns, due to our implementation that uses
    modular arithmetic. In order to get the results very fast, the code was adapted
    for multithreading and parallel implementations. The method is reduced to
    assembly language level instructions, thus the final result shows significant
    time and memory savings compared to the reference algorithm.


    Learning

    Faster variational inducing input Gaussian process classification

    Pavel Izmailov, Dmitry Kropotov
    Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

    Gaussian processes (GP) provide a prior over functions and allow finding
    complex regularities in data. Gaussian processes are successfully used for
    classification/regression problems and dimensionality reduction. In this work
    we consider the classification problem only. The complexity of standard methods
    for GP-classification scales cubically with the size of the training dataset.
    This complexity makes them inapplicable to big data problems. Therefore, a
    variety of methods were introduced to overcome this limitation. In the paper we
    focus on methods based on so called inducing inputs. This approach is based on
    variational inference and proposes a particular lower bound for marginal
    likelihood (evidence). This bound is then maximized w.r.t. parameters of kernel
    function of the Gaussian process, thus fitting the model to data. The
    computational complexity of this method is (O(nm^2)), where (m) is the number
    of inducing inputs used by the model and is assumed to be substantially smaller
    than the size of the dataset (n). Recently, a new evidence lower bound for
    GP-classification problem was introduced. It allows using stochastic
    optimization, which makes it suitable for big data problems. However, the new
    lower bound depends on (O(m^2)) variational parameter, which makes optimization
    challenging in case of big m. In this work we develop a new approach for
    training inducing input GP models for classification problems. Here we use
    quadratic approximation of several terms in the aforementioned evidence lower
    bound, obtaining analytical expressions for optimal values of most of the
    parameters in the optimization, thus sufficiently reducing the dimension of
    optimization space. In our experiments we achieve as well or better results,
    compared to the existing method. Moreover, our method doesn’t require the user
    to manually set the learning rate, making it more practical, than the existing
    method.

    Robust and Scalable Column/Row Sampling from Corrupted Big Data

    Mostafa Rahmani, George Atia
    Subjects: Learning (cs.LG); Numerical Analysis (cs.NA); Applications (stat.AP); Machine Learning (stat.ML)

    Conventional sampling techniques fall short of drawing descriptive sketches
    of the data when the data is grossly corrupted as such corruptions break the
    low rank structure required for them to perform satisfactorily. In this paper,
    we present new sampling algorithms which can locate the informative columns in
    presence of severe data corruptions. In addition, we develop new scalable
    randomized designs of the proposed algorithms. The proposed approach is
    simultaneously robust to sparse corruption and outliers and substantially
    outperforms the state-of-the-art robust sampling algorithms as demonstrated by
    experiments conducted using both real and synthetic data.

    A Characterization of Prediction Errors

    Christopher Meek
    Subjects: Learning (cs.LG)

    Understanding prediction errors and determining how to fix them is critical
    to building effective predictive systems. In this paper, we delineate four
    types of prediction errors and demonstrate that these four types characterize
    all prediction errors. In addition, we describe potential remedies and tools
    that can be used to reduce the uncertainty when trying to determine the source
    of a prediction error and when trying to take action to remove a prediction
    errors.

    Associative Memories to Accelerate Approximate Nearest Neighbor Search

    Vincent Gripon, Matthias Löwe, Franck Vermet
    Comments: 25 pages, 12 figures
    Subjects: Learning (cs.LG); Probability (math.PR)

    Nearest neighbor search is a very active field in machine learning for it
    appears in many application cases, including classification and object
    retrieval. In its canonical version, the complexity of the search is linear
    with both the dimension and the cardinal of the collection of vectors the
    search is performed in. Recently many works have focused on reducing the
    dimension of vectors using quantization techniques or hashing, while providing
    an approximate result. In this paper we focus instead on tackling the cardinal
    of the collection of vectors. Namely, we introduce a technique that partitions
    the collection of vectors and stores each part in its own associative memory.
    When a query vector is given to the system, associative memories are polled to
    identify which one contain the closest match. Then an exhaustive search is
    conducted only on the part of vectors stored in the selected associative
    memory. We study the effectiveness of the system when messages to store are
    generated from i.i.d. uniform (pm)1 random variables or 0-1 sparse i.i.d.
    random variables. We also conduct experiment on both synthetic data and real
    data and show it is possible to achieve interesting trade-offs between
    complexity and accuracy.

    Approximate Near Neighbors for General Symmetric Norms

    Alexandr Andoni, Aleksandar Nikolov, Ilya Razenshteyn, Erik Waingarten
    Comments: 24 pages, no figures
    Subjects: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Information Retrieval (cs.IR); Learning (cs.LG); Metric Geometry (math.MG)

    We show that every symmetric normed space admits an efficient nearest
    neighbor search data structure with doubly-logarithmic approximation.
    Specifically, for every (n), (d = 2^{oleft(frac{log n}{log log
    n}
    ight)}), and every (d)-dimensional symmetric norm (|cdot|), there exists
    a data structure for (mathrm{poly}(log log n))-approximate nearest neighbor
    search over (|cdot|) for (n)-point datasets achieving (n^{o(1)}) query time
    and (n^{1+o(1)}) space. The main technical ingredient of the algorithm is a
    low-distortion embedding of a symmetric norm into a low-dimensional iterated
    product of top-(k) norms.

    We also show that our techniques cannot be extended to general norms.

    Structural Causal Models: Cycles, Marginalizations, Exogenous Reparametrizations and Reductions

    Stephan Bongers, Jonas Peters, Bernhard Schölkopf, Joris M. Mooij
    Comments: Will probably be submitted to The Annals of Statistics
    Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Learning (cs.LG)

    Structural causal models (SCMs), also known as non-parametric structural
    equation models (NP-SEMs), are widely used for causal modeling purposes. In
    this paper, we give a rigorous treatment of structural causal models, dealing
    with measure-theoretic complications that arise in the presence of cyclic
    relations. The central question studied in this paper is: given a (possibly
    cyclic) SCM defined on a large system (consisting of observable endogenous and
    latent exogenous variables), can we “project it down” to an SCM that describes
    a subsystem (consisting of a subset of the observed endogenous variables and
    possibly different latent exogenous variables) in order to obtain a more
    parsimonious but equivalent representation of the subsystem? We define a
    marginalization operation that effectively removes a subset of the endogenous
    variables from the model, and a class of mappings, exogenous
    reparameterizations, that can be used to reduce the space of exogenous
    variables. We show that both operations preserve the causal semantics of the
    model and that under mild conditions they can lead to a significant reduction
    of the model complexity, at least in terms of the number of variables in the
    model. We argue that for the task of estimating an SCM from data, the existence
    of “smooth” reductions would be desirable. We provide several conditions under
    which the existence of such reductions can be shown, but also provide a
    counterexample that shows that such reductions do not exist in general. The
    latter result implies that existing approaches to estimate linear or Markovian
    SCMs from data cannot be extended to general SCMs.

    GaDei: On Scale-up Training As A Service For Deep Learning

    Wei Zhang, Minwei Feng, Yunhui Zheng, Yufei Ren, Yandong Wang, Ji Liu, Peng Liu, Bing Xiang, Li Zhang, Bowen Zhou
    Subjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Learning (cs.LG)

    Deep learning (DL) training-as-a-service (TaaS) is an important emerging
    industrial workload. The unique challenge of TaaS is that it must satisfy a
    wide range of customers who have no experience and resources to tune DL
    hyper-parameters, and meticulous tuning for each user’s dataset is
    prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with
    values that are applicable to all users. IBM Watson Natural Language Classifier
    (NLC) service, the most popular IBM cognitive service used by thousands of
    enterprise-level clients around the globe, is a typical TaaS service. By
    evaluating the NLC workloads, we show that only the conservative
    hyper-parameter setup (e.g., small mini-batch size and small learning rate) can
    guarantee acceptable model accuracy for a wide range of customers. We further
    justify theoretically why such a setup guarantees better model convergence in
    general. Unfortunately, the small mini-batch size causes a high volume of
    communication traffic in a parameter-server based system. We characterize the
    high communication bandwidth requirement of TaaS using representative
    industrial deep learning workloads and demonstrate that none of the
    state-of-the-art scale-up or scale-out solutions can satisfy such a
    requirement. We then present GaDei, an optimized shared-memory based scale-up
    parameter server design. We prove that the designed protocol is deadlock-free
    and it processes each gradient exactly once. Our implementation is evaluated on
    both commercial benchmarks and public benchmarks to demonstrate that it
    significantly outperforms the state-of-the-art parameter-server based
    implementation while maintaining the required accuracy and our implementation
    reaches near the best possible runtime performance, constrained only by the
    hardware limitation. Furthermore, to the best of our knowledge, GaDei is the
    only scale-up DL system that provides fault-tolerance.

    Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

    Volkan Cirik, Eduard Hovy, Louis-Philippe Morency
    Subjects: Computation and Language (cs.CL); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

    Curriculum Learning emphasizes the order of training instances in a
    computational learning setup. The core hypothesis is that simpler instances
    should be learned early as building blocks to learn more complex ones. Despite
    its usefulness, it is still unknown how exactly the internal representation of
    models are affected by curriculum learning. In this paper, we study the effect
    of curriculum learning on Long Short-Term Memory (LSTM) networks, which have
    shown strong competency in many Natural Language Processing (NLP) problems. Our
    experiments on sentiment analysis task and a synthetic task similar to sequence
    prediction tasks in NLP show that curriculum learning has a positive effect on
    the LSTM’s internal states by biasing the model towards building constructive
    representations i.e. the internal representation at the previous timesteps are
    used as building blocks for the final prediction. We also find that smaller
    models significantly improves when they are trained with curriculum learning.
    Lastly, we show that curriculum learning helps more when the amount of training
    data is limited.

    Variable Computation in Recurrent Neural Networks

    Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov
    Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Learning (cs.LG)

    Recurrent neural networks (RNNs) have been used extensively and with
    increasing success to model various types of sequential data. Much of this
    progress has been achieved through devising recurrent units and architectures
    with the flexibility to capture complex statistics in the data, such as long
    range dependency or localized attention phenomena. However, while many
    sequential data (such as video, speech or language) can have highly variable
    information flow, most recurrent models still consume input features at a
    constant rate and perform a constant number of computations per time step,
    which can be detrimental to both speed and model capacity. In this paper, we
    explore a modification to existing recurrent units which allows them to learn
    to vary the amount of computation they perform at each step, without prior
    knowledge of the sequence’s time structure. We show experimentally that not
    only is our model more computationally efficient, it also leads to better
    performance overall on our evaluation tasks.

    Learning Interpretability for Visualizations using Adapted Cox Models through a User Experiment

    Adrien Bibal, Benoit Frénay
    Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems
    Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Learning (cs.LG)

    In order to be useful, visualizations need to be interpretable. This paper
    uses a user-based approach to combine and assess quality measures in order to
    better model user preferences. Results show that cluster separability measures
    are outperformed by a neighborhood conservation measure, even though the former
    are usually considered as intuitively representative of user motives. Moreover,
    combining measures, as opposed to using a single measure, further improves
    prediction performances.

    Compacting Neural Network Classifiers via Dropout Training

    Yotaro Kubo, George Tucker, Simon Wiesler
    Comments: Accepted to NIPS Workshop on Efficient Methods for Deep Neural Networks
    Subjects: Machine Learning (stat.ML); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

    We introduce dropout compaction, a novel method for training feed-forward
    neural networks which realizes the performance gains of training a large model
    with dropout regularization, yet extracts a compact neural network for run-time
    efficiency. In the proposed method, we introduce a sparsity-inducing prior on
    the per unit dropout retention probability so that the optimizer can
    effectively prune hidden units during training. By changing the prior
    hyperparameters, we can control the size of the resulting network. We performed
    a systematic comparison of dropout compaction and competing methods on several
    real-world speech recognition tasks and found that dropout compaction achieved
    comparable accuracy with fewer than 50% of the hidden units, translating to a
    2.5x speedup in run-time.

    A Generalized Stochastic Variational Bayesian Hyperparameter Learning Framework for Sparse Spectrum Gaussian Process Regression

    Quang Minh Hoang, Trong Nghia Hoang, Kian Hsiang Low
    Comments: 31st AAAI Conference on Artificial Intelligence (AAAI 2017), Extended version with proofs, 11 pages
    Subjects: Machine Learning (stat.ML); Learning (cs.LG)

    While much research effort has been dedicated to scaling up sparse Gaussian
    process (GP) models based on inducing variables for big data, little attention
    is afforded to the other less explored class of low-rank GP approximations that
    exploit the sparse spectral representation of a GP kernel. This paper presents
    such an effort to advance the state of the art of sparse spectrum GP models to
    achieve competitive predictive performance for massive datasets. Our
    generalized framework of stochastic variational Bayesian sparse spectrum GP
    (sVBSSGP) models addresses their shortcomings by adopting a Bayesian treatment
    of the spectral frequencies to avoid overfitting, modeling these frequencies
    jointly in its variational distribution to enable their interaction a
    posteriori, and exploiting local data for boosting the predictive performance.
    However, such structural improvements result in a variational lower bound that
    is intractable to be optimized. To resolve this, we exploit a variational
    parameterization trick to make it amenable to stochastic optimization.
    Interestingly, the resulting stochastic gradient has a linearly decomposable
    structure that can be exploited to refine our stochastic optimization method to
    incur constant time per iteration while preserving its property of being an
    unbiased estimator of the exact gradient of the variational lower bound.
    Empirical evaluation on real-world datasets shows that sVBSSGP outperforms
    state-of-the-art stochastic implementations of sparse GP models.

    Monte Carlo Connection Prover

    Michael Färber, Cezary Kaliszyk, Josef Urban
    Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Learning (cs.LG)

    Monte Carlo Tree Search (MCTS) is a technique to guide search in a large
    decision space by taking random samples and evaluating their outcome. In this
    work, we study MCTS methods in the context of the connection calculus and
    implement them on top of the leanCoP prover. This includes proposing useful
    proof-state evaluation heuristics that are learned from previous proofs, and
    proposing and automatically improving suitable MCTS strategies in this context.
    The system is trained and evaluated on a large suite of related problems coming
    from the Mizar proof assistant, showing that it is capable to find new and
    different proofs. To our knowledge, this is the first time MCTS has been
    applied to theorem proving.

    Analysis of a Design Pattern for Teaching with Features and Labels

    Christopher Meek, Patrice Simard, Xiaojin Zhu
    Comments: Also available at this https URL
    Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG)

    We study the task of teaching a machine to classify objects using features
    and labels. We introduce the Error-Driven-Featuring design pattern for teaching
    using features and labels in which a teacher prefers to introduce features only
    if they are needed. We analyze the potential risks and benefits of this
    teaching pattern through the use of teaching protocols, illustrative examples,
    and by providing bounds on the effort required for an optimal machine teacher
    using a linear learning algorithm, the most commonly used type of learners in
    interactive machine learning systems. Our analysis provides a deeper
    understanding of potential trade-offs of using different learning algorithms
    and between the effort required for featuring (creating new features) and
    labeling (providing labels for objects).

    Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models

    Viktoriya Krakovna, Finale Doshi-Velez
    Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems. arXiv admin note: substantial text overlap with arXiv:1606.05320
    Subjects: Machine Learning (stat.ML); Learning (cs.LG)

    As deep neural networks continue to revolutionize various application
    domains, there is increasing interest in making these powerful models more
    understandable and interpretable, and narrowing down the causes of good and bad
    predictions. We focus on recurrent neural networks, state of the art models in
    speech recognition and translation. Our approach to increasing interpretability
    is by combining a long short-term memory (LSTM) model with a hidden Markov
    model (HMM), a simpler and more transparent model. We add the HMM state
    probabilities to the output layer of the LSTM, and then train the HMM and LSTM
    either sequentially or jointly. The LSTM can make use of the information from
    the HMM, and fill in the gaps when the HMM is not performing well. A small
    hybrid model usually performs better than a standalone LSTM of the same size,
    especially on smaller data sets. We test the algorithms on text data and
    medical time series data, and find that the LSTM and HMM learn complementary
    information about the features in the text.

    "Influence Sketching": Finding Influential Samples In Large-Scale Regressions

    Mike Wojnowicz, Ben Cruz, Xuan Zhao, Brian Wallace, Matt Wolff, Jay Luan, Caleb Crable
    Subjects: Machine Learning (stat.ML); Learning (cs.LG)

    There is an especially strong need in modern large-scale data analysis to
    prioritize samples for manual inspection. For example, the inspection could
    target important mislabeled samples or key vulnerabilities exploitable by an
    adversarial attack. In order to solve the “needle in the haystack” problem of
    which samples to inspect, we develop a new scalable version of Cook’s distance,
    a classical statistical technique for identifying samples which unusually
    strongly impact the fit of a regression model (and its downstream predictions).
    In order to scale this technique up to very large and high-dimensional
    datasets, we introduce a new algorithm which we call “influence sketching.”
    Influence sketching embeds random projections within the influence computation;
    in particular, the influence score is calculated using the randomly projected
    pseudo-dataset from the post-convergence General Linear Model (GLM). We
    validate that influence sketching can reliably and successfully discover
    influential samples by applying the technique to a malware detection dataset of
    over 2 million executable files, each represented with almost 100,000 features.
    For example, we find that randomly deleting approximately 10% of training
    samples reduces predictive accuracy only slightly from 99.47% to 99.45%,
    whereas deleting the same number of samples with high influence sketch scores
    reduces predictive accuracy all the way down to 90.24%. Moreover, we find that
    influential samples are especially likely to be mislabeled. In the case study,
    we manually inspect the most influential samples, and find that influence
    sketching pointed us to new, previously unidentified pieces of malware.


    Information Theory

    Optimal Key Consensus in Presence of Noise

    Zhengzhong Jin, Yunlei Zhao
    Subjects: Information Theory (cs.IT); Cryptography and Security (cs.CR)

    In this work, we introduce and formalize a new primitive, referred to as key
    consensus (KC), and its asymmetric variant AKC, which are for reaching
    consensus from close values. Some inherent constraints on the relationship
    among bandwidth, correctness and consensus range, for any KC and AKC schemes,
    are then revealed, which are particularly instrumental in choosing and
    evaluating parameters towards different optimization or balance goals. KC and
    AKC are fundamental to lattice based cryptography, in the sense that a list of
    cryptographic primitives based on LWE or Ring-LWE (including key exchange,
    public key encryption, oblivious transfer, and more) can be modularly
    constructed from them. As a conceptual contribution, this much simplifies the
    design and analysis of these cryptosystems in the future.

    Highly practical KC and AKC schemes are then designed and analyzed, within a
    generalized framework and with tight constraints on parameters that are almost
    the same as the inherent ones discovered. The structure generalization and
    constraint tightness allow us to choose and evaluate parameters towards optimal
    balance among security, computational cost, bandwidth, consensus range, and
    error rate. When applied to LWE or RLWE based cryptosystems, generally
    speaking, by carefully choosing parameters they can result in (to our
    knowledge) state-of-the-art practical schemes of key exchange, CPA-secure
    public key encryption, and oblivious transfer.

    Multi-User Millimeter Wave MIMO with Full-Dimensional Lens Antenna Array

    Yong Zeng, Lu Yang, Rui Zhang
    Comments: 14 pages, 9 figures, submitted for possible journal publication
    Subjects: Information Theory (cs.IT)

    Millimeter wave (mmWave) communication by utilizing lens antenna arrays is a
    promising technique for realizing cost-effective 5G wireless systems with large
    MIMO (multiple-input multiple-output) but only limited radio frequency (RF)
    chains. This paper studies an uplink multi-user mmWave single-sided lens MIMO
    system, where only the base station (BS) is equipped with a full-dimensional
    (FD) lens antenna array with both elevation and azimuth angle resolution
    capabilities, and each mobile station (MS) employs the conventional uniform
    planar array (UPA) without the lens. By exploiting the angle-dependent energy
    focusing property of the lens antenna array at the BS as well as the multi-path
    sparsity of mmWave channels, we propose a low-complexity path division multiple
    access (PDMA) scheme, which enables virtually interference-free multi-user
    communications when the angle of arrivals (AoAs) of all MS multi-path signals
    are sufficiently separable at the BS. To this end, a new technique called path
    delay compensation is proposed at the BS to effectively transform the
    multi-user frequency-selective MIMO channels to parallel frequency-flat
    small-size MIMO channels for different MSs, for each of which the
    low-complexity single-carrier(SC) transmission is applied. For general
    scenarios with insufficient AoA separations, analog beamforming at the MSs and
    digital combining at the BS are jointly designed to maximize the achievable
    sum-rate of the MSs based on their effective MIMO channels resulting from path
    delay compensation. In addition, we propose a new and efficient channel
    estimation scheme tailored for PDMA, which requires negligible training
    overhead in practical mmWave systems and yet leads to comparable performance as
    that based on perfect channel state information (CSI).

    Wireless-powered relaying with finite block-length codes

    Mahdi Haghifam, Behrooz Makki, Masoumeh Nasiri-Kenari, Tommy Svensson, Michele Zorzi
    Subjects: Information Theory (cs.IT)

    This paper studies the outage probability and the throughput of
    amplify-and-forward relay networks with wireless information and energy
    transfer. We use some recent results on finite block-length codes to analyze
    the system performance in the cases with short codewords. Specifically, the
    time switching relaying and the power splitting relaying protocols are
    considered for energy and information transfer. We derive tight approximations
    for the outage probability/throughput. Then, we analyze the outage probability
    in asymptotically high signal-to-noise ratios. Finally, we use numerical
    results to confirm the accuracy of our analysis and to evaluate the system
    performance in different scenarios. Our results indicate that, in
    delay-constrained scenarios, the codeword length affects the outage
    probability/throughput of the joint energy and information transfer systems
    considerably.

    Robust Secrecy Rate Beamforming for Multi-cell Networks with TS-based Information and Energy Harvesting

    Hoang Duong Tuan, Ali Arshad Nasir, Trung Q. Duong
    Subjects: Information Theory (cs.IT)

    Wireless energy harvesting networks are more vulnerable to eavesdropping due
    to higher received power levels. In this paper, we consider a dense multicell
    network in the presence of multi-antenna eavesdroppers and consider
    multi-objective beamforming where the energy beamformers are also used to jam
    the eavesdropper. Thus, we study transmit time-switching (TS) mode to realize
    wireless information and power transfer, where energy and information signal
    are transmitted separately in time by the BS. We study the joint design of
    transmit energy and information beamformers at the BSs and the transmit TS
    ratio. In the presence of imperfect channel estimation and multi-antenna
    eavesdroppers, our objective is to maximize the worst-case user secure rate
    under BS transmit power and UE minimum harvested energy constraints. We solve
    such highly non-convex problem by proposing new robust path-following
    algorithm, which involves one simple convex quadratic program (QP) in each
    iteration. Numerical results confirm that performance of the proposed algorithm
    is close to that of the perfect channel knowledge case. Moreover, the proposed
    algorithm not only outperforms the existing algorithm that models
    power-splitting (PS) based receiver but also the proposed transmit TS based
    model is implementation-wise quite simple than the PS-based model.

    Compressed Sensing from Phaseless Gaussian Measurements via Linear Programming in the Natural Parameter Space

    Paul Hand, Vladislav Voroninski
    Subjects: Information Theory (cs.IT); Optimization and Control (math.OC); Probability (math.PR)

    We consider faithfully combining phase retrieval with classical compressed
    sensing. Inspired by the recent novel formulation for phase retrieval called
    PhaseMax, we present and analyze SparsePhaseMax, a linear program for phaseless
    compressed sensing in the natural parameter space. We establish that when
    provided with an initialization that correlates with a k-sparse n-vector,
    SparsePhaseMax recovers this vector up to global sign with high probability
    from O(k log (n/k)) magnitude measurements against k i.i.d. Gaussian random
    vectors. Our proof of this fact exploits a curious newfound connection between
    phaseless and 1-bit compressed sensing. This is the first result to establish
    bootstrapped compressed sensing from phaseless Gaussian measurements under
    optimal sample complexity.

    Hardware-Based Linear Program Decoding with the Alternating Direction Method of Multipliers

    Mitchell Wasson, Mario Milicevic, Stark C. Draper, Glenn Gulak
    Subjects: Information Theory (cs.IT); Optimization and Control (math.OC)

    We present a hardware-based implementation of Linear Program (LP) decoding
    for binary linear codes. LP decoding frames error-correction as an optimization
    problem. In contrast, variants of Belief Propagation (BP) decoding frame
    error-correction as a problem of graphical inference. LP decoding has several
    advantages over BP-based methods, including convergence guarantees and better
    error-rate performance in high-reliability channels. The latter makes LP
    decoding attractive for optical transport and storage applications. However, LP
    decoding, when implemented with general solvers, does not scale to large
    blocklengths and is not suitable for a parallelized implementation in hardware.
    It has been recently shown that the Alternating Direction Method of Multipliers
    (ADMM) can be applied to decompose the LP decoding problem. The result is a
    message-passing algorithm with a structure very similar to BP. We present new
    intuition for this decoding algorithm as well as for its major computational
    primitive: projection onto the parity polytope. Furthermore, we present results
    for a fixed-point Verilog implementation of ADMM-LP decoding. This
    implementation targets a Field-Programmable Gate Array (FPGA) platform to
    evaluate error-rate performance and estimate resource usage. We show that Frame
    Error Rate (FER) performance well within 0.5dB of double-precision
    implementations is possible with 10-bit messages. Finally, we outline a number
    of research opportunities that should be explored en-route to the realization
    of an Application Specific Integrated Circuit (ASIC) implementation capable of
    gigabit per second throughput.

    A Machine Learning Approach to Model the Received Signal in Molecular Communications

    H. Birkan Yilmaz, Changmin Lee, Yae Jee Cho, Chan-Byoung Chae
    Subjects: Emerging Technologies (cs.ET); Information Theory (cs.IT)

    A molecular communication channel is determined by the received signal.
    Received signal models form the basis for studies focused on modulation,
    receiver design, capacity, and coding depend on the received signal models.
    Therefore, it is crucial to model the number of received molecules until time
    (t) analytically. Modeling the diffusion-based molecular communication channel
    with the first-hitting process is an open issue for a spherical transmitter. In
    this paper, we utilize the artificial neural networks technique to model the
    received signal for a spherical transmitter and a perfectly absorbing receiver
    (i.e., first hitting process). The proposed technique may be utilized in other
    studies that assume a spherical transmitter instead of a point transmitter.




沪ICP备19023445号-2号
友情链接