Statistical Machine Learning

In statistical machine learning, we consider the problem of inferring the probability structure behind data from a relatively small number of examples. The key is how to model the probability structure that generates the data and how to efficiently solve the unknown parameters.

We consider the properties of various learning machines, the dynamics of optimization, and the statistical properties of the estimated parameters from the standpoint of mathematical science.

Universality of Multi-Layer Perceptron

It has been proved from various points of view that the neural network is a universal function approximator that can approximate arbitrary nonlinear functions sufficiently well. We clarify the class of functions that can be represented by the neural network based on the integral representation (Ridgelet transform), and investigate the relationship between the accuracy of approximation by finite sums and the integral representation.

[ slide ]

Murata, N.: “An integral representation of functions using three-layered networks and their approximation bounds”, Neural Networks, Volume 9, Issue 6, August 1996, Pages 947-956. https://doi.org/10.1016/0893-6080(96)00000-7

Sonoda S., Murata, N. Applied and Computational Harmonic Analysis Volume 43, Issue 2, September 2017, Pages 233-268 https://doi.org/10.1016/j.acha.2015.12.005

Statistical Analysis of On-line Learning

Learning is a flexible and effective means of extracting the stochastic structure of the environment. In practice, two different types of learning are used, namely batch learning and on-line learning. The batch learning procedure uses all the training examples repeatedly so that its performance is compared to the statistical estimation procedure. On-line learning is more dynamical, updating the current estimate by observing a new datum one by one. On-line learning is slow in general but works well in the changing environment. We give a unified framework of statistical analysis for batch and on-line learning. The topics include the asymptotic learning curve, generalization error and training error, over-fitting and over-training, efficiency of learning, and an adaptive method of determining learning rate.

[ slide ]

Murata, N., and Amari, S.: “Statistical analysis of learning dynamics”, Signal Processing, Volume 74, Issue 1, January 1999, Pages 3–28. https://doi.org/10.1016/S0165-1684(98)00206-0

Murata, N., Kawanabe, M., Ziehe, A., Müller, K.-R., and Amari, S.: “On-line learning in changing environments with applications in supervised and unsupervised learning”, Neural Networks, Volume 15, Issue 4–6, June-July 2002, Pages 743–760. https://doi.org/10.1016/S0893-6080(02)00060-6

Change-Point Detection in a Sequence of Bags-of-Data

Change-point detection is an important engineering problem, and various methods have been proposed. Many existing methods assume that each data point observed at each time step is a single multi-dimensional vector, but to make them applicable to a wider class of problems, we propose a non-parametric and computationally efficient method. First, the underlying distribution behind each bag-of-data is estimated and embedded in the metric space with earth mover’s distance. Then, using the distance-based information estimator, we evaluate how the sequence of bags-of-data varies in the metric space to derive a change-point score. A procedure is also incorporated to adaptively determine the timing of alarms by calculating confidence intervals for the change-point scores at each time step by means of Bayesian bootstrap. This makes it possible to avoid false alarms in noisy situations and to detect changes of various magnitudes.

[ slide ]

Koshijima, K., Hino, H. and Murata, N.: “Change-Point Detection in a Sequence of Bags-of-Data”, IEEE Transactions on Knowledge and Data Engineering, Volume 27, Number 10, October 2015, Pages 2632-2644. https://doi.org/10.1109/TKDE.2015.2426693

related works

Hino, H. and Murata, N.: “Information estimators for weighted observations”, Neural Networks, Volume 46, October 2013, Pages 260–275. https://doi.org/10.1016/j.neunet.2013.06.005

In a product market or stock market, different products or stocks compete for the same consumers or purchasers. We propose a method to estimate the time-varying transition matrix of the product share using a multivariate time series of the product share. The method is based on the assumption that each of the observed time series of shares is a stationary dis- tribution of the underlying Markov processes characterized by transition probability matrices. We estimate transition probability matrices for every observation under natural assumptions. We demonstrate, on a real-world dataset of the share of automobiles, that the proposed method can find intrinsic transition of shares.

[ slide ]

Chiba, T., Hino, H., Akaho, S., and Murata, N.: “Time-Varying Transition Probability Matrix Estimation and Its Application to Brand Share Analysis”, PLOS ONE, Volume 12, Issue 1, January 2017, e0169981. https://doi.org/10.1371/journal.pone.0169981