Use of Simulation in Statistics

 

Raju Rimal

20 September, 2019

Traffic Simulation

Simulation in Various Fields

farming simulation

traffic simulation

flood simulation

swimming simulation

simulation of posterior distribution

Simulation is the creation of a model that can be manipulated logically to decide how the physical world works.

Dr. Richard Gran

Source: https://www.youtube.com/watch?v=OCMafswcNkY

Simulation in Statistics

Background and Introduction

Data Analysis Process

A model is a conceptual representation of a relationship, a system or an aspect of a real world

 

Example: When you are buying a second-hand car, the car travelled a longer distance cost more.

\begin{aligned} \text{price} &= \beta_0 + \beta_1\text{mileage} + \varepsilon\\ \varepsilon &\sim \textsf{N}(0, \sigma^2) \end{aligned}

Reverse Engineering

  • We fix the parameters
  • Identify a distribution
  • Sample the data

Most of the distribution can be approximated by simulating data from a uniform distribution and manipulating the values.

\begin{aligned} \text{price} &= \beta_0 + \beta_1\text{mileage} + \varepsilon\\ \varepsilon &\sim \textsf{N}(0, \sigma^2) \end{aligned}

Simulation in Statistics

 

Why Simulate

  • Answer what-if question
  • Collecting data to answer these what-ifs is expensive
  • Find answers that is difficult and sometimes impossible to compute analytically 
  • Mimicking a real system (a scenario, a process etc)
  • Generating data from a model
  • Random Variables are the basic components

Deterministic and Stochastic Simulation

  • Stochastic contains any probabilistic components
  • Deterministic has fixed output and can not be generalized

Random Variates

  • Random variables are the building blocks of any complex system
  • Random variables following any distribution can be obtained from sampling and manipulating uniform random variates U(0,1).
  • Generating samples from U(0,1) needs random numbers.

Random Numbers

  • Building blocks of any stochastic simulation
  • Are random numbers generated by computers really random?

Pseudo-Random Numbers:

  • Deterministic but unpredictable unless the generating mechanism is known
  • Usually, a seed is used to reproduce the same random number sequence.

Random Variables from Different Distributions

runif(1000, min = 0, max = 1)
rchisq(1000, df = 2)
rnorm(mean = 0, sd = 1)
rgamma(1000, shape = 2)
rcauchy(1000, scale = 1.5)
rbeta(1000, 1.3, 2.4)

In most software, we can draw random samples from a different distribution.

Uses of Simulation

Building Statistical Methods

Monte Carlo Methods

Random sample to solve problems that might be deterministic

 

Used in Optimization, Numerical Integration, Drawing samples from a probability distribution, etc

Methods Based on Monte Carlo

  • Inverse Transform Method
  • Acceptance-Rejection Method
  • Markov Chain Monte Carlo
    • Metropolis-Hastings Algorithm
    • Gibbs Sampling
  • Bayesian Analysis
  • Resampling Techniques
    • Jackknifing
    • Bootstrapping
    • Permutation test
 
source: http://tiny.cc/eistcz

Methods based on Monte Carlo

  • Inverse Transform Method
  • Acceptance-Rejection Method
  • Markov Chain Monte Carlo
    • Metropolis-Hastings Algorithm
    • Gibbs Sampling
  • Bayesian Analysis
  • Resampling Techniques
    • Jackknifing
    • Bootstrapping
    • Permutation test
 

Methods based on Monte Carlo

  • Inverse Transform Method
  • Acceptance-Rejection Method
  • Markov Chain Monte Carlo
    • Metropolis-Hastings Algorithm
    • Gibbs Sampling
  • Bayesian Analysis
  • Resampling Techniques
    • Jackknifing
    • Bootstrapping
    • Permutation test
 

Methods based on Monte Carlo

  • Inverse Transform Method
  • Acceptance-Rejection Method
  • Markov Chain Monte Carlo
    • Metropolis-Hastings Algorithm
    • Gibbs Sampling
  • Bayesian Analysis
  • Resampling Techniques
    • Jackknifing
    • Bootstrapping
    • Permutation test
 

Methods that uses Simulation

  • Inverse Transform Method
  • Acceptance-Rejection Method
  • Markov Chain Monte Carlo
    • Metropolis-Hastings Algorithm
    • Gibbs Sampling
  • Bayesian Analysis
  • Resampling Techniques
    • Jackknifing
    • Bootstrapping
    • Permutation test
    • Random Cross-validation
 
 
 

Methods that uses Simulation

  • Inverse Transform Method
  • Acceptance-Rejection Method
  • Markov Chain Monte Carlo
    • Metropolis-Hastings Algorithm
    • Gibbs Sampling
  • Bayesian Analysis
  • Resampling Techniques
    • Jackknifing
    • Bootstrapping
    • Permutation test
    • Random Cross-validation
 
 
 

Uses of Simulation

Generating Data for Research and Education

Generating Data

  • Study the effect of problems while deploying a method or a technique
  • Assessing accuracy and problems of new methods against difficult data structures
  • Answer the what-if question

Use of Generated Data

 
  • R-packages:
    • simrel
    • simulator
    • simTools
    • simglm, ...
  • Python-packages:
  • Other software:

Methods for Generating Data

 

Generating Data

  • R-packages:
  • Python-packages:
    • simpy
    • pysimrel
    • numpy
  • Other software:
    • stata
    • SAS
  • Study the effect of problems while deploying a method or a technique
  • Assessing accuracy and problems of new methods against difficult data structures
  • Answer the what-if question

Use of Generated Data

 

Methods for Generating Data

 

Experimental Design

Proper use of experimental design makes the simulation more effective. Consider a model,

\mathbf{y} = \boldsymbol{\mu}_y + \boldsymbol{\beta}^t(\mathbf{x} - \boldsymbol{\mu}_x) + \boldsymbol{\varepsilon}

Experimental Design

\mathbf{y} = \boldsymbol{\mu}_y + \boldsymbol{\beta}^t(\mathbf{x} - \boldsymbol{\mu}_x) + \boldsymbol{\varepsilon}

Proper use of experimental design makes the simulation more effective. Consider a model,

Simulation in Research Studies

 
  • Numerous studies can be obtained both in the past and present
  • With Increase in computing power, the trend is increasing

Modern Application in Machine Learning

  • Combine Simulation with Machine Learning
  • Generate artificial training samples
  • Add domain-specific knowledge to machine learning through simulation

Following are some of these studies:

Some Extra Use Cases

Simple Simulation Examples

Limitations of Simulation

Cautious Simulation

  • Fake data give fake results
  • Simulated data is not a replacement for real data
  • Trying to get something from nothing
  • Making the analysis and results reproducible and open-source
  • Unclear experimental design and poor reporting of results

Simulation illuminates important points and build up a picture of the landscape, but can not illuminate the entire landscape.

- Patrick Landscape

 
 

References

References

  1. Ripley, B. D. (2009). Stochastic simulation (Vol. 316). John Wiley & Sons.
  2. Jones, O., Maillardet, R., & Robinson, A. (2014). Introduction to scientific programming and simulation using R. Chapman and Hall/CRC.
  3. Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in medicine, 38(11), 2074-2102.
  4. Ross, S. M. (2014). Introduction to probability models. Academic press.
  5. Ripley, B. D. (1988). Uses and abuses of statistical simulation. Mathematical Programming, 42(1-3), 53-68.
  6. Knežo, D., & Vagaská, A. (2019). Monte Carlo Method Application and Generation of Random Numbers by Usage of Numerical Methods. In Models and Theories in Social Systems (pp. 197-207). Springer, Cham.
  7. Birta Louis, G., & Gilbert, A. (2007). Modelling and Simulation: Exploring Dynamic System Behaviour. Ottawa: School of information technology and engineering.
  8. Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24(3), 136-156.
  9. Sæbø, S., Almøy, T., & Helland, I. S. (2015). simrel—A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors. Chemometrics and Intelligent Laboratory Systems, 146, 128-135.