A versatile simulation tool with an example of its application

class: center, middle, inverse, title-slide

.title[
# A versatile simulation tool with an example of its application
]
.author[
### Raju Rimal
]
.date[
### april 10, 2023
]

---

background-image: url(_images/logo-nb.png)
background-size: 30%
background-position: center
background-repeat: no-repeat

---
class: center, middle, inverse

# `simrel` 
## Simulation based on relevant and irrelevant space

???

- This is about a simulation tool
- Helpful in comparing and assessing properties of new methods, models or algorithm

---

.left-column[
# Concept Behind

]

.right-column[
### The Model:

`$$\begin{bmatrix}
  \mathbf{y} \\ \mathbf{x}
\end{bmatrix} 
\sim \mathsf{N}\left(
  \begin{bmatrix}
    \boldsymbol{\mu}_x \\
    \boldsymbol{\mu}_y
  \end{bmatrix},
  \begin{bmatrix}
    \boldsymbol{\Sigma}_{yy} & \boldsymbol{\Sigma}_{yx}\\ 
    \boldsymbol{\Sigma}_{xy} & \boldsymbol{\Sigma}_{xx}
  \end{bmatrix}
\right)$$`

### Linear Regression:
`$$\underset{(m\times 1)}{\mathbf{y}} = \underset{(m\times 1)}{\boldsymbol{\mu}_y} +
\underset{(m\times p)}{\boldsymbol{\beta}^t}\underset{(p\times 1)}{\mathbf{x}} + 
\underset{(m\times 1)}{\varepsilon},\; \varepsilon \sim \mathsf{N}(\mathbf{0}, \boldsymbol{\Sigma}_{y|x})$$`

]

???

- Population Model
- Regression Model, talk about coef
- The regression coefficients is tied up with the covariance structure of predictors and responses

---

.left-column[
# Concept Behind

]

.right-column[
### The Model:

]

???

- The Regression model defines a linear linear relationship between two blocks of data
- Predictors explains the variations in Response
- If Information in X is rich for Y, the model is easy for prediction
- A good method can squeez this relationship to find optimal coefficients

---

.left-column[
# Concept Behind

]

.right-column[
### The Model:

]

???

- Concept of relevant space
- A certain subset of latent components spans this space
- Lets us call the latent components of predictors as predictor components and latent components of response as response components

**In ideal case:**

- The irrelevant space in X does not contain information about Y
- The uninformative space in Y does not contain information that X (or relevant space of X) can explain
- For a certain model, we can assume the covariance structure as in the right

---

.left-column[
# Concept Behind

]

.right-column[
### The Model:

]

???

We can reorder the variables so that the structure can be properly explained
- Here we see that there are three components in the response and they also yield to response variables.
- relevant Predictors ... are for Y1, ... are for Y2 and ... are for Y3
- They together span the relevant space

---
class: top

.left-column[
# R-package
## Training Samples, Predictors and Responses

]
.right-column[

```r
sobj <- simrel(
* n       = 100,
* p       = 10,
* m       = 3,
  q       = c(3, 3, 4),
  relpos  = list(c(1, 5), c(3, 6), c(2, 4, 7)),
  ypos    = list(1, 2, 3),
  R2      = c(0.3, 0.4, 0.3),
  gamma   = 0.8,
  eta     = 0,
  ntest   = 200,
  type    = "multivariate"
)
```

```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
```

```
`geom_smooth()` using formula = 'y ~ x'
```

]

???

R api for simrel (An example):

- Use few parameters to tune the model and simulated data with wide rage of properties
- This example simulates 3 responses and 10 predictors with 100 training samples and 200 test samples.
- Can also simulate more that three responses from these three response components combining with uninformative normal variates

---
class: top

.left-column[
# R-package
## Relevant and Irrelevant Components

]
.right-column[

```r
sobj <- simrel(
    n       = 100,
    p       = 10,
    m       = 3,
*   q       = c(3, 3, 4),
*   relpos  = list(c(1, 5), c(3, 6), c(2, 4, 7)),
*   ypos    = list(1, 2, 3),
*   R2      = c(0.3, 0.4, 0.3),
    gamma   = 0.8,
    eta     = 0,
    ntest   = 200,
    type    = "multivariate"
)
```

]

???

- The data are sampled from a population with properties controlled by the simulation parameters
- Here, 3 predictors are relevant for W1, 3 predictors are relevant for W2 and 4 predictors are relevant for W3
- Predictor components 1 and 5 which spans the same space as spanned by the three predictors are relevant for response Components W1. 
- In the simulated data these components explains 30% of the variateion in the response
- For other responses, a similar understanding follows

---
class: top

.left-column[
# R-package
## Relevant and Irrelevant Components

]
.right-column[

]

???
- These components are rotated to obtain the data
- The data will reflect the samilar properties as specified in simulated parameters
- For example, predictors at position 1, 5  and 10 spans the same space as components 1 and 5 are relevant for response Y1.

---
class: top

.left-column[
# R-package
## Reparameterization

]
.right-column[

```r
sobj <- simrel(
    n       = 100,
    p       = 10,
    m       = 3,
    q       = c(3, 3, 4), 
    relpos  = list(c(1, 5), c(3, 6), c(2, 4, 7)), 
    ypos    = list(1, 2, 3), 
    R2      = c(0.3, 0.4, 0.3),
*   gamma   = 0.8,
*   eta     = 0,
    ntest   = 200,
    type    = "multivariate"
)
```

.spread[
- `gamma` controls the multicollinearity by decaying the eigenvalues of predictors

- `eta` controls the correlation between response the same way as `gamma`
]

]

???

- We have parameterised the eigenvalues so that a single parameters can tune its exponential decay
- `gamma` and `eta` controls the decay of eigenvalues of predictors and responses.
- Higher the value more will be the decay
- In predictors, this controls the multicollinearity and in response, it controls the correlation between the responses
- A model with relevant componets at the position with small eigenvalues are likly to be difficult and vice versa.

---
class: center, middle, inverse

# Shiny Application for Simrel
<h2><a href="http://localhost:5555" target="_blank">Demonstration</a></h2>

---
class: center, middle, inverse

# Application: Comparison of multivariate estimators
## Setting up experimental design and comparison

---

.left-column[
# Application
## Experimental Design

```
Warning: `cols` is now required when using `unnest()`.
ℹ Please use `cols = c(Replication)`.
```

```
# A tibble: 16 × 4
   gamma relpos Method Design
   <dbl> <chr>  <chr>   <int>
 1   0   1:5    PCR         1
 2   0   1:5    PLS1        2
 3   0   1:5    PLS2        3
 4   0   1:5    Xenv        4
 5   0   4:7    PCR         5
 6   0   4:7    PLS1        6
 7   0   4:7    PLS2        7
 8   0   4:7    Xenv        8
 9   1.2 1:5    PCR         9
10   1.2 1:5    PLS1       10
11   1.2 1:5    PLS2       11
12   1.2 1:5    Xenv       12
13   1.2 4:7    PCR        13
14   1.2 4:7    PLS1       14
15   1.2 4:7    PLS2       15
16   1.2 4:7    Xenv       16
```

]
.right-column[

.plot[

]

???

As an example I will show an example
- The design have two simulation parameters where each have two levels
- gamma 0: low multicollinearity; gamma 1.2: high multicollinearity
- relpos 1:5: relevant components with large eigenvalues; relpos 4:7: relevant components with small eigenvalues
- Four estimators are used for estimation (PCR, PLS1, PLS2 and Xenv) (see refrences for details)
- This form 16 designs
- Each designs are replicated 15 times so that there are 240 models
- Average prediction error is computed and averaged over all replicates. The plot in the right show how each methods behave for data with certain nature.

- Go throught some random design (May be 1, 4, 9, 11, 12 (maybe))

---
class: top

.left-40-column[

# Application
## Meta Modelling

]
.right-60-column[

`$$\begin{aligned}
\mathbf{y}_{ijkl} = \mu &+ (\mathtt{ncomp_i} + \mathtt{Method_j} + \mathtt{relpos_k} + \mathtt{gamma_l})^2 \\ &+ \mathtt{(Method:relpos:gamma)_{jkl}} + \varepsilon_{ijkl}
\end{aligned}$$`

```
Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
```

]

???

A statistical model can be formulate to analyse these prediction errors

- Explain the plots in details (if you have time)
- Specially the gamma:relpos interaction

---
class: middle, center

# Acknoledgement

.flex-box[

.solve[

### Solve Sæbø
.inst[
(NMBU)
]

]
.other[

]

.trygve[

### Trygve Almøy
.inst[
(BioStatistics, NMBU)
]

]

---
class: top

# References
.references[

Almøy, T. (1996). "A simulation study on comparison of prediction
methods when only a few components are relevant". In: _Computational
Statistics & Data Analysis_ 21.1, pp. 87-107. DOI:
[10.1016/0167-9473(95)00006-2](https://doi.org/10.1016%2F0167-9473%2895%2900006-2).
URL:
[https://doi.org/10.1016/0167-9473(95)00006-2](https://doi.org/10.1016/0167-9473(95)00006-2).

Cook, D., Z. Su, Y. Yang, et al. (2015). "envlp: A MATLAB Toolbox for
Computing Envelope Estimators in Multivariate Analysis". In: _Journal
of Statistical Software_ 62.1, pp. 1-20.

Cook, R. D., B. Li, and F. Chiaromonte (2010). "Envelope models for
parsimonious and efficient multivariate linear regression". In:
_Statistica Sinica_, pp. 927-960.

Cook, R. D. and X. Zhang (2015). "Simultaneous envelopes for
multivariate linear regression". In: _Technometrics_ 57.1, pp. 11-25.

Helland, I. S. (2000). "Model Reduction for Prediction in Regression
Models". In: _Scandinavian Journal of Statistics_ 27.1, pp. 1-20. ISSN:
1467-9469. DOI:
[10.1111/1467-9469.00174](https://doi.org/10.1111%2F1467-9469.00174).
URL:
[http://dx.doi.org/10.1111/1467-9469.00174](http://dx.doi.org/10.1111/1467-9469.00174).

]

---

# References
.references[

Helland, I. S. and T. Almøy (1994). "Comparison of prediction methods
when only a few components are relevant". In: _Journal of the American
Statistical Association_ 89.426, pp. 583-591.

Helland, I. S., S. Sæbø, T. Almøy, et al. (2017). "Model and estimators
for partial least squares". unpublished.

Helland, I. S., S. Sæbø, T. Almøy, et al. "Model and estimators for
partial least squares regression". In: _Journal of Chemometrics_, p.
e3044.

Rimal, R., T. Almøy, and S. Sæbø (2018). "A tool for simulating
multi-response linear model data". In: _Chemometrics and Intelligent
Laboratory Systems_ 176, pp. 1-10.

Sæbø, S., T. Almøy, and I. S. Helland (2015). "simrel - A versatile
tool for linear model data simulation based on the concept of a
relevant subspace and relevant predictors". In: _Chemometrics and
Intelligent Laboratory Systems_.

]

---

# Installation
.flex-box[
.installation-details[

## R-Package

```r
if (!require(devtools)) install.packages("devtools")
devtools::install_github("simulatr/simrel")
```

## Shiny Application

```r
if (!require(simrel)) install.packages("simrel")
shiny::runGitHub("simulatr/AppSimulatr")
```

]

]

---
background-image: url(_images/ThankYou.png)
background-size: cover
background-position: center
background-repeat: no-repeat