class: center, middle, inverse, title-slide .title[ # A versatile simulation tool with an example of its application ] .author[ ### Raju Rimal ] .date[ ### april 10, 2023 ] --- background-image: url(_images/logo-nb.png) background-size: 30% background-position: center background-repeat: no-repeat --- class: center, middle, inverse # `simrel` ## Simulation based on relevant and irrelevant space ??? - This is about a simulation tool - Helpful in comparing and assessing properties of new methods, models or algorithm --- .left-column[ # Concept Behind ] .right-column[ ### The Model: `$$\begin{bmatrix} \mathbf{y} \\ \mathbf{x} \end{bmatrix} \sim \mathsf{N}\left( \begin{bmatrix} \boldsymbol{\mu}_x \\ \boldsymbol{\mu}_y \end{bmatrix}, \begin{bmatrix} \boldsymbol{\Sigma}_{yy} & \boldsymbol{\Sigma}_{yx}\\ \boldsymbol{\Sigma}_{xy} & \boldsymbol{\Sigma}_{xx} \end{bmatrix} \right)$$` ### Linear Regression: `$$\underset{(m\times 1)}{\mathbf{y}} = \underset{(m\times 1)}{\boldsymbol{\mu}_y} + \underset{(m\times p)}{\boldsymbol{\beta}^t}\underset{(p\times 1)}{\mathbf{x}} + \underset{(m\times 1)}{\varepsilon},\; \varepsilon \sim \mathsf{N}(\mathbf{0}, \boldsymbol{\Sigma}_{y|x})$$` ] ??? - Population Model - Regression Model, talk about coef - The regression coefficients is tied up with the covariance structure of predictors and responses --- .left-column[ # Concept Behind ] .right-column[ ### The Model: `$$\begin{bmatrix} \mathbf{y} \\ \mathbf{x} \end{bmatrix} \sim \mathsf{N}\left( \begin{bmatrix} \boldsymbol{\mu}_x \\ \boldsymbol{\mu}_y \end{bmatrix}, \begin{bmatrix} \boldsymbol{\Sigma}_{yy} & \boldsymbol{\Sigma}_{yx}\\ \boldsymbol{\Sigma}_{xy} & \boldsymbol{\Sigma}_{xx} \end{bmatrix} \right)$$` ### Linear Regression: `$$\underset{(m\times 1)}{\mathbf{y}} = \underset{(m\times 1)}{\boldsymbol{\mu}_y} + \underset{(m\times p)}{\boldsymbol{\beta}^t}\underset{(p\times 1)}{\mathbf{x}} + \underset{(m\times 1)}{\varepsilon},\; \varepsilon \sim \mathsf{N}(\mathbf{0}, \boldsymbol{\Sigma}_{y|x})$$` <img src="index_files/figure-html/relspace1-1.svg" width="75%" style="display: block; margin: auto;" /> ] ??? - The Regression model defines a linear linear relationship between two blocks of data - Predictors explains the variations in Response - If Information in X is rich for Y, the model is easy for prediction - A good method can squeez this relationship to find optimal coefficients --- .left-column[ # Concept Behind <img src="index_files/figure-html/cov_plot_relpos-1.svg" width="100%" style="display: block; margin: auto;" /> ] .right-column[ ### The Model: `$$\begin{bmatrix} \mathbf{y} \\ \mathbf{x} \end{bmatrix} \sim \mathsf{N}\left( \begin{bmatrix} \boldsymbol{\mu}_x \\ \boldsymbol{\mu}_y \end{bmatrix}, \begin{bmatrix} \boldsymbol{\Sigma}_{yy} & \boldsymbol{\Sigma}_{yx}\\ \boldsymbol{\Sigma}_{xy} & \boldsymbol{\Sigma}_{xx} \end{bmatrix} \right)$$` ### Linear Regression: `$$\underset{(m\times 1)}{\mathbf{y}} = \underset{(m\times 1)}{\boldsymbol{\mu}_y} + \underset{(m\times p)}{\boldsymbol{\beta}^t}\underset{(p\times 1)}{\mathbf{x}} + \underset{(m\times 1)}{\varepsilon},\; \varepsilon \sim \mathsf{N}(\mathbf{0}, \boldsymbol{\Sigma}_{y|x})$$` <img src="index_files/figure-html/relspace-1.svg" width="75%" style="display: block; margin: auto;" /> ] ??? - Concept of relevant space - A certain subset of latent components spans this space - Lets us call the latent components of predictors as predictor components and latent components of response as response components **In ideal case:** - The irrelevant space in X does not contain information about Y - The uninformative space in Y does not contain information that X (or relevant space of X) can explain - For a certain model, we can assume the covariance structure as in the right --- .left-column[ # Concept Behind <img src="index_files/figure-html/cov_plot_relpos_1-1.svg" width="100%" style="display: block; margin: auto;" /> ] .right-column[ ### The Model: `$$\begin{bmatrix} \mathbf{y} \\ \mathbf{x} \end{bmatrix} \sim \mathsf{N}\left( \begin{bmatrix} \boldsymbol{\mu}_x \\ \boldsymbol{\mu}_y \end{bmatrix}, \begin{bmatrix} \boldsymbol{\Sigma}_{yy} & \boldsymbol{\Sigma}_{yx}\\ \boldsymbol{\Sigma}_{xy} & \boldsymbol{\Sigma}_{xx} \end{bmatrix} \right)$$` ### Linear Regression: `$$\underset{(m\times 1)}{\mathbf{y}} = \underset{(m\times 1)}{\boldsymbol{\mu}_y} + \underset{(m\times p)}{\boldsymbol{\beta}^t}\underset{(p\times 1)}{\mathbf{x}} + \underset{(m\times 1)}{\varepsilon},\; \varepsilon \sim \mathsf{N}(\mathbf{0}, \boldsymbol{\Sigma}_{y|x})$$` <img src="index_files/figure-html/relspace-1.svg" width="75%" style="display: block; margin: auto;" /> ] ??? We can reorder the variables so that the structure can be properly explained - Here we see that there are three components in the response and they also yield to response variables. - relevant Predictors ... are for Y1, ... are for Y2 and ... are for Y3 - They together span the relevant space --- class: top .left-column[ # R-package ## Training Samples, Predictors and Responses ] .right-column[ ```r sobj <- simrel( * n = 100, * p = 10, * m = 3, q = c(3, 3, 4), relpos = list(c(1, 5), c(3, 6), c(2, 4, 7)), ypos = list(1, 2, 3), R2 = c(0.3, 0.4, 0.3), gamma = 0.8, eta = 0, ntest = 200, type = "multivariate" ) ``` ``` Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. ℹ Please use `linewidth` instead. This warning is displayed once every 8 hours. Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. ``` ``` `geom_smooth()` using formula = 'y ~ x' ``` <img src="index_files/figure-html/unnamed-chunk-2-1.svg" width="100%" style="display: block; margin: auto;" /> ] ??? R api for simrel (An example): - Use few parameters to tune the model and simulated data with wide rage of properties - This example simulates 3 responses and 10 predictors with 100 training samples and 200 test samples. - Can also simulate more that three responses from these three response components combining with uninformative normal variates --- class: top .left-column[ # R-package ## Relevant and Irrelevant Components <img src="index_files/figure-html/unnamed-chunk-3-1.svg" width="100%" style="display: block; margin: auto;" /> ] .right-column[ ```r sobj <- simrel( n = 100, p = 10, m = 3, * q = c(3, 3, 4), * relpos = list(c(1, 5), c(3, 6), c(2, 4, 7)), * ypos = list(1, 2, 3), * R2 = c(0.3, 0.4, 0.3), gamma = 0.8, eta = 0, ntest = 200, type = "multivariate" ) ``` <img src="index_files/figure-html/unnamed-chunk-5-1.svg" width="100%" style="display: block; margin: auto;" /> ] ??? - The data are sampled from a population with properties controlled by the simulation parameters - Here, 3 predictors are relevant for W1, 3 predictors are relevant for W2 and 4 predictors are relevant for W3 - Predictor components 1 and 5 which spans the same space as spanned by the three predictors are relevant for response Components W1. - In the simulated data these components explains 30% of the variateion in the response - For other responses, a similar understanding follows --- class: top .left-column[ # R-package ## Relevant and Irrelevant Components <img src="index_files/figure-html/unnamed-chunk-6-1.svg" width="100%" style="display: block; margin: auto;" /> ] .right-column[ ```r sobj <- simrel( n = 100, p = 10, m = 3, * q = c(3, 3, 4), * relpos = list(c(1, 5), c(3, 6), c(2, 4, 7)), * ypos = list(1, 2, 3), * R2 = c(0.3, 0.4, 0.3), gamma = 0.8, eta = 0, ntest = 200, type = "multivariate" ) ``` <img src="index_files/figure-html/unnamed-chunk-8-1.svg" width="100%" style="display: block; margin: auto;" /> ] ??? - These components are rotated to obtain the data - The data will reflect the samilar properties as specified in simulated parameters - For example, predictors at position 1, 5 and 10 spans the same space as components 1 and 5 are relevant for response Y1. --- class: top .left-column[ # R-package ## Reparameterization <img src="_images/gamma-animation.gif" width="100%" style="display: block; margin: auto;" /> ] .right-column[ ```r sobj <- simrel( n = 100, p = 10, m = 3, q = c(3, 3, 4), relpos = list(c(1, 5), c(3, 6), c(2, 4, 7)), ypos = list(1, 2, 3), R2 = c(0.3, 0.4, 0.3), * gamma = 0.8, * eta = 0, ntest = 200, type = "multivariate" ) ``` .spread[ - `gamma` controls the multicollinearity by decaying the eigenvalues of predictors - `eta` controls the correlation between response the same way as `gamma` ] ] ??? - We have parameterised the eigenvalues so that a single parameters can tune its exponential decay - `gamma` and `eta` controls the decay of eigenvalues of predictors and responses. - Higher the value more will be the decay - In predictors, this controls the multicollinearity and in response, it controls the correlation between the responses - A model with relevant componets at the position with small eigenvalues are likly to be difficult and vice versa. --- class: center, middle, inverse # Shiny Application for Simrel <h2><a href="http://localhost:5555" target="_blank">Demonstration</a></h2> --- class: center, middle, inverse # Application: Comparison of multivariate estimators ## Setting up experimental design and comparison --- .left-column[ # Application ## Experimental Design ``` Warning: `cols` is now required when using `unnest()`. ℹ Please use `cols = c(Replication)`. ``` ``` # A tibble: 16 × 4 gamma relpos Method Design <dbl> <chr> <chr> <int> 1 0 1:5 PCR 1 2 0 1:5 PLS1 2 3 0 1:5 PLS2 3 4 0 1:5 Xenv 4 5 0 4:7 PCR 5 6 0 4:7 PLS1 6 7 0 4:7 PLS2 7 8 0 4:7 Xenv 8 9 1.2 1:5 PCR 9 10 1.2 1:5 PLS1 10 11 1.2 1:5 PLS2 11 12 1.2 1:5 Xenv 12 13 1.2 4:7 PCR 13 14 1.2 4:7 PLS1 14 15 1.2 4:7 PLS2 15 16 1.2 4:7 Xenv 16 ``` ] .right-column[ .plot[
] ] ??? As an example I will show an example - The design have two simulation parameters where each have two levels - gamma 0: low multicollinearity; gamma 1.2: high multicollinearity - relpos 1:5: relevant components with large eigenvalues; relpos 4:7: relevant components with small eigenvalues - Four estimators are used for estimation (PCR, PLS1, PLS2 and Xenv) (see refrences for details) - This form 16 designs - Each designs are replicated 15 times so that there are 240 models - Average prediction error is computed and averaged over all replicates. The plot in the right show how each methods behave for data with certain nature. - Go throught some random design (May be 1, 4, 9, 11, 12 (maybe)) --- class: top .left-40-column[ # Application ## Meta Modelling <img src="index_files/figure-html/unnamed-chunk-14-1.svg" width="95%" style="display: block; margin: auto;" /> ] .right-60-column[ `$$\begin{aligned} \mathbf{y}_{ijkl} = \mu &+ (\mathtt{ncomp_i} + \mathtt{Method_j} + \mathtt{relpos_k} + \mathtt{gamma_l})^2 \\ &+ \mathtt{(Method:relpos:gamma)_{jkl}} + \varepsilon_{ijkl} \end{aligned}$$` ``` Warning: `as.tibble()` was deprecated in tibble 2.0.0. ℹ Please use `as_tibble()` instead. ℹ The signature and semantics have changed, see `?as_tibble`. This warning is displayed once every 8 hours. Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. ``` <img src="index_files/figure-html/unnamed-chunk-15-1.svg" width="95%" style="display: block; margin: auto;" /> ] ??? A statistical model can be formulate to analyse these prediction errors - Explain the plots in details (if you have time) - Specially the gamma:relpos interaction --- class: middle, center # Acknoledgement .flex-box[ .solve[ <img src="_images/solve.jpg" width="70%" style="display: block; margin: auto;" /> ### Solve Sæbø .inst[ (NMBU) ] ] .other[ ] .trygve[ <img src="_images/trygve.jpg" width="70%" style="display: block; margin: auto;" /> ### Trygve Almøy .inst[ (BioStatistics, NMBU) ] ] ] --- class: top # References .references[ Almøy, T. (1996). "A simulation study on comparison of prediction methods when only a few components are relevant". In: _Computational Statistics & Data Analysis_ 21.1, pp. 87-107. DOI: [10.1016/0167-9473(95)00006-2](https://doi.org/10.1016%2F0167-9473%2895%2900006-2). URL: [https://doi.org/10.1016/0167-9473(95)00006-2](https://doi.org/10.1016/0167-9473(95)00006-2). Cook, D., Z. Su, Y. Yang, et al. (2015). "envlp: A MATLAB Toolbox for Computing Envelope Estimators in Multivariate Analysis". In: _Journal of Statistical Software_ 62.1, pp. 1-20. Cook, R. D., B. Li, and F. Chiaromonte (2010). "Envelope models for parsimonious and efficient multivariate linear regression". In: _Statistica Sinica_, pp. 927-960. Cook, R. D. and X. Zhang (2015). "Simultaneous envelopes for multivariate linear regression". In: _Technometrics_ 57.1, pp. 11-25. Helland, I. S. (2000). "Model Reduction for Prediction in Regression Models". In: _Scandinavian Journal of Statistics_ 27.1, pp. 1-20. ISSN: 1467-9469. DOI: [10.1111/1467-9469.00174](https://doi.org/10.1111%2F1467-9469.00174). URL: [http://dx.doi.org/10.1111/1467-9469.00174](http://dx.doi.org/10.1111/1467-9469.00174). ] --- # References .references[ Helland, I. S. and T. Almøy (1994). "Comparison of prediction methods when only a few components are relevant". In: _Journal of the American Statistical Association_ 89.426, pp. 583-591. Helland, I. S., S. Sæbø, T. Almøy, et al. (2017). "Model and estimators for partial least squares". unpublished. Helland, I. S., S. Sæbø, T. Almøy, et al. "Model and estimators for partial least squares regression". In: _Journal of Chemometrics_, p. e3044. Rimal, R., T. Almøy, and S. Sæbø (2018). "A tool for simulating multi-response linear model data". In: _Chemometrics and Intelligent Laboratory Systems_ 176, pp. 1-10. Sæbø, S., T. Almøy, and I. S. Helland (2015). "simrel - A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors". In: _Chemometrics and Intelligent Laboratory Systems_. ] --- # Installation .flex-box[ .installation-details[ ## R-Package ```r if (!require(devtools)) install.packages("devtools") devtools::install_github("simulatr/simrel") ``` ## Shiny Application ```r if (!require(simrel)) install.packages("simrel") shiny::runGitHub("simulatr/AppSimulatr") ``` ] <img src="_images/simrel-hex.svg" width="75%" id='simrel-hex' style="display: block; margin: auto;" /> ] --- background-image: url(_images/ThankYou.png) background-size: cover background-position: center background-repeat: no-repeat