Supervisors:

# Introduction

## My PhD Plan ### Why I am doing this

Important for:

• Research
• Education and
• Method Evaluation

## What I learn

• Advanced Multivariate methods and their properties
• Programming concept for developing statistical packages and applications for various statistical methods
• Extending and improving existing methods in statistics
• And, obviously, to properly document what I have done

# A comparative study of different estimation methods using simulated data

## Overview

Four estimtion methods were considered

#### Ordinary Least Squares (OLS)

• Although unbiased, suffer highly from multicollinearity
• Widely used and can be used as reference for comparison

#### Partial Least Squares (PLS)

• Well established and widely used method
• Based on Latent Structure and free of multicollinearity problem

## Overview

Four estimation methods were considered

#### Envelope

• Relatively new method (Cook, Helland, & Su, 2013) and is also based on reduction of regression model
• Based on Maximum Likelihood but works better than OLS in $$p$$ approaches $$n$$

#### Bayes PLS

• Bayesian Estimation of regression coefficient
• Promising performance was shown in previous studies (I. S. Helland, Sæbø, & Tjelmeland, 2012)

## Simulation Design

Population Parameters were set as follows:

• Number of sample observations: 50
• Number of predictor variables: 15 and 40
• Coefficient of determination $$(R^2)$$: 0.5 and 0.9
• Level of multicollinearity: 0.5 and 0.9
• Position of relevant components: 1 and 2; 1 and 3; 2 and 3; 1, 2 and 3

From the combination of above parameters, 32 datasets were simulated with 5 replication of each, i.e. 160 datasets with 5 of them having similar population properties.

## A Systematic Comparison • Bayes PLS has out-performed others methods
• Envelope performed better than OLS
• OLS prediction: very poor in noisy data

## A Systematic Comparison • Bayes PLS has approached to its minimum error with very few component and remained low for additional component
• PLS has moderate performance but better than envelope in many situations.
• OLS prediction is poor especially with large number of predictor
• Envelope method captured its minimum error and the error increased with additional components

# simrel-m: A versatile tool for simulating multi-response linear model data

## simrel-m

It is an extension of simrel (Sæbø, Almøy, & Helland, 2015) r-package for simulating multi-response data • Based on idea of reduction of random regression model
• It separates $$X$$ into subspaces that is relevant and irrelevant for predicting each response
• It re-parameterize the population model, $\mathbf{Y} = \boldsymbol{\mu}_{Y} + \mathbf{B}^t\left(\mathbf{X} - \boldsymbol{\mu}_X\right) + \boldsymbol{\epsilon} \text{, where }\boldsymbol{\epsilon} \sim N(0, \boldsymbol{\Sigma}_{Y|X})$
• It can simulate diverse nature of data with very few parameters

## How it works • Collect input parameters from user
• Make a covariance matrix satisfying those input parameters
• Computes true population properties such as regression coefficients
• Sample calibration and validation sets

# Demonstration

## References

Cook, R., Helland, I., & Su, Z. (2013). Envelopes and partial least squares regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(5), 851–877.

Helland, I. S., Sæbø, S., & Tjelmeland. (2012). Near optimal prediction from relevant components. Scandinavian Journal of Statistics, 39(4), 695–713.

Sæbø, S., Almøy, T., & Helland, I. S. (2015). Simrel—A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors. Chemometrics and Intelligent Laboratory Systems, 146, 128–135.