Britta Anker Bak

(Department of Mathematics, Aarhus University)

(Department of Mathematics, Aarhus University)

Thiele Seminar

Thursday, 20 November, 2014, at 13:15-14:00, in Koll. D (1531-211)

Abstract:

The scientific and technological development have given rise to datasets with a very high number of variables p, while the number of observations n cannot be expected to increase at the same rate. Such ’large p, small n’ situations appear in a large variety of fields including microarray analysis, chemometrics and medical imaging.

In applications the goal is often to classify a new observation correctly to one of two groups based on a rule built on a training dataset. When p/n tends to infinity it is well known that Fishers Discrimination Rule is asymptotically as bad as a random guess. On the other hand the independence rule, also known as naive Bayes, has been proven to lead to constructive results under certain parameter restrictions. For more general parameter spaces it has been shown that some kind of thresholding is essential to avoid noise accumulation.

We assume log(p)/n tends to zero and work on two approximatively sparse parameter spaces that are suitable for situations with microarray data. We consider the behaviour of a thresholded version of the independence rule in classification between two normally distributed groups. By considering upper bounds for various tails in the normal- and related distributions we obtain perfect separation of relevant and irrelevant variables asymptotically, apart from variables with scaled group difference tending to zero at a certain rate. This leads to an upper bound of the classification error.

Organised by: The T.N. Thiele Centre

Contact person: Søren Asmussen