For a long time costly low sample high–dimensional data sets have fueled research in statistical techniques handling data with much more features than samples. However, the emerging development of vast online repositories of publicly available data now increases sample size with an unprecedented speed. Take as an example the Gene Expression Omnibus (GEO), which is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data sets. Since its introduction in 2000 the database has increased to 55,429 datasets containing more than 1,350,000 million samples. This calls for the development of robust and flexible tools to integrate results of multiple high–dimensional experiments across laboratory batch effects and varying biotechnological technologies. By use of publicly available data sets from GEO we will illustrate how we have developed methods for integrating data across laboratories and technologies to validate and study new sub–classifications of lymph node cancer. The talk will cover the whole process from the preclinical models to the individual assignment and classification of tumors to be included into early clinical phase I–II trials.
*The research is funded by the National Experimental Therapy Partnership (NEXT), which is financed by a grant from Innovation Fund Denmark.