We have assessed the influence of technical and biological sources of variability in transcriptomic data on predictive performance of molecular signatures learned from these data. Our approach compares two types of validation methods: (1) ordinary randomized validation (RCV), which extracts random subsets of sample data for testing, and (2) inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV operates based on the assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic differences in performance estimates relative to RCV. Measuring this difference between RCV and ISV quantifies the influence of inter-study variability on phenotype classification performance. Note that the classification algorithms that we applied to perform this analysis were multiclass SVM (libSVM package for MATLAB) or Identification of Structured Signatures and Classifiers. The analysis code can be modified to incorporate other classification approaches. The analysis code is provided below.
As part of this analysis, we gathered 1470 microarray samples of 6 lung phenotypes from 26 independent experimental studies from Gene Expression Omnibus and ArrayExpress. We applied a custom consensus GCRMA preprocessing pipeline to the raw .CEL files of each of these microarray samples. We provide the preprocessed dataset below.
Data and Software File(s):
CCVAcodeREADME.txt
The consensus preprocessed microarray dataset of 26 publicly available lung gene expression studies
CCVAcode.zip: Code and files needed to run comparative cross-validation analysis
ResultsAnalysisCode.zip: Code and files needed to generate the output graphs from the CCVA results