Introduction

In this workshop, we will study GSE53987 dataset on Bipolar disorder (BD), major depressive disorder (MDD) and schizophrenia. You can download it here.

In total there are 205 rows consisting of 19 individuals diagnosed with BPD, 19 with MDD, 19 schizophrenia and 19 controls. Each sample has gene expression from 3 tissues (post-mortem brain). There are a total of 13,768 genes (numeric features) and 10 meta features and 1 ID (GEO sample accession).

Age
Race (W for white and B for black)
Gender is F for female and M for male
Ph is the ph of the brain tissue
Pmi is the post mortal interval
Rin is the RNA integrity number
Patient is unique for each patient. Each patient has up to 3 tissue samples. The patient ID is written as disease followed by a number from 1 to 19
Tissue is the tissue the expression was obtained from.
Disease.state is the class of disease the patient belongs to: bipolar, schizophrenia, depression or control.
source.name is the combination of the tissue and disease.state

Load data

data <- read.csv("~/Downloads/GSE53987_combined.csv", row.names = 1)

Explore dataset

First, let’s check the dimensions of the dataset.

dim(data)

## [1]   205 13778

Task 1: Feature Exploration

Check all the features. Which features are numeric, which are categorical? Understanding the nature of your data is a very important and necessary first step before proceeding with any analysis.

What type of distributions exist within the features? Is Gender a balanced feature (roughly equal representation between both men and women)? Are numerical values normally distributed? Explore numerical distributions by plotting histograms for Age, an Age + Gender histogram, and one of your favorite genes found in the dataset.
Some features display factor dependent values. That is, whether a subject is a male or a female might effect the expression patterns of a given gene. Explore factor and feature relationships by creating boxplots to observe how Age is dependent on Tissue, Gender and Disease.status.

Gender

# Explore the distribution of genders
table(data[,'Gender'])

## 
##   F   M 
##  93 112

Age

# Explore the distribution of ages
hist(data[, 'Age'],  prob = TRUE)
lines(density(data[, 'Age']))

#fancy ggplot
ggplot2::ggplot(data = data, aes(x=Age)) +
  geom_histogram(aes(y=stat(density)), binwidth = 5) + geom_density()

Age faceted by gender

# Explore the distribution of age in each gender
ggplot2::ggplot(data = data, aes(x=Age)) +
  geom_histogram(aes(y = stat(density))) +
  geom_density() +
  facet_wrap(~Gender)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution of ZWINT gene

ggplot2::ggplot(data = data, aes(x=ZWINT)) +
  geom_histogram(aes(y=stat(density))) +
  geom_density()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Relationship between age, disease state, and gender

ggplot2::ggplot(data = data, aes(x=Disease.state, y=Age, fill=Gender)) +
  geom_boxplot()

ggplot2::ggplot(data = data, aes(x=Gender, y=Age, fill=Disease.state)) +
  geom_boxplot()

It is always a good practice to arrange the conditions starting from the control group or baseline condition.

data$Disease.state <- factor(data$Disease.state, levels = c("control", "Bipolar disorder", "Major depression disorder", "schizophrenia"))
ggplot2::ggplot(data = data, aes(x=Disease.state, y=Age, fill=Gender)) +
  geom_boxplot()

ggplot2::ggplot(data = data, aes(x=Gender, y=Age, fill=Disease.state)) +
  geom_boxplot()

Relationship between age, disease state, and tissue type

ggplot2::ggplot(data = data, aes(x=Disease.state, y=Age, fill=Tissue)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot2::ggplot(data = data, aes(x=Tissue, y=Age, fill=Disease.state)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot2::ggplot(data = data, aes(x=Disease.state, y=Age, fill=Tissue)) +
  geom_boxplot() +
  facet_wrap(~Gender) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) #make labels more readable

ggplot2::ggplot(data = data, aes(x=Tissue, y=Age, fill=Disease.state)) +
  geom_boxplot() +
  facet_wrap(~Gender) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Task 2: Principal Component Analysis

Principal Component Analysis (PCA) is a commonly used technique to create linearly uncorrelated features from a set of possibly correlated features. The procedure is done in such a way that the first feature produced by PCA, the first principal component – PC1, explains the largest amount of variability possible. In this way, PCA is a dimension reduction technique, as the first few principal components often explain upwards of 90% of the variability found within a dataset. It is important to note that if we’re planning on predicting anything using the principal components, such as tissue type or Disease.status, those features should not be included in the input matrix. Before performing PCA, create a new data frame containing only explanatory values (i.e. the features we want to use to predict class membership).

Explore how much variation is explained by the principal components. How much variation is explained by the first two principal components? How many principal components might be required to explain 75%, 85%, 90%, 95%, and 99% of the variation within our dataset?
Visually explore this separation to plot the first two principal components and color samples according to Tissue and Disease.status. What effect does plotting the third principal component have on sample separation?
Subset the dataset into three disjoint datasets by Tissue. Run PCA on all three of these datasets, plot the first two principal components, and color the dots according to Disease.status. Does there appear to be a meaningful difference in the separation between disease classes between the three different datasets?

Clean up data for PCA

# Create a dataframe containing only the expression data
expression = data[, 11:ncol(data)]

# Create a PCA class with 2 components
pca = prcomp(x = expression)

Scale your data first, then rerun PCA

PCA is sensitive to the underlying variance of your variables, such that PCA preferentially chooses axes with higher variance. That may be what you want, but many people scale their data (in this case, Z-score) and perform PCA on that data, in order to identify sets of correlated features along which PCA can find axes.

expression_scaled = scale(expression, center = TRUE, scale = TRUE)

pca_scaled = prcomp(x = expression_scaled)

Plot the first two PCs

pca_df <- data.frame(pca$x)
pca_df$tissue <- data$Tissue
ggplot(data = pca_df, aes(x=PC1, y=PC2, col=tissue)) +
  geom_point()

Plot the first two PCs after scaling is performed.

What differences do you notice?

pca_df_scaled <- data.frame(pca_scaled$x)
pca_df_scaled$tissue <- data$Tissue
ggplot(data = pca_df_scaled, aes(x=PC1, y=PC2, col=tissue)) +
  geom_point()

PCA explained variation

# %Variance = pca$sdev^2/sum(pca$sdev^2)
summary(pca)

## Importance of components:
##                            PC1     PC2      PC3     PC4     PC5     PC6    PC7
## Standard deviation     31.1430 17.7250 10.96164 8.95247 7.13340 6.24519 5.5880
## Proportion of Variance  0.4691  0.1520  0.05812 0.03876 0.02461 0.01886 0.0151
## Cumulative Proportion   0.4691  0.6211  0.67917 0.71794 0.74255 0.76141 0.7765
##                            PC8     PC9    PC10    PC11   PC12    PC13    PC14
## Standard deviation     5.06453 4.66497 4.21519 3.99315 3.8863 3.68249 3.57089
## Proportion of Variance 0.01241 0.01053 0.00859 0.00771 0.0073 0.00656 0.00617
## Cumulative Proportion  0.78892 0.79944 0.80804 0.81575 0.8231 0.82961 0.83578
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     3.30411 3.04443 2.90404 2.83318 2.76900 2.64638 2.58053
## Proportion of Variance 0.00528 0.00448 0.00408 0.00388 0.00371 0.00339 0.00322
## Cumulative Proportion  0.84106 0.84554 0.84962 0.85351 0.85721 0.86060 0.86382
##                           PC22    PC23   PC24    PC25   PC26    PC27    PC28
## Standard deviation     2.51111 2.42890 2.3605 2.34306 2.3201 2.25634 2.19149
## Proportion of Variance 0.00305 0.00285 0.0027 0.00266 0.0026 0.00246 0.00232
## Cumulative Proportion  0.86687 0.86973 0.8724 0.87508 0.8777 0.88014 0.88246
##                          PC29    PC30    PC31    PC32    PC33    PC34   PC35
## Standard deviation     2.1346 2.06796 2.05302 2.00469 1.95893 1.95034 1.9307
## Proportion of Variance 0.0022 0.00207 0.00204 0.00194 0.00186 0.00184 0.0018
## Cumulative Proportion  0.8847 0.88674 0.88878 0.89072 0.89257 0.89441 0.8962
##                           PC36   PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     1.90950 1.8752 1.85689 1.82871 1.80152 1.78258 1.75016
## Proportion of Variance 0.00176 0.0017 0.00167 0.00162 0.00157 0.00154 0.00148
## Cumulative Proportion  0.89798 0.8997 0.90135 0.90297 0.90454 0.90607 0.90756
##                           PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     1.71103 1.69760 1.67131 1.65049 1.63246 1.62538 1.60549
## Proportion of Variance 0.00142 0.00139 0.00135 0.00132 0.00129 0.00128 0.00125
## Cumulative Proportion  0.90897 0.91036 0.91172 0.91303 0.91432 0.91560 0.91685
##                           PC50    PC51    PC52    PC53    PC54    PC55   PC56
## Standard deviation     1.59602 1.56806 1.56287 1.55042 1.53572 1.52097 1.5074
## Proportion of Variance 0.00123 0.00119 0.00118 0.00116 0.00114 0.00112 0.0011
## Cumulative Proportion  0.91808 0.91927 0.92045 0.92161 0.92275 0.92387 0.9250
##                           PC57    PC58    PC59    PC60    PC61    PC62    PC63
## Standard deviation     1.48275 1.46927 1.45695 1.44933 1.43392 1.42896 1.42516
## Proportion of Variance 0.00106 0.00104 0.00103 0.00102 0.00099 0.00099 0.00098
## Cumulative Proportion  0.92603 0.92708 0.92810 0.92912 0.93012 0.93110 0.93209
##                           PC64    PC65    PC66    PC67    PC68   PC69    PC70
## Standard deviation     1.40491 1.40246 1.38423 1.37821 1.37119 1.3633 1.34394
## Proportion of Variance 0.00095 0.00095 0.00093 0.00092 0.00091 0.0009 0.00087
## Cumulative Proportion  0.93304 0.93399 0.93492 0.93584 0.93675 0.9376 0.93852
##                           PC71    PC72    PC73    PC74    PC75    PC76    PC77
## Standard deviation     1.33711 1.32677 1.32392 1.32180 1.31456 1.30424 1.29733
## Proportion of Variance 0.00086 0.00085 0.00085 0.00085 0.00084 0.00082 0.00081
## Cumulative Proportion  0.93938 0.94023 0.94108 0.94193 0.94276 0.94359 0.94440
##                          PC78    PC79    PC80    PC81    PC82    PC83    PC84
## Standard deviation     1.2883 1.27183 1.26875 1.26295 1.25464 1.24912 1.24553
## Proportion of Variance 0.0008 0.00078 0.00078 0.00077 0.00076 0.00075 0.00075
## Cumulative Proportion  0.9452 0.94599 0.94676 0.94754 0.94830 0.94905 0.94980
##                           PC85    PC86    PC87    PC88    PC89    PC90    PC91
## Standard deviation     1.24130 1.23045 1.22805 1.21890 1.21284 1.20951 1.19444
## Proportion of Variance 0.00075 0.00073 0.00073 0.00072 0.00071 0.00071 0.00069
## Cumulative Proportion  0.95055 0.95128 0.95201 0.95273 0.95344 0.95415 0.95484
##                           PC92    PC93    PC94    PC95    PC96    PC97    PC98
## Standard deviation     1.18888 1.17899 1.17275 1.16642 1.15817 1.15221 1.15149
## Proportion of Variance 0.00068 0.00067 0.00067 0.00066 0.00065 0.00064 0.00064
## Cumulative Proportion  0.95552 0.95619 0.95686 0.95752 0.95816 0.95881 0.95945
##                           PC99   PC100   PC101  PC102   PC103   PC104   PC105
## Standard deviation     1.14547 1.14403 1.12855 1.1162 1.10521 1.10251 1.09988
## Proportion of Variance 0.00063 0.00063 0.00062 0.0006 0.00059 0.00059 0.00059
## Cumulative Proportion  0.96008 0.96072 0.96133 0.9619 0.96252 0.96311 0.96370
##                          PC106   PC107   PC108   PC109   PC110   PC111   PC112
## Standard deviation     1.09378 1.08483 1.07870 1.06856 1.06570 1.05990 1.04624
## Proportion of Variance 0.00058 0.00057 0.00056 0.00055 0.00055 0.00054 0.00053
## Cumulative Proportion  0.96428 0.96485 0.96541 0.96596 0.96651 0.96705 0.96758
##                          PC113   PC114   PC115   PC116   PC117  PC118  PC119
## Standard deviation     1.04352 1.04025 1.03484 1.02957 1.02817 1.0218 1.0198
## Proportion of Variance 0.00053 0.00052 0.00052 0.00051 0.00051 0.0005 0.0005
## Cumulative Proportion  0.96811 0.96863 0.96915 0.96966 0.97017 0.9707 0.9712
##                          PC120   PC121   PC122   PC123   PC124   PC125   PC126
## Standard deviation     1.00795 1.00706 1.00461 0.99874 0.99496 0.99125 0.98688
## Proportion of Variance 0.00049 0.00049 0.00049 0.00048 0.00048 0.00048 0.00047
## Cumulative Proportion  0.97167 0.97216 0.97265 0.97313 0.97361 0.97409 0.97456
##                          PC127   PC128   PC129   PC130   PC131   PC132   PC133
## Standard deviation     0.98424 0.98154 0.97062 0.96450 0.96253 0.95858 0.95085
## Proportion of Variance 0.00047 0.00047 0.00046 0.00045 0.00045 0.00044 0.00044
## Cumulative Proportion  0.97503 0.97549 0.97595 0.97640 0.97685 0.97729 0.97773
##                          PC134   PC135   PC136   PC137   PC138   PC139   PC140
## Standard deviation     0.94990 0.94183 0.93814 0.93163 0.92687 0.92619 0.92064
## Proportion of Variance 0.00044 0.00043 0.00043 0.00042 0.00042 0.00041 0.00041
## Cumulative Proportion  0.97817 0.97860 0.97902 0.97944 0.97986 0.98027 0.98068
##                          PC141  PC142  PC143   PC144   PC145   PC146   PC147
## Standard deviation     0.91544 0.9123 0.9107 0.90293 0.89885 0.89010 0.88961
## Proportion of Variance 0.00041 0.0004 0.0004 0.00039 0.00039 0.00038 0.00038
## Cumulative Proportion  0.98109 0.9815 0.9819 0.98228 0.98268 0.98306 0.98344
##                          PC148   PC149   PC150   PC151   PC152   PC153   PC154
## Standard deviation     0.88693 0.88030 0.87788 0.87084 0.86798 0.86714 0.86333
## Proportion of Variance 0.00038 0.00037 0.00037 0.00037 0.00036 0.00036 0.00036
## Cumulative Proportion  0.98382 0.98420 0.98457 0.98494 0.98530 0.98566 0.98602
##                          PC155   PC156   PC157   PC158   PC159   PC160   PC161
## Standard deviation     0.85981 0.85831 0.85707 0.85198 0.84743 0.84461 0.84008
## Proportion of Variance 0.00036 0.00036 0.00036 0.00035 0.00035 0.00035 0.00034
## Cumulative Proportion  0.98638 0.98674 0.98709 0.98744 0.98779 0.98814 0.98848
##                          PC162   PC163   PC164   PC165   PC166   PC167   PC168
## Standard deviation     0.83298 0.83015 0.82620 0.82212 0.82121 0.81516 0.81185
## Proportion of Variance 0.00034 0.00033 0.00033 0.00033 0.00033 0.00032 0.00032
## Cumulative Proportion  0.98881 0.98915 0.98948 0.98980 0.99013 0.99045 0.99077
##                          PC169   PC170   PC171   PC172  PC173  PC174   PC175
## Standard deviation     0.80762 0.80449 0.79959 0.79649 0.7912 0.7840 0.77793
## Proportion of Variance 0.00032 0.00031 0.00031 0.00031 0.0003 0.0003 0.00029
## Cumulative Proportion  0.99109 0.99140 0.99171 0.99202 0.9923 0.9926 0.99291
##                          PC176   PC177   PC178   PC179   PC180   PC181   PC182
## Standard deviation     0.77641 0.76757 0.76417 0.75853 0.75554 0.75230 0.74754
## Proportion of Variance 0.00029 0.00028 0.00028 0.00028 0.00028 0.00027 0.00027
## Cumulative Proportion  0.99320 0.99348 0.99377 0.99405 0.99432 0.99460 0.99487
##                          PC183   PC184   PC185   PC186   PC187   PC188   PC189
## Standard deviation     0.74495 0.74070 0.73289 0.73061 0.72514 0.72142 0.71708
## Proportion of Variance 0.00027 0.00027 0.00026 0.00026 0.00025 0.00025 0.00025
## Cumulative Proportion  0.99513 0.99540 0.99566 0.99592 0.99617 0.99642 0.99667
##                          PC190   PC191   PC192   PC193   PC194   PC195   PC196
## Standard deviation     0.71032 0.70465 0.70260 0.69582 0.68845 0.68546 0.68428
## Proportion of Variance 0.00024 0.00024 0.00024 0.00023 0.00023 0.00023 0.00023
## Cumulative Proportion  0.99692 0.99716 0.99739 0.99763 0.99786 0.99809 0.99831
##                          PC197   PC198   PC199   PC200   PC201   PC202  PC203
## Standard deviation     0.68017 0.67401 0.66806 0.66574 0.65843 0.65698 0.6439
## Proportion of Variance 0.00022 0.00022 0.00022 0.00021 0.00021 0.00021 0.0002
## Cumulative Proportion  0.99854 0.99876 0.99897 0.99919 0.99940 0.99960 0.9998
##                         PC204     PC205
## Standard deviation     0.6354 3.885e-14
## Proportion of Variance 0.0002 0.000e+00
## Cumulative Proportion  1.0000 1.000e+00

By tissue: hippocampus

hippocampus = data[data$Tissue == "hippocampus",]
hippocampus_expression = hippocampus[,11:ncol(hippocampus)]
hippocampus_pca = prcomp(hippocampus_expression)

# Again, you can scale this data
hippocampus_expression_scaled <- scale(hippocampus_expression, center = TRUE, scale = TRUE)
hippocampus_pca_scaled = prcomp(hippocampus_expression_scaled)

Plot the first two PCs: hippocampus

hippo_df <- data.frame(hippocampus_pca$x)
hippo_df$disease <- hippocampus$Disease.state
ggplot(data = hippo_df, aes(x=PC1, y=PC2, col=disease)) +
  geom_point()

hippo_df_scaled <- data.frame(hippocampus_pca_scaled$x)
hippo_df_scaled$disease <- hippocampus$Disease.state
ggplot(data = hippo_df_scaled, aes(x=PC1, y=PC2, col=disease)) +
  geom_point()

Task 3: Feature Selection

Feature selection by variance

gene_var <- apply(expression, 2, var)
gene_var.df <- data.frame(gene_var, row.names = colnames(expression))
quantile(gene_var.df$gene_var, seq(0,1,0.01))

##           0%           1%           2%           3%           4%           5% 
##  0.003443167  0.008198729  0.009996457  0.011536157  0.012737273  0.014073209 
##           6%           7%           8%           9%          10%          11% 
##  0.015310911  0.016310119  0.017252163  0.018285436  0.019408559  0.020242797 
##          12%          13%          14%          15%          16%          17% 
##  0.021190282  0.022189200  0.023287995  0.024284778  0.025193448  0.026089868 
##          18%          19%          20%          21%          22%          23% 
##  0.027014907  0.027755842  0.028600421  0.029427090  0.030305498  0.031327696 
##          24%          25%          26%          27%          28%          29% 
##  0.032319858  0.033179578  0.034150538  0.035252867  0.036307698  0.037331555 
##          30%          31%          32%          33%          34%          35% 
##  0.038269428  0.039352798  0.040538026  0.041528429  0.042600559  0.043664474 
##          36%          37%          38%          39%          40%          41% 
##  0.044861846  0.046224239  0.047359208  0.048824358  0.050252854  0.051482144 
##          42%          43%          44%          45%          46%          47% 
##  0.052943493  0.054264364  0.055926396  0.057417725  0.058990966  0.060396053 
##          48%          49%          50%          51%          52%          53% 
##  0.061926761  0.063630234  0.065512239  0.067218686  0.069118934  0.071253727 
##          54%          55%          56%          57%          58%          59% 
##  0.073321413  0.075444391  0.077624273  0.079672002  0.082176119  0.084713629 
##          60%          61%          62%          63%          64%          65% 
##  0.086731843  0.089482671  0.092319934  0.095209055  0.097970651  0.100753764 
##          66%          67%          68%          69%          70%          71% 
##  0.104380537  0.107830547  0.111225981  0.115201508  0.118736925  0.123569616 
##          72%          73%          74%          75%          76%          77% 
##  0.127937053  0.132374989  0.137732878  0.143699950  0.149183492  0.154869431 
##          78%          79%          80%          81%          82%          83% 
##  0.161159396  0.168623480  0.176822785  0.185243761  0.193750439  0.203091844 
##          84%          85%          86%          87%          88%          89% 
##  0.214160103  0.225337213  0.236300704  0.250325720  0.267717735  0.287610360 
##          90%          91%          92%          93%          94%          95% 
##  0.308177495  0.335497075  0.362922326  0.400941579  0.447427531  0.501203656 
##          96%          97%          98%          99%         100% 
##  0.587346937  0.706769047  0.926959700  1.504119077 11.146195059

# ~2/3 of features have a variance less than 0.1 (this threshold is arbitrary)
expression_highvar <- expression[,which(gene_var.df$gene_var > 0.1)]
dim(expression_highvar) # 205 x 4853

## [1]  205 4853

Univariate Feature Selection

Here, we perform univariate feature selection by choosing features with lowest p-values in an ANOVA (here, across tissue type).

expression_tissue <- expression 
expression_tissue$tissue <- factor(data$Tissue)

allModelsList <- lapply(paste(colnames(expression), "~ tissue"), as.formula)
allModelsResults <- lapply(allModelsList, function(x) lm(x, data = expression_tissue))
allModelsANOVA <- lapply(allModelsResults, anova)
allModelsP <- lapply(allModelsANOVA, function(x) x[["Pr(>F)"]][1])
names(allModelsP) <- colnames(expression)
allModelsPSort <- sort(unlist(allModelsP))

# Top 10 features
head(allModelsPSort, n = 10)

##         ADORA2A SPECC1L.ADORA2A            NRN1         ANKRD11            RGS9 
##   3.771409e-176   3.771409e-176   4.688685e-172   1.832503e-171   3.914261e-170 
##           NPTXR            ISL1           EFR3A           NUAK1           KLF13 
##   8.023423e-168   5.016608e-167   1.317526e-163   3.526544e-161   1.426829e-159

# Plot top feature
ggplot(expression_tissue, aes(x = ADORA2A, y = tissue)) + geom_boxplot() + coord_flip()

Task 4: K-means Clustering (Unsupervised Learning)

Unsupervised learning can be thought of as applying an algorithm to a dataset in order to discover latent structure that exists between samples. We’ve already been exposed to some of these algorithms via PCA. However, one of the most common techniques in machine learning, and especially bioinformatics, is clustering. In this section, the goal is to cluster the data using the k-means algorithm.

We start by clustering the data into 3 clusters. Hint: use the kmeans function in R.

Compute k-means with k = 3.

set.seed(123)
km <- kmeans(expression, centers = 3)

Examine the sizes of the resultant clusters.

table(km$cluster)

## 
##  1  2  3 
## 68 69 68

Visualize the clustering results in the PCA plot.

Do the k-means clusters correspond to the tissues from the earlier PCA plot?

pca_df$cluster <- factor(km$cluster)
ggplot(data = pca_df, aes(x=PC1, y=PC2, col=cluster)) +
  geom_point()  +
  ggtitle("K-means clustering")

Task 5: Logistic Regression (Supervised Learning)

Supervised learning is a technique to teach an algorithm to distinguish between previously labelled groups, such as Tissue, Gender, or Disease.status. However, all supervised methods require data to learn how to differentiate between classes. Therefore, it is necessary to separate data into test/train sets. The training set is used to train the model, while the test set is used to evaluate performance. Cross-validation, a method of partitioning the data into disjoint subsets and continually re-training and re-testing with different partition combinations, is often used to evaluate models. In this section, we will build various classifiers using logistic regression to predict different classes from our data. You should evaluate your models’ performances using confusion matrices and accuracy scores.

Many functions below are from the caret package. See https://topepo.github.io/caret/ for helpful instructions.

Split data into training and tests sets (75% training, 25% test). Make sure to maintain distribution of Disease.status in splits.

expression_disease <- expression
expression_disease$Disease <- data$Disease.state

set.seed(123)
trainIndex <- createDataPartition(expression_disease$Disease, p = .75, 
                                  list = FALSE, 
                                  times = 1)

disease_train <- expression_disease[trainIndex,]
disease_test <- expression_disease[-trainIndex,]

dim(disease_train)

## [1]   155 13769

dim(disease_test)

## [1]    50 13769

prop.table(table(disease_train$Disease))

## 
##                   control          Bipolar disorder Major depression disorder 
##                 0.2709677                 0.2516129                 0.2451613 
##             schizophrenia 
##                 0.2322581

prop.table(table(disease_test$Disease))

## 
##                   control          Bipolar disorder Major depression disorder 
##                      0.26                      0.26                      0.24 
##             schizophrenia 
##                      0.24

Scale the training data

Features should be scaled in supervised learning. Importantly, you scale the training data and apply the same scaling parameters (mean, standard deviation) to the test data. Allowing testing data to influence training (i.e. by allowing testing data to influence scaling parameters) is called bleeding and will artificially improve your results.

preProcValues <- preProcess(disease_train, method = c("center", "scale"))

disease_train_scaled <- predict(preProcValues, disease_train)
disease_test_scaled <- predict(preProcValues, disease_test)

Perform Feature Selection for Disease using Univariate Methods

diseaseModels <- lapply(paste(colnames(expression), "~ Disease"), as.formula)
diseaseModelsResults <- lapply(diseaseModels, function(x) lm(x, data = disease_train_scaled))
diseaseModelsANOVA <- lapply(diseaseModelsResults, anova)
diseaseModelsP <- lapply(diseaseModelsANOVA, function(x) x[["Pr(>F)"]][1])
names(diseaseModelsP) <- colnames(expression)
diseaseModelsPSort <- sort(unlist(diseaseModelsP))

# This is the Bonferroni correction for 13768 tests at p < 0.05 significance
length(which(diseaseModelsPSort < 0.05/13768)) # 32 genes pass significance

## [1] 32

# Subset to most discriminative features
genes_discriminative_disease <- c(head(names(diseaseModelsPSort), n = 32), "Disease")
train_subset <- disease_train_scaled[, genes_discriminative_disease]
test_subset <- disease_test_scaled[, genes_discriminative_disease]

Train the multinomial logistic model using the “multinom” method.

NB: This requires that the “nnet” be installed.)

disease_log_mod = train(
  Disease ~ .,
  data = train_subset,
  method = "multinom",
  trControl = trainControl(method = "cv", number = 5),
  trace = FALSE, 
  # tuneGrid = multinom_grid
)

Predict the testing classes.

test_classes <- predict(disease_log_mod, newdata = test_subset)

test_table <- table(test_subset$Disease, test_classes)

# The similarity of the training and test accuracy suggests that this is a stable (albeit poor) model
test_table

##                            test_classes
##                             control Bipolar disorder Major depression disorder
##   control                         7                2                         1
##   Bipolar disorder                1                8                         2
##   Major depression disorder       5                4                         2
##   schizophrenia                   1                3                         0
##                            test_classes
##                             schizophrenia
##   control                               3
##   Bipolar disorder                      2
##   Major depression disorder             1
##   schizophrenia                         8

sum(diag(test_table))/sum(test_table)

## [1] 0.5

BRITE Machine Learning Workshop

M. Muzamil Khan, Ahmed Youssef, Conor Shea

7/18/2022