In this workshop, we will study GSE53987 dataset on Bipolar disorder (BD), major depressive disorder (MDD) and schizophrenia. You can download it here.
In total there are 205 rows consisting of 19 individuals diagnosed with BPD, 19 with MDD, 19 schizophrenia and 19 controls. Each sample has gene expression from 3 tissues (post-mortem brain). There are a total of 13,768 genes (numeric features) and 10 meta features and 1 ID (GEO sample accession).
data <- read.csv("~/Downloads/GSE53987_combined.csv", row.names = 1)
First, let’s check the dimensions of the dataset.
dim(data)
## [1] 205 13778
Check all the features. Which features are numeric, which are categorical? Understanding the nature of your data is a very important and necessary first step before proceeding with any analysis.
What type of distributions exist within the features? Is Gender a balanced feature (roughly equal representation between both men and women)? Are numerical values normally distributed? Explore numerical distributions by plotting histograms for Age, an Age + Gender histogram, and one of your favorite genes found in the dataset.
Some features display factor dependent values. That is, whether a subject is a male or a female might effect the expression patterns of a given gene. Explore factor and feature relationships by creating boxplots to observe how Age is dependent on Tissue, Gender and Disease.status.
# Explore the distribution of genders
table(data[,'Gender'])
##
## F M
## 93 112
# Explore the distribution of ages
hist(data[, 'Age'], prob = TRUE)
lines(density(data[, 'Age']))
#fancy ggplot
ggplot2::ggplot(data = data, aes(x=Age)) +
geom_histogram(aes(y=stat(density)), binwidth = 5) + geom_density()
# Explore the distribution of age in each gender
ggplot2::ggplot(data = data, aes(x=Age)) +
geom_histogram(aes(y = stat(density))) +
geom_density() +
facet_wrap(~Gender)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot2::ggplot(data = data, aes(x=ZWINT)) +
geom_histogram(aes(y=stat(density))) +
geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot2::ggplot(data = data, aes(x=Disease.state, y=Age, fill=Gender)) +
geom_boxplot()
ggplot2::ggplot(data = data, aes(x=Gender, y=Age, fill=Disease.state)) +
geom_boxplot()
data$Disease.state <- factor(data$Disease.state, levels = c("control", "Bipolar disorder", "Major depression disorder", "schizophrenia"))
ggplot2::ggplot(data = data, aes(x=Disease.state, y=Age, fill=Gender)) +
geom_boxplot()
ggplot2::ggplot(data = data, aes(x=Gender, y=Age, fill=Disease.state)) +
geom_boxplot()
ggplot2::ggplot(data = data, aes(x=Disease.state, y=Age, fill=Tissue)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot2::ggplot(data = data, aes(x=Tissue, y=Age, fill=Disease.state)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot2::ggplot(data = data, aes(x=Disease.state, y=Age, fill=Tissue)) +
geom_boxplot() +
facet_wrap(~Gender) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) #make labels more readable
ggplot2::ggplot(data = data, aes(x=Tissue, y=Age, fill=Disease.state)) +
geom_boxplot() +
facet_wrap(~Gender) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Principal Component Analysis (PCA) is a commonly used technique to create linearly uncorrelated features from a set of possibly correlated features. The procedure is done in such a way that the first feature produced by PCA, the first principal component – PC1, explains the largest amount of variability possible. In this way, PCA is a dimension reduction technique, as the first few principal components often explain upwards of 90% of the variability found within a dataset. It is important to note that if we’re planning on predicting anything using the principal components, such as tissue type or Disease.status, those features should not be included in the input matrix. Before performing PCA, create a new data frame containing only explanatory values (i.e. the features we want to use to predict class membership).
Explore how much variation is explained by the principal components. How much variation is explained by the first two principal components? How many principal components might be required to explain 75%, 85%, 90%, 95%, and 99% of the variation within our dataset?
Visually explore this separation to plot the first two principal components and color samples according to Tissue and Disease.status. What effect does plotting the third principal component have on sample separation?
Subset the dataset into three disjoint datasets by Tissue. Run PCA on all three of these datasets, plot the first two principal components, and color the dots according to Disease.status. Does there appear to be a meaningful difference in the separation between disease classes between the three different datasets?
# Create a dataframe containing only the expression data
expression = data[, 11:ncol(data)]
# Create a PCA class with 2 components
pca = prcomp(x = expression)
PCA is sensitive to the underlying variance of your variables, such that PCA preferentially chooses axes with higher variance. That may be what you want, but many people scale their data (in this case, Z-score) and perform PCA on that data, in order to identify sets of correlated features along which PCA can find axes.
expression_scaled = scale(expression, center = TRUE, scale = TRUE)
pca_scaled = prcomp(x = expression_scaled)
pca_df <- data.frame(pca$x)
pca_df$tissue <- data$Tissue
ggplot(data = pca_df, aes(x=PC1, y=PC2, col=tissue)) +
geom_point()
What differences do you notice?
pca_df_scaled <- data.frame(pca_scaled$x)
pca_df_scaled$tissue <- data$Tissue
ggplot(data = pca_df_scaled, aes(x=PC1, y=PC2, col=tissue)) +
geom_point()
# %Variance = pca$sdev^2/sum(pca$sdev^2)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 31.1430 17.7250 10.96164 8.95247 7.13340 6.24519 5.5880
## Proportion of Variance 0.4691 0.1520 0.05812 0.03876 0.02461 0.01886 0.0151
## Cumulative Proportion 0.4691 0.6211 0.67917 0.71794 0.74255 0.76141 0.7765
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 5.06453 4.66497 4.21519 3.99315 3.8863 3.68249 3.57089
## Proportion of Variance 0.01241 0.01053 0.00859 0.00771 0.0073 0.00656 0.00617
## Cumulative Proportion 0.78892 0.79944 0.80804 0.81575 0.8231 0.82961 0.83578
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 3.30411 3.04443 2.90404 2.83318 2.76900 2.64638 2.58053
## Proportion of Variance 0.00528 0.00448 0.00408 0.00388 0.00371 0.00339 0.00322
## Cumulative Proportion 0.84106 0.84554 0.84962 0.85351 0.85721 0.86060 0.86382
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 2.51111 2.42890 2.3605 2.34306 2.3201 2.25634 2.19149
## Proportion of Variance 0.00305 0.00285 0.0027 0.00266 0.0026 0.00246 0.00232
## Cumulative Proportion 0.86687 0.86973 0.8724 0.87508 0.8777 0.88014 0.88246
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 2.1346 2.06796 2.05302 2.00469 1.95893 1.95034 1.9307
## Proportion of Variance 0.0022 0.00207 0.00204 0.00194 0.00186 0.00184 0.0018
## Cumulative Proportion 0.8847 0.88674 0.88878 0.89072 0.89257 0.89441 0.8962
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 1.90950 1.8752 1.85689 1.82871 1.80152 1.78258 1.75016
## Proportion of Variance 0.00176 0.0017 0.00167 0.00162 0.00157 0.00154 0.00148
## Cumulative Proportion 0.89798 0.8997 0.90135 0.90297 0.90454 0.90607 0.90756
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 1.71103 1.69760 1.67131 1.65049 1.63246 1.62538 1.60549
## Proportion of Variance 0.00142 0.00139 0.00135 0.00132 0.00129 0.00128 0.00125
## Cumulative Proportion 0.90897 0.91036 0.91172 0.91303 0.91432 0.91560 0.91685
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 1.59602 1.56806 1.56287 1.55042 1.53572 1.52097 1.5074
## Proportion of Variance 0.00123 0.00119 0.00118 0.00116 0.00114 0.00112 0.0011
## Cumulative Proportion 0.91808 0.91927 0.92045 0.92161 0.92275 0.92387 0.9250
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 1.48275 1.46927 1.45695 1.44933 1.43392 1.42896 1.42516
## Proportion of Variance 0.00106 0.00104 0.00103 0.00102 0.00099 0.00099 0.00098
## Cumulative Proportion 0.92603 0.92708 0.92810 0.92912 0.93012 0.93110 0.93209
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 1.40491 1.40246 1.38423 1.37821 1.37119 1.3633 1.34394
## Proportion of Variance 0.00095 0.00095 0.00093 0.00092 0.00091 0.0009 0.00087
## Cumulative Proportion 0.93304 0.93399 0.93492 0.93584 0.93675 0.9376 0.93852
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 1.33711 1.32677 1.32392 1.32180 1.31456 1.30424 1.29733
## Proportion of Variance 0.00086 0.00085 0.00085 0.00085 0.00084 0.00082 0.00081
## Cumulative Proportion 0.93938 0.94023 0.94108 0.94193 0.94276 0.94359 0.94440
## PC78 PC79 PC80 PC81 PC82 PC83 PC84
## Standard deviation 1.2883 1.27183 1.26875 1.26295 1.25464 1.24912 1.24553
## Proportion of Variance 0.0008 0.00078 0.00078 0.00077 0.00076 0.00075 0.00075
## Cumulative Proportion 0.9452 0.94599 0.94676 0.94754 0.94830 0.94905 0.94980
## PC85 PC86 PC87 PC88 PC89 PC90 PC91
## Standard deviation 1.24130 1.23045 1.22805 1.21890 1.21284 1.20951 1.19444
## Proportion of Variance 0.00075 0.00073 0.00073 0.00072 0.00071 0.00071 0.00069
## Cumulative Proportion 0.95055 0.95128 0.95201 0.95273 0.95344 0.95415 0.95484
## PC92 PC93 PC94 PC95 PC96 PC97 PC98
## Standard deviation 1.18888 1.17899 1.17275 1.16642 1.15817 1.15221 1.15149
## Proportion of Variance 0.00068 0.00067 0.00067 0.00066 0.00065 0.00064 0.00064
## Cumulative Proportion 0.95552 0.95619 0.95686 0.95752 0.95816 0.95881 0.95945
## PC99 PC100 PC101 PC102 PC103 PC104 PC105
## Standard deviation 1.14547 1.14403 1.12855 1.1162 1.10521 1.10251 1.09988
## Proportion of Variance 0.00063 0.00063 0.00062 0.0006 0.00059 0.00059 0.00059
## Cumulative Proportion 0.96008 0.96072 0.96133 0.9619 0.96252 0.96311 0.96370
## PC106 PC107 PC108 PC109 PC110 PC111 PC112
## Standard deviation 1.09378 1.08483 1.07870 1.06856 1.06570 1.05990 1.04624
## Proportion of Variance 0.00058 0.00057 0.00056 0.00055 0.00055 0.00054 0.00053
## Cumulative Proportion 0.96428 0.96485 0.96541 0.96596 0.96651 0.96705 0.96758
## PC113 PC114 PC115 PC116 PC117 PC118 PC119
## Standard deviation 1.04352 1.04025 1.03484 1.02957 1.02817 1.0218 1.0198
## Proportion of Variance 0.00053 0.00052 0.00052 0.00051 0.00051 0.0005 0.0005
## Cumulative Proportion 0.96811 0.96863 0.96915 0.96966 0.97017 0.9707 0.9712
## PC120 PC121 PC122 PC123 PC124 PC125 PC126
## Standard deviation 1.00795 1.00706 1.00461 0.99874 0.99496 0.99125 0.98688
## Proportion of Variance 0.00049 0.00049 0.00049 0.00048 0.00048 0.00048 0.00047
## Cumulative Proportion 0.97167 0.97216 0.97265 0.97313 0.97361 0.97409 0.97456
## PC127 PC128 PC129 PC130 PC131 PC132 PC133
## Standard deviation 0.98424 0.98154 0.97062 0.96450 0.96253 0.95858 0.95085
## Proportion of Variance 0.00047 0.00047 0.00046 0.00045 0.00045 0.00044 0.00044
## Cumulative Proportion 0.97503 0.97549 0.97595 0.97640 0.97685 0.97729 0.97773
## PC134 PC135 PC136 PC137 PC138 PC139 PC140
## Standard deviation 0.94990 0.94183 0.93814 0.93163 0.92687 0.92619 0.92064
## Proportion of Variance 0.00044 0.00043 0.00043 0.00042 0.00042 0.00041 0.00041
## Cumulative Proportion 0.97817 0.97860 0.97902 0.97944 0.97986 0.98027 0.98068
## PC141 PC142 PC143 PC144 PC145 PC146 PC147
## Standard deviation 0.91544 0.9123 0.9107 0.90293 0.89885 0.89010 0.88961
## Proportion of Variance 0.00041 0.0004 0.0004 0.00039 0.00039 0.00038 0.00038
## Cumulative Proportion 0.98109 0.9815 0.9819 0.98228 0.98268 0.98306 0.98344
## PC148 PC149 PC150 PC151 PC152 PC153 PC154
## Standard deviation 0.88693 0.88030 0.87788 0.87084 0.86798 0.86714 0.86333
## Proportion of Variance 0.00038 0.00037 0.00037 0.00037 0.00036 0.00036 0.00036
## Cumulative Proportion 0.98382 0.98420 0.98457 0.98494 0.98530 0.98566 0.98602
## PC155 PC156 PC157 PC158 PC159 PC160 PC161
## Standard deviation 0.85981 0.85831 0.85707 0.85198 0.84743 0.84461 0.84008
## Proportion of Variance 0.00036 0.00036 0.00036 0.00035 0.00035 0.00035 0.00034
## Cumulative Proportion 0.98638 0.98674 0.98709 0.98744 0.98779 0.98814 0.98848
## PC162 PC163 PC164 PC165 PC166 PC167 PC168
## Standard deviation 0.83298 0.83015 0.82620 0.82212 0.82121 0.81516 0.81185
## Proportion of Variance 0.00034 0.00033 0.00033 0.00033 0.00033 0.00032 0.00032
## Cumulative Proportion 0.98881 0.98915 0.98948 0.98980 0.99013 0.99045 0.99077
## PC169 PC170 PC171 PC172 PC173 PC174 PC175
## Standard deviation 0.80762 0.80449 0.79959 0.79649 0.7912 0.7840 0.77793
## Proportion of Variance 0.00032 0.00031 0.00031 0.00031 0.0003 0.0003 0.00029
## Cumulative Proportion 0.99109 0.99140 0.99171 0.99202 0.9923 0.9926 0.99291
## PC176 PC177 PC178 PC179 PC180 PC181 PC182
## Standard deviation 0.77641 0.76757 0.76417 0.75853 0.75554 0.75230 0.74754
## Proportion of Variance 0.00029 0.00028 0.00028 0.00028 0.00028 0.00027 0.00027
## Cumulative Proportion 0.99320 0.99348 0.99377 0.99405 0.99432 0.99460 0.99487
## PC183 PC184 PC185 PC186 PC187 PC188 PC189
## Standard deviation 0.74495 0.74070 0.73289 0.73061 0.72514 0.72142 0.71708
## Proportion of Variance 0.00027 0.00027 0.00026 0.00026 0.00025 0.00025 0.00025
## Cumulative Proportion 0.99513 0.99540 0.99566 0.99592 0.99617 0.99642 0.99667
## PC190 PC191 PC192 PC193 PC194 PC195 PC196
## Standard deviation 0.71032 0.70465 0.70260 0.69582 0.68845 0.68546 0.68428
## Proportion of Variance 0.00024 0.00024 0.00024 0.00023 0.00023 0.00023 0.00023
## Cumulative Proportion 0.99692 0.99716 0.99739 0.99763 0.99786 0.99809 0.99831
## PC197 PC198 PC199 PC200 PC201 PC202 PC203
## Standard deviation 0.68017 0.67401 0.66806 0.66574 0.65843 0.65698 0.6439
## Proportion of Variance 0.00022 0.00022 0.00022 0.00021 0.00021 0.00021 0.0002
## Cumulative Proportion 0.99854 0.99876 0.99897 0.99919 0.99940 0.99960 0.9998
## PC204 PC205
## Standard deviation 0.6354 3.885e-14
## Proportion of Variance 0.0002 0.000e+00
## Cumulative Proportion 1.0000 1.000e+00
hippocampus = data[data$Tissue == "hippocampus",]
hippocampus_expression = hippocampus[,11:ncol(hippocampus)]
hippocampus_pca = prcomp(hippocampus_expression)
# Again, you can scale this data
hippocampus_expression_scaled <- scale(hippocampus_expression, center = TRUE, scale = TRUE)
hippocampus_pca_scaled = prcomp(hippocampus_expression_scaled)
hippo_df <- data.frame(hippocampus_pca$x)
hippo_df$disease <- hippocampus$Disease.state
ggplot(data = hippo_df, aes(x=PC1, y=PC2, col=disease)) +
geom_point()
hippo_df_scaled <- data.frame(hippocampus_pca_scaled$x)
hippo_df_scaled$disease <- hippocampus$Disease.state
ggplot(data = hippo_df_scaled, aes(x=PC1, y=PC2, col=disease)) +
geom_point()
gene_var <- apply(expression, 2, var)
gene_var.df <- data.frame(gene_var, row.names = colnames(expression))
quantile(gene_var.df$gene_var, seq(0,1,0.01))
## 0% 1% 2% 3% 4% 5%
## 0.003443167 0.008198729 0.009996457 0.011536157 0.012737273 0.014073209
## 6% 7% 8% 9% 10% 11%
## 0.015310911 0.016310119 0.017252163 0.018285436 0.019408559 0.020242797
## 12% 13% 14% 15% 16% 17%
## 0.021190282 0.022189200 0.023287995 0.024284778 0.025193448 0.026089868
## 18% 19% 20% 21% 22% 23%
## 0.027014907 0.027755842 0.028600421 0.029427090 0.030305498 0.031327696
## 24% 25% 26% 27% 28% 29%
## 0.032319858 0.033179578 0.034150538 0.035252867 0.036307698 0.037331555
## 30% 31% 32% 33% 34% 35%
## 0.038269428 0.039352798 0.040538026 0.041528429 0.042600559 0.043664474
## 36% 37% 38% 39% 40% 41%
## 0.044861846 0.046224239 0.047359208 0.048824358 0.050252854 0.051482144
## 42% 43% 44% 45% 46% 47%
## 0.052943493 0.054264364 0.055926396 0.057417725 0.058990966 0.060396053
## 48% 49% 50% 51% 52% 53%
## 0.061926761 0.063630234 0.065512239 0.067218686 0.069118934 0.071253727
## 54% 55% 56% 57% 58% 59%
## 0.073321413 0.075444391 0.077624273 0.079672002 0.082176119 0.084713629
## 60% 61% 62% 63% 64% 65%
## 0.086731843 0.089482671 0.092319934 0.095209055 0.097970651 0.100753764
## 66% 67% 68% 69% 70% 71%
## 0.104380537 0.107830547 0.111225981 0.115201508 0.118736925 0.123569616
## 72% 73% 74% 75% 76% 77%
## 0.127937053 0.132374989 0.137732878 0.143699950 0.149183492 0.154869431
## 78% 79% 80% 81% 82% 83%
## 0.161159396 0.168623480 0.176822785 0.185243761 0.193750439 0.203091844
## 84% 85% 86% 87% 88% 89%
## 0.214160103 0.225337213 0.236300704 0.250325720 0.267717735 0.287610360
## 90% 91% 92% 93% 94% 95%
## 0.308177495 0.335497075 0.362922326 0.400941579 0.447427531 0.501203656
## 96% 97% 98% 99% 100%
## 0.587346937 0.706769047 0.926959700 1.504119077 11.146195059
# ~2/3 of features have a variance less than 0.1 (this threshold is arbitrary)
expression_highvar <- expression[,which(gene_var.df$gene_var > 0.1)]
dim(expression_highvar) # 205 x 4853
## [1] 205 4853
Here, we perform univariate feature selection by choosing features with lowest p-values in an ANOVA (here, across tissue type).
expression_tissue <- expression
expression_tissue$tissue <- factor(data$Tissue)
allModelsList <- lapply(paste(colnames(expression), "~ tissue"), as.formula)
allModelsResults <- lapply(allModelsList, function(x) lm(x, data = expression_tissue))
allModelsANOVA <- lapply(allModelsResults, anova)
allModelsP <- lapply(allModelsANOVA, function(x) x[["Pr(>F)"]][1])
names(allModelsP) <- colnames(expression)
allModelsPSort <- sort(unlist(allModelsP))
# Top 10 features
head(allModelsPSort, n = 10)
## ADORA2A SPECC1L.ADORA2A NRN1 ANKRD11 RGS9
## 3.771409e-176 3.771409e-176 4.688685e-172 1.832503e-171 3.914261e-170
## NPTXR ISL1 EFR3A NUAK1 KLF13
## 8.023423e-168 5.016608e-167 1.317526e-163 3.526544e-161 1.426829e-159
# Plot top feature
ggplot(expression_tissue, aes(x = ADORA2A, y = tissue)) + geom_boxplot() + coord_flip()
Unsupervised learning can be thought of as applying an algorithm to a dataset in order to discover latent structure that exists between samples. We’ve already been exposed to some of these algorithms via PCA. However, one of the most common techniques in machine learning, and especially bioinformatics, is clustering. In this section, the goal is to cluster the data using the k-means algorithm.
We start by clustering the data into 3 clusters. Hint: use the
kmeans
function in R.
set.seed(123)
km <- kmeans(expression, centers = 3)
table(km$cluster)
##
## 1 2 3
## 68 69 68
Do the k-means clusters correspond to the tissues from the earlier PCA plot?
pca_df$cluster <- factor(km$cluster)
ggplot(data = pca_df, aes(x=PC1, y=PC2, col=cluster)) +
geom_point() +
ggtitle("K-means clustering")
Supervised learning is a technique to teach an algorithm to distinguish between previously labelled groups, such as Tissue, Gender, or Disease.status. However, all supervised methods require data to learn how to differentiate between classes. Therefore, it is necessary to separate data into test/train sets. The training set is used to train the model, while the test set is used to evaluate performance. Cross-validation, a method of partitioning the data into disjoint subsets and continually re-training and re-testing with different partition combinations, is often used to evaluate models. In this section, we will build various classifiers using logistic regression to predict different classes from our data. You should evaluate your models’ performances using confusion matrices and accuracy scores.
Many functions below are from the caret package. See https://topepo.github.io/caret/ for helpful instructions.
expression_disease <- expression
expression_disease$Disease <- data$Disease.state
set.seed(123)
trainIndex <- createDataPartition(expression_disease$Disease, p = .75,
list = FALSE,
times = 1)
disease_train <- expression_disease[trainIndex,]
disease_test <- expression_disease[-trainIndex,]
dim(disease_train)
## [1] 155 13769
dim(disease_test)
## [1] 50 13769
prop.table(table(disease_train$Disease))
##
## control Bipolar disorder Major depression disorder
## 0.2709677 0.2516129 0.2451613
## schizophrenia
## 0.2322581
prop.table(table(disease_test$Disease))
##
## control Bipolar disorder Major depression disorder
## 0.26 0.26 0.24
## schizophrenia
## 0.24
Features should be scaled in supervised learning. Importantly, you scale the training data and apply the same scaling parameters (mean, standard deviation) to the test data. Allowing testing data to influence training (i.e. by allowing testing data to influence scaling parameters) is called bleeding and will artificially improve your results.
preProcValues <- preProcess(disease_train, method = c("center", "scale"))
disease_train_scaled <- predict(preProcValues, disease_train)
disease_test_scaled <- predict(preProcValues, disease_test)
diseaseModels <- lapply(paste(colnames(expression), "~ Disease"), as.formula)
diseaseModelsResults <- lapply(diseaseModels, function(x) lm(x, data = disease_train_scaled))
diseaseModelsANOVA <- lapply(diseaseModelsResults, anova)
diseaseModelsP <- lapply(diseaseModelsANOVA, function(x) x[["Pr(>F)"]][1])
names(diseaseModelsP) <- colnames(expression)
diseaseModelsPSort <- sort(unlist(diseaseModelsP))
# This is the Bonferroni correction for 13768 tests at p < 0.05 significance
length(which(diseaseModelsPSort < 0.05/13768)) # 32 genes pass significance
## [1] 32
# Subset to most discriminative features
genes_discriminative_disease <- c(head(names(diseaseModelsPSort), n = 32), "Disease")
train_subset <- disease_train_scaled[, genes_discriminative_disease]
test_subset <- disease_test_scaled[, genes_discriminative_disease]
NB: This requires that the “nnet” be installed.)
disease_log_mod = train(
Disease ~ .,
data = train_subset,
method = "multinom",
trControl = trainControl(method = "cv", number = 5),
trace = FALSE,
# tuneGrid = multinom_grid
)
test_classes <- predict(disease_log_mod, newdata = test_subset)
test_table <- table(test_subset$Disease, test_classes)
# The similarity of the training and test accuracy suggests that this is a stable (albeit poor) model
test_table
## test_classes
## control Bipolar disorder Major depression disorder
## control 7 2 1
## Bipolar disorder 1 8 2
## Major depression disorder 5 4 2
## schizophrenia 1 3 0
## test_classes
## schizophrenia
## control 3
## Bipolar disorder 2
## Major depression disorder 1
## schizophrenia 8
sum(diag(test_table))/sum(test_table)
## [1] 0.5