The curatedTBData is an R package that provides standardized, curated tuberculosis(TB) transcriptomic studies. The initial release of the package contains 49 studies. The curatedTBData package allows users to access tuberculosis trancriptomic efficiently and to make easy comparison for different TB gene signatures across multiple datasets.
curatedTBData 2.2.0
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("curatedTBData")
library(curatedTBData)
library(dplyr)
library(SummarizedExperiment)
library(sva)
The DataSummary
table summarized the list of available studies and
their metadata information contained in the curatedTBData package.
The table helps users to query avialable datasets quickly.
# Remove GeographicalRegion, Age, DiagnosisMethod, Notes, Tissue, HIVStatus for concise display
data("DataSummary", package = "curatedTBData")
DataSummary |>
dplyr::select(-c(`Country/Region`, Age, DiagnosisMethod, Notes,
Tissue, HIVStatus)) |>
DT::datatable()
Users can use curatedTBData()
function to access data.
There are three arguments in the function.
The first argument study_name
represents the names of the data that are used
to determine the resources of interests.
Users can find all available resource names from DataSummary$Study
.
The second argument dry.run
enables users to determine the resources’s
availability before actually downloading them.
When dry.run
is set to TRUE
, the output includes names of the resources.
The third argument curated.only
allows the users to access
the curated ready-to-use data.
If curated.only
is TRUE
, the function only download the curated
gene expression profile and the clinical annotation of the corresponding data.
If curated.only
is FALSE
, the function downloads all available resources
for input studies.
curatedTBData("GSE19439", dry.run = TRUE, curated.only = FALSE)
## dry.run = TRUE, listing dataset(s) to be downloaded
## Set dry.run = FALSE to download dataset(s).
## Will download the following resources for GSE19439 from the ExperimentHub:
## GSE19439_assay_raw
## GSE19439_assay_curated
## GSE19439_column_data
## GSE19439_row_data
## GSE19439_meta_data
To download complete data for a Microarry study (e.g. GSE19439),
we set dry.run = FALSE
and curated.only = FALSE
.
There are two experiments assay being included in the output Microarray studies.
The first experiment is assay_curated
, which is a matrix
that represents
normalized, curated version of the gene expression profile.
The second experiment is assay_raw
, which is a
SummarziedExperiment object that contains the raw gene expression profile and information about probe features.
GSE19439 <- curatedTBData("GSE19439", dry.run = FALSE, curated.only = FALSE)
## Downloading: GSE19439
##
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
## Finished!
GSE19439
## $GSE19439
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] assay_curated: matrix with 25417 rows and 42 columns
## [2] object_raw: SummarizedExperiment with 48803 rows and 42 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
When accessing data for RNA sequencing studies, another assay called
assay_reprocess
is included. This matrix
represents the reprocessed version
of gene expression profile from the raw .fastq files using
Rsubread.
GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE)
## Downloading: GSE79362
##
|
| | 0%
|
|========== | 14%
|
|==================== | 29%
|
|============================== | 43%
|
|======================================== | 57%
|
|================================================== | 71%
|
|============================================================ | 86%
|
|======================================================================| 100%
## Finished!
GSE79362
## $GSE79362
## A MultiAssayExperiment object of 4 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 4:
## [1] assay_curated: matrix with 13419 rows and 355 columns
## [2] assay_reprocess_hg19: matrix with 25369 rows and 355 columns
## [3] assay_reprocess_hg38: matrix with 60642 rows and 355 columns
## [4] object_raw: SummarizedExperiment with 134866 rows and 355 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
A list of MultiAssayExperiment objects is returned
when dry.run
is FALSE
.
To save running time, in the following example, we set the curated.only = TRUE
and selected five studies that belong to GSE19491
from the Gene Expression Omnibus.
myGEO <- c("GSE19435", "GSE19439", "GSE19442", "GSE19444", "GSE22098")
object_list <- curatedTBData(myGEO, dry.run = FALSE, curated.only = TRUE)
## curated.only = TRUE. Download curated version.
## Set curated.only = FALSE if want to download both raw and curated data.
## Downloading: GSE19435
## Downloading: GSE19439
## Downloading: GSE19442
## Downloading: GSE19444
## Downloading: GSE22098
## Finished!
object_list[1:2]
## $GSE19435
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] assay_curated: matrix with 25417 rows and 33 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
##
## $GSE19439
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] assay_curated: matrix with 25417 rows and 42 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
A full version example for RNA sequencing data: GSE79362.
GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE)
## Downloading: GSE79362
## Finished!
GSE79362
## $GSE79362
## A MultiAssayExperiment object of 4 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 4:
## [1] assay_curated: matrix with 13419 rows and 355 columns
## [2] assay_reprocess_hg19: matrix with 25369 rows and 355 columns
## [3] assay_reprocess_hg38: matrix with 60642 rows and 355 columns
## [4] object_raw: SummarizedExperiment with 134866 rows and 355 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
The major advantage of using MultiAssayExperiment
is the coordination of the meta-data and assays when sub-setting.
The MultiAssayExperiment object has built-in function for subsetting samples
based on column condition.
The following code shows how to select samples with only active TB.
GSE19439 <- object_list$GSE19439
GSE19439[, GSE19439$TBStatus == "PTB"]["assay_curated"] # 13 samples
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] assay_curated: matrix with 0 rows and 13 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
The following example shows how to subset patients with active TB and LTBI
GSE19439[, GSE19439$TBStatus %in% c("PTB", "LTBI")]["assay_curated"] # 30 samples
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] assay_curated: matrix with 0 rows and 30 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
The combine_objects()
function provides an easy implementation for
combining different studies based on common gene symbols.
The function returns a SummarizedExperiment object that contains
the merged assay and associated clinical annotation. Noticed that GSE74092 is usually
removed from merging, because it used quantitative PCR,
which did not have enough coverage to capture all genes.
There are two arguments in the combine_objects()
function.
The first one is object_list
, which takes a list of
MultiAssayExperiment objects obtained from curatedTBData()
. Notice that the names(object_list)
should not be NULL
and must be unique for each object within the list,
so that we can keep track the original study after merging.
The second argument is experiment_name
, which can be a string or vector of strings representing the name of the assay from the object.
GSE19491 <- combine_objects(object_list, experiment_name = "assay_curated",
update_genes = TRUE)
## "update_genes" is TRUE, updating gene symbols
GSE19491
## class: SummarizedExperiment
## dim: 25065 454
## metadata(0):
## assays(1): assay1
## rownames(25065): A1BG A1CF ... ZZZ3 dJ341D10.1
## rowData names(0):
## colnames(454): GSM484368 GSM484369 ... GSM550399 GSM550400
## colData names(32): Age Gender ... StaphStatus StrepStatus
When the objects are merged, the original individual data tag can be found
in the Study
section from the metadata.
unique(GSE19491$Study)
## [1] "GSE19435" "GSE19439" "GSE19442" "GSE19444" "GSE22098"
It is also possible to combine the given list of objects with
different experiment names. In this case, the experiment_name
is a vector of
string that corresponds to each of object from the input list.
exp <- combine_objects(c(GSE79362[1], object_list[1]),
experiment_name = c("assay_reprocess_hg19",
"assay_curated"),
update_genes = TRUE)
## Found more than one "experiment_name".
## "update_genes" is TRUE, updating gene symbols
exp
## class: SummarizedExperiment
## dim: 18925 388
## metadata(0):
## assays(1): assay1
## rownames(18925): A1BG A1CF ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(388): GSM2092905 GSM2092906 ... GSM484399 GSM484400
## colData names(31): Age Gender ... DiabetesStatus Treatment
If datasets are merged, it is typically recommended to
remove a very likely batch effect.
We use the ComBat()
function from sva to remove
potential batch effect between studies.
In the following example, each study is viewed as one batch. The batch corrected
assay will be stored in a SummarizedExperiment object.
batch1 <- colData(GSE19491)$Study
combat_edata1 <- sva::ComBat(dat = assay(GSE19491), batch = batch1)
assays(GSE19491)[["Batch_corrected_assay"]] <- combat_edata1
GSE19491
## class: SummarizedExperiment
## dim: 25065 454
## metadata(0):
## assays(2): assay1 Batch_corrected_assay
## rownames(25065): A1BG A1CF ... ZZZ3 dJ341D10.1
## rowData names(0):
## colnames(454): GSM484368 GSM484369 ... GSM550399 GSM550400
## colData names(32): Age Gender ... StaphStatus StrepStatus
The function subset_curatedTBData()
allows the users to subset a list of
MultiAssayExperiment with the output contains
the exact conditions given by the annotationCondition
.
With subset_curatedTBData()
, users can quickly subset desired results
from curatedTBData database without checking individual object.
There are four arguments in this function.
The theObject
represents a MultiAssayExperiment or
SummarizedExperiment object.
The annotationColName
is a character that indicates the
column name in the metadata. The annotationCondition
is a character or
vector of characters that the users intend to select.
In the following example, we call subset_curatedTBData()
function to
subset samples with active TB (PTB
) and latent TB infection (LTBI
) for
binary classification.
multi_set_PTB_LTBI <- lapply(object_list, function(x)
subset_curatedTBData(x, annotationColName = "TBStatus",
annotationCondition = c("LTBI", "PTB"),
assayName = "assay_curated"))
# Remove NULL from the list
multi_set_PTB_LTBI <- multi_set_PTB_LTBI[!sapply(multi_set_PTB_LTBI, is.null)]
multi_set_PTB_LTBI[1:3]
## $GSE19439
## class: SummarizedExperiment
## dim: 25417 30
## metadata(0):
## assays(1): assay1
## rownames(25417): 7A5 A1BG ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(30): GSM484448 GSM484449 ... GSM484488 GSM484489
## colData names(24): Age Gender ... DiabetesStatus QFT_GIT
##
## $GSE19442
## class: SummarizedExperiment
## dim: 25417 51
## metadata(0):
## assays(1): assay1
## rownames(25417): 7A5 A1BG ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(51): GSM484500 GSM484501 ... GSM484549 GSM484550
## colData names(23): Age Gender ... isolate_sensitivity DiabetesStatus
##
## $GSE19444
## class: SummarizedExperiment
## dim: 25417 42
## metadata(0):
## assays(1): assay1
## rownames(25417): 7A5 A1BG ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(42): GSM484595 GSM484596 ... GSM484645 GSM484646
## colData names(24): Age Gender ... DiabetesStatus QFT_GIT
The HIV status (HIVStatus
) and diabetes status (DiabetesStatus
) for each
subject were also recorded for each study in the curatedTBData.
In the following example, we select subjects with HIV positive from the input.
Users can also find HIV status information for each study by
looking at the column: HIVStatus
from DataSummary
.
When the the length of the annotationCondition
equals to 1, we can subset using
either MultiAssayExperiment
built-in procedure or subset_curatedTBData
.
multi_set_HIV <- lapply(object_list, function(x)
subset_curatedTBData(x, annotationColName = "HIVStatus",
annotationCondition = "Negative",
assayName = "assay_curated"))
# Remove NULL from the list
multi_set_HIV <- multi_set_HIV[!vapply(multi_set_HIV, is.null, TRUE)]
multi_set_HIV[1:3]
## $GSE19435
## class: SummarizedExperiment
## dim: 25417 33
## metadata(0):
## assays(1): assay1
## rownames(25417): 7A5 A1BG ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(33): GSM484368 GSM484369 ... GSM484399 GSM484400
## colData names(24): Age Gender ... DiabetesStatus Treatment
##
## $GSE19439
## class: SummarizedExperiment
## dim: 25417 42
## metadata(0):
## assays(1): assay1
## rownames(25417): 7A5 A1BG ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(42): GSM484448 GSM484449 ... GSM484488 GSM484489
## colData names(24): Age Gender ... DiabetesStatus QFT_GIT
##
## $GSE19442
## class: SummarizedExperiment
## dim: 25417 51
## metadata(0):
## assays(1): assay1
## rownames(25417): 7A5 A1BG ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(51): GSM484500 GSM484501 ... GSM484549 GSM484550
## colData names(23): Age Gender ... isolate_sensitivity DiabetesStatus
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] sva_3.54.0 BiocParallel_1.40.0
## [3] genefilter_1.88.0 mgcv_1.9-1
## [5] nlme_3.1-166 SummarizedExperiment_1.36.0
## [7] Biobase_2.66.0 GenomicRanges_1.58.0
## [9] GenomeInfoDb_1.42.0 IRanges_2.40.0
## [11] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [13] MatrixGenerics_1.18.0 matrixStats_1.4.1
## [15] dplyr_1.1.4 curatedTBData_2.2.0
## [17] BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 blob_1.2.4
## [3] filelock_1.0.3 Biostrings_2.74.0
## [5] fastmap_1.2.0 BiocFileCache_2.14.0
## [7] XML_3.99-0.17 digest_0.6.37
## [9] mime_0.12 lifecycle_1.0.4
## [11] statmod_1.5.0 survival_3.7-0
## [13] KEGGREST_1.46.0 RSQLite_2.3.7
## [15] magrittr_2.0.3 compiler_4.4.1
## [17] rlang_1.1.4 sass_0.4.9
## [19] tools_4.4.1 utf8_1.2.4
## [21] yaml_2.3.10 data.table_1.16.2
## [23] knitr_1.48 htmlwidgets_1.6.4
## [25] S4Arrays_1.6.0 bit_4.5.0
## [27] curl_5.2.3 splitstackshape_1.4.8
## [29] DelayedArray_0.32.0 abind_1.4-8
## [31] purrr_1.0.2 withr_3.0.2
## [33] grid_4.4.1 fansi_1.0.6
## [35] ExperimentHub_2.14.0 xtable_1.8-4
## [37] edgeR_4.4.0 MultiAssayExperiment_1.32.0
## [39] cli_3.6.3 rmarkdown_2.28
## [41] crayon_1.5.3 generics_0.1.3
## [43] httr_1.4.7 BiocBaseUtils_1.8.0
## [45] DBI_1.2.3 cachem_1.1.0
## [47] zlibbioc_1.52.0 splines_4.4.1
## [49] parallel_4.4.1 AnnotationDbi_1.68.0
## [51] BiocManager_1.30.25 XVector_0.46.0
## [53] vctrs_0.6.5 Matrix_1.7-1
## [55] jsonlite_1.8.9 bookdown_0.41
## [57] bit64_4.5.2 crosstalk_1.2.1
## [59] locfit_1.5-9.10 limma_3.62.0
## [61] jquerylib_0.1.4 annotate_1.84.0
## [63] glue_1.8.0 codetools_0.2-20
## [65] DT_0.33 BiocVersion_3.20.0
## [67] UCSC.utils_1.2.0 tibble_3.2.1
## [69] pillar_1.9.0 rappdirs_0.3.3
## [71] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [73] R6_2.5.1 dbplyr_2.5.0
## [75] evaluate_1.0.1 lattice_0.22-6
## [77] AnnotationHub_3.14.0 png_0.1-8
## [79] memoise_2.0.1 bslib_0.8.0
## [81] SparseArray_1.6.0 HGNChelper_0.8.14
## [83] xfun_0.48 pkgconfig_2.0.3