Abstract
Describes the background of the package, important functions defined in the package and some of the applications and usages.
This vignette explains how to work with sets using this package. The package provides a class to store the information efficiently and functions to work with it.
To create a TidySet
object, to store associations
between elements and sets image we have several genes associated with a
characteristic.
library("BaseSet")
gene_lists <- list(
geneset1 = c("A", "B"),
geneset2 = c("B", "C", "D")
)
tidy_set <- tidySet(gene_lists)
tidy_set
#> elements sets fuzzy
#> 1 A geneset1 1
#> 2 B geneset1 1
#> 3 B geneset2 1
#> 4 C geneset2 1
#> 5 D geneset2 1
This is then stored internally in three slots
relations()
, elements()
, and
sets()
slots.
If you have more information for each element or set it can be added:
gene_data <- data.frame(
stat1 = c( 1, 2, 3, 4 ),
info1 = c("a", "b", "c", "d")
)
tidy_set <- add_column(tidy_set, "elements", gene_data)
set_data <- data.frame(
Group = c( 100 , 200 ),
Column = c("abc", "def")
)
tidy_set <- add_column(tidy_set, "sets", set_data)
tidy_set
#> elements sets fuzzy Group Column stat1 info1
#> 1 A geneset1 1 100 abc 1 a
#> 2 B geneset1 1 100 abc 2 b
#> 3 B geneset2 1 200 def 2 b
#> 4 C geneset2 1 200 def 3 c
#> 5 D geneset2 1 200 def 4 d
This data is stored in one of the three slots, which can be directly accessed using their getter methods:
relations(tidy_set)
#> elements sets fuzzy
#> 1 A geneset1 1
#> 2 B geneset1 1
#> 3 B geneset2 1
#> 4 C geneset2 1
#> 5 D geneset2 1
elements(tidy_set)
#> elements stat1 info1
#> 1 A 1 a
#> 2 B 2 b
#> 3 C 3 c
#> 4 D 4 d
sets(tidy_set)
#> sets Group Column
#> 1 geneset1 100 abc
#> 2 geneset2 200 def
You can add as much information as you want, with the only
restriction for a “fuzzy” column for the relations()
. See
the Fuzzy sets vignette:
vignette("Fuzzy sets", "BaseSet")
.
You can also use the standard R approach with [
:
gene_data <- data.frame(
stat2 = c( 4, 4, 3, 5 ),
info2 = c("a", "b", "c", "d")
)
tidy_set$info1 <- NULL
tidy_set[, "elements", c("stat2", "info2")] <- gene_data
tidy_set[, "sets", "Group"] <- c("low", "high")
tidy_set
#> elements sets fuzzy Group Column stat1 stat2 info2
#> 1 A geneset1 1 low abc 1 4 a
#> 2 B geneset1 1 low abc 2 4 b
#> 3 B geneset2 1 high def 2 4 b
#> 4 C geneset2 1 high def 3 3 c
#> 5 D geneset2 1 high def 4 5 d
Observe that one can add, replace or delete
As you can see it is possible to create a TidySet from a list. More commonly you can create it from a data.frame:
relations <- data.frame(elements = c("a", "b", "c", "d", "e", "f"),
sets = c("A", "A", "A", "A", "A", "B"),
fuzzy = c(1, 1, 1, 1, 1, 1))
TS <- tidySet(relations)
TS
#> elements sets fuzzy
#> 1 a A 1
#> 2 b A 1
#> 3 c A 1
#> 4 d A 1
#> 5 e A 1
#> 6 f B 1
It is also possible from a matrix:
m <- matrix(c(0, 0, 1, 1, 1, 1, 0, 1, 0), ncol = 3, nrow = 3,
dimnames = list(letters[1:3], LETTERS[1:3]))
m
#> A B C
#> a 0 1 0
#> b 0 1 1
#> c 1 1 0
tidy_set <- tidySet(m)
tidy_set
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
Or they can be created from a GeneSet and GeneSetCollection objects.
Additionally it has several function to read files related to sets like
the OBO files (getOBO
) and GAF (getGAF
)
It is possible to extract the gene sets as a list
, for
use with functions such as lapply
.
as.list(tidy_set)
#> $A
#> c
#> 1
#>
#> $B
#> a b c
#> 1 1 1
#>
#> $C
#> b
#> 1
Or if you need to apply some network methods and you need a matrix,
you can create it with incidence
:
incidence(tidy_set)
#> A B C
#> c 1 1 0
#> a 0 1 0
#> b 0 1 1
To work with sets several methods are provided. In general you can
provide a new name for the resulting set of the operation, but if you
don’t one will be automatically provided using naming()
.
All methods work with fuzzy and non-fuzzy sets
You can make a union of two sets present on the same object.
BaseSet::union(tidy_set, sets = c("C", "B"), name = "D")
#> elements sets fuzzy
#> 1 a D 1
#> 2 b D 1
#> 3 c D 1
intersection(tidy_set, sets = c("A", "B"), name = "D", keep = TRUE)
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c D 1
The keep argument used here is if you want to keep all the other previous sets:
intersection(tidy_set, sets = c("A", "B"), name = "D", keep = FALSE)
#> elements sets fuzzy
#> 1 c D 1
We can look for the complement of one or several sets:
complement_set(tidy_set, sets = c("A", "B"))
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c ∁A∪B 0
#> 7 a ∁A∪B 0
#> 8 b ∁A∪B 0
Observe that we haven’t provided a name for the resulting set but we can provide one if we prefer to
complement_set(tidy_set, sets = c("A", "B"), name = "F")
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c F 0
#> 7 a F 0
#> 8 b F 0
This is the equivalent of setdiff
, but clearer:
out <- subtract(tidy_set, set_in = "A", not_in = "B", name = "A-B")
out
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
name_sets(out)
#> [1] "A" "B" "C" "A-B"
subtract(tidy_set, set_in = "B", not_in = "A", keep = FALSE)
#> elements sets fuzzy
#> 1 a B∖A 1
#> 2 b B∖A 1
See that in the first case there isn’t any element present in B not in set A, but the new set is stored. In the second use case we focus just on the elements that are present on B but not in A.
The number of unique elements and sets can be obtained using the
nElements()
and nSets()
methods.
nElements(tidy_set)
#> [1] 3
nSets(tidy_set)
#> [1] 3
nRelations(tidy_set)
#> [1] 5
If you wish to know all in a single call you can use
dim(tidy_set)
: 3, 5, 3. This summary doesn’t provide the
number of relations of each set. You can quickly obtain that with
lengths(tidy_set)
: 1, 3, 1
The size of each set can be obtained using the
set_size()
method.
set_size(tidy_set)
#> sets size probability
#> 1 A 1 1
#> 2 B 3 1
#> 3 C 1 1
Conversely, the number of sets associated with each gene is returned
by the element_size()
function.
element_size(tidy_set)
#> elements size probability
#> 1 c 2 1
#> 2 a 1 1
#> 3 b 2 1
The identifiers of elements and sets can be inspected and renamed
using name_elements
and
name_elements(tidy_set)
#> [1] "c" "a" "b"
name_elements(tidy_set) <- paste0("Gene", seq_len(nElements(tidy_set)))
name_elements(tidy_set)
#> [1] "Gene1" "Gene2" "Gene3"
name_sets(tidy_set)
#> [1] "A" "B" "C"
name_sets(tidy_set) <- paste0("Geneset", seq_len(nSets(tidy_set)))
name_sets(tidy_set)
#> [1] "Geneset1" "Geneset2" "Geneset3"
dplyr
verbsYou can also use mutate()
, filter()
,
select()
, group_by()
and other
dplyr
verbs with TidySets. You usually need to activate
which three slots you want to affect with activate()
:
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:BaseSet':
#>
#> union
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
m_TS <- tidy_set %>%
activate("relations") %>%
mutate(Important = runif(nRelations(tidy_set)))
m_TS
#> elements sets fuzzy Important
#> 1 Gene1 Geneset1 1 0.2078229
#> 2 Gene2 Geneset2 1 0.0849534
#> 3 Gene3 Geneset2 1 0.2430043
#> 4 Gene1 Geneset2 1 0.5751461
#> 5 Gene3 Geneset3 1 0.7439878
You can use activate to select what are the verbs modifying:
set_modified <- m_TS %>%
activate("elements") %>%
mutate(Pathway = if_else(elements %in% c("Gene1", "Gene2"),
"pathway1",
"pathway2"))
set_modified
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.2078229 pathway1
#> 2 Gene2 Geneset2 1 0.0849534 pathway1
#> 3 Gene3 Geneset2 1 0.2430043 pathway2
#> 4 Gene1 Geneset2 1 0.5751461 pathway1
#> 5 Gene3 Geneset3 1 0.7439878 pathway2
set_modified %>%
deactivate() %>% # To apply a filter independently of where it is
filter(Pathway == "pathway1")
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.2078229 pathway1
#> 2 Gene2 Geneset2 1 0.0849534 pathway1
#> 3 Gene1 Geneset2 1 0.5751461 pathway1
If you think you need group_by
usually this could mean
that you need a new set. You can create a new one with
group
.
# A new group of those elements in pathway1 and with Important == 1
set_modified %>%
deactivate() %>%
group(name = "new", Pathway == "pathway1")
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.2078229 pathway1
#> 2 Gene2 Geneset2 1 0.0849534 pathway1
#> 3 Gene3 Geneset2 1 0.2430043 pathway2
#> 4 Gene1 Geneset2 1 0.5751461 pathway1
#> 5 Gene3 Geneset3 1 0.7439878 pathway2
#> 6 Gene1 new 1 NA pathway1
#> 7 Gene2 new 1 NA pathway1
set_modified %>%
group("pathway1", elements %in% c("Gene1", "Gene2"))
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.2078229 pathway1
#> 2 Gene2 Geneset2 1 0.0849534 pathway1
#> 3 Gene3 Geneset2 1 0.2430043 pathway2
#> 4 Gene1 Geneset2 1 0.5751461 pathway1
#> 5 Gene3 Geneset3 1 0.7439878 pathway2
#> 6 Gene1 pathway1 1 NA pathway1
#> 7 Gene2 pathway1 1 NA pathway1
You can use group_by()
but it won’t return a
TidySet
.
set_modified %>%
deactivate() %>%
group_by(Pathway, sets) %>%
count()
#> # A tibble: 4 × 3
#> # Groups: Pathway, sets [4]
#> Pathway sets n
#> <chr> <chr> <int>
#> 1 pathway1 Geneset1 1
#> 2 pathway1 Geneset2 2
#> 3 pathway2 Geneset2 1
#> 4 pathway2 Geneset3 1
After grouping or mutating sometimes we might be interested in moving a column describing something to other places. We can do by this with:
elements(set_modified)
#> elements Pathway
#> 1 Gene1 pathway1
#> 2 Gene2 pathway1
#> 3 Gene3 pathway2
out <- move_to(set_modified, "elements", "relations", "Pathway")
relations(out)
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.2078229 pathway1
#> 2 Gene2 Geneset2 1 0.0849534 pathway1
#> 3 Gene3 Geneset2 1 0.2430043 pathway2
#> 4 Gene1 Geneset2 1 0.5751461 pathway1
#> 5 Gene3 Geneset3 1 0.7439878 pathway2
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Madrid
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.2 BaseSet_0.9.0
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.3 cli_3.6.1 knitr_1.43 rlang_1.1.1
#> [5] xfun_0.40 generics_0.1.3 jsonlite_1.8.7 glue_1.6.2
#> [9] htmltools_0.5.6 sass_0.4.7 fansi_1.0.4 rmarkdown_2.24
#> [13] evaluate_0.21 jquerylib_0.1.4 tibble_3.2.1 fastmap_1.1.1
#> [17] yaml_2.3.7 lifecycle_1.0.3 compiler_4.3.1 pkgconfig_2.0.3
#> [21] rstudioapi_0.15.0 digest_0.6.33 R6_2.5.1 tidyselect_1.2.0
#> [25] utf8_1.2.3 pillar_1.9.0 magrittr_2.0.3 bslib_0.5.1
#> [29] tools_4.3.1 cachem_1.0.8