Performance

library(DAISIEprep)

Performance analysis of DAISIEprep::extract_island_species()

In this article we test and analyse the performance, in terms of time consumption/complexity of the main function in the DAISIEprep R package: extract_island_species(). This function takes in a phylogenetic tree and species endemicity data in the form of a phylo4d object (an S4 class from the R package phylobase).

This is not a thorough examination of all possible uses of the extract_island_species() function, but rather gives an indication of the usability of the function for data sets of different sizes, as well as detecting any features of the data that may slow the process of extracting and formatting the island community data.

All of the work for the performance analysis is carried out by the benchmark() function from the DAISIEprep package. But before calling this function we explain how the benchmarking is set up.

The first argument of benchmark() is phylod. If this is NULL then the function will simulate the phylogenies and the endemicity data given the: tree_size_range, num_points, prob_on_island, and prob_endemic arguments. These specify the lower and upper range of the tree size to simulate (the sequence breaks can be in linear or log space depending on the argument log_scale), the number of breaks in the sequence between lower and upper tree sizes, the probability that a species will be on the island, and if a species is on the island whether it is endemic (1 - prob_endemic is the probability a species on the island is non-endemic). If a phylo4d object is supplied to the phylod argument in benchmark() then this tree will be used to perform the benchmarking.

In the case of simulating data the parameter space of the performance analysis is then the combination of these variables (using the expand.grid() function). The phylogeny is simulated to a fixed size using the rphylo() function from the ape package.

As we are stochastically simulating the endemicity statuses of the species in the tree it may be that there is not an outgroup species not on the island in order to correctly extract the colonisation times from the stem age of the species or clade on the island. Therefore, we add an outgroup which we ensure is not present on the island.

When we want to test the performance of the extract_island_species() function using the asr algorithm we need to first run an ancestral state reconstruction. This can easily be achieved if the phylod data with the phylogeny and endemicity statuses at each tip of the tree are ready (see tutorial vignette for more information on the asr algorithm).

To quantify the time elapsed while the function runs there are several methods in both base R and in various R packages (e.g. microbenchmark or rbenchmark). However here we use system.time() from base R (see https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/) for reasoning).

Side note: by default DAISIEprep::benchmark() conditions on each simulated data set having a non-empty island, and thus the function is not tested for the trivial case that no species need to be extracted.

Due to the computational time not being deterministic and to ensure the results are not spurious we replicate timing three times and the mean “real time elapsed” is calculated, as well as replicating the simulation (given by argument replicates) to account for stochasticity in the simulation of the data.

The range of tree sizes we use encompasses those of common empirical phylogenies (10 tips to 10,000 tips). We then generate a random sample of tip states for each species, with the possible states being: endemic to the island, non-endemic and not present on the island. We tested a low probability of each species in the tree being on the island (\(P(0.2)\)) and a high probability of species being on the island (\(P(0.5)\)). For each of these scenarios we tested a low (\(P_E(0.2)\), \(P_{NE}(0.8)\)) and high probability (\(P_E(0.8)\), \(P_{NE}(0.2)\)) of island species being endemic to test whether this affects performance. For each scenario we ran ten replicates. For isolated islands where cladogenesis is high, island species will likely not be spread across uniformly across the phylogeny as we assumed in our simulations, but instead will be clustered. To check for scenarios where the number of species per colonisation events is high (i.e.  radiations on the island) we used the mammals of Madagascar data set. For this we used the (Upham et al. 2019) complete and DNA-only phylogenies, and the island endemicity data of Madagascar (Michielsen et al. 2023).

The results presented in this vignette are not computed each time the vignette is rendered due to the large computation time required. Instead, the analyses are run on a cluster computer and saved in the package. The analysis script run to produce the results can be found in the DAISIEprepExtra package here. The performance analysis uses the benchmark() function in the DAISIEprep package.

The output produces results for the DNA-only phylogeny and the complete phylogeny. The raw data of parameter estimates for the different parameter settings is tidied into a tibble containing the data we need for both the DNA and complete phylogeny.

# Internal function - Not to be called for regular 'DAISIEprep' operation,
# but merely to load in data
performance_data <- DAISIEprep:::read_performance()

We can plot the time consumption of each simulated and empirical (DNA-only and complete phylogeny) data set and group the results by the probability of species being on the island (prob_on_island). Now the results can be plotted with plot_performance implemented in DAISIEprep. This function follows the tidyverse convention of giving variable names as variables (as opposed to strings) and uses tidy evaluation to group by the variable given, either prob_on_island or prob_endemic. First for the min algorithm:

plot_performance(
  performance_data = performance_data$performance_data_min,
  group_by = prob_on_island
)
#> Warning: `position_dodge()` requires non-overlapping x intervals.
#> `position_dodge()` requires non-overlapping x intervals.

And secondly for the asr algorithm:

plot_performance(
  performance_data = performance_data$performance_data_asr,
  group_by = prob_on_island
)
#> Warning: `position_dodge()` requires non-overlapping x intervals.
#> `position_dodge()` requires non-overlapping x intervals.

Alternatively, the data can be grouped by the probability that species on the island are endemic (prob_endemic) for the min (first) and asr (second) algorithms:

plot_performance(
  performance_data = performance_data$performance_data_min,
  group_by = prob_endemic
)
#> Warning: `position_dodge()` requires non-overlapping x intervals.
#> `position_dodge()` requires non-overlapping x intervals.

plot_performance(
  performance_data = performance_data$performance_data_min,
  group_by = prob_endemic
)
#> Warning: `position_dodge()` requires non-overlapping x intervals.
#> `position_dodge()` requires non-overlapping x intervals.

We find that even for large trees (10,000 tips) the scale of extraction less than seconds, whereas running the DAISIE inference model is on the scale of minutes to days, and thus pre-processing does not present a bottleneck to the pipeline. The empirical trees we ran (the Madagascar mammal example) were quicker to process than trees simulated with uniformly random island presence (figure 1). This suggests that when the ratio of species to colonisation events on the island is higher it should be faster to extract. However, even for large trees with many colonisations extraction should pose a computational problem. The speed of extraction facilitates extraction of island data across many phylogenies, for example extracting island community data across the posterior distribution of inferred phylogenetic trees to account for uncertainty in branching times and tree topology.

Given the approximately linear relationship between the size of the phylogeny and time required for extraction in the above log-log plots we fit a power law to the mean run time (mean across the replicates).

grouped_performance_data <- dplyr::group_by(
  performance_data$performance_data_min,
  tree_size,
  "prob_on_island"
)

mean_performance_data <- dplyr::summarise(
  grouped_performance_data,
  mean = mean(median_time),
  .groups = "drop"
)

fit_min <- lm(log(mean_performance_data$mean) ~ log(mean_performance_data$tree_size))
fit_min$coefficients
#>                          (Intercept) log(mean_performance_data$tree_size) 
#>                           -28.297771                             1.399895

grouped_performance_data <- dplyr::group_by(
  performance_data$performance_data_asr,
  tree_size,
  "prob_on_island"
)

mean_performance_data <- dplyr::summarise(
  grouped_performance_data,
  mean = mean(median_time),
  .groups = "drop"
)

fit_asr <- lm(log(mean_performance_data$mean) ~ log(mean_performance_data$tree_size))
fit_asr$coefficients
#>                          (Intercept) log(mean_performance_data$tree_size) 
#>                             -28.0177                               1.3008

The growth in time for both the min and asr algorithms follows a power function (\(y=ax^k\)) with exponents \(k =\) 1.3998949 and \(k =\) 1.3008005, respectively, when fitted on mean run time against tree size. Therefore the time complexity scales relatively well with tree size, and thus unless needing to be applied to extremely large phylogenies (>10,000 tips) the DAISIEprep extraction functionality should be applicable.

References

Michielsen, N. M., S. M. Goodman, V. Soarimalala, A. A. E. van der Geer, L. M. Dávalos, G. I. Saville, N. Upham, and L. Valente. 2023. The macroevolutionary impact of recent and imminent mammal extinctions on madagascar. Nature Communications 14. Springer Science; Business Media LLC.
Upham, N. S., J. A. Esselstyn, and W. Jetz. 2019. Inferring the mammal tree: Species-level sets of phylogenies for questions in ecology, evolution, and conservation. PLOS Biology 17:e3000494. Public Library of Science (PLoS).