Extract summary statistics of R package structure and functionality.
Not all statistics of course, but a good go at balancing insightful
statistics while ensuring computational feasibility.
pkgstats is a static code analysis tool, so is
generally very fast (a few seconds at most for very large packages).
Installation is described in a
separate vignette.
Statistics are derived from these primary sources:
DESCRIPTION file and related
package meta-statistics./R,
./src, and ./inst/include).ctags, and references (“calls”) to those obtained from
another tagging library,
gtags. This network roughly connects every object
making a call (as from) with every object being called
(to).The primary
function, pkgstats(), returns a list of these various
components, including full data.frame objects for the final
three components described above. The statistical properties of this
list can be aggregated by the pkgstats_summary()
function, which returns a data.frame with a single row
of summary statistics. This function is demonstrated below, including
full details of all statistics extracted.
The following code demonstrates the output of the main function,
pkgstats, using an internally bundled .tar.gz
“tarball” of this package. The system.time call
demonstrates that the static code analyses of pkgstats are
generally very fast.
library (pkgstats)
tarball <- system.file ("extdata", "pkgstats_9.9.tar.gz", package = "pkgstats")
system.time (
p <- pkgstats (tarball)
)## user system elapsed
## 1.701 0.124 1.802
names (p)## [1] "loc" "vignettes" "data_stats" "desc"
## [5] "translations" "objects" "network" "external_calls"
The result is a list of various data extracted from the code. All
except for objects and network represent
summary data:
p [!names (p) %in% c ("objects", "network", "external_calls")]## $loc
## # A tibble: 4 × 12
## langage dir nfiles nlines ncode ndoc nempty nspaces nchars nexpr ntabs
## <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <int>
## 1 C++ src 3 365 277 21 67 933 7002 1 0
## 2 R R 19 3741 2698 536 507 27575 94022 1 0
## 3 R tests 2 146 121 1 24 395 2423 1 0
## 4 R tests/tes… 5 202 145 9 48 375 3738 1 0
## # ℹ 1 more variable: indentation <int>
##
## $vignettes
## vignettes demos
## 0 0
##
## $data_stats
## n total_size median_size
## 0 0 0
##
## $desc
## package version date license
## 1 pkgstats 9.9 2022-05-12 19:41:22 GPL-3
## urls
## 1 https://docs.ropensci.org/pkgstats/,\nhttps://github.com/ropensci-review-tools/pkgstats
## bugs aut ctb fnd rev ths
## 1 https://github.com/ropensci-review-tools/pkgstats/issues 1 0 0 0 0
## trl depends imports
## 1 0 NA brio, checkmate, dplyr, fs, igraph, methods, readr, sys, withr
## suggests
## 1 hms, knitr, pbapply, pkgbuild, Rcpp, rmarkdown, roxygen2, testthat, visNetwork
## enhances linking_to
## 1 NA cpp11
##
## $translations
## [1] NA
The various components of these results are described in further detail in the main package vignette.
pkgstats_summary() functionA summary of the pkgstats data can be obtained by
submitting the object returned from pkgstats() to the pkgstats_summary()
function:
s <- pkgstats_summary (p)This function reduces the result of the pkgstats()
function to a single line with 95 entries, represented as a
data.frame with one row and that number of columns. This
format is intended to enable summary statistics from multiple packages
to be aggregated by simply binding rows together. While 95 statistics
might seem like a lot, the pkgstats_summary()
function aims to return as many usable raw statistics as possible in
order to flexibly allow higher-level statistics to be derived through
combination and aggregation. These 95 statistics can be roughly grouped
into the following categories (not shown in the order in which they
actually appear), with variable names in parentheses after each
description. Some statistics are summarised as comma-delimited character
strings, such as translations into human languages, or other packages
listed under “depends”, “imports”, or “suggests”. This enables
subsequent analyses of their contents, for example of actual translated
languages, or both aggregate numbers and individual details of all
package dependencies, as demonstrated immediately below.
Package Summaries
package)version)DESCRIPTION file
where not explicitly stated (date)license)languages), and excluding R itself.translations).Information from DESCRIPTION file
url)bugs)desc_n_aut), contributor
(desc_n_ctb), funder (desc_n_fnd),
reviewer (desc_n_rev), thesis advisor
(ths), and translator (trl, relating
to translation between computer and not spoken languages).depends,
imports, suggests, and linking_to
packages.Numbers of entries in each the of the last two kinds of items can be
obtained from by a simple strsplit call, like this:
deps <- strsplit (s$suggests, ", ") [[1]]
length (deps)
print (deps)## [1] 9
print (deps)## [1] "hms" "knitr" "pbapply" "pkgbuild" "Rcpp"
## [6] "rmarkdown" "roxygen2" "testthat" "visNetwork"
Numbers of files and associated data
num_vignettes)num_demos)num_data_files)data_size_total)data_size_median)files_R,
files_src, files_inst,
files_vignettes, files_tests), where numbers
are recursively counted in all sub-directories, and where
inst only counts files in the inst/include
sub-directory.Statistics on lines of code
loc_R,
loc_src, loc_ins, loc_vignettes,
loc_tests).blank_lines_R, blank_lines_src,
blank_lines_inst, blank_lines_vignette,
blank_lines_tests).comment_lines_R, comment_lines_src,
comment_lines_inst, comment_lines_vignettes,
comment_lines_tests).rel_space_R, rel_space_src,
rel_space_inst, rel_space_vignettes,
rel_space_tests), as well as an overall measure for the
R/, src/, and inst/ directories
(rel_space).indentation),
with values of -1 indicating indentation with tab characters.nexpr).Statistics on individual objects (including functions)
These statistics all refer to “functions”, but actually represent more general “objects,” such as global variables or class definitions (generally from languages other than R), as detailed below.
n_fns_r)n_fns_r_exported, n_fns_r_not_exported)n_fns_src), including functions in both src
and inst/include directories.src) directories (n_fns_per_file_r,
n_fns_per_file_src).npars_exported_mn, npars_exported_md).loc_per_fn_r_mn, loc_per_fn_r_md,
loc_per_fn_r_exp_m, loc_per_fn_r_exp_md,
loc_per_fn_r_not_exp_mn,
loc_per_fn_r_not_exp_m, loc_per_fn_src_mn,
loc_per_fn_src_md).doclines_per_fn_exp_mn,
doclines_per_fn_exp_md,
doclines_per_fn_not_exp_m,
doclines_per_fn_not_exp_md,
docchars_per_par_exp_mn,
docchars_per_par_exp_m).Network Statistics
The full structure of the network table is described
below, with summary statistics including:
n_edges, n_edges_r,
n_edges_src).n_clusters).centrality_dir_mn, centrality_dir_md,
centrality_undir_mn,
centrality_undir_md).centrality_dir_mn_no0, centrality_dir_md_no0,
centrality_undir_mn_no0,
centrality_undir_md_no).num_terminal_edges_dir,
num_terminal_edges_undir).node_degree_mn,
node_degree_md, node_degree_max)External Call Statistics
The final column in the result of the
pkgstats_summary() function summarises the
external_calls object detailing all calls make to external
packages (including to base and recommended packages). This summary is
also represented as a single character string. Each package lists total
numbers of function calls, and total numbers of unique function calls.
Data for each package are separated by a comma, while data within each
package are separated by a colon.
s$external_calls## [1] "base:447:78,brio:7:1,dplyr:7:4,fs:4:2,graphics:10:2,hms:1:1,igraph:3:3,pbapply:1:1,pkgstats:99:60,readr:8:5,stats:16:2,sys:13:1,tools:2:2,utils:10:7,visNetwork:3:2,withr:5:1"
This structure allows numbers of calls to all packages to be readily extracted with code like the following:
calls <- do.call (
rbind,
strsplit (strsplit (s$external_call, ",") [[1]], ":")
)
calls <- data.frame (
package = calls [, 1],
n_total = as.integer (calls [, 2]),
n_unique = as.integer (calls [, 3])
)
print (calls)## package n_total n_unique
## 1 base 447 78
## 2 brio 7 1
## 3 dplyr 7 4
## 4 fs 4 2
## 5 graphics 10 2
## 6 hms 1 1
## 7 igraph 3 3
## 8 pbapply 1 1
## 9 pkgstats 99 60
## 10 readr 8 5
## 11 stats 16 2
## 12 sys 13 1
## 13 tools 2 2
## 14 utils 10 7
## 15 visNetwork 3 2
## 16 withr 5 1
The two numeric columns respectively show the total number of calls made to each package, and the total number of unique functions used within those packages. These results provide detailed information on numbers of calls made to, and functions used from, other R packages, including base and recommended packages.
Finally, the summary statistics conclude with two further statistics
of afferent_pkg and efferent_pkg. These are
package-internal measures of afferent
and efferent couplings between the files of a package. The
afferent couplings (ca) are numbers of
incoming calls to each file of a package from functions defined
elsewhere in the package, while the efferent couplings
(ce) are numbers of outgoing calls from each file
of a package to functions defined elsewhere in the package. These can be
used to derive a measure of “internal package instability” as the ratio
of efferent to total coupling (ce / (ce + ca)).
There are many other “raw” statistics returned by the main
pkgstats() function which are not represented in
pkgstats_summary(). The main
package vignette provides further detail on the full results.
The following sub-sections provide further detail on the
objects, network, and
external_call items, which could be used to extract
additional statistics beyond those described here.
Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
All contributions to this project are gratefully acknowledged using
the allcontributors
package following the allcontributors specification.
Contributions of any kind are welcome!
|
mpadge |
jhollist |
jeroen |
Bisaloo |
thomaszwagerman |
|
helske |
rpodcast |
assignUser |
GFabien |
pawelru |
stitam |
willgearty |
|
krlmlr |
noamross |
maelle |
mdsumner |
kellijohnson-NOAA |
ScottClaessens |
schneiderpy |