Customized Distributions

Custom distributions can be specified in defData and defDataAdd by setting the argument dist to “custom”. When defining a custom distribution, you provide the name of the user-defined function as a string in the formula argument. The arguments of the custom function are listed in the variance argument, separated by commas and formatted as “arg_1 = val_form_1, arg_2 = val_form_2, \(\dots\), arg_K = val_form_K”.

Here, the arg_k’s represent the names of the arguments passed to the customized function, where \(k\) ranges from \(1\) to \(K\). You can use values or formulas for each val_form_k. If formulas are used, ensure that the variables have been previously generated. Double dot notation is available in specifying value_formula_k. One important requirement of the custom function is that the parameter list used to define the function must include an argument”n = n”, but do not include \(n\) in the definition as part of defData or defDataAdd.

Example 1

Here is an example where we would like to generate data from a zero-inflated beta distribution. In this case, there is a user-defined function zeroBeta that takes on shape parameters \(a\) and \(b\), as well as \(p_0\), the proportion of the sample that is zero. Note that the function also takes an argument \(n\) that will not to be be specified in the data definition; \(n\) will represent the number of observations being generated:

zeroBeta <- function(n, a, b, p0) {
  betas <- rbeta(n, a, b)
  is.zero <- rbinom(n, 1, p0)
  betas*!(is.zero)
}

The data definition specifies a new variable \(zb\) that sets \(a\) and \(b\) to 0.75, and \(p_0 = 0.02\):

def <- defData(
  varname = "zb", 
  formula = "zeroBeta", 
  variance = "a = 0.75, b = 0.75, p0 = 0.02", 
  dist = "custom"
)

The data are generated:

set.seed(1234)
dd <- genData(100000, def)
## Key: <id>
##             id         zb
##          <int>      <num>
##      1:      1 0.93922887
##      2:      2 0.35609519
##      3:      3 0.08087245
##      4:      4 0.99796758
##      5:      5 0.28481522
##     ---                  
##  99996:  99996 0.81740836
##  99997:  99997 0.98586333
##  99998:  99998 0.68770216
##  99999:  99999 0.45096868
## 100000: 100000 0.74101272

A plot of the data reveals dis-proportion of zero’s:

Example 2

In this second example, we are generating sets of truncated Gaussian distributions with means ranging from \(-1\) to \(1\). The limits of the truncation vary across three different groups. rnormt is a customized (user-defined) function that generates the truncated Gaussiandata. The function requires four arguments (the left truncation value, the right truncation value, the distribution average and the standard deviation).

rnormt <- function(n, min, max, mu, s) {
  
  F.a <- pnorm(min, mean = mu, sd = s)
  F.b <- pnorm(max, mean = mu, sd = s)
  
  u <- runif(n, min = F.a, max = F.b)
  qnorm(u, mean = mu, sd = s)
  
}

In this example, truncation limits vary based on group membership. Initially, three groups are created, followed by the generation of truncated values. For Group 1, truncation occurs within the range of \(-1\) to \(1\), for Group 2, it’s \(-2\) to \(2\) and for Group 3, it’s \(-3\) to \(3\). We’ll generate three data sets, each with a distinct mean denoted by M, using the double-dot notation to implement these different means.

def <-
  defData(
    varname = "limit", 
    formula = "1/4;1/2;1/4",
    dist = "categorical"
  ) |>
  defData(
    varname = "tn", 
    formula = "rnormt", 
    variance = "min = -limit, max = limit, mu = ..M, s = 1.5",
    dist = "custom"
  )

The data generation requires three calls to genData. The output is a list of three data sets:

mus <- c(-1, 0, 1)
dd <-lapply(mus, function(M) genData(100000, def))

Here are the first six observations from each of the three data sets:

## [[1]]
## Key: <id>
##       id limit         tn
##    <int> <int>      <num>
## 1:     1     2  0.6949619
## 2:     2     2 -0.3641963
## 3:     3     2 -0.4721632
## 4:     4     3 -2.6083796
## 5:     5     2 -0.6800441
## 6:     6     3 -0.5813880
## 
## [[2]]
## Key: <id>
##       id limit         tn
##    <int> <int>      <num>
## 1:     1     1  0.4853614
## 2:     2     2 -0.5690811
## 3:     3     2  0.5282246
## 4:     4     2  0.1107778
## 5:     5     2 -0.3504309
## 6:     6     2  1.9439890
## 
## [[3]]
## Key: <id>
##       id limit         tn
##    <int> <int>      <num>
## 1:     1     2  1.3560628
## 2:     2     2  1.4543616
## 3:     3     3  1.4491010
## 4:     4     2  0.7328855
## 5:     5     2 -0.1254556
## 6:     6     2 -0.7455908

A plot highlights the group differences.