Getting Started with bitfield

library(bitfield)
library(dplyr, warn.conflicts = FALSE)

This guide walks through the core workflow of the bitfield package: planning an encoding, choosing protocols, finding optimal bit allocations, and sharing protocols with the community.

The example data

The package ships with bf_tbl, a small dataset with typical quality issues (missing values, invalid coordinates, mixed formats):

bf_tbl
#> # A tibble: 9 × 5
#>       x     y commodity yield year 
#>   <dbl> <dbl> <fct>     <dbl> <chr>
#> 1  25.3  59.5 soybean   11.2  2021 
#> 2  27.9  58.1 maize     12.0  <NA> 
#> 3  27.8  57.8 soybean   13.2  2021r
#> 4  27    59.2 <NA>       4.43 2021 
#> 5 259   Inf   honey     13.0  2021 
#> 6  27.3  59.1 maize      8.55 2021 
#> 7  26.1  58.4 soybean   11.3  2021 
#> 8  26.5 NaN   maize     10.6  2021 
#> 9   0     0   soybean    9.01 2021

Encoding types at a glance

Every bit flag uses one of a few encoding types. The table below summarises when to use which:

Encoding Protocol Key parameters Use when
Boolean na, inf, matches set Yes/no flags (1 bit each)
Enumeration category na.val Discrete classes (auto-sized)
Integer integer range, fields Bounded values with uniform precision
Floating-point numeric fields (exp/sig) Open-ended values spanning orders of magnitude

Integer encoding maps a bounded range linearly onto bit states. Good for percentages, indices, and quantities with well-defined limits.

Floating-point encoding splits bits into exponent and significand fields. Precision is finest near zero and coarsens with magnitude. Good for standard deviations, rates, and other variables that can span several orders of magnitude.

When in doubt, bf_analyze() helps decide.

Using bf_analyze() to find optimal bit allocations

bf_analyze() works with any data type. For booleans, integers, and categories it reports the required bit count directly. For floating-point data it evaluates all possible exponent/significand configurations and reports those on the Pareto front for each total bit count.

set.seed(42)
x <- runif(1000, 0.1, 10)
bf_analyze(x, range = c(0, 15), max_bits = 8, decimals = 1)
#> Float Analysis
#> ==============
#> 
#>   Observations      1000
#>   NA values         0
#>   Range             [0.102365, 9.98506]
#>   Levels            -
#>   Sign required     no
#>   Bits required     select from the table below
#>   Suggested na.val  automatic 
#>   Target range      [0.102365, 15]
#>   Decimals          1
#> 
#> Exp  Sig  Total  Underflow  Overflow  Changed     Min Res     Max Res        RMSE     Max Err
#> ---  ---  -----  ---------  --------  -------  ----------  ----------  ----------  ----------
#>   2    1      3      4.2%    19.8%    95.8%      0.2500      2.0000      2.3762      6.0000
#>   2    2      4      4.2%    19.8%    91.9%      0.1250      1.0000      1.4195      4.0000
#>   3    1      4      0.7%     0.0%    94.0%      0.0625      4.0000      0.9339      2.0000
#>   2    3      5      4.2%    19.8%    84.5%      0.0625      0.5000      0.9514      3.0000
#>   3    2      5      0.7%     0.0%    89.6%      0.0312      2.0000      0.6466      2.0000
#>   4    1      5      0.0%     0.0%    93.3%      0.0312      4.0000      0.9338      2.0000
#>   2    4      6      4.2%    19.8%    70.6%      0.0312      0.2500      0.7265      2.5000
#>   3    3      6      0.7%     0.0%    80.7%      0.0156      1.0000      0.3072      1.0000
#>   4    2      6      0.0%     0.0%    88.9%      0.0156      2.0000      0.6465      2.0000
#>   2    5      7      4.2%    19.8%    56.1%      0.0156      0.1250      0.5992      2.2000
#>   3    4      7      0.7%     0.0%    65.3%      0.0078      0.5000      0.1626      0.5000
#>   4    3      7      0.0%     0.0%    80.0%      0.0078      1.0000      0.3071      1.0000
#>   2    6      8      4.2%    19.8%    40.7%      0.0078      0.0625      0.5578      2.1000
#>   3    5      8      0.7%     0.0%    48.7%      0.0039      0.2500      0.0895      0.3000
#>   4    4      8      0.0%     0.0%    64.6%      0.0039      0.5000      0.1624      0.5000
#> 
#> Usage:
#>   bf_map(protocol = "numeric", ...,
#>          fields = list(exponent = <exp>, significand = <sig>))

The Pareto table shows multiple rows per bit count when trade-offs exist. Key columns:

For example, at 7 total bits, exp=4/sig=3 covers the full range (no underflow) while exp=3/sig=4 offers finer precision but cannot represent values below \(2^{-3} = 0.125\). Neither dominates the other, so both appear.

Use the result to pick a configuration, then pass it to bf_map():

# After deciding on exp=4, sig=3 based on bf_analyze() output:
reg <- bf_map(protocol = "numeric", data = my_data, x = sd_values,
              registry = reg, fields = list(exponent = 4, significand = 3))

A complete example

Here is a full workflow using bf_tbl:

# 1. Create a registry
reg <- bf_registry(
  name = "yield_QA",
  description = "Quality assessment for yield data",
  template = bf_tbl)

# 2. Add boolean flags
reg <- bf_map(protocol = "na", data = bf_tbl, x = commodity, registry = reg)
reg <- bf_map(protocol = "inf", data = bf_tbl, x = x, registry = reg)
reg <- bf_map(protocol = "inf", data = bf_tbl, x = y, registry = reg)

# 3. Add a category
reg <- bf_map(protocol = "category", data = bf_tbl, x = commodity,
              registry = reg, na.val = 0L)

# 4. Add a numeric value with custom float encoding
reg <- bf_map(protocol = "numeric", data = bf_tbl, x = yield,
              registry = reg, format = "half")

reg
#>   type  data.frame
#>   width 21
#>   flags 5  -|-|-|--|----------------
#> 
#>   pos encoding  name     col
#>   1   0.0.1/0   na       commodity
#>   2   0.0.1/0   inf      x
#>   3   0.0.1/0   inf      y
#>   4   0.0.2/0   category commodity
#>   6   1.5.10/15 numeric  yield
# 5. Encode and decode
field <- bf_encode(registry = reg)
decoded <- bf_decode(field, registry = reg, verbose = FALSE)

head(decoded, 3)
#> $na_commodity
#> [1] 0 0 0 1 0 0 0 0 0
#> 
#> $inf_x
#> [1] 0 0 0 0 0 0 0 0 0
#> 
#> $inf_y
#> [1] 0 0 0 0 1 0 0 0 0

Verify round-trip integrity

Always check that the encode/decode cycle preserves essential information:

# input NAs
sum(is.na(bf_tbl$commodity))
#> [1] 1

# NAs after roundtrip
sum(decoded$na_commodity == 1)
#> [1] 1

Design guidelines

Match precision to needs. Do not allocate 16 bits where 5 suffice. Use bf_analyze() to quantify the trade-off.

Use atomic protocols. Each bf_map() call should test one concept. This maximises reusability across projects.

Handle NA explicitly. Use the na.val parameter to reserve a sentinel value for missing data. Choose a value outside your data range (can be 0 when that value is not relevant in and of itself, or the maximum integer state).

Test with edge cases. Include NA, Inf, NaN, zeros, and boundary values in your test data:

problematic <- bf_tbl[c(4, 5, 8, 9), ]
print(problematic)
#> # A tibble: 4 × 5
#>       x     y commodity yield year 
#>   <dbl> <dbl> <fct>     <dbl> <chr>
#> 1  27    59.2 <NA>       4.43 2021 
#> 2 259   Inf   honey     13.0  2021 
#> 3  26.5 NaN   maize     10.6  2021 
#> 4   0     0   soybean    9.01 2021

Choose units wisely. Livestock density in animals/km\(^2\) (range 0–290) needs 9 bits. The same quantity in animals/ha (range 0–3.1) needs only 5 bits. Unit choice directly affects bit budget.

Creating custom protocols

When built-in protocols do not cover your use case, create a custom one with bf_protocol(). For example, some datasets embed status flags directly in value strings – a year column might contain "2021r" where r marks the value as revised. The built-in grepl protocol can detect such flags as a boolean, but a custom protocol can distinguish which flag is present:

valueFlag <- bf_protocol(
  name = "valueFlag",
  description = paste("Extracts trailing status flags from {x}.",
                      "0 = none, 1 = r(evised), 2 = p(rovisional),",
                      "3 = e(stimated)"),
  test = "function(x) { suffix <- sub('.*([a-z])$', '\\\\1', x); match(suffix, c('r','p','e'), nomatch = 0L) }",
  example = list(x = c("2020", "2021r", "2019p", "2018e", NA)),
  type = "int",
  bits = 2
)

Versioning and extension

When improving existing protocols, use versioning to maintain reproducibility:

valueFlagV2 <- bf_protocol(
  name = "valueFlag",
  description = paste("Extracts trailing status flags from {x}.",
                      "0 = none, 1 = r(evised), 2 = p(rovisional),",
                      "3 = e(stimated). Now also handles uppercase flags."),
  test = "function(x) { suffix <- sub('.*([a-zA-Z])$', '\\\\1', tolower(x)); match(suffix, c('r','p','e'), nomatch = 0L) }",
  example = list(x = c("2020", "2021r", "2019P", "2018e", NA)),
  type = "int",
  bits = 2,
  version = "1.1.0",
  extends = "valueFlag_1.0.0",
  note = "Now handles uppercase flags via case-insensitive matching"
)

Sharing protocols via community standards

The bitfloat/standards repository enables sharing encoding protocols. Access it through bf_standards():

# List available protocols
bf_standards(action = "list")

# Pull a community protocol
soil_protocol <- bf_standards(
  protocol = "soil_moisture",
  remote = "environmental/soil",
  action = "pull"
)

# Push your own protocol
bf_standards(
  protocol = dataAgeProtocol,
  remote = "environmental/temporal",
  action = "push",
  version = "1.0.0",
  change = "Initial release: data age encoding for environmental monitoring"
)

This requires a GitHub Personal Access Token. See ?bf_standards for setup instructions.

Quick reference

Do:

Avoid:

For more details, see the function documentation (?bf_map, ?bf_analyze, ?bf_protocol) and the package website at https://bitfloat.github.io/bitfield/.