This guide walks through the core workflow of the
bitfield package: planning an encoding, choosing protocols,
finding optimal bit allocations, and sharing protocols with the
community.
The package ships with bf_tbl, a small dataset with
typical quality issues (missing values, invalid coordinates, mixed
formats):
bf_tbl
#> # A tibble: 9 × 5
#> x y commodity yield year
#> <dbl> <dbl> <fct> <dbl> <chr>
#> 1 25.3 59.5 soybean 11.2 2021
#> 2 27.9 58.1 maize 12.0 <NA>
#> 3 27.8 57.8 soybean 13.2 2021r
#> 4 27 59.2 <NA> 4.43 2021
#> 5 259 Inf honey 13.0 2021
#> 6 27.3 59.1 maize 8.55 2021
#> 7 26.1 58.4 soybean 11.3 2021
#> 8 26.5 NaN maize 10.6 2021
#> 9 0 0 soybean 9.01 2021Every bit flag uses one of a few encoding types. The table below summarises when to use which:
| Encoding | Protocol | Key parameters | Use when |
|---|---|---|---|
| Boolean | na, inf, matches |
set |
Yes/no flags (1 bit each) |
| Enumeration | category |
na.val |
Discrete classes (auto-sized) |
| Integer | integer |
range, fields |
Bounded values with uniform precision |
| Floating-point | numeric |
fields (exp/sig) |
Open-ended values spanning orders of magnitude |
Integer encoding maps a bounded range linearly onto bit states. Good for percentages, indices, and quantities with well-defined limits.
Floating-point encoding splits bits into exponent and significand fields. Precision is finest near zero and coarsens with magnitude. Good for standard deviations, rates, and other variables that can span several orders of magnitude.
When in doubt, bf_analyze() helps decide.
bf_analyze() works with any data type. For booleans,
integers, and categories it reports the required bit count directly. For
floating-point data it evaluates all possible exponent/significand
configurations and reports those on the Pareto front for each total bit
count.
set.seed(42)
x <- runif(1000, 0.1, 10)
bf_analyze(x, range = c(0, 15), max_bits = 8, decimals = 1)
#> Float Analysis
#> ==============
#>
#> Observations 1000
#> NA values 0
#> Range [0.102365, 9.98506]
#> Levels -
#> Sign required no
#> Bits required select from the table below
#> Suggested na.val automatic
#> Target range [0.102365, 15]
#> Decimals 1
#>
#> Exp Sig Total Underflow Overflow Changed Min Res Max Res RMSE Max Err
#> --- --- ----- --------- -------- ------- ---------- ---------- ---------- ----------
#> 2 1 3 4.2% 19.8% 95.8% 0.2500 2.0000 2.3762 6.0000
#> 2 2 4 4.2% 19.8% 91.9% 0.1250 1.0000 1.4195 4.0000
#> 3 1 4 0.7% 0.0% 94.0% 0.0625 4.0000 0.9339 2.0000
#> 2 3 5 4.2% 19.8% 84.5% 0.0625 0.5000 0.9514 3.0000
#> 3 2 5 0.7% 0.0% 89.6% 0.0312 2.0000 0.6466 2.0000
#> 4 1 5 0.0% 0.0% 93.3% 0.0312 4.0000 0.9338 2.0000
#> 2 4 6 4.2% 19.8% 70.6% 0.0312 0.2500 0.7265 2.5000
#> 3 3 6 0.7% 0.0% 80.7% 0.0156 1.0000 0.3072 1.0000
#> 4 2 6 0.0% 0.0% 88.9% 0.0156 2.0000 0.6465 2.0000
#> 2 5 7 4.2% 19.8% 56.1% 0.0156 0.1250 0.5992 2.2000
#> 3 4 7 0.7% 0.0% 65.3% 0.0078 0.5000 0.1626 0.5000
#> 4 3 7 0.0% 0.0% 80.0% 0.0078 1.0000 0.3071 1.0000
#> 2 6 8 4.2% 19.8% 40.7% 0.0078 0.0625 0.5578 2.1000
#> 3 5 8 0.7% 0.0% 48.7% 0.0039 0.2500 0.0895 0.3000
#> 4 4 8 0.0% 0.0% 64.6% 0.0039 0.5000 0.1624 0.5000
#>
#> Usage:
#> bf_map(protocol = "numeric", ...,
#> fields = list(exponent = <exp>, significand = <sig>))The Pareto table shows multiple rows per bit count when trade-offs exist. Key columns:
For example, at 7 total bits, exp=4/sig=3 covers the
full range (no underflow) while exp=3/sig=4 offers finer
precision but cannot represent values below \(2^{-3} = 0.125\). Neither dominates the
other, so both appear.
Use the result to pick a configuration, then pass it to
bf_map():
Here is a full workflow using bf_tbl:
# 1. Create a registry
reg <- bf_registry(
name = "yield_QA",
description = "Quality assessment for yield data",
template = bf_tbl)
# 2. Add boolean flags
reg <- bf_map(protocol = "na", data = bf_tbl, x = commodity, registry = reg)
reg <- bf_map(protocol = "inf", data = bf_tbl, x = x, registry = reg)
reg <- bf_map(protocol = "inf", data = bf_tbl, x = y, registry = reg)
# 3. Add a category
reg <- bf_map(protocol = "category", data = bf_tbl, x = commodity,
registry = reg, na.val = 0L)
# 4. Add a numeric value with custom float encoding
reg <- bf_map(protocol = "numeric", data = bf_tbl, x = yield,
registry = reg, format = "half")
reg
#> type data.frame
#> width 21
#> flags 5 -|-|-|--|----------------
#>
#> pos encoding name col
#> 1 0.0.1/0 na commodity
#> 2 0.0.1/0 inf x
#> 3 0.0.1/0 inf y
#> 4 0.0.2/0 category commodity
#> 6 1.5.10/15 numeric yieldMatch precision to needs. Do not allocate 16 bits
where 5 suffice. Use bf_analyze() to quantify the
trade-off.
Use atomic protocols. Each bf_map()
call should test one concept. This maximises reusability across
projects.
Handle NA explicitly. Use the na.val
parameter to reserve a sentinel value for missing data. Choose a value
outside your data range (can be 0 when that value is not relevant in and
of itself, or the maximum integer state).
Test with edge cases. Include NA,
Inf, NaN, zeros, and boundary values in your
test data:
problematic <- bf_tbl[c(4, 5, 8, 9), ]
print(problematic)
#> # A tibble: 4 × 5
#> x y commodity yield year
#> <dbl> <dbl> <fct> <dbl> <chr>
#> 1 27 59.2 <NA> 4.43 2021
#> 2 259 Inf honey 13.0 2021
#> 3 26.5 NaN maize 10.6 2021
#> 4 0 0 soybean 9.01 2021Choose units wisely. Livestock density in animals/km\(^2\) (range 0–290) needs 9 bits. The same quantity in animals/ha (range 0–3.1) needs only 5 bits. Unit choice directly affects bit budget.
When built-in protocols do not cover your use case, create a custom
one with bf_protocol(). For example, some datasets embed
status flags directly in value strings – a year column might contain
"2021r" where r marks the value as revised.
The built-in grepl protocol can detect such flags as a
boolean, but a custom protocol can distinguish which flag is
present:
valueFlag <- bf_protocol(
name = "valueFlag",
description = paste("Extracts trailing status flags from {x}.",
"0 = none, 1 = r(evised), 2 = p(rovisional),",
"3 = e(stimated)"),
test = "function(x) { suffix <- sub('.*([a-z])$', '\\\\1', x); match(suffix, c('r','p','e'), nomatch = 0L) }",
example = list(x = c("2020", "2021r", "2019p", "2018e", NA)),
type = "int",
bits = 2
)When improving existing protocols, use versioning to maintain reproducibility:
valueFlagV2 <- bf_protocol(
name = "valueFlag",
description = paste("Extracts trailing status flags from {x}.",
"0 = none, 1 = r(evised), 2 = p(rovisional),",
"3 = e(stimated). Now also handles uppercase flags."),
test = "function(x) { suffix <- sub('.*([a-zA-Z])$', '\\\\1', tolower(x)); match(suffix, c('r','p','e'), nomatch = 0L) }",
example = list(x = c("2020", "2021r", "2019P", "2018e", NA)),
type = "int",
bits = 2,
version = "1.1.0",
extends = "valueFlag_1.0.0",
note = "Now handles uppercase flags via case-insensitive matching"
)Do:
bf_analyze() for floating-point configurationsna.val for missing dataAvoid:
For more details, see the function documentation
(?bf_map, ?bf_analyze,
?bf_protocol) and the package website at https://bitfloat.github.io/bitfield/.