---
title: "How rfair works: methodology and architecture"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How rfair works: methodology and architecture}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(rfair)
```

This vignette describes what rfair measures and how, in enough detail to
interpret and reproduce its scores. For a quick tour see
`vignette("rfair")`; for the reuse/sensitivity extensions see
`vignette("beyond-fuji")`.

## 1. Background: FAIR, the FAIRsFAIR metrics, and F-UJI

The **FAIR principles** (Wilkinson et al. 2016) state that research data should
be **F**indable, **A**ccessible, **I**nteroperable, and **R**eusable. They are
aspirational; to assess a real data object you need *measurable* indicators.

The **FAIRsFAIR** project turned the principles into a concrete, testable metric
set, and the **F-UJI** tool (Devaraju & Huber, PANGAEA) implemented an automated
assessment service for them. F-UJI is a Python web service: you send it a
persistent identifier (PID) and it returns per-metric scores.

`rfair` is a **native R reimplementation** of the F-UJI metrics (version 0.8).
It performs the whole assessment in R, with no external server, so assessments
are scriptable, reproducible, and embeddable in R pipelines. The original
`rfair` package (v1) was only an HTTP client for an F-UJI server; this version
(v2) is the engine itself.

## 2. The assessment pipeline

A single call to `assess_fair()` runs this pipeline:

```
identifier
   │  id_parse()            scheme detection + normalization + resolver URL
   ▼
resolution                  content-negotiated GET, follow redirects -> landing page
   │  resolve_landing_page()
   ▼
harvesting                  a sequence of collectors, in priority order:
   │   collect_html_meta()      embedded JSON-LD (schema.org), Dublin Core,
   │                            OpenGraph, Highwire meta tags
   │   collect_signposting()    HTTP Link header + <link rel> typed links
   │   collect_datacite()       DataCite JSON via content negotiation
   │   collect_xml()            DataCite XML, Dublin Core, MODS, EML, ISO19139
   │   collect_rdf()            JSON-LD (native) and Turtle/RDF-XML (via rdflib)
   │   collect_github()         GitHub repository + codemeta.json + CITATION.cff
   │   harvest_data()           HEAD on data links for MIME type and size
   ▼
mapping + merging           each source is mapped to one reference schema and
   │  merge_metadata()         merged (first-non-empty for scalars; union for
   │                           lists; longer-but-similar replacement)
   ▼
evaluation                  one evaluator per metric inspects the merged metadata
   │  run_evaluators()         and the resolved identifier, scoring each test
   ▼
scoring                     per-test scores -> per-metric -> F/A/I/R -> overall
   │  get_assessment_summary()
   ▼
fair_assessment             tidy S3 object (print / summary / as.data.frame /
                            as_fuji_json / as_rdf)
```

### Identifier handling

`id_parse()` recognizes DOIs, Handles, ARKs, URNs, UUIDs, `identifiers.org`
PIDs, w3id, and plain URLs, normalizes them, and constructs a resolver URL.
Persistence is inferred from the scheme.

```{r}
id_parse("https://doi.org/10.5281/zenodo.8347772")[c("preferred_schema", "is_persistent", "identifier_url")]
```

### Harvesting and content negotiation

Different repositories expose metadata in different ways. rfair asks for several
representations of the same object via HTTP **content negotiation** (the `Accept`
header) and scrapes the landing page, then **merges** everything into a single
reference schema (~30 elements: `creator`, `title`, `publisher`,
`publication_date`, `license`, `access_level`, `object_content_identifier`,
`related_resources`, ...). When two sources disagree, scalars keep the first
non-empty value (replaced only by a longer, sufficiently-similar string), and
list-valued elements are unioned.

### The metric model

Metrics are data-driven: their definitions, tests, scores, and maturity levels
come from the bundled FAIRsFAIR YAML, not from hard-coded R logic.

```{r}
rfair_metric_versions()      # bundled metric versions
# v0.8 has 17 metrics across F/A/I/R (one row each):
nrow(as.data.frame(assess_fair("https://doi.org/10.5281/zenodo.8347772", resolve = FALSE)))
```

Each metric has one or more **tests**. A test contributes a *score* and a
*maturity* level (a CMMI level 0–3: incomplete, initial, moderate, advanced)
when it passes. Metrics use one of two scoring mechanisms:

* **cumulative** — passed tests' scores add up;
* **alternative** — tests are alternative routes to the same points (the earned
  score is capped at the metric total).

The criterium engine (`criterium_engine.R`) builds each metric's result from the
YAML and lets evaluators mark tests passed; `as_fuji_json()` then emits a payload
matching the upstream F-UJI `FAIRResults` schema.

## 3. What each FAIR category measures (v0.8)

| | metric | what rfair checks |
|---|---|---|
| **F** | F1-01MD | identifier follows a unique scheme (URI/URN/UUID/HASH/PID) |
| | F1-02MD | identifier is persistent and registered (resolves) |
| | F2-01M | core descriptive metadata present (creator, title, id, date, publisher, type, summary, keywords) |
| | F3-01M | metadata links to the downloadable data content |
| | F4-01M | metadata offered in a search-engine-ingestible way (embedded JSON-LD / meta tags) |
| **A** | A1-01M | access level / rights are stated in metadata |
| | A1-02MD | metadata and data are retrievable via their identifiers |
| | A1.1-01MD | identifiers use a standardized communication protocol (http/https/ftp) |
| | A1.2-01MD | the protocol supports authentication where needed |
| **I** | I1-01M | metadata uses a formal, machine-readable representation (JSON-LD/RDF/XML) |
| | I2-01M | metadata uses terms from registered semantic vocabularies |
| | I3-01M | qualified references to related entities (with relation types) |
| **R** | R1-01M | metadata describes the data content (type, format/size) |
| | R1.1-01M | a machine-readable license is present and SPDX/CC-recognized |
| | R1.2-01M | provenance information (creators, dates, contributors) |
| | R1.3-01M | a community-/discipline-endorsed metadata standard is used |
| | R1.3-02D | data is in a recommended (scientific/open/long-term) file format |

The score for a category is the sum of earned over total across its metrics; the
overall FAIR score is the sum across all 17, and the maturity is the (clamped)
mean of the per-category maturities.

```{r}
# the canonical principle definitions these metrics map to
fair_principles("I")[, c("id", "definition")]
```

## 4. Software FAIR (FRSM)

For software objects, rfair also bundles the FRSM (FAIR for Research Software)
metric set; select it with `metric_version = "0.7_software"`. The GitHub
harvester inspects the repository file tree for signals (a license file, tests,
CI workflows, dependency manifests, a registry DOI, a release version,
contributors) and the 17 FRSM evaluators score from them. FRSM scoring is
heuristic and not yet validated against an upstream software-FAIR reference.

## 5. Fidelity to F-UJI

Because rfair reimplements an existing scoring engine, it includes a
non-CRAN conformance harness. `tests/conformance/run.R` runs identifiers through
both rfair and a locally run, version-matched F-UJI server and compares
per-metric earned scores. A manual run on 2026-06-16 against F-UJI 4.0.0
(metrics v0.8) measured **94.1% on a Zenodo DOI (16/17 metrics exact)** and
**85.3%** across PANGAEA and Dryad; the consistent divergence was the data
file-format metric (F-UJI uses Tika content detection where rfair uses an HTTP
HEAD). This reference-server comparison is not reproduced by CI yet. A separate
harness (`tests/conformance/parity.R`) compares the R engine with the browser
TypeScript engine on registry-derivable metrics after the `webapp` branch is
checked out alongside the package.

## 6. Beyond F-UJI

rfair adds checks that automated FAIR tools usually miss, motivated by peer
review of a COVID-19 FAIR study: license *reusability* (not just presence) with
the (Re)usable Data Project taxonomy, controlled-access/sensitive-data flagging,
identifier hygiene, and the **FAIR-TLC** (Traceable, Licensed, Connected)
extension. See `vignette("beyond-fuji")`.

## 7. Limitations

* The browser app is registry-only (CORS): it cannot harvest landing pages, so
  some metrics score lower than the R engine.
* I2-01M (semantic vocabularies) scores 0 for objects whose metadata uses only
  default namespaces (dc/schema.org/DataCite) — this matches F-UJI.
* RDF Turtle/RDF-XML harvesting and `as_rdf()` Turtle output need the optional
  `rdflib` package (system `librdf`); without it those paths are skipped.
* Live scores depend on the object's current metadata and on third-party
  services (DataCite, Crossref, GitHub) being reachable.

## References

* Wilkinson et al. (2016). The FAIR Guiding Principles. *Sci Data*. \doi{10.1038/sdata.2016.18}
* Devaraju & Huber. F-UJI. <https://github.com/pangaea-data-publisher/fuji>
* FAIRsFAIR metrics. \doi{10.5281/zenodo.15045911}
* Carbon et al. (2019). (Re)usable data licensing. *PLOS ONE*. \doi{10.1371/journal.pone.0213090}