samesies
compares lists of texts, factors, or numerical
values to measure their similarity. The motivating use case is
evaluating the similarity of large language model (LLM) responses across
models, providers, or prompts—a strategy often referred to as
LLM-as-a-judge .
You can install samesies
from CRAN with:
install.packages("samesies")
samesies
provides three main functions for measuring
similarity:
same_text()
Compare similarity between multiple lists of character strings.
library(samesies)
<- list("R is a statistical computing software",
r1 "R enables grammar of graphics using ggplot2",
"R supports advanced statistical models")
<- list("R is a full-stack programming language",
r2 "R enables advanced data visualizations",
"R supports machine learning algorithms")
<- same_text(r1, r2) tex
Methods available via stringdist (e.g.,
method = "osa"
):
Transformational Algorithms
Structural Comparison
Linguistic Matching
same_factor()
Compare similarity between multiple lists of categorical data.
<- list("R", "R", "Python")
cats1 <- list("R", "Python", "R")
cats2
<- same_factor(cats1, cats2,
fct levels = c("R", "Python"))
Methods available (e.g., method = "exact"
):
same_number()
Compare similarity between multiple lists of numeric values.
<- list(1, 2, 3)
n1 <- list(1, 2.1, 3.2)
n2
<- same_number(n1, n2) num
Methods available (e.g., method = "exact"
):
max_diff
is computed automatically by default)<- same_number(n1, n2,
num method = "normalized",
max_diff = 2.2)
<- same_number(n1, n2,
num method = "fuzzy",
epsilon = 0.05,
epsilon_pct = 0.02)
When you input more than two lists, compute pairwise comparisons across lists.
Nested lists are supported as long as they share the same names and lengths.
All three functions return similar
objects that support
the following methods:
print(x)
summary(x)
average_similarity(x, method = NULL)
pair_averages(x, method = NULL)
The package uses S3 objects, allowing access to the underlying data:
$scores
: A list of similarity scores for each method
and comparison pair$summary
: A list of statistical summaries for each
method and comparison pair$methods
: The similarity methods used in the
analysis$list_names
: Names of the input lists$raw_values
: The original input values$digits
: Number of decimal places for rounding results
in outputThe Spiderman image in the hex logo is fan art created by the Reddit user WistlerR15.