This document shows an example session for using supervised classification in the package RecordLinkage for deduplication of a single data set. Conducting linkage of two data sets differs only in the step of generating record pairs. See also the vignette on Fellegi-Sunter deduplication for some general information on using the package.
In this session, a training set with 50 matches and 250 non-matches
is generated from the included data set RLData10000. Record
pairs from the set RLData500 are used to calibrate and
subsequently evaluate the classifiers.
data(RLdata500)
data(RLdata10000)
train_pairs <- compare.dedup(RLdata10000, identity = identity.RLdata10000,
n_match = 500, n_non_match = 500)
eval_pairs <- compare.dedup(RLdata500, identity = identity.RLdata500)
trainSupv handles calibration of supervised
classificators which are selected through the argument
method. In the following, a single decision tree (rpart), a
bootstrap aggregation of decision trees (bagging) and a support vector
machine are calibrated (svm).
model_rpart <- trainSupv(train_pairs, method = "rpart")
model_bagging <- trainSupv(train_pairs, method = "bagging")
model_svm <- trainSupv(train_pairs, method = "svm")
classifySupv handles classification for all supervised
classificators, taking as arguments the structure returned by
trainSupv which contains the classification model and the
set of record pairs which to classify.
result_rpart <- classifySupv(model_rpart, eval_pairs)
result_bagging <- classifySupv(model_bagging, eval_pairs)
result_svm <- classifySupv(model_svm, eval_pairs)
##
## Deduplication Data Set
##
## 500 records
## 124750 record pairs
##
## 50 matches
## 124700 non-matches
## 0 pairs with unknown status
##
##
## 2709 links detected
## 0 possible links detected
## 122041 non-links detected
##
## alpha error: 0.000000
## beta error: 0.021323
## accuracy: 0.978685
##
##
## Classification table:
##
## classification
## true status N P L
## FALSE 122041 0 2659
## TRUE 0 0 50
##
## Deduplication Data Set
##
## 500 records
## 124750 record pairs
##
## 50 matches
## 124700 non-matches
## 0 pairs with unknown status
##
##
## 210 links detected
## 0 possible links detected
## 124540 non-links detected
##
## alpha error: 0.000000
## beta error: 0.001283
## accuracy: 0.998717
##
##
## Classification table:
##
## classification
## true status N P L
## FALSE 124540 0 160
## TRUE 0 0 50
##
## Deduplication Data Set
##
## 500 records
## 124750 record pairs
##
## 50 matches
## 124700 non-matches
## 0 pairs with unknown status
##
##
## 657 links detected
## 0 possible links detected
## 124093 non-links detected
##
## alpha error: 0.000000
## beta error: 0.004868
## accuracy: 0.995134
##
##
## Classification table:
##
## classification
## true status N P L
## FALSE 124093 0 607
## TRUE 0 0 50