This document shows an example session for using supervised classification in the package RecordLinkage for deduplication of a single data set. Conducting linkage of two data sets differs only in the step of generating record pairs. See also the vignette on Fellegi-Sunter deduplication for some general information on using the package.

Generating comparison patterns

In this session, a training set with 50 matches and 250 non-matches is generated from the included data set RLData10000. Record pairs from the set RLData500 are used to calibrate and subsequently evaluate the classifiers.

data(RLdata500)
data(RLdata10000)
train_pairs <- compare.dedup(RLdata10000, identity = identity.RLdata10000,
                             n_match = 500, n_non_match = 500)
eval_pairs <- compare.dedup(RLdata500, identity = identity.RLdata500)

Training

trainSupv handles calibration of supervised classificators which are selected through the argument method. In the following, a single decision tree (rpart), a bootstrap aggregation of decision trees (bagging) and a support vector machine are calibrated (svm).

model_rpart <- trainSupv(train_pairs, method = "rpart")
model_bagging <- trainSupv(train_pairs, method = "bagging")
model_svm <- trainSupv(train_pairs, method = "svm")

Classification

classifySupv handles classification for all supervised classificators, taking as arguments the structure returned by trainSupv which contains the classification model and the set of record pairs which to classify.

result_rpart <- classifySupv(model_rpart, eval_pairs)
result_bagging <- classifySupv(model_bagging, eval_pairs)
result_svm <- classifySupv(model_svm, eval_pairs)

Results

Rpart

## 
## Deduplication Data Set
## 
## 500 records 
## 124750 record pairs 
## 
## 50 matches
## 124700 non-matches
## 0 pairs with unknown status
## 
## 
## 2709 links detected 
## 0 possible links detected 
## 122041 non-links detected 
## 
## alpha error: 0.000000
## beta error: 0.021323
## accuracy: 0.978685
## 
## 
## Classification table:
## 
##            classification
## true status      N      P      L
##       FALSE 122041      0   2659
##       TRUE       0      0     50

Bagging

## 
## Deduplication Data Set
## 
## 500 records 
## 124750 record pairs 
## 
## 50 matches
## 124700 non-matches
## 0 pairs with unknown status
## 
## 
## 210 links detected 
## 0 possible links detected 
## 124540 non-links detected 
## 
## alpha error: 0.000000
## beta error: 0.001283
## accuracy: 0.998717
## 
## 
## Classification table:
## 
##            classification
## true status      N      P      L
##       FALSE 124540      0    160
##       TRUE       0      0     50

SVM

## 
## Deduplication Data Set
## 
## 500 records 
## 124750 record pairs 
## 
## 50 matches
## 124700 non-matches
## 0 pairs with unknown status
## 
## 
## 657 links detected 
## 0 possible links detected 
## 124093 non-links detected 
## 
## alpha error: 0.000000
## beta error: 0.004868
## accuracy: 0.995134
## 
## 
## Classification table:
## 
##            classification
## true status      N      P      L
##       FALSE 124093      0    607
##       TRUE       0      0     50