Revue Bibliographique des Méthodes de Couplage des Bases de Données : Applications et Perspectives dans le Cas des Données de Santé Publique
Journal de la société française de statistique, Tome 159 (2018) no. 3, pp. 79-123.

Le couplage des bases de données est un enjeu important en santé publique, particulièrement en cette période de multiplication des bases de données administratives et de cohortes ( Loth, 2015 ). Cette procédure consiste à faire correspondre des informations concernant un individu issues de base de données différentes sans pouvoir utiliser un identifiant unique. En France, dans le cas des données médicales et administratives, le Numéro d’Identification au Répertoire (NIR) est un exemple d’identifiant susceptible d’être utilisé pour servir de clé de couplage. Cependant ce dernier restera, en dépit de la loi du 26 janvier 2016 de modernisation de notre système de santé, difficile d’accès en raison de sa qualité d’identifiant direct commun à de nombreuses bases de données. Nous présentons les méthodes de chaînage susceptibles d’être utilisées par des chercheurs, en nous concentrant sur le modèle génératif de Fellegi et Sunter qui est une approche non supervisée, ainsi que sur quelques méthodes issues de l’apprentissage statistique. Enfin nous présentons rapidement différentes approches pour réaliser une analyse statistique sur des données appariées et comment répercuter l’incertitude de l’appariement dans l’analyse.

Record linkage has become a powerful tool for public health, since the rise of medical and administrative database or cohort ( Loth, 2015 ). This process allows matching individual’s information obtained from different databases which don’t have necessarily a common identifier. Furthermore, if such common identifier exists it could take a long time to obtain the necessary approval to use it. In France, the NIR is the identifier which is the most likely to be an identifier at the national level. However, in order to use the NIR, it is still compulsory to obtain the authorization from the CNIL even after the change of law concerning the modernization of the French Healthcare system. This paper presents a broad set of methods to perform record linkage, in particular the method proposed by Fellegi and Sunter and its extensions. The aim is to give some guidelines to researchers and to introduce some approaches to incorporate uncertainty associated with the linkage in their analysis.

Mot clés : couplage/appariement indirect, bases de données médicales et administratives, réseau bayésien naïf, modèle mixte
Keywords: record linkage, healthcare database, naive bayes network, mixed model
@article{JSFS_2018__159_3_79_0,
     author = {Bounebache, Said Karim and Quantin, Catherine and Benzenine, \'Eric and Obozinski, Guillaume and Rey, Gr\'egoire},
     title = {Revue {Bibliographique} des {M\'ethodes} de  {Couplage} des {Bases} de {Donn\'ees~:} {Applications} et {Perspectives} dans le  {Cas} des {Donn\'ees} de {Sant\'e} {Publique}},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {79--123},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {159},
     number = {3},
     year = {2018},
     mrnumber = {3901137},
     zbl = {1411.62313},
     language = {fr},
     url = {http://www.numdam.org/item/JSFS_2018__159_3_79_0/}
}
TY  - JOUR
AU  - Bounebache, Said Karim
AU  - Quantin, Catherine
AU  - Benzenine, Éric
AU  - Obozinski, Guillaume
AU  - Rey, Grégoire
TI  - Revue Bibliographique des Méthodes de  Couplage des Bases de Données : Applications et Perspectives dans le  Cas des Données de Santé Publique
JO  - Journal de la société française de statistique
PY  - 2018
SP  - 79
EP  - 123
VL  - 159
IS  - 3
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2018__159_3_79_0/
LA  - fr
ID  - JSFS_2018__159_3_79_0
ER  - 
%0 Journal Article
%A Bounebache, Said Karim
%A Quantin, Catherine
%A Benzenine, Éric
%A Obozinski, Guillaume
%A Rey, Grégoire
%T Revue Bibliographique des Méthodes de  Couplage des Bases de Données : Applications et Perspectives dans le  Cas des Données de Santé Publique
%J Journal de la société française de statistique
%D 2018
%P 79-123
%V 159
%N 3
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2018__159_3_79_0/
%G fr
%F JSFS_2018__159_3_79_0
Bounebache, Said Karim; Quantin, Catherine; Benzenine, Éric; Obozinski, Guillaume; Rey, Grégoire. Revue Bibliographique des Méthodes de  Couplage des Bases de Données : Applications et Perspectives dans le  Cas des Données de Santé Publique. Journal de la société française de statistique, Tome 159 (2018) no. 3, pp. 79-123. http://www.numdam.org/item/JSFS_2018__159_3_79_0/

[1] Ananthakrishna, R.; Chandhuri, S.; Ganti, V. Eliminating fuzzy duplicates in data warehouses, VLDB ’02 Proceedings of the 28th international conference on Very Large Data Bases (2002)

[2] Agrawal, R.; Srikant, R. Searching with Numbers, Proceedings of the 11th International Conference on World Wide Web (WWW ’02), ACM (2002), pp. 420-431

[3] Box, G.E.P.; Cox, D.R. An Analysis of Transformations, Journal of the Royal Statistical Society. Series B (Methodological), Volume 26 (1964) no. 2, pp. 211-252 | MR

[4] Batxer, R.; Christen, P.; Churches, T. A Comparison of Fast Blocking Methods for Record Linkage, ACM SIGKDD ’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003

[5] Bilenko, M.; Cohen, W.; Feinberg, S.; Mooney, R.; Ravikumar, P. Adaptive Name-Matching in Information Integration., IEEE Intelligent System, Volume 18 (2003), pp. 16-23

[6] Belin, T. A Proposed Improvement in Computer Matching., Statistics of Income and Related Administrative Record Resarch (1990), pp. 167-172

[7] Bohensky, M.A.; Jolley, D.; Sundararajan, V.; Evans, S.; Pilcher, D.V.; Scott, I.; Brand, C.A. Data Linkage : A powerful research tool with potential problems, BMC Health Services Research, Volume 10 (2010) | DOI

[8] Belin, T.R.; Rubin, D.B. A Method for Calibrating False-Match Rates in Record Linkage, Journal of the American Statistical Association, Volume 90 (1995) no. 430, pp. 694-707 | Zbl

[9] Borg, A.; Sariyar, M. RecordLinkage : Record Linkage in R (2016) (R package version 0.4-10, https ://CRAN.R-project.org/package=RecordLinkage)

[10] Chanduri, S.; Ganti, V.; Motwani, R. Robust Identification of Fuzzy Duplicates (of the 21st International Conference on Data Engineering, Proceedings, ed.), IEEE Computer Society, Washington, DC, USA (2005), pp. 865-876

[11] Copas, J.B.; Hilton, F.J. Record Linkage : Statistical Models for Matching Computer Records, Journal of the Royal Statistical Society. Series A (Statistics in Society), Volume 153 (1990) no. 3, pp. 287-320

[12] Chambers, R. Regression analysis of probability-linked data, Official Statistics Research Series, Volume 4 (2009)

[13] Christen, P. A Comparison of Personal Name Matching : Techniques and Practical Issues, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW’06) (2006), pp. 290-294

[14] Christen, P. A Two-step Classification Approach to Unsupervised Record Linkage, Proceedings of the Sixth Australasian Conference on Data Mining and Analytics - Volume 70 (AusDM ’07), Australian Computer Society, Inc. (2007), pp. 111-119

[15] Christen, P. Automatic Training Example Selection for Scalable Unsupervised Record Linkage, Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD’08), Springer-Verlag (2008), pp. 511-518

[16] Christen, P. Febrl - : An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface, Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM (2008), pp. 1065-1068

[17] Christen, P. Data Matching : Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer-Verlag Berlin and Heidelberg GmbH & Co. K, 2012

[18] Cochinwala, M.; Kurien, V.; Lalk, G.; Sasha, D. Efficient Data Reconciliation, Information Science, Volume 137 (2001) no. 1, pp. 1-15 | Zbl

[19] Cornuéjols, A.; Miclet, L. Apprentissage Artificiel : Concept et Algorithmes, Algorithme, Eyrolles, 2010

[20] Cohen, W.W. Data Integration Using Similarity Joins and World-based Information Representation Language., ACM Transactions on Information Systems, Volume 18 (2000) no. 3

[21] Cohen, William W. Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity (1998), pp. 201-212

[22] Cheeseman, P.; Stutz, J. Bayesian Classification (AutoClass) :Theory and Results, Advances in Knowledge Discovery and Data Mining (1997)

[23] Dua, S.; Chowriappa, P. Data Mining for Bioinformatics, CRC Press, 2012 | Zbl

[24] Domingo-Ferrer, J.; Torra, V. Validating distance-based record linkage with probabilistic record linkage, Topics in Artificial Intelligence, Springer, 2002, pp. 207-215 | Zbl

[25] Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B, Volume 39 (1977) no. 1, pp. 1-38 | MR | Zbl

[26] Duflo, M. Algorithmes stochastiques, Springer, 1997 | MR | Zbl

[27] Elmagarmind, A.K.; Ipeirotis, P.G.; Verykios, V.S. Duplicate record detection : A survey, IEEE Transactions on Knowledge and Data Engineering, Volume 19 (2007) no. 1

[28] Elfeky, M.G.; Verykios, V.S.; Elmagarmid, A.K. TAILOR : a record linkage toolbox, Proceedings 18th International Conference on Data Engineering (2002), pp. 17-28

[29] Fortini, M.; Liseo, B.; Nuccitelli, A.; Scanu, M. On Bayesian Record Linkage, Research in Official Statistics, Volume 4 (2001) no. 1

[30] Foulley, J-L. Algorithme "EM" : Théorie et Application au Modèle Mixte., Journal de la Société Française de Statistique, Volume 18 (2002) no. 3-4, pp. 57-109

[31] Ford, J.B.; Roberts, C.L.; Taylor, L.K. Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data, Paediatric and Perinatal Epidemiology, Volume 20 (2006) no. 4, pp. 329-337

[32] Fellegi, I.P.; Sunter, A.B. A Theory for Record Linkage, Journal of the American Statistical Association, Volume 64 (1969) no. 328, pp. 1183-1210

[33] Fournel, I.; Schwarzinger, M.; Binquet, C.; Benzenine, E.; Hill, C.; Quantin, C. Contribution of record linkage to vital status determination in cancer patients, Studies in Health Technology and Informatics, Volume 150 (2009), pp. 91-95

[34] Goldstein, H.; Carpenter, J.; Kenward, M.G.; Levin, K.A. Multilevel models with multivariate mixed response types, Statistical Modelling, Volume 9 (2009) no. 3, pp. 173-197 | MR | Zbl

[35] Goldstein, H.; Harron, K.; Cortina-Borja, M. A scaling approach to record linkage, Statistics in Medicine, Volume 36 (2017) no. 16, pp. 2514-2521 | MR

[36] Goldstein, H.; Harron, K.; Wade, A. The analysis of record-linked data using multiple imputation with data value priors, Statistics in Medicine, Volume 31 (2012) no. 28, pp. 3481-3493 | MR

[38] Guha, S.; Koudas, N.; Marathe, A.; Srivastava, D. Merging the Results of Approximate Match Operations, Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB Endowment (2004), pp. 636-647

[39] Haberman, S.J. Analysis of qualitative data. vol. 2, new developments, Academic Press Inc., 1979

[40] Harron, K.L.; Doidge, J.C.; Knight, H.E.; Gilbert, R.E.; Goldstein, H.; Cromwell, D.A.; Meulen, V.D.; Jan, H. A guide to evaluating linkage quality for the analysis of linked data, International Journal of Epidemiology, Volume 46 (2017) no. 5, pp. 1699-1710

[41] Herandez, M.A.; Stolfo, S.J. Real-World Data is Dirty : Data Cleansing and The Merge/Purge Problem, Data Mining and Knowledge Discovery, Volume 2 (1998) no. 1, pp. 9-37

[42] Herzog, Thomas N.; Scheuren, Fritz J.; Winkler, William E. Data Quality and Record Linkage Techniques, Springer Publishing Company, Incorporated, 2007 | Zbl

[43] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome The Elements of Statistical Learning, Springer, 2001 | Zbl

[44] Harron, K.; Wade, A.; Gilbert, R.; Muller-Pebody, B.; Goldstein, H. Evaluating bias due to data linkage error in electronic healthcare records, BMC Medical Research Methodology, Volume 14 (2014) | DOI

[45] Hof, M.H.P.; Zwinderman, A.H. Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables, Statistics in Medicine, Volume 31 (2012) no. 30, pp. 4231-4242 | MR

[46] Hof, M.H.P.; Zwinderman, A.H. A mixture model for the analysis of data derived from record linkage, Statistics in Medicine, Volume 34 (2015) no. 1, pp. 74-92 | MR

[47] Izenman, A.J. Modern Multivariate Statistical Techniques : Regression, Classification and Manifold Learning, Springer Texts in Statistics, Springer, 2008 | MR | Zbl

[48] Jaro, M. A. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association, Volume 84 (1989) no. 406

[49] Jaro, M.A. Probabilistic linkage of large public health data files, Statistics in Medicine, Volume 14 (1995) no. 5, pp. 491-498

[50] Jain, S.; Neal, R.M. A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model, Journal of Computational and Graphical Statistics, Volume 13 (2004) no. 1, pp. 158-182 | MR

[51] Jaro, Matthew A.; States., United UNIMATCH : a record linkage system : users manual, Bureau of the Census, Washington, 1978, 275 pages

[52] Jurczyk, P. FRIL : Fined-grained Record Integration and Linkage Tool Tutorial (2009) (http://fril.sourceforge.net/FRIL-Tutorial-3.2.pdf)

[53] Kim, G.; Chambers, R. Regression analysis under incomplete linkage, Computational Statistics & Data Analysis, Volume 56 (2012) no. 9, pp. 2756-2770 | MR | Zbl

[54] Kim, G.; Chambers, R. Regression Analysis under Probabilistic Multi-Linkage, Statistica Neerlandica, Volume 66 (2012) no. 1, pp. 64-79 | DOI | MR

[55] Kelley, P. Robustness of the Census Bureau’s record linkage system, Proceedings of the Section on Survey Research Methods, American Statistical Association (1986), pp. 620-624

[56] Larsen, M.D. Advances in Record Linkage Theory : Hierarchical Bayesian Record Linkage Theory, ASA Section on Survey Research Methods, 2005

[57] Lash, Timothy L.; Fox, Matthew P.; Fink, Aliza K. Applying Quantitative Bias Analysis to Epidemiologic Data, Springer Publishing Company, Incorporated, 2009 | Zbl

[58] Lahiri, P.; Larsen, M.D. Regression Analysis with Linked Data, Journal of the American Statistical Association, Volume 100 (2005) no. 469, pp. 222-230 | MR | Zbl

[59] Loth, A. Données de santé : Anonymat et risque de ré-identification (2015) no. 64 (Dossiers Solidarité Santé)

[60] Larsen, M.D.; Rubin, D.B. Iterative Automated Record Linkage Using Mixture Models, Journal of the American Statistical Association, Volume 96 (2001) no. 453, pp. 32-41 | MR

[61] Legleye, S.; Richard, J-B.; Rey, G.; Beck, F.0; Grieve, M. Testing the Acceptability of Asking Respondents for Identifying Information in a Cross-Sectional Survey of the General Population, Population, English edition, Volume 72 (2017) no. 4, pp. 697-713 (Accessed 2018-04-03)

[62] Lim, E.P.; Srivastava, J.; Prabhakar, S.; Richardson, J. Entity Identification in Database Integration, Informatics and computer science, Volume 89 (1996) no. 1

[63] Lamarche-Vadel, A.; Jougla, E.; Rey, G. Base AMPHI : Base de données pour l’Analyse de la Mortalité Post-Hospitalisation en France en 2008-2010, Université Paris-Sud / Inserm-CépiDC (2013) (Ph. D. Thesis)

[64] McGlincy, M.H. A Bayesian Record Linkage Methodology for Multiple Imputation of Missing Links, 2004

[65] Monge, A.E.; Elkan, C.P. The Field Matching Problem : Algorithms and Applications, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press (1996), pp. 267-270

[66] Monge, A.E.; Elkan, C.P. An efficient domain-independent algorithm for detecting approximately duplicate database record, 1997

[67] McLachlan, G.; Krishnan, T. The EM Algorithm and Extensions, Wiley-Blackwell, 2008 | MR

[68] Meng, X.L.; Rubin, D.B. Maximum Likelihood Estimation via the ECM Algorithm : A General Framework, Biometrika, Volume 80 (1993) no. 2, pp. 267-278 | MR | Zbl

[69] Meng, X.L.; Van Dyk, D. The EM Algorithm-an Old Folk-song Sung to a Fast New Tune, Journal of the Royal Statistical Society : Series B (Statistical Methodology), Volume 59 (1997) no. 3, pp. 511-567 | Zbl

[70] Newcombe, H.B.; Kennedy, J.M. Automatic Linkage of Vital Records., Science, Volume 130 (1959) no. 3381, pp. 954-959

[71] Neter, J.; Maynes, E.S.; Ramanathan, R. The Effect of Mismatching on the Measurement of Response Error, Journal of the American Statistical Association, Volume 60 (1965) no. 312, pp. 1005-1027

[72] Nigam, K.; Mccallum, A.K.; Thrun, S.; Mitchell, T. Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning, Volume 39 (2000) no. 2, pp. 103-134 | Zbl

[73] Neyman, J.; Pearson, E.S. On the Problem of the Most Efficient Tests of Statistical Hypotheses, Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, Volume 231 (1933), pp. 289-337 | JFM | Zbl

[74] Philips, L. Hanging on the Metaphone, Computer Language, Volume 7 (1990) no. 12, pp. 39-44

[75] Quantin, Catherine; Gouyon, Béatrice; Avillach, Paul; Ferdynus, Cyril; Sagot, Paul; Gouyon, Jean-Bernard Using Discharge Abstracts to Evaluate a Regional Perinatal Network : Assessment of the Linkage Procedure of Anonymous Data, International Journal of Telemedicine and Applications, Volume 2009 (2009)

[76] Rubin, D.B.; Belin, T.R. Recent Developments in Calibrating Error Rates for Computer Matching., Conference Paper 1991 Annual Research Conference :Proceedings : Bureau of the Census, 1991

[77] Rogot, E.; Sorlie, P.; Johnson, N.J. Probabilistic methods in matching census samples to the National Death Index, Journal of Chronic Diseases, Volume 39 (1986) no. 9, pp. 719-734

[78] Sariyar, M.; Borg, A.; Pommerening, K. Controlling false match rates in record linkage using extreme value theory, Journal of Biomedical Informatics, Volume 44 (2011) no. 4, pp. 648-654

[79] Schürle, Josef A method for consideration of conditional dependencies in the Fellegi and Sunter model of record linkage, Statistical Papers, Volume 46 (2005) no. 3, pp. 433-449 | DOI | Zbl

[80] Schafer, J.L. Analysis of Incomplete Multivariate Data, Chapman and Hall/CRC, 1997 | Zbl

[81] Sadinle, Mauricio; Fienberg, Stephen E. A Generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems, Journal of the American Statistical Association, Volume 108 (2013), pp. 385-397 | Zbl

[82] Steorts, R.C.; Hall, R.; Fienberg, S.E. A Bayesian Approach to Graphical Record Linkage and De-duplication, arXiv :1312.4645 [stat] (2013)

[83] Steorts, R.C.; Hall, R.; Fienberg, S.E. SMERED : A Bayesian Approach to Graphical Record Linkage and De-duplication, arXiv :1403.0211 [stat] (2014)

[84] Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis, Cambridge University Press, New York, NY, USA, 2004

[85] Steorts, R.C. Entity Resolution with Empirically Motivated Priors, Bayesian Analysis, Volume 10 (2015) no. 4, pp. 849-875 | Zbl

[86] Steorts, R.C.; Ventura, S.L.; Sadinle, M.; Fienberg, S.E. A Comparison of Blocking Methods for Record Linkage, Privacy in Statistical Databases, Springer, Cham (2014), pp. 253-268

[87] Scheuren, F.; Winkler, W.E. Regression Analysis of Data Files that are Computer Matched, Survey Methodology,, 1993

[88] Taft, R.L. Name Search Techniques., Technical Report Special no.1 New York State Identification and Intelligence System, 1970

[89] Torra, V.; Domingo-Ferrer, J. Record linkage methods for multidatabase data mining, Information Fusion in Data Mining (Torra, Prof Vicenç, ed.) (Studies in Fuzziness and Soft Computing), Springer Berlin Heidelberg, 2003 no. 123, pp. 101-132 | Zbl

[90] Thibaudeau, Y. The Discrimination Power of Dependency Structures in Record Linkage, SURVEY METHODOLOGY, Volume 19 (1993) no. 1

[91] Tancredi, A.; Liseo, B. A hierarchical Bayesian approach to record linkage and population size problems, The Annals of Applied Statistics, Volume 5 (2011) no. 2, pp. 1553-1585 | Zbl

[92] Tancredi, A.; Liseo, B. Some advances on Bayesian record linkage and inference for linked data, Proceedings of the ESSnet Data Integration Workshop (2011) (http ://www.ine.es/e/essnetdi_ws2011/ppts/Liseo_Tancredi.pdf)

[93] Tromp, M.; Méray, N.; Ravelli, A.C.J.; Reitsma, J.B.; Bonsel, G.J. Ignoring Dependency between Linking Variables and Its Impact on the Outcome of Probabilistic Record Linkage Studies, Journal of the American Medical Informatics Association : JAMIA, Volume 15 (2008) no. 5, pp. 654-660

[94] Verykios, V.S.; Elmagarmid, A.K.; Houstis, E. Automating the Approximate Record-matching process, information Science, Volume 126 (2000) no. 1, pp. 83-98 | Zbl

[95] Winkler, W.E. Machine Learning, Information Retrieval and Record Linkage (2000) (Technical report)

[96] Winkler, W.E. Methods for Record Linkage and Bayesian Networks (2002) (Technical report)

[97] Winkler, W.E. Automatically Estimating Record Linkage False Match Rates (2007) (Technical report)

[98] Winkler, W.E. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage, Bureau of the Census Statistical Research Report Series, 1988

[99] Winkler, W.E. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage (1990)

[100] Winkler, W.E. Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage (1993) (Technical report)

[101] Winkler, W.E. The State of Record Linkage and Current Research Problem (1999) (Technical report)

[102] Wu, C.; Jeff, F. On the Convergence Properties of the EM Algorithm, The Annals of Statistics, Volume 11 (1983) no. 1, pp. 95-103 | Zbl

[103] Wright, G. Probabilistic Record Linkage in SAS., Proceedings of Western Users of SAS Software (2011)