Numéro spécial : données longitudinales quantitatives, événementielles, incomplètement observées
Mixed Hidden Markov Model for Heterogeneous Longitudinal Data with Missingness and Errors in the Outcome Variable
[Modèle de Markov caché mixte pour des données longitudinales hétérogènes avec erreurs et données manquantes dans la variable de sortie]
Journal de la société française de statistique, Tome 155 (2014) no. 1, pp. 73-98.

L’analyse de données déclaratives longitudinales fait apparaître de nombreuses difficultés, comme le traitement des erreurs et des données manquantes de la variable de sortie. En outre, les cohortes suivies sur le long terme, telles que celles utilisées en épidémiologie « life-course » peuvent soulever un problème d’hétérogénéité du temps, surtout en ce qui concerne la façon de répondre aux questions de l’enquêteur. Nous proposons dans cet article l’introduction d’un modèle de Markov caché mixte qui comprend les possibilités d’erreur et de non-réponse, et permet également de considérer que l’effet d’un résultat de santé passé peut agir sur les réponses actuelles à travers une mémoire d’ état. En ce qui concerne les estimations, nous avons proposé d’utiliser un algorithme EM Stochastique (SEM), qui est moins gourmand en temps de calcul que l’algorithme EM usuel utilisant une intégration sur les effets aléatoires.

Nous avons effectué une étude par simulation afin d’évaluer les performances de cet algorithme dans le contexte de l’épidémiologie du cancer avec les données de la cohorte britanniques « NCDS 1958 ». Les simulations ont montré que l’effet des covariables sur les probabilités de transitions a été estimée avec un biais modéré. Enfin, nous avons réalisé une application à des données réelles en étudiant l’effet de la classe sociale précoce sur le cancer à travers un comportement tabagique. Il est apparu que, dans l’échantillon de femmes utilisé pour cette enquête, la classe sociale précoce n’agit pas principalement sur l’usage du tabac. Cependant, plus d’information est nécessaire pour compenser les données manquantes et les erreurs de déclaration et obtenir de meilleurs résultats statistiques.

Analysing longitudinal declarative data raises many difficulties, such as the processing of errors and missingness in the outcome variable. Moreover, long-term monitored cohorts (commonly encountered in life-course epidemiology) may reveal a problem of time heterogeneity, especially regarding the way subjects respond to the investigator. We propose a Mixed Hidden Markov Model which considers several causes of randomness in response and also enables the effect of a past health outcome to act on present responses through a memory state. Hence, we take into account both errors and missing responses, time heterogeneity, and retrospective questions. We thus propose a Stochastic Expectation Maximization algorithm (SEM), which is less time-consuming than usual EM algorithms to perform the estimation of the parameters of our MHMM.

We carry out a simulation study to assess the performances of this algorithm in the context of cancer epidemiology with the British NCDS 1958 cohort. Simulations show that the effect of covariates on the transitions probabilities is estimated with moderate bias. At last, we investigate a brief real data application on the effect of early social class on cancer through a smoking behaviour. It appears that in the female sample we used, the early social class does not mainly act on smoking behaviours. Moreover, more information is needed to compensate for data missingness and declarative errors in the view to improve our statistical analysis.

Mots clés : Données longitudinales, Modèle de Markov caché mixtes, Effets aléatoires, Algorithme EM stochastique
@article{JSFS_2014__155_1_73_0,
     author = {Dedieu, Dominique and Delpierre, Cyrille and Gadat, S\'ebastien and Lang, Thierry},
     title = {Mixed {Hidden} {Markov} {Model} for {Heterogeneous} {Longitudinal} {Data} with {Missingness} and {Errors} in the {Outcome} {Variable}},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {73--98},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {155},
     number = {1},
     year = {2014},
     zbl = {1316.62125},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2014__155_1_73_0/}
}
TY  - JOUR
AU  - Dedieu, Dominique
AU  - Delpierre, Cyrille
AU  - Gadat, Sébastien
AU  - Lang, Thierry
TI  - Mixed Hidden Markov Model for Heterogeneous Longitudinal Data with Missingness and Errors in the Outcome Variable
JO  - Journal de la société française de statistique
PY  - 2014
DA  - 2014///
SP  - 73
EP  - 98
VL  - 155
IS  - 1
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2014__155_1_73_0/
UR  - https://zbmath.org/?q=an%3A1316.62125
LA  - en
ID  - JSFS_2014__155_1_73_0
ER  - 
Dedieu, Dominique; Delpierre, Cyrille; Gadat, Sébastien; Lang, Thierry. Mixed Hidden Markov Model for Heterogeneous Longitudinal Data with Missingness and Errors in the Outcome Variable. Journal de la société française de statistique, Tome 155 (2014) no. 1, pp. 73-98. http://www.numdam.org/item/JSFS_2014__155_1_73_0/

[1] Aalen, O.O.; Borgan, O.; Gjessing, H.K. Survival and event history analysis, Statistics for Biology and Health, Springer, New York, 2008, xviii+539 pages | Article | MR 2449233 | Zbl 1204.62165

[2] Albert, P.S. A Transitional Model for Longitudinal Binary Data Subject to Nonignorable Missing Data, Biometrics, Volume 56 (2000) no. 2, pp. 602-608 http://www.jstor.org/stable/2677007 | Zbl 1060.62572

[3] Altman, R.M. Assessing the Goodness-of-Fit of Hidden Markov Models, Biometrics, Volume 60 (2004) no. 2, pp. 444-450 http://www.jstor.org/stable/3695772 | Zbl 1274.62708

[4] Altman, R.M. Mixed hidden Markov models: an extension of the hidden Markov model to the longitudinal data setting, Journal of the American Statistical Association, Volume 102 (2007) no. 477, pp. 201-210 | Article | MR 2345538 | Zbl 1284.62803

[5] Bartolucci, F.; Pennoni, F.; Francis, B. A Latent Markov Model for Detecting Patterns of Criminal Activity, Journal of the Royal Statistical Society. Series A (Statistics in Society), Volume 170 (2007) no. 1, pp. 115-132 http://www.jstor.org/stable/4623137

[6] Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains, The Annals of Mathematical Statistics, Volume 41 (1970) no. 1, pp. 164-171 http://www.jstor.org/stable/2239727 | Zbl 0188.49603

[7] Bureau, A.; Shiboski, S.; Hughes, J.P. Applications of continuous time hidden Markov models to the study of misclassified disease outcomes, Statistics in Medicine, Volume 22 (2003) no. 3, pp. 441-462 | Article

[8] Celeux, G.; Diebolt, J. A stochastic approximation type EM algorithm for the mixture problem, Stochastics and Stochastics Reports, Volume 41 (1992) no. 1-2, pp. 119-134 | MR 1275369 | Zbl 0766.62050

[9] Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings Algorithm, The American Statistician, Volume 49 (1995) no. 4, pp. 27-335 http://www.jstor.org/stable/2684568

[10] Cho, L.; Lian, L.; JaeJeong, Y.; SoungHoon, C.; KeunYoung, Y.; Park, S. Validation of self-reported cancer incidence at follow-up in a prospective cohort study, Annals of Epidemiology, Volume 19 (2009) no. 9, p. 644-–646

[11] Commenges, D. Inference for multi-state models from interval-censored data, Statistical Methods in Medical Research, Volume 11 (2002) no. 2, pp. 167-182 | Zbl 1121.62589

[12] Commenges, D. Multi-state models in epidemiology, Lifetime Data Analysis, Volume 5 (1999) no. 4, pp. 315-327 | Article | MR 1758966 | Zbl 0941.62117

[13] Delattre, M. Inference in mixed hidden Markov models and applications to medical studies, Journal de la Société Française de Statistique, Volume 151 (2010) no. 1, pp. 90-105 | Zbl 1316.62155

[14] Detilleux, J.C. The analysis of disease biomarker data using a mixed hidden Markov model, Genetics, Selection, Evolution, Volume 40 (2008) no. 5, pp. 491-509

[15] Diebolt, J.; Ip, E. A stochastic EM algorithm for approximating the maximum likelihood estimate, in Markov chain Monte Carlo in practice, Chapman and Hall, Dordrect, The Netherlands, 1996 | Zbl 0840.62025

[16] Delattre, M.; Lavielle, M. Maximum likelihood estimation in discrete mixed hidden Markov models using the SAEM algorithm, Comput. Statist. Data Anal., Volume 56 (2012) no. 6, pp. 2073-2085 | Article | MR 2892400 | Zbl 1243.62111

[17] Delyon, B.; Lavielle, M.; Moulines, E. Convergence of a Stochastic Approximation Version of the EM Algorithm, The Annals of Statistics, Volume 27 (1999) no. 1, pp. 94-128 http://www.jstor.org/stable/120120 | Zbl 0932.62094

[18] Dempster, A.P.; Laird, N. M.; Rubin, D. B. Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society. Series B, Volume 39 (1977) no. 1 | Zbl 0364.62022

[19] Efron, B.; Tibshirani, R. An Introduction to the Bootstrap, Chapman and Hall, Dordrect, The Netherlands, 1994 | Zbl 0835.62038

[20] Goldberg, M.; Leclerc, A.; Bonenfant, S.; Chastang, J.F.; Schmaus, A.; Kaniewski, N.; Zins, M. Cohort profile: the GAZEL Cohort Study, International journal of epidemiology, Volume 36 (2007) no. 1, p. 32-9

[21] Gilks, W.; Richardson, S.; Spiegelhalter, D. Markov chain Monte Carlo in practice, Chapman and Hall, Dordrecht, The Netherlands, 1996 | Zbl 0832.00018

[22] Applied latent class analysis (Hagenaars, J. A.; McCutcheon, A. L., eds.), Cambridge University Press, Cambridge, 2002, xxii+454 pages | Article | MR 1927663 | Zbl 1003.00021

[23] Holford, N. The Visual Predictive Check Superiority to Standard Diagnostic Plots, Proccedings of the “Population Approach Group in Europe” meeting (2005)

[24] Jackson, C.H.; Sharples, L.D.; Thompson, S.G.; Duffy, S.W.; Couto, E. Multistate Markov models for disease progression with classification error, Journal of The Royal Statistical Society Series D (the Statistician), Volume 52 (2003), pp. 193-209 | Article

[25] Kelly-Irving, M.; Lepage, B.; Dedieu, D.; Lacey, R.; Cable, N.; Bartley, M.; Blane, D.; Grosclaude, P.; Lang, T.; Delpierre, C. Childhood adversity as a risk for cancer. Findings from the 1958 british birth cohort study (2012) (Under review for BMC Public Health)

[26] Kuhn, E.; Lavielle, M. Coupling a stochastic approximation version of EM with an MCMC procedure, ESAIM: Probability and Statistics, Volume 8 (2004), pp. 115-131 | Article | MR 2085610 | Zbl 1155.62420

[27] Louis, T.A. Finding the Observed Information Matrix when Using the EM Algorithm, Journal of the Royal Statistical Society. Series B (Methodological), Volume 44 (1982) no. 2, pp. 226-233 http://www.jstor.org/stable/2345828 | Zbl 0488.62018

[28] Lystig, T. Evaluation of hidden Markov models (2001) (Ph. D. Thesis)

[29] Manjer, J.; Merlo, J.; Berglund, G. Validity of Self-Reported Information on Cancer: Determinants of Under- and Over-Reporting, European Journal of Epidemiology, Volume 19 (2004) no. 3, pp. 239-247 http://www.jstor.org/stable/3582689

[31] Nielsen, S.F. The Stochastic EM Algorithm: Estimation and Asymptotic Results, Bernoulli, Volume 6 (2000) no. 3, pp. 457-489 http://www.jstor.org/stable/3318671 | Zbl 0981.62022

[32] Power, C.; Elliott, J. Cohort profile: 1958 british birth cohort (national child development study), International journal of epidemiology, Volume 35 (2006) no. 1, pp. 34-41

[33] Post, T.M.; Freijer, J.I.; Winter, W.; Ploeger, B.A. Accurate Interpretation of the Visual Predictive Check in order to Evaluate Model Performance, Proccedings of the “Population Approach Group in Europe” meeting (2006)

[34] Panhard, X.; Samson, A. Extension of the SAEM algorithm for nonlinear mixed models with 2 levels of random effects, Biostatistics, Volume 10 (2008) no. 1, pp. 121-135 | Zbl 1437.62573

[35] Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE (1989), pp. 257-286

[36] Satten, G.A.; Longini, I.M. Markov Chains With Measurement Error: Estimating the ‘True’ Course of a Marker of the Progression of Human Immunodeficiency Virus Disease, Journal of the Royal Statistical Society. Series C (Applied Statistics), Volume 45 (1996) no. 3, pp. 275-309 http://www.jstor.org/stable/2986089 | Zbl 0856.62100

[37] Titman, A.C.; Sharples, L.D. A general goodness-of-fit test for Markov and hidden Markov models, Statistics in Medicine, Volume 27 (2008) no. 12, pp. 2177-2195 | Article | MR 2439893

[38] Vermunt, J.K.; Langeheine, R.; Bockenholt, U. Discrete-Time Discrete-State Latent Markov Models with Time-Constant and Time-Varying Covariates, Journal of Educational and Behavioral Statistics, Volume 24 (1999) no. 2, pp. 179-207 http://www.jstor.org/stable/1165200

[39] Zhang, Q.; Snow J., Alison; Rijmen, F.; Ip, E.H. Multivariate discrete hidden Markov models for domain-based measurements and assessment of risk factors in child development, Journal of Computational and Graphical Statistics, Volume 19 (2010) no. 3, pp. 746-765 (With supplementary material available online) | Article | MR 2732501