Cluster or co-cluster the nodes of oriented graphs?
[Classification non supervisée de graphes orientés : faut-il distinguer les nœuds origines des noeuds terminaux ?]
Journal de la société française de statistique, Tome 162 (2021) no. 1, pp. 46-69.

Lors de la classification non supervisée des nœuds d’un graphe, une partition unique des nœuds est généralement construite, que le graphe soit orienté ou non. Bien que ce choix soit pertinent pour les graphes non orientés, il devrait être discuté pour les graphes orientés car il implique qu’aucune différence n’est faite entre les clusters de nœuds source et cible. Nous examinons cette question dans le contexte des modèles de clustering probabilistes à variables latentes et comparons l’utilisation du modèle de blocs stochastiques (SBM) et du modèle de blocs latents (LBM). Nous analysons et discutons cette comparaison à travers des jeux de données simulées et réelles.

When clustering the nodes of a graph, a unique partition of the nodes is usually built, either the graph is undirected or directed. While this choice is pertinent for undirected graphs, it should be discussed for directed graphs because it implies that no difference is made between the clusters of source and target nodes. We examine this question in the context of probabilistic models with latent variables and compare the use of the Stochastic Block Model (SBM) and of the Latent Block Model (LBM). We analyze and discuss this comparison through simulated and real data sets and suggest some recommendation.

Classification : 62-09, 62J05, 62P10
Keywords: Clustering for directed graphs, Genes networks, Penalized log-likehood, Co-clustering, SBM, LBM
Mot clés : Classification non supervisée de graphes orientés, Réseaux de gènes, Vraisemblance pénalisée, Co-clustering, SBM, LBM
Keribin, Christine 1

1 Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d’Orsay, 91405, Orsay, France.
@article{JSFS_2021__162_1_46_0,
     author = {Keribin, Christine},
     title = {Cluster or co-cluster the nodes of oriented graphs?},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {46--69},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {162},
     number = {1},
     year = {2021},
     mrnumber = {4286849},
     zbl = {1469.62287},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2021__162_1_46_0/}
}
TY  - JOUR
AU  - Keribin, Christine
TI  - Cluster or co-cluster the nodes of oriented graphs?
JO  - Journal de la société française de statistique
PY  - 2021
SP  - 46
EP  - 69
VL  - 162
IS  - 1
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2021__162_1_46_0/
LA  - en
ID  - JSFS_2021__162_1_46_0
ER  - 
%0 Journal Article
%A Keribin, Christine
%T Cluster or co-cluster the nodes of oriented graphs?
%J Journal de la société française de statistique
%D 2021
%P 46-69
%V 162
%N 1
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2021__162_1_46_0/
%G en
%F JSFS_2021__162_1_46_0
Keribin, Christine. Cluster or co-cluster the nodes of oriented graphs?. Journal de la société française de statistique, Tome 162 (2021) no. 1, pp. 46-69. http://www.numdam.org/item/JSFS_2021__162_1_46_0/

[1] Abbe, Emmanuel Community detection and stochastic block models: recent developments, The Journal of Machine Learning Research, Volume 18 (2017) no. 1, pp. 6446-6531 | MR

[2] Airoldi, Edoardo; Blei, David; Xing, Eric; Fienberg, Stephen A latent mixed membership model for relational data, Proceedings of the 3rd international workshop on Link discovery (2005), pp. 82-89 | DOI

[3] Bickel, Peter; Choi, David; Chang, Xiangyu; Zhang, Hai Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels, The Annals of Statistics, Volume 41 (2013) no. 4, pp. 1922-1943 | MR | Zbl

[4] Biernacki, Christophe; Celeux, Gilles; Govaert, Gérard Assessing a mixture model for clustering with the integrated completed likelihood, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Volume 22 (2000) no. 7, pp. 719-725 | DOI

[5] Brault, Vincent; Keribin, Christine; Mariadassou, Mahendra Consistency and asymptotic normality of Latent Block Model estimators, Electronic journal of statistics, Volume 14 (2020) no. 1, pp. 1234-1268 | MR | Zbl

[6] Bennett, James; Lanning, Stan The netflix prize, Proceedings of KDD cup and workshop, Volume 2007 (2007), 35 pages

[7] Brault, Vincent; Lomet, Aurore Revue des méthodes pour la classification jointe des lignes et des colonnes d’un tableau, Journal de la Société Française de Statistique, Volume 156 (2015) no. 3, pp. 27-51 | Numdam | MR | Zbl

[8] Brault, Vincent; Mariadassou, Mahendra Co-clustering through Latent Bloc Model: a Review, Journal de la Société Française de Statistique, Volume 156 (2015) no. 3, pp. 120-139 | Numdam | MR | Zbl

[9] Bollobás, Béla Modern graph theory, 184, Springer Science & Business Media, 2013 | MR

[10] Bollobás, Béla Random graphs, Modern graph theory, Springer, 1998, pp. 215-252 | DOI | MR

[11] Castrillo, Gabriel; Turck, Franziska; Leveugle, Magalie; Lecharny, Alain; Carbonero, Pilar; Coupland, George; Paz-Ares, Javier; Oñate-Sánchez, Luis Speeding cis-trans regulation discovery by phylogenomic analyses coupled with screenings of an arrayed library of Arabidopsis transcription factors, PloS one, Volume 6 (2011) no. 6, p. e21524 | DOI

[12] Celeux, Gilles; Vasseur, Yann Classification non supervisée de graphes orientés: faut-il distinguer les nœuds origines des nœuds terminaux?, Actes des Neuvièmes Journées Francophones sur les Réseaux Bayésiens et les Modèles Graphiques Probabilistes 2018 (JFRB, ed.) (2018), pp. 68-75

[13] Dhillon, Inderjit S Co-clustering documents and words using bipartite spectral graph partitioning, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (2001), pp. 269-274 | DOI

[14] Dempster, Arthur P; Laird, Nan M; Rubin, Donald B Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (methodological) (1977), pp. 1-38 | DOI | MR | Zbl

[15] Daudin, J-J; Picard, Franck; Robin, Stéphane A mixture model for random graphs, Statistics and computing, Volume 18 (2008) no. 2, pp. 173-183 | DOI | MR

[16] Etienne, Côme; Latifa, Oukhellou Model-based count series clustering for bike sharing system usage mining: a case study with the Vélib’system of Paris, ACM Transactions on Intelligent Systems and Technology (TIST), Volume 5 (2014) no. 3, pp. 1-21 | DOI

[17] Erdős, Paul; Rényi, Alfréd On the evolution of random graphs, Publ. Math. Inst. Hung. Acad. Sci, Volume 5 (1960) no. 1, pp. 17-60 | MR | Zbl

[18] Euler, Leonhard Solutio problematis ad geometriam situs pertinentis, Commentarii academiae scientiarum Petropolitanae (1741), pp. 128-140

[19] Frank, Ove; Harary, Frank Cluster inference by using transitivity indices in empirical graphs, Journal of the American Statistical Association, Volume 77 (1982) no. 380, pp. 835-840 | DOI | MR | Zbl

[20] Frisch, Gabriel; Léger, Jean-Benoist; Grandvalet, Yves Learning from missing data with the Latent Block Model, arXiv preprint arXiv:2010.12222 (2020) | MR | Zbl

[21] Flake, Gary William; Lawrence, Steve; Giles, C Lee; Coetzee, Frans M Self-organization and identification of web communities, Computer, Volume 35 (2002) no. 3, pp. 66-70 | DOI

[22] Fortunato, Santo Community detection in graphs, Physics reports, Volume 486 (2010) no. 3-5, pp. 75-174 | DOI | MR

[23] Girvan, Michelle; Newman, Mark EJ Community structure in social and biological networks, Proceedings of the national academy of sciences, Volume 99 (2002) no. 12, pp. 7821-7826 | DOI | MR | Zbl

[24] Govaert, G.; Nadif, M. Clustering of contingency table and mixture model, European Journal of Operational Research, Volume 183 (2007), pp. 1055-1066 | DOI | MR | Zbl

[25] Govaert, Gérard; Nadif, Mohamed Co-Clustering, ISTE Ltd and John Wiley & Sons, Inc, 2013 | DOI

[26] Goldenberg, Anna; Zheng, Alice X; Fienberg, Stephen E; Airoldi, Edoardo M A survey of statistical network models, Found. Trends Mach. Learn., Volume 2 (2010) no. 2, pp. 129-233 | DOI | Zbl

[27] Holme, Petter; Liljeros, Fredrik; Edling, Christofer R; Kim, Beom Jun Network bipartivity, Physical Review E, Volume 68 (2003) no. 5, p. 056107 | DOI

[28] Holland, Paul W; Laskey, Kathryn Blackmond; Leinhardt, Samuel Stochastic block models: First steps, Social networks, Volume 5 (1983) no. 2, pp. 109-137 | DOI | MR

[29] Keribin, Christine; Brault, Vincent; Celeux, Gilles; Govaert, Gérard Estimation and selection for the latent block model on categorical data, Statistics and Computing, Volume 25 (2015) no. 6, pp. 1201-1216 | DOI | MR | Zbl

[30] Kolaczyk, Eric D Statistical Analysis of Network Data Methods and Models, Springer, 2009 | DOI | MR

[31] Leger, Jean-Benoist blockmodels: Latent and Stochastic Block Model Estimation by a ’V-EM’ Algorithm (2015) https://CRAN.R-project.org/package=blockmodels (R package version 1.1.1)

[32] Leger, Jean-Benoist Blockmodels: A R-package for estimating in Latent Block Model and Stochastic Block Model, with various probability functions, with or without covariates, arXiv preprint arXiv:1602.07587 (2016)

[33] Leger, Jean-Benoist; Vacher, Corinne; Daudin, Jean-Jacques Detection of structurally homogeneous subsets in graphs, Statistics and Computing, Volume 24 (2014) no. 5, pp. 675-692 | DOI | MR | Zbl

[34] Madeira, Sara C; Oliveira, Arlindo L Biclustering algorithms for biological data analysis: a survey, Computational Biology and Bioinformatics, IEEE/ACM Transactions on, Volume 1 (2004) no. 1, pp. 24-45 | DOI

[35] Matias, Catherine; Robin, Stéphane Modeling heterogeneity in random graphs through latent space models: a selective review, ESAIM: Proceedings and Surveys, Volume 47 (2014), pp. 55-74 | DOI | MR | Zbl

[36] Mariadassou, M.; Robin, S.; Vacher, C. Uncovering latent structure in valued graphs: a variational approach, The Annals of Applied Statistics, Volume 4 (2010) no. 2, pp. 715-742 | DOI | MR | Zbl

[37] Newman, Mark EJ Modularity and community structure in networks, Proceedings of the national academy of sciences, Volume 103 (2006) no. 23, pp. 8577-8582 | DOI

[38] Passino, Francesco Sanna; Heard, Nicholas A Bayesian estimation of the latent dimension and communities in stochastic blockmodels, Statistics and Computing (2020), pp. 1-17 | MR | Zbl

[39] Priebe, Carey E; Park, Youngser; Vogelstein, Joshua T; Conroy, John M; Lyzinski, Vince; Tang, Minh; Athreya, Avanti; Cape, Joshua; Bridgeford, Eric On a two-truths phenomenon in spectral graph clustering, Proceedings of the National Academy of Sciences, Volume 116 (2019) no. 13, pp. 5995-6000 | DOI | MR | Zbl

[40] Rohe, Karl; Chatterjee, Sourav; Yu, Bin Spectral clustering and the high-dimensional stochastic blockmodel, The Annals of Statistics, Volume 39 (2011) no. 4, pp. 1878-1915 | MR | Zbl

[41] Reddy, P Krishna; Kitsuregawa, Masaru; Sreekanth, P; Rao, S Srinivasa A graph based approach to extract a neighborhood customer community for collaborative filtering, International Workshop on Databases in Networked Information Systems, Springer (2002), pp. 188-200 | DOI | Zbl

[42] Robert, Valerie bikm1: Co-Clustering Adjusted Rand Index and Bikm1 Procedure for Contingency and Binary Data-Sets (2020) https://CRAN.R-project.org/package=bikm1 (R package version 1.0.0)

[43] Robert, Valérie; Vasseur, Yann; Brault, Vincent Comparing high-dimensional partitions with the Co-clustering Adjusted Rand Index, Journal of Classification (2020), pp. 1-29 | MR | Zbl

[44] Schwarz, Gideon Estimating the dimension of a model, The annals of statistics, Volume 6 (1978) no. 2, pp. 461-464 | MR | Zbl

[45] Shan, H.; Banerjee, A. Bayesian co-clustering, Eighth IEEE International Conference on Data Mining, 2008. ICDM’08 (2008), pp. 530-539 | DOI

[46] Singh Bhatia, Parmeet; Iovleff, Serge; Govaert, Gérard blockcluster: An R Package for Model-Based Co-Clustering, Journal of Statistical Software, Volume 76 (2017) no. 9, pp. 1-24 | DOI

[47] Swarbreck, David; Wilks, Christopher; Lamesch, Philippe; Berardini, Tanya Z; Garcia-Hernandez, Margarita; Foerster, Hartmut; Li, Donghui; Meyer, Tom; Muller, Robert; Ploetz, Larry The Arabidopsis Information Resource (TAIR): gene structure and function annotation, Nucleic acids research, Volume 36 (2007) no. suppl_1, p. D1009-D1014 | DOI

[48] Thébault, Elisa; Fontaine, Colin Stability of ecological communities and the architecture of mutualistic and trophic networks, Science, Volume 329 (2010) no. 5993, pp. 853-856 | DOI

[49] Vasseur, Yann Inférence de réseaux de régulation orientés pour les facteurs de transcription d’Arabidopsis thaliana et création de groupes de co-régulation, Ph. D. Thesis , Paris Saclay (2017)

[50] Von Luxburg, Ulrike A tutorial on spectral clustering, Statistics and computing, Volume 17 (2007) no. 4, pp. 395-416 | DOI | MR

[51] Vacher, Corinne; Piou, Dominique; Desprez-Loustau, Marie-Laure Architecture of an antagonistic tree/fungus network: the asymmetric influence of past evolutionary history, PloS one, Volume 3 (2008) no. 3, p. e1740 | DOI

[52] Wang, YX Rachel; Bickel, Peter J Likelihood-based model selection for stochastic block models, The Annals of Statistics, Volume 45 (2017) no. 2, pp. 500-528 | MR | Zbl

[53] Wyse, J.; Latouche, P.; Friel, N. Inferring structure in bipartite networks using the latent block model and exact ICL ., Network Science, Volume 5 (2018), pp. 45-69 | DOI

[54] Wolpert, Lewis Positional information and pattern formation, Current topics in developmental biology, Volume 6 (1971), pp. 183-224 | DOI