A statistical significance-based approach for clustering grouped data via generalized linear model with discrete random effects
Identifying distinct subgroups within correlated data is essential for tailoring policies to specific needs, ensuring optimal outcomes for each group. In the context of model-based clustering, we introduce an innovative approach for clustering grouped data using linear mixed models with discrete random effects and exponential family responses (e.g. Poisson or Bernoulli). Our method uncovers the latent clustering structure, net of fixed effects, by assuming that random effects follow a discrete distribution with an a priori unknown number of support points. We refine this process within a modified Expectation–Maximization algorithm, collapsing support points of the discrete distribution with overlapping estimated confidence intervals or regions, derived from the asymptotic properties of maximum likelihood estimators. This approach offers a transparent interpretation of the latent structure, distinct from existing tools for discrete random effects, which often rely on discretionary tuning parameters or predetermined cluster counts. Through simulation studies, we compare our approach with traditional parametric methods and state-of-the-art techniques, demonstrating its effectiveness. We apply our model on real-world data from the Programme for International Student Assessment, aiming to classify countries based on their impact on low-achieving student rates in schools. Our methodology provides valuable insights for effective policy formulation.