Towards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration

Mahdi Morafah1*, Vyacheslav Kungurtsev2, Hojin Chang1, Chen Chen3, Bill Lin1,
1University of California San Diego (UCSD), 2Czech Technical University in Prague
3University of Central Florida (UCF)

*Correspondence: mmorafah@ucsd.edu

NeurIPS 2024

Abstract

Federated Learning (FL) has emerged as a promising paradigm for collaborative machine learning, while preserving user data privacy. Despite its potential, standard FL algorithms lack support for diverse heterogeneous device prototypes, which vary significantly in model and dataset sizes---from small IoT devices to large workstations. This limitation is only partially addressed by existing knowledge distillation (KD) techniques, which often fail to transfer knowledge effectively across a broad spectrum of device prototypes with varied capabilities. This failure primarily stems from two issues: the dilution of informative logits from more capable devices by those from less capable ones, and the use of a single integrated logits as the distillation target across all devices, which neglects their individual learning capacities and and the unique contributions of each device. To address these challenges, we introduce TAKFL, a novel KD-based framework that treats the knowledge transfer from each device prototype's ensemble as a separate task, independently distilling each to preserve its unique contributions and avoid dilution. TAKFL also incorporates a KD-based self-regularization technique to mitigate the issues related to the noisy and unsupervised ensemble distillation process. To integrate the separately distilled knowledge, we introduce an adaptive task arithmetic knowledge integration process, allowing each student model to customize the knowledge integration for optimal performance. Additionally, we present theoretical results demonstrating the effectiveness of task arithmetic in transferring knowledge across heterogeneous device prototypes with varying capacities. Comprehensive evaluations of our method across both computer vision (CV) and natural language processing (NLP) tasks demonstrate that TAKFL achieves state-of-the-art results in a variety of datasets and settings, significantly outperforming existing KD-based methods.

Device Heterogeneous FL

Overview Image

In practice, there is a wide range of device prototypes, from small devices such as IoT devices, and medium-sized ones like smartphones, to large-scale systems like workstations. Each of these devices has unique, unhashable neural network architectures designed to fit their specific hardware, software configurations, and machine learning tasks. The capabilities of these devices vary significantly, with smaller devices (e.g., IoT) having smaller models and smaller datasets, while larger devices (e.g., workstations) have larger models and larger datasets. Additionally, these methods employ this single integrated distillation target to transfer knowledge across all different-size student models. These diverse device prototypes with heterogeneous model architectures participate in FL to enhance their global model performance through mutual knowledge sharing. In this figure, we have depicted a scenario of three different devices prototypes including IoTs, smartphones, and workstations.

Limitations of KD-based Device Heterogeneous FL

Overview Image

Knowledge distillation (KD) in device-heterogeneous FL facilitates knowledge transfer by using locally updated client models from different device prototypes—collectively termed as ensembles—as teachers to distill their knowledge into each device prototype's server student model, using an unlabeled public dataset. While existing works primarily focus on same-size devices with similar capabilities, they often overlook the significant variation in device prototypes, ranging from small IoT devices to large workstations. As shown in this figure, different-sized ensembles' logits are being averaged and used as the target distillation to transfer knowledge to each server-side student model. Unfortunately, existing methods struggle to establish effective knowledge transfer in these challenging, real-world device-heterogeneous settings, primarily due to two reasons:
1 Existing methods often disregard the individual strengths and information quality of each device prototype's ensembles and integrate their logits into a single distillation target. This approach dilutes the richer, more informative logits from larger, more capable devices with less informative logits from smaller, less capable ones.
1 Additionally, these methods employ this single integrated distillation target to transfer knowledge across all different-size student models. This one-size-fits-all approach fails to provide customized knowledge integration based on the unique learning capacities of each student and the specific helpfulness of each device prototype’s ensembles.

TAKFL Overview

Overview Image

This figure provides an overview of the TAKFL approach, which transfers the knowledge of different-sized device prototypes' ensembles into each server student model. The scenario consists of three device prototypes: Small (S), Medium (M), and Large (L). TAKFL treats knowledge transfer from each prototype’s ensemble as a separate task and distills them independently. This ensures that the unique knowledge and contributions of each prototype’s ensembles are effectively distilled, avoiding dilution, information loss, and interference from other prototypes’ ensembles. Additionally, a KD-based self-regularization technique is introduced to guide the student through the noisy and unsupervised ensemble distillation process. Finally, the heterogeneously distilled knowledge is strategically integrated using an adaptive task arithmetic operation, allowing for customized knowledge integration based on each student prototype’s specific needs.

Performance Results

Theoretical Results

We present the first theoretical framework using the concept of capacity dimension to understand the effectiveness of KD in device heterogeneous FL. Our theoretical results demonstrate that vanilla KD leads to inefficient and inaccurate use of model capacity, and thereby information loss and suboptimal knowledge transfer (Remark 1 in the paper). Moreover, our theoretical results show that under the case that device prototypes have different capacities, TAKFL smartly transfers the most informative knowledge to each prototype’s student model based on its own intrinsic capacity leading to effective knowlege transfer.

Acknowledgement

This research was partially supported by a grant from Cisco Systems, Inc. We also gratefully acknowledge the use of the computational infrastructure provided by the OP VVV funded project CZ.02.1.01/0.0/0.0/16 019/0000765, “Research Center for Informatics,” which enabled us to conduct the experiments presented in this work.

BibTeX

@misc{morafah2024diversedeviceheterogeneousfederated,
      title={Towards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration}, 
      author={Mahdi Morafah and Vyacheslav Kungurtsev and Hojin Chang and Chen Chen and Bill Lin},
      year={2024},
      eprint={2409.18461},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2409.18461}, 
}