Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization

2021-09-25

Basic Information

Cite by @article{wang2020tackling, title={Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization}, author={Wang, Jianyu and Liu, Qinghua and Liang, Hao and Joshi, Gauri and Poor, H Vincent}, journal={Advances in Neural Information Processing Systems}, volume={33}, year={2020} }

Authors Jianyu Wang, Qinghua Liu , Hao Liang, Gauri Joshi, H. Vincent Poor

Organization Carnegie Mellon University, Princeton University

Source Advances in Neural Information Processing Systems 2020

Source Code https://github.com/JYWa/FedNova

Content

Motivation

There is heterogeneity in the Number of Local Updates in Federated Learning, which causes objective inconsistency. However, none of prior works provides insight into the convergence of algorithms when the number of local update heterogeneity happens.

As we know, in federated learning, a total of \(m\) clients aim to jointly solve the following optimization problem: \(\min_x [F(x) = \sum_{i=1}^m p_i F_i(x)] \tag{1}\label1\) where \(p_i = \frac{n_i}{n}\), and \(F_i(x) = \frac{1}{n_i}\sum_{xi\in D_i}f_i(x;\xi)\) is the local objective function at thr \(i\)-th client. And \(\xi\) is a data sample from local dataset \(D_i\).

In practice, there is heterogeneity in the number of local updates due to distribution skew. Author sets \(\tau_i\) as the number of local SGD iteration in \(i\)-th client, such that \(\tau_i = [\frac{En_i}{B}]\). Therefore, take FedAvg for an example, the shared global model is updated as following: \(x^{(t+1, 0)} - x^{(t, 0)} = \sum_{i=1}^{m}p_i\Delta_i^{(t)} = -\sum_{i=1}^{m}p_i\eta\sum_{k=0}^{\tau_i-1}g_i(x_i^{(t,k)}) \tag{2}\label2\)

Where \(x_i^{(t,k)}\) denotes client \(i\)’s model after the \(k\)-th local update in the \(t\)-th communication round and \(\Delta_i^{(t)} = x_i^{(t,\tau_i)} - x_i^{(t,0)}\). \(\eta\) is the client learning rate and \(g_i(\cdot)\) represents the stochastic gradient over a mini-batch of \(B\) samples.

Suppose that the local objective functions are \(F_i(x) = \sum_{i=1}^m\frac1{2m}||x-e_i||^2, which\quad is \quad minimized \quad by \quad x^* = \frac1m \sum_{i=1}^me_i \tag3\label3\) where \(e_i\) is an arbitrary vector. It can be shown that FedAvg will converge to \(\widetilde{x}^* = \lim_{T\rightarrow\infty} x^(T,0) = \sum_{i=1}^m\frac{\tau_i}{\tau}e_i\), which minimizes the surrogate objective: \(\widetilde{F}(x) = \sum_{i=1}^m\frac{\tau_i}{\tau}F_i(x)\). While FedAvg aims at optimizing \(F(x)\), it actually converges to the optimum of a surrogate objective \(\widetilde{F}(x)\).

As figure shown, FedProx and VRLSGD also cause bias due to objective Inconsistency.

Evaluation

Metric: Training loss, Test Accuracy

Datasets

Logistic Regression on a Synthetic Federated Dataset and DNN trained on a Non-IID partitioned CIFAR-10 dataset.

Result

Method

The authors proposed an analysis framework to understand bias due to objective Inconsistency and proposed Federated Normalized Averaging Algorithm(FedNova) to solve this problem.

Analysis Framework

Rewrite \eqref{2} to following update form. \(x^{(t+1, 0)} - x^{(t, 0)} = -\tau_{eff} \sum_{i=1}^{m} w_i \cdot \eta d_i^{(t)}, \quad which \quad optimizes \quad \widetilde{F}(x) = \sum_{i=1}^{m}w_iF_i(x) \tag4\label4\)

the three key elements \(\tau_{eff}, w_i, d_i^{(t)}\) of this update rule take different forms for different algorithms.

1) Normalized gradient \(d_i^{(t)}\): \(d_i^{(t)} = \frac{ G_i^{(t)}a_i}{||a_i||}\), where \(G_i^{(t)} = [g_i(x_i^{(t,0)}), g_i(x_i^{(t,1)}), \dots, g_i(x_i^{(t,\tau_i)})]\) stacks all stochastic gradients in the \(t\)-th round. \(a_i\in \mathbf{R}^{\tau_i}\) (weight of each gradient). In FedAvg, \(a_i = [1,1,\dots, 1], ||a_i|| = \tau_i\) 2) Aggregation weights \(w_i\) 3) Effective number os step \(\tau_{eff}\): The average number of local SGD step per communication round is \(\hat{\tau} = \sum_i^m\frac{\tau_i}{m}\). However, the server can control the effect the aggregated updates by setting the parameter \(\hat{\tau}\).

For previous algorithms: \(x^{(t+1, 0)} - x^{(t, 0)} = \sum_{i=1}^{m}p_i\Delta_i^{(t)} = -\sum_{i=1}^{m}p_i||a_i||\cdot\frac{G_i^{(t)}a_i}{||a_i||} \\ = - \underbrace{(\sum_{i=1}^{m}p_i||a_i||)}_{\tau_{eff}} \sum_{i=1}^{m}\eta \underbrace{(\frac{p_i||a_i||}{\sum_{i=1}^{m}p_i||a_i||})}_{w_i} \underbrace{(\frac{ G_i^{(t)}a_i}{||a_i||})}_{d_i} \tag5\label5\)

For FedAvg, above figure shows the difference. The left figure shows the original aggregation function in FedAvg.

Convergence Analysis

Assumption 1:(L-Smoothness) \(||\nabla F_i(x) - \nabla F_i(y)|| \le L||x-y||\)

Assumption 2:(Unbiased Gradient and Bounded Variance) \(\mathbb{E}_\xi[||g_i(x|\xi)\nabla F_i(x)||^2] \le \sigma^2 L||x-y||\)

Assumption 3:(Bounded Dissimilarity) \(\exists \beta^2\ge 0, \kappa^2\ge 0 s.t. \sum_{i=1}^{m}w_i||\nabla F_i(x)||^2 \le \beta^2||\sum_{i=1}^{m}w_i\nabla F_i(x)||^2 + \kappa^2\)

Base above assumptions, author gets following theorem:

Theorem 1(Convergence to \(\widetilde{F}(x)\) Stationary Point). Under assumptions 1 to 3, any fedrated optimizaition algorithm that follows the update rule \eqref{4}, will converge to a stationary point. if learning rate \(\eta = \sqrt{\frac{m}{\hat{\tau}T}}\), then the optimization error will be bounded as follows: \(\min_{t\in[T]} \mathbb{E}|| \nabla \widetilde{F}(x^{(t,0)})||^2 \le O(\frac{\hat{\tau}\eta}{m\tau_{eff}}) + O(\frac{A\eta\sigma^2}{m}) + O(B\eta^2\sigma^2) + O(C\eta^2\kappa^2) \tag6\)

Where A, B, C are defined as follows: \(A = m\tau_{eff}\sum_{i=1}^m \frac{w_i^2||a_i||_2^2}{||a_i||_1^2}, B = \sum_{i=1}^m w_i(||a_i||_2^2 - a_{i,-1}^2), C = \max_i \{||a_i||_1^2 - ||a_i||_1a_{i,-1}\} \tag7\) \(a_{i,-1}\) is the last element in the vector \(a_i\).

FedNova: Federated Normalized Averaging Algorithm

\[x^{(t+1, 0)} - x^{(t, 0)} = -\tau^{(t)}_{eff} \sum_{i=1}^{m} p_i \cdot \eta d_i^{(t)} \tag8\label8\]

where \(\tau^{(t)}_{eff} = \sum_{i=1}^{m} p_i\tau_i^{(t)}, d_i^{(t)} = \frac{ G_i^{(t)}a_i}{||a_i||}\).

Comments

Pros

Solve the heterogeneity in the Number of Local Updates in Federated Learning
Proposed a general framework which can guarantee the convergence to the stationary point.

Cons

(need further experiment)

Further work

Extending the theoretical framework to adaptive optimization methods or gossip-based training method.

Comments

(need further experiment)

Hong Huang

Hong Huang

Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization

目录

Basic Information

Content

Motivation

Evaluation

Datasets

Result

Method

Analysis Framework

Convergence Analysis

FedNova: Federated Normalized Averaging Algorithm

Comments

Pros

Cons

Further work

Comments