论文标题
没有Lipschitz梯度的一阶算法:一种顺序的局部优化方法
First-Order Algorithms Without Lipschitz Gradient: A Sequential Local Optimization Approach
论文作者
论文摘要
一阶算法在求解凸面和非凸优化问题方面很受欢迎。这些算法中大多数的一个关键假设是,目标函数的梯度是全球lipschitz的连续,但是许多当代问题(例如张量分解)无法满足这样的假设。本文开发了一阶算法的顺序局部优化(SLO)框架,该算法可以有效地优化没有Lipschitz梯度的问题。基于假设梯度是{\ it局部} Lipschitz在任何紧凑型集合上连续的,提议的框架仔细限制了两个连续的迭代之间的距离。我们表明,所提出的框架可以轻松适应现有的一阶方法,例如梯度下降(GD),归一化梯度下降(NGD),加速梯度下降(AGD)以及Armijo Line搜索的GD。值得注意的是,后一种算法完全不含参数,甚至不需要当地Lipschitz常数的知识。 我们表明,对于提出的算法,以达到$ \ | \ nabla f(x)\ |^2 \leε$的梯度错误,最多需要$ \ nathcal {o}(\ frac {1} {1}ε\ times \ times \ nimits \ natile \ nationcal {l}(y)$ there $ premation $ premation $ premation y y l \ y y \ y \ y \ y \ y \ y \ y \ y \ y。当地的Lipschitz常数随着给定的$ y $的大小而生长。此外,我们表明AGD的变体可以提高对$ε$和增长功能$ \ MATHCAL {l}(y)$的依赖性。所提出的算法补充了现有的Bregman近端梯度(BPG)算法,因为它们不需要有关问题结构的全局信息来构建和求解Bregman近端映射。
First-order algorithms have been popular for solving convex and non-convex optimization problems. A key assumption for the majority of these algorithms is that the gradient of the objective function is globally Lipschitz continuous, but many contemporary problems such as tensor decomposition fail to satisfy such an assumption. This paper develops a sequential local optimization (SLO) framework of first-order algorithms that can effectively optimize problems without Lipschitz gradient. Operating on the assumption that the gradients are {\it locally} Lipschitz continuous over any compact set, the proposed framework carefully restricts the distance between two successive iterates. We show that the proposed framework can easily adapt to existing first-order methods such as gradient descent (GD), normalized gradient descent (NGD), accelerated gradient descent (AGD), as well as GD with Armijo line search. Remarkably, the latter algorithm is totally parameter-free and do not even require the knowledge of local Lipschitz constants. We show that for the proposed algorithms to achieve gradient error bound of $\|\nabla f(x)\|^2\le ε$, it requires at most $\mathcal{O}(\frac{1}ε\times \mathcal{L}(Y))$ total access to the gradient oracle, where $\mathcal{L}(Y)$ characterizes how the local Lipschitz constants grow with the size of a given set $Y$. Moreover, we show that the variant of AGD improves the dependency on both $ε$ and the growth function $\mathcal{L}(Y)$. The proposed algorithms complement the existing Bregman Proximal Gradient (BPG) algorithm, because they do not require the global information about problem structure to construct and solve Bregman proximal mappings.