DCA for online prediction with expert advice

Hoai An Le Thi ; Vinh Thanh Ho ; et al.

In: Neural Computing and Applications, Jg. 33 (2021-02-26), S. 9521-9544

Online unknown

Zugriff:

Volltext (PDF)

We investigate DC (Difference of Convex functions) programming and DCA (DC Algorithm) for a class of online learning techniques, namely prediction with expert advice, where the learner’s prediction is made based on the weighted average of experts’ predictions. The problem of predicting the experts’ weights is formulated as a DC program for which an online version of DCA is investigated. The two so-called approximate/complete variants of online DCA based schemes are designed, and their regrets are proved to be logarithmic/sublinear. The four proposed algorithms for online prediction with expert advice are furthermore applied to online binary classification. Experimental results tested on various benchmark datasets showed their performance and their superiority over three standard online prediction with expert advice algorithms—the well-known weighted majority algorithm and two online convex optimization algorithms.

DCA for online prediction with expert advice

We investigate DC (Difference of Convex functions) programming and DCA (DC Algorithm) for a class of online learning techniques, namely prediction with expert advice, where the learner's prediction is made based on the weighted average of experts' predictions. The problem of predicting the experts' weights is formulated as a DC program for which an online version of DCA is investigated. The two so-called approximate/complete variants of online DCA based schemes are designed, and their regrets are proved to be logarithmic/sublinear. The four proposed algorithms for online prediction with expert advice are furthermore applied to online binary classification. Experimental results tested on various benchmark datasets showed their performance and their superiority over three standard online prediction with expert advice algorithms—the well-known weighted majority algorithm and two online convex optimization algorithms.

Keywords: Online learning; Prediction with expert advice; Online DC programming; Online DCA

Introduction

Prediction with expert advice is the main topic in the theory of machine learning, more specifically in online learning (see, e.g., [[6], [8]]). It establishes the foundation of prediction of individual sequences. The prediction with expert advice was first introduced in the 1980s as a model of online learning by DeSantis, Markowsky, and Wegman [[17]]; Littlestone and Warmuth [[48]] and can be described as follows (see, e.g., [[6], [8], [48]]). Suppose the learner has access to predictions of the pool including d experts. On each online round, the learner receives an incoming example and must make a prediction on this example based on the advice made by experts. After making a decision, the correct prediction is provided and the learner will suffer some loss. The learning rule here is how the learner uses the expert advice to make a prediction as accurate as possible.

Denote by ${\tilde{p}}_{i, t}$ the prediction at the time t of the ith expert. Fix a loss function $ℓ$ . The following protocol is often used.

Graph

Denote by $L_{i, T}$ and $L_{T}$ , respectively, the cumulative loss of the ith expert and the learner after T prediction steps, i.e.,

$\begin{matrix} L_{i, T} : = \sum_{t = 1}^{T} ℓ_{t} ({\tilde{p}}_{i, t}, y_{t}), i = 1, ..., d ; L_{T} : = \sum_{t = 1}^{T} ℓ_{t} ({\bar{p}}_{t}, y_{t}) . \end{matrix}$

Graph

The goal of the learner is to minimize the regret

1 $\begin{matrix} R_{T} : = \sum_{t = 1}^{T} ℓ_{t} ({\bar{p}}_{t}, y_{t}) - min_{i = 1, ..., d} \sum_{t = 1}^{T} ℓ_{t} ({\tilde{p}}_{i, t}, y_{t}) = L_{T} - min_{i = 1, ..., d} L_{i, T} . \end{matrix}$

Graph

The prediction with expert advice has been successfully applied in many different contexts where a group of experts is available, and the learner must predict on each online round by combining the experts' predictions. In practice, the experts can be specialists, or good choices, prediction strategies, efficient algorithms. An interesting application of the prediction with expert advice is online auctions for digital goods [[28]]. The fixed sales price of goods plays a role of an expert. The auctioneer must determine which sales price for every bidder in order to maximize the total revenue obtained from the auction. Other real-world applications in various areas are such as adaptive caching [[26]], power management [[36]], rent-and-buy problem [[25]], financial stability and monitoring [[1]], image retrieval [[67]], temperature times series forecasting [[33]], electricity consumption forecasting [[18]], contamination control in food supply chain [[15]], classification [[28]], neural architecture search [[49]], to cite a few.

In the literature, there are many prediction strategies with expert advice. For instance, the consistent algorithm (see, e.g., [[59]]), the Halving algorithm (see, e.g., [[2], [4], [48]]), the min/max prediction strategy and its version for static experts [[6]]. Many works considered a natural prediction strategy which computes a weighted average of experts' predictions. For example, when the prediction space $P$ is convex, the leaner will predict an answer according to (see [[8]])

$\begin{matrix} {\bar{p}}_{t} = \frac{\sum_{i = 1}^{d} w^{t} [i] . {\tilde{p}}_{i, t}}{\sum_{j = 1}^{d} w^{t} [j]}, \end{matrix}$

Graph

where each weight $w^{t} [i]$ ( $i = 1, ..., d$ ) is assigned to the ith expert at the step t.

Let us denote by ${\tilde{p}}_{t} = {({\tilde{p}}_{i, t})}_{i = 1, ..., d} \in P^{d}$ the vector of experts' predictions. We consider in this paper the following weighted average prediction strategy (see, e.g., [[8], [59]]).

2 $\begin{matrix} {\bar{p}}_{t} = p_{t} (w^{t}) \in P, p_{t} (w) = 1_{⟨ w, {\tilde{p}}_{t} ⟩ \geq ρ} (w) . \end{matrix}$

Graph

Here, $1_{X}$ is the indicator function on $X$ (say $1_{X} (x) = 1$ if $x \in X$ , 0 otherwise), $ρ$ is a positive index, and the experts' weight belongs to the probability simplex $S$ in $R^{d}$ defined by

3 $\begin{matrix} S = \{w \in R^{d} : \sum_{i = 1}^{d} w [i] = 1 and w [i] \geq 0, i = 1, ..., d\} . \end{matrix}$

Graph

In practical problems with binary answers (e.g., $y_{t} \in {0, 1}$ ), the $0 - 1$ loss function $ℓ_{t}$ is typically used (see, e.g., [[8], [59], [68]]), it is defined as

4 $\begin{matrix} ℓ_{t} (p_{t} (w), y_{t}) : = 1_{{p_{t} (w) \neq y_{t}}} (w) = \{\begin{matrix} 1 & if p_{t} (w) \neq y_{t}, \\ 0 & otherwise . \end{matrix}) \end{matrix}$

Graph

The function $ℓ_{t}$ returns 0 if the prediction is correct and 1 otherwise.

Related works To develop efficient prediction techniques with expert advice according to the weighted average strategy, the crucial issue is how the learner computes the weight vector of the experts at each step. Many weighted average prediction techniques have been extensively developed, on both theoretical and algorithmic aspects (see, e.g., [[5]–[9], [17], [19], [28], [48], [58], [63], [68]]). They can be divided into three main categories corresponding to three ways for computing the weights.

The first category belongs to the well-known weighted majority algorithm and its variants (see, e.g., [[48], [68]]). Computing the weights of the weighted majority algorithm is very simple: all weights of experts are initially set to 1, and at each step, if any expert makes a mistake, then his weight will be reduced by a fixed factor. In [[48]], Littlestone and Warmuth showed that the mistake bound for the weighted majority algorithm is closely related to the mistake bounds of the best expert in hindsight.

In the second category, the experts' weights at each learning step t are computed based on the cumulative regret (1). For instance, when the prediction/outcome spaces $P$ and $Y$ are convex, the weights are defined by a derivative of a (polynomial/exponential) potential function at the cumulative regret. In an alternative way, if the loss function $ℓ_{t}$ is differentiable, then the weights can be determined via its partial derivative $\nabla_{\bar{p}} ℓ_{t} (\bar{p}, y)$ (see, e.g., [[5]–[8], [34]] for more details). Two well-known algorithms in this category are polynomially/exponentially weighted average forecasters (see, e.g., [[8]]).

However, in general, both ways just mentioned above do not exploit the structures of the loss function to compute the weights (more precisely, some methods in the second category consider the loss only for the case where it is differentiable). Compensating for this, the third way consists in directly minimizing, at each step t, the cumulative loss function after $t - 1$ steps [[59]]. Specifically, when the loss functions and the prediction outcome spaces are convex, the experts' weights are computed by solving one convex optimization problem and online convex optimization methods are employed (see, e.g., [[59], [69]]). For example, the online gradient descent algorithm (with lazy/greedy projections) [[69]], the exponentiated gradient algorithm [[3], [34]], the p-norm [[24], [27]] and their variants, etc.

A major common difficulty of the methods in the third category is that in many practical problems the loss function is nonconvex (for example, the $0 - 1$ loss function (4)), and/or the prediction/outcome spaces are nonconvex (see, e.g., [[10]]). In these cases, the optimization problem to be solved at each step is nonconvex. To deal with the non-convexity, classical methods often use convexification techniques to get convex problems and then apply an online convex optimization algorithm. However, such approaches have several disadvantages (see, e.g., [[8], [58]]). Hence, solving nonconvex optimization problems to compute the experts' weights stills a challenge in the area of prediction with expert advice. It is needed to design efficient algorithms in online nonconvex optimization framework to meet this challenge. In our knowledge, such algorithms do not yet exist in the literature.

Our contributions This paper aims to tackle the challenge of non-convexity in prediction with expert advice by exploiting the power of DC (difference of convex functions) programming and DCA (DC algorithm), the backbone of nonconvex programming and global optimization. These tools were introduced by Pham Dinh Tao in a preliminary form in 1985 and have been extensively developed by Le Thi Hoai An and Pham Dinh Tao since 1993 (e.g., [[37], [41], [44], [46], [54]–[56]] and references therein) to become now classic and increasingly popular. The basic DCA aims to solve a standard DC program which consists of minimizing a DC function under a convex set. It is based on the nice and simple concept of approximating a nonconvex (DC) program by a sequence of convex ones. Numerous DCA-based algorithms have been developed for successfully solving large-scale nonsmooth/nonconvex programs appearing in many application areas, especially in machine learning, communication system, biology, finance, supply chain and production management, etc (see [[46]] and references therein). In this paper, as we are faced with an online nonconvex optimization problem, we do not use the basic DCA but rather develop a new DCA-based approach which can be seen as an online version of DCA. This constitutes the main contribution and novelty of our work. We consider the $0 - 1$ loss function that is typically suitable for many practical learning problems. By definition, the $0 - 1$ loss function itself is not a DC function. We first approximate the $0 - 1$ loss function by a DC function, and then investigate online DCA for minimizing the DC loss function. Each iteration of online DCA consists in approximating the current DC loss function by its convex majorization and then solving the resulting convex subproblem. How to approximate the $0 - 1$ loss function by a DC function so that the resulting DCA is efficient is an important issue in the design of algorithm. This constitutes the second contribution and novelty of the paper. Moreover, the solution method for convex subproblems at each iteration is another major issue of DCA. Four versions of online DCA are developed which differ from one of the others by the way for solving the convex subproblems (to get exact or approximate solutions via projected/exponentiated subgradient methods). We prove that these algorithms have the vanishing per-step regret with respect to the system of experts.

The fourth significant contribution of the paper concerns the deployment of the proposed online DCA algorithms for online binary linear classification where each expert is an online classification algorithm. From the regret bound of these algorithms, we derive their bounds on the number of prediction mistakes. We provide several numerical experiments of the proposed algorithms on many benchmark classification datasets and compare them with the well-known weighted majority algorithm and two online convex optimization algorithms.

The rest of the paper is organized as follows. To facilitate the reader's understanding, we give in Sect. 2 a brief introduction to online DC programming and online DCA. In Sect. 3 we show how to investigate online DCA for prediction with expert advice. The application of these proposed algorithms on online binary linear classification is presented in Sect. 4. Finally, Sect. 5 concludes the paper.

Outline of DC programming, DCA and online DCA

DC programming and DCA address DC programs of the form

$\begin{matrix} inf {f (w) : = g (w) - h (w) : w \in R^{d}}, (P_{dc}) \end{matrix}$

Graph

where $g, h \in Γ_{0} (R^{d})$ , the set of all lower semicontinuous proper convex functions on $R^{d}$ . Such a function f is called DC function, and $g - h$ , DC decomposition of f while g and h are DC components of f.

The constrained DC program whose feasible set C is convex always can be transformed into the unconstrained DC program by adding the indicator function of C, denoted by $χ_{C}$ which is defined by $χ_{C} (w) = 0$ if $w \in C$ , and $+ \infty$ otherwise to the first DC component.

$\begin{matrix} inf {f (w) : = g (w) - h (w) : w \in C} \\ = inf {χ_{C} (w) + g (w) - h (w) : w \in R^{d}} . \end{matrix}$

Graph

The main idea of DCA is quite simple. It consists in approximating a DC program by a sequence of convex programs: each iteration l of DCA approximates the concave part $- h$ by its affine majorization (that corresponds to taking $z^{l} \in \partial h (w^{l})$ ) and minimizes the resulting convex function.

The generic DCA scheme can be described as follows.

Graph

DCA has been successfully solved large-scale nonsmooth/nonconvex programs in numerous application areas and has been proved to be a fast and scalable approach which is, thanks to the effect of DC decompositions, more efficient than related methods (see, e.g., [[37]–[46], [51], [53]–[56]] and references therein). For a comprehensible survey on thirty years of development of DCA, the reader is referred to the recent paper [[46]].

The above generic DCA scheme can be adapted for solving online DC problems where the set of predictions is convex and the loss suffered at each step is a DC function as follows. At each learning step, we have to minimize a DC loss function $F_{t}$ under the set of predictions $S$ . We are therefore faced with a (standard) DC program. As we are in the "online" context where data is available in sequential order, solving this DC program completely may not be imperative. Instead, we perform only one iteration of DCA.

Let us denote by T the number of online learning steps. The function $F_{t}$ can be defined either from the cumulative $\sum_{t = 1}^{T} f_{t}$ or the current loss $f_{t}$ . In this paper we define $F_{t}$ simply as the current loss function $f_{t}$ , say $F_{t} = f_{t} : = g_{t} - h_{t}$ with $g_{t}$ and $h_{t}$ being convex functions. The online DCA scheme can be described as follows.

Graph

Now, we are going to show how to develop this online DCA scheme for prediction with expert advice.

Online DCA for prediction with expert advice

In this section, we investigate online DCA for prediction with expert advice based on the weighted average strategy (2) and the $0 - 1$ loss function (4). More precisely, the experts' weights ${w^{t}}$ will be computed by applying the iteration t of online DCA via the loss function (4), and the prediction ${\bar{p}}_{t}$ is defined by (2). As the function (4) is not DC, for applying DCA we have to approximate it by a DC function. We will explain in the next subsection the way to construct such a function. Assuming that $f_{t}$ is a DC function which approximates the loss function (4). The online DCA scheme for prediction with expert advice, named ODCA, is summarized in the following algorithm.

Generic scheme of online DCA for prediction with expert advice

First of all, it is worth mentioning that, in practice, if the prediction using the weight $w^{t - 1}$ is correct (i.e., $ℓ_{t - 1} (p_{t - 1} (w^{t - 1}), y_{t - 1}) = 0$ ), then it is not necessary to update the weight $w^{t}$ (see [[59]]).

Graph

Now we will show how to approximate the $0 - 1$ loss function $ℓ_{t}$ by a DC function $f_{t}$ .

DC approximation loss functions

It is obviously that if the prediction at step t is correct, say $ℓ_{t} ({\bar{p}}_{t}, y_{t}) = 0$ , then we take $f_{t} = 0$ which is a DC function. We can verify easily that the two cases where $ℓ_{t} ({\bar{p}}_{t}, y_{t}) = ℓ_{t} (p_{t} (w^{t}), y_{t}) = 0$ are

$y_{t} = 1$ and $⟨ w^{t}, {\tilde{p}}_{t} ⟩ \geq ρ$ ( $ℓ_{t} (p_{t} (w^{t}), y_{t}) = 1 - p_{t} (w^{t}) = 1_{⟨ w, {\tilde{p}}_{t} ⟩ < ρ} (w^{t}) = 0$ )

$y_{t} = 0$ and $⟨ w^{t}, {\tilde{p}}_{t} ⟩ < ρ$ ( $ℓ_{t} (p_{t} (w^{t}), y_{t}) = p_{t} (w^{t}) = 1_{⟨ w, {\tilde{p}}_{t} ⟩ \geq ρ} (w^{t}) = 0$ ).

Thus, we have to approximate the loss function $ℓ_{t}$ by a DC function $f_{t}$ only in the following three cases $i = 1, 2, 3$ where $ℓ_{t} ({\bar{p}}_{t}, y_{t}) > 0$ :

6 $\begin{matrix} i = \{\begin{matrix} 1 & if y_{t} = 1, ⟨ w^{t}, {\tilde{p}}_{t} ⟩ < ρ, \\ 2 & if y_{t} = 0, ⟨ w^{t}, {\tilde{p}}_{t} ⟩ > ρ, \\ 3 & if y_{t} = 0, ⟨ w^{t}, {\tilde{p}}_{t} ⟩ = ρ . \end{matrix}) \end{matrix}$

Graph

To ensure the boundness on prediction mistakes, $f_{t}$ should be a surrogate function of $ℓ_{t}$ (see, e.g., [[59]]):

7 $\begin{matrix} f_{t} (w^{t}) \geq ℓ_{t} (p_{t} (w^{t}), y_{t}) . \end{matrix}$

Graph

We propose three DC approximation functions, denoted $f_{t}^{(i)}$ ( $i = 1, 2, 3$ ), taking the form of piecewise-linear functions like ramp loss [[11], [31]] and corresponding to the above three cases, as follows (see Fig. 1):

8 $\begin{matrix} f_{t}^{(i)} (w) : = max \{0, min \{1, \frac{ρ - ⟨ w, {\tilde{p}}_{t} ⟩}{ρ - τ_{t}^{(i)}} + δ^{(i)}\}\}, \end{matrix}$

Graph

where

9 $\begin{matrix} τ_{t}^{(i)} : = \{\begin{matrix} max {⟨ w^{t}, {\tilde{p}}_{t} ⟩, τ_{1}} & if i = 1, \\ min {⟨ w^{t}, {\tilde{p}}_{t} ⟩, 2 ρ - τ_{2}} & if i = 2, \\ 2 ρ - τ_{3} & if i = 3, \end{matrix}) \end{matrix}$

Graph

$τ_{i}$ , $i = 1, 2, 3$ , are parameters in $] 0, ρ [$ ; and

10 $\begin{matrix} δ^{(i)} : = \{\begin{matrix} 0 & if i = 1 or i = 2, \\ 1 & if i = 3 . \end{matrix}) \end{matrix}$

Graph

Graph: Fig. 1 Three DC approximation functions ft corresponding to the three cases in (6)

Now, we explain how to obtain the parameters $τ_{t}^{(i)}$ and $δ^{(i)}$ as (9) and (10), respectively. Note that the smaller the parameter $δ^{(i)}$ is, the more the part $'' ℓ_{t} (w) = 1^{''}$ of the $0 - 1$ loss function is approximated. On the other hand, the more the $τ_{t}^{(i)}$ is close to $ρ$ , the better the DC loss function $f_{t}^{(i)}$ would approximate the loss function $ℓ_{t}$ . Thus, our aim is to choose the parameters $τ_{t}^{(i)} > 0$ and $δ^{(i)} \geq 0$ as small as possible such that the functions $f_{t}^{(i)}$ ( $i = 1, 2, 3$ ) satisfy the condition (7).

For Case $i = 1$ , we must choose $τ_{t}^{(1)} < ρ$ (see Fig. 1a). Since $⟨ w^{t}, {\tilde{p}}_{t} ⟩ < ρ$ , the condition (7) is equivalent to

$\begin{matrix} max \{0, min \{1, \frac{ρ - ⟨ w^{t}, {\tilde{p}}_{t} ⟩}{ρ - τ_{t}^{(1)}} + δ^{(1)}\}\} \\ \geq 1 ⟺ \frac{ρ - ⟨ w^{t}, {\tilde{p}}_{t} ⟩}{ρ - τ_{t}^{(1)}} + δ^{(1)} \geq 1 . \end{matrix}$

Graph

By taking $δ^{(1)} = 0$ , we can choose any $τ_{t}^{(1)} \geq ⟨ w^{t}, {\tilde{p}}_{t} ⟩$ to guarantee the condition (7). On the other hand, in our experiments, we propose controlling $τ_{t}^{(1)}$ over all steps by adding a fixed tuning parameter, denoted $τ_{1}$ , in $] 0, ρ [$ . Then, we can take $τ_{t}^{(1)} = max {⟨ w^{t}, {\tilde{p}}_{t} ⟩, τ_{1}}$ .

For Case $i = 2$ , we need $τ_{t}^{(2)} > ρ$ . When taking $δ^{(2)} = 0$ , the condition (7) becomes $τ_{t}^{(2)} \leq ⟨ w^{t}, {\tilde{p}}_{t} ⟩$ . Similarly to Case 1, we use the tuning parameter $τ_{2} \in] 0, ρ [$ and take $τ_{t}^{(2)} = min {⟨ w^{t}, {\tilde{p}}_{t} ⟩, 2 ρ - τ_{2}}$ .

For Case $i = 3$ , since $⟨ w^{t}, {\tilde{p}}_{t} ⟩ = ρ$ , the condition (7) is equivalent to $δ^{(3)} \geq 1$ . In this case, we take $δ^{(3)} = 1$ and $τ_{t}^{(3)} = 2 ρ - τ_{3} > ρ$ where $τ_{3} \in] 0, ρ [$ .

According to Proposition 3 in "Appendix 1", a DC decomposition of $f_{t}^{(i)}$ is given by

11 $\begin{matrix} f_{t}^{(i)} = g_{t}^{(i)} - h_{t}^{(i)}, \end{matrix}$

Graph

where

$\begin{matrix} g_{t}^{(i)} (w) & = max \{0, \frac{ρ - ⟨ w, {\tilde{p}}_{t} ⟩}{ρ - τ_{t}^{(i)}} + δ^{(i)}\}, \\ h_{t}^{(i)} (w) & = max \{0, \frac{τ_{t}^{(i)} - ⟨ w, {\tilde{p}}_{t} ⟩}{ρ - τ_{t}^{(i)}} + δ^{(i)}\} . \end{matrix}$

Graph

Online DCA for predicting the weight wt

We now show how to investigate DCA in the step 3.1.2 of the algorithm ODCA. It consists of computing a subgradient $z^{t - 1} \in h_{t - 1}^{(i)} (w^{t - 1})$ and then computing the weight $w^{t}$ by solving the convex program

12 $\begin{matrix} min {g_{t - 1}^{(i)} (w) - ⟨ z^{t - 1}, w ⟩ : w \in S} \end{matrix}$

Graph

for $i = 1, 2, 3$ according to the three cases considered above.

To solve (12), we propose to use projected/exponentiated subgradient methods (see, e.g., [[5], [34], [60]]).

In particular, for applying the projected subgradient method with step size $η_{t - 1}$ (see, e.g., [[60]]) at step t, we need to compute, at iteration $k \in {0, 1, ...}$ , a subgradient $r^{t - 1, k} \in \partial g_{t - 1}^{(i)} (w^{t - 1, k})$ and then determine $w^{t - 1, k + 1} \in S$ as:

13 $\begin{matrix} w^{t - 1, k + 1} = {Proj}_{S} (w^{t - 1, k} - η_{t - 1} (r^{t - 1, k} - z^{t - 1})) \end{matrix}$

Graph

with $w^{t - 1, 0} : = w^{t - 1}$ .

Clearly, the functions $g_{t - 1}^{(i)}$ and $h_{t - 1}^{(i)}$ are the maximum of two affine functions. Thanks to the rule of computing the subdifferential of a function

$\begin{matrix} h (w) = max_{i = 1, ..., m} h_{i} (w), \end{matrix}$

Graph

where $h_{i}$ , $i = 1, ..., m$ , are convex functions [[61]], we have

14 $\begin{matrix} \partial h (w) = co \cup \{\partial h_{i} (w) : i \in \underset{j \in {1, ..., m}}{argmax} h_{j} (w)\} . \end{matrix}$

Graph

Here, $co (X)$ denotes the convex hull of a set of points $X$ .

Applying (14) to the functions $h_{t - 1}^{(i)}$ and $g_{t - 1}^{(i)}$ with $m = 2$ and $h_{1}, h_{2}$ being affine functions, the computations of $\partial h_{t - 1}^{(i)}$ and $\partial g_{t - 1}^{(i)}$ for $i = 1, 2, 3$ are given by

15 $\begin{matrix} \partial h_{t - 1}^{(i)} (w) & = \{\begin{matrix} \{\frac{- {\tilde{p}}_{t - 1}}{ρ - τ_{t - 1}^{(i)}}\} & if \frac{τ_{t - 1}^{(i)} - ⟨ w, {\tilde{p}}_{t - 1} ⟩}{ρ - τ_{t - 1}^{(i)}} + δ^{(i)} > 0, \\ [\frac{- {\tilde{p}}_{t - 1}}{ρ - τ_{t - 1}^{(i)}}, 0] & if \frac{τ_{t - 1}^{(i)} - ⟨ w, {\tilde{p}}_{t - 1} ⟩}{ρ - τ_{t - 1}^{(i)}} + δ^{(i)} = 0, \\ {0} & otherwise, \end{matrix}) \end{matrix}$

Graph

16 $\begin{matrix} \partial g_{t - 1}^{(i)} (w) & = \{\begin{matrix} \{\frac{- {\tilde{p}}_{t - 1}}{ρ - τ_{t - 1}^{(i)}}\} & if \frac{ρ - ⟨ w, {\tilde{p}}_{t - 1} ⟩}{ρ - τ_{t - 1}^{(i)}} + δ^{(i)} > 0, \\ [\frac{- {\tilde{p}}_{t - 1}}{ρ - τ_{t - 1}^{(i)}}, 0] & if \frac{ρ - ⟨ w, {\tilde{p}}_{t - 1} ⟩}{ρ - τ_{t - 1}^{(i)}} + δ^{(i)} = 0, \\ {0} & otherwise . \end{matrix}) \end{matrix}$

Graph

Here [a, b] denotes the line segment between a and b.

In particular, for $i = 1$ we can take

$\begin{matrix} z^{t - 1} = & \{\begin{matrix} \frac{- {\tilde{p}}_{t - 1}}{ρ - τ_{1}} & if ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ < τ_{1}, \\ 0 & if τ_{1} \leq ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ < ρ, \end{matrix}) \\ r^{t - 1, k} = & \{\begin{matrix} \frac{- {\tilde{p}}_{t - 1}}{ρ - max {⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩, τ_{1}}} & if ⟨ w^{t - 1, k}, {\tilde{p}}_{t - 1} ⟩ < ρ, \\ 0 & if ⟨ w^{t - 1, k}, {\tilde{p}}_{t - 1} ⟩ \geq ρ, \end{matrix}) \end{matrix}$

Graph

for $i = 2$ ,

$\begin{matrix} z^{t - 1} = & \{\begin{matrix} \frac{{\tilde{p}}_{t - 1}}{ρ - τ_{2}} & if ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ > 2 ρ - τ_{2}, \\ 0 & if ρ < ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ \leq 2 ρ - τ_{2}, \end{matrix}) \\ r^{t - 1, k} = & \{\begin{matrix} \frac{{\tilde{p}}_{t - 1}}{ρ - max {2 ρ - ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩, τ_{2}}} & if ⟨ w^{t - 1, k}, {\tilde{p}}_{t - 1} ⟩ \geq ρ, \\ 0 & if ⟨ w^{t - 1, k}, {\tilde{p}}_{t - 1} ⟩ < ρ, \end{matrix}) \end{matrix}$

Graph

and for $i = 3$ ,

$\begin{matrix} z^{t - 1} = 0, r^{t - 1, k} = \{\begin{matrix} \frac{{\tilde{p}}_{t - 1}}{ρ - τ_{3}} & if ⟨ w^{t - 1, k}, {\tilde{p}}_{t - 1} ⟩ > τ_{3}, \\ 0 & if ⟨ w^{t - 1, k}, {\tilde{p}}_{t - 1} ⟩ \leq τ_{3} . \end{matrix}) \end{matrix}$

Graph

When the convex subproblems in online DCA are completely solved by the projected subgradient method, the corresponding DCA is called complete version of online DCA, and is denoted by ODCA-SG. It is summarized in Algorithm 1.

Graph

The main algorithms

Observe that solving completely the subproblem in step 3.1.2.2 by the projected gradient method may be computationally expensive. Thus, we propose the so-called approximate version of ODCA-SG, named ODCA-SGk, in which the convex subproblem (12) is approximately solved by one iteration of the projected subgradient method. More precisely, in ODCA-SG, at step t, we take $w^{t - 1, 0} = w^{t - 1}$ and

$\begin{matrix} r^{t - 1, 0} = \{\begin{matrix} \frac{- {\tilde{p}}_{t - 1}}{ρ - max {⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩, τ_{1}}} & if ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ < ρ, \\ \frac{{\tilde{p}}_{t - 1}}{ρ - max {2 ρ - ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩, τ_{2}}} & if ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ > ρ, \\ \frac{{\tilde{p}}_{t - 1}}{ρ - τ_{3}} & otherwise, \end{matrix}) \end{matrix}$

Graph

and therefore

$\begin{matrix} s^{t - 1, 0} = \{\begin{matrix} \frac{- {\tilde{p}}_{t - 1}}{ρ - ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩} & if τ_{1} \leq ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ < ρ \\ or ρ < ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ \leq 2 ρ - τ_{2}, \\ \frac{{\tilde{p}}_{t - 1}}{ρ - τ_{3}} & if ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ = ρ, \\ 0 & otherwise. \end{matrix}) \end{matrix}$

Graph

Note also that if $⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ < τ_{1}$ or $⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ > 2 ρ - τ_{2}$ , then $w^{t} = w^{t - 1, 1} = w^{t - 1, 0} = w^{t - 1}$ , because that $s^{t - 1, 0} = 0$ . Finally, the approximate ODCA-SGk is given in Algorithm 2. Naturally, this version is simpler and faster than ODCA-SG, but the question is how about the regret of this algorithm.

Graph

As the set of weights $S$ defined in (3) is a probability simplex on $R^{d}$ , we can also use exponentiated subgradient method (see, e.g., [[34]]) for solving (5). Similarly to projected subgradient method, we design the complete version and the approximate version of ODCA using the exponentiated subgradient method, named ODCA-ESG and ODCA-ESGk, respectively. ODCA-ESG is described exactly as ODCA-SG except for the way to computing $w^{t - 1, k + 1}$ in step 3.1.2.2.

Graph

As for ODCA-ESGk, it is nothing else ODCA-ESG in which the step 3.1.2.2 is reduced to one iteration.

Graph

Regret bound of the proposed algorithms

Now, we focus on analyzing the regret bound of four online DCA algorithms: ODCA-SGk, ODCA-ESGk, ODCA-SG, and ODCA-ESG.

For $t \in {1, ..., T}$ , we define the regret of an algorithm A until step T by

17 $\begin{matrix} {Regret}_{A}^{T} : = \sum_{t = 1}^{T} f_{t} (w^{t}) - min_{w \in S} \sum_{t = 1}^{T} f_{t} (w), \end{matrix}$

Graph

where the sequence ${w^{1}, w^{2}, ..., w^{T}}$ is generated by the algorithm A.

We aim to prove that four proposed algorithms have a vanishing per-step regret (or a sublinear regret). That is, ${Regret}_{A}^{T}$ grows sublinearly with the number of steps T, i.e., ${lim}_{T \to + \infty} {Regret}_{A}^{T} / T = 0$ . Specially, we can achieve a logarithmic regret bound $O (log (T))$ for ODCA-SGk.

Let us denote by $M$ the set defined by $M : = {t : f_{t} (w^{t}) > 0} .$ We first show in Lemma 1 that the DC function $f_{t}$ satisfies Assumption 1. The proof of this lemma is provided in "Appendix 2".

Assumption 1

There exist $u^{*} \in S$ , positive parameters $α$ , $γ$ , and nonnegative parameter $β$ such that for $t = 1, ..., T$ ,

$u^{*} \in argmin {f_{t} (w) : w \in S}$ ,

$\frac{α}{2} {‖ u^{*} - w^{t} ‖}_{2}^{2} \leq g_{t} (w^{t}) - g_{t} (u^{*}) - ⟨ z^{t}, w^{t} - u^{*} ⟩$ ,

$h_{t} (u^{*}) - h_{t} (w^{t}) - ⟨ z^{t}, u^{*} - w^{t} ⟩ \leq \frac{β}{2} {‖ u^{*} - w^{t} ‖}_{2}^{2}$ ,

$g_{t} (w^{t}) - g_{t} (u^{*}) \leq ⟨ r^{t}, w^{t} - u^{*} ⟩ - \frac{γ}{2} {‖ u^{*} - w^{t} ‖}_{2}^{2}$ , with $r^{t} \in \partial g_{t} (w^{t})$ .

Lemma 1

For the DC functions (8), if there is a vector $u^{*} \in S$ such that for all $t = 1, ..., T$ ,

18 $\begin{matrix} \begin{matrix} ⟨ u^{*}, {\tilde{p}}_{t} ⟩ > ρ & if y_{t} = 1 and ⟨ w^{t}, {\tilde{p}}_{t} ⟩ < ρ, \\ ⟨ u^{*}, {\tilde{p}}_{t} ⟩ < ρ & if y_{t} = 0 and ⟨ w^{t}, {\tilde{p}}_{t} ⟩ > ρ, \\ ⟨ u^{*}, {\tilde{p}}_{t} ⟩ < τ_{3} & if y_{t} = 0 and ⟨ w^{t}, {\tilde{p}}_{t} ⟩ = ρ, \end{matrix} \end{matrix}$

Graph

then there exist $α$ , $γ$ , $u^{*}$ such thatAssumptions1(i), (ii), (iv)are verified; andAssumption1(iii)is satisfied for all $β \geq 0$ .

Next, Theorems 1 and 2 indicate the regret bounds of the four proposed algorithms ODCA-SGk, ODCA-ESGk, ODCA-SG and ODCA-ESG. The proof of Theorem 1 is given in "Appendix 3".

Let us suppose that $K_{t}$ is the number of iterations of subgradient methods at step t, $K = {max}_{t \in {1, ..., T}} K_{t}$ , and L is a positive number satisfying

19 $\begin{matrix} max_{t \in {1, ..., T}} max \{‖ s^{t} ‖_{2}, max_{k \in {1, ..., K_{t}}} {‖ s^{t, k} ‖}_{2}\} \leq L, \\ max_{t \in {1, ..., T}} max \{‖ s^{t} ‖_{\infty}, max_{k \in {1, ..., K_{t}}} {‖ s^{t, k} ‖}_{\infty}\} \leq L, \\ and max_{t \in {1, ..., T}} ‖ w^{t} - u^{*} ‖_{2} \leq min_{t \in {1, ..., T}} η_{t} max {K - 1, 1} L . \end{matrix}$

Graph

Theorem 1

Assume that ODCA-SG and ODCA-SGk generate the sequence ${w^{t}}_{t = 1, ..., T}$ . Then, we have

$\begin{matrix} {Regret}_{O D C A - S G}^{T} \leq & 3 L^{2} \sqrt{T} \sqrt{3 K^{2} - 4 K + 2}, \\ {Regret}_{ODCA - SG k}^{T} \leq & \frac{L^{2} (1 + log (T))}{γ}, \end{matrix}$

Graph

where $γ$ is the positive parameter satisfying

20 $\begin{matrix} γ \leq min_{t \in M} \{\frac{2}{‖ u^{*} - w^{t} ‖_{2}^{2}}, min_{i \in {1, 2, 3}}, κ, (\frac{⟨ u^{*}, {\tilde{p}}_{t} ⟩ - ρ}{ρ - τ_{t}^{(i)}} - δ^{(i)})\}, \end{matrix}$

Graph

and $κ$ is the real function defined by $κ (x) = x$ if $x > 0$ , $+ \infty$ otherwise.

Thanks to the proof of Theorem 1 and Lemma 2 in [[5]], we derive the regret bounds of ODCA-ESG and ODCA-ESGk stated in Theorem 2.

Theorem 2

Assume that ODCA-ESG and ODCA-ESGk generate the sequence ${w^{t}}_{t = 1, ..., T}$ where $w^{1} [i] = 1 / d$ , $i = 1, ..., d$ . Then, we have

$\begin{matrix} {Regret}_{ODCA - ESG}^{T} \leq 2 K L \sqrt{2 log (d) T}, {Regret}_{ODCA - ESG k}^{T} \leq 2 L \sqrt{2 log (d) T} . \end{matrix}$

Graph

Application to online binary linear classification

In this section, we apply the four proposed online DCA algorithms on online binary linear classification (see, e.g., [[8], [59]]). Online binary classification is online learning with the yes/no answers and predictions, in which the prediction set is as same as corrected answer set ${0, 1}$ , and the loss $ℓ ({\bar{p}}_{t}, y_{t})$ is the $0 - 1$ loss function. More precisely, on each round, the learner receives an instance with n features, denoted $x_{t} \in R^{n}$ , and tries to predict ${\bar{p}}_{t} \in {0, 1}$ , and then, the learner will be provided the correct answer $y_{t} \in {0, 1}$ and has to pay the loss $ℓ ({\bar{p}}_{t}, y_{t})$ . The loss $ℓ ({\bar{p}}_{t}, y_{t}) = 0$ when ${\bar{p}}_{t} = y_{t}$ (the prediction is correct) and $ℓ ({\bar{p}}_{t}, y_{t}) = 1$ when ${\bar{p}}_{t} \neq y_{t}$ (the prediction is wrong). In online binary linear classification by prediction with expert advice, at step t, for $i = 1, ..., d$ , the prediction label of ith expert is defined by

21 $\begin{matrix} {\tilde{p}}_{i, t} : = 1_{⟨ u_{i}, x ⟩ \geq 0} (x_{t}), \end{matrix}$

Graph

where $u_{i} \in R^{n}$ ( $i = 1, ..., d$ ) is the given ith linear classifier corresponding to the ith expert

The following propositions provide a mistake bound for our algorithms, i.e., the bound of the number of steps at which ${\bar{p}}_{t} \neq y_{t}$ .

The mistake bound for the proposed ODCA algorithms

Proposition 1

(i) For $w \in S$ , the number of prediction mistakes made byODCA-SGis upper bounded by ${(\bar{c} + \sqrt{{\bar{c}}^{2} + 4 \bar{a}})}^{2} / 4$ where $\bar{a} : = \sum_{t \in M} f_{t} (w)$ and $\bar{c} : = 3 L^{2} \sqrt{3 K^{2} - 4 K + 2}$ .

(ii) For $w \in S$ , the number of prediction mistakes made byODCA-SGkhas an upper bound that is the root, denoted ${\bar{x}}_{1}$ , of equation

$\begin{matrix} x - \bar{a} - \bar{b} (1 + log (x)) = 0, \end{matrix}$

Graph

where $\bar{a} : = \sum_{t \in M} f_{t} (w)$ , $\bar{b} : = L^{2} / γ_{SG}$ , $0 < γ_{SG} \leq min \{γ, L^{2}\}$ , $γ$ is defined by (20). In addition, ${\bar{x}}_{1} \geq \bar{b}$ .

This proposition is proven in detail in "Appendix 4".

Similarly, we present a mistake bound for the ODCA-ESG and ODCA-ESGk in Proposition 2.

Proposition 2

For any $w \in S$ , the number of prediction mistakes made byODCA-ESG (resp. ODCA-ESGk) is upper bounded by ${(\bar{c} + \sqrt{{\bar{c}}^{2} + 4 \bar{a}})}^{2} / 4$ where $\bar{c} : = 2 K L \sqrt{2 log (d)}$ (resp. $\bar{c} : = 2 L \sqrt{2 log (d)}$ ) and $\bar{a} : = \sum_{t \in M} f_{t} (w)$ .

An illustrative example

This section presents an example of our algorithm ODCA-SGk and the well-known weighted majority (WM) algorithm for an online binary classification task over 5 online rounds with the advice of three experts. The 1st, 2nd and 3rd experts are linear classifiers $u_{1} = (0.5, - 1.5)$ , $u_{2} = (2, 4.5)$ , and $u_{3} = (- 0.5, 0)$ , respectively.

When an instance arrives, the experts give prediction labels in ${0, 1}$ using (21). Then, the algorithm computes the experts' weight and then makes the prediction using (2). If ODCA-SGk makes the wrong prediction on the previous round, then it observes the DC loss function and applies online DCA for minimizing this function to get the weight. Otherwise, ODCA-SGk does not update the weight. WM modifies the weight on every round: each weight of the experts that made mistakes is multiplied by $μ \in [0, 1 [$ . Finally, when the true label is revealed, the prediction losses of the algorithms and three experts are computed by (4).

The parameters of the task and the algorithms are given as follows: $ρ = 0.5$ , $μ = 0.5$ , $τ_{1} = τ_{2} = τ_{3} = 0.05$ , $η_{t} = \frac{0.05}{\sqrt{5}}$ ( $t \in {1, ..., 5}$ ), $f_{0} = 0$ . The initial experts' weights of ODCA-SGk (resp. WM) are set to be equal: $w^{0} = (1 / 3, 1 / 3, 1 / 3) \in S$ (resp. $w^{0} = (1, 1, 1)$ ).

Now, ODCA-SGk and WM are implemented step-by-step in Table 1. In this example, the cumulative losses of ODCA-SGk, WM and the 1st, 2nd and 3rd experts after 5 rounds are 1, 2, and 2, 2, 2, respectively. It means that ODCA-SGk predicts better than both the WM and the best expert.

Table 1 An illustrative example of ODCA-SGk and WM

ODCA-SGk	WM
Round 1
1. receive instance $x_{1} = (4, 3)$
2. prediction labels of the 1st, 2nd, 3rd experts are, respectively, 0, 1, 0
3.1. set $w^{1} : = w^{0} = (1 / 3, 1 / 3, 1 / 3)$	3.1. set $w^{1} : = w^{0} = (1, 1, 1)$
3.2. total weight for label 0 is 2/3	3.2. total weight for label 0 is 2
total weight for label 1 is 1/3	total weight for label 1 is 1
prediction label ${\bar{p}}_{1}$ is 0	prediction label is 0
4. receive the true label $y_{1} = 0$
5. prediction loss $ℓ_{1} ({\bar{p}}_{1}, y_{1})$ is 0 (as ${\bar{p}}_{1} = y_{1}$ )	5. prediction loss is 0
prediction losses of the 1st, 2nd, 3rd experts are, respectively, 0, 1, 0
Round 2
1. receive instance $x_{2} = (7, 5)$
2. prediction labels of the 1st, 2nd, 3rd experts are, respectively, 0, 1, 0
3.1. do not update the weight as the	3.1. 2nd expert makes mistake,
prediction on Round 1 is correct:	$w^{2} [2] : = w^{1} [2] . μ = 0.5$
set $w^{2} : = w^{1} = (1 / 3, 1 / 3, 1 / 3)$	$w^{2} [i] : = w^{1} [i] = 1$ $(i = 1, 3)$
3.2. total weight for label 0 is 2/3	3.2. total weight for label 0 is 2
total weight for label 1 is 1/3	total weight for label 1 is 0.5
prediction label ${\bar{p}}_{2}$ is 0	prediction label is 0
4. receive the true label $y_{2} = 1$
observe DC loss function $f_{2}$ as Case 1 in (8):
$f_{2} = max \{0, min \{1, 3 - 6 ⟨ w, {\tilde{p}}_{2} ⟩\}\} = g_{2} - h_{2}$ ,
${\tilde{p}}_{2} = (0, 1, 0)$ , $g_{2} (w) = max \{0, 3 - 6 ⟨ w, {\tilde{p}}_{2} ⟩\}$ ,
$h_{2} (w) = max \{0, 2 - 6 ⟨ w, {\tilde{p}}_{2} ⟩\}$
5. prediction loss $ℓ_{2} ({\bar{p}}_{2}, y_{2})$ is 1 (as ${\bar{p}}_{2} \neq y_{2}$ )	5. prediction loss is 1
prediction losses of the 1st, 2nd, 3rd experts are, respectively, 1, 0, 1
Round 3
1. receive instance $x_{3} = (1, - 3)$
2. prediction labels of the 1st, 2nd, 3rd experts are, respectively, 1, 0, 0
3.1. apply online DCA for minimizing the	3.1. update the weight
DC function $f_{2}$ to get the weight:	$w^{3} [i] : = w^{2} [i] . μ = 0.5 (i = 1, 3)$
$w^{3} = {Proj}_{S} (w^{2} - η_{2} (r^{2} - z^{2}))$ ,	$w^{3} [2] : = w^{2} [2] = 0.5$
where $r^{2} \in \partial g_{2} (w^{2})$ , $z^{2} \in \partial h_{2} (w^{2})$
take $r^{2} = - 6 {\tilde{p}}_{2}$ , $z^{2} = 0$ according to (15), (16)
apply projection algorithm in "Appendix 5":
$w^{3} = (\frac{\sqrt{5} - 0.3}{3 \sqrt{5}}, \frac{\sqrt{5} + 0.6}{3 \sqrt{5}}, \frac{\sqrt{5} - 0.3}{3 \sqrt{5}})$
3.2. total weight for label 0 is $\frac{2 \sqrt{5} + 0.3}{3 \sqrt{5}}$	3.2. total weight for label 0 is 1
total weight for label 1 is $\frac{\sqrt{5} - 0.3}{3 \sqrt{5}}$	total weight for label 1 is 0.5
prediction label is 0	prediction label is 0
4. receive the true label 0
5. prediction loss is 1	5. prediction loss is 1
prediction losses of the 1st, 2nd, 3rd experts are, respectively, 1, 0, 0
Round 4
1. receive instance $x_{4} = (8, - 1)$
2. prediction labels of the 1st, 2nd, 3rd experts are, respectively, 1, 1, 0
3.1. do not update the weight: $w^{4} = w^{3}$	3.1. compute $w^{4} = (0.25, 0.5, 0.5)$
3.2. total weight for label 0 is $\frac{\sqrt{5} - 0.3}{3 \sqrt{5}}$	3.2. total weight for label 0 is 0.5
total weight for label 1 is $\frac{2 \sqrt{5} + 0.3}{3 \sqrt{5}}$	total weight for label 1 is 0.75
prediction label is 1	prediction label is 1
4. receive the true label 1
5. prediction loss is 0	5. prediction loss is 0
prediction losses of the 1st, 2nd, 3rd experts are, respectively, 0, 0, 1
Round 5
1. receive instance $x_{5} = (7, 6)$
2. prediction labels of the 1st, 2nd, 3rd experts are, respectively, 0, 1, 0
3.1. do not update the weight: $w^{5} = w^{4}$	3.1. compute $w^{5} = (0.25, 0.5, 0.25)$
3.2. total weight for label 0 is $\frac{2 \sqrt{5} - 0.6}{3 \sqrt{5}}$	3.2. total weight for label 0 is 0.5
total weight for label 1 is $\frac{\sqrt{5} + 0.6}{3 \sqrt{5}}$	total weight for label 1 is 0.5
prediction label is 0	prediction label is 1
4. receive the true label 0
5. prediction loss is 0	5. prediction loss is 1
prediction losses of the 1st, 2nd, 3rd experts are, respectively, 0, 1, 0

Numerical experiments

In the numerical experiments, we conduct online binary classification tasks with expert advice. In order to construct the group of experts, we used five well-known online classification algorithms ( $d = 5$ ) including perceptron [[50], [57], [62]], relaxed online maximum margin algorithm [[47]], approximate maximal margin classification algorithm [[23]], passive-aggressive learning algorithm [[14]], classic online gradient descent algorithm [[69]]. These five experts are described in detail in "Appendix 6" (see [[32]] for a library of scalable and efficient online classification algorithms).

Our experiment is composed of two parts. In the first experiment we study the efficiency of two versions of online DCA algorithms: the complete one ODCA-SG (resp. ODCA-ESG) and the approximate one ODCA-SGk (resp. ODCA-ESGk). In the second experiment, we compare the notable algorithms (ODCA-SGk and ODCA-ESGk) with the WM algorithm and the two online convex algorithms: online gradient descent with greedy projection (OGD) [[69]] and normalized exponentiated gradient (NEG) [[3], [34]].

The benchmark datasets used in our experiments cover many areas (e.g., social sciences, biology, physics, life sciences), which is shown in Table 2. The type of these datasets is classification with two labels. More details of these datasets are given on the LIBSVM website,[1] UCI Machine Learning, Repository[2] and the references therein.

Table 2 Datasets used in our experiments

Dataset	Name	$#$ Instances	$#$ Features (n)
D1	a8a	32561	123
D2	cod-rna	271617	8
D3	colon-cancer	62	2000
D4	covtype	581012	54
D5	diabetes	768	8
D6	german.number	1000	24
D7	ionosphere	351	34
D8	madelon	1549	500
D9	mushrooms	8124	112
D10	spambase	4601	57
D11	svmguide1	7089	4

Set up experiments All algorithms were implemented in Visual C++ version 11.0 and run on a PC Intel(R) Core(TM) i5-3470 CPU 3.20GHz of 8GB RAM. All experts are the first-order learning algorithms for large-scale online classification tasks in [[32]]. The open-source MATLAB package for the expert algorithms is available in [[32]].

In our experiment, each dataset is randomly divided into two sets as follows. A so-called training set including $20 %$ of the whole data is used by the system of experts to learn linear classifiers $u_{i}$ ( $i = 1, ..., d$ ), while a so-called test set consisting of the remaining dataset is adopted by all algorithms to make the predictions. The initial weight vector $w^{0} \in S$ is set to ${(1 / d, ..., 1 / d)}^{⊤}$ . The index $ρ$ is set to 0.5. We take $f_{0} = 0$ . The explicit projection algorithm ${Proj}_{S}$ on the probability simplex set $S$ is described in [[65]] (see "Appendix 5").

We are interested in the following criteria to evaluate the effectiveness of the proposed algorithms: the percentage of regret (denoted by regret in %) and CPU time (in seconds). The regret is computed as

$\begin{matrix} regret = \frac{R_{T}}{T} . 100 = \frac{L_{T}}{T} . 100 - min_{i = 1, ..., d} \frac{L_{i, T}}{T} . 100, \end{matrix}$

Graph

where $R_{T}$ is defined as (1) and T is the number of steps (corresponding to the number of instances in the test set). The smaller the regret is, the better the algorithm would be. From the definition of regret, the first term is nothing else than the accuracy of the algorithm, while the second term is the accuracy of the best expert in hindsight, which does not depend on the algorithm's predictions. In addition, regret shows us how much the algorithm predicts better or worse than the best expert. Thus, we use the measure regret rather than the accuracy. Other measures like precision and recall are not considered since the ratio between the number of instances of two classes on the considered datasets is small and varies from 1 : 1 to 1 : 2 (resp. 1 : 3) on the datasets D2-D11 (resp. D1).

For a fair comparison, we follow a so-called validation procedure on the test set so as to choose the best parameters for different algorithms (see [[32]]). In particular, we first perform each algorithm by running over one random permutation of the dataset with different parameter values and then take the value corresponding to the smallest mistake. The ranges of parameters for the expert algorithms and existing algorithms are completely described in [[32]] while the approximation parameters and the step size in our algorithms are chosen as follows. All three parameters $τ_{1}$ , $τ_{2}$ , $τ_{3}$ are set to the same positive tuning parameter $τ$ ( $τ < ρ$ ). The step size for ODCA-SG, ODCA-SGk (resp. ODCA-ESG, ODCA-ESGk) is $η_{t} = C / \sqrt{T}$ for all t (resp. $η = C \sqrt{log (d)} / \sqrt{T}$ ). The parameters $τ$ and C are searched from the range of ${0.00, 0.02, ..., 0.48}$ and ${2^{- 4}, 2^{- 3}, ..., 2^{4}}$ , respectively. After the validation procedure, each algorithm is conducted over N runs of different random permutations for each test set with the best parameters chosen. The default tolerance $ε$ is set to $10^{- 4}$ .

Statistical test In two following experiments, we use the statistical tests in order to verify the significance of the numerical results. The authors in [[16], [22]] recommended the use of nonparametric tests for statistical comparisons of classifiers over multiple datasets: the Wilcoxon signed ranks test [[66]] for two classifiers, and the Friedman's test [[20]] followed by a post hoc test for more classifiers. In the first experiment, we apply the Wilcoxon signed ranks test for examining whether the approximate version of online DCA is significantly better the complete one in terms of regret and CPU time. In the second experiment, the statistical significance is analyzed by the Friedman's test. The regret (resp. CPU time) of the algorithms is ranked for each dataset (see Table 4 (resp. Table 6) in the next subsections); here, average ranks are assigned in case of ties. The Friedman's test verifies the null hypothesis that the ranking performances of all algorithms are statistically equivalent. When the null hypothesis is rejected (i.e., at least one pair of algorithms has different performance), a variety of post hoc procedures can be applied to determine which algorithms differ from each other. Pereira et al. [[52]] indicated that the Fisher's least significant differences (LSD) test [[13]] is the most powerful procedure. The reader is referred to an overview of the Friedman's test and a post hoc analysis in [[52]]. To implement the Wilcoxon signed ranks test, the Friedman's test and the Fisher's LSD test, we use the MATLAB functions signrank, friedman and multcompare, respectively.

Experiment 1: comparison between two versions of online DCA-based algorithms

In this subsection, we compare between the complete algorithm ODCA-SG (resp. ODCA-ESG) and the approximate one ODCA-SGk (resp. ODCA-ESGk). We run four algorithms on 11 datasets over 5 runs ( $N = 5$ ). The average regret and CPU time over these 5 runs of four algorithms and the statistical results of the Wilcoxon signed ranks tests are reported in Table 3. Here, we try to test the null hypothesis that the performances of two versions of online DCA are equivalent, against the alternative that the approximate version outperforms the complete one.

Table 3 Average regret and average CPU time in seconds obtained by the four proposed ODCA algorithms over 5 runs and the statistical results of the Wilcoxon signed rank test for each pair. The null hypothesis assumes that the performances of two versions are equivalent. Bold values indicate the best results in each pair

Dataset		ODCA-SGk	ODCA-SG	ODCA-ESGk	ODCA-ESG
D1	regret	0.005	0.810	0.023	0.146
D1	CPU	0.018	0.019	0.017	0.019
D2	regret	− 0.084	1.469	− 0.084	− 0.084
D2	CPU	0.038	0.047	0.035	0.042
D3	regret	0.000	0.000	0.000	0.000
D3	CPU	0.000	0.001	0.000	0.001
D4	regret	2.243	4.510	2.249	11.40
D4	CPU	0.209	0.288	0.188	0.273
D5	regret	− 1.140	− 1.140	− 1.140	− 1.140
D5	CPU	0.000	0.001	0.000	0.001
D6	regret	1.125	1.125	1.300	1.500
D6	CPU	0.000	0.001	0.000	0.001
D7	regret	1.281	2.918	1.281	3.701
D7	CPU	0.000	0.000	0.000	0.000
D8	regret	− 0.016	− 0.016	0.645	1.678
D8	CPU	0.003	0.004	0.003	0.003
D9	regret	− 0.015	0.080	0.064	0.092
D9	CPU	0.004	0.004	0.004	0.004
D10	regret	− 0.217	− 0.217	− 0.217	− 0.217
D10	CPU	0.001	0.002	0.001	0.002
D11	regret	− 0.003	− 0.003	0.518	14.91
D11	CPU	0.001	0.001	0.001	0.002
p-value	regret	0.031		0.007
	CPU	0.003		0.005
Conclusion:	The difference between two versions of online DCA is significant at the 5% level in terms of both regret and CPU time

Comments on numerical results

In terms of both regret and CPU times, the approximate algorithms ODCA-SGk and ODCA-ESGk are significantly more efficient than the complete versions ODCA-SG and ODCA-ESG, respectively. Indeed, ODCA-SGk and ODCA-ESGk run faster than ODCA-SG and ODCA-ESG in most datasets—the ratio of gain is, respectively, up to 1.67 and 1.75 times in the datasets D1, D2, D4, D10 and D11. The regret of the approximate version is better than the complete version in 5/11 datasets (resp. 7/11 datasets)—the difference between ODCA-SGk (resp. ODCA-ESGk) and ODCA-SG (resp. ODCA-ESG) varies from 0.095 to $2.267 %$ (resp. from 0.028 to $14.392 %$ ). In the other datasets, the two versions obtain the same values of regret. Concerning the statistical results, the p-values of each pair in regret (resp. CPU time) are all less than 0.05 (resp. 0.01). Thus, we reject all null hypotheses and conclude that the difference between two versions of online DCA is statistically significant at $5 %$ level on both quality of prediction and rapidity.

Experiment 2: comparison with WM and online convex algorithms

In this experiment, we compare the approximate algorithms ODCA-SGk and ODCA-ESGk with the WM algorithm and the two online convex algorithms, OGD and NEG. The average results, their standard deviation on 20 runs of all algorithms and the statistical results of the Friedman's tests are reported in Tables 4 and 6, respectively. From the results in Table 4, the Friedman's test rejects the null hypothesis that the performances of the five algorithms in $regret$ are equivalent. Therefore, we analyze the pairwise comparison of the algorithms with the Fisher's LSD test. The statistical pairwise comparison results are given in Table 5 and Fig. 2. In addition, Fig. 3 shows the number of mistakes of all five algorithms along online process in the validation procedure on several datasets.

Table 4 Average regret and its standard deviation (std) obtained by ODCA-SGk, ODCA-ESGk, OGD, NEG and WM over 20 runs and their statistical results of the Friedman's test. The rank of the algorithms for each dataset is given in parentheses. The null hypothesis assumes that all the performances of the algorithms are equivalent

Dataset		ODCA-SGk	ODCA-ESGk	OGD	NEG	WM
D1	regret	0.010 (1)	0.013 (2)	0.865 (5)	0.847 (4)	0.083 (3)
D1	std	0.014	0.016	0.095	0.085	0.019
D2	regret	− 0.084 (1.5)	− 0.084 (1.5)	0.711 (5)	0.079 (4)	− 0.079 (3)
D2	std	0.000	0.000	0.033	0.014	0.003
D3	regret	0.000 (2)	0.000 (2)	3.500 (5)	2.200 (4)	0.000 (2)
D3	std	0.000	0.000	3.514	3.516	0.000
D4	regret	0.000 (1)	0.562 (2)	5.960 (4)	2.652 (3)	8.977 (5)
D4	std	0.001	2.446	0.054	0.028	0.024
D5	regret	− 1.140 (2)	− 1.140 (2)	− 1.140 (2)	− 1.116 (4)	0.098 (5)
D5	std	0.000	0.000	0.000	0.232	0.567
D6	regret	0.869 (3)	0.268 (1)	1.063 (5)	1.000 (4)	0.744 (2)
D6	std	0.756	0.782	0.862	0.746	0.341
D7	regret	1.263 (2)	1.281 (3)	1.708 (5)	1.548 (4)	0.765 (1)
D7	std	0.931	0.715	1.117	0.909	0.304
D8	regret	0.153 (1)	0.452 (2)	0.823 (4)	0.791 (3)	1.105 (5)
D8	std	0.364	0.660	0.839	0.903	0.556
D9	regret	− 0.012 (1)	0.064 (5)	0.030 (3)	0.036 (4)	0.014 (2)
D9	std	0.012	0.059	0.036	0.042	0.022
D10	regret	− 0.217 (1.5)	− 0.217 (1.5)	7.311 (5)	2.017 (4)	0.073 (3)
D10	std	0.000	0.000	0.540	0.320	0.038
D11	regret	0.007 (1.5)	0.007 (1.5)	5.820 (5)	2.527 (4)	0.040 (3)
D11	std	0.014	0.014	0.359	0.203	0.017
Mean rank		1.590	2.136	4.363	3.818	3.090
p-value $= 0.000066$
Conclusion: The differences between all algorithms are significant at the $1 %$ level (see Table 5 and Fig. 2 for the pairwise comparison results)

Graph: Fig. 2 Comparison intervals of the five algorithms in terms of regret. The performances of two algorithms are significantly different if their comparison intervals are disjoint. Otherwise, there is no significant difference between them

Table 5 Statistical pairwise comparison results of the five algorithms in terms of regret

Algorithm A	Algorithm B	True difference of mean ranks			p-value	sign
Algorithm A	Algorithm B	Lower CI	Estimate	Upper CI	p-value	sign
ODCA-SGk	ODCA-ESGk	− 1.833	− 0.545	0.742	0.406	−
	OGD	− 4.060	− 2.772	− 1.484	0.000	$+$
	NEG	− 3.515	− 2.227	− 0.939	0.000	$+$
	WM	− 2.787	− 1.500	− 0.212	0.022	$+$
ODCA-ESGk	OGD	− 3.515	− 2.227	− 0.939	0.000	$+$
	NEG	− 2.969	− 1.681	− 0.393	0.010	$+$
	WM	− 2.242	− 0.954	0.333	0.146	−
OGD	NEG	− 0.742	0.545	1.833	0.406	−
	WM	− 0.015	1.272	2.560	0.052	−
NEG	WM	− 0.560	0.727	2.015	0.268	−

CI: $95 %$ confidence interval. $+$ : Algorithm A is significantly better than Algorithm B (since p value $\leq 0.05$ ). −: There is no significant difference between two algorithms A and B

Table 6 Average CPU time in seconds obtained by ODCA-SGk, ODCA-ESGk, OGD, NEG and WM over 20 runs and their statistical results on the Friedman's test. For each dataset, the rank of the algorithms is given in parentheses. The null hypothesis assumes that all the performances of the five algorithms are equivalent

Dataset	ODCA-SGk	ODCA-ESGk	OGD	NEG	WM
D1	0.01865 (4)	0.01855 (3)	0.01885 (5)	0.01835 (2)	0.01805 (1)
D2	0.03900 (3)	0.03585 (1)	0.03985 (4)	0.03745 (2)	0.04055 (5)
D3	0.00060 (4.5)	0.00030 (1)	0.00060 (4.5)	0.00045 (2)	0.00050 (3)
D4	0.19410 (3)	0.17575 (1)	0.20010 (5)	0.19030 (2)	0.19950 (4)
D5	0.00005 (2)	0.00015 (5)	0.00005 (2)	0.00005 (2)	0.00010 (4)
D6	0.00020 (3.5)	0.00010 (2)	0.00020 (3.5)	0.00030 (5)	0.00005 (1)
D7	0.00020 (5)	0.00000 (1.5)	0.00005 (3)	0.00010 (4)	0.00000 (1.5)
D8	0.00320 (2.5)	0.00320 (2.5)	0.00325 (4)	0.00335 (5)	0.00315 (1)
D9	0.00380 (1)	0.00390 (5)	0.00385 (3)	0.00385 (3)	0.00385 (3)
D10	0.00145 (3)	0.00130 (1)	0.00170 (5)	0.00150 (4)	0.00135 (2)
D11	0.00080 (4.5)	0.00065 (1.5)	0.00070 (3)	0.00080 (4.5)	0.00065 (1.5)
Mean rank	3.272	2.227	3.818	3.227	2.454
p value $= 0.093$
Conclusion: There are no significant differences between all algorithms at the $5 %$ level

Graph: Fig. 3 The number of mistakes of all five algorithms with respect to the best value of parameters in the validation procedure on five notable datasets

Comments on numerical results

$∙$ In terms of regret, ODCA-SGk and ODCA-ESGk are the most efficient. In particular, ODCA-SGk is the best with the smallest mean rank. ODCA-SGk is the first best on 9/11 datasets and the second best on 1/11 datasets—the difference with the worst varies from 0.003% to 8.977%. From Fig. 2 and Table 5, we see that ODCA-SGk is significantly better than the existing algorithms OGD, NEG (p-value $= 0$ ) and WM (p-value $< 0.05$ ). With the second smallest mean rank, ODCA-ESGk is the second best and outperforms the existing algorithms on 9/11 datasets (6 for the first best and 3 for the second best)—the difference varies from $0.005 %$ to $8.415 %$ . ODCA-ESGk is significantly different from OGD (p-value $= 0$ ) and NEG (p-value $= 0.01$ ); but there is no significant difference between ODCA-ESGk and WM. Notice that the regret of ODCA-ESGk and ODCA-SGk are fairly comparable on 8/11 datasets with a little difference in the interval $[0 %, 0.076 %]$ ; however, this difference is no statistically significant since p-value $> 0.05$ . In addition, the regret obtained by our algorithms is actually small and stable (with small standard deviations) on most of the datasets. In fact, there are several datasets (e.g., D2, D5, D9, D10) on which ODCA-SGk and ODCA-ESGk gave the negative value of regret, especially, for the large-scale datasets D2 (271617 instances) and D4 (581012 instances) (see Table 4). That is to say, our algorithms can make predictions even better than the best experts. Moreover, Fig. 3 shows that the number of mistakes of our proposed algorithms is less than that of other algorithms during the online process, particularly in large-scale datasets. It is worth noting that there are no significant differences between the three existing algorithms.

$∙$ Concerning CPU time: all five algorithms run very fast (less than 0.2 seconds on all datasets). For the large datasets D2 and D4, the rapidity of the algorithms can be classified as follows: ODCA-ESGk and NEG are the fastest algorithms, ODCA-SGk comes next and finally, OGD and WM. From the statistical results in Table 6, we can conclude that there are no significant differences between all five algorithms. In other words, all algorithms run comparably.

Conclusion

We have investigated an online DCA-based approach for prediction with expert advice. The main idea is to approximate the $0 - 1$ loss function in the prediction with expert advice scheme by DC functions to develop online DCA for computing the weight of experts and then determine the prediction via the weighted strategy. We have proposed four online DCA-based algorithms: the two complete (resp. approximate) versions of ODCA: ODCA-SG, ODCA-ESG (resp. ODCA-SGk, ODCA-ESGk) where the convex subproblem in DCA's iteration is solved completely by (resp. is approximately solved by one iteration of) projected/exponentiated subgradient methods. We have proved that ODCA-SGk archives the logarithmic regret, while the others have the sublinear regret. It turns out from numerical results on various benchmark classification datasets that the approximate algorithm ODCA-SGk (resp. ODCA-ESGk) is significantly more efficient than the complete algorithm ODCA-SG (resp. ODCA-ESG) on both rapidity and quality of prediction. On the one hand, in terms of quality, our algorithms ODCA-SGk and ODCA-ESGk outperform several existing approaches, among them, ODCA-SGk is the best.

The proposed algorithms can be applied to other classes of online learning problems. Furthermore, besides prediction with expert advice, online DCA can also be investigated in other approaches for online learning. Works in these directions are in progress.

Acknowledgements

This research is funded by Foundation for Science and Technology Development of Ton Duc Thang University (FOSTECT), website: http://fostect.tdtu.edu.vn, under Grant FOSTECT.2017.BR.10.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix 1: DC decomposition of ft(i)

We present the following proposition to get a DC decomposition of $f_{t}^{(i)}$ .

Proposition 3

Let $a > 0$ , b, candxbe three constants and a given vector, respectively. The function

22 $\begin{matrix} f (w) = max {0, min {a, b ⟨ w, x ⟩ + c}}, \end{matrix}$

Graph

is a DC function with DC components

$\begin{matrix} g (w) = max {0, b ⟨ w, x ⟩ + c} and h (w) = max {0, b ⟨ w, x ⟩ + c - a} . \end{matrix}$

Graph

Proof

Knowing from [[54]] that if $f = g - h$ is a DC function, then the function

23 $\begin{matrix} max {0, g - h} = max {g, h} - h \end{matrix}$

Graph

is DC too. We see that the function

$\begin{matrix} min {a, b ⟨ w, x ⟩ + c} = (b ⟨ w, x ⟩ + c) - max {0, b ⟨ w, x ⟩ + c - a} \end{matrix}$

Graph

is DC. Therefore, applying (23) for $g = b ⟨ w, x ⟩ + c$ and $h = max {0, b ⟨ w, x ⟩ + c - a}$ , we get immediately a DC decomposition of f, that is,

$\begin{matrix} f (w) = max {b ⟨ w, x ⟩ + c, max {0, b ⟨ w, x ⟩ + c - a}} \\ - max {0, b ⟨ w, x ⟩ + c - a} . \end{matrix}$

Graph

However, since $a > 0$ , we have

$\begin{matrix} max {b ⟨ w, x ⟩ + c, max {0, b ⟨ w, x ⟩ + c - a}} = max {0, b ⟨ w, x ⟩ + c} . \end{matrix}$

Graph

Hence, we complete the proof. $□$

Appendix 2: Proof of Lemma 1

Proof

First, we observe that when ${\tilde{p}}_{t - 1} = 0$ or $τ_{t - 1}^{(1)} > ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩$ (resp. $τ_{t - 1}^{(2)} < ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩$ ), our algorithms do not update the weight, and thus, the result in Lemma 1 is straightforward. Hence, we need to consider only the cases where, at each step $t \in M$ ,

24 $\begin{matrix} τ_{t - 1}^{(1)} = ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩ (resp . τ_{t - 1}^{(2)} = ⟨ w^{t - 1}, {\tilde{p}}_{t - 1} ⟩) and {\tilde{p}}_{t - 1} \neq 0 . \end{matrix}$

Graph

Since $u^{*}$ satisfies (18) and ${\tilde{p}}_{t} \neq 0$ , Assumption 1 (i) is satisfied for DC functions (8). We see that $‖ u^{*} - w^{t} ‖_{2} \neq 0$ for all t because if there is some step t such that $u^{*} = w^{t}$ , then $⟨ u^{*}, {\tilde{p}}_{t} ⟩ = ⟨ w^{t}, {\tilde{p}}_{t} ⟩$ , which contradicts (18).

Next, we verify Assumptions 1 (ii)–(iv) for the DC functions $f_{t}^{(1)}$ .

$∙$ Let us define the function ${\bar{g}}_{t}^{(1)} : = g_{t}^{(1)} - ⟨ z^{t}, \cdot ⟩ .$ From (24), we derive

$\begin{matrix} {\bar{g}}_{t}^{(1)} (w^{t}) - {\bar{g}}_{t}^{(1)} (u^{*}) = \frac{ρ - ⟨ w^{t}, {\tilde{p}}_{t} ⟩}{ρ - τ_{t}^{(1)}} = 1 > 0 . \end{matrix}$

Graph

Thus, there exists a positive number $α \leq min_{t \in M} \frac{2}{‖ u^{*} - w^{t} ‖_{2}^{2}}$ such that Assumption 1 (ii) is satisfied.

$∙$ For any $β \geq 0$ , we have

$\begin{matrix} h_{t}^{(1)} (u^{*}) - h_{t}^{(1)} (w^{t}) - ⟨ z^{t}, u^{*} - w^{t} ⟩ = 0 \leq \frac{β}{2} {‖ u^{*} - w^{t} ‖}_{2}^{2} . \end{matrix}$

Graph

Thus, Assumption 1 (iii) is satisfied.

$∙$ Assumption 1 (iv) is also verified with $γ \leq min_{t \in M} \frac{2 (⟨ u^{*}, {\tilde{p}}_{t} ⟩ - ρ)}{(ρ - τ_{t}^{(1)}) {‖ u^{*} - w^{t} ‖}_{2}^{2}}$ since

$\begin{matrix} g_{t}^{(1)} (u^{*}) - g_{t}^{(1)} (w^{t}) - ⟨ r^{t}, u^{*} - w^{t} ⟩ = \frac{⟨ u^{*}, {\tilde{p}}_{t} ⟩ - ρ}{ρ - τ_{t}^{(1)}} \\ \geq \frac{γ}{2} {‖ u^{*} - w^{t} ‖}_{2}^{2} . \end{matrix}$

Graph

Similarly, as for the DC functions $f_{t}^{(i)}$ ( $i = 2, 3$ ), Assumptions 1 (i)–(iv) are satisfied if

$\begin{matrix} α \leq & min_{t \in M} \frac{2}{‖ u^{*} - w^{t} ‖_{2}^{2}}, β \geq 0, \\ γ \leq & min_{t \in M} \frac{2}{‖ u^{*} - w^{t} ‖_{2}^{2}} (\frac{⟨ u^{*}, {\tilde{p}}_{t} ⟩ - ρ}{ρ - τ_{t}^{(i)}} - δ^{(i)}) . \end{matrix}$

Graph

The proof of Lemma 1 is established. $□$

Appendix 3: Proof of Theorem 1

Proof

First of all, we analyze the regret bound of ODCA-SG.

From the definition (17), we have

25 $\begin{matrix} {Regret}_{ODCA-SG}^{T} = \sum_{t = 1}^{T} f_{t} (w^{t}) - min_{w \in S} \sum_{t = 1}^{T} f_{t} (w) \\ \leq \sum_{t = 1}^{T} [f_{t} (w^{t}) - min_{w \in S} f_{t} (w)] . \end{matrix}$

Graph

It derives from Assumption 1 (i) that

26 $\begin{matrix} f_{t} (w^{t}) - min_{w \in S} f_{t} (w) = f_{t} (w^{t}) - f_{t} (u^{*}) \\ = [{\bar{g}}_{t} (w^{t}) - {\bar{g}}_{t} (u^{*})] + [h_{t} (u^{*}) - h_{t} (w^{t}) - ⟨ z^{t}, u^{*} - w^{t} ⟩], \end{matrix}$

Graph

where the convex function ${\bar{g}}_{t} : = g_{t} - ⟨ z^{t}, \cdot ⟩$ for $t = 1, ..., T$ .

From (25), (26) and Assumptions 1 (ii)–(iii) with the choice $β = α$ , we obtain

27 $\begin{matrix} {Regret}_{ODCA-SG}^{T} \leq 2 \sum_{t = 1}^{T} [{\bar{g}}_{t} (w^{t}) - {\bar{g}}_{t} (u^{*})] \\ = 2 \sum_{t = 1}^{T} [{\bar{g}}_{t} (w^{t, 0}) - {\bar{g}}_{t} (u^{*})] \end{matrix}$

Graph

28 $\begin{matrix} \leq 2 \sum_{t = 1}^{T} \{\sum_{k = 0}^{K_{t} - 2} [{\bar{g}}_{t} (w^{t, k}) - {\bar{g}}_{t} (w^{t, k + 1})] + [{\bar{g}}_{t} (w^{t, K_{t} - 1}) - {\bar{g}}_{t} (u^{*})]\} \\ \leq 2 \sum_{t = 1}^{T} \{\sum_{k = 0}^{K_{t} - 2} ⟨ s^{t, k}, w^{t, k} - w^{t, k + 1} ⟩ + ⟨ s^{t, K_{t} - 1}, w^{t, K_{t} - 1} - u^{*} ⟩\} . \end{matrix}$

Graph

The last inequality holds as $s^{t, k} \in \partial {\bar{g}}_{t} (w^{t, k})$ , $k = 0, 1, ..., K_{t} - 1$ .

Similarly to Theorem 3.1 in [[30]], we can derive from (13) an upper bound of $⟨ s^{t, K_{t} - 1}, w^{t, K_{t} - 1} - u^{*} ⟩$ as follows:

29 $\begin{matrix} ⟨ s^{t, K_{t} - 1}, w^{t, K_{t} - 1} - u^{*} ⟩ \\ \leq \frac{‖ w^{t, K_{t} - 1} - u^{*} ‖_{2}^{2} - {‖ w^{t, K_{t}} - u^{*} ‖}_{2}^{2}}{2 η_{t}} + \frac{η_{t}}{2} {‖ s^{t, K_{t} - 1} ‖}_{2}^{2} . \end{matrix}$

Graph

Combining (19) and the fact that

$\begin{matrix} ‖ w^{t, K_{t} - 1} - u^{*} ‖_{2} \leq ‖ w^{t, K_{t} - 1} - w^{t, 0} ‖_{2} + {‖ w^{t, 0} - u^{*} ‖}_{2} \\ \leq \sum_{k = 0}^{K_{t} - 2} ‖ w^{t, k + 1} - w^{t, k} ‖_{2} + {‖ w^{t} - u^{*} ‖}_{2} \\ \leq \sum_{k = 0}^{K_{t} - 2} η_{t} ‖ s^{t, k} ‖_{2} + {‖ w^{t} - u^{*} ‖}_{2} \\ \leq η_{t} (K - 1) L + {‖ w^{t} - u^{*} ‖}_{2}, \end{matrix}$

Graph

we yield

30 $\begin{matrix} ‖ w^{t, K_{t} - 1} - u^{*} ‖_{2}^{2} \leq {‖ w^{t} - u^{*} ‖}_{2}^{2} + 3 η_{t}^{2} {(K - 1)}^{2} L^{2} . \end{matrix}$

Graph

It implies

31 $\begin{matrix} ⟨ s^{t, K_{t} - 1}, w^{t, K_{t} - 1} - u^{*} ⟩ \\ \leq \frac{‖ w^{t} - u^{*} ‖_{2}^{2} - {‖ w^{t + 1} - u^{*} ‖}_{2}^{2}}{2 η_{t}} + \frac{η_{t}}{2} (3 K^{2} - 6 K + 4) L^{2} . \end{matrix}$

Graph

Similarly, we get

32 $\begin{matrix} \sum_{k = 0}^{K_{t} - 2} ⟨ s^{t, k}, w^{t, k} - w^{t, k + 1} ⟩ \\ \leq \sum_{k = 0}^{K_{t} - 2} [\frac{‖ w^{t, k} - w^{t, k + 1} ‖_{2}^{2}}{2 η_{t}} + \frac{η_{t}}{2} {‖ s^{t, k} ‖}_{2}^{2}] \\ \leq \sum_{k = 0}^{K_{t} - 2} [η_{t}, {‖ s^{t, k} ‖}_{2}^{2}] \leq η_{t} (K - 1) L^{2} . \end{matrix}$

Graph

We deduce from (28), (31), (32) that

$\begin{matrix} {Regret}_{ODCA-SG}^{T} \\ \leq 2 \sum_{t = 1}^{T} [\frac{‖ w^{t} - u^{*} ‖_{2}^{2} - {‖ w^{t + 1} - u^{*} ‖}_{2}^{2}}{2 η_{t}} + \frac{η_{t}}{2} (3 K^{2} - 4 K + 2) L^{2}] \\ \leq 2 \sum_{t = 1}^{T} [‖ w^{t} - u^{*} ‖_{2}^{2} (\frac{1}{2 η_{t}} - \frac{1}{2 η_{t - 1}}) + \frac{η_{t}}{2} (3 K^{2} - 4 K + 2) L^{2}], \end{matrix}$

Graph

where, by convention, $\frac{1}{η_{0}} : = 0$ .

Let us define $η_{t} : = \frac{1}{\sqrt{t} \sqrt{3 K^{2} - 4 K + 2}}$ for all $t = 1, ..., T$ . We have

$\begin{matrix} {Regret}_{ODCA-SG}^{T} & \leq 2 \sqrt{3 K^{2} - 4 K + 2} [\frac{L^{2} \sqrt{T}}{2} + \frac{L^{2}}{2} (2, \sqrt{T})] \\ \leq 3 L^{2} \sqrt{T} \sqrt{3 K^{2} - 4 K + 2} . \end{matrix}$

Graph

As for ODCA-SGk, since Assumption 1 (iv) is also satisfied, we can derive from (27) that

$\begin{matrix} {Regret}_{ODCA-SG k}^{T} \leq 2 \sum_{t = 1}^{T} [⟨ s^{t}, w^{t} - u^{*} ⟩ - \frac{γ}{2} {‖ u^{*} - w^{t} ‖}^{2}] \\ \leq 2 \sum_{t = 1}^{T} [‖ w^{t} - u^{*} ‖_{2}^{2} (\frac{1}{2 η_{t}} - \frac{1}{2 η_{t - 1}} - \frac{γ}{2}) + \frac{η_{t}}{2} L^{2}] . \end{matrix}$

Graph

Defining $η_{t} : = \frac{1}{γ t}$ for all $t = 1, ..., T$ , we obtain

$\begin{matrix} {Regret}_{ODCA-SG k}^{T} \leq \frac{L^{2} (1 + log (T))}{γ} . \end{matrix}$

Graph

The proof of Theorem 1 is established. $□$

Appendix 4: Proof of Proposition 1

Proof

(i) From the condition (7) and Theorem 1, we have, for any $w \in S$ ,

$| M | \leq \sum_{t \in M} f_{t} (w^{t}) \leq \sum_{t \in M} f_{t} (w) + 3 L^{2} \sqrt{| M |} \sqrt{3 K^{2} - 4 K + 2} .$

Graph

Here, $| M |$ is the number of the steps in $M$ .

It implies that $| M | \leq \bar{a} + \bar{c} \sqrt{| M |}$ . Thus, the proof of (i) is complete.

(ii) From the definition of $γ_{SG}$ , we derive that for any $w \in S$ ,

33 $\begin{matrix} | M | \leq \sum_{t \in M} f_{t} (w^{t}) \leq \sum_{t \in M} f_{t} (w) + \frac{L^{2} (1 + log (| M |))}{γ_{SG}} . \end{matrix}$

Graph

It implies

$\begin{matrix} | M | \leq \bar{a} + \bar{b} (1 + log (| M |)) . \end{matrix}$

Graph

Considering the strictly convex function $r : (0, + \infty) \to R$ ,

$\begin{matrix} r (x) = x - \bar{a} - \bar{b} (1 + log (x)) . \end{matrix}$

Graph

Since ${lim}_{x \to 0^{+}} r (x) = {lim}_{x \to + \infty} r (x) = + \infty$ and $r (\bar{b}) \leq 0$ , equation $r (x) = 0$ has two roots ${\bar{x}}_{1}$ , ${\bar{x}}_{2}$ such that $0 < {\bar{x}}_{2} \leq \bar{b} \leq {\bar{x}}_{1}$ . The proof of (ii) is established. $□$

Appendix 5: Euclidean projection onto the probability simplex

Graph

Appendix 6: Description of the experts

We give here a description of the five experts used in the numerical experiments. They are well-known online classification algorithms: perceptron [[50], [57], [62]], relaxed online maximum margin algorithm [[47]], approximate maximal margin classification algorithm [[23]], passive-aggressive learning algorithm [[14]], classic online gradient descent algorithm [[69]].

Note that, in this paper, we consider the outcome space $Y = {0, 1}$ and the prediction label ${\tilde{p}}_{i, t} = 1_{⟨ u_{i}, x ⟩ \geq 0} (x_{t}) \in {0, 1}$ , where $u_{i}$ is the linear classifier of ith expert. Therefore, in the description below, the ${0, 1}$ label is used instead to ${- 1, 1}$ which are often used in linear classification algorithms.

First, the perceptron algorithm is known as the earliest, simplest approach for online binary linear classification [[57]].

Graph

Second, the relaxed online maximum margin algorithm [[47]] is an incremental algorithm for classification using a linear threshold function. It can be seen as a relaxed version of the algorithm that searches for the separating hyperplane which maximizes the minimum distance from previous instances classified correctly.

Graph

Third, the approximate maximal margin classification algorithm [[23]] consists in approximating the maximal margin hyperplane with respect to $ℓ_{p}$ -norm ( $p \geq 2$ ) for a set of linearly separable data. The proposed algorithm in [[23]] is called Approximate Large Margin Algorithm.

Graph

Fourth, the passive-aggressive learning algorithm [[14]] computes the classifier based on analytical solution to simple constrained optimization problem which minimizes the distance from the current classifier $u^{t}$ to the half-space of vectors which are of the zero hinge-loss on the current sample.

Graph

Finally, the classic online gradient descent algorithm [[69]] uses the gradient descent method for minimizing the hinge-loss function.

Graph

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References 1 Alexander L, Das SR, Ives Z, Jagadish H, Monteleoni C. Research challenges in financial data modeling and analysis. Big Data. 2017; 5; 3: 177-188. 10.1089/big.2016.0074 2 Angluin D. Queries and concept learning. Mach Learn. 1988; 2; 4: 319-342. 3363446 3 Azoury K, Warmuth MK. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach Learn. 2001; 43; 3: 211-246. 0988.68173. 10.1023/A:1010896012157 4 Barzdin JM, Freivald RV. On the prediction of general recursive functions. Sov Math Doklady. 1972; 13: 1224-1228. 0267.02029 5 Cesa-Bianchi N. Analysis of two gradient-based algorithms for on-line regression. J Comput Syst Sci. 1999; 59; 3: 392-411. 1726769. 0961.68148. 10.1006/jcss.1999.1635 6 Cesa-Bianchi N, Freund Y, Haussler D, Helmbold DP, Schapire RE, Warmuth MK. How to use expert advice. J ACM. 1997; 44; 3: 427-485. 1470152. 0890.68066. 10.1145/258128.258179 7 Cesa-Bianchi N, Lugosi G. Potential-based algorithms in on-line prediction and game theory. Mach Learn. 2003; 51; 3: 239-261. 1026.68152. 10.1023/A:1022901500417 8 Cesa-Bianchi N, Lugosi G. Prediction, learning, and games. 2006: New York; Cambridge University Press. 1114.91001. 10.1017/CBO9780511546921 9 Cesa-Bianchi N, Mansour Y, Stoltz G. Improved second-order bounds for prediction with expert advice. Mach Learn. 2007; 66; 2: 321-352. 1137.68525. 10.1007/s10994-006-5001-7 Chung TH (1994) Approximate methods for sequential decision making using expert advice. In: Proceedings of the seventh annual conference on computational learning theory, COLT '94, pp 183–189. ACM, New York, NY, USA Collobert R, Sinz F, Weston J, Bottou L. Large scale transductive SVMs. J Mach Learn Res. 2006; 7: 1687-1712. 2274421. 1222.68173 Collobert R, Sinz F, Weston J, Bottou L (2006) Trading convexity for scalability. In: Proceedings of the 23rd international conference on machine learning, ICML '06, pp 201–208. New York, NY, USA Conover WJ. Pratical nonparametric statistics. 19993: Hoboken; Wiley Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y. Online passive-aggressive algorithms. J Mach Learn Res. 2006; 7: 551-585. 2274378. 1222.68177 Dadkhahi H, Shanmugam K, Rios J, Das P, Hoffman SC, Loeffler TD, Sankaranarayanan S (2020) Combinatorial black-box optimization with expert advice. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1918–1927. Association for Computing Machinery, New York, NY, USA Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006; 7: 1-30. 2274360. 1222.68184 DeSantis A, Markowsky G, Wegman MN (1988) Learning probabilistic prediction functions. In: Proceedings of the first annual workshop on computational learning theory, COLT'88, pp. 312–328. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA Devaine M, Gaillard P, Goude Y, Stoltz G. Forecasting electricity consumption by aggregating specialized experts. Mach Learn. 2013; 90; 2: 231-260. 3015743. 1260.62090. 10.1007/s10994-012-5314-7 Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997; 55; 1: 119-139. 1473055. 0880.68103. 10.1006/jcss.1997.1504 Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937; 32; 200: 675-701. 63.1098.02. 10.1080/01621459.1937.10503522 Friedman M. A comparison of alternative tests of significance for the problem of rankings. Ann Math Stat. 1940; 11; 1: 86-92. 2085. 66.1305.08. 10.1214/aoms/1177731944 García S, Herrera F. An extension on "statistical comparisons of classifiers over multiple data sets" for all pairwise comparisons. J Mach Learn Res. 2009; 9: 2677-2694. 1225.68178 Gentile C. A new approximate maximal margin classification algorithm. J Mach Learn Res. 2002; 2: 213-242. 1904759. 1037.68124 Gentile C. The robustness of the -norm algorithms. Mach Learn. 2003; 53; 3: 265-299. 1089.68093. 10.1023/A:1026319107706 Gollapudi S, Panigrahi D (2019) Online algorithms for rent-or-buy with expert advice. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp 2319–2327. PMLR, Long Beach, California, USA Gramacy RB, Warmuth MKK, Brandt SA, Ari IBecker S, Thrun S, Obermayer K. Adaptive caching by refetching. Advances in neural information processing systems. 2003: Cambridge; MIT Press: 1489-1496 Grove AJ, Littlestone N, Schuurmans D. General convergence results for linear discriminant updates. Mach Learn. 2001; 43; 3: 173-210. 0988.68147. 10.1023/A:1010844028087 Hao S, Hu P, Zhao P, Hoi SCH, Miao C. Online active learning with expert advice. ACM Trans Knowl Discov Data. 2018; 12; 5: 1-22. 10.1145/3201604 Haussler D, Kivinen J, Warmuth MKVitányi P. Tight worst-case loss bounds for predicting with expert advice. Computational learning theory, lecture notes in computer Science. 1995: Berlin; Springer: 69-83 Hazan E. Introduction to online convex optimization. Found Trends Optim. 2016; 2; 3–4: 157-325. 10.1561/2400000013 Ho VT, Le Thi HA, Bui DC (2016) Online DC optimization for online binary linear classification. In: Nguyen TN, Trawiński B, Fujita H, Hong TP (eds) Intelligent information and database systems: 8th Asian conference, ACIIDS 2016, proceedings, Part II. Springer, Berlin, pp 661–670 Hoi SCH, Wang J, Zhao P. LIBOL: a library for online learning algorithms. J Mach Learn Res. 2014; 15; 1: 495-499. 1317.68164 Jamil W, Bouchachia ALotfi A, Bouchachia H, Gegov A, Langensiepen C, McGinnity M. Model selection in online learning for times series forecasting. Advances in computational intelligence systems. 2019: Cham; Springer: 83-95. 10.1007/978-3-319-97982-3_7 Kivinen J, Warmuth MK. Exponentiated gradient versus gradient descent for linear predictors. Inf Comput. 1997; 132; 1: 1-63. 1429254. 0872.68158. 10.1006/inco.1996.2612 Kivinen J, Warmuth MK. Relative loss bounds for multidimensional regression problems. Mach Learn. 2001; 45; 3: 301-329. 0998.68097. 10.1023/A:1017938623079 Kveton B, Yu JY, Theocharous G, Mannor S (2008) Online learning with expert advice and finite-horizon constraints. In: Proceedings of the twenty-third AAAI conference on artificial intelligence, AAAI 2008, pp 331–336. AAAI Press Le Thi HA (1994) Analyse numérique des algorithmes de l'optimisation d. C. Approches locale et globale. Codes et simulations numériques en grande dimension. Applications. Ph.D. thesis, University of Rouen, France Le Thi HA. DC programming and DCA for supply chain and production management: state-of-the-art models and methods. Int J Prod Res. 2020; 58; 20: 6078-6114. 10.1080/00207543.2019.1657245 Le Thi HA, Ho VT, Pham Dinh T. A unified DC programming framework and efficient DCA based approaches for large scale batch reinforcement learning. J Glob Optim. 2019; 73; 2: 279-310. 3908360. 1434.90159. 10.1007/s10898-018-0698-y Le Thi HA, Le HM, Phan DN, Tran B. Stochastic DCA for minimizing a large sum of DC functions with application to multi-class logistic regression. Neural Netw. 2020; 132: 220-231. 10.1016/j.neunet.2020.08.024 Le Thi HA, Moeini M, Pham Dinh T. Portfolio selection under downside risk measures and cardinality constraints based on DC programming and DCA. Comput Manag Sci. 2009; 6; 4: 459-475. 2534832. 1188.90185. 10.1007/s10287-009-0098-3 Le Thi HA, Pham Dinh TMigdalas A, Pardalos PM, Värbrand P. DC programming approach to the multidimensional scaling problem. From local to global optimization. 2001: Boston; Springer: 231-276. 1007.90057 Le Thi HA, Pham Dinh T. Large-scale molecular optimization from distance matrices by a D.C. optimization approach. SIAM J Optim. 2003; 14; 1: 77-114. 2005937. 1075.90071. 10.1137/S1052623498342794 Le Thi HA, Pham Dinh T. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann Oper Res. 2005; 133; 1–4: 23-48. 2119311. 1116.90122 Le Thi HA, Pham Dinh T. DC programming in communication systems: challenging problems and methods. Vietnam J Comput Sci. 2014; 1; 1: 15-28. 10.1007/s40595-013-0010-5 Le Thi HA, Pham Dinh T. DC programming and DCA: thirty years of developments. Math Program Spec Issue DC Program Theory Algorithms Appl. 2018; 169; 1: 5-68. 3785670. 1387.90197 Li Y, Long P. The relaxed online maximum margin algorithm. Mach Learn. 2002; 46; 1–3: 361-387. 0998.68110. 10.1023/A:1012435301888 Littlestone N, Warmuth MK. The weighted majority algorithm. Inf Comput. 1994; 108; 2: 212-261. 1265851. 0804.68121. 10.1006/inco.1994.1009 Nayman N, Noy A, Ridnik T, Friedman I, Jin R, Zelnik-Manor L (2019) XNAS: neural architecture search with expert advice. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, pp 1975–1985 Novikoff AB (1963) On convergence proofs for perceptrons. In: Proceedings of the symposium on the mathematical theory of automata 12:615–622 Ong CS, Le Thi HA. Learning sparse classifiers with difference of convex functions algorithms. Optim Methods Softw. 2013; 28; 4: 830-854. 3175446. 1282.90181. 10.1080/10556788.2011.652630 Pereira DG, Afonso A, Medeiros FM. Overview of Friedman's test and post-hoc analysis. Commun Stat Simul Comput. 2014; 44; 10: 2636-2653. 3366193. 10.1080/03610918.2014.931971 Pham Dinh T, Le HM, Le Thi HA, Lauer F. A difference of convex functions algorithm for switched linear regression. IEEE Trans Autom Control. 2014; 59; 8: 2277-2282. 3245274. 1360.62386. 10.1109/TAC.2014.2301575 Pham Dinh T, Le Thi HA. Convex analysis approach to D.C. programming: theory, algorithm and applications. Acta Math Vietnam. 1997; 22; 1: 289-355. 1479751. 0895.90152 Pham Dinh T, Le Thi HA. DC optimization algorithms for solving the trust region subproblem. SIAM J Optim. 1998; 8; 2: 476-505. 1618531. 0913.65054. 10.1137/S1052623494274313 Pham Dinh T, Le Thi HANguyen NT, Le Thi HA. Recent advances in DC programming and DCA. Transactions on computational intelligence XIII. 2014: Berlin; Springer: 1-37. 10.1007/978-3-642-54455-2_1 Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev. 1958; 65; 6: 386-408. 10.1037/h0042519 Shalev-Shwartz S (2007) Online learning: theory, algorithms, and applications. Ph.D. thesis, The Hebrew University of Jerusalem Shalev-Shwartz S. Online learning and online convex optimization. Found Trends Mach Learn. 2012; 4; 2: 107-194. 1253.68190. 10.1561/2200000018 Shor NZ (1985) Minimization methods for non-differentiable functions, 1 edn. Springer Series in Computational Mathematics 3. Springer, Berlin Valadier M. Sous-différentiels d'une borne supérieure et d'une somme continue de fonctions convexes. CR Acad. Sci. Paris Sér. AB. 1969; 268: A39-A42. 0164.43302 Van Der Malsburg CPalm G, Aertsen A. Frank rosenblatt: principles of neurodynamics: perceptrons and the theory of brain mechanisms. Brain theory. 1986: Berlin; Springer: 245-248. 10.1007/978-3-642-70911-1_20 Vovk V (1990) Aggregating strategies. In: Proceedings of the third annual workshop on computational learning theory. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp 371–386 Vovk V. A game of prediction with expert advice. J Comput Syst Sci. 1998; 56; 2: 153-173. 1629690. 0945.68528. 10.1006/jcss.1997.1556 Wang W, Carreira-Perpiñán MÁ (2013) Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arxiv: 1309.1541 Wilcoxon F. Individual comparisons by ranking methods. Biometr Bull. 1945; 1; 6: 80-83. 10.2307/3001968 Wu P, Hoi SCH, Zhao P, Miao C, Liu Z. Online multi-modal distance metric learning with application to image retrieval. IEEE Trans Knowl Data Eng. 2016; 28; 2: 454-467. 10.1109/TKDE.2015.2477296 Yaroshinsky R, El-Yaniv R, Seiden SS. How to better use expert advice. Mach Learn. 2004; 55; 3: 271-309. 1093.68094. 10.1023/B:MACH.0000027784.72823.e4 Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Fawcett T, Mishra N (eds) Proceedings of the 20th international conference on machine learning (ICML-03), pp 928–936 Footnotes https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. http://www.ics.uci.edu/~mlearn/MLRepository.html.

By Hoai An Le Thi and Vinh Thanh Ho

Reported by Author; Author

Titel:	DCA for online prediction with expert advice
Autor/in / Beteiligte Person:	Hoai An Le Thi ; Vinh Thanh Ho ; Laboratoire de Génie Informatique, de Production et de Maintenance (LGIPM) ; Université de Lorraine (UL)
Link:	Volltext (PDF) View record in OpenAIRE (Volltext) https://doi.org/10.1007/s00521-021-05709-0
Zeitschrift:	Neural Computing and Applications, Jg. 33 (2021-02-26), S. 9521-9544
Veröffentlichung:	Springer Science and Business Media LLC, 2021
Medientyp:	unknown
ISSN:	1433-3058 (print) ; 0941-0643 (print)
DOI:	10.1007/s00521-021-05709-0
Schlagwort:	0209 industrial biotechnology Weighted Majority Algorithm Logarithm business.industry Computer science Expert advice 02 engineering and technology Machine learning computer.software_genre Class (biology) 020901 industrial engineering & automation Binary classification Artificial Intelligence Convex optimization 0202 electrical engineering, electronic engineering, information engineering Benchmark (computing) [INFO]Computer Science [cs] 020201 artificial intelligence & image processing Artificial intelligence business Convex function computer ComputingMilieux_MISCELLANEOUS Software
Sonstiges:	Nachgewiesen in: OpenAIRE Rights: OPEN

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

BibTeX Citavi, JabRef, u.a.
(Literaturverwaltung)

PDF kein Volltext!
(Merkzettel, Notizen)

RIS Endnote, Citavi u.a.
(Literaturverwaltung)

MODS
(XML zur Weiterverarbeitung)

oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

Gewünschter Zitations-Stil:

oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.