# 3321 from binary to multiclass and multilabel

Daniel Lowd , Amirmohammad Rooshenas License: Bsd 2 Clause License Programming Language: Operations involving incompatible dimensions of DV and DM will now throw exceptions for warning the user Authors: Lgpl 3 Programming Language: Csharp , Fsharp Operating System: Linux , Windows Data Formats: Updated to work with luamongo v0.

Francisco Zamora Martinez License: Gpl V3 Programming Language: Machine Learning , Mapreduce , Mongodb. BayesOpt, a Bayesian Optimization toolbox 0. Ruben Martinez Cantin License: Agpl V3 Programming Language: Linux , Windows , Macos Data Formats: Free Bsd Programming Language: Metric Learning , Person Reidentification.

Updated home repository link to follow april-org github organization. Serialization and deserilization have been updated with more robust and reusable API, implemented in util. Added batch normalization ANN component. Added methods prod , cumsum and cumprod at matrix classes.

Added operator[] in the right side of matrix operations. Added bind function to freeze any positional argument of any Lua function.

Added method data to numeric matrix classes. Fixed bugs when reading bad formed CSV files. Fixed bugs at statistical distributions. This bug affected ImageRGB operations such as resize. Solved problems when chaining methods in Lua, some objects end to be garbage collected.

Improved support of strings in auto-completion rlcompleter package. Solved bug in SparseMatrix:: All functions have been overloaded to accept an in-place operation and another version which receives a destination matrix. Adding iterators to language models. Added support for IPyLua.

Optimized matrix access for confusion matrix. Minor changes in class. Added Git commit hash and compilation time. Java , Lisp Operating System: Classifier and filter classes satisfy base unit tests. Added uses decorator to prevent non-essential arguments from being passed. Fixed nasty bug where imputation, binarisation, and standardisation would not actually be applied to test instances.

Fixed bug where single quotes in attribute values could mess up args creation. ArffToPickle now recognises class index option and arguments. Fix nasty bug where filters were not being saved and were made from scratch from test data. To keep our theory general, we are not specifying weak learners line 4 and The number of candidate labels is fixed to be k, which is known to the learner. Without loss of generality, we may write the labels using integers in [k]: We are allowing multiple correct answers, and the label Yt is a subset of [k].

The labels in Yt is called relevant, and those in Ytc , irrelevant. In our boosting framework, we assume that the learner consists of a booster and a fixed N number of weak learners.

This resembles a manager-worker framework in that booster distributes tasks by specifying losses, and each weak learner makes a prediction to minimize the loss.

Booster makes the final decision by aggregating weak predictions. Once the true label is revealed, the booster shares this information so that weak learners can update their parameters for the next example.

Algorithm 1 Online Boosting Schema 1: Receive example xt 4: Record expert predictions sjt: Make a final decision y 8: Get the true label Yt 9: Weak learners update the internal parameters Online Weak Learners and Cost Vector We will keep the form of weak predictions ht general in that we only assume it is a distribution over [k]. This can in fact represent various types of predictions. Due to this general format, our boosting algorithm can even combine weak predictions of different formats.

This implies that if a researcher has a strong family of binary learners, he can simply boost them without transforming them into multi-class learners through well known techniques such as one-vs-all or one-vs-one.

We extend the cost matrix framework, first proposed by Mukherjee and Schapire [] and then adopted in online settings by Jung et al. The cost vector is unknown to W Li until it produces hit , which is usual in online settings. Otherwise, W Li can trivially minimize the cost. We deal with this matter in two different ways. One way is to define an edge of a weak leaner over a 2 Cost Vectors The optimal design of a cost vector depends on the choice of loss.

We will use LY s to denote the loss without specifying it where s is the predicted score vector. Then we introduce potential function, a well known concept in game theory which is first introduced to boosting by Schapire []: Even a learner that is worse than random guessing can contribute positively if we allow negative weights.

Optimal Algorithm We will rigorously define the edge of a weak learner. Recall that weak learners suffer losses determined by cost vectors. It can be easily shown by induction that many attributes of L are inherited by potentials.

Being proper or convex is a good example. Essentially, we want to set C0eor: Since booster wants weak learners to put higher scores at the relevant labels, costs at the relevant labels should be less than those at the irrelevant ones. Note that this is unavailable until we observe all the instances, which is fine because we only need this value in proving the loss bound.

Algorithm Details The algorithm is named by OnlineBMR online boost-by-majority for multi-label ranking as its potential ideas stem from classical boost-bymajority algorithm Schapire []. These datails are summarized in Algorithm 2. The proper t property inherited by potentials ensures the relevant labels have less costs than the irrelevant.

To satisfy the boundedness condition of C0eor , we normalize 2 to get where the value of a is determined by the size, Y. Now we state our online weak learning condition. This extends the condition made by Jung et al.

The probabilistic statement is needed as many online learners produce randomized predictions. The excess loss can be interpreted as a warm-up period. The following theorem holds either if weak learners are single-label learners or if the loss L is convex. Note that this loss is not convex. In case weak learners are in fact single-label learners, we can simply use rank loss to compute potentials, but in more general case, we will use the following hinge loss to compute potentials: It is convex and always greater than rank loss, and thus Theorem 2 can be used to bound rank loss.

In Appendix A, we bound two terms in the RHS of 4 when the potentials are built upon rank and hinge losses. Here we record the results. This implies we can plug in wi [t]dit at the place of cit. With single-label learners, we have Plugging this in 5 , we get T X 2 Combining these results with Theorem 2, we get the following corollary. When we divide both sides by T , we find the average loss is asymptotically bounded by the first term. The second term will determine the sample complexity.

In both cases, the first term decreases exponentially as N grows, which means the algorithm does not require too many learners to achieve a desired loss bound.

Now we evaluate the efficiency of OnlineBMR by fixing a loss. Unfortunately, there is no canonical loss in 4 becomes a major bottleneck cf. Finally, it is possible that learners have different edges, and assuming a constant edge can lead to inefficiency. In fact the following theorem constructs a circumstance that matches these bounds up to logarithmic factors. Throughout the proof, we consider k as a fixed constant. The choice of loss is broadly discussed by Jung et al.

In this regard, we will use the following logistic loss: We introduce a sketch here and postpone the complete discussion to Appendix B.

Algorithm Details The algorithm is inspired by Jung et al. OLM], and we will call it by Ada. Since it internally aims to minimize the logistic loss, we will set the cost vector to be the gradient of the surrogate: The above inequality guarantees OnlineWLC is met. Then a similar argument of Schapire and Freund [, Section Finally adopting the arguments in the proof of Jung et al. This proves the first part of the theorem. Then OnlineWLC can be shown to be met in a similar fashion.

To apply the result by Zinkevich [, Theorem 1], fti needs to be convex, and F should be compact. Now we present the loss bound of Ada. Finally, it remains to address how to choose it. In contrast to OnlineBMR, we cannot show that the last expert is reliably sophisticated.

Instead, what can be shown is that at least one of the experts is good enough. Thus we will use classical Hedge algorithm cf. At each iteration, it is randomly drawn such that Theorem 5. OLMR is bounded as follows: Randomly draw it s. From 8 , it can be observed that the relevant labels have negative costs and the irrelevant ones have positive cost.

Furthermore, the summation of entries of cit is exactly 0. This observation suggests a new definition of weight: XX 1 wi [t]: As the booster chooses an expert through the Hedge algorithm, a standard analysis cf.

We start the proof by defining the rank loss suffered by expert i as below: Here we record an univariate inequality: This does not directly correspond to the weight used in 3 , but plays a similar role. Then we define the empirical edge: Despite this sub-optimality, Ada. OLMR shows comparable results in real data sets due to its adaptive nature. The authors in fact used five data sets, but image data set became no longer available from the source.

This completes our proof. OLMR is far more favorable in practice. With large number of labels, runtime for OnlineBMR grows rapidly, and it was even impossible to run mediamill data within a week, and this was why we produced the reduced version.

The main bottleneck is the computation of potentials as they do not have closed form. The data set m-reduced is a reduced version of mediamill obtained by random sampling without replacement.

We kept the original split for training and test sets to provide more relevant comparisons. Summary of Data Sets data train test dim k min mean max emotions 72 6 scene 6 yeast 14 mediamill m-reduced 1 1 1 0 0 1. The algorithms are quite flexible in their choice of weak learners in that various types of learners can be combined to produce a strong learner. OnlineBMR is built upon the assumption that all weak learners are strictly better than random guessing, and its loss bound is shown to be tight under certain conditions.

OLMR adaptively chooses the weights over the learners so that learners with arbitrary even negative edges can be boosted. Despite its suboptimal loss bound, it produces comparable results with OnlineBMR and runs much faster. Online MLR boosting provides several opportunities for further research. A major issue in MLR problems is that there does not exist a canonical loss. Fortunately, Theorem 2 holds for any proper loss, but Ada.

OLMR only has a rank loss bound. An adaptive algorithm that can handle more general losses will be desirable. The existence of an optimal adaptive algorithm is another interesting open question. Every algorithm used trees whose parameters were randomly chosen. Instead of using all covariates, the booster fed to trees randomly chosen 20 covariates to make weak predictions less correlated.

All computations were carried out on a Nehalem architecture core 2. Each algorithm was trained at least ten times3 with different random seeds, and the results were aggregated through mean. Predictions were evaluated by rank loss.

Since VFDT outputs a conditional distribution, which is not of a single-label format, we used hinge loss to compute potentials. We tried four different values4 , and the best result is recorded as best BMR. The results are summarized in Table 3. Average Loss and Runtime in seconds data emotions scene yeast mediamill m-reduced 5 Ada. A boosting-based system for text categorization. Machine learning, 39 Journal of Machine Learning Research, A short introduction to boosting.