1 Introduction
The bipartite ranking problem aims at learning a ranking function that orders positive instances ahead of negative ones. For example, in information retrieval, bipartite ranking can be used to order the preferred documents in front of the lesspreferred ones within a list of searchengine results; in bioinformatics, bipartite ranking can be used to identify genes related to a certain disease by ranking the relevant genes higher than irrelevant ones. Bipartite ranking algorithms take some positive and negative instances as the input data, and produce a ranking function that maps an instance to a realvalued score. Given a pair of a positive instance and a negative one, we say that the pair is misordered if the ranking function gives a higher score to the negative instance. The performance of the ranking function is measured by the probability of misordering an unseen pair of randomly chosen positive and negative instances, which is equal to one minus the Area Under the ROC Curve (AUC)
[16], a popular criterion for evaluating the sensitivity and the specificity of binary classifiers in many realworld tasks
[9] and largescale data mining competitions [8, 38].Given the many potential applications in information retrieval, bioinformatics, and recommendation systems, bipartite ranking has received much research attention in the past two decades [17, 10, 22, 27, 24, 14]. Many existing bipartite ranking algorithms explicitly or implicitly reduce the problem to binary classification to inherit the benefits from the welldeveloped methods in binary classification [20, 17, 5, 24, 14]. The majority of those reductionbased algorithms can be categorized to two approaches: the pairwise approach and the pointwise one. The pairwise approach transforms the input data of positive and negative instances to pairs of instances, and learns a binary classifier for predicting whether the first instance in a pair should be scored higher than the second one. Note that for an input data set that contains positive instances and negative ones, the pairwise approach trains a binary classifier by optimizing an objective function that consists of terms, one for each pair of instances. The pairwise approach comes with strong theoretical guarantee. For example, [3] shows that a lowregret ranking function can indeed be formed by a lowregret binary classifier. The strong theoretical guarantee leads to promising experimental results in many stateoftheart bipartite ranking algorithms, such as RankSVM [20], RankBoost [17] and RankNet [6]. Nevertheless, the number of pairs in the input data can easily be of size , where is the size of the input data, if the data is not extremely unbalanced. The quadratic number of pairs with respect to makes the pairwise approach computationally infeasible for largescale data sets in general, except in a few special algorithms like RankBoost [17] or the efficient linear RankSVM [22]. RankBoost enjoys an efficient implementation by reducing the quadratic number of pairwise terms in the objective function to a linear number of equivalent terms; efficient linear RankSVM transforms the pairwise optimization formulation to an equivalent formulation that can be solved in subquadratic time complexity [24].
On the other hand, the pointwise approach directly runs binary classification on the positive and negative instance points of the input data, and takes the scoring function behind the resulting binary classifier as the ranking function. In some special cases [17, 28], such as AdaBoost [18] and its pairwise sibling RankBoost [17], the pointwise approach is shown to be equivalent to the corresponding pairwise one [30, 14]. In other cases, the pointwise approach often operates with an approximate objective function that involves only terms [24]. For example, [24]
shows that minimizing the exponential or the logistic loss function on the instance points decreases an upper bound on the number of misordered pairs within the input data. Because of the approximate nature of the pointwise approach, its ranking performance can sometimes be inferior to the pairwise approach.
From the discussion above, we see that the pairwise approach leads to more satisfactory performance while the pointwise approach comes with efficiency, and there is a tradeoff between the two. In this paper, we are interested in designing bipartite ranking algorithms that enjoy both satisfactory performance and efficiency for largescale bipartite ranking. The concrete problem setup will be provided in Section 2. We focus on using the linear Support Vector Machine (SVM) [37] given its recent advances for efficient largescale learning [40]. We first show that the loss function behind the usual pointwise SVM [37] minimizes an upper bound on the loss function behind RankSVM, which suggests that the pointwise SVM could be an approximate bipartite ranking algorithm that enjoys efficiency. Then, we design a better ranking algorithm with two major contributions.
As illustrated in Section 3
, firstly, we study an active sampling scheme to select important pairs for the pairwise approach and name the scheme Active Sampling for RankSVM (ASRankSVM). The scheme makes the pairwise SVM computationally feasible by focusing only on a few valuable pairs out of the quadratic number of pairs , and allows us to overcome the challenge of having a quadratic number of pairs. The active sampling scheme is inspired by active learning, another popular machine learning setup that aims to save the efforts of labeling
[32]. More specifically, we discuss the similarity and differences between active sampling (selecting a few of valuable pairs within a pool of potential pairs) and poolbased active learning (labeling a few of valuable instances within a pool of unlabeled instances), and propose some active sampling strategies based on the similarity. Secondly, we propose a general framework that unifies the pointwise SVM and the pairwise SVM (RankSVM) as special cases. The framework, called combined ranking and classification (CRC), is simply based on the idea of treating each instance point as a pseudopair. The CRC framework coupled with active sampling (ASCRC) improves the performance of the pointwise SVM by considering not only points but also pairs in its objective function.Performing active sampling within the CRC framework leads to a promising algorithm for largescale linear bipartite ranking. In Section 4, we conduct experiments on 14 realworld largescale data sets and compare the proposed algorithms (ASRankSVM and ASCRC) with several stateoftheart bipartite ranking algorithms, including the pointwise linear SVM [15], the efficient linear RankSVM [22], and the Combined Ranking and Regression (CRR) algorithm [31] which is closely related to the CRC framework. In addition, we demonstrate the robustness and the efficiency of the active sampling strategies and discuss some advantages and disadvantages of different strategies. The results show that ASRankSVM is able to efficiently sample only of the more than millions of the possible pairs to achieve better performance than other stateoftheart algorithms that use all the pairs, while ASCRC that considers the pseudopairs can sometimes be helpful. Those results validate that the proposed algorithm can indeed enjoy both satisfactory performance and efficiency for largescale bipartite ranking.
A preliminary version of this paper appeared in the 5th Asian Conference on Machine Learning [34]. The paper is then enriched by
2 Setup and Related Works
In a bipartite ranking problem, we are given a training set , where each is a training instance with the feature vector in an dimensional space and the binary label . Such a training set is of the same format as the training set in usual binary classification problems. We assume that the instances are drawn i.i.d. from an unknown distribution on . Bipartite ranking algorithms take as the input and learn a ranking function that maps a feature vector to a realvalued score .
For any pair of two instances, we call the pair misordered by iff the pair contains a positive instance and a negative one while . For a distribution that generates instances , we can define its pair distribution that generates to be the conditional probability of sampling two instances and from , conditioned on . Then, let the expected bipartite ranking loss for any ranking function be the expected number of misordered pairs over .
where is an indicator function that returns iff the condition is true, and returns otherwise. The goal of bipartite ranking is to use the training set to learn a ranking function that minimizes the expected bipartite ranking loss .
Because is unknown, cannot be computed directly. Thus, bipartite ranking algorithms usually resort to the empirical bipartite ranking loss , which takes the expectation over the pairs in instead of over the pair distribution , with the hope that would be sufficiently close to when the model complexity of the candidate ranking functions is properly controlled. Denote as the set of the positive instances in , and as the set of negative instances in . The formal definition of is
The bipartite ranking loss is closely related to the area under the ROC curve (AUC), which is commonly used to evaluate the sensitivity and the specificity of binary classifiers [8, 5, 9, 38]. More specifically, AUC calculates the expected number of correctlyordered pairs. Hence, for or , and higher AUC indicates better ranking performance.
Bipartite ranking is a special case of the general ranking problem in which the labels can be any real value, not necessarily . There are lots of recent studies on improving the accuracy [20, 7, 13] and efficiency [17, 2] of general ranking problems. In this paper, instead of considering the general ranking problem, we focus on using the specialty of bipartite ranking, namely its connection to binary classification, to improve the accuracy and the efficiency.
Motivated by the recent advances of linear models for efficient largescale learning [40], we consider linear models for efficient largescale bipartite ranking. That is, the ranking functions would be of the form , which is linear to the components of the feature vector . In particular, we study the linear Support Vector Machine (SVM) [37] for bipartite ranking. There are two possible approaches for adopting the linear SVM on bipartite ranking problem, the pairwise SVM approach and the pointwise SVM approach.
The pairwise approach corresponds to the famous RankSVM algorithm [20], which is originally designed for ranking with ordinalscaled scores, but can be easily extended to general ranking with realvalued labels or restricted to bipartite ranking with binary labels. For each positive instance and negative instance , the pairwise approach transforms the two instances to two symmetric pairs of instance and , the former for indicating that should be scored higher than and the latter for indicating that should be scored lower than . In the linear case, the pairs transformed from are then fed to a linear SVM for learning a ranking function of the form . Then, for the pair , we see that iff . Define the transformed feature vector and the transformed label , we can equivalently view the pairwise linear SVM as simply running a linear SVM on the pairwise training set . The pairwise linear SVM minimizes the hinge loss as a surrogate to the 0/1 loss on [35], and the 0/1 loss on is equivalent to , the empirical bipartite ranking loss of interest. That is, if the linear SVM learns an accurate binary classifier using , the resulting ranker would also be accurate in terms of the bipartite ranking loss.
Denote the hinge function by , RankSVM solves the following optimization problem
(1) 
where denotes the weight of the pair . Because of the symmetry of and , we naturally assume that . In the original RankSVM formulation, is set to a constant for all the pairs. Here we list a more flexible formulation (1) to facilitate some discussions later. RankSVM has reached promising bipartite ranking performance in the literature [5]. Because of the symmetry of positive and negative pairs, we can equivalently solve (1) on those positive pairs with . The number of such positive pairs is if there are positive instances and negative ones. The huge number of pairs make it difficult to solve (1) with a naïve quadratic programming algorithm.
In contrast with the naïve RankSVM, the efficient linear RankSVM [22] changes (1) to a more sophisticated and equivalent one with an exponential number of constraints, each corresponding to a particular linear combination of the pairs. Then, it reaches time complexity by using a cuttingplane solver to identify the mostviolated constraints iteratively, while the constant hidden in the big notation depends on the parameter as well as the desired precision of optimization. The subquadratic time complexity of the efficient RankSVM can make it much slower than the pointwise approach (to be discussed below), and hence may not always be fast enough for largescale bipartite ranking.
The pointwise SVM approach, on the other hand, directly runs an SVM on the original training set instead of . That is, in the linear case, the pointwise approach solves the following optimization problem
(2) 
Such an approach comes with some theoretical justification [24]. In particular, the 0/1 loss on has been proved to be an upper bound of the empirical bipartite ranking loss. In fact, the bound can be tightened by adjusting and to balance the distribution of the positive and negative instances in . When , [5] shows that the pointwise approach (2) is inferior to the pairwise approach (1) in performance. The inferior performance can be attributed to the fact that the pointwise approach only operates with an approximation (upper bound) of the bipartite ranking loss of interest.
Next, inspired by the theoretical result of upperbounding the bipartite ranking loss with a balanced 0/1 loss, we study the connection between (1) and (2) by balancing the hinge loss in (2). In particular, as shown in Theorem 1, a balanced form of (2) can be viewed as minimizing an upper bound of the objective function within (1). In other words, the weighted pointwise SVM can be viewed as a reasonable baseline algorithm for largescale bipartite ranking problem.
Theorem 1.
Proof.
The objective function of (1)
The theorem can be easily proved by substituting with a new variable . ∎
3 Proposed Algorithm
As discussed in the previous section, the pairwise approach (1) is infeasible on largescale data sets due to the huge number of pairs. Then, either some random subsampling of the pairs are needed [31], or the lessaccurate pointwise approach (2) is taken as the approximate alternative [24]. Nevertheless, the better ranking performance of the pairwise approach over the pointwise one suggest that some of the key pairs shall carry more valuable information than the instancepoints. Next, we design an algorithm that samples a few key pairs actively during learning. We first show that some proposed active sampling schemes can help identify those key pairs better than random subsampling. Then, we discuss how we can unify pointwise and pairwise ranking approaches under the same framework.
3.1 Poolbased Active Learning
The pairwise SVM approach (1) is challenging to solve because of the huge number of pairs involved in . To make the computation feasible, we can only afford to work on a small subset of during training. Existing algorithms conquer the computational difficulty of the huge number of pairs in different ways. The Combined Ranking and Regression approach [31]
performs stochastic gradient descent on its objective function, which essentially selects within the huge number of pairs in a random manner; the efficient RankSVM
[22] identifies the mostviolated constraints during optimization, which corresponds to selecting the most valuable pairs from an optimization perspective.We take an alternative route and hope to select the most valuable pairs from a learning perspective. That is, our task is to iteratively select a small number of valuable pairs for training while reaching similar performance to the pairwise approach that trains with all the pairs. One machine learning setup that works for a similar task is active learning [32], which iteratively select a small number of valuable instances for labeling (and training) while reaching similar performance to the approach that trains with all the instances fully labeled. [1] avoids the quadratic number of pairs in the general ranking problem from an active learning perspective, and proves that selecting a subquadratic number of pairs is sufficient to obtain a ranking function that is close to the optimal ranking function trained by using all the pairs. The algorithm is theoretical in nature, while many other promising active learning tools [25, 29, 32] have not been explored for selecting valuable pairs in largescale bipartite ranking.
Next, we start exploring those tools by providing a brief review about active learning. We focus on the setup of poolbased active learning [32] because of its strong connection to our needs. In a poolbased active learning problem, the training instances are separated into two parts, the labeled pool () and the unlabeled pool (). As the name suggests, the labeled pool consists of labeled instances that contain both the feature vector and its corresponding label , and the unlabeled pool contains unlabeled instances only. Poolbased active learning assumes that a (huge) pool of unlabeled instances is relatively easy to gather, while labeling those instances can be expensive. Therefore, we hope to achieve promising learning performance with as few labeled instances as possible.
A poolbased active learning algorithm is generally iterative. In each iteration, there are two steps: the training step and the querying step. In the training step, the algorithm trains a decision function from the labeled pool; in the querying step, the algorithm selects one (or a few) unlabeled instances, queries an oracle to label those instances, and moves those instances from the unlabeled pool to the labeled one. The poolbased active learning framework repeats the training and querying steps iteratively until a given budget on the number of queries is met, with the hope that the decision functions returned throughout the learning steps are as accurate as possible for prediction.
Because labeling is expensive, active learning algorithms aim to select the most valuable instance(s) from the unlabeled pool at each querying step. Various selection criteria have been proposed to describe the value of an unlabeled instance [32], such as uncertainty sampling [25], expected error reduction [29], and expected model change [33].
Moreover, there are several works that solve bipartite ranking under the active learning scenario [39, 11, 12]. For example, [11] selects points that reduce the ranking loss functions most from the unlabeled pool while [12] selects points that maximize the AUC in expectation. Nevertheless, these active learning algorithms require either sorting or enumerating over the huge unlabeled pool in each querying step. The sorting or enumerating process can be time consuming, but have not been considered seriously because labeling is assumed to be even more expensive. We will discuss later that those algorithms that require sorting or enumerating may not fit our goal.
3.2 Active Sampling
Following the philosophy of active learning, we propose the active sampling scheme for choosing a smaller set of key pairs on the huge training set . We call the scheme Active Sampling in order to highlight some differences to active learning. One particular difference is that RankSVM (1) only requires optimizing with positive pairs. Then, the label of a pair is a constant and thus easy to get during active sampling, while the label in active learning remains unknown before the possibly expensive querying step. Thus, while active sampling and active learning both focus on using as few labeled data as possible, the costly part of the active sampling scheme is on training rather than querying.
For active sampling, we denote as the budget on the number of pairs that can be used in training, which plays a similar role to the budget on querying in active learning. In brief, active sampling chooses informative pairs iteratively for solving the optimization problem (1). We separate the pairwise training set into two parts, the chosen pool () and the unchosen pool (
). The chosen pool is the subset of pairs to be used for training, and the unchosen pool contains the unused pairs. The chosen pool is similar to the labeled pool in poolbased active learning; the unchosen pool acts like the unlabeled pool. The fact that it is almost costless to “label” the instances in the unchosen pool allows us to design simpler sampling strategies than those commonly used for active learning, because no effort is needed to estimate the unknown labels.
The proposed scheme of active sampling is illustrated in Algorithm 1. The algorithm takes an initial chosen pool and an initial unchosen pool , where we simply mimic the usual setup in poolbased active learning by letting be a randomly chosen subset of and be the set of unchosen pairs in . In each iteration of the algorithm, we use Sample to actively choose instances to be moved from to . The strategy Sample takes the current ranking function into account. After sampling, a linearSVM is called to learn from along with the weights in . We feed the current to the linearSVM solver to allow a warmstart in optimization. The warmstart step enhances the efficiency and the performance. The iterative procedure continues until the budget of chosen instances is fully consumed.
Another main difference between the active sampling scheme and typical poolbased active learning is that we sample instances before the training step, while poolbased active learning often considers executing the training step right after querying the label of one instance. The difference is due to the fact that the pairwise labels can be obtained very easily and thus sampling and labeling can be relatively cheaper than querying in poolbased active learning. Furthermore, updating the weights right after knowing one instance may not lead to much improvement and can be too time consuming for the largescale bipartite ranking problem that we want to solve.
3.3 Sampling Strategies
Next, we discuss some possible sampling strategies that can be used in Algorithm 1. One naïve strategy is to passively choose a random sample within . For active sampling strategies, we define two measures that estimate the (learning) value of an unchosen pair. The two measures correspond to wellknown criteria in poolbased active learning. Let be the unchosen pair in with , the two measures with respect to the current ranking function are
(3) 
(4) 
The measure corresponds to one of the most popular criteria in poolbased active learning called uncertainty sampling [25]. It captures the uncertainty of the ranking function on the unchosen pair. Intuitively, a low value of means that the ranking function finds it hard to distinguish the two instances in the pair, which implies that the ranking function is less confident on the pair. Therefore, sampling the unchosen pairs that come with the lowest values may improve the ranking performance by resolving the uncertainty.
On the other hand, the measure is related to another common criterion in poolbased active learning called expected error reduction [29]. It captures the performance of the ranking function on the unchosen pair. Note that this exact correctness measure is only available within our active sampling scheme because we know the pairlabel to always be without loss of generality, while usual active learning algorithms do not know the exact measure before querying and hence have to estimate it [11, 12]. A low value of indicates that the ranking function does not perform well on the pair. Then, sampling the unchosen pairs that come with the lowest values may improve the ranking performance by correcting the possible mistakes. Moreover, sampling the pair with lowest value shall change the most in general, which echoes another criterion in poolbased active learning called expected model change [33].
Similar to other active learning algorithms [11, 12], computing the pairs that come with the lowest or values can be time consuming, as it requires at least evaluating the values of for each instance , and then computing the measures on the pairs along with some selection or sorting steps that may be of superlinear time complexity [22]. Thus, such a hard version of active sampling is not computationally feasible for largescale bipartite ranking. Next, we discuss the soft version of active sampling that randomly chooses pairs that come with lower or values by rejection sampling.
The soft version of active sampling can be described as follows: we consider a rejection sampling step that samples a pair with probability based on a method random() that generates random numbers between . A pair that comes with a lower or values would enjoy a higher probability of being accepted.
Next, we define the probability value functions that correspond to the hard versions of and
. Both value functions are in the shape of the sigmoid function, which is widely used to represent probabilities in logistic regression and neural networks
[4]. For soft sampling, we define the probability value as: For soft sampling, we define as: We take different forms of soft versions because is of range while is of range .Note that the sampling strategies above, albeit focusing on the most valuable pairs, is inheritedly biased. The chosen pool may not be representative enough of the whole training set because of the biased sampling strategies. There is a simple way that allows us to correct the sampling bias for learning a ranking function that performs well on the original bipartite ranking loss of interest. We take the idea of [21] to weight the sampled pair by the inverse of its probability of being sampled. That is, we could multiply the weight for a chosen pair by when it gets returned by the rejection sampling.
3.4 Combined Ranking and Classification
Inspired by Theorem 1, the points can also carry some information for ranking. Next, we study how we can take those points into account during active sampling. We start by taking a closer look at the similarity and differences between the pointwise SVM (2) and the pairwise SVM (1). The pairwise SVM considers the weighted hinge loss on the pairs , while the pointwise SVM considers the weighted hinge loss on the points . Consider one positive point . Its hinge loss is , which is the same as . In other words, the positive point can also be viewed as a pseudopair that consists of and . Similarly, a negative point can be viewed as a pseudopair that consists of and . Let the set of all pseudopairs within be . Then, the pointwise SVM (2) is just a variant of the pairwise one (1) using the pseudopairs and some particular weights. Thus, we can easily unify the pointwise and the pairwise SVMs together by minimizing some weighted hinge loss on the joint set of pairs and pseudopairs. By introducing a parameter to control the relative importance between the real pairs and the pseudopairs, we propose the following novel formulation.
(5)  
where and denote the set of positive pairs and positive pseudopairs, respectively. The new formulation (5) combines the pointwise SVM and the pairwise SVM in its objective function, and hence is named the Combined Ranking and Classification (CRC) framework. When , CRC takes the pairwise SVM (1) as a special case with ; when , CRC takes the pointwise SVM (2) as a special case with and . The CRC framework (5) remains as challenging to solve as the pairwise SVM approach (1) because of the huge number of pairs. However, the general framework can be easily extended to the active sampling scheme, and hence be solved efficiently. We only need to change the training set from to the joint set , and multiply the probability value in the soft version sampling by or () for actual pairs or pseudopairs.
The CRC framework is closely related to the algorithm of Combined Ranking and Regression (CRR) [31] for general ranking. The CRR algorithm similarly considers a combined objective function of the pointwise terms and the pairwise terms for improving the ranking performance. The main difference between CRR and CRC is that the CRR approach takes the squared loss on the points, while CRC takes the nature of bipartite ranking into account and considers the hinge loss on the points. On the other hand, the idea of combining pairwise and pointwise approaches had been used in another machine learning setup, the multilabel classification problem [36]. The algorithm of Calibrated Ranking by Pairwise Comparison [19] assumes a calibration label between relevant and irrelevant labels, and hence unifies the pairwise and pointwise label learning for multilabel classification. The calibration label plays a similar role to the zerovector in the pseudopairs for combining pairwise and pointwise approaches.
To the best of our knowledge, while the CRR approach has reached promising performance in practice [31], the CRC formulation has not been seriously studied. The hinge loss used in CRC allows unifying the pointwise SVM and the pairwise SVM under the same framework, and the unification is essential for applying one active sampling strategy on both the real pairs and the pseudopairs.
In summary, we propose the active sampling scheme for RankSVM (ASRankSVM) and the more general CRC framework (ASCRC), and derive two sampling strategies that correspond to popular strategies in poolbased active learning. The soft version of the sampling strategies helps reducing the computational cost, and allows correcting the sampling bias by adjusting the weights with the inverse probability of being sampled.
3.5 CRC with Threshold
In Theorem 1, we connect the pointwise SVM without threshold term (2) to the pairwise SVM (1). The standard SVM for binary classification, however, often come with a threshold term
to allow the classification hyperplane to be away from the origin. That is, the standard SVM solves
(6)  
Note that for any given ,
If we revisit the proof of Theorem 1 with the equation above, we get a similar theorem that connects the standard SVM to the pairwise SVM.
Theorem 2.
Given the connection between (6) to (1) in Theorem 2, one may wonder whether the trick of pseudopair works for connecting the two formulations. Consider one positive point . Its hinge loss within (6) is , which is the same as , where is a zero vector of length . Thus, the positive point can also be viewed as an extended pseudopair that consists of and , ranked by the extended vector . We will denote the extended vector as . Similarly, a negative point can be viewed as an extended pseudopair that consists of and .
Note that if we consider all the extended vectors , ranking pairwise extended vectors by means calculating
That is, the hinge loss on extended pairs is exactly the same as the hinge loss on the original pairs.
Based on the discussions above, if we define extended pairs and extended pseudopairs based on the extended vectors and , we can combine the pairwise SVM and the standard SVM with threshold term to design a variant of the CRC formulation. In the variant, we take one common trick to include We use one trick (as taken by LIBLINEAR [15]) that includes in the regularization term to allow simpler design of optimization routines. That is, we solve
(7)  
in our study. We call this formulation (7) CRCthreshold, which is simply equivalent to the original CRC formulation (5) applied to the extended vectors. The equivalence allows us to easily test whether the flexibility of (through using the extended vectors ) can improve the original CRC formulation.
4 Experiments
In this section, we study the performance and efficiency of our proposed ASCRC algorithm on realworld largescale data sets. We compare ASCRC with randomCRC, which does random sampling under the CRC framework. In addition, we compare ASCRC with three other stateoftheart algorithms for largescale bipartite ranking: the pointwise weighted linear SVM (2) (WSVM), an efficient implementation [22] of the pairwise linear RankSVM (1) (ERankSVM), and the combined ranking and regression (CRR) [31] algorithm for general ranking.
We use 14 data sets from the LIBSVM Tools^{1}^{1}1http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ and the UCI Repository [23] in the experiments. Table I shows the statistics of the data sets, which contains more than tenthousands of instances and more than tenmillions of pairs. The data sets are definitely too large for a naïve implementation of RankSVM (1). The data sets marked with are originally multiclass data sets, and we take the subproblem of ranking the first class ahead of the other classes as a bipartite ranking task. For data sets that come with a moderatesized test set, we report the test AUC. Otherwise we perform a fold cross validation and report the crossvalidation AUC.
Data  Points  Pairs  Dimension  AUC 

MQ08  15211  26589360  10042  CV 
letter*  20000  30314958  16  CV 
yahoo2  34815  43609852  1966  test 
yahoo2_v2  34815  125014968  1966  test 
protein*  17766  156876928  357  test 
news20  19996  199920006  1355191  CV 
rcv1  20242  204595482  47236  CV 
a9a  32561  387659040  123  test 
bank  45211  422294916  51  CV 
ijcnn1  49990  438099722  22  CV 
MQ07  69623  508061760  10036  CV 
shuttle*  43500  640684672  9  test 
mnist*  60000  640596142  780  test 
connect*  67557  2053229464  126  CV 
acoustic*  78823  2211845364  50  test 
realsim  72309  2226957796  20958  CV 
yahoo1  473134  8300699226  20643  test 
yahho1_v2  473134  38617091106  20643  test 
covtype  581012  168683648022  54  CV 
url  2396130  2541177395650  3231961  CV 
4.1 Experiment Settings
Given a budget on the number of pairs to be used in each algorithm and a global regularization parameter , we set the instance weights for each algorithm to fairly maintain the numerical scale between the regularization term and the loss terms. The global regularization parameter is fixed to in all the experiments. In particular, the setting below ensures that the total , summed over all the pairs (or pseudopairs), would be for all the algorithms.

CRR: We use the package sofiaml [31] with the sgdsvm learner type, combinedranking loop type and the default number of iterations that SGD takes to solve the problem. We set its regularization parameter .

ASCRC (ASRankSVM): We initialize to , and assign in each iteration, where equals to either or for either real or pseudo pairs and is a normalization constant that prevents from being too large. We solve the linearSVM within ASCRC by the LIBLINEAR [15] package with its extension on instance weights.

randomCRC: randomCRC simply corresponds to ASCRC with for all the pairs. That is, randomCRC samples uniformly within the unlabeled pool.
To evaluate the average performance of ASCRC and randomCRC algorithms, we average their results over different initial pools.
4.2 Performance Comparison and Robustness
Next, we examine the necessity of three key designs within the active sampling framework: softversion versus hardversion, sampling bias correction within softversion of active sampling, and the choice of softversion value functions. We first set in ASCRC and randomCRC, which makes ASCRC equivalent to ASRankSVM. We let and , which is a relatively small budget out of the millions of pairs.
Close  Correct  
Soft v.s Hard  10/3/1  13/0/1 
Results Summary under 95% ttest (win/loss/tie)
4.2.1 SoftVersion versus HardVersion
We will discuss the time difference between the soft and hardversions of sampling in Table V of Section 4.3. The softversions are both coupled with bias correction. Intuitively, the soft version is much faster than the hard version. Here we examine the performance difference between the two versions first. In Table II, we compare the soft and hardversions of closeness and correctness sampling under the test of 95% confidence level. For closeness sampling, the soft version performs better than the hard version on 9 data sets and ties with 3; for correctness sampling, the soft version performs better than the hard version on 12 data sets and ties with 1. The results justify that the soft version is a better choice than the hardversion in terms of AUC performance.
Fig. 3 further show how the AUC changes as grows for different versions of sampling, along with the baseline ERankSVM algorithm. We see that hardcorrectnesssampling always leads to unsatisfactory performance. One possible reason is that hard correctnesssampling can easily suffer from sampling the noisy pairs, which come with larger hinge loss. On the other hand, hardclosenesssampling is competitive to the softversions (albeit slower), but appears to be saturating to less satisfactory model in Fig. 3(b). The saturation corresponds to a known problem of uncertainty sampling in active learning because of the restricted view of the nonperfect model used for sampling [26]. The softversion, on the other hand, has some probability of escaping from the restricted view, and hence enjoys better performance.
letter  protein  news20  rcv1  a9a  bank  ijcnn1  

Close  1.0540e03  1.0400e03  3.2210e03  5.4200e04  3.3000e05  2.2160e03  2.9800e04 
Correct  2.4430e03  9.3760e03  2.9550e03  9.6000e05  2.1680e03  4.5100e04  7.3300e04 
shuttle  mnist  connect  acoustic  realsim  yahoo1  covtype  url  
Close  9.3600e04  2.7600e04  7.1700e04  2.6960e03  8.2700e04  1.0650e03  1.7220e03  3.8000e05 
Correct  3.0700e04  2.4700e04  3.2030e03  4.9000e05  1.8710e03  5.4300e04  2.8380e03  2.7900e04 
4.2.2 Bias Correction for Soft Version Sampling
Next, we show the AUC difference between doing bias correction (see Section 3.3) and not doing so for softversion sampling in Table III. A positive difference indicates that doing bias correction leads to better performance. First of all, we see that the difference of the bias correction is relatively small. For softclose sampling, performing bias correction is slightly worse in 12 data sets; for softcorrect sampling, performing bias correction is slightly better in 9 data sets. Note correctness sampling is inheritedly more biased towards the noisy pairs as discussed during hardversion sampling. Thus, performing bias correction can be necessary and helpful in ensuring the stability, as justified by the better performance in those 9 data sets.
4.2.3 Value Functions for Soft Version Sampling
We show how the AUC changes as grows throughout the active sampling steps of ASRankSVM in Fig. 25
. For WSVM, ERankSVM and CRR, we plot a horizontal line on the AUC achieved when using the whole training set. We also list the final AUC with the standard deviation of all the algorithms in Table
IV.ASRankSVM  
Data  WSVM  ERankSVM  CRR  Random  SoftClose  SoftCorrect 
MQ08  .8174 (*)  .864  .8596  .8587 .0041  .8621 .0029  .8599 .0032 
letter  .9808  .9877  .9874  .9883 .0003  .9883 .0002  .9874 .0123 
yahoo2  .9544  .9873  .7885(*)  .9525 .0015  .9561 .0007  .9481 .0047 
yahoo2_v2  .9014  .9513  .7134(*)  .8963 .0015 (*)  .8999 .0008  .8974 .0011 
protein  .8329  .8302  .8306  .8229 .0031  .8240 .0016  .8233 .0028 
news20  .9379 (*)  .9753 (*)  .9743 (*)  .9828 .0008 (*)  .9836 .0006 (*)  .9903 .0003 
rcv1  .9876 (*)  .9916 (*)  .9755 (*)  .9920 .0004 (*)  .9923 .0003 (*)  .9944 .0002 
a9a  .9008  .9047  .8999 (*)  .9003 .0006  .9012 .0004  .9007 .0006 
bank  .8932 (*)  .9023 (*)  .8972 (*)  .9051 .0010 (*)  .9057 .0011 (*)  .9083 .0007 
ijcnn1  .9335 (*)  .9343 (*)  .9336 (*)  .9342 .0004 (*)  .9345 .0006  .9348 .0003 
MQ07  .8238 (*)  .8951 (*)  .8894(*)  .8886 .0028 (*)  .8922 .0024 (*)  .897 .0017 
shuttle  .9873 (*)  .9876 (*)  .9888 (*)  .9894 .0001 (*)  .9896 .0001 (*)  .9907 .0000 
mnist  .9985  .9983  .9973 (*)  .9967 .0004 (*)  .9979 .0001  .9976 .0002 
connect  .8603  .8613  .8532 (*)  .8594 .0008 (*)  .8604 .0007  .8603 .0009 
acoustic  .8881 (*)  .8911 (*)  .8931 (*)  .8952 .0005 (*)  .8952 .0005 (*)  .8988 .0004 
realsim  .9861 (*)  .9908  .9907  .9908 .0003  .9915 .0002  .9934 .0064 
yahoo1  .955 (*)  .9609  .941 (*)  .9524 .001 (*)  .9554 .0007 (*)  .9571 .0006 
yahoo1_v2  .806  .8234  .7653(*)  .8014 .0010 (*)  .8038 .0008 (*)  .8049 .0011 
covtype  .8047 (*)  .8228 (*)  .8189 (*)  .8238 .0008 (*)  .8239 .0007 (*)  .8249 .0006 
url  .9963 (*)  .9967 (*)  .9956 (*)  .9940 .0003 (*)  .9961 .0001 (*)  .9984 .0015 
Total (ASRankSVM win/loss/tie)  12/5/3  9/9/2  16/1/3  14/1/5  10/4/6  – 
(*) marks the case where ASRankSVM wins significantly under the test under a 95% significance level.
From Fig. 25 and Table IV, we see that softcorrect sampling is generally the best. We also conduct the righttail test for softcorrect against the others, and list the results in Table IV to show whether the improvement of softcorrect sampling is significant.
First, we compare softcorrect with random sampling and discover that softcorrect performs better on 10 data sets and ties with 4, which shows that active sampling is working reasonably well. While comparing softclose with softcorrect in Table IV, we find that softcorrect outperforms softclose on 7 data sets and ties with 5. Moreover, Fig. 25 shows the strong performance of softcorrect comes from the early steps of active sampling. Finally, when comparing softcorrect with other algorithms, we discover that softcorrect performs the best on 8 data sets: it outperforms ERankSVM on 8 data sets, WSVM on 9 data sets, and CRR on 11 data sets. The results demonstrate that even when using a pretty small sampling budget of pairs, ASRankSVM with softcorrect sampling can achieve significant improvement over those stateoftheart ranking algorithms that use the whole training data set. Also, the tiny standard deviation shown in Table IV and the significant results from the test suggest the robustness of ASRankSVM with softcorrect in general.
Nevertheless, we observe a potential problem of softcorrect sampling from Fig. 25. In data sets letter and mnist, the performance of softcorrect increases faster than softclose in the beginning, but starts dropping in the middle. The possible reason, similar to the hardversion sampling, is the existence of noisy pairs that shall better not to be put into the chosen pool. When sampling more pairs, the probability that some noisy pairs (which come with larger hinge loss) are sampled by softcorrect sampling is higher, and can in term lead to degrading of performance. The results suggest a possible future work in combining the benefits of softclose and softcorrect sampling to be more noisetolerant.
Data  Random  SoftClose  SoftCorrect  HardClose  HardCorrect  WSVM  ERankSVM  CRR 
MQ08  0.662  0.72  1.3192  14.6544  8.6444  0.092  1.506  0.256 
letter  0.1616  0.306  3.9204  38.5342  8.5106  0.058  0.426  0.174 
yahoo2  19.372  19.83  39.1  180.097  91.761  11.39  28.05  7.68 
yahoo2_v2  22.732  22.828  35.059  95.662  99.154  11.42  34.05  7.83 
protein  3.136  2.943  4.29  12.128  8.315  0.85  3.05  0.99 
news20  11.7188  11.3012  13.2788  36.8466  25.6112  4.02  10.454  3.938 
rcv1  1.1744  1.2636  4.3578  22.8056  7.0516  0.366  2.186  0.82 
a9a  0.374  0.504  1.065  30.384  19.537  0.28  1.96  0.31 
bank  0.3914  0.4602  0.9288  20.8064  13.8512  0.142  3.368  0.3 
ijcnn1  0.1914  0.3016  0.8062  21.4004  15.864  0.138  1.768  0.324 
MQ07  1.9086  1.9132  2.2412  44.915  37.3924  0.512  11.618  1.42 
shuttle  0.146  0.288  3.02  26.307  17.577  0.18  0.62  0.36 
mnist  1.61  2.851  56.604  205.135  50.174  4.92  14.13  2.75 
connect4  0.5468  0.6458  1.0986  23.4094  24.2718  0.5  9.156  0.732 
acoustic  0.488  0.624  1.03  33.39  41.167  1.82  6.66  1.93 
realsim  0.805  0.87  2.3296  63.7404  27.8084  0.63  4.79  1.77 
yahoo1  42.306  42.284  62.08  803.822  517.441  73.29  500.01  53.26 
yahoo1_v2  38.886  37.779  41.868  412.988  848.064  50.07  585  33.16 
covtype  0.294  0.3602  0.5478  160.108  1147.99  1.282  29.206  3.122 
url  6.2052  6.2022  32.6044  2440.91  5707.88  23.322  536.146  56.474 
4.3 Efficiency Comparison
First, we study the efficiency of soft active sampling by checking the average number of rejected samples before passing the probability threshold during rejection sampling. The number is plotted against the size of in Fig. 41. The softclose strategy usually needs fewer than rejected samples, while the softcorrect strategy generally needs an increasing number of rejected samples. The reason is that when the ranking performance becomes better throughout the iterations, the probability threshold behind softcorrect could be pretty small. The results suggest that the softclose strategy is generally efficient, while the softcorrect strategy may be less efficient as grows.
Next, we list the CPU time consumed for all algorithms under pairs budget in Table V, and the data sets are ordered ascendantly by its size. We can see that WSVM and CRR run fast but give inferior performance; ERankSVM performs better but the training time grows fast as the data size increases. The result is consistent with the discussion in Section 1 that conducting bipartite ranking efficiently and accurately at the same time is challenging.
For ASRankSVM, random runs the fastest, then softclose, and softcorrect is the slowest. The results reflect the average number of rejected samples discussed above. In addition, not surprisingly, the soft version samplings are usually much faster then the corresponding hard versions, which validate that the time consuming enumerating or sorting steps do not fit our goal in terms of efficiency.
More importantly, when comparing softcorrect with ERankSVM, softcorrect runs faster on 7 data sets, which suggests ASRankSVM is as efficient as the stateoftheart ERankSVM on largescale data sets in general. Nevertheless, we can find that the CPU time of softcorrect grows much slower than ERankSVM as data size increases because the time complexity of ASRankSVM mainly depends on the budget and the step size , not the size of data.
4.4 The Usefulness of Larger Budget
From the previous experiments, we have shown that ASRankSVM with a budget of pairs can perform better than other competitors on largescale data sets. Now, we check the performance of ASRankSVM with different budget size. In Figure 6, we show the AUC curves with much larger budgets on two data sets. Then, we find that the performance of ASRankSVM can be improved or maintained as the budget size increases. For example, in data set protein, we can match the performance of WSVM with around 40,000 pairs and surpass it slightly with around 80,000 pairs. Nevertheless, in most data sets, we find that the slope of AUC curves become flat around 10,000 pairs, and eventually converge as the budget increases. That is, increasing the budget in ASRankSVM leads to consistent but marginal improvements.
Note that the potential problem of sampling noisy pairs within the softcorrect sampling can be more serious when the budget size increases. Fig. 9 illustrates the problem with the realsim data when , , and , where the performance of softcorrect degrades and rejects many more pairs in the latter sampling iterations. On the other hand, softclose maintains the robustness and the efficiency as the budget increases, and improves the performance consistently throughout the iterations. Thus, if a larger budget is used, softclose can be a better choice than softcorrect.
Data  SoftClose  SoftCorrect 
letter  uniform,uniformth  0.9,0.9th 
protein  1  0.4th 
news20  uniform,1,uniformth  uniform,1,uniformth 
rcv1  uniform,1,uniformth  uniform,1,uniformth 
a9a  uniform,uniformth  0.6th 
bank  1  uniform,1,uniformth 
ijcnn1  1,uniformth  0.8,0.9,1,uniformth,0.9th 
shuttle  0.1th,0.2th  uniformth 
mnist  0.1  0.6 
connect  uniform,1,uniformth  uniform,1,uniformth 
acoustic  1  uniform,1,uniformth 
realsim  uniform,1,uniformth  uniform,uniformth 
yahoo1  uniform  uniformth, 0.9th 
covtype  uniform,1  uniform,0.9,1,uniformth,0.9th 
url  uniform,1,uniformth  0.8 
4.5 The Usefulness of the CRC Framework
Next, we study the necessity of the CRC framework by comparing the performance of softcloseness and softcorrectness under different choices of . We report the best under a 95% significance level within , where means balancing the influence of actual pairs and pseudopairs by . Moreover, we check whether CRCthreshold can be useful. Table VI shows the best and formulation for each sampling strategy. The entries with “th” indicates CRCthreshold. The bold entries indicates that the setting outperforms ERankSVM significantly. The table suggests three observations. Firstly, the choice of sampling strategy does not effect the optimal much, and most data sets have similar optimal for both softcloseness and softcorrectness sampling. Secondly, we find that adding a threshold term for CRC can sometimes reach better performance. Last, we see that using (real pairs only) performs well in most data sets, while a smaller or can sometimes reach better performance. The results justify that the real pairs are more important than the pseudopairs, while the latter can sometimes be helpful.
In summary, a special case of the proposed ASCRC algorithm that only samples actual pairs (ASRankSVM) works reasonably well for a budget of when coupled with softcorrect sampling. The setting significantly outperforms WSVM, ERankSVM, CRR and softclose on most of the data sets, also the execution time shown the efficiency of softcorrect sampling is comparable with ERankSVM. While leads to promising performance on most of the data sets, further tuning with a smaller or adding a threshold term helps in some data sets.
5 Conclusion
We propose the algorithm of Active Sampling (AS) under Combined Ranking and Classification (CRC) based on the linear SVM. There are two major components of the proposed algorithm. The AS scheme selects valuable pairs for training and resolves the computational burden in largescale bipartite ranking. The CRC framework unifies the concept of pointwise ranking and pairwise ranking under the same framework, and can perform better than pure pointwise ranking or pairwise ranking. The unified view of pairs and points (pseudopairs) in CRC allows using one AS scheme to select from both types of pairs.
Experiments on 14 realworld largescale data sets demonstrate the promising performance and efficiency of the ASRankSVM and ASCRC algorithms. The algorithms usually outperform stateoftheart bipartite ranking algorithms, including the pointwise SVM, the pairwise SVM, and the combined ranking and regression approach. The results not only justify the validity of ASCRC, but also shows the valuable pairs or pseudopairs can be helpful for largescale bipartite ranking.
References
 [1] N. Ailon. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. JMLR, 13:137–164, 2012.
 [2] N. Ailon and M. Mohri. An efficient reduction of ranking to classification. arXiv preprint arXiv:0710.2889, 2007.
 [3] M.F. Balcan, N. Bansal, A. Beygelzimer, D. Coppersmith, J. Langford, and G. B. Sorkin. Robust reductions from ranking to classification. Machine learning, 72(12):139–153, 2008.
 [4] E. B. Baum and F. Wilczek. Supervised learning of probability distributions by neural networks. In NIPS, pages 52–61, 1988.
 [5] U. Brefeld and T. Scheffer. AUC maximizing support vector learning. In ICML Workshop on ROC Analysis in Machine Learning, 2005.
 [6] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89–96, 2005.
 [7] C. J. Burges, Q. V. Le, and R. Ragno. Learning to rank with nonsmooth cost functions. NIPS, 19:193–200, 2007.
 [8] R. Caruana, T. Joachims, and L. Backstrom. KDDCup 2004: results and analysis. ACM SIGKDD Explorations Newsletter, 6(2):95–108, 2004.
 [9] S. Clemençon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of Ustatistics. The Annals of Statistics, 36(2):844–874, 2008.
 [10] C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. NIPS, 16(16):313–320, 2004.
 [11] P. Donmez and J. G. Carbonell. Optimizing estimated loss reduction for active sampling in rank learning. In ICML, pages 248–255, 2008.
 [12] P. Donmez and J. G. Carbonell. Active sampling for rank learning via optimizing the area under the ROC curve. Advances in Information Retrieval, pages 78–89, 2009.
 [13] J. Duchi, L. Mackey, and M. I. Jordan. On the consistency of ranking algorithms. In ICML, pages 327–334, 2010.
 [14] Ş. Ertekin and C. Rudin. On equivalence relationships between classification and ranking algorithms. JMLR, 12:2905–2929, 2011.
 [15] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
 [16] T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, 2006.
 [17] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. JMLR, 4:933–969, 2003.
 [18] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
 [19] J. Fürnkranz, E. Hüllermeier, E. L. Mencía, and K. Brinker. Multilabel classification via calibrated label ranking. Machine Learning, 73(2):133–153, 2008.
 [20] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. NIPS, pages 115–132, 1999.
 [21] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
 [22] T. Joachims. Training linear svms in linear time. In KDD, pages 217–226, 2006.
 [23] M. L. Kevin Bache. UCI machine learning repository, 2013.
 [24] W. Kotłlowski, K. Dembczynski, and E. Hüllermeier. Bipartite ranking through minimization of univariate loss. In ICML, pages 1113–1120, 2011.
 [25] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR, pages 3–12, 1994.
 [26] C.L. Li, C.S. Ferng, and H.T. Lin. Active learning with hinted support vector machine. In ACML, 2012.
 [27] T.Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.
 [28] S. Rajaram, C. K. Dagli, N. Petrovic, and T. S. Huang. Diverse active ranking for multimedia search. In CVPR, pages 1–8, 2007.
 [29] N. Roy and A. McCallum. Toward optimal active learning through monte carlo estimation of error reduction. In ICML, pages 441–448, 2001.
 [30] C. Rudin and R. E. Schapire. Marginbased ranking and an equivalence between AdaBoost and RankBoost. JMLR, 10:2193–2232, 2009.
 [31] D. Sculley. Combined regression and ranking. In KDD, pages 979–988, 2010.
 [32] B. Settles. Active learning literature survey. U. of Wisconsin, Madison, 2010.
 [33] B. Settles, M. Craven, and S. Ray. Multipleinstance active learning. In NIPS, 2008.
 [34] W.Y. Shen and H.T. Lin. Active sampling of pairs and points for largescale linear bipartite ranking. In ACML, pages 388–403, 2013.
 [35] H. Steck. Hinge rank loss and the area under the ROC curve. ECML, pages 347–358, 2007.
 [36] G. Tsoumakas and I. Katakis. Multilabel classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.

[37]
V. Vapnik.
The nature of statistical learning theory
. Springer, 1999.  [38] K.W. Wu, C.S. Ferng, C.H. Ho, A.C. Liang, C.H. Huang, W.Y. Shen, J.Y. Jiang, M.H. Yang, T.W. Lin, C.P. Lee, et al. A twostage ensemble of diverse models for advertisement ranking in KDD Cup 2012. In ACM SIGKDD KDDCup WorkShop, 2012.
 [39] H. Yu. SVM selective sampling for ranking with application to data retrieval. In KDD, pages 354–363, 2005.
 [40] G.X. Yuan, C.H. Ho, and C.J. Lin. Recent advances of largescale linear classification. Proceedings of the IEEE, 100(9):2584–2603, 2012.
Comments
There are no comments yet.