Text Classification and Co-training
from Positive and Unlabeled Examples
Fran¸cois Denis fdenis@cmi.univ-mrs.fr
Anne Laurent alaurent@cmi.univ-mrs.fr
´Equipe BDAA, LIF ­ UMR 6166, Centre de Math´ematiques et d'Informatique (CMI), Universit´e de Provence,
39, rue F. Joliot Curie, 13453 Marseille CEDEX 13, FRANCE
R´emi Gilleron gilleron@univ-lille3.fr
Marc Tommasi tommasi@univ-lille3.fr
´Equipe Grappa ­ EA 3588 and projet MOSTRARE ­ UR INRIA Futurs, Universit´e de Lille 3, domaine universitaire
du "Pont de bois", BP 149, 59653 Villeneuve d'Ascq CEDEX, FRANCE
Abstract
In the general framework of semi-supervised
learning from labeled and unlabeled data, we
consider the specific problem of learning from
a pool of positive data, without any negative
data but with the help of unlabeled data.
We study a naive Bayes algorithm PNB from
positive and unlabeled examples. Then, we
consider the case where the number of positive
examples is quite small, assuming that
the co-training setting is relevant, i.e. assuming
that the datasets have a natural separation
of their features into two sets. We design
a co-training algorithm PNCT from positive
and unlabeled examples. We give experimental
results for the two algorithms PNB
and PNCT. They show that text classification
with naive Bayes is feasible with positive
examples and unlabeled examples and
that co-training algorithms can significantly
improve learning accuracy when the available
set of positive data is small.
1. Introduction
It is often tedious and expensive to hand-label large
amount of training data. Thus recently, semisupervised
learning algorithms from a small set of labeled
data with the help of unlabeled data have been
defined. Such approaches include using Expectation
Maximization to estimate maximum a posteriori parameters
(Nigam et al., 2000; M.-R & P., 2003), using
transductive inference for support vector machines
(Joachims, 1999), using the unlabeled data to define
a metric or a kernel function (Hofmann, 1999), using
a partition of the set of features into two disjoint sets
of features (Blum & Mitchell, 1998; Nigam & Ghani,
2000; Muslea et al., 2002).
Here, we consider the problem of learning from positive
data with the help of unlabeled data. For instance, in
many text learning tasks, such as document retrieval
and classification, one goal is the efficient classification
and retrieval of interests of some user. Positive information
is readily available and unlabeled data can
easily be collected. One example is learning to classify
web pages as "interesting" for a specific user. Documents
pointed by the user's bookmarks define a set of
positive examples because they correspond to interesting
web pages for him and negative examples are not
available at all. Nonetheless, unlabeled examples are
easily available on the World Wide Web.
Theoretical results show that in order to learn from
positive and unlabeled data, it is sometimes sufficient
to consider unlabeled data as negative ones (Denis,
1998; Liu et al., 2002). The starting point of (Liu
et al.) is a constrained approximation approach. The
idea is to select a function that correctly classifies all
positive documents and minimizes the number of unlabeled
documents classified as positive. Following this
idea, they define a new learning algorithm S-EM built
on the naive Bayes classifier in conjunction with the
EM (Expectation Maximization) algorithm.
Another approach in the statistical query learning
model is to estimate statistical queries over positive
and unlabeled examples (Denis et al., 2003). We fol-
low this idea and define a naive Bayes Classifier PNB.
PNB takes as input positive and unlabeled data, together
with an estimate ­ possibly rough ­ of the positive
class probability. In practical situations, the positive
class probability can be empirically estimated or
provided by using some domain knowledge. We compare
the performance of S-EM, NB and PNB on three
public domain document datasets: the WebKB course
dataset, the Reuters collection and the 20 newsgroups
dataset.
Next, we consider situations where only a small set
of positive data is available together with unlabeled
data. In these situations, building accurate classifiers
may fail because of the poverty of the input
data. However, learning is still possible when the
existence of two different views over the data is assumed,
as in the co-training framework introduced by
Blum and Mitchell (1998). For instance, consider the
retrieval of bibliographic references. Positive examples
are stored in the user database. A first view
consists of the bibliographic fields -- title, author,
abstract, editor. A second view is the full content
of the paper. Unlabeled examples are easily available
in the bibliographic databases accessible via the
World Wide Web. Co-training algorithms incrementally
build basic classifiers over each of the two feature
sets. Co-training methods have been used previously
to train classifiers in applications like text classification
(Blum & Mitchell, 1998), word-sense disambiguation
(Yarowski, 1995) and named-entity classification
(Collins & Singer, 1999). Co-training learning
is a special case of multi-view learning for which
semi-supervised learning algorithms have been defined
(Muslea et al., 2002).
We define a co-training algorithm PNCT for which the
seed information is a small pool of positive documents.
At first, PNCT incrementally builds naive Bayes classifiers
from positive and unlabeled documents over each
of the two views by using PNB. Along the co-training
steps, self-labeled positive examples and self-labeled
negative examples are added to the training sets. We
propose a base algorithm PNNB, which is a variant of
PNB, able to use these self-labeled examples. Experiments
on the WebKB Course dataset are performed;
they show that co-training algorithms lead to significant
improvement of classifiers, even when the initial
seed is only composed of positive documents.
In Section 2, we present the naive Bayes algorithm
from positive and unlabeled data PNB. In Section 3, we
define our co-training algorithm PNCT. Experimental
results are given in Section 4.
2. Naive Bayes from Positive and
Unlabeled Documents
2.1. PNB algorithm
Naive Bayes algorithm from positive and unlabeled examples
(PNB) is introduced in (Denis et al., 2002). We
briefly present the main ideas of PNB in this section.
We only consider binary text classification problems:
the set of classes is {0, 1} where 1 corresponds to the
positive class. We consider bag-of-words representations
for documents. Let D be a set of documents and
let w be a word. We denote by N(D) the total number
of word occurrences in D and by N(w, D) the number
of occurrences of w in all the documents of D.
PNB assumes an underlying generative model. In this
model, first a class c is selected according to class prior
probabilities P(c). Second, a document length l is chosen
according to length prior probability P(l). Then,
each word w in the document is generated by drawing
from a multinomial distribution over words specific to
the class Pr(w|c).
The algorithm PNB takes as input: an estimate ^P(1) of
the positive class probability P(1), a set PD of positive
documents together with a set UD of unlabeled documents.
The Positive Naive Bayes classifier PNB classifies
a document d consisting of n words (w1, . . . , wn)
­ with possibly multiple occurrences of a word w ­ as
a member of the class:
PNB(d) = argmax
c{0,1}
^P(c)
i=n
i=1
^Pr(wi|c) . (1)
We must now explain how the class probability estimates
^P(c) and the word probability estimates
^Pr(wi|c) are calculated.
Class probability estimates. An estimate ^P(1) of
the positive class probability P(1) is given as input to
the learner. An estimate ^P(0) of the negative class
probability is set to 1 - ^P(1).
Positive word probability estimates. We are
given as input a set PD of positive documents. We
consider the Laplace smoothing. The positive word
probability estimates are calculated using the following
equation:
^Pr(wi|1) =
1 + N(wi, PD)
Card(V ) + N(PD)
(2)
where V is the vocabulary and Card(V ) its cardinality.
Negative word probability estimates. When a
set ND of negative documents is available, the neg-
ative word probability estimates are calculated using
the following equation:
^Pr(wi|0) =
1 + N(wi, ND)
Card(V ) + N(ND)
(3)
But, in our framework negative word probabilities
must be estimated without negative examples.
Nonetheless, for word probabilities we have:
Pr(wi) = Pr(wi|0)Pr(0) + Pr(wi|1)Pr(1) (4)
where Pr(wi) is the probability that the generator creates
wi and Pr(1) is the probability that the generator
creates a word in a positive document. Equation 4 can
be rewritten as:
Pr(wi|0) =
Pr(wi) - Pr(wi|1) × Pr(1)
1 - Pr(1)
(5)
This equation is used to estimate negative word probabilities.
Assuming that the set of unlabeled documents
is generated according to the underlying generative
model, probability Pr(wi) is estimated on the set
of unlabeled documents by N(wi, UD)/N(UD). Estimates
for negative word probabilities can be rewritten:
^Pr(wi|0) =
N(wi, UD) - ^Pr(wi|1) × ^Pr(1) × N(UD)
(1 - ^Pr(1)) × N(UD)
(6)
In this equation, the positive word probabilities
^Pr(wi|1) are calculated according to Equation 2 with
the input set PD of the positive documents. ^Pr(1)
is an estimate of the probability that the generator
creates a word in a positive document. As it is assumed
that the lengths of documents are independent
of the class, ^Pr(1) could be either set to ^P(1) or directly
computed using the inputs of PNB (see (Denis
et al., 2002) for the calculation and the smoothing of
negative word probability estimates).
3. Co-training from Positive and
Unlabeled Examples
3.1. Co-training from positive and negative
examples
The co-training setting was introduced in (Blum &
Mitchell, 1998) in the general framework of learning
from labeled data and unlabeled data. The co-training
setting applies when a dataset has a natural division
of its features. Blum and Mitchell show that under the
assumptions that each set of features is sufficient for
classification, and the two feature sets of each instance
are conditionally independent given the class, PAClike
guarantees on learning from labeled and unlabeled
data hold.
They also present a co-training algorithm (see Table 3)
which incrementally build naive Bayes classifiers over
each of the two views. We denote by D a set of documents
described by two views and D1 (resp. D2) is the
projection of D on the first (resp. second) view. When
the documents are labeled, projections are considered
together with their labels. The co-training algorithm
first creates a pool of u unlabeled documents. It then
iterates the following procedure. First, the algorithm
trains two classifiers NB1 and NB2 based on each of the
two views. Second, the classifiers are applied to unlabeled
examples. The examples on which the classifiers
make the more confident predictions are removed from
the set of unlabeled data and are added together with
their label to the set of labeled data. At the end, a final
hypothesis Combine(NB1, NB2) is created by a voting
scheme that combines the prediction of the classifiers
learned in each view.
Following the co-training scheme, we define in Section
3.3 a co-training algorithm from positive and unlabeled
examples. A first idea is to replace NB by PNB.
Thus along the boosting rounds, only positive and unlabeled
examples are used. But, PNB outputs a classifier
which can label examples as negative. These selflabeled
negative examples should be used along the
boosting rounds of the co-training algorithm. With
this aim, we first define a variant of PNB which is able
to use self-labeled negative examples.
3.2. PNNB algorithm
PNNB takes as input an estimate ^P(1) of the positive
class probability and three training sets, a set PD of
positive documents, a set ND of negative documents
and a set UD of unlabeled documents. The situation
differs from classical naive Bayes from labeled examples
(the input is a set D=PD  ND of labeled examples)
in two ways:
* the ratio Card(PD)/(Card(PD) + Card(ND)) is
not an estimate of P(1)
* we are not confident in the labels of the negative
documents.
As for PNB, the key point is estimating negative word
probabilities in Equation 1. The negative word probabilities
can be estimated either from the set of negative
examples or from the sets of positive and unlabeled examples.
We mix these two estimates.
Let us denote by ^Pr(wi|0, ND) the estimate obtained
from the set of negative examples using Equation 3.
Let us denote by ^Pr(wi|0, PD, UD) the estimate obtained
from ^P(1) together with the sets PD and UD
according to Section 2 by ^Pr(wi|0, PD, UD). We define
estimates for negative word probabilities combining
these two estimates using the following equation:
^Pr(wi|0) = (1-) ^Pr(wi|0, PD, UD)+ ^Pr(wi|0, ND)
(7)
We set the parameter  to:
 =
1
2
×
Card(ND)
Card(PD)
×
^P(1)
1 - ^P(1)
(8)
When there is no negative document,  is set to 0 and
negative word probabilities are estimated from ^P(1)
and the two sets PD and UD according to Equations 7
and 6. When the sets PD and ND are such that
the ratio of positive documents in the union set PD 
ND is equal to the estimate ^P(1) of the positive class
probability P(1),  has value 1/2, that is we suppose
that we are equally confident on both estimates.
The naive Bayes algorithm PNNB takes as input an estimate
^P(1) of the positive class probability, a set PD
of positive documents, a set UD of unlabeled documents
and a set ND of negative documents. Class
probabilities and positive word probabilities are calculated
as for PNB. Negative word probabilities are
estimated according to Equations 7 and 8.
3.3. Co-training from only Positive and
Unlabeled Examples
We extend the co-training setting to the case where
only positive documents and unlabeled documents are
given to the learner. The co-training learning algorithm
PNCT is given in Table 4. It incrementally
builds classifiers over each of the two views with
the PNNB algorithm. The co-training process repeats
for k iterations. At each co-training step, it
picks Card(PD)/ ^P(1) documents from the set of unlabeled
documents to form the unlabeled dataset given
as input to PNNB classifiers. Indeed, large unlabeled
datasets can degrade performance of PNB classifiers
(Denis et al., 2002). The outcome of the cotraining
process consists in a final hypothesis whose
prediction is obtained by multiplying the prediction of
the classifiers learned in each view.
4. Empirical Evaluation
Datasets The WebKB Course dataset is a collection
of 1051 web pages collected from computer science
departments at four universities. Web pages are
divided into several categories. We use the student,
project, course and faculty categories. No stop-list is
used, html tags are removed and no stemming is performed.
The Reuters collection is the most commonlyused
collection for text classification. We use a formatted
version of Reuters version 2 (also called Reuters-
21450) prepared by Y. Yang and colleagues. Documents
are labeled to belong to at least one of the 135
possible categories. Here we consider two binary classification
problems defined by the categories acq and
grain. The 20-newsgroups dataset contains 20 different
UseNet discussion groups. We remove UseNet headers,
no stop-list is used and no stemming is performed.
4.1. Experiments with PNB
Preliminary experimental results were given in Denis
et al. (2002). They show that PNB is robust against
the input value of ^P(1) and compare learning accuracy
when varying the number of unlabeled examples.
Here, we apply PNB to real world data sets. We also
compare PNB and NB considering experimental results
for NB as lower and upper bounds for PNB. We also
compare PNB and S-EM defined in Liu et al. (2002).
Comparison between PNB and NB Experiments
were conducted to compare PNB and NB while varying
the number of labeled documents. Results are given
in Table 1. For a given row in Table 1, we repeat 200
times the following procedure. We select at random
a set of p of labeled data for NBp, a set of N labeled
data for NBN , and a set of p positive and N unlabeled
data for PNBp,N . The Reuters dataset comes with
a test set (3662 items) and a train set (9610 items)
and we keep this separation in our experiences. In
the case of the WebKB dataset, we use as test set
the remaining data after drawing the train sets (we
obtain therefore 200 different test sets). The estimated
error is averaged over the 200 the runs. The standard
deviation is estimated as the standard deviation of the
accuracy estimations from each holdout run. F score
is defined as F = 2pr/(p + r) where p is the precision
and r is the recall.
Naive Bayes from positive and unlabeled examples
with p positive examples outperforms standard Naive
Bayes with p labeled examples. Obviously, if unlabeled
documents are given with their correct label, standard
Naive Bayes outperforms Naive Bayes from positive
and unlabeled examples. Also, it should be noted that
we obtain good results when the weight of the positive
class is quite small (Category Grain).
Comparison between PNB and S-EM In Table 2,
we report error rates and F-measure of PNB and SEM.
Learning algorithms take two sets as input: a
set P of positive documents and a set M built with
negative documents and positive ones. As indicated by
(Liu et al.), the objective is to recover those positive
documents put in the mixed M, thus M can be seen
as a test set.
The S-EM algorithm takes as input a set P of positive
documents, a set M of unlabeled documents and do
not need any estimation of P(1). The PNB algorithm
takes M as a set of unlabeled documents and P as a set
of positive documents. We let P(1) = 0.5, considering
we have no knowledge about it. Results indicated in
Table 2 give the average error rates and F-measure
for 100 draws of M and P for PNB. Our algorithm
PNB outperforms S-EM in eight of the nine sets of
experiments. Results of PNB have also lower variance.
Table 1. A comparison between NB and PNB on three real
world datasets.
p N NBp PNBp,N NBN
Error F Error F Error F
Reuters Category acq; ^P(1) is set to 0.172
40 232 12(3.3) 66 11(1.8) 67 7.2(1.4) 81
120 698 8.9(2.3) 76 6.9(0.8) 82 4.8(0.5) 88
200 1164 7.4(1.5) 80 5.5(0.5) 86 4.2(0.3) 90
280 1630 6.6(1.1) 83 4.8(0.4) 88 4.0(0.2) 91
360 2096 6.0(0.8) 85 4.5(0.3) 89 3.9(0.3) 91
440 2562 5.7(0.6) 86 4.4(0.3) 90 3.9(0.2) 91
520 3028 5.3(0.5) 87 4.2(0.3) 90 3.9(0.2) 91
Reuters Category grain; ^P(1) is set to 0.045
40 888 5.2(1.2) 36 3.4(0.6) 61 3.0(0.4) 70
60 1333 4.9(1.1) 40 3.3(0.4) 66 3.0(0.4) 71
80 1777 4.6(0.7) 45 3.3(0.4) 68 3.1(0.4) 72
100 2222 4.3(0.7) 48 3.4(0.4) 69 3.2(0.4) 72
120 2666 4.2(0.6) 51 3.5(0.4) 70 3.3(0.3) 72
140 3111 4.0(0.6) 52 3.6(0.4) 70 3.4(0.3) 71
160 3555 3.9(0.6) 54 3.6(0.3) 70 3.5(0.3) 71
WebKB; ^P(1) is set to 0.22
10 45 18(7.9) 51 13(4.8) 64 8.3(3.2) 80
20 90 13(5.6) 64 10(3.6) 74 6.3(2.0) 86
30 136 10(4.4) 74 8.9(3.0) 78 5.6(1.7) 87
40 181 8.8(3.5) 79 7.8(2.6) 81 5.0(1.5) 89
50 227 7.8(2.6) 81 7.4(2.2) 82 4.9(1.5) 89
60 272 7.1(2.3) 84 6.9(1.8) 84 4.7(1.4) 89
70 318 7.1(2.8) 83 6.6(1.7) 85 4.5(1.3) 90
Discussion On real world datasets, our algorithm
PNB builds accurate classifiers. S-EM and PNB give
similar results but PNB needs a rough estimation of
P(1). It is worth noting that (Liu et al.) use spy documents
in M to optimize the performance of S-EM.
Usefulness of spy documents (ten percents of the positive
ones in M) is twofold: they are used to avoid
strong bias toward positive documents in the EM initialization;
they are also used to estimate errors and
then select a good classifier in the sequence produced
by EM.
4.2. Co-training Experimental Results
We run the PNCT co-training algorithm on the WebKB
Course dataset. The binary classification problem
is to identify web pages that are course home
pages. Each example consists of the words that occur
on the web page (full-text view), as well as words
occurring in the anchor text of hyperlinks pointing to
that page (Hyperlink view). The class course is designed
as the positive class in our setting and 22% of
the web pages are positive. Given a fixed seed size
Card(PD), for each experiment we first pick at random
a test set of 263 documents. From the 819 remaining
documents, we draw a set of labeled documents
containing Card(PD) positive documents and
these positive documents define the seed PD. The remaining
documents are left unlabeled and define the
set UD. The estimate ^P(1) is set to 0.22. The parameter
k is set to the maximal number of co-training
steps depending on the number of available unlabeled
documents. Parameters p and n are respectively set
to 1 and 3.
In a first set of experiments, we study the evolution
of error rates along the co-training steps. The seed
size Card(PD) is set to 20. Error rates of the output
classifiers are averaged over 100 experiments. Figure
1 gives a plot of error versus number of iterations
for the PNCT co-training algorithm. Along the first
co-training steps, error rates first increase. This may
be due to the fact that a sufficient number of selflabeled
documents is needed for statistics to become
sufficiently accurate. But after some co-training steps,
error rates for the full-text classifier and the combined
classifier decrease continuously. Finally, the output
full-text and combined classifiers outperform classifiers
built over the seed dataset. The hyperlink classifier is
helped less by co-training but hyperlinks documents
contain fewer words. For individual experiments when
the seed size is set to 20, we obtain similar plots. When
the seed size is lower, for instance consider a seed of 10
positive documents, for some rare experiments, error
rates of initial classifiers are quite poor and co-training
does not improve the accuracy of the initial classifiers.
We also reproduce experiments of Blum and Mitchell
Table 2. Experimental results from the 20-Newsgroup dataset. Columns PNB and S-EM give accuracy and F measure
evaluated on M and averaged over 100 draws (standard deviation is indicated in parenthesis)
Positive Negative P M pos in M PNB S-EM
Error F Error F
atheism rel 200 1400 400 21.0(1.4) 58.2(3.5) 27.1(2.8) 61.2(7.4)
graphic mac 200 1400 400 10.7(3.0) 77.9(8.2) 13.6(4.4) 71.9(17.1)
guns pol 200 1400 400 14.2(1.4) 71.8(4.0) 16.2(3.3) 73.7(5.9)
med elec 200 1400 400 6.9(1.5) 86.5(3.4) 8.6(4.8) 81.7(15.8)
oswin winx 200 1400 400 22.4(3.9) 35.6(18.4) 20.9(5.6) 43.0(27.5)
rel pol 200 1400 400 13.4(1.0) 73.9(2.4) 17.0(2.4) 71.6(6.5)
student course 328 1586 656 4.1(0.8) 94.8(1.0) 5.4(1.2) 93.2(1.7)
project course 100 1132 202 3.6(0.7) 90.1(1.9) 6.0(2.8) 79.9(13.4)
faculty course 224 1378 450 4.9(2.2) 92.6(3.2) 7.6(4.3) 86.6(12.8)
(1998) with our implementation of their algorithm
(here called CT) of co-training from positive and negative
data. As in the PNCT case, along the first
co-training steps, error rates first increase and then
decreases continuously. We observe that the phenomenon
is even accentuated in the CT case. Moreover,
it seems to us that the CT algorithm is not robust
in the choice of the initial seed. For instance, given 3
positive documents and 9 negative documents, CT ultimately
outputs a classifier whose error rate is greater
than 12% in 20 percents of our trials. With 10 positive
documents in input PNCT ultimately outputs a
classifiers whose error rate is greater than 12% in only
4 percents of our trials.
Figure 1. Error versus number of co-training steps for the
co-training algorithm PNCT. The seed size is set to 20 and
error rates are averaged over 100 experiments
In a second set of experiments, we study error rates of
the two co-training algorithms for different seed sizes.
For a given seed size, we run 100 experiments. Table 5
gives error rates for the output classifiers. For the cotraining
algorithm CT defined in (Blum & Mitchell,
1998), we choose an initial seed whose cardinality is
Figure 2. Error versus number of co-training steps for the
co-training algorithm CT. The seed size is set to 11+33
documents and error rates are averaged over 100 experi-
ments
close to the number of positive documents in the seed
of PNCT in the corresponding row (e.g. for the first
row of the two tables, 3 positive plus 9 negative documents
in the CT seed and 10 positive documents in the
PNCT seed). These experimental results on the WebKB
dataset are promising. Given a seed of only 10
positive documents and 40 unlabeled documents, the
ultimately classifiers produced by PNCT outperform
naive Bayes classifiers trained over 90 labeled documents
(see Table 1). We should also note that our
algorithm seems more robust. For a seed of 20 positive
documents, PNCT classifiers always outperform
classifiers trained over the seed while for a seed of 12
labeled documents, CT classifiers may be quite poor
for some draws of the seed.
Table 3. The co-training algorithm CT (Blum & Mitchell, 1998)
Co-training algorithm CT
parameters: u, p, n, k
input: a set D of labeled documents; a set UD of unlabeled documents
Create a pool UDp
choosing u documents at random from UD
Loop for k iterations
for each i in {1, 2}
Use Di to train a naive Bayes classifier NBi
Remove from UDp
the p examples that NBi most confidently labels as positive and add them to D
Remove from UDp
the n examples that NBi most confidently labels as negative and add them to D
Randomly choose 2n + 2p examples from UD to replenish UDp
output: Combine(NB1, NB2)
Table 4. The co-training algorithm from positive and unlabeled documents where self-labeled positive and negative documents
are added along the co-training steps.
Co-training algorithm PNCT
parameters: p, n, k
input: a set PD of positive documents; a set UD of unlabeled documents; an estimate ^P(1)
Set UDp
to UD; set ND to 
Loop for k iterations
Create a pool UDlearn
choosing Card(P D)
^P (1)
documents at random from UD
for each i in {1, 2}
Train PNNBi with input PDi, UDlearn
i , NDi and ^P(1)
Remove from UDp
the p examples that PNNBi most confidently labels as positive and add them to PD
Remove from UDp
the p examples that PNNBi most confidently labels as negative and add them to ND
output: Combine(PNNB1, PNNB2)
Table 5. Co-training with CT (upper table) and PNCT (below). The column start gives error rate and F-measure of PNB
with Card(PD) positive examples and the column stop gives error rate and F-measure for the combined classifiers after
k co-training steps.
seed steps Start Stop
size Error F Error F
|POS| |NEG| CT
3 9 74 12.4(4.0) 66.8(14.7) 11.4(10.6) 73.6(25.7)
4 12 73 10.4(3.9) 72.9(13.3) 8.7(7.4) 80.3(17.1)
6 18 72 8.4(3.4) 78.1(11.3) 7.87(7) 82.6(16.7)
8 24 71 7.7(2.5) 80.7(8.4) 7.3(4.5) 83.7(11.4)
11 33 69 6.6(2.3) 84.1(6.6) 6.0(1.7) 87.0(3.6)
|POS| |UNL| PNCT
10 40 70 12.8(4.5) 58.4(20.1) 6.3(2.3) 84.9(6.6)
20 80 65 9.6(3.6) 72.0(14.2) 5.1(1.4) 88.2(3.3)
30 120 56 8.2(2.9) 77.8(10.1) 5.0(1.2) 88.5(3.0)
40 160 46 7.1(2.5) 81.3(8.2) 5.1(1.4) 88.4(3.3)
50 200 36 6.5(2.4) 82.9(7.2) 5.0(1.2) 88.4(3.1)
5. Conclusion
We study an adaptation of Naive Bayes that allow
to build classifiers from positive and unlabeled data
(PNB). The main idea is to approximate word probabilities
given the negative class using positive, unlabeled
data and an estimation of the weight of the
positive class. In the presence of a small set of examples
from the target class, we reuse the co-training
scheme introduced by Blum and Mitchell (1998) with
PNB as a base classifier. We apply these algorithms
to a binary text classification problem. Experiments
show that starting from a small number of documents
from the target class, an estimate of probability of this
class and unlabeled documents, our co-training methods
build competitive classifiers. Outcomes of the cotraining
algorithm also seem to be more robust in the
choice of the initial seed. Nonetheless, there are still
a lot of open questions. How can the positive class
probability be estimated from the data? Does an hypothesis
testing algorithm apply in our setting ?
Acknowledgements
This research was partially supported by: "CPER
2000-2006, Contrat de Plan ´etat-r´egion Nord/Pasde-Calais:
axe TACT, projet TIC"; fonds europ´eens
FEDER "TIC - Fouille Intelligente de donn´ees - Traitement
Intelligent des Connaissances" OBJ 2-phasing
out - 2001/3 - 4.1 - n 3.
We would like to thank the anonymous reviewers for
their comments and suggestions.
References
Blum, A., & Mitchell, T. (1998). Combining labeled
and unlabeled data with co-training. Proc. 11th
Annu. Conf. on Comput. Learning Theory (pp. 92­
100). ACM Press, New York, NY.
Collins, M., & Singer, Y. (1999). Unsupervised models
for named entity classification. Proceedings of
the Joint SIGDAT Conference on Empirical Methods
in Natural Language Processing and Very Large
Corpora (pp. 100 ­ 110).
Denis, F. (1998). PAC learning from positive statistical
queries. Proceedings of the 9th International
Conference on Algorithmic Learning Theory (ALT-
98) (pp. 112­126). Berlin.
Denis, F., Gilleron, R., & Letouzey, F. (2003). Learning
from positive and unlabeled examples. Theorical
Computer Science, to appear.
Denis, F., Gilleron, R., & Tommasi, M. (2002). Text
classification from positive and unlabeled examples.
IPMU'02, 9th International Conference on Information
Processing and Management of Uncertainty in
Knowledge-Based Systems.
Hofmann, T. (1999). Text categorization with labeled
and unlabeled data: A generative model approach.
Working Notes for NIPS 99 Workshop on Using Unlabeled
Data for Supervised Learning.
Joachims, T. (1999). Transductive inference for text
classification using support vector machines. Proceedings
of ICML-99, 16th International Conference
on Machine Learning (pp. 200­209).
Liu, B., Lee, W., Yu, P., & Li, X. (2002). Partially
supervised classification of text documents. in Proc.
19th International Conference on Machine Learning
(pp. 387 ­ 394).
M.-R, A., & P., G. (2003). Semi-supervised learning
with explicit misclassification modeling. Proceedings
of the 18th International Joint Conference on Artificial
Intelligence. To appear.
Muslea, I., Minton, S., & Knoblock, C. (2002). Active
+ Semi-supervised Learning = Robust Multi-view
Learning. Proceedings of ICML-2002 (pp. 435­442).
Nigam, K., & Ghani, R. (2000). Analyzing the applicability
and effectiveness of co-training. Proceedings
of CIKM-00, Ninth International Conference on Information
and Knowledge Management (pp. 86­93).
Nigam, K., McCallum, A. K., Thrun, S., & Mitchell,
T. M. (2000). Text classification from labeled and
unlabeled documents using EM. Machine Learning,
39, 103­134.
Yarowski, D. (1995). Unsupervised word sense disambiguation
rivaling supervised methods. Proceedings
thirty-third meeting of the ACL (pp. 189 ­ 196).