Focused crawling in depression portal search: A feasibility study
Thanh Tin Tang
David Hawking
Department of Computer Science, ANU
Canberra, Australia
Canberra, Australia
Nick Craswell
Ramesh S. Sankaranarayana
Microsoft Research
Department of Computer Science, ANU
Canberra, Australia
quality when judged against the best available scientific
search services in the area of depressive illness
evidence [8, 10]. It is thus important that consumers can
has documented the significant human cost required to
locate depression information which is both relevant
setup and maintain closed-crawl parameters. It also
and of high quality.
showed that domain coverage is much less than that of
Recently, in [15], we compared examples of two
whole-of-web search engines. Here we report on the
types of search tool which can be used for locating
feasibility of techniques for achieving greater coverage
depression information: whole-of-Web search engines
at lower cost. We found that acceptably effective crawl
such as Google, and domain-specific (portal) search
parameters could be automatically derived from a
services which include only selected sites. We found
DMOZ depression category list, with dramatic saving
that coverage of depression information was much
in effort. We also found evidence that focused crawling
greater in Google than in portals devoted to depression
could be effective in this domain: relevant documents
from diverse sources are extensively interlinked; many
BluePages Search (BPS)1 is a depression-specific
outgoing links from a constrained crawl based on
search service offered as part of the BluePages depres-
DMOZ lead to additional relevant content; and we
sion information site. Its index was built by manu-
were able to achieve reasonable precision (88%) and
ally identifying and crawling areas on 207 Web servers
recall (68%) using a J48-derived predictive classifier
containing depression information. It took about two
operating only on URL words, anchor text and text
weeks of intensive human effort to identify these areas
content adjacent to referring links. Future directions
(seed URLs) and define their extent by means of include
include implementing and evaluating a focused
and exclude patterns. Similar effort would be required
crawler. Furthermore, the quality of information in
at regular intervals to maintain coverage and accuracy.
returned pages (measured in accordance with the
Despite this human effort, only about 17% of relevant
evidence based medicine) is vital when searchers are
pages returned by Google were contained in the BPS
consumers. Accordingly, automatic estimation of web
site quality and its possible incorporation in a focused
One might conclude from this that the best way to
crawler is the subject of a separate concurrent study.
provide depression-portal search would be to add theword 'depression' to all queries and forward them to
focused crawler, hypertext classification,
a general search engine such as Google. However, in
mental health, depression, domain-specific search.
other experiments in [15] relating to quality of infor-mation in search results, we showed that substantial
amounts of the additional relevant information returned
Depression is a major public health problem, being a
by Google was of low quality and not in accord with
leading cause of disease burden [13] and the leading
best available scientific evidence. The operators of the
risk factor for suicide. Recent research has demon-
BluePages portal (ANU's Centre for Mental Health Re-
strated that high quality web-based depression infor-
search) were keen to know if it would be feasible to
mation can improve public knowledge about depres-
provide a portal search service featuring:
sion and is associated with a reduction in depressivesymptoms [6]. Thus, the Web is a potentially valuable
1. increased coverage of high-quality depression in-
resource for people with depression. However, a great
deal of depression information on the Web is of poor
2. reduced coverage of dubious, misleading or un-
Proceedings of the 9th Australasian Document Computing
helpful information, and
Symposium, Melbourne, Australia, December 13, 2004.
Copyright for this article remains with the authors.
3. significantly reduced human cost to maintain the
Our work also used link information. We tried to
predict the relevance of uncrawled URLs using threefeatures: anchor text, text around the link and URL
We have attempted to answer the questions in two
parts. Here we attempt to determine whether it is fea-sible to reduce human effort by using a directory of
depression sites maintained by others as a seedlist andusing focused crawling techniques to avoid the need
This section describes the resources used in our exper-
to define include and exclude rules. We also investi-
iments: the BluePages search service; the data from
gate whether the content of a constrained crawl links
our previous domain-specific search experiments; the
to significant amounts of additional depression content
DMOZ depression directory listing and the WEKA ma-
and whether it is possible to tell which links lead to
chine learning toolkit.
depression content.
A separate project is under way to determine
whether it is feasible to evaluate the quality of
BluePages Search (BPS) is a search service offered as
depression sites using automatic means.
part of the existing BluePages depression information
reported elsewhere. If the outcomes of both projects
site. Crawling, indexing and search were performed by
are favourable, the end-result may be a focused crawler
CSIRO's Panoptic search engine2.
capable of preferentially crawling relevant content
The list of web sites that made up the BPS was man-
high quality sites.
ually identified from the Yahoo! Directory and fromquerying general search engines using the query term
Focused crawling - related work
'depression'. Each URL from this list was then exam-ined to find out if it was relevant to depression before it
Focused crawlers, first described by de Bra et al. [2], for
was selected. The fencing of web site boundaries was a
crawling a topic-focused set of Web pages, have been
much bigger issue. A lot of human effort was needed to
frequently studied [3, 1, 5, 9, 12].
examine all the links in each web site to decide which
A focused crawler seeks, acquires, indexes, and
links should be included and excluded. Areas of 207
maintains pages on a specific set of topics that represent
web sites were selected. These areas sometimes in-
a relatively small portion of the Web. Focused crawlers
cluded a whole web server, sometimes a subtree of a
require much smaller investment in hardware and
web server and sometimes only some individual pages.
network resources but may achieve high coverage at a
Newspaper articles (which tend to be archived after a
short time), potentially distressing, offensive or destruc-
A focused crawler starts with a seed list which con-
tive materials and dead links were excluded during the
tains URLs that are relevant to the topic of interest,
construction of the BPS index.
it crawls these URLs and then follows the links from
A simple example of seeds and boundaries is:
these pages to identify the most promising links basedon both the content of the source pages and the link
• seed = www.counselingdepression.com/, and
structure of the web [3]. Several studies have used sim-
• include patterns = www.counselingdepression.
ple string matching of these features to decide if the
next link is worth following [1, 5, 9]. Others used re-inforcement learning to build domain-specific search
In this case, every link within this web site is included.
engines from similar features. For example, McCallum
In complicated cases, however, some areas should be
et al. [11] used Naive Bayes classifiers to classify hy-
included while others are excluded. For instance, ex-
perlinks based on both the full text of the sources and
amining www.drada.org would result in the following
anchor text on the links pointing to the targets.
seed and boundaries:
A focused crawler should be able to decide if a page
is worth visiting before actually visiting it. This raises
• seed = www.drada.org/
the general problem of hypertext classification.
• include patterns =
In traditional text classification, the classifier looks
only at the text in each document when deciding what
• exclude patterns =
class should be assigned.
Hypertext classification is different because it tries
to classify documents without the need for the content
of the document itself. Instead, it uses link information.
Chakrabati et al. [3] used the hypertext graph includingin-neighbours (documents citing the target document)
The above boundaries mean that everything within the
and out-neighbours (documents that target document
web site should be crawled except for pages about bipo-
cites) as input to some classifiers.
lar depression and book reviews.
Data from our previous work
allows us to leverage off the categorisation work beingdone by volunteer editors.
In our previous work, we conducted a standardinformation
DMOZ seed generation
'depression' queries against six engines of differenttypes:
two health portals, two depression-specific
We started from the 'depression' directory on the
search engines, one general search engine and one
general search engine where the word 'depression' was
added to each query if not already present (GoogleD).
Depression/. This directory is intended to contain
We then pooled the results for each query and employed
links to relevant sites and subsites about depression.
research assistants to judge them. We obtained 2778
The directory, however, also had a small list of 12
judged URLs and 1575 relevant URLs from all the
within-site links to other directories, which may or
engines. We used these URLs as a base in the present
may not be relevant to depression.
work to estimate relevance.
only needed to do some minor boundary selection
We found that, over 101 queries, GoogleD returned
for these links to include relevant directories.
more relevant results than those of the domain-specific
example, the following directories were included
621 relevant URLs were returned by BPS
because they are related to depression and they are
while 683 relevant results were retrieved by GoogleD.
links from the depression directory:
As GoogleD was the best performer in obtaining the
most relevant results, we also used it as a base engine
to compare with other collections in the present work.
Medications/Antidepressants/. These links were
DMOZ3 is the Open Directory Project which is "the
selected simply because their URLs contain the term
largest, most comprehensive human-edited directory of
'depression' (such as childhood_depression) or
the Web. It is constructed and maintained by a vast,
'antidepressants'. The seed URLs, as a result, included
global community of volunteer editors"4.
the above links and all the links to depression-related
We started with the Depression directory5 which
sites and subsites from this directory.
Include patterns corresponding to the seed URLs
relevant to depressive disorder.
were generated automatically. In general, the include
pattern was the same as the URL, except that defaultpage suffixes such as index.htm were removed. Thus,
Weka6 was developed at the University of Waikato in
if the URL referenced the default page of a server or
New Zealand [16]. It is a data mining package which
web directory, the whole server or whole directory was
contains machine learning algorithms. Weka provides
included. If the link was to an individual page, only that
tools for data pre-processing, classification, regression,
page was included.
clustering, association rules, and visualization. Weka
The manual effort required to identify the seed
was used in our experiments for the prediction of URL
URLs and define their extent was approximately one
relevance using hypertext features. It was used because
it provided many classifiers, was easy to use and servedour purposes well in predicting URL relevance.
Comparison of the DMOZ collection
and the BPS collection
Experiment 1 - Usefulness of a DMOZ
This experiment aimed to find out if a constrained crawl
category as a seed list
from the low-cost DMOZ seed list can lead to domain
A focused crawler needs a good seed list of relevant
coverage comparable to that of the manually configured
URLs as a starting point for the crawl. These URLs
should span a variety of web site types so that
After identifying the DMOZ seed list and include
the crawler can explore the Web in many different
patterns as described above, we used the Panoptic
directions. Instead of using a manually created list, we
crawler to build our DMOZ collection. We then ran the
attempted to derive a seed list from a publicly available
101 queries from our previous study and obtained 779
directory - DMOZ. Because depression sites on the
results for DMOZ.
web are widely scattered, the diversity of content in
We attempted to judge the relevance of these results
DMOZ is expected to improve coverage. Using DMOZ
using the 1575 known relevant URLs (see Section 3.2)and to compare the DMOZ results with those of the
BPS collection.
Table 1 shows that 186 out of 227 judged URLs (a
pleasing 81%) from the DMOZ collection were rele-
vant. However, the percentage of judged results (30%)
Table 1: Comparison of relevant URLs in DMOZ andBPS results of running 101 queries.
was too low to allow us to validly conclude that DMOZwas a good collection.
Since we no longer had access to the services of the
judges from the original study we attempted to confirm
One link away URLs
that a reasonable proportion of the unjudged documentswere relevant to the general topic of depression by sam-pling URLs and judging them ourselves.
We randomly selected 2 lists of 50 non-overlapped
Figure 1: Illustration of one link away collection from
URLs among the unjudged results and made relevance
the DMOZ crawl.
judgments on these. In the first list, we obtained 35relevant results and in the second list, 34 URLs were
• the BPS index,
relevant. Because there was close agreement between
• the BPS outgoing link set containing all URLs
the proportion relevant in each list we were confident
linked to by BPS URLs, and
that we could extrapolate the results to give a reasonableestimate of the total number of relevant pages returned.
• 2 sets of judged-relevant URLs: BPS relevant and
Extrapolation suggests 381 relevant URLs for the
all relevant.
unjudged DMOZ set.
Hence, in total we might be
Our previous work concluded that BPS didn't re-
able to obtain 567 (186 + 381) relevant URLs from
trieve as many relevant documents as GoogleD because
the DMOZ set. This number was not as high as that
of its small coverage of sites. We wanted to find out if
of BPS, but it was relatively high (72% relevant URLs
focused crawling techniques have the potential to raise
in DMOZ set compared to 91% of these in BPS).
BPS performance by crawling one step away from BPS.
Therefore, we could conclude that the DMOZ list is an
Among 954 relevant pages retrieved by all engines ex-
acceptably good, low-maintenance starting point for a
cept for BPS, BPS failed to index 775 pages. The ex-
focused crawl.
tended crawl yielded 196 of these 775 pages or 25.3%.
In other words, an unrestricted crawler starting from
Experiments 2A-2C - Additional link-
the original BPS crawl would be able to reach an addi-
accessible relevant information
tional 25.3% of the known relevant pages, in only a sin-
Although some focused crawlers can look a few links
gle step from the existing pages. In fact, the true num-
ahead to predict relevant links at some distance from the
ber of additional relevant pages is likely to be higher
currently crawled URLs [7], the immediate outgoing
because of the large number of unjudged pages.
links are of most immediate interest.
It is unclear whether the additional relevant content
We performed three experiments to gauge how
in the extended BPS crawl would enable more relevant
much additional relevant information is accessible one
documents to be retrieved than in the case of GoogleD.
link away from the existing crawled content.
Retrieval performance depends upon the effectiveness
additional relevant content is linked to from pages in
of the ranking algorithm as well as on coverage.
the original crawl, the prospects of successful focused
Experiment 2B: Comparison of out-
crawling are very low. Figure 1 shows an illustration of
going links between BPS and DMOZ
the one-link-away set of URLs from the DMOZ crawl.
The first experiment (2A) involved testing if outgo-
This experiment compared the out-going link sets of
ing links from the BPS collection were relevant while
BPS and DMOZ to find out if the DMOZ seed list could
the second (2B) compared the outgoing link sets of BPS
be used instead of the BPS seed list to guide a focused
and DMOZ to see if DMOZ was really a good place to
crawler to relevant areas of the web. The following data
lead a focused crawler to additional relevant content.
The last experiment (2C) attempted to find out if URLs
relevant to a particular topic linked to each other.
2 sets of out-going links from the BPS and DMOZcollections, and
Experiment 2A: Outgoing links from
• 2 sets of all judged URLs and judged-relevant
the BPS collection
The data used for this experiment included:
Collection of URLs for training and
Table 2: Comparison of relevant out-going link URLs
for BPS and DMOZ.
For both BPS and DMOZ crawls, we collected all
immediate outgoing URLs satisfying the followingtwo conditions (1) known relevant or known irrelevant
URLs and (2) the URLs pointing to each of these URLs
were also relevant. We collected 295 relevant and 251irrelevant URLs for our classification experiment.
¿From our previous work, we obtained 2778 judged
URLs which were used here as a base to compare rele-
vance. Table 2 shows that even though the outgoing link
Several papers in the field used the content of crawled
collection of DMOZ was more than double the size of
URLs, anchor text, URL structure and other link graph
that of BPS, more outgoing BPS pages were judged.
information to predict the relevance of the next unvis-
Among the judged pages, BPS and DMOZ had 196
ited URLs [1, 5, 9]. Instead of looking at the con-
and 158 relevant pages respectively in their outgoing
tent of the whole document pointing to the target URL,
link sets. Although DMOZ had less known relevant
Chakrabarti [4] used 50 characters before and after a
pages than BPS, the proportion of relevant pages versus
link and suggested that this method was more effective.
judged pages were quite similar for both engines(78%
Our work was somewhat related to all of the above. We
for DMOZ and 79% for BPS). This result together with
used the following features to predict the relevance of
the size of each outgoing link collection implied that (1)
the target URL.
The DMOZ outgoing link set contained quite a largenumber of relevant URLs which could potentially be
• anchor text on the source pages: all the text ap-
accessed by a focused crawler, and (2) The DMOZ seed
pearing on the links to the target page from the
list could lead to much better coverage than the BPS
• text around the link: 50 characters before and 50
characters after the link to the target page from the
Experiment 2C: Linking patterns be-
source pages7, and
tween relevant pages
• URL words: words appearing in the URL of the
We performed a very similar experiment to the experi-
target page.
ment described in Section 5.1, with the purpose of find-ing out if relevant URLs on the same topic are linked to
We accumulated all words for each of these features to
each other. Instead of using the whole BPS collection
form 3 vocabularies where all stop words were elimi-
of 12,177 documents as the seed list, we only chose the
nated. URL words separated by a comma, a full stop,
621 known relevant URLs. The following data were
a special character and a slash were parsed and treated
as individual words. URL extensions such as .html,.asp,.htm,.php were also eliminated. The end result
• the BPS known relevant URLs,
showed 1,774 distinct words in the anchor text vocab-
• the BPS outgoing link set from the above, con-
ulary, 874 distinct words in the URL vocabulary, and
taining all URLs linked to by BPS known relevant
1103 distinct words in the content vocabulary.
For purposes of illustration , Table 3 shows the fea-
• judged-relevant URLs from our previous work.
tures extracted from each of six links to the same URL.
Assume that we would like to predict www.ndmda.
The outgoing link collection of the BPS known rel-
org for its relevance to depression and that we have
evant URLs contained 5623 URLs. Of these, 158 were
six already-crawled pages pointing to it from our
known relevant. This was a very high number com-
crawled collection. From each of the pages, features
pared to the 196 known relevant URLs obtained from
are extracted in the form of anchor text words and the
the much bigger set of all outgoing link URLs (contain-
words within a range of a maximum of 50 characters
ing above 40,000 URLs) in the previous experiment. It
before and after the link pointing to www.ndmda.
is likely from this experiment that relevant pages tend
There is no content around the link from
to link to each other. This is good evidence supporting
the feasibity of the focused crawling approach.
the target URL because that URL contains only stopwords and/or numbers which have been stripped off.
Experiment 3 - Hypertext classification
The URL words for the target URL after being parsed
After downloading the content of the seed URLs and
extracting links from them, a focused crawler needs to
7We first extracted the 50-character string and then eliminated
decide what links to follow and in what order based
markup and stopwords, sometimes leaving only a few words.
on the information it has available. We used hypertextclassification for this purpose.
Table 3: Features for www.ndmda.org after removing stop words and numbers.
Target URL: www.ndmda.org
URL words: ndmda, org
source URL
anchor text
content around the link
depression, bipolar, support,alliance,american, psychiatric
depression, bipolar, support,
support, group, affliated,
highly, recommend
national, depressive, manic,
depressive, association
national, depressive, manic,
pat resources.html
depressive, association
depression, bipolar, support,alliance,american, psychiatric
www.emufarm.org/ cmbell/
national, depressive, manic,
depressive, association
Table 4: Algorithms used from Weka.
Zero rule. Predicts the majority class. Used as a baseline.
Statistical method. Assumes independence of attributes. Usesconditional probability and Bayes rule.
Complement Naive Bayes
Class for building and using a Complement class Naive Bayes classifier.
C4.5 algorithm. A decision tree learner with pruning.
Class for bagging a classifier to reduce variance.
Class for boosting a nominal class classifier using the Adaboost M1 method.
By this means we obtained a list of URLs, each
associated with the tf .idf s for all terms in the 3 vocab-
We compared a range of classification algorithms pro-
ularies. A learning algorithm was then run in Weka to
vided by Weka [16]. (See Table 4.)
learn and predict if these URLs were relevant or irrele-
When training and testing the collection, we used
vant. We also used boosting and bagging algorithms to
a stratified cross-validation method, i.e. using 10-fold
boost the performance of different classifiers.
cross validation where one tenth of the collection wasused for training and the rest was used for testing and
the operation was repeated 10 times. The results werethen averaged and a confusion matrix was drawn to find
We used three measures to analyse how a classifier per-
accuracy, precision and recall.
formed in categorizing all the URLs. We denoted truepositive and true negative for the relevant and irrelevant
Input data
URLs that were correctly predicted by the classifier re-spectively. Similarly, false positive and false negative
We treated the three vocabularies containing all features
were used for irrelevant and relevant URLs that were
independently from each other.
incorrectly predicted respectively. The three measures
frequency and inverse document frequency (tf .idf ) for
are listed below.
each feature attached to each of the URLs specified inSection 6.1 using the following formula [14].
• Accuracy: shows how accurately URLs are classi-
fied into correct categories.
tf .idf = tf (t , d ) ∗ log(n/df (t ))
accuracy = true positive+true negative
where t is a term, d is a document, tf (t , d ) is the fre-
• Precision: shows the proportion of correctly rele-
quency of t in d, n is total number of documents and
vant URLs out of all the URLs that were predicted
df (t ) is the number of documents containing t.
as relevant.
precision =
true positive+f alse positive
• Recall: shows the proportion of relevant URLs
did reducing the feature set using a feature selection
that were correctly predicted out of all the relevant
URLs in the collection.
recall =
true positive+f alse negative
Conclusions and future work
Although accuracy is an important measure, a focused
Weeks of human effort were required to set up the cur-
crawler would be more interested in following the links
rent BPS depression portal search service and consider-
from the predicted relevant set to crawl other potentially
able ongoing effort is needed to maintain its coverage
relevant pages. Thus, precision and recall are better
and accuracy. Our investigations of the viability of a
focused crawling alternative have resulted in three keyfindings.
Results and discussion
First, web pages on the topic of depression are
The results of some representative classifiers are shown
strongly interlinked despite the heterogeneity of
in Table 5. ZeroR represented a realistic performance
the sources.
This confirms previous findings in the
"floor" as it classified all URLs into the largest cate-
literature for other topic domains and provides a good
gory i.e relevant. As expected, it was the least accurate.
foundation for focused crawling in the depression
Naive Bayes and J48 performed best. Naive Bayes was
domain. The one-link away extensions to the closed
slightly better than J48 on recall but the latter was much
BPS and DMOZ crawls contained many relevant pages.
better in obtaining higher accuracy and precision. Out
Second, although somewhat inferior to the expen-
of 228 URLs that J48 predicted as relevant, 201 were
sively constructed BPS alternative, the DMOZ depres-
correct (88.15%). However, out of the 264 URLs pre-
sion category features a diversity of sources and seems
dicted as relevant by Naive Bayes, only 206 (78.03%)
to provide a seed list of adequate quality for a focused
were correct. Overall, the J48 algorithm was the best
crawl in the depression domain. This is very good news
performer among all the classifers used.
for the maintainability of the portal search because of
We found that bagging did not improve the classifi-
the very considerable labour savings. Other DMOZ
cation result while boosting showed some improvement
categories may provide good starting points for other
for recall (from 64.74% to 68.13%) when the J48 algo-
domain-specific search services.
rithm was used.
Third, predictive classification of outgoing links
We also performed other experiments where only
into relevant and irrelevant categories using source-
one set of features or any combination of two sets of
page features such as anchor text, content around the
features were used. In all cases, we observed that the
link and URL words of the target pages, achieved
accuracy, precision and recall were all worse than when
very promising results.
With the J48 decision-tree
all three sets of features were combined.
algorithm, as implemented by Weka, we obtained high
Our best results, as detailed in Table 5, showed that
accuracy, high precision and relatively high recall.
a focused crawler starting from a set of relevant URLs,
Given the promise of the approach, there is obvious
and using J48 in predicting future URLs, could obtain a
follow-up work to be done on designing and building
precision of 88% and a recall of 68% using the features
a domain-specific search portal using focused crawling
mentioned in Section 6.2.
techniques. In particular, it may be beneficial to rank
We wished to compare these performance levels
the URLs classified as relevant in the order of degree
with the state of the art, but were unable to find in the
of relevance so that a focused crawler can decide on
literature any applicable results relating to the topic
visiting priorities. Also, appropriate data structures are
of depression. We therefore decided to compare our
needed to hold accumulated information for unvisited
predictive classifier with a more conventional content
URLs (i.e. anchor text and nearby content for each
classifier for the same topic.
referring link.) This information needs to be updated
We built a 'content classifier' for 'depression', using
as additional links to the same target are encountered.
only the content of the target documents instead of the
Another important question will be how to persuade
features being used in our experiment. The best ac-
Weka to output a classifier that can be easily plugged-
curacies obtained from the two classification systems
in into the focused crawler's architecture. Since the best
were very similar, 78% for the content classifier and
performing classifier in these trials was a decision tree,
77.8% for the predictive version. Content classification
this may be easier than otherwise.
showed slightly worse precision but better recall.
Once a focused crawler is constructed, it will be
We concluded from this comparison that hypertext
necessary to determine how to use it operationally. We
classification is quite effective in predicting the rele-
envisage operating without any include or exclude rules
vance of uncrawled URLs. This is quite pleasing as
but will need to decide on appropriate stopping condi-
a lot of unnecessary crawling can be avoided.
tions. If none of the outgoing links are classified as
Finally we explored two variant methods for fea-
likely to lead to relevant content, should the crawl stop,
ture selection. We found that generating features using
or should some unpromising links be followed? And
stemmed words caused a reduction in performance, as
with what restrictions?
Table 5: Classification Results.
Accuracy (%)
Complement Naive Bayes
Because of the requirements of the depression por-
[5] J. Cho, H. Garcia-Molina and L. Page. Efficient crawl-
tal operators site quality must be taken into account in
ing through url ordering. In
Proceeding of the Seventh
building the portal search service. Ideally, the focused
World Wide Web Conference, 1998.
crawler should take site quality into account when de-
[6] H. Christensen, K. M. Griffiths and A. F. Jorm. Deliver-
ciding whether to follow an outgoing link, but this may
ing Interventions for Depression by Using the Internet:
or may not be feasible. Another more expensive alter-
Randomised Controlled Trial.
British Medical Journal,
native would be to crawl using relevance as the sole
Volume 328, Number 7434, pages 265–0, 2004.
criterion and to filter the results based on quality.
[7] M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles
Site quality estimation is the subject of a separate
and M. Gori. Focused crawling using context graphs. In
study, yet to be completed. In the meantime, it seems
Proceeding of the 26th VLDB Conference, Cairo, Egypt,
fairly clear from our experiments that it will be possible
to increase coverage of the depression domain for dra-
[8] Berland G, Elliott M, Morales L, Algazy J, Kravitz
matically lower cost by starting from a DMOZ category
R, Broder M, Kanouse D, Munoz J, Puyol J, Lara M,
list and using a focused crawler.
Watkins K, Yang H and McGlynn E. Health Information
Verifying whether techniques found useful in this
on the Internet: Accessibility, Quality, and Readability
project also extend to other domains is an obvious fu-
in English and Spanish.
The Journal of the AmericanMedical Association, Volume 285, Number 20, pages
ture step. Other health-related areas are the most likely
2612–2621, 2001.
candidates because of the focus on quality of informa-tion in those areas.
[9] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pellegb,
M. Shtalhaima and S. Ura. The shark-search algorithm.
an application: tailored web site mapping. In
ing of the Seventh World Wide Web Conference, 1998.
Kathy Griffiths and Helen Christensen provided
[10] Griffiths K and Christensen H. The quality and acces-
expert input about the depression domain and about
sibility of australian depression sites on the world wide
BluePages, and John Lloyd and Eric McCreath gave
The Medical Journal of Australia, Volume 176,
advice on machine learning techniques.
pages S97–S104, 2002.
[11] A. McCallum, K. Nigam, J. Rennie and K. Seymore.
Building domain-specific search engines with machinelearning technique.
Proceedings of AAAI Spring
[1] C. C. Aggarwal, F. Al-Garawi and P. S. Yu.
Symposium on Intelligents Engine in Cyberspace, 1999.
the design of a learning crawler for topical resource
[12] F. Menczer, G. Pant and P. Srinivasan. Evaluating topic-
ACM Trans. Inf. Syst., Volume 19, Number 3,
driven web crawlers. In
Proceeding of the 24th Annual
pages 286–309, 2001.
Intl. ACM SIGIR Conf. On Research and Development
[2] P. De Bra, G. Houben, Y. Kornatzky and R. Post.
in Information Retrieval, 2001.
Information retrieval in distributed hypertexts. In
[13] C. J. L. Murray and A. D. Lopez (editors).
ceedings of the 4th RIAO Conference, pages 481–491,
Global Burden of Disease and Injury Series. Harvard
New York, 1994.
University Press, Cambridge MA, 1996.
[3] S. Chakrabarti, M. Berg and B. Dom. Focused crawling:
[14] G. Salton and C. Buckley. Term weighting approaches
A new approach to topic-specific web resource discov-
in automatic text retrieval. Technical report, 1987.
ery. In
Proceeding of the 8th International World Wide
[15] T.T. Tang, N. Craswell, D. Hawking, K. M. Griffiths
Web Conference (WWW8), 1999.
and H. Christensen. Quality and relevance of domain-
[4] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan,
specific search: A case study in mental health.
D. Gibson and J. Kleinberg. Automatic resource com-
appear in the Journal of Information Retrieval - Special
pilation by analyzing hyperlink structure and associated
text. In
Proceedings of the seventh international con-
[16] I. H. Witten and E. Frank.
Data Mining: Practical ma-
ference on World Wide Web 7, pages 65–74. Elsevier
chine learning tools with Java implementations. Morgan
Science Publishers B. V., 1998.
Kaufmann, San Francisco, 1999.
Source: http://david-hawking.net/pubs/tang_adcs04.pdf
GB ORIGINAL INSTRUCTIONS SKIL - Divisão de Ferramentas Elétricas Caixa Postal 1195 - CEP 13065-9000 - Campinas/SP - Brasil SKIL_IB2144_BR.indd 1 14-04-2010 14:22:22 EPTA 01/2003 0,8-10 mm SKIL_IB2144_BR.indd 2 14-04-2010 14:22:28 SKIL_IB2144_BR.indd 3 14-04-2010 14:22:29 SKIL_IB2144_BR.indd 4 14-04-2010 14:22:30 ACCESSORIES
Published online August 25, 2004 Nucleic Acids Research, 2004, Vol. 32, No. 15 RNA expression microarrays (REMs), ahigh-throughput method to measure differencesin gene expression in diverse biological samples Charles E. Rogler*, Tatyana Tchaikovskaya, Raquel Norel, Aldo Massimi1,Christopher Plescia2, Eugeny Rubashevsky, Paul Siebert3 and Leslie E. Rogler Department of Medicine and Marion Bessin Liver Research Center, 1Department of Molecular Genetics, Albert EinsteinCollege of Medicine, Bronx, NY, USA, 2Department of Neurosciences, Mt Sinai College of Medicine, New York, NY,USA and 3BD Biosciences-Clontech, Palo Alto, CA, USA