Medical Care |

Medical Care



Focused crawling in depression portal search: A feasibility study
Thanh Tin Tang David Hawking Department of Computer Science, ANU Canberra, Australia Canberra, Australia Nick Craswell Ramesh S. Sankaranarayana Microsoft Research Department of Computer Science, ANU Canberra, Australia quality when judged against the best available scientific search services in the area of depressive illness evidence [8, 10]. It is thus important that consumers can has documented the significant human cost required to locate depression information which is both relevant setup and maintain closed-crawl parameters. It also and of high quality.
showed that domain coverage is much less than that of Recently, in [15], we compared examples of two whole-of-web search engines. Here we report on the types of search tool which can be used for locating feasibility of techniques for achieving greater coverage depression information: whole-of-Web search engines at lower cost. We found that acceptably effective crawl such as Google, and domain-specific (portal) search parameters could be automatically derived from a services which include only selected sites. We found DMOZ depression category list, with dramatic saving that coverage of depression information was much in effort. We also found evidence that focused crawling greater in Google than in portals devoted to depression could be effective in this domain: relevant documents from diverse sources are extensively interlinked; many BluePages Search (BPS)1 is a depression-specific outgoing links from a constrained crawl based on search service offered as part of the BluePages depres- DMOZ lead to additional relevant content; and we sion information site. Its index was built by manu- were able to achieve reasonable precision (88%) and ally identifying and crawling areas on 207 Web servers recall (68%) using a J48-derived predictive classifier containing depression information. It took about two operating only on URL words, anchor text and text weeks of intensive human effort to identify these areas content adjacent to referring links. Future directions (seed URLs) and define their extent by means of include include implementing and evaluating a focused and exclude patterns. Similar effort would be required crawler. Furthermore, the quality of information in at regular intervals to maintain coverage and accuracy.
returned pages (measured in accordance with the Despite this human effort, only about 17% of relevant evidence based medicine) is vital when searchers are pages returned by Google were contained in the BPS consumers. Accordingly, automatic estimation of web site quality and its possible incorporation in a focused One might conclude from this that the best way to crawler is the subject of a separate concurrent study. provide depression-portal search would be to add theword 'depression' to all queries and forward them to focused crawler, hypertext classification, a general search engine such as Google. However, in mental health, depression, domain-specific search.
other experiments in [15] relating to quality of infor-mation in search results, we showed that substantial amounts of the additional relevant information returned Depression is a major public health problem, being a by Google was of low quality and not in accord with leading cause of disease burden [13] and the leading best available scientific evidence. The operators of the risk factor for suicide. Recent research has demon- BluePages portal (ANU's Centre for Mental Health Re- strated that high quality web-based depression infor- search) were keen to know if it would be feasible to mation can improve public knowledge about depres- provide a portal search service featuring: sion and is associated with a reduction in depressivesymptoms [6]. Thus, the Web is a potentially valuable 1. increased coverage of high-quality depression in- resource for people with depression. However, a great deal of depression information on the Web is of poor 2. reduced coverage of dubious, misleading or un- Proceedings of the 9th Australasian Document Computing
helpful information, and Symposium, Melbourne, Australia, December 13, 2004.
Copyright for this article remains with the authors.
3. significantly reduced human cost to maintain the Our work also used link information. We tried to predict the relevance of uncrawled URLs using threefeatures: anchor text, text around the link and URL We have attempted to answer the questions in two parts. Here we attempt to determine whether it is fea-sible to reduce human effort by using a directory of depression sites maintained by others as a seedlist andusing focused crawling techniques to avoid the need This section describes the resources used in our exper- to define include and exclude rules. We also investi- iments: the BluePages search service; the data from gate whether the content of a constrained crawl links our previous domain-specific search experiments; the to significant amounts of additional depression content DMOZ depression directory listing and the WEKA ma- and whether it is possible to tell which links lead to chine learning toolkit.
depression content.
A separate project is under way to determine whether it is feasible to evaluate the quality of BluePages Search (BPS) is a search service offered as depression sites using automatic means.
part of the existing BluePages depression information reported elsewhere. If the outcomes of both projects site. Crawling, indexing and search were performed by are favourable, the end-result may be a focused crawler CSIRO's Panoptic search engine2.
capable of preferentially crawling relevant content from The list of web sites that made up the BPS was man- high quality sites.
ually identified from the Yahoo! Directory and fromquerying general search engines using the query term Focused crawling - related work
'depression'. Each URL from this list was then exam-ined to find out if it was relevant to depression before it Focused crawlers, first described by de Bra et al. [2], for was selected. The fencing of web site boundaries was a crawling a topic-focused set of Web pages, have been much bigger issue. A lot of human effort was needed to frequently studied [3, 1, 5, 9, 12].
examine all the links in each web site to decide which A focused crawler seeks, acquires, indexes, and links should be included and excluded. Areas of 207 maintains pages on a specific set of topics that represent web sites were selected. These areas sometimes in- a relatively small portion of the Web. Focused crawlers cluded a whole web server, sometimes a subtree of a require much smaller investment in hardware and web server and sometimes only some individual pages.
network resources but may achieve high coverage at a Newspaper articles (which tend to be archived after a short time), potentially distressing, offensive or destruc- A focused crawler starts with a seed list which con- tive materials and dead links were excluded during the tains URLs that are relevant to the topic of interest, construction of the BPS index.
it crawls these URLs and then follows the links from A simple example of seeds and boundaries is: these pages to identify the most promising links basedon both the content of the source pages and the link • seed =, and structure of the web [3]. Several studies have used sim- • include patterns = www.counselingdepression.
ple string matching of these features to decide if the next link is worth following [1, 5, 9]. Others used re-inforcement learning to build domain-specific search In this case, every link within this web site is included.
engines from similar features. For example, McCallum In complicated cases, however, some areas should be et al. [11] used Naive Bayes classifiers to classify hy- included while others are excluded. For instance, ex- perlinks based on both the full text of the sources and amining would result in the following anchor text on the links pointing to the targets.
seed and boundaries: A focused crawler should be able to decide if a page is worth visiting before actually visiting it. This raises • seed = the general problem of hypertext classification.
• include patterns = In traditional text classification, the classifier looks only at the text in each document when deciding what • exclude patterns = class should be assigned.
Hypertext classification is different because it tries to classify documents without the need for the content of the document itself. Instead, it uses link information.
Chakrabati et al. [3] used the hypertext graph includingin-neighbours (documents citing the target document) The above boundaries mean that everything within the and out-neighbours (documents that target document web site should be crawled except for pages about bipo- cites) as input to some classifiers.
lar depression and book reviews.
Data from our previous work
allows us to leverage off the categorisation work beingdone by volunteer editors.
In our previous work, we conducted a standardinformation DMOZ seed generation
'depression' queries against six engines of differenttypes: two health portals, two depression-specific We started from the 'depression' directory on the search engines, one general search engine and one general search engine where the word 'depression' was added to each query if not already present (GoogleD).
Depression/. This directory is intended to contain We then pooled the results for each query and employed links to relevant sites and subsites about depression.
research assistants to judge them. We obtained 2778 The directory, however, also had a small list of 12 judged URLs and 1575 relevant URLs from all the within-site links to other directories, which may or engines. We used these URLs as a base in the present may not be relevant to depression.
work to estimate relevance.
only needed to do some minor boundary selection We found that, over 101 queries, GoogleD returned for these links to include relevant directories.
more relevant results than those of the domain-specific example, the following directories were included 621 relevant URLs were returned by BPS because they are related to depression and they are while 683 relevant results were retrieved by GoogleD.
links from the depression directory: As GoogleD was the best performer in obtaining the most relevant results, we also used it as a base engine to compare with other collections in the present work.
Medications/Antidepressants/. These links were DMOZ3 is the Open Directory Project which is "the selected simply because their URLs contain the term largest, most comprehensive human-edited directory of 'depression' (such as childhood_depression) or the Web. It is constructed and maintained by a vast, 'antidepressants'. The seed URLs, as a result, included global community of volunteer editors"4.
the above links and all the links to depression-related We started with the Depression directory5 which sites and subsites from this directory.
Include patterns corresponding to the seed URLs relevant to depressive disorder.
were generated automatically. In general, the include pattern was the same as the URL, except that defaultpage suffixes such as index.htm were removed. Thus, Weka6 was developed at the University of Waikato in if the URL referenced the default page of a server or New Zealand [16]. It is a data mining package which web directory, the whole server or whole directory was contains machine learning algorithms. Weka provides included. If the link was to an individual page, only that tools for data pre-processing, classification, regression, page was included.
clustering, association rules, and visualization. Weka The manual effort required to identify the seed was used in our experiments for the prediction of URL URLs and define their extent was approximately one relevance using hypertext features. It was used because it provided many classifiers, was easy to use and servedour purposes well in predicting URL relevance.
Comparison of the DMOZ collection
and the BPS collection

Experiment 1 - Usefulness of a DMOZ
This experiment aimed to find out if a constrained crawl category as a seed list
from the low-cost DMOZ seed list can lead to domain A focused crawler needs a good seed list of relevant coverage comparable to that of the manually configured URLs as a starting point for the crawl. These URLs should span a variety of web site types so that After identifying the DMOZ seed list and include the crawler can explore the Web in many different patterns as described above, we used the Panoptic directions. Instead of using a manually created list, we crawler to build our DMOZ collection. We then ran the attempted to derive a seed list from a publicly available 101 queries from our previous study and obtained 779 directory - DMOZ. Because depression sites on the results for DMOZ.
web are widely scattered, the diversity of content in We attempted to judge the relevance of these results DMOZ is expected to improve coverage. Using DMOZ using the 1575 known relevant URLs (see Section 3.2)and to compare the DMOZ results with those of the BPS collection.
Table 1 shows that 186 out of 227 judged URLs (a pleasing 81%) from the DMOZ collection were rele- vant. However, the percentage of judged results (30%) Table 1: Comparison of relevant URLs in DMOZ andBPS results of running 101 queries.
was too low to allow us to validly conclude that DMOZwas a good collection.
Since we no longer had access to the services of the judges from the original study we attempted to confirm One link away URLs that a reasonable proportion of the unjudged documentswere relevant to the general topic of depression by sam-pling URLs and judging them ourselves.
We randomly selected 2 lists of 50 non-overlapped Figure 1: Illustration of one link away collection from URLs among the unjudged results and made relevance the DMOZ crawl.
judgments on these. In the first list, we obtained 35relevant results and in the second list, 34 URLs were • the BPS index, relevant. Because there was close agreement between • the BPS outgoing link set containing all URLs the proportion relevant in each list we were confident linked to by BPS URLs, and that we could extrapolate the results to give a reasonableestimate of the total number of relevant pages returned.
• 2 sets of judged-relevant URLs: BPS relevant and Extrapolation suggests 381 relevant URLs for the all relevant.
unjudged DMOZ set.
Hence, in total we might be Our previous work concluded that BPS didn't re- able to obtain 567 (186 + 381) relevant URLs from trieve as many relevant documents as GoogleD because the DMOZ set. This number was not as high as that of its small coverage of sites. We wanted to find out if of BPS, but it was relatively high (72% relevant URLs focused crawling techniques have the potential to raise in DMOZ set compared to 91% of these in BPS).
BPS performance by crawling one step away from BPS.
Therefore, we could conclude that the DMOZ list is an Among 954 relevant pages retrieved by all engines ex- acceptably good, low-maintenance starting point for a cept for BPS, BPS failed to index 775 pages. The ex- focused crawl.
tended crawl yielded 196 of these 775 pages or 25.3%.
In other words, an unrestricted crawler starting from Experiments 2A-2C - Additional link-
the original BPS crawl would be able to reach an addi- accessible relevant information
tional 25.3% of the known relevant pages, in only a sin- Although some focused crawlers can look a few links gle step from the existing pages. In fact, the true num- ahead to predict relevant links at some distance from the ber of additional relevant pages is likely to be higher currently crawled URLs [7], the immediate outgoing because of the large number of unjudged pages.
links are of most immediate interest.
It is unclear whether the additional relevant content We performed three experiments to gauge how in the extended BPS crawl would enable more relevant much additional relevant information is accessible one documents to be retrieved than in the case of GoogleD.
link away from the existing crawled content.
Retrieval performance depends upon the effectiveness additional relevant content is linked to from pages in of the ranking algorithm as well as on coverage.
the original crawl, the prospects of successful focused Experiment 2B: Comparison of out-
crawling are very low. Figure 1 shows an illustration of going links between BPS and DMOZ
the one-link-away set of URLs from the DMOZ crawl.
The first experiment (2A) involved testing if outgo- This experiment compared the out-going link sets of ing links from the BPS collection were relevant while BPS and DMOZ to find out if the DMOZ seed list could the second (2B) compared the outgoing link sets of BPS be used instead of the BPS seed list to guide a focused and DMOZ to see if DMOZ was really a good place to crawler to relevant areas of the web. The following data lead a focused crawler to additional relevant content.
The last experiment (2C) attempted to find out if URLs relevant to a particular topic linked to each other.
2 sets of out-going links from the BPS and DMOZcollections, and Experiment 2A: Outgoing links from
• 2 sets of all judged URLs and judged-relevant the BPS collection
The data used for this experiment included: Collection of URLs for training and
Table 2: Comparison of relevant out-going link URLs for BPS and DMOZ.
For both BPS and DMOZ crawls, we collected all immediate outgoing URLs satisfying the followingtwo conditions (1) known relevant or known irrelevant URLs and (2) the URLs pointing to each of these URLs were also relevant. We collected 295 relevant and 251irrelevant URLs for our classification experiment.
¿From our previous work, we obtained 2778 judged URLs which were used here as a base to compare rele- vance. Table 2 shows that even though the outgoing link Several papers in the field used the content of crawled collection of DMOZ was more than double the size of URLs, anchor text, URL structure and other link graph that of BPS, more outgoing BPS pages were judged.
information to predict the relevance of the next unvis- Among the judged pages, BPS and DMOZ had 196 ited URLs [1, 5, 9]. Instead of looking at the con- and 158 relevant pages respectively in their outgoing tent of the whole document pointing to the target URL, link sets. Although DMOZ had less known relevant Chakrabarti [4] used 50 characters before and after a pages than BPS, the proportion of relevant pages versus link and suggested that this method was more effective.
judged pages were quite similar for both engines(78% Our work was somewhat related to all of the above. We for DMOZ and 79% for BPS). This result together with used the following features to predict the relevance of the size of each outgoing link collection implied that (1) the target URL.
The DMOZ outgoing link set contained quite a largenumber of relevant URLs which could potentially be • anchor text on the source pages: all the text ap- accessed by a focused crawler, and (2) The DMOZ seed pearing on the links to the target page from the list could lead to much better coverage than the BPS • text around the link: 50 characters before and 50 characters after the link to the target page from the Experiment 2C: Linking patterns be-
source pages7, and tween relevant pages
• URL words: words appearing in the URL of the We performed a very similar experiment to the experi- target page.
ment described in Section 5.1, with the purpose of find-ing out if relevant URLs on the same topic are linked to We accumulated all words for each of these features to each other. Instead of using the whole BPS collection form 3 vocabularies where all stop words were elimi- of 12,177 documents as the seed list, we only chose the nated. URL words separated by a comma, a full stop, 621 known relevant URLs. The following data were a special character and a slash were parsed and treated as individual words. URL extensions such as .html,.asp,.htm,.php were also eliminated. The end result • the BPS known relevant URLs, showed 1,774 distinct words in the anchor text vocab- • the BPS outgoing link set from the above, con- ulary, 874 distinct words in the URL vocabulary, and taining all URLs linked to by BPS known relevant 1103 distinct words in the content vocabulary.
For purposes of illustration , Table 3 shows the fea- • judged-relevant URLs from our previous work.
tures extracted from each of six links to the same URL.
Assume that we would like to predict www.ndmda.
The outgoing link collection of the BPS known rel- org for its relevance to depression and that we have evant URLs contained 5623 URLs. Of these, 158 were six already-crawled pages pointing to it from our known relevant. This was a very high number com- crawled collection. From each of the pages, features pared to the 196 known relevant URLs obtained from are extracted in the form of anchor text words and the the much bigger set of all outgoing link URLs (contain- words within a range of a maximum of 50 characters ing above 40,000 URLs) in the previous experiment. It before and after the link pointing to www.ndmda.
is likely from this experiment that relevant pages tend There is no content around the link from to link to each other. This is good evidence supporting the feasibity of the focused crawling approach.
the target URL because that URL contains only stopwords and/or numbers which have been stripped off.
Experiment 3 - Hypertext classification
The URL words for the target URL after being parsed After downloading the content of the seed URLs and extracting links from them, a focused crawler needs to 7We first extracted the 50-character string and then eliminated decide what links to follow and in what order based markup and stopwords, sometimes leaving only a few words.
on the information it has available. We used hypertextclassification for this purpose.
Table 3: Features for after removing stop words and numbers.
Target URL:
URL words: ndmda, org source URL
anchor text
content around the link
depression, bipolar, support,alliance,american, psychiatric depression, bipolar, support, support, group, affliated, highly, recommend national, depressive, manic, depressive, association national, depressive, manic, pat resources.html depressive, association depression, bipolar, support,alliance,american, psychiatric cmbell/ national, depressive, manic, depressive, association Table 4: Algorithms used from Weka.
Zero rule. Predicts the majority class. Used as a baseline.
Statistical method. Assumes independence of attributes. Usesconditional probability and Bayes rule.
Complement Naive Bayes Class for building and using a Complement class Naive Bayes classifier.
C4.5 algorithm. A decision tree learner with pruning.
Class for bagging a classifier to reduce variance.
Class for boosting a nominal class classifier using the Adaboost M1 method.
By this means we obtained a list of URLs, each associated with the tf .idf s for all terms in the 3 vocab- We compared a range of classification algorithms pro- ularies. A learning algorithm was then run in Weka to vided by Weka [16]. (See Table 4.) learn and predict if these URLs were relevant or irrele- When training and testing the collection, we used vant. We also used boosting and bagging algorithms to a stratified cross-validation method, i.e. using 10-fold boost the performance of different classifiers.
cross validation where one tenth of the collection wasused for training and the rest was used for testing and the operation was repeated 10 times. The results werethen averaged and a confusion matrix was drawn to find We used three measures to analyse how a classifier per- accuracy, precision and recall.
formed in categorizing all the URLs. We denoted truepositive and true negative for the relevant and irrelevant Input data
URLs that were correctly predicted by the classifier re-spectively. Similarly, false positive and false negative We treated the three vocabularies containing all features were used for irrelevant and relevant URLs that were independently from each other.
incorrectly predicted respectively. The three measures frequency and inverse document frequency (tf .idf ) for are listed below.
each feature attached to each of the URLs specified inSection 6.1 using the following formula [14].
• Accuracy: shows how accurately URLs are classi- fied into correct categories.
tf .idf = tf (t , d ) ∗ log(n/df (t )) accuracy = true positive+true negative where t is a term, d is a document, tf (t , d ) is the fre- • Precision: shows the proportion of correctly rele- quency of t in d, n is total number of documents and vant URLs out of all the URLs that were predicted df (t ) is the number of documents containing t.
as relevant.
precision = true positive+f alse positive • Recall: shows the proportion of relevant URLs did reducing the feature set using a feature selection that were correctly predicted out of all the relevant URLs in the collection.
recall = true positive+f alse negative Conclusions and future work
Although accuracy is an important measure, a focused Weeks of human effort were required to set up the cur- crawler would be more interested in following the links rent BPS depression portal search service and consider- from the predicted relevant set to crawl other potentially able ongoing effort is needed to maintain its coverage relevant pages. Thus, precision and recall are better and accuracy. Our investigations of the viability of a focused crawling alternative have resulted in three keyfindings.
Results and discussion
First, web pages on the topic of depression are The results of some representative classifiers are shown strongly interlinked despite the heterogeneity of in Table 5. ZeroR represented a realistic performance the sources.
This confirms previous findings in the "floor" as it classified all URLs into the largest cate- literature for other topic domains and provides a good gory i.e relevant. As expected, it was the least accurate.
foundation for focused crawling in the depression Naive Bayes and J48 performed best. Naive Bayes was domain. The one-link away extensions to the closed slightly better than J48 on recall but the latter was much BPS and DMOZ crawls contained many relevant pages.
better in obtaining higher accuracy and precision. Out Second, although somewhat inferior to the expen- of 228 URLs that J48 predicted as relevant, 201 were sively constructed BPS alternative, the DMOZ depres- correct (88.15%). However, out of the 264 URLs pre- sion category features a diversity of sources and seems dicted as relevant by Naive Bayes, only 206 (78.03%) to provide a seed list of adequate quality for a focused were correct. Overall, the J48 algorithm was the best crawl in the depression domain. This is very good news performer among all the classifers used.
for the maintainability of the portal search because of We found that bagging did not improve the classifi- the very considerable labour savings. Other DMOZ cation result while boosting showed some improvement categories may provide good starting points for other for recall (from 64.74% to 68.13%) when the J48 algo- domain-specific search services.
rithm was used.
Third, predictive classification of outgoing links We also performed other experiments where only into relevant and irrelevant categories using source- one set of features or any combination of two sets of page features such as anchor text, content around the features were used. In all cases, we observed that the link and URL words of the target pages, achieved accuracy, precision and recall were all worse than when very promising results.
With the J48 decision-tree all three sets of features were combined.
algorithm, as implemented by Weka, we obtained high Our best results, as detailed in Table 5, showed that accuracy, high precision and relatively high recall.
a focused crawler starting from a set of relevant URLs, Given the promise of the approach, there is obvious and using J48 in predicting future URLs, could obtain a follow-up work to be done on designing and building precision of 88% and a recall of 68% using the features a domain-specific search portal using focused crawling mentioned in Section 6.2.
techniques. In particular, it may be beneficial to rank We wished to compare these performance levels the URLs classified as relevant in the order of degree with the state of the art, but were unable to find in the of relevance so that a focused crawler can decide on literature any applicable results relating to the topic visiting priorities. Also, appropriate data structures are of depression. We therefore decided to compare our needed to hold accumulated information for unvisited predictive classifier with a more conventional content URLs (i.e. anchor text and nearby content for each classifier for the same topic.
referring link.) This information needs to be updated We built a 'content classifier' for 'depression', using as additional links to the same target are encountered.
only the content of the target documents instead of the Another important question will be how to persuade features being used in our experiment. The best ac- Weka to output a classifier that can be easily plugged- curacies obtained from the two classification systems in into the focused crawler's architecture. Since the best were very similar, 78% for the content classifier and performing classifier in these trials was a decision tree, 77.8% for the predictive version. Content classification this may be easier than otherwise.
showed slightly worse precision but better recall.
Once a focused crawler is constructed, it will be We concluded from this comparison that hypertext necessary to determine how to use it operationally. We classification is quite effective in predicting the rele- envisage operating without any include or exclude rules vance of uncrawled URLs. This is quite pleasing as but will need to decide on appropriate stopping condi- a lot of unnecessary crawling can be avoided.
tions. If none of the outgoing links are classified as Finally we explored two variant methods for fea- likely to lead to relevant content, should the crawl stop, ture selection. We found that generating features using or should some unpromising links be followed? And stemmed words caused a reduction in performance, as with what restrictions? Table 5: Classification Results.
Accuracy (%)
Complement Naive Bayes Because of the requirements of the depression por- [5] J. Cho, H. Garcia-Molina and L. Page. Efficient crawl- tal operators site quality must be taken into account in ing through url ordering. In Proceeding of the Seventh building the portal search service. Ideally, the focused World Wide Web Conference, 1998.
crawler should take site quality into account when de- [6] H. Christensen, K. M. Griffiths and A. F. Jorm. Deliver- ciding whether to follow an outgoing link, but this may ing Interventions for Depression by Using the Internet: or may not be feasible. Another more expensive alter- Randomised Controlled Trial. British Medical Journal, native would be to crawl using relevance as the sole Volume 328, Number 7434, pages 265–0, 2004.
criterion and to filter the results based on quality.
[7] M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles Site quality estimation is the subject of a separate and M. Gori. Focused crawling using context graphs. In study, yet to be completed. In the meantime, it seems Proceeding of the 26th VLDB Conference, Cairo, Egypt, fairly clear from our experiments that it will be possible to increase coverage of the depression domain for dra- [8] Berland G, Elliott M, Morales L, Algazy J, Kravitz matically lower cost by starting from a DMOZ category R, Broder M, Kanouse D, Munoz J, Puyol J, Lara M, list and using a focused crawler.
Watkins K, Yang H and McGlynn E. Health Information Verifying whether techniques found useful in this on the Internet: Accessibility, Quality, and Readability project also extend to other domains is an obvious fu- in English and Spanish. The Journal of the AmericanMedical Association, Volume 285, Number 20, pages ture step. Other health-related areas are the most likely 2612–2621, 2001.
candidates because of the focus on quality of informa-tion in those areas.
[9] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pellegb, M. Shtalhaima and S. Ura. The shark-search algorithm.
an application: tailored web site mapping. In Proceed- ing of the Seventh World Wide Web Conference, 1998.
Kathy Griffiths and Helen Christensen provided [10] Griffiths K and Christensen H. The quality and acces- expert input about the depression domain and about sibility of australian depression sites on the world wide BluePages, and John Lloyd and Eric McCreath gave web. The Medical Journal of Australia, Volume 176, advice on machine learning techniques.
pages S97–S104, 2002.
[11] A. McCallum, K. Nigam, J. Rennie and K. Seymore.
Building domain-specific search engines with machinelearning technique.
In Proceedings of AAAI Spring [1] C. C. Aggarwal, F. Al-Garawi and P. S. Yu.
Symposium on Intelligents Engine in Cyberspace, 1999.
the design of a learning crawler for topical resource [12] F. Menczer, G. Pant and P. Srinivasan. Evaluating topic- discovery. ACM Trans. Inf. Syst., Volume 19, Number 3, driven web crawlers. In Proceeding of the 24th Annual pages 286–309, 2001.
Intl. ACM SIGIR Conf. On Research and Development [2] P. De Bra, G. Houben, Y. Kornatzky and R. Post.
in Information Retrieval, 2001.
Information retrieval in distributed hypertexts. In Pro- [13] C. J. L. Murray and A. D. Lopez (editors).
ceedings of the 4th RIAO Conference, pages 481–491, Global Burden of Disease and Injury Series. Harvard New York, 1994.
University Press, Cambridge MA, 1996.
[3] S. Chakrabarti, M. Berg and B. Dom. Focused crawling: [14] G. Salton and C. Buckley. Term weighting approaches A new approach to topic-specific web resource discov- in automatic text retrieval. Technical report, 1987.
ery. In Proceeding of the 8th International World Wide [15] T.T. Tang, N. Craswell, D. Hawking, K. M. Griffiths Web Conference (WWW8), 1999.
and H. Christensen. Quality and relevance of domain- [4] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, specific search: A case study in mental health.
D. Gibson and J. Kleinberg. Automatic resource com- appear in the Journal of Information Retrieval - Special pilation by analyzing hyperlink structure and associated text. In Proceedings of the seventh international con- [16] I. H. Witten and E. Frank. Data Mining: Practical ma- ference on World Wide Web 7, pages 65–74. Elsevier chine learning tools with Java implementations. Morgan Science Publishers B. V., 1998.
Kaufmann, San Francisco, 1999.


GB ORIGINAL INSTRUCTIONS SKIL - Divisão de Ferramentas Elétricas Caixa Postal 1195 - CEP 13065-9000 - Campinas/SP - Brasil SKIL_IB2144_BR.indd 1 14-04-2010 14:22:22 EPTA 01/2003 0,8-10 mm SKIL_IB2144_BR.indd 2 14-04-2010 14:22:28 SKIL_IB2144_BR.indd 3 14-04-2010 14:22:29 SKIL_IB2144_BR.indd 4 14-04-2010 14:22:30 ACCESSORIES

Published online August 25, 2004 Nucleic Acids Research, 2004, Vol. 32, No. 15 RNA expression microarrays (REMs), ahigh-throughput method to measure differencesin gene expression in diverse biological samples Charles E. Rogler*, Tatyana Tchaikovskaya, Raquel Norel, Aldo Massimi1,Christopher Plescia2, Eugeny Rubashevsky, Paul Siebert3 and Leslie E. Rogler Department of Medicine and Marion Bessin Liver Research Center, 1Department of Molecular Genetics, Albert EinsteinCollege of Medicine, Bronx, NY, USA, 2Department of Neurosciences, Mt Sinai College of Medicine, New York, NY,USA and 3BD Biosciences-Clontech, Palo Alto, CA, USA