Medical Care |

Medical Care



Spam Detection Using Web Page Content:
a New Battleground
Marco Túlio Ribeiro, Pedro H. Calais Guerra, Leonardo Vilela, Adriano Veloso, Dorgival Guedes∗, Wagner Meira Jr.
Universidade Federal de Minas Gerais (UFMG) Belo Horizonte, Brazil Marcelo H.P.C Chaves, Klaus Steding-Jessen, Cristine Hoepers Brazilian Network Information Center ( Sao Paulo, Brazil egy to face content-based filters is to obfuscate message con-tent in order to deceive filters. In the early years of the Traditional content-based e-mail spam filtering takes into ac- spam arms race, obfuscation techniques were as simple as count content of e-mail messages and apply machine learning misspelling Viagra as V1agra, but have evolved to complex techniques to infer patterns that discriminate spams from HTML-based obfuscations and the use of images to prevent hams. In particular, the use of content-based spam filtering action of text-based filters. However, spammers face a trade- unleashed an unending arms race between spammers and fil- off: their final goal is to motivate a recipient to click on their ter developers, given the spammers' ability to continuously links; too much obfuscation can lead to lower click rates and change spam message content in ways that might circumvent reduce spammers' gains [3]. No obfuscation at all, on the the current filters. In this paper, we propose to expand the other hand, will cause the spam to be easily blocked and few horizons of content-based filters by taking into consideration mailboxes will be reached. Therefore, in addition to keeping the content of the Web pages linked by e-mail messages.
spam detection rates high, content-based filters caused the We describe a methodology for extracting pages linked positive effect (for the anti-spam side) of making each spam by URLs in spam messages and we characterize the rela- message less attractive and less monetizable – even though tionship between those pages and the messages. We then spammers have tackled that problem by sending larger vol- use a machine learning technique (a lazy associative classi- umes of spam.
fier) to extract classification rules from the web pages that In this paper, we argue that a fundamental component are relevant to spam detection. We demonstrate that the of spam content has been neglected by content-based spam use of information from linked pages can nicely complement filters: the content of web pages linked by spam messages.
current spam classification techniques, as portrayed by Spa- We believe web pages can be an useful component added to mAssassin. Our study shows that the pages linked by spams current spam filtering frameworks for the following reasons: are a very promising battleground.
1. Web page content is an almost unexplored battleground on the spam arms race, in part for the belief that pro-cessing those pages would be too expensive in terms Spam fighting is an "arms race" characterized by an in- of computing resources. Therefore, current spammers crease in the sophistication adopted by both spam filters may not have major concerns regarding web pages get- and spammers [12, 14]. The co-evolution of spammers and ting identified as spam and so do not implement mech- anti-spammers is a remarkable aspect of the anti-spam battle anisms to obfuscate web pages, what would represent and has motivated a variety of works that devise adversarial an extra cost and might cause their pages to become strategies to treat spam as a moving target [6, 4].
harder to read. Increasing the cost of the spam activity On the spammers' side, the standard counter-attack strat- is one efficient strategy to discourage spammers [16].
∗Dorgival Guedes is also with the International Computer In addition to that, although spammers send billion Science Institute (ICSI), Berkeley: of messages daily, the range of products advertised in web sites are not very diverse; a recent report con-cluded that 70% of spam advertises pharmaceutical Permission to make digital or hard copies of all or part of this work for products [9].
A recent work has shown that a few personal or classroom use is granted without fee provided that copies are banks are responsible for processing transactions of not made or distributed for profit or commercial advantage and that copies spam product purchases [18], what is an additional bear this notice and the full citation on the first page. To copy otherwise, to motivation for seeking evidences for spam detection republish, to post on servers or to redistribute to lists, requires prior specific with are closer to the spammer's business and cannot permission and/or a fee.
CEAS 2011 - Eighth annual Collaboration, Electronic messaging, Anti- be changed easily.
Abuse and Spam Conference September 1-2, 2011, Perth, Western Aus- 2. Recently, Thomas et al. have presented Monarch [27], tralia, AustraliaCopyright 2011 ACM 978-1-4503-0788-8/11/09 .$10.00.
a real-time system to detect spam content in web pages published in social network sites and in e-mail mes- Their work shows the feasibility of collecting and process- sages. Their results show that with the current tech- ing web pages in real time for spam analysis with a cost- nology it is feasible to collect and process pages as they effective infra-structure, while identifying a large number of show up in web sites posts and e-mail messages.
attributes that may be used for that goal. Although theirgoal is mainly to classify the web content itself, the authors 3. In the underground spam market, there is evidence also show how information from a Twitter post containing that the spam sender is not always the same agent a URL can be used to improve the classification of the web as the web page owner, making the adaptation of web page pointed by it. However, due to the nature of their data page content a separate task, harder to coordinate with they could not explore the relation between e-mail messages the spam messages [1]. This is the case when spammers and the web pages they link to, since they did not have the work as business brokers that connect sellers (the web original e-mail messages, but only the list of URLs in them.
page owner) and buyers (e-mail users targeted with Our datasets make it possible for us to do exactly that.
spams) [16].
Obviously, web pages are the basic unity of analysis for the 4. It has been reported that 92% of spams in 2010 con- web spam detection task, which aims to detect artificially- tained one or more URLs [17]; previous work reported created pages into the web in order to influence the results the presence of URLs in up to 95% of spam campaigns from search engines [22]. In that direction, Webb [31] has examined [15, 24]. Those numbers indicate that us- created a dataset containing web pages crawled from spams ing web page content to improve spam detection is an from the Spam Archive dataset [13] comprising messages applicable strategy, as spammers need to use URLs to from 2002 and 2006. However, his dataset did not relate earn money through advertised web sites.
web pages with spam messages.
State-of-the art approaches for e-mail spam detection con- In this work, we show that web pages can provide pre- sider message content features and network features [5]. In cious information about nature of the e-mail message that terms of content-based filtering, a wide range of machine link to them. Using information from classical spam/ham learning techniques such as Bayesian Filters, SVM Classi- repositories, we built a new dataset that provides also the fiers and Decision Trees have being applied over e-mail mes- information about the web pages mentioned in messages of sage content with reasonable success [5, 7, 2].
the set. Using that dataset, we show that by analyzing the web page content has not being experimented as a spam web pages pointed by e-mail messages we can complement detection strategy, URLs embedded on e-mail messages are Spam Assassin regular expressions and blacklist tests pro- used for phishing detection purposes by considering IP ad- viding an improvement of up to 10% in spam classification, dress, WHOIS and domain properties and geolocalization without increasing the false positive rate. Our contributions, of URLs [21, 11].
The web pages linked by e-mail mes- sages have also been used by Spamscatter [1] as a means of • we make available this new dataset, which associates identifying spam campaigns by the web pages they linked web page content and e-mail message content; to. However, their technique was applied only to spam mes-sages already identified as such and required creating and • we propose and evaluate a methodology for e-mail spam comparing snapshots of the rendered web pages.
detection that considers the content of web pages pointedby messages; PROPOSED METHODOLOGY • we show the effectiveness of this methodology for the For each e-mail message processed from the message stream, e-mail spam detection task.
we download the web pages linked by the URLs contained in Our results show that considering web pages for spam the message. Then, content analysis techniques are applied detection purposes is a promising strategy that creates a new to the web page content — basically, the same approach battleground for spam, which has not yet being exploited by adopted by filters to analyze the content of a spam message.
current spam filters.
We assign a spamicity score to the set of web pages linked The remainder of this work is organized as follows: Sec- by a given message and then combine this score with the tion 2 presents related work, while Section 3 go into details result of classical spam filtering techniques and the message of our web page-based spam detection scheme. In Section 4 receives a final score, that takes into account both the mes- we describe the dataset we have used in this work and we sage content and the web page content. Figure 1 summarizes discuss the experimental evaluation results, which are fol- the work-flow of our filtering technique, and we discuss the lowed by conclusions and the discussion of future work in major steps next.
Web Page Crawling We begin by extracting the URLs from the body of the messages1 and use simple regular expressions to remove non- Very few works mention the use of web pages for e-mail HTML URLs, i.e., URLs that provide links to images or spam detection purposes. To our knowledge, the first work executable files. After that, we download and store the web to suggest the use of web pages for e-mail spam detection pages2. In the case of spam messages containing multiple is a proposal of a framework which combines different spam URLs, all the web pages are downloaded and stored. Many filtering techniques, including a web page-based detection of the URLs considered lead to redirections before reaching scheme, but the authors did not go into details about the their final page; in that case, we follow all redirects and store strategy [23].
As previously mentioned, the Monarch system [27] is cer- 1Using the Perl Modules URI::Find and HTML::LinkExtor.
tainly the one that is more closely related to our proposal.
2Using the file transfer library libcurl [19].
Figure 1: Steps of the Web page-based spam filtering approach. Web pages are crawled, URLs are extracted and a spamicityscore is assigned to the set of web pages relative to an e-mail message. Then, web page score is combined with conventionalspamicity scores to generate a final spamicity assessment.
the content of the final URL.
by the classifier is ˆ p(c x). Therefore, it is possible to know After extracting messages' URLs and downloading the which predictions are more or less accurate and use that in- web pages linked by them, we use lynx [20], a text-based formation when scoring different pages. For more details on browser, in order to format the web page's text as users class membership likelihood estimates, please refer to [29].
would perceive it (without images). Lynx generates a dump The demand-driven associative classifier algorithm gener- of the web page, already in text mode, removing the non- ates rules in the form χ → c, where χ is a set of words textual part of the message such as HTML tags and JavaScript and c is a class (spam or ham). One typical rule extracted code. Textual terms in that dump are then extracted, and from e-mail messages would be buy, viagra → spam. Each the final result is a set of words that are used to determine one of those rules has a support (how often the rule appears the spamicity of the web page.
in the training data) and a confidence, which is given bythe number of pages in the training data that are classified Web Page Spamicity Score Computation correctly by the rule divided by the number of pages thatcontain the set of terms χ. The final result of each page's Given that we have at our disposal textual terms extracted classification is a score between 0 and 1 that indicates both from the set of web pages linked on each spam message, a the predicted class (spam or ham) and the certainty of the range of classifiers could be built based on that informa- prediction. This score is reliable (as it has been noted that tion. We chose to use LAC, a demand-driven associative the algorithm is well calibrated), and will be a factor for the classifier [28, 30]. Associative classifiers are systems which final page score.
integrates association mining with classification, by mining One of the challenges of spam detection is the asymmetry association rules that correlate features with the classes of between the cost associated with classifying a spam mes- interest (e.g., spam or ham), and build a classifier which sage incorrectly and the cost associated with classifying a uses relevant association patterns discovered to predict the ham message incorrectly. A false negative might cause slight class of an object [26]. LAC is a demand-driven classifier irritation, as the user sees an undesirable message. A false because it projects/filters the training data according to the positive, on the other hand, means that a legitimate message features in the test instance, and extracts rules from this may never reach the user's inbox [10]. Therefore, instead of projected data. This ensures that only rules that carry in- giving the same importance to false positives and false nega- formation about the test instance are extracted from the tives we build a cost-sensitive classifier [8]. In that scenario, training data, drastically bounding the number of possible the cost of a class measures the cost of incorrectly classifying a message of that class. As it weighs all the rules obtained We have chosen to apply a demand-driven lazy classifi- for a certain message, the algorithm does a weighted sum of cation algorithm for several reasons: (i) it has a good per- the rules, taking into consideration the confidence of each formance for real-time use, (ii) it generates a readable and rule and the cost for each class, in order to give higher im- interpretable model in the form of association rules (which portance to rules that point to the class that has the higher can be easily transformed into a set of regular expressions cost. That implies that as the cost of the ham class grows, and incorporated into SpamAssassin), and (iii) it is well cal- the algorithm needs more certainty in order to classify a ibrated, which means that it provides accurate estimates of page as spam.
class membership properties. Formally, we state that a clas-sifier is well calibrated when the estimated probability ˆ is close to p(c ˆ p(c x)), which is the true, empirical probability of x being member of c given that the probability estimated There are several possible ways to use the resulting score of page classification to classify an e-mail message as spam tual content. SpamAssassin (considering queries to black- or ham. Our approach combines the page score Sp with lists) yields the rules shown in Table 1.
other spamicity scores obtained within the spam messageby applying a traditional classifier. The exact formula for SpamAssassin Rules extracted for Spam Mes- Sp will depend on the characteristics of that classifier.
sage from Figure 2. SpamAssassin regular expressions and In SpamAssassin, for example, a message is usually con- queries to blacklists are not sufficient to classify the message sidered spam if it reaches 5 or more score points, assigned from a Bayes Filter, regular expressions and blacklists [25].
One way of incorporating our technique to SpamAssassinis to simply assign S p spamicity score points to a message based on the web content of its linked pages, considering HTML included in mes- whether our classifier says it is spam (Is = 1) or ham (Is = −1), weighting the classifier certainty in that pre- DNS Blacklist BRBL BRBL LASTEXTURIBL BLACK Contains an URL listed in the URIBL blacklis Note that if the classifier judges the web page to be ham, Spwill be negative and it will contribute to reduce the overallspamicity of a message. In this way, web pages that are more The resulting score considering just spam content is only "spammy" will result in higher scores for their messages; web 0.001. Taking blacklist information into account, the result- pages that look more like "ham" will result in lower (nega- ing score is 3.4 – still not enough to classify the message as tive) scores. This is the strategy we use in this paper. The spam. An excerpt from the web page pointed by the URL is shown in Figure 3.
p will influence how much impact the web page classification will have on the final score; we evaluate thatfor SpamAssassin in Section 4.
Another alternative would to completely eliminate the use of blacklists, and substitute them for our technique, whenappropriate. This could be an interesting approach whenthe network resources are scarce, since both blacklists andour technique demand that a request be made to anotherserver.
Messages that do not have URLs cannot be filtered by our technique, for obvious reasons. Those messages are filteredby conventional spam filtering methods, and their spamicityassessments are not affected by our strategy.
Illustrative Example In this section, we present a step-by-step example of the application of our Web page-content based technique. Wepicked a spam message identified in October of 2010, fromthe Spam Archive dataset.
Figure 3: Web page linked by URL present of message fromFigure 2. As current filters do not consider web page con-tent, spammers do not need to obfuscate and sacrifice read-ability.
It can be noted that, in this case, the content of the mes- sage and the content of the page are totally different – oneseems to be selling watches and bags, the other is sellingmedicine. The content of the page is then extracted into aset of words (with lynx), and is then delivered as input forthe already-trained associative classifier (it was trained withother pages from Spam Archive and from ham pages fromSpamAssassin's dataset). The associative classifier finds alist of rules, some of which are listed on Table 2.
After weighting all the rules, the associative classifier yields Spam Message extracted from Spam Archive.
a result: the page is a spam page, with 90% of certainty Small textual content poses challenges for analysis of spam- (i.e., c = 0.9). If we set Wp = 4.0, then Sp = 3.6. This web icity based on message content.
page, therefore, would have a 3.6 score. Adding this score tothe score obtained by SpamAssassin (Table 1), this message Figure 2 shows the body of the message; it can be noted would have a 7.0 score — more than enough to be classified that the message is very concise, exhibiting very small tex- Table 2: Rules extracted for Web Page linked by message of pages downloaded for each message in the spam dataset.
from Figure 3, unveiling high spamicity terms.
EXPERIMENTAL EVALUATION In order to evaluate the applicability of building anti-spam filters using the content of the web pages, we built a datasetwith the necessary information. In the period between Julyand December of 2010, we collected spam messages from theSpam Archive dataset [13]. Spam Archive is update daily,and each day we collected the web pages linked by the newmessages included in that day using wget, to guarantee wecrawled the URLs as soon as they appear. That was essen-tial, since it is well known that spam web pages lifetimes Figure 4: Distribution of the number of pages downloaded is usually short and few spam-related web pages are online per message.
after two weeks from activation [1]. We only included in ourdataset the messages for which we could download at leastone web page3.
We evaluated our technique using all the resulting unique For each of the 157,114 web pages obtained, we stored two pages (ham and spam) and the sampled e-mail messages files: one containing the HTML content of the web page and that pointed to them. In all cases, to generated results, we the other containing the HTTP session information. We also used a standard 5-cross validation strategy. Our technique associated the downloaded web page with the corresponding was used with the SpamAssassin filter, with rules and black- message. We decided to evaluate only the unique pages, so lists activated. We combined the SpamAssassin's score with that spam campaigns that pointed to the same pages would the associative classifier's score, computed as previously de- not introduce bias to our results. Whenever many different scribed (Eq. 1), by adding the two values.
spam messages pointed to the same set of web pages, one In the next sections we show the relationship between the of them was randomly chosen for the evaluation, in order associative classifier's certainty and the score given by Spa- that only one instance of each page remained. Note that mAssassin, and the impact of varying the parameters Wp the fact that spammers advertise the same web page mul- (web page weight, i.e., the importance of web page content tiple times inside a spam campaign (with different message for classification of the e-mail message) and cost (the impor- content, though) would actually benefit our approach, but tance of an error in classifying spams and hams).
we decided to neutralize this affect to generate results closerto a lower bound.
Certainty of the classifier's prediction vsSpamAssassin score Table 3: Dataset Description The relationship between the message score given by Spa- mAssassin and the certainty that the page is spam given by the associative classifier is shown in Figure 5. In this exper- Pages downloaded per spam msg. (avg.) iment, the cost of each class was the same and the Wp hasbeen set to 4.
Unique spam web pages In Figure 5, lines show the threshold values that would Chosen spam messages separate the hams from the spams, as given by SpamAssas- Unique spam web pages per msg. (avg.) sin scores (vertical line) and certainty in our technique (hor- Unique ham messages izontal line), representing all four possible situations (spam Unique ham web pages or ham according to SA X spam or ham according to SA + Unique ham web pages per msg. (avg.) Web Page approach). Figure 5a represents the spams fromSpam Archive, and Figure 5b represents the hams from Spa- We used the same methodology to process ham messages mAssasin Ham Archive. It is worth noticing that there is a from the SpamAssassin Ham Dataset4, resulting in 11,134 large quantity of spams in the bottom right corner of Fig- unique ham pages, linked by 4,927 ham messages. The char- ure 5a – spams that are weakly identified by SpamAssassin acteristics of the resulting dataset are summarized in Ta- rules, but are linked to pages that are identified as spams by ble 3. It is interesting to notice that the average number our technique. Also, it can be noted that most hams in the of pages per message was not very different between hams bottom right corner of Figure 5a have a low score given by and spams. Figure 4 shows the distribution of the number SpamAssassin, and a low certainty given by our technique,meaning that even if they were misclassified by our tech- 3To obtain a copy of the dataset, please contact the authors.
nique, they would probably not be marked as spam when both scores were combined.
Effect of Page Weight Variation Web Page Classification Certainty (message is spam) SA + Web Page - FP Spam Assassin - FP SA + Web Page - FN Spam Assassin - FN Figure 6: Rate of false positives and false negatives for SpamAssassin with and without web page classification rules, for different values of Wp (page weight). Considering web page content for spam detection reduces the rate of false nega-tives, without incurring in higher rate of false positives up Effect of Cost Variation Web Page Classification Certainty (message is ham) Figure 5: Comparison between Spam Assassin score and web page classification certainty. It is possible to note in the right bottom quadrant of (a) which spam messages are notcaught by the filter, but point to messages that are caught by the web page classifier. In the right bottom quadrant of misclassified ham cost / misclassified spam cost (b), on the other hand, it is possible to see that the rateof false positives would not increase considerably using our SA + Web Page - FP Spam Assassin - FP SA + Web Page - FN Spam Assassin - FN technique, since most messages in that area have a low scoregiven by spam assassin and/or a low certainty given by the Figure 7: Rate of false positives and false negatives for Spam web page-content based classifier.
Assassin with and without web page classification rules,varying the rate between the cost of misclassifying a hammessage and the cost of misclassifying a spam message. It Impact of the weight parameter in spam can be seen that the rate of false positives and negatives can be adjusted according to the user's necessity concerning the The relationship between the rate of false positives and cost of each class.
false negatives and different weight values is shown in fig-ure 6. We also show in the same figure the rate of falsepositives and false negatives obtained using SpamAssassin The relationship between the rate of false positives and without our technique, for comparison purposes.
false negatives and different cost values is shown in Figure 7.
It can be seen that up to a value of 4, the rate of false The values in the x axis represent the ratio between the cost positives of our technique is equal or lower than the rate of classifying a ham as spam and the cost of classifying a of false positives of SpamAssassin, even though the rate of spam as ham. Therefore, if the value in the x axis is 1.5, false negatives of our technique is remarkably lower. This is that means that it is 50% more costly to classify a ham as the reason we chose the value 4 for the weight in the other spam (i.e. a false positive) than to classify a spam as ham experiments, as it is the value that yields the lowest rate of (i.e., a false negative). We have not considered ratios lower false negatives while maintaining an acceptable rate of false than 1.0, since that would mean considering that false nega- tives had a higher impact than false positives. Therefore, itfollows that a raise in the relative cost of ham misclassifica- Impact of the cost parameter in spam de- tion yields a smaller rate of false positives and a potentially higher rate of false negatives. With a cost for misclassify- Table 4: Frequency of top 10 terms in web pages in com- performance. Will such a system be able to keep up with parison with message content. Frequent terms in web pages the flow of incoming messages in a busy e-mail server as are rare in messages, due to spammers' obfuscation efforts.
it crawls the web and processes pages, in addition to theexisting spam filter load? Certainly this solution will require extra processing power, since it adds a new analysis to the anti-spam bag of tricks.
There are two major costs to be considered in this scenario: crawling and filtering. Although we have not measured the performance of our crawler, results from the Monarch sys- tem show that crawling pages even for a heavy mail server can be done with a median time for page of 5.54 seconds and a throughput of 638,00 URLs per day on a four core professional 0.33 2.8GHz Xeon processor with 8GB of memory [27]. That cost includes the DNS queries, HTTP transfers and page processing. Our feature extraction task is simpler, however, since we only extract the final text from each page. That seems an acceptable delay for an incoming message, in our Compared to Monarch, our feature extraction process is simpler, both because we use fewer attributes and becausewe focus only on the text from HTML documents.
ing ham set to 70% higher than that of misclassifying spam, question in this comparison might be the cost of the mes- our technique classifies spam with the same accuracy than sage filtering itself, since we use a very different approach to SpamAssassin, even though we lower the number of false classification. We believe that our approach, by using fewer positives to zero. It is worth noticing that this compromise attributes, reduces processing time, but we have no means is adjustable in our technique, through the ratio of the cost for a direct comparison. However, in our experiments with a parameters. It is up to the user to set that ratio according single dual-core Xeon Processor with 4 GB of memory, our to his situation.
classifier showed a throughput of 111 pages per second onaverage. Since message classification may be done indepen- Robustness of Web Page features dently for each message, the task can be easily distributed, In this section, we assess the difficulty in detecting spams becoming highly scalable.
through web pages in comparison with the standard detec- Time changing URLs: one problem of the solution we tion approach based on message content. Using the demand- proposed, first mentioned by the authors of Monarch, is driven associative classifier we presented in Section 3.2, we that spammers might change the content of a web page over compute the top 10 most frequent rules the algorithm de- time, using dynamic pages. When a message is delivered, tected for web page content (relative to the spam class), and the server hosting the spam web pages is configured to re- compare their frequency (support) in message content. Re- turn a simple ham page in response to a request. Knowing a sults are shown in Table 4. Note that, in web pages, the server using a methodology like ours would try to verify the most frequent terms are present in a very high fraction of pages right after delivery, the server would only start return- spams; over one third of the web pages contain "popular" ing the intended spam content some time after the SMTP spam terms such as viagra, cialis, levita and active. On the delivery is completed. That, again, would require more pro- other hand, the prevalence of those terms in message con- cessing by the spammer, and coordination between the spam tent is significantly lower: the most popular term in web distributing server and the spam web hosting server, which pages, viagra, is observed in no more than 14% of messages.
are not always under the same control [1]. If such prac- For the remaining popular words in web pages, frequency tice becomes usual, a possible solution might be to add such in messages are lower or equal than 5%, a consequence of tests to the user interface: as soon as an user opened a mes- spammers' obfuscations in message content and the fact that sage with links, they would be collected and analyzed in the spam web pages exhibit the full content of the products be- background and a message might be shown to the user.
ing advertised, while messages tend to present a summary There are other issues to be addressed as future work, as of the advertised content.
we discuss in the next section.
Those numbers indicate that, as spammers currently do Spammer adaptation: It is our observation that web not have any concern in obfuscating web page content, a few pages linked in e-mail messages are currently not explored number of simple rules have a strong impact on detecting by spam filters. We also observed that the "spammy" con- spams, and will obligate spammers to react accordingly – tent in such web pages is clear an unobfuscated. However, increasing the cost of sending spam.
it is known that spammers adapt to new techniques used by Operational issues filters [14]. It could be argued that the same obfuscationsused in e-mail messages could also be used in these web Three final aspects deserve mention: performance, the ef- pages. However, that is not always possible. As with the fectiveness of the technique for time changing URLs and the Time Changing URLs, this would require coordination be- adaptation of spammers to our technique. We discuss these tween the spam distributing server and the spam web host- issues in this section.
ing server, which are not always under the same control [1].
Performance: adding web page analysis to the spam de- Not only that, but the spammer would have to sacrifice legi- tection pipeline immediately brings to mind the matter of bility and the appearance of credibility in order to obfuscate over any specific site. Setting limits on the number or URLs the web page, as is the case with e-mail spam, which results extracted from any message or the concurrent queries to a in less profitability from each message. The current state of single server may prevent other overload attacks.
e-mail spam demonstrates this clearly, as many spam mes- Finally, one countermeasure that spammers could take sages being sent are hard to understand, and do not appear would be to avoid that a filter could identify the URLs in a message. Although possible, that would imply that thespammer would have to rely on the user to copy — and CONCLUSIONS AND FUTURE WORK edit — text from a message in order to turn it into a validURL. That would again add a level of complexity that would Web pages linked by spam messages may be a reliable reduce the effect of such spam campaign.
source of evidence of spamicity of e-mail messages. In this In terms of machine learning challenges, we think the most work, we propose and evaluate a novel spam detection ap- promising direction is to combine knowledge from message proach which assigns a spamicity score to web pages linked and web page content in order to maximize spam detection in e-mail messages. Our approach is suitable to work as a accuracy while keeping a satisfactory filter performance: in complement to traditional spamicity assessments – such as cases where spam message content is enough to a classifier message content and blacklists.
judge its spamicity, we do not need to pay the cost of crawl- Our motivation for such proposal is the observation that, ing and analyzing web page content. Devising strategies that so far, web pages linked in e-mail messages are not explored examine web page content only when spam message content by current spam filters, and, despite that, they offer – cur- is not enough seems to be an interesting approach to lead to rently – clear and unobfuscated content for spamicity assess- robust and efficient spam detection.
ments. Since spam filters currently do not examine web pagecontent, spammers usually do not obfuscate their advertised sites. Even if spammers begin to obfuscate their web pages,the effort required would serve as an additional disincentive This work was partially supported by, CNPq, CAPES, for spam activity.
FAPEMIG, FINEP, by UOL ( through its We evaluate the use of a lazy machine learning algorithm [28] Research Scholarship Program, Proc. Number 20110215235100, to classify web pages, and propose a simple strategy for ag- by the Brazilian National Institute of Science and Technol- gregating the classification of the pages to the traditional ogy for the Web, InWeb. CNPq grants no. 573871/2008-6 spam message classification, by using SpamAssassin [25].
and 141322/2009-8, and by Movimento Brasil Competitivo, We show that, by using our technique, it is possible to im- through its Brazil ICSI Visitor Program.
prove spam filtering accuracy without adding a significantnumber of false positives. Furthermore, the use of a cost- sensitive classifier allows the adjustment of the false positive [1] D. S. Anderson, C. Fleizach, S. Savage, and G. M.
cost, allowing users to have better control on the trade-off Voelker. Spamscatter: Characterizing Internet Scam between false positives and false negatives.
Hosting Infrastructure. In Proceedings of the 16th We believe that this work explores a new frontier for spam IEEE Security Symposium, pages 135–148, 2007.
filtering strategies, introducing a new aspect that has not [2] I. Androutsopoulos, J. Koutsias, K. Chandrinos, yet been explored in the literature. In other words, the web G. Paliouras, and C. D. Spyropoulos. An evaluation of pages that are linked by spam messages are a new battle- naive bayesian anti-spam filtering. CoRR, ground, one that spammers are not currently worried about, and one that may be explored through many different ap- [3] B. Biggio, G. Fumera, and F. Roli. Evade hard proaches and algorithms.
multiple classifier systems. In O. Okun and As future work, we intend to address issues that may sur- G. Valentini, editors, Supervised and Unsupervised face once web page analysis becomes integrated into the Ensemble Methods and Their Applications, volume spam-fighting infra-structure. They relate to the privacy of 245, pages 15–38. Springer Berlin / Heidelberg, 2008.
users, the abuse of the infra-structure and the obfuscation [4] D. Chinavle, P. Kolari, T. Oates, and T. Finin.
of urls, and we discuss them next.
Ensembles in adversarial classification for spam. In It is possible that, by crawling a URL in a spam message CIKM '09: Proceeding of the 18th ACM conference on sent to a user, we may be providing feedback to the spam- Information and knowledge management, pages mer. That would be the case if the spammer embedded the 2015–2018, New York, NY, USA, 2009. ACM.
message receiver identifier in the URL using some kind ofencoding. By collecting that page our system would be con- [5] G. V. Cormack. Email spam filtering: A systematic firming to the spammer that such user is active. There is review. Found. Trends Inf. Retr., 1:335–455, April no information on how often such encoded URLs are used, but they certainly demand more processing power from the [6] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and spammer, what, in turn, may reduce profits.
D. Verma. Adversarial classification. In KDD '04: By implying that all URLs in received messages will result Proceedings of the tenth ACM SIGKDD international in a crawler execution, this may provide a way for spammers conference on Knowledge discovery and data mining, and other individuals to start attacks by abusing such fea- pages 99–108, New York, NY, USA, 2004. ACM.
tures. A heavy volume of messages crafted by an attacker [7] H. Drucker, D. Wu, and V. N. Vapnik. Support vector with multiple URLs might overload the crawler, or be used machines for spam categorization. IEEE to turn the crawler into an attacker, requesting a large vol- TRANSACTIONS ON NEURAL NETWORKS, ume of pages from another site. The later may be avoided just by using DNS and HTTP caches, reducing the load [8] C. Elkan. The foundations of cost-sensitive learning.
In Proceedings of the Seventeenth International Joint [24] C. Pu and S. Webb. Observed trends in spam Conference on Artificial Intelligence, pages 973–978, construction techniques: a case study of spam evolution. Proceedings of the 3rd Conference on Email [9] eSoft. Pharma-fraud continues to dominate spam.
and Anti-Spam (CEAS), 2006.
[25] SpamAssassin, 2011.
[10] T. Fawcett. "in vivo" spam filtering: a challenge [26] Y. Sun, A. K. C. Wong, and I. Y. Wang. An overview problem for kdd. SIGKDD Explor. Newsl., 5:140–148, of associative classifiers, 2006.
December 2003.
[27] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song.
[11] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect Monarch: Providing real-time URL spam filtering as a phishing emails. In Proceedings of the 16th service. In Proceedings of the IEEE Symposium on international conference on World Wide Web, WWW Security and Privacy, Los Alamitos, CA, USA, 2011.
'07, pages 649–656, New York, NY, USA, 2007. ACM.
IEEE Computer Society.
[12] J. Goodman, G. V. Cormack, and D. Heckerman.
[28] A. Veloso, W. M. Jr., and M. J. Zaki. Lazy associative Spam and the ongoing battle for the inbox. Commun.
classification. In ICDM, pages 645–654. IEEE ACM, 50(2):24–33, 2007.
Computer Society, 2006.
[13] B. Guenter. Spam Archive, 2011.
[29] A. Veloso, W. M. Jr., and M. J. Zaki. Calibrated lazy associative classification. In S. de Amo, editor, [14] P. H. C. Guerra, D. Guedes, J. Wagner Meira, Proceedings of The Brazilian Symposium on Databases C. Hoepers, M. H. P. C. Chaves, and (SBBD), pages 135–149. SBC, 2008.
K. Steding-Jessen. Exploring the spam arms race to [30] A. Veloso and W. Meira Jr. Lazy associative characterize spam evolution. In Proceedings of the 7th classification for content-based spam detection. In Collaboration, Electronic messaging, Anti-Abuse and Proceedings of the Fourth Latin American Web Spam Conference (CEAS), Redmond, WA, 2010.
Congress, pages 154–161, Washington, DC, USA, [15] P. H. C. Guerra, D. Pires, D. Guedes, 2006. IEEE Computer Society.
J. Wagner Meira, C. Hoepers, and K. Steding-Jessen.
[31] S. Webb. Introducing the webb spam corpus: Using A campaign-based characterization of spamming email spam to identify web spam automatically. In In strategies. In Proceedings of the 5th Conference on Proceedings of the 3rd Conference on Email and e-mail and anti-spam (CEAS), Mountain View, CA, AntiSpam (CEAS) (Mountain View), 2006.
[16] M. Illger, J. Straub, W. Gansterer, and C. Proschinger. The economy of spam. Technicalreport, Faculty of Computer Science, University ofVienna, 2006.
[17] S. M. Labs. MessageLabs Intelligence: 2010 annual [18] K. Levchenko, A. Pitsillidis, N. Chachra, B. Enright, M. Felegyhazi, C. Grier, T. Halvorson, C. K.
C. Kanich, H. Liu, D. McCoy, N. Weaver, V. P. G. M.
Voelker, and S. Savage. Click trajectories: End-to-endanalysis of the spam value chain. Proceedings of theIEEE Symposium on Security and Privacy, 2011.
[19] Libcurl, 2011.
[20] Lynx, 2011.
[21] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker.
Beyond blacklists: learning to detect malicious websites from suspicious urls. In Proceedings of the 15thACM SIGKDD international conference on Knowledgediscovery and data mining, KDD '09, pages1245–1254, New York, NY, USA, 2009. ACM.
[22] A. Ntoulas and M. Manasse. Detecting spam web pages through content analysis. In In Proceedings ofthe World Wide Web conference, pages 83–92. ACMPress, 2006.
[23] C. Pu, S. Member, S. Webb, O. Kolesnikov, W. Lee, and R. Lipton. Towards the integration of diversespam filtering techniques. In Proceedings of the IEEEInternational Conference on Granular Computing(GrC06), Atlanta, GA, pages 17–20, 2006.


Growth factors for rotator cuff repair

for Rotator Cuff Repair LawrenceV. Gulotta, MD, Scott A. Rodeo, MD KEYWORDS! Tendon biology ! Cytokines ! BMP ! Growth factors! Gene therapy ! Repair scaffolds Rotator cuff repair surgeries are one of the most common procedures performed byorthopedic surgeons, with over 250,000 performed annually in the United Statesalone. Despite its prevalence, there is concern regarding the ability of the rotatorcuff to heal back to the insertion site on the humerus following repair. Clinical studieshave shown radiographic failures at the repair site at 2 years in anywhere from 11% to95% of patients, depending on the size and chronicity of the tear, presence of fattyinfiltration, and the age and general health status of the Although patientswith re-tears or failed healing may have pain relief, these studies show that they haveinferior functional results when compared with patients with healed An un-derstanding of the histology and biology that occur during the healing process maylead to therapies that can improve the healing rate and improve the functional resultsof patients following repair.