Honeytarg.cert.br
Spam Detection Using Web Page Content:
a New Battleground
Marco Túlio Ribeiro, Pedro H. Calais Guerra, Leonardo Vilela,
Adriano Veloso, Dorgival Guedes∗, Wagner Meira Jr.
Universidade Federal de Minas Gerais (UFMG)
Belo Horizonte, Brazil
Marcelo H.P.C Chaves, Klaus Steding-Jessen, Cristine Hoepers
Brazilian Network Information Center (NIC.br)
Sao Paulo, Brazil
egy to face content-based filters is to obfuscate message con-tent in order to deceive filters. In the early years of the
Traditional content-based e-mail spam filtering takes into ac-
spam arms race, obfuscation techniques were as simple as
count content of e-mail messages and apply machine learning
misspelling Viagra as V1agra, but have evolved to complex
techniques to infer patterns that discriminate spams from
HTML-based obfuscations and the use of images to prevent
hams. In particular, the use of content-based spam filtering
action of text-based filters. However, spammers face a trade-
unleashed an unending arms race between spammers and fil-
off: their final goal is to motivate a recipient to click on their
ter developers, given the spammers' ability to continuously
links; too much obfuscation can lead to lower click rates and
change spam message content in ways that might circumvent
reduce spammers' gains [3]. No obfuscation at all, on the
the current filters. In this paper, we propose to expand the
other hand, will cause the spam to be easily blocked and few
horizons of content-based filters by taking into consideration
mailboxes will be reached. Therefore, in addition to keeping
the content of the Web pages linked by e-mail messages.
spam detection rates high, content-based filters caused the
We describe a methodology for extracting pages linked
positive effect (for the anti-spam side) of making each spam
by URLs in spam messages and we characterize the rela-
message less attractive and less monetizable – even though
tionship between those pages and the messages. We then
spammers have tackled that problem by sending larger vol-
use a machine learning technique (a lazy associative classi-
umes of spam.
fier) to extract classification rules from the web pages that
In this paper, we argue that a fundamental component
are relevant to spam detection. We demonstrate that the
of spam content has been neglected by content-based spam
use of information from linked pages can nicely complement
filters: the content of web pages linked by spam messages.
current spam classification techniques, as portrayed by Spa-
We believe web pages can be an useful component added to
mAssassin. Our study shows that the pages linked by spams
current spam filtering frameworks for the following reasons:
are a very promising battleground.
1. Web page content is an almost unexplored battleground
on the spam arms race, in part for the belief that pro-cessing those pages would be too expensive in terms
Spam fighting is an "arms race" characterized by an in-
of computing resources. Therefore, current spammers
crease in the sophistication adopted by both spam filters
may not have major concerns regarding web pages get-
and spammers [12, 14]. The co-evolution of spammers and
ting identified as spam and so do not implement mech-
anti-spammers is a remarkable aspect of the anti-spam battle
anisms to obfuscate web pages, what would represent
and has motivated a variety of works that devise adversarial
an extra cost and might cause their pages to become
strategies to treat spam as a moving target [6, 4].
harder to read. Increasing the cost of the spam activity
On the spammers' side, the standard counter-attack strat-
is one efficient strategy to discourage spammers [16].
∗Dorgival Guedes is also with the International Computer
In addition to that, although spammers send billion
Science Institute (ICSI), Berkeley:
of messages daily, the range of products advertised in
web sites are not very diverse; a recent report con-cluded that 70% of spam advertises pharmaceutical
Permission to make digital or hard copies of all or part of this work for
products [9].
A recent work has shown that a few
personal or classroom use is granted without fee provided that copies are
banks are responsible for processing transactions of
not made or distributed for profit or commercial advantage and that copies
spam product purchases [18], what is an additional
bear this notice and the full citation on the first page. To copy otherwise, to
motivation for seeking evidences for spam detection
republish, to post on servers or to redistribute to lists, requires prior specific
with are closer to the spammer's business and cannot
permission and/or a fee.
CEAS 2011 - Eighth annual Collaboration, Electronic messaging, Anti-
be changed easily.
Abuse and Spam Conference September 1-2, 2011, Perth, Western Aus-
2. Recently, Thomas et al. have presented Monarch [27],
tralia, AustraliaCopyright 2011 ACM 978-1-4503-0788-8/11/09 .$10.00.
a real-time system to detect spam content in web pages
published in social network sites and in e-mail mes-
Their work shows the feasibility of collecting and process-
sages. Their results show that with the current tech-
ing web pages in real time for spam analysis with a cost-
nology it is feasible to collect and process pages as they
effective infra-structure, while identifying a large number of
show up in web sites posts and e-mail messages.
attributes that may be used for that goal. Although theirgoal is mainly to classify the web content itself, the authors
3. In the underground spam market, there is evidence
also show how information from a Twitter post containing
that the spam sender is not always the same agent
a URL can be used to improve the classification of the web
as the web page owner, making the adaptation of web
page pointed by it. However, due to the nature of their data
page content a separate task, harder to coordinate with
they could not explore the relation between e-mail messages
the spam messages [1]. This is the case when spammers
and the web pages they link to, since they did not have the
work as business brokers that connect sellers (the web
original e-mail messages, but only the list of URLs in them.
page owner) and buyers (e-mail users targeted with
Our datasets make it possible for us to do exactly that.
spams) [16].
Obviously, web pages are the basic unity of analysis for the
4. It has been reported that 92% of spams in 2010 con-
web spam detection task, which aims to detect artificially-
tained one or more URLs [17]; previous work reported
created pages into the web in order to influence the results
the presence of URLs in up to 95% of spam campaigns
from search engines [22]. In that direction, Webb [31] has
examined [15, 24]. Those numbers indicate that us-
created a dataset containing web pages crawled from spams
ing web page content to improve spam detection is an
from the Spam Archive dataset [13] comprising messages
applicable strategy, as spammers need to use URLs to
from 2002 and 2006. However, his dataset did not relate
earn money through advertised web sites.
web pages with spam messages.
State-of-the art approaches for e-mail spam detection con-
In this work, we show that web pages can provide pre-
sider message content features and network features [5]. In
cious information about nature of the e-mail message that
terms of content-based filtering, a wide range of machine
link to them. Using information from classical spam/ham
learning techniques such as Bayesian Filters, SVM Classi-
repositories, we built a new dataset that provides also the
fiers and Decision Trees have being applied over e-mail mes-
information about the web pages mentioned in messages of
sage content with reasonable success [5, 7, 2].
the set. Using that dataset, we show that by analyzing the
web page content has not being experimented as a spam
web pages pointed by e-mail messages we can complement
detection strategy, URLs embedded on e-mail messages are
Spam Assassin regular expressions and blacklist tests pro-
used for phishing detection purposes by considering IP ad-
viding an improvement of up to 10% in spam classification,
dress, WHOIS and domain properties and geolocalization
without increasing the false positive rate. Our contributions,
of URLs [21, 11].
The web pages linked by e-mail mes-
sages have also been used by Spamscatter [1] as a means of
• we make available this new dataset, which associates
identifying spam campaigns by the web pages they linked
web page content and e-mail message content;
to. However, their technique was applied only to spam mes-sages already identified as such and required creating and
• we propose and evaluate a methodology for e-mail spam
comparing snapshots of the rendered web pages.
detection that considers the content of web pages pointedby messages;
PROPOSED METHODOLOGY
• we show the effectiveness of this methodology for the
For each e-mail message processed from the message stream,
e-mail spam detection task.
we download the web pages linked by the URLs contained in
Our results show that considering web pages for spam
the message. Then, content analysis techniques are applied
detection purposes is a promising strategy that creates a new
to the web page content — basically, the same approach
battleground for spam, which has not yet being exploited by
adopted by filters to analyze the content of a spam message.
current spam filters.
We assign a spamicity score to the set of web pages linked
The remainder of this work is organized as follows: Sec-
by a given message and then combine this score with the
tion 2 presents related work, while Section 3 go into details
result of classical spam filtering techniques and the message
of our web page-based spam detection scheme. In Section 4
receives a final score, that takes into account both the mes-
we describe the dataset we have used in this work and we
sage content and the web page content. Figure 1 summarizes
discuss the experimental evaluation results, which are fol-
the work-flow of our filtering technique, and we discuss the
lowed by conclusions and the discussion of future work in
major steps next.
Web Page Crawling
We begin by extracting the URLs from the body of the
messages1 and use simple regular expressions to remove non-
Very few works mention the use of web pages for e-mail
HTML URLs, i.e., URLs that provide links to images or
spam detection purposes. To our knowledge, the first work
executable files. After that, we download and store the web
to suggest the use of web pages for e-mail spam detection
pages2. In the case of spam messages containing multiple
is a proposal of a framework which combines different spam
URLs, all the web pages are downloaded and stored. Many
filtering techniques, including a web page-based detection
of the URLs considered lead to redirections before reaching
scheme, but the authors did not go into details about the
their final page; in that case, we follow all redirects and store
strategy [23].
As previously mentioned, the Monarch system [27] is cer-
1Using the Perl Modules URI::Find and HTML::LinkExtor.
tainly the one that is more closely related to our proposal.
2Using the file transfer library libcurl [19].
Figure 1: Steps of the Web page-based spam filtering approach. Web pages are crawled, URLs are extracted and a spamicityscore is assigned to the set of web pages relative to an e-mail message. Then, web page score is combined with conventionalspamicity scores to generate a final spamicity assessment.
the content of the final URL.
by the classifier is ˆ
p(c x). Therefore, it is possible to know
After extracting messages' URLs and downloading the
which predictions are more or less accurate and use that in-
web pages linked by them, we use lynx [20], a text-based
formation when scoring different pages. For more details on
browser, in order to format the web page's text as users
class membership likelihood estimates, please refer to [29].
would perceive it (without images). Lynx generates a dump
The demand-driven associative classifier algorithm gener-
of the web page, already in text mode, removing the non-
ates rules in the form χ → c, where χ is a set of words
textual part of the message such as HTML tags and JavaScript
and c is a class (spam or ham). One typical rule extracted
code. Textual terms in that dump are then extracted, and
from e-mail messages would be buy, viagra → spam. Each
the final result is a set of words that are used to determine
one of those rules has a support (how often the rule appears
the spamicity of the web page.
in the training data) and a confidence, which is given bythe number of pages in the training data that are classified
Web Page Spamicity Score Computation
correctly by the rule divided by the number of pages thatcontain the set of terms χ. The final result of each page's
Given that we have at our disposal textual terms extracted
classification is a score between 0 and 1 that indicates both
from the set of web pages linked on each spam message, a
the predicted class (spam or ham) and the certainty of the
range of classifiers could be built based on that informa-
prediction. This score is reliable (as it has been noted that
tion. We chose to use LAC, a demand-driven associative
the algorithm is well calibrated), and will be a factor for the
classifier [28, 30]. Associative classifiers are systems which
final page score.
integrates association mining with classification, by mining
One of the challenges of spam detection is the asymmetry
association rules that correlate features with the classes of
between the cost associated with classifying a spam mes-
interest (e.g., spam or ham), and build a classifier which
sage incorrectly and the cost associated with classifying a
uses relevant association patterns discovered to predict the
ham message incorrectly. A false negative might cause slight
class of an object [26]. LAC is a demand-driven classifier
irritation, as the user sees an undesirable message. A false
because it projects/filters the training data according to the
positive, on the other hand, means that a legitimate message
features in the test instance, and extracts rules from this
may never reach the user's inbox [10]. Therefore, instead of
projected data. This ensures that only rules that carry in-
giving the same importance to false positives and false nega-
formation about the test instance are extracted from the
tives we build a cost-sensitive classifier [8]. In that scenario,
training data, drastically bounding the number of possible
the cost of a class measures the cost of incorrectly classifying
a message of that class. As it weighs all the rules obtained
We have chosen to apply a demand-driven lazy classifi-
for a certain message, the algorithm does a weighted sum of
cation algorithm for several reasons: (i) it has a good per-
the rules, taking into consideration the confidence of each
formance for real-time use, (ii) it generates a readable and
rule and the cost for each class, in order to give higher im-
interpretable model in the form of association rules (which
portance to rules that point to the class that has the higher
can be easily transformed into a set of regular expressions
cost. That implies that as the cost of the ham class grows,
and incorporated into SpamAssassin), and (iii) it is well cal-
the algorithm needs more certainty in order to classify a
ibrated, which means that it provides accurate estimates of
page as spam.
class membership properties. Formally, we state that a clas-sifier is well calibrated when the estimated probability ˆ
is close to p(c ˆ
p(c x)), which is the true, empirical probability
of x being member of c given that the probability estimated
There are several possible ways to use the resulting score
of page classification to classify an e-mail message as spam
tual content. SpamAssassin (considering queries to black-
or ham. Our approach combines the page score Sp with
lists) yields the rules shown in Table 1.
other spamicity scores obtained within the spam messageby applying a traditional classifier. The exact formula for
SpamAssassin Rules extracted for Spam Mes-
Sp will depend on the characteristics of that classifier.
sage from Figure 2. SpamAssassin regular expressions and
In SpamAssassin, for example, a message is usually con-
queries to blacklists are not sufficient to classify the message
sidered spam if it reaches 5 or more score points, assigned
from a Bayes Filter, regular expressions and blacklists [25].
One way of incorporating our technique to SpamAssassinis to simply assign S
p spamicity score points to a message
based on the web content of its linked pages, considering
HTML included in mes-
whether our classifier says it is spam (Is = 1) or ham
(Is = −1), weighting the classifier certainty in that pre-
DNS Blacklist BRBL
BRBL LASTEXTURIBL BLACK
Contains an URL listed
in the URIBL blacklis
Note that if the classifier judges the web page to be ham, Spwill be negative and it will contribute to reduce the overallspamicity of a message. In this way, web pages that are more
The resulting score considering just spam content is only
"spammy" will result in higher scores for their messages; web
0.001. Taking blacklist information into account, the result-
pages that look more like "ham" will result in lower (nega-
ing score is 3.4 – still not enough to classify the message as
tive) scores. This is the strategy we use in this paper. The
spam. An excerpt from the web page pointed by the URL
is shown in Figure 3.
p will influence how much impact the web page
classification will have on the final score; we evaluate thatfor SpamAssassin in Section 4.
Another alternative would to completely eliminate the use
of blacklists, and substitute them for our technique, whenappropriate. This could be an interesting approach whenthe network resources are scarce, since both blacklists andour technique demand that a request be made to anotherserver.
Messages that do not have URLs cannot be filtered by our
technique, for obvious reasons. Those messages are filteredby conventional spam filtering methods, and their spamicityassessments are not affected by our strategy.
Illustrative Example
In this section, we present a step-by-step example of the
application of our Web page-content based technique. Wepicked a spam message identified in October of 2010, fromthe Spam Archive dataset.
Figure 3: Web page linked by URL present of message fromFigure 2. As current filters do not consider web page con-tent, spammers do not need to obfuscate and sacrifice read-ability.
It can be noted that, in this case, the content of the mes-
sage and the content of the page are totally different – oneseems to be selling watches and bags, the other is sellingmedicine. The content of the page is then extracted into aset of words (with lynx), and is then delivered as input forthe already-trained associative classifier (it was trained withother pages from Spam Archive and from ham pages fromSpamAssassin's dataset). The associative classifier finds alist of rules, some of which are listed on Table 2.
After weighting all the rules, the associative classifier yields
Spam Message extracted from Spam Archive.
a result: the page is a spam page, with 90% of certainty
Small textual content poses challenges for analysis of spam-
(i.e., c = 0.9). If we set Wp = 4.0, then Sp = 3.6. This web
icity based on message content.
page, therefore, would have a 3.6 score. Adding this score tothe score obtained by SpamAssassin (Table 1), this message
Figure 2 shows the body of the message; it can be noted
would have a 7.0 score — more than enough to be classified
that the message is very concise, exhibiting very small tex-
Table 2: Rules extracted for Web Page linked by message
of pages downloaded for each message in the spam dataset.
from Figure 3, unveiling high spamicity terms.
EXPERIMENTAL EVALUATION
In order to evaluate the applicability of building anti-spam
filters using the content of the web pages, we built a datasetwith the necessary information. In the period between Julyand December of 2010, we collected spam messages from theSpam Archive dataset [13]. Spam Archive is update daily,and each day we collected the web pages linked by the newmessages included in that day using wget, to guarantee wecrawled the URLs as soon as they appear. That was essen-tial, since it is well known that spam web pages lifetimes
Figure 4: Distribution of the number of pages downloaded
is usually short and few spam-related web pages are online
per message.
after two weeks from activation [1]. We only included in ourdataset the messages for which we could download at leastone web page3.
We evaluated our technique using all the resulting unique
For each of the 157,114 web pages obtained, we stored two
pages (ham and spam) and the sampled e-mail messages
files: one containing the HTML content of the web page and
that pointed to them. In all cases, to generated results, we
the other containing the HTTP session information. We also
used a standard 5-cross validation strategy. Our technique
associated the downloaded web page with the corresponding
was used with the SpamAssassin filter, with rules and black-
message. We decided to evaluate only the unique pages, so
lists activated. We combined the SpamAssassin's score with
that spam campaigns that pointed to the same pages would
the associative classifier's score, computed as previously de-
not introduce bias to our results. Whenever many different
scribed (Eq. 1), by adding the two values.
spam messages pointed to the same set of web pages, one
In the next sections we show the relationship between the
of them was randomly chosen for the evaluation, in order
associative classifier's certainty and the score given by Spa-
that only one instance of each page remained. Note that
mAssassin, and the impact of varying the parameters Wp
the fact that spammers advertise the same web page mul-
(web page weight, i.e., the importance of web page content
tiple times inside a spam campaign (with different message
for classification of the e-mail message) and cost (the impor-
content, though) would actually benefit our approach, but
tance of an error in classifying spams and hams).
we decided to neutralize this affect to generate results closerto a lower bound.
Certainty of the classifier's prediction vsSpamAssassin score
Table 3: Dataset Description
The relationship between the message score given by Spa-
mAssassin and the certainty that the page is spam given by
the associative classifier is shown in Figure 5. In this exper-
Pages downloaded per spam msg. (avg.)
iment, the cost of each class was the same and the Wp hasbeen set to 4.
Unique spam web pages
In Figure 5, lines show the threshold values that would
Chosen spam messages
separate the hams from the spams, as given by SpamAssas-
Unique spam web pages per msg. (avg.)
sin scores (vertical line) and certainty in our technique (hor-
Unique ham messages
izontal line), representing all four possible situations (spam
Unique ham web pages
or ham according to SA X spam or ham according to SA +
Unique ham web pages per msg. (avg.)
Web Page approach). Figure 5a represents the spams fromSpam Archive, and Figure 5b represents the hams from Spa-
We used the same methodology to process ham messages
mAssasin Ham Archive. It is worth noticing that there is a
from the SpamAssassin Ham Dataset4, resulting in 11,134
large quantity of spams in the bottom right corner of Fig-
unique ham pages, linked by 4,927 ham messages. The char-
ure 5a – spams that are weakly identified by SpamAssassin
acteristics of the resulting dataset are summarized in Ta-
rules, but are linked to pages that are identified as spams by
ble 3. It is interesting to notice that the average number
our technique. Also, it can be noted that most hams in the
of pages per message was not very different between hams
bottom right corner of Figure 5a have a low score given by
and spams. Figure 4 shows the distribution of the number
SpamAssassin, and a low certainty given by our technique,meaning that even if they were misclassified by our tech-
3To obtain a copy of the dataset, please contact the authors.
nique, they would probably not be marked as spam when
both scores were combined.
Effect of Page Weight Variation
Web Page Classification Certainty (message is spam)
SA + Web Page - FP
Spam Assassin - FP
SA + Web Page - FN
Spam Assassin - FN
Figure 6: Rate of false positives and false negatives for SpamAssassin with and without web page classification rules, for
different values of Wp (page weight). Considering web page
content for spam detection reduces the rate of false nega-tives, without incurring in higher rate of false positives up
Effect of Cost Variation
Web Page Classification Certainty (message is ham)
Figure 5: Comparison between Spam Assassin score and web
page classification certainty. It is possible to note in the
right bottom quadrant of (a) which spam messages are notcaught by the filter, but point to messages that are caught
by the web page classifier. In the right bottom quadrant of
misclassified ham cost / misclassified spam cost
(b), on the other hand, it is possible to see that the rateof false positives would not increase considerably using our
SA + Web Page - FP
Spam Assassin - FP
SA + Web Page - FN
Spam Assassin - FN
technique, since most messages in that area have a low scoregiven by spam assassin and/or a low certainty given by the
Figure 7: Rate of false positives and false negatives for Spam
web page-content based classifier.
Assassin with and without web page classification rules,varying the rate between the cost of misclassifying a hammessage and the cost of misclassifying a spam message. It
Impact of the weight parameter in spam
can be seen that the rate of false positives and negatives can
be adjusted according to the user's necessity concerning the
The relationship between the rate of false positives and
cost of each class.
false negatives and different weight values is shown in fig-ure 6. We also show in the same figure the rate of falsepositives and false negatives obtained using SpamAssassin
The relationship between the rate of false positives and
without our technique, for comparison purposes.
false negatives and different cost values is shown in Figure 7.
It can be seen that up to a value of 4, the rate of false
The values in the x axis represent the ratio between the cost
positives of our technique is equal or lower than the rate
of classifying a ham as spam and the cost of classifying a
of false positives of SpamAssassin, even though the rate of
spam as ham. Therefore, if the value in the x axis is 1.5,
false negatives of our technique is remarkably lower. This is
that means that it is 50% more costly to classify a ham as
the reason we chose the value 4 for the weight in the other
spam (i.e. a false positive) than to classify a spam as ham
experiments, as it is the value that yields the lowest rate of
(i.e., a false negative). We have not considered ratios lower
false negatives while maintaining an acceptable rate of false
than 1.0, since that would mean considering that false nega-
tives had a higher impact than false positives. Therefore, itfollows that a raise in the relative cost of ham misclassifica-
Impact of the cost parameter in spam de-
tion yields a smaller rate of false positives and a potentially
higher rate of false negatives. With a cost for misclassify-
Table 4: Frequency of top 10 terms in web pages in com-
performance. Will such a system be able to keep up with
parison with message content. Frequent terms in web pages
the flow of incoming messages in a busy e-mail server as
are rare in messages, due to spammers' obfuscation efforts.
it crawls the web and processes pages, in addition to theexisting spam filter load?
Certainly this solution will require extra processing power,
since it adds a new analysis to the anti-spam bag of tricks.
There are two major costs to be considered in this scenario:
crawling and filtering. Although we have not measured the
performance of our crawler, results from the Monarch sys-
tem show that crawling pages even for a heavy mail server
can be done with a median time for page of 5.54 seconds
and a throughput of 638,00 URLs per day on a four core
professional 0.33
2.8GHz Xeon processor with 8GB of memory [27]. That
cost includes the DNS queries, HTTP transfers and page
processing. Our feature extraction task is simpler, however,
since we only extract the final text from each page. That
seems an acceptable delay for an incoming message, in our
Compared to Monarch, our feature extraction process is
simpler, both because we use fewer attributes and becausewe focus only on the text from HTML documents.
ing ham set to 70% higher than that of misclassifying spam,
question in this comparison might be the cost of the mes-
our technique classifies spam with the same accuracy than
sage filtering itself, since we use a very different approach to
SpamAssassin, even though we lower the number of false
classification. We believe that our approach, by using fewer
positives to zero. It is worth noticing that this compromise
attributes, reduces processing time, but we have no means
is adjustable in our technique, through the ratio of the cost
for a direct comparison. However, in our experiments with a
parameters. It is up to the user to set that ratio according
single dual-core Xeon Processor with 4 GB of memory, our
to his situation.
classifier showed a throughput of 111 pages per second onaverage. Since message classification may be done indepen-
Robustness of Web Page features
dently for each message, the task can be easily distributed,
In this section, we assess the difficulty in detecting spams
becoming highly scalable.
through web pages in comparison with the standard detec-
Time changing URLs: one problem of the solution we
tion approach based on message content. Using the demand-
proposed, first mentioned by the authors of Monarch, is
driven associative classifier we presented in Section 3.2, we
that spammers might change the content of a web page over
compute the top 10 most frequent rules the algorithm de-
time, using dynamic pages. When a message is delivered,
tected for web page content (relative to the spam class), and
the server hosting the spam web pages is configured to re-
compare their frequency (support) in message content. Re-
turn a simple ham page in response to a request. Knowing a
sults are shown in Table 4. Note that, in web pages, the
server using a methodology like ours would try to verify the
most frequent terms are present in a very high fraction of
pages right after delivery, the server would only start return-
spams; over one third of the web pages contain "popular"
ing the intended spam content some time after the SMTP
spam terms such as viagra, cialis, levita and active. On the
delivery is completed. That, again, would require more pro-
other hand, the prevalence of those terms in message con-
cessing by the spammer, and coordination between the spam
tent is significantly lower: the most popular term in web
distributing server and the spam web hosting server, which
pages, viagra, is observed in no more than 14% of messages.
are not always under the same control [1]. If such prac-
For the remaining popular words in web pages, frequency
tice becomes usual, a possible solution might be to add such
in messages are lower or equal than 5%, a consequence of
tests to the user interface: as soon as an user opened a mes-
spammers' obfuscations in message content and the fact that
sage with links, they would be collected and analyzed in the
spam web pages exhibit the full content of the products be-
background and a message might be shown to the user.
ing advertised, while messages tend to present a summary
There are other issues to be addressed as future work, as
of the advertised content.
we discuss in the next section.
Those numbers indicate that, as spammers currently do
Spammer adaptation: It is our observation that web
not have any concern in obfuscating web page content, a few
pages linked in e-mail messages are currently not explored
number of simple rules have a strong impact on detecting
by spam filters. We also observed that the "spammy" con-
spams, and will obligate spammers to react accordingly –
tent in such web pages is clear an unobfuscated. However,
increasing the cost of sending spam.
it is known that spammers adapt to new techniques used by
Operational issues
filters [14]. It could be argued that the same obfuscationsused in e-mail messages could also be used in these web
Three final aspects deserve mention: performance, the ef-
pages. However, that is not always possible. As with the
fectiveness of the technique for time changing URLs and the
Time Changing URLs, this would require coordination be-
adaptation of spammers to our technique. We discuss these
tween the spam distributing server and the spam web host-
issues in this section.
ing server, which are not always under the same control [1].
Performance: adding web page analysis to the spam de-
Not only that, but the spammer would have to sacrifice legi-
tection pipeline immediately brings to mind the matter of
bility and the appearance of credibility in order to obfuscate
over any specific site. Setting limits on the number or URLs
the web page, as is the case with e-mail spam, which results
extracted from any message or the concurrent queries to a
in less profitability from each message. The current state of
single server may prevent other overload attacks.
e-mail spam demonstrates this clearly, as many spam mes-
Finally, one countermeasure that spammers could take
sages being sent are hard to understand, and do not appear
would be to avoid that a filter could identify the URLs in
a message. Although possible, that would imply that thespammer would have to rely on the user to copy — and
CONCLUSIONS AND FUTURE WORK
edit — text from a message in order to turn it into a validURL. That would again add a level of complexity that would
Web pages linked by spam messages may be a reliable
reduce the effect of such spam campaign.
source of evidence of spamicity of e-mail messages. In this
In terms of machine learning challenges, we think the most
work, we propose and evaluate a novel spam detection ap-
promising direction is to combine knowledge from message
proach which assigns a spamicity score to web pages linked
and web page content in order to maximize spam detection
in e-mail messages. Our approach is suitable to work as a
accuracy while keeping a satisfactory filter performance: in
complement to traditional spamicity assessments – such as
cases where spam message content is enough to a classifier
message content and blacklists.
judge its spamicity, we do not need to pay the cost of crawl-
Our motivation for such proposal is the observation that,
ing and analyzing web page content. Devising strategies that
so far, web pages linked in e-mail messages are not explored
examine web page content only when spam message content
by current spam filters, and, despite that, they offer – cur-
is not enough seems to be an interesting approach to lead to
rently – clear and unobfuscated content for spamicity assess-
robust and efficient spam detection.
ments. Since spam filters currently do not examine web pagecontent, spammers usually do not obfuscate their advertised
sites. Even if spammers begin to obfuscate their web pages,the effort required would serve as an additional disincentive
This work was partially supported by NIC.br, CNPq, CAPES,
for spam activity.
FAPEMIG, FINEP, by UOL (www.uol.com.br) through its
We evaluate the use of a lazy machine learning algorithm [28]
Research Scholarship Program, Proc. Number 20110215235100,
to classify web pages, and propose a simple strategy for ag-
by the Brazilian National Institute of Science and Technol-
gregating the classification of the pages to the traditional
ogy for the Web, InWeb. CNPq grants no. 573871/2008-6
spam message classification, by using SpamAssassin [25].
and 141322/2009-8, and by Movimento Brasil Competitivo,
We show that, by using our technique, it is possible to im-
through its Brazil ICSI Visitor Program.
prove spam filtering accuracy without adding a significantnumber of false positives. Furthermore, the use of a cost-
sensitive classifier allows the adjustment of the false positive
[1] D. S. Anderson, C. Fleizach, S. Savage, and G. M.
cost, allowing users to have better control on the trade-off
Voelker. Spamscatter: Characterizing Internet Scam
between false positives and false negatives.
Hosting Infrastructure. In Proceedings of the 16th
We believe that this work explores a new frontier for spam
IEEE Security Symposium, pages 135–148, 2007.
filtering strategies, introducing a new aspect that has not
[2] I. Androutsopoulos, J. Koutsias, K. Chandrinos,
yet been explored in the literature. In other words, the web
G. Paliouras, and C. D. Spyropoulos. An evaluation of
pages that are linked by spam messages are a new battle-
naive bayesian anti-spam filtering. CoRR,
ground, one that spammers are not currently worried about,
and one that may be explored through many different ap-
[3] B. Biggio, G. Fumera, and F. Roli. Evade hard
proaches and algorithms.
multiple classifier systems. In O. Okun and
As future work, we intend to address issues that may sur-
G. Valentini, editors, Supervised and Unsupervised
face once web page analysis becomes integrated into the
Ensemble Methods and Their Applications, volume
spam-fighting infra-structure. They relate to the privacy of
245, pages 15–38. Springer Berlin / Heidelberg, 2008.
users, the abuse of the infra-structure and the obfuscation
[4] D. Chinavle, P. Kolari, T. Oates, and T. Finin.
of urls, and we discuss them next.
Ensembles in adversarial classification for spam. In
It is possible that, by crawling a URL in a spam message
CIKM '09: Proceeding of the 18th ACM conference on
sent to a user, we may be providing feedback to the spam-
Information and knowledge management, pages
mer. That would be the case if the spammer embedded the
2015–2018, New York, NY, USA, 2009. ACM.
message receiver identifier in the URL using some kind ofencoding. By collecting that page our system would be con-
[5] G. V. Cormack. Email spam filtering: A systematic
firming to the spammer that such user is active. There is
review. Found. Trends Inf. Retr., 1:335–455, April
no information on how often such encoded URLs are used,
but they certainly demand more processing power from the
[6] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and
spammer, what, in turn, may reduce profits.
D. Verma. Adversarial classification. In KDD '04:
By implying that all URLs in received messages will result
Proceedings of the tenth ACM SIGKDD international
in a crawler execution, this may provide a way for spammers
conference on Knowledge discovery and data mining,
and other individuals to start attacks by abusing such fea-
pages 99–108, New York, NY, USA, 2004. ACM.
tures. A heavy volume of messages crafted by an attacker
[7] H. Drucker, D. Wu, and V. N. Vapnik. Support vector
with multiple URLs might overload the crawler, or be used
machines for spam categorization. IEEE
to turn the crawler into an attacker, requesting a large vol-
TRANSACTIONS ON NEURAL NETWORKS,
ume of pages from another site. The later may be avoided
just by using DNS and HTTP caches, reducing the load
[8] C. Elkan. The foundations of cost-sensitive learning.
In Proceedings of the Seventeenth International Joint
[24] C. Pu and S. Webb. Observed trends in spam
Conference on Artificial Intelligence, pages 973–978,
construction techniques: a case study of spam
evolution. Proceedings of the 3rd Conference on Email
[9] eSoft. Pharma-fraud continues to dominate spam.
and Anti-Spam (CEAS), 2006.
[25] SpamAssassin, 2011.
[10] T. Fawcett. "in vivo" spam filtering: a challenge
[26] Y. Sun, A. K. C. Wong, and I. Y. Wang. An overview
problem for kdd. SIGKDD Explor. Newsl., 5:140–148,
of associative classifiers, 2006.
December 2003.
[27] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song.
[11] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect
Monarch: Providing real-time URL spam filtering as a
phishing emails. In Proceedings of the 16th
service. In Proceedings of the IEEE Symposium on
international conference on World Wide Web, WWW
Security and Privacy, Los Alamitos, CA, USA, 2011.
'07, pages 649–656, New York, NY, USA, 2007. ACM.
IEEE Computer Society.
[12] J. Goodman, G. V. Cormack, and D. Heckerman.
[28] A. Veloso, W. M. Jr., and M. J. Zaki. Lazy associative
Spam and the ongoing battle for the inbox. Commun.
classification. In ICDM, pages 645–654. IEEE
ACM, 50(2):24–33, 2007.
Computer Society, 2006.
[13] B. Guenter. Spam Archive, 2011.
[29] A. Veloso, W. M. Jr., and M. J. Zaki. Calibrated lazy
associative classification. In S. de Amo, editor,
[14] P. H. C. Guerra, D. Guedes, J. Wagner Meira,
Proceedings of The Brazilian Symposium on Databases
C. Hoepers, M. H. P. C. Chaves, and
(SBBD), pages 135–149. SBC, 2008.
K. Steding-Jessen. Exploring the spam arms race to
[30] A. Veloso and W. Meira Jr. Lazy associative
characterize spam evolution. In Proceedings of the 7th
classification for content-based spam detection. In
Collaboration, Electronic messaging, Anti-Abuse and
Proceedings of the Fourth Latin American Web
Spam Conference (CEAS), Redmond, WA, 2010.
Congress, pages 154–161, Washington, DC, USA,
[15] P. H. C. Guerra, D. Pires, D. Guedes,
2006. IEEE Computer Society.
J. Wagner Meira, C. Hoepers, and K. Steding-Jessen.
[31] S. Webb. Introducing the webb spam corpus: Using
A campaign-based characterization of spamming
email spam to identify web spam automatically. In In
strategies. In Proceedings of the 5th Conference on
Proceedings of the 3rd Conference on Email and
e-mail and anti-spam (CEAS), Mountain View, CA,
AntiSpam (CEAS) (Mountain View), 2006.
[16] M. Illger, J. Straub, W. Gansterer, and
C. Proschinger. The economy of spam. Technicalreport, Faculty of Computer Science, University ofVienna, 2006.
[17] S. M. Labs. MessageLabs Intelligence: 2010 annual
[18] K. Levchenko, A. Pitsillidis, N. Chachra, B. Enright,
M. Felegyhazi, C. Grier, T. Halvorson, C. K.
C. Kanich, H. Liu, D. McCoy, N. Weaver, V. P. G. M.
Voelker, and S. Savage. Click trajectories: End-to-endanalysis of the spam value chain. Proceedings of theIEEE Symposium on Security and Privacy, 2011.
[19] Libcurl, 2011. http://curl.haxx.se/libcurl/.
[20] Lynx, 2011. http://lynx.browser.org/.
[21] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker.
Beyond blacklists: learning to detect malicious websites from suspicious urls. In Proceedings of the 15thACM SIGKDD international conference on Knowledgediscovery and data mining, KDD '09, pages1245–1254, New York, NY, USA, 2009. ACM.
[22] A. Ntoulas and M. Manasse. Detecting spam web
pages through content analysis. In In Proceedings ofthe World Wide Web conference, pages 83–92. ACMPress, 2006.
[23] C. Pu, S. Member, S. Webb, O. Kolesnikov, W. Lee,
and R. Lipton. Towards the integration of diversespam filtering techniques. In Proceedings of the IEEEInternational Conference on Granular Computing(GrC06), Atlanta, GA, pages 17–20, 2006.
Source: http://honeytarg.cert.br/spampots/papers/spampots-ceas11.pdf
Effects of recombinant human erythropoietin in the cuprizone mouse model of de- and remyelination for the award of the degree "Doctor rerum naturalium" Division of Mathematics and Natural Sciences of the Georg-August-University Göttingen submitted by Nora Hagemeyer
for Rotator Cuff Repair LawrenceV. Gulotta, MD, Scott A. Rodeo, MD KEYWORDS! Tendon biology ! Cytokines ! BMP ! Growth factors! Gene therapy ! Repair scaffolds Rotator cuff repair surgeries are one of the most common procedures performed byorthopedic surgeons, with over 250,000 performed annually in the United Statesalone. Despite its prevalence, there is concern regarding the ability of the rotatorcuff to heal back to the insertion site on the humerus following repair. Clinical studieshave shown radiographic failures at the repair site at 2 years in anywhere from 11% to95% of patients, depending on the size and chronicity of the tear, presence of fattyinfiltration, and the age and general health status of the Although patientswith re-tears or failed healing may have pain relief, these studies show that they haveinferior functional results when compared with patients with healed An un-derstanding of the histology and biology that occur during the healing process maylead to therapies that can improve the healing rate and improve the functional resultsof patients following repair.