Hot zone identification: analyzing effects of data sampling on spam clustering
Hot Zone Identification: Analyzing Effects of Data
Sampling On Spam ClusteringRasib Khan
University of Alabama, Birmingham
Mainul Mizan
University of Alabama, Birmingham
Ragib Hasan
University of Alabama, Birmingham
Alan Sprague
University of Alabama, Birmingham
Follow this and additional works at:
Part of the nd the
Recommended CitationKhan, Rasib; Mizan, Mainul; Hasan, Ragib; and Sprague, Alan (2014) "Hot Zone Identification: Analyzing Effects of Data SamplingOn Spam Clustering,"
Journal of Digital Forensics, Security and Law: Vol. 9: No. 1, Article 5.
Available at:
This Article is brought to you for free and open access by ERAU ScholarlyCommons. It has been accepted for inclusion in Journal of DigitalForensics, Security and Law by an authorized administrator of ERAUScholarly Commons. For more information, please contact
Journal of Digital Forensics, Security and Law, Vol. 9(1)
HOT ZONE IDENTIFICATION:
ANALYZING EFFECTS OF
DATA SAMPLING ON SPAM CLUSTERING
Department of Computer and Information Sciences
University of Alabama at Birmingham
115A Campbell Hall, 1300 University Boulevard
Birmingham, Alabama 35294-1170
ABSTRACT
Email is the most common and comparatively the most efficient means of exchanging information in today's world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process of mining the spam data involves going through every email in the data mine and clustering them based on their different attributes. However, given the size of the data mine, it takes an exceptionally long time to execute the clustering mechanism each time. In this paper, we have illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone' from the spam campaigns. We have provided detailed comparative analysis of the quality of the clusters after sampling, the overall distribution of clusters on the spam data, and timing measurements for our sampling approach. Additionally, we present different strategies which allowed us to optimize the sampling process using data-preprocessing and using the database engine's computational resources, and thus improving the performance of the clustering process.
Keywords: Clustering, Data mining, Monte-Carlo Sampler, Sampling, Spam, Step Sequence Sampler,
Stepping Random Sampler, Hot Zone
1. INTRODUCTION
Spam emails are mostly generated by malware bots on different computers across the Internet. However,
Advancement of the IT infrastructure significantly
malwares installed by the same spammer exhibit a
affects the way people communicate. Social
specific pattern in the spam emails (Nhung and
interaction and information exchange are highly
Phuong 2007; Ying et al., 2010). The content of the
dependent on emails and other such forms of media.
spam is usually generated using a common template.
At the same time, such medium of communication
Therefore, the identification of the pattern in these
has been the target of misuse since the beginning.
spam emails is significantly important to IT forensic
Thus, the negative motives from spammers have
experts. The identified pattern can then help identify
been a serious issue, which have led to phishing,
a specific spammer and follow through with proper
viruses, malware bots, and other such attacks.
investigations (Dagon et al., 2007; Ono et al., 2007). Mining spam emails helps discover and correlate
Journal of Digital Forensics, Security and Law, Vol. 9(1)
useful patterns. Most of the mining techniques are
measurements for the different operations in the
text-based, given that such spam emails are mostly
algorithm. The paper also includes a different
text-oriented. Once the emails are scrutinized for
approach to optimize the sampling process, utilizing
such patterns, different clustering techniques and
the efficiency of the database engine, which allowed
algorithms can be applied over the email data to
us to enhance the resulting performance of the
group the spams based on some similarity criteria.
The speed of producing faster clusters from large
Contributions: The contributions in this paper are
datasets depends on efficient algorithms. However,
in case of very large datasets, it might be required to reduce the size of the data prior to the clustering
We evaluate the sampling methods on actual
spam emails from the UAB Spam Data Mine. The validation and effectiveness of
In this paper, we focus on the evaluation of
sampling is based on the following: (a)
clustering performed on sampled spam emails. The
quality of the clusters produced, (b) the data
data used is from the Spam Data Mine at the
cover/distribution of spam emails within the
University of Alabama at Birmingham (UAB)
data mine, and (c) the timing performance
(UAB-CIS, 2013). The UAB Spam Data Mine is a
for the clustering operation. All the
large and widely researched repository for spam
sampling models have been validated for
emails, and is used as a helpful resource by
varying sampling rates against the clusters
researchers from different global organizations.
created using the complete data set. Our
Given the huge number of spam emails collected
results show that we are successfully able to
every day, the clustering of the spams take a long
highlight the ‘hot zone' from the spam
time. However, in this work, instead of focusing on
emails with a significant improvement in
algorithms to optimize the clustering process, we
timing performance.
considered sampling the dataset prior to fetching it
We present techniques and strategies for the
to clustering algorithms. Once we are able to prove
most efficient way to implement the
sampling as an efficient and applicable solution for
sampling process and retrieve the huge
data reduction, we believe appropriate clustering
number of spam emails from the data mine,
algorithms can be applied accordingly. We have
which are then used to execute the clustering
adopted the previous work done by Chun Wei et al.,
algorithm. The experimental measurements
to create the clusters based on patterns in the subject
using our optimization strategies illustrate
header of the spam emails (Wei et al., 2009).
that there are further improvements in
In this work, we have utilized four simple methods
performance, compared to naïve SQL query
of sampling that we have applied on the spam data
based retrieval of sampled spam records
from the data mine. As a result, we aim in making
from the UAB Spam Data Mine.
the process of clustering more efficient and less time
The rest of the paper is organized as follows. The
consuming. Furthermore, we provide the results to
motivation for the work is presented in Section 2.
illustrate that the sampled data from the UAB Spam
Section 3 describes the organization of the UAB
Data Mine preserves the information contained for forming clusters and highlight the ‘hot zone'. In this
Spam Data Mine, including the clustering algorithm
context, we refer to ‘hot zone' as the most prominent
from the work of Wei et al. (2009). The different sampling models are described in Section 4. The
clusters with respect to spamming activities. We
results and corresponding analysis are presented in
have presented the results in order to support our
Section 5. Section 6 includes the optimization
claim of using sampled spam data to allow
strategies to improve the efficiency of the sampling
investigators a faster and better opportunity to identify the ‘hot zone' in spam clusters. We
process. Finally the related works and conclusion are presented in Section 7 and Section 8 respectively.
illustrated the resulting clusters from the sampled data, and performed extensive comparative analysis
2. RESEARCH MOTIVATION
with the clusters formed using the whole data set. Our evaluation includes an analysis of the data
The increasing number of Internet users has attracted
distribution on the spam data, and also the time
criminals to the field of online crimes. eCrimes have
Journal of Digital Forensics, Security and Law, Vol. 9(1)
been significantly on the rise since the last few years.
indicate that the network for criminal activities have
This section illustrates the issue of eCrimes on the
outgrown the authorities dealing with eCrimes.
Internet, and the research motivation behind the work on investigating spam clusters, and the
2.2 Spam Investigation
importance of identifying the hot zone.
Spam emails are perceived as being analogous to junk mails. These emails are generally advertising
2.1 eCrimes on the Internet
emails, or with other forms of undesired content.
Information security and economics have become
However, spam emails are not as innocent as junk
interdependent in recent times. Corporations employ
mails. They are sent to a large number of recipients,
information security specialists, as well as
and usually have hidden motives along with the
economists and lawyers to deal with the rising
content of the email. They are considered as the
concern of eCrimes. The network of criminal
primary channel for attackers to deploy Trojans,
activities has become more organized with
worms, viruses, spyware, and botnets on other
structured online black markets, where the criminals
machines across the Internet.
trade insider information. Data and information,
The email body of spams has hidden scripts, cookies,
such as credit card and PIN codes, are sold to online
and other attached content to attract the recipient of
anonymous brokers in these underground eCrime
the email. Once the user opens the email, the scripts
markets. According to Moore et al. (2009), credit
may use the current information from the browser to
card information are sold at advertised prices of
expose the identity of the user to the attacker. This is
$0.40 to $20.00 per card, and bank account
the easiest and a very well-known approach, but still
credentials at $10 to $100 per bank account. Social
the most common scenario where users are victims
security numbers and other personal details are sold
of identity thefts on the Internet. This information
for $1 to $15 per person, while online auction
can be used to remotely access the user's machine
credentials fetches around $1 to $8 per identity.
and install unwanted malwares as botnets. The
Subsequently, the brokers sell the information to
malware can then operate from the infected machine
specific expert hackers, who perform the final act of
using the identity of the user, and send further spam
money laundering.
emails or perform other unwanted tasks.
The information collected in these online criminal
When an attacker sends a spam, he generally uses a
template to generate the content of the email. The
Usually, Internet users are driven to false websites
format of the content is thus prevalent in all the spam
with the help of advertising emails. These bulk
emails those are being sent. However, the spammers
emails are generally classified as spams, which are
replace some words or phrases to introduce variation
sent by spammers, using malicious software running
and hence bypass the spam filters. Thus, it becomes
on infected machines. The infected computers are
a non-trivial task for such filtering services to detect
used by the spammers to record keystrokes and send
all the spam. Data mining from spam emails is useful
further spam emails.
to detect and investigate these patterns. The spam
The monetizing channel for spam emails includes
emails are scrutinized and parsed into different text-
multiple organizations. It is illustrated by Levchenko
based segments. Each email comprises of certain
et al. (2011), the spam value chain has multiple links
attributes, such as the sender email, subject header,
between the money handling authorities and the
and the mail body. These individual attributes can be
investigated to match other spam emails, and thus
approximate consensus, 5% of online devices on the
grouping similar spam emails. Once a pattern is
Internet are susceptible to being infected with
observed, they can be clustered and classified as a
malware. At least 10 million personal computers
specific spam campaign (Caruana and Li 2008;
have been assumed to be infected with malware in
Kyriakopoulou and Kalamboukis 2008; Sasaki and
2008, the number for which should have had
Shinnou 2005; UAB-CIS 2013; Wei et al., 2009;
increased significantly over the last few years
Ying et al., 2010). The individual clusters obtained
(Moore et al., 2009). Thus, these figures easily
from grouping spam emails allow the eCrime investigators to identify a particular spammer. The
Journal of Digital Forensics, Security and Law, Vol. 9(1)
clustered spams are examined to classify the
law enforcement authorities to study the criminals. It
spammer and obtain further track-down information.
is more important to identify the largest clusters
eCrime investigators use these collected data to hunt
rather than obtaining an extensive number of clusters
down online criminals and take appropriate actions
for the huge amount of spam from the data mine. It
against the involved personnel.
might not be the same scenario when it comes to user privacy protection and spam filters on web browsers
The Spam Data Mine at UAB collects approximately
and email clients, where more fine-grained spam
1 million spam emails each day (UAB-CIS, 2013).
filtering is required to protect the users on the
The spam emails can then be used to find the patterns
Internet. Therefore, when it comes to criminal
and perform clustering on the collected data. The
investigations and law enforcement, the prominent
identified clusters are assumed to be individual spam
clusters are the ones of interest, while the smaller
campaigns by an attacker. The extracted patterns
ones can be classified as outliers.
from the spam emails are dependent on the template used by the spammer to generate the spam.
3. CLUSTERING SPAM DATA
However, it should also be noted that an attacker generally uses a given spam template for a few days,
For our work in this paper, we have adopted an
after which he changes the format of the emails. This
existing clustering algorithm proposed by Wei
constant change in the format of the spams makes it
(2010) and Wei et al. (2009). The algorithm has been
difficult to identify a particular attacker. As a result,
executed using data from the UAB Spam Data Mine
spam emails collected over a small duration of time
(UAB-CIS, 2013). In this section, we discuss the
exhibits the specific pattern, after which the
background and the description of the data mine,
extracted cluster information does not apply any
including the clustering technique proposed by Chun
Wei et al. (2009, 2010) on the spam data.
From the above scenario, we have observed the
3.1 Background
following requirements for investigating eCrimes
The initial research issue for knowledge extraction
using spam clusters. First, it is important that the
or data mining is classifying data and creating
identification of the spam campaigns should be done
representations of the feature space. Clustering is
as early as possible. The multitude of financial loss
most commonly used for feature compression and
resulting from eCrimes requires the investigation to
proceed quickly. The sooner a particular spam
Kalamboukis, 2008). Specific features are compared
campaign is taken down, the lesser is the financial
and clustered into groups which represent a
loss. A quick action against a spam campaign would
commonality among all of its data items. The task of
also mean that lesser people will fall as victims to
measuring the similarity of data items can be
the campaign on the Internet. However, given the
performed in different ways. The most common
huge amount of data, it requires a lot of time to
methods for measuring similarity/dissimilarity are
execute the clustering operation. Thus, the inherent
Jaccard and Levenshtein coefficients (Jaccard 1901;
requirement to act quickly against such eCrimes is
Levenshtein 1966). The distances can then be used
not fulfilled with the current approaches for
in other clustering algorithms to create and evaluate
clustering spam emails. Moreover, the quickly
clusters (Caruana and Li 2008; Kanungo et al., 2002;
changing pattern of templates by the spammers
Hartigan and Wong 1979; Wei 2010; Ying et al.,
makes it more difficult to extract the information
2010). The clustering algorithms thus use the
from the spams and act on it accordingly.
similarity or dissimilarity of individual data items
Second, the ‘hot zone' of the spam campaigns are
based on the feature space, and group them into a
the ones about which conclusive remarks can be
common cluster based on preset threshold
made about an attacker. Here, we refer ‘hot zone' as
the group of largest clusters and the most prominent
3.2 The Spam Data Mine
spam campaigns on the Internet. The largest spam clusters imply a large number of similar spam
We utilized the UAB Spam Data Mine (UAB-CIS,
emails. As a result, the larger clusters incorporate
2013) for the purpose of our research evaluation.
more information for the eCrime investigators and
The UAB Spam Data Mine is a research project
Journal of Digital Forensics, Security and Law, Vol. 9(1)
under The Center for Information Assurance and
received from numerous sources and honey-pots,
Joint Forensics Research (CIS-JFR)1. The Center
and collects approximately 1 million spam emails
generates information about currently on-going
campaigns by spammers. It archives spam emails
The collection of spam emails from the sources is
Subsequently, the spam data mine stores the data
collected in a batch-wise operation. General users on
regarding spam emails parsed into different
the Internet, upon receiving a (suspected) spam
attributes. The current database design holds the
email, marks the email as spam, and forwards it to
following attributes for each spam email:
the honey-pot email address for archiving.
message_id,
sender_name,
Additionally, numerous other honey-pots are placed
sender_username,
sender_domain,
sender_ip,
at different points in the network which dedicatedly
receiving_date, time_stamp, word_count.
receive and archive spam emails. The archived spam
3.3 Algorithm for Clustering
emails are collected batch-wise at specific time intervals during the day. Thus, due to the manner
The method employed by Wei et al. (2009) for
these spam emails are stored and collected in the
clustering the spam data is specific to the data from
data mine, the records do not display a shuffled
the UAB Spam Data Mine (UAB-CIS, 2013). In this
organization in their sequence.
section, we present the clustering algorithm designed and implemented by Wei et al. (2009) and
1 The Center (CIS-JFR),
Journal of Digital Forensics, Security and Law, Vol. 9(1)
also included as a part of the work in Wei (2010).
hash of the sender's email. Similar items were
For our purpose, we chose the rather ‘fast-n-dirty'
clustered into a common group. From within the
version of the clustering algorithm by Wei, which is
clusters, some of them are set aside using a bounded
shown in Algorithm 1. The clustering algorithm
threshold, which was set at a minimum of (
mean +
matched spam emails on exact similarity of sender
email addresses. They are matched using the MD5
Figure 1 Sampling Methods: Step Sequence Sampler (SSS), Stepping Random Sampler (SRS),
and Monte Carlo Sampler (MCS)
Next, the process was repeated for the
word_count
initializes using a 48-bit long random seed.
of the email body for all the small clusters, and
Subsequently, it is modified using a linear
further clusters were created. As a result, some of the
congruential formula to generate a stream of pseudo-
clusters had both the
sender_name and the
random numbers (Knuth, 2006). Alternatively,
word_count in the feature space, while some only
Mersenne Twister is another method for polynomial
had the
word_count criteria. Finally, a Levenstein
calculations over two-element fields to generate
index is calculated to create a common pattern for
uniform pseudo-random numbers (Matsumoto and
the
subject header for each of the clusters. The
Nishimura 1998). However, our random generator
output patterns of subject headers for the spam
uses the linear congruential formula due to the
emails are produced in the form
‘ similar
simplicity of the model, and serves the purpose of
word'. Here, the blank spaces are the words which
could be substituted for other words. The blank spaces together with the words ‘
The simple random sampler takes in a range of
similar' and ‘
word'
values within a begin/end index for
message_ids.
define the basic template of the subject headers for
Subsequently, it generates the random indexes
each of the clusters of similar spam emails.
within the given range, according to the desired sampling rate. However, the generated random
4. SPAM DATA SAMPLING
indexes may or may not be evenly distributed across
Sampling is a well-known technique for data
the range of values for the
message_ids.
reduction, given that it preserves the information
4.2 Step Sequence Sampler
from the original data set. In this section, we present our approaches to create the sampled data. We have
The step sequence sampler is another method of
presented four different schemes for creating the
sampling which we utilized for our spam data. As
sampled data, which have been discussed in the
shown in Figure 1a, given the sampling rate
r, we
following sections. For each of the models, we
initially calculated the step frequency
f. The range of
invoke the sampling method with the begin index,
values for the
message_ids is then divided into
f-
end index, and sampling rate parameters.
segments, and the boundary index values are returned as the sampled indexes. As a result, the
4.1 Simple Random Sampler
obtained sampled data is evenly distributed, and
The simple random sampler is implemented using
sequentially selected from the data set.
the Java Random class2. The Java Random class
2 Java Random class,
Journal of Digital Forensics, Security and Law, Vol. 9(1)
4.3 Stepping Random Sampler
to probabilistically generate some random indexes for choosing the sampled
message_ids, as illustrated
The stepping random sampler is an extension of the
in Figure 1c, and presented in Algorithm 2.
step sequence sampler, as shown in Figure 1b. As before, we calculated the step frequency
f for the
In the Monte Carlo sampler, for each index
i, where
given range of
message_ids based on the sampling
i is between begin and end, we ‘roll' between 0 -100.
rate. After that, we utilized the Java Random class to
If the random ‘roll' is less than or equal to the
randomly select an index from within each block.
sampling rate r, we select the specific index
i. Thus,
Thus, the sampled index values for the
message_ids
the sampled indexes are sequentially selected or
are evenly distributed with the frequency
f, and
discarded from within the range of begin and end
randomized within each blocked segment, thus
indexes for
message_ids. However, the number of
ensuring unbiased results.
index values that we receive from the Monte Carlo sampler is not exact, but probabilistically close to
4.4 Monte Carlo Sampler
match the sampling rate
r. The success or fail events
Monte Carlo methods refer to computational
in Monte Carlo models are usually executed for a
algorithms which are based on repeated random
large number of events. Therefore, according to the
sampling to obtain a desired goal. It is a process of
model, the larger the range of
message_ids, the
calculating heuristic probability for a given scenario
closer we get to the desired value for the number of
which is defined by the specific validation of a
sampled items (Hammersley et al., 1965).
success or fail event (Hammersley et al., 1965). In
4.5 Comparison of Sampling Methods
our case, we designed a simple Monte Carlo sampler
Table 1 Comparison of properties for the Random Sampler (RS), Step Sequence Sampler (SSS), Stepping Random
Sampler (SRS), and the Monte Carlo Sampler (MCS)
Number of samples
The properties of the different sampling methods are
Repetition in sampling means the possibility
summarized in Table 1. In this context, we define the
of an index being chosen more than once.
following properties for the different sampling
Data cover represents the feature of the
chosen sampled indexes being evenly distributed over the range of values from the
Randomness in the sampling process
original data set.
implies the probability of a particular index
Number of samples refers to the number of
being chosen in the sample.
indexes chosen, given the total number of
Sequential sampling refers to the criteria of
indexes n, and the sampling rate r.
the chosen indexes being in order once the sampling process has completed.
Journal of Digital Forensics, Security and Law, Vol. 9(1)
As shown in Table 1, the simple random sampler
of the data. The rightmost bar on Figure 2 shows the
provides good randomness, as it depends on a simple
distribution of the clusters which were created from
linear congruential formula to generate the pseudo-
complete data set for the given range of days. It can
random number stream. However, it is not
be seen that the ten largest clusters actually represent
sequential, as the chosen index samples are
almost 25% of the whole data set, with three largest
generated at random, and does not preserve order.
clusters representing approximately 9%, 8%, and 3%
Additionally, the simple random sampler does not
guarantee uniqueness, as the same number can be
Next, we executed the clustering algorithm on
generated more than once. Therefore, the already
sampled data with each of our samplers. The
mentioned properties can be utilized to state that the
sampling was performed at varying rates of 1%, 2%,
simple random sampler does not provide a
3%, 5%, and 8% respectively. For each of the cases,
guaranteed data cover either. The step sequence
we analyzed the clusters created with the sampled
sampler does not provide any randomness and is
data. To visualize the clustering quality with better
purely sequential. However, we are able to ensure no
understanding, we normalized each of the sampled
repetition and full data cover. Using the stepping
clusters using the size of the sample to calculate the
random sampler allows mediocre randomness, but
clustering factor for each. Using a normalized view
contains sequence, ensures uniqueness, and also
for the sampled clusters thus makes it easier to
provides a full data cover. Finally, the Monte Carlo
evaluate the quality of the clustering with respect to
method provides good randomness and ensures
the clusters formed using the full data set. The
sequentiality with no repetition. However, it has a
clustering factor for each of the sampling methods at
probabilistic sample size of approximately (
n*r),
varying sampling rates is illustrated in Figure 2.
where n is the data size and r is the sampling rate. The probability of the sample size will get closer to
From the results, it can be seen that random
(
n*r) with a greater range of values for the indexes.
sampling, step sequence, and stepping random create the clusters with a similar clustering factor as that of
5. RESULTS AND ANALYSIS
the full data set. Thus, the more similar the clustering
In this section, we present the results obtained from
factors and distributions are, the better they can be
claimed to have performed. It should also be noted
previously. The sampled data were mined and used
that all the three sampling methods perform in a
to create clusters, based on the algorithm of Wei et
stable manner with their varying sampling rates.
al. (2010) (Ying et al., 2010). We also provide an
Additionally, we verified that each of the ten largest
analysis of the results and comparison of each of the
clusters from the sampled data actually coincides
sampling methods against clustering performed on
with at least eight of the largest clusters from the full
the full data set. The results presented have been
dataset. However, they might sometimes be slightly
generated using two days' spam data. As mentioned
out of order in the sampled cluster sizes. Moreover,
earlier, the data mine collects a huge number of spam
the top three to five clusters as shown in Figure 2 is
emails, and there were a total of approximately 1.8
always the same clusters in all the cases, which
million spam emails in these two days.
verifies that the sampling effectively allows us to identify the ‘hot zone' of spam campaigns. Table 2
5.1 Clustering Quality
describes the patterns of subject headers for each of
Initially, we performed the clustering on the whole
the top ten clusters created in order of their sizes. It
spam data for a range of two days. With the clusters
can be seen that most of the clusters created from the
formed, we selected the ten largest clusters and
2% step sequence sampling are exactly in the same
analyzed their statistics. We recorded the number of
order if compared to the clusters created using the
data points, pattern of the subject within the cluster,
full data set. However, there are minor interchanges
and the percentage of data that each of the clusters
in the position of the clusters in their ordering.
has with respect to the data size. We refer to
Nonetheless, they are not the top clusters, and are
clustering factor as the value between 0 and 1, which
usually of similar sizes and hence tend to swap
represents the size of the cluster in terms of the size
places with minor changes in the order.
Journal of Digital Forensics, Security and Law, Vol. 9(1)
Table 2 Subject Header Patterns of Ten Largest Clusters Compared using Full Dataset Vs. 2% Sampled Data
Clustering on full data set
Clustering using 2% Step Sequence
Canadian Pharmacy: BUY NOW VIAGRA & CIALIS !
Canadian Pharmacy: BUY NOW VIAGRA & CIALIS !
Corporate eFax message - pages
Corporate eFax message - pages
United Parcel Service notification
United Parcel Service notification
Purchase your Levitra from one of our drugstores today.
Levitra/Viagr/Cialis from $1.25
However, with the Monte Carlo sampler, it can be
Figure 3 illustrates the graph to help visualize the
seen that the sampled data had some skewness
distribution for the complete dataset. The x-axis
towards the clustering data points. This can be
corresponds to the total number of
message_ids for
claimed as both positive and negative. Given that the
the given date. The y-axis specifies the number of
results tend to have a greater clustering factor for the
spam emails in the cluster to which the
larger clusters and represent almost 45% of the
corresponding
message_id belongs to. The colored
sampled data, it can be argued that Monte Carlo
lines are formed by very closely placed data points,
sampling makes it easier to focus on the largest
and each of the colors represents a different cluster.
clusters. However, they tend to distort the actual
We also present the data cover graphs generated
distribution of clusters and misrepresent the
from the clusters created using the four different
clustering factor for each of the clusters compared to
sampling methods, shown in Figures 4, 5, 6, and 7
the full data. An interesting convergence towards the
respectively. The sampled graphs have been
desired clustering factor distribution can be seen as
produced only for a sampling rate of 2%, which is
the sampling rate is increased.
sufficient to prove the effectiveness of sampling. It
Therefore, from the clusters created and the
can be seen that each of the sampling methods have
clustering factors, we are able to infer the effect of
been equally capable to successfully identify the
the different sampling methods. It can be seen that
same top clusters which have been created by the
random, step sequence, and stepping random
complete data set. Additionally, it can be seen that
sampling tends to preserve the distribution of the
most items which belong to the same cluster reside
original data set of spams. Therefore, we can say that
closely in the data set. This observation is useful in
the sampling models for the above three are
asserting the fact that sampling the data which
representative sampling. On the other hand, Monte
preserves the sequentiality is also able to preserve
Carlo seems to perform well in highlighting larger
the representation of the dataset.
clusters and removing noise from smaller clusters.
An interesting observation is the comparison of
Hence, we call it
noise suppressive sampling. Given
tailing or sparse data from Figure 3 compared to any
the context and the requirement, each of the
of the other Figures 4, 5, 6, and 7. All the sampling
sampling methods can be utilized accordingly.
methods have nicely cleaned the scattered data
5.2 Data
Cover
We utilized the clusters created from our
However, the sampled data for step sequence
experiments to analyze the distribution of the data in
sampler and Monte Carlo sampler (Figure 5 and 7)
the spam data mine. We are interested to visualize
still shows some minor traces of the existence of the
how the spam emails have been archived in the data
scattered data in comparison to the original data. In
mine, with respect to the cluster each spam email
all the cases, the leveling clusters at the bottom are
belongs to. In this context, data cover refers to the
cluttered together. However, these are the smaller
distribution of the spam emails in the data set.
Journal of Digital Forensics, Security and Law, Vol. 9(1)
clusters and do not play any interesting role in the
Additionally, such a pattern of data arrival
identification of the ‘hot zone'.
strengthens ours claim of sampling being sufficient and effective to preserve the characteristics of the
Thus, Figures 3, 4, 5, 6, and 7 illustrates the way the
dataset and the largest clusters from the spam emails
data set is organized. This can lead us to generalize
in the data mine.
a pattern of arrivals of spam emails into the archive.
Figure 2 Clustering Factor for Ten Largest
Figure 3 Spam Distribution based on
Clusters for Complete Dataset
Figure 4 Spam Distribution based on Clusters for Figure 5 Spam Distribution based on Clusters for
Simple Random 2% Sampling
Step Sequence 2% Sampling
Figure 6 Spam Distribution based on Clusters for Figure 7 Spam Distribution based on Clusters for
Stepping Random 2% Sampling
Monte Carlo 2% Sampling
Journal of Digital Forensics, Security and Law, Vol. 9(1)
Figure 8 Timing Performance for Application Level
Figure 9 Timing Performance for Database Filtering
using Naive SQL Query
5.3 Timing Performance
processing time required for each of the cases of reduced data size using varying sampling rates. Once
Here, we present the timing performance
the data have been loaded and sampled, the
enhancement from mining and clustering the
clustering algorithm (Wei 2010; Ying et al., 2010)
sampled data compared to using the whole dataset.
creates the clusters based on the given data. It can be
The database was deployed on a x86 64-bit machine,
distinctively seen that the time required for the
using Intel 2.4 Ghz processor, with 6 processing
whole data set is very high, compared to the sampled
cores and 12 GB RAM. Additionally, we executed
data clustering. Additionally, the algorithm adapted
the Java program to perform the clustering on the
from Chun Wei et. al.'s work is the simple and faster
same machine. Hence, all timing measurements have
version, which still is significantly high compared to
been recorded based on the corresponding execution
the measurements obtained for the sampled data.
times. Figure 8 illustrates the timing measurements
The increase in time required with increasing
from the different sampling rates, including the
sampling rate is not exactly linear, but not quadratic
timing for the complete data set.
either. Thus, the reduction in the amount of time to
The mean time required for loading the data from the
perform a whole data set clustering can be reduced
database is 4261 milliseconds, and is depicted by the
by a factor greater than linear if a sampled data set is
lower block in the timing bars in Figure 8. The
loading time of the data is almost constant for all
6. SAMPLING OPTIMIZATION
cases. This is because the query executed on the database from the application requests for the
For further research, we explored some strategies to
complete dataset for the specified day(s). Once the
optimize the process of sampling. In our opinion, the
data is received, the application then performs an
timing performance of sampling can be improved if
application level filtering of the data, by either
we are able to perform the operation on the database
selecting or discarding the item, based on the
engine. The following sections illustrate our process
sampled indexes generated separately. Thus, given
of investigation and the methods we adopted to
that the machine executing the program had
fulfill the requirements.
sufficient main memory, the task of on-memory
6.1 Data Preprocessing
filtering of the data was performed within a very short time.
Given the huge number of spam emails gathered every day, reading the data items from the database
The interesting measurement to be noticed is the
required a significant amount of time. In the
upper segment in Figure 8, which corresponds to the
clustering implementation by Chun Wei et. al. (Wei
Journal of Digital Forensics, Security and Law, Vol. 9(1)
et al., 2009), they performed a read operation on the
queries take an exceptionally long time to load the
whole data for a specific date. As a result, this
sampled data. Thus, as we failed to improve the
incurred to a huge number of read operations on the
performance using the naïve SQL query, we
database server.
investigated further options to optimize the sampling process.
We performed some initial data preprocessing to reduce the number of read operations while
6.3 Cross-Product with Temporary Table
retrieving the data items from the database. We
Next, we considered executing the query in a
created a new table, namely
daily_index, with fields
different fashion. In this approach, similar to the
receiving_date and
message_id. The table was
previous, we performed the sampling selection using
populated using the minimum values for the
the
daily_index table. However, the next operation
message_id for each date from the spam table. With
included creating a temporary table with only the
the
daily_index table created, we can now easily
selected
message_ids. A query was then executed on
retrieve the range of values for
message_id for the
the database to return the cross-product of the
given dates for which we will perform the clustering.
temporary table and the spam table. The execution
For each sampling method, we initially provide the
of cross-product operation is optimized by the
message_id range, get the sampled indexes, and
database itself, and therefore, the database is able to
subsequently, retrieve only the required data items
return the resulting records in split seconds. The
from the database based on the desired sampling rate
timing measurements from using a temporary table
r. As a result of this operation, we are able to save
and cross-product operation are shown in Figure 10.
(
n-(n*r/100)) read operations from the database; where
n is the total number of records for the given
It can be seen that the total time required for the
sampled data is much lesser than the time required for the complete data set. As it was seen previously
6.2 Naïve SQL Query
in Figure 9, the load times for the sampled records
The initial time measurements were taken based on
were significantly high compared to the full data
an application level filtering for the sampling
retrieval. However, in this case, it can be seen from
process. On the contrary, with the data pre-
Figure 10 that the load times for sampled
processing and the
daily_index table created, we
message_ids are around a few hundred milliseconds,
initially generated indexes for the sampled
which are much lesser compared to the full data. The
message_ids. Subsequently, we queried the database
maximum load time was required when we reached
with a long matching clause of the sampled
a sampling rate of 8%, which was still equal to the
message_ids to retrieve the required rows. However,
load time for the whole data set. If we compare our
in this form of queries, we failed to improve the
results from the initial timing measurements
timing requirement. The size of the query was itself
presented in Figure 8, it can be seen that the times
very large, and the database took a very long time to
for sampling rates 1%, 2%, 3%, and 5% are all much
select and load the sampled records. The
lesser in our optimized sampling operation. In the
measurements from the naïve SQL query are
case of 8%, it is still lesser, but maybe comparable
illustrated in Figure 9. It can be seen clearly that even
to the previously recorded measurements.
though the processing time is reduced, the sampling
Journal of Digital Forensics, Security and Law, Vol. 9(1)
Figure 10 Timing Performance for Database Filtering using Temporary Table
Therefore, with the given results, we can argue that
The performance of the clustering process and the
the proposed approach is significantly better than the
quality of the resultant clusters depends on the
original application layer filtering. We have
corresponding clustering algorithms. In this paper,
successfully illustrated that the processing time for
we have successfully illustrated that we are able to
the sampled clustering using a temporary table is
identify the prominent spam clusters from the
much better for reasonable sampling rates.
sampled data, with radical improvements in timing
Additionally, sampled clustering using this strategy
performance for clustering algorithms. There are
reduces a lot of task load on the machine which
multiple clustering algorithms which explore the
executes the clustering algorithm. Even though we
text-based patterns in spam emails (Kyriakopoulou
had both the program and the database on the same
and Kalamboukis 2008; Ramachandran et al., 2007;
machine, it can be surely assumed that the database
Sasaki and Shinnou 2005; Wei 2010; Wei et al.,
server is usually a separate machine with more
2009), including clustering algorithms specifically
processing power. Therefore, the described method
applicable for large datasets (Ganti et al., 1999).
of optimizing the process of sampling takes
Halkidi et al., proposed further techniques, which
advantage of the processing power of the database
can be used to validate the clustering quality (2001).
engine, and keeps the machine running the clustering
Therefore, given that we have proved sampling to be
algorithm much lighter in its operation.
an effective data reduction process, our following research will focus on optimizing the clustering
7. RELATED WORKS
Researchers have been working on interaction with
We have explored different strategies and related
large databases for a long time. Data mining and
works on clustering mechanisms. The oldest
knowledge extraction technologies have been a
centroid based clustering method is the k-means
rather new addition to the list of research works on
algorithm (Hartigan and Wong, 1979). Later, many
large data sets. The clustering algorithm used here
optimized and efficient versions of the k-means
has been the ‘fast-n-dirty' version of Wei's work
algorithm have been proposed (Kanungo et al.,
(Wei 2010; Wei et al., 2009). The focus of this paper
2002). One of the earliest works on modern
was to illustrate the efficiency which can be reached
clustering techniques was proposed by Koontz et al.
prior to the process of clustering, leading to a faster identification of the ‘hot zone'. Therefore, the
(1975). They proposed a branch and bound clustering algorithm based on global combinatorial
algorithm for clustering is separate from the
optimization. DBSCAN is a well-known density-
sampling process. As a result, any underlying
based clustering algorithm. Arlia et al., proposed a
algorithm for the sampling models will provide more
method of parallelizing DBSCAN, which is suitable
efficient results with respect to time and space.
for high-dimensional data, and thus can be useful in
Journal of Digital Forensics, Security and Law, Vol. 9(1)
implementing a suitable clustering algorithm for the
However, given the size of the dataset of the UAB
huge number of spam emails (Arlia and Coppola,
Spam Data Mine (UAB-CIS, 2013), we suggest that
2001). ST-DBSCAN is a different variation of
the purpose of identifying the ‘hot zone' by eCrime
DBSCAN, proposed by Birant et al. (2007), which
investigators and law enforcement authorities is
performs the clustering based on identifying core
better served by avoiding such fine-grained spam
objects, noise objects, and adjacent clusters. Ying et
detection algorithms.
al., has already presented in (Ying et al., 2010) a
8. CONCLUSION
variation of DBSCAN to successfully identify spam clusters. The proposed research aims for faster
Spam campaigns and emails create a lot of hassle in
clustering results from spam emails. Henceforth, it
today's world. A lot of people fall victims to such
can be suitably stated that, given the organization of
scams every day. Most spams are sent using
the spam data mine, we will be able to preserve the
malware bots, which are installed on affected PCs
results from these clustering algorithms, when
and spread around like a virus. The UAB Spam Data
compared to clustering based on sampled data.
Mine collects such spam emails, and provides reports on ongoing spam campaigns. Clustering the
There has been significant research on sampling
spam data to categorize and identify the spammer
methodologies so far. The random sampling with
has been implemented using the full dataset. In this
reservoir, proposed by Vitter (Vitter 1985), uses a
paper, we presented different models for sampling
non-replacing one pass sampler, requires constant
the spam data, to be used as a tool for data reduction.
space, and runs in O(n(1 + log(N/n))) time. These
Subsequently, the sampled data were utilized to
sampling models aim to introduce randomness in the
create the clusters.
sampled items. However, we are interested in identifying the most prominent clusters. The purpose
Our obtained results substantially prove that
is fulfilled using the proposed models and are shown
sampling the data and creating the clusters allow the
to be effective in determining the ‘hot zone'
investigators to interpret the same conclusions, as
appropriately. Nagwani et al. (2010) proposed a
opposed to using the whole data set. As a result, we
weighted matching technique of attributes to
claim that it is much faster and efficient to perform
measure attribute similarity of email content. The
the clusters after sampling the data, and thus identify
weights of the attributes are custom assigned and are
the ‘hot zone' within a significantly shorter period of
then used to create the spam clusters. An algorithm
time. We have provided extensive experimental
for text clustering based on vector space is presented
results using actual spam data and investigated the
by Sasaki et al., in (Sasaki and Shinnou, 2005). The
distribution of spam in the data mine, which
proposed algorithm creates disjoint clusters with the
reinforced our claims of sampling being more
underlying spherical k-means algorithm to obtain
effective given its purpose. Furthermore, we also
centroid vectors of the spam clusters.
presented an optimization strategy which utilizes the computational power of database engines to perform
There are other works related to email filtering
the sampling operation more efficiently, and thus
which can be related to analyzing the content of
promises faster results in terms of the time required.
spam emails. An interesting approach for filtering spam emails based on behavioral blacklisting has
been proposed by Ramachandran et al. (2007). The
This research was supported by a Google Faculty
proposed method overcomes the problem of varying
Research Award, the Office of Naval Research
sender IP addresses by classifying sending patterns
Grant #N000141210217, the Department of
and behaviors of spammers, and subsequently
Homeland Security Grant #FA8750-12-2- 0254, and
enforcing blacklisting decisions. Thomas et al.,
by the National Science Foundation under Grant
presents an interesting approach for spam detection,
#0937060 to the Computing Research Association
which includes real-time web crawling of URLs,
for the CIFellows Project. We would like to thank
based on blacklists and whitelists (Thomas et al.,
Jason Britt and Gary Warner for providing the
2011). All the approaches for clustering spam emails
support for the UAB Spam Data Mine.
are suitable and will have varying results. These algorithms are typically applicable for spam filters,
usually on web browsers and email clients.
Journal of Digital Forensics, Security and Law, Vol. 9(1)
REFERENCES
Algorithm.
IEEE Transactions on Computers,
24(9), 908-915.
1. Arlia, D. & Coppola, M. (2001). Experiments in
13. Kyriakopoulou, A. & Kalamboukis, T. (2008).
parallel clustering with dbscan. Euro-Par 2001
Combining clustering with classification for
Parallel Processing. Lecture Notes in Computer
spam detection in social bookmarking systems.
Science, 2150. Springer Berlin Heidelberg, 326-
Proceedings of European Conference on
Machine Learning and Principles and Practice of
2. Birant, D. & Kut, A. (2007). ST-DBSCAN: An
Knowledge Discovery in Databases Discovery
algorithm for clustering spatial-temporal data.
Challenge, (ECML/PKDD RSDC 2008), 47-54.
Data & Knowledge Engineering,
60(1), 208-
14. Levchenko, K., Pitsillidis, A., Chachra, N.,
Enright, B., Halvorson, T., Kanich, C…Savage,
3. Caruana, G. & Li, M. (2008). A survey of
S. (2011). Click trajectories: End-to-end
emerging approaches to spam filtering.
ACM
analysis of the spam value chain. Proceedings of
Computing Surveys, 44(2), 9:1-9:27.
The IEEE Symposium on Security & Privacy,
4. Dagon, D., Gu, G., Lee, C., & Lee, W. (2007).
A taxonomy of botnet structures. Proceedings of
15. Levenshtein, V. I. (1966). Binary codes capable
the 23rd Annual Computer Security Applications
of correcting deletions, insertions and reversals.
Conference. ACSAC 2007, 325-339.
Soviet Physics Doklady.
10(8 Feb), 707-710.
5. Ganti, V., Ramakrishnan, R., Gehrke, J., &
16. Matsumoto, M. & Nishimura, T. (1998).
Powell, A. (1999). Clustering large datasets in
623-dimensionally
arbitrary metric spaces. Proceedings of the 15th
equidistributed uniform pseudo-random number
International Conference on Data Engineering
generator.
ACM Transactions on Modeling and
IEEE Computer
Computer Simulation (TOMACS) - Special issue
Washington, DC, USA.
on uniform random number generation,
8(1
6. Halkidi, M., Batistakis, Y., & Vazirgiannis, M.
(2001). On clustering validation techniques.
17. Moore, T., Clayton, R., & Anderson, R. (2009).
Journal of Intelligent Information Systems,
17,
The economics of online crime.
The Journal of
December, 2-3, 107-145.
Economic Perspectives,
23(3), 3-20.
7. Hammersley, J. M., Handscomb, D. C., &
18. Nagwani, N. K. & Bhansali, A. (2010). An
Weiss, G. (1965). Monte Carlo methods.
Email Clustering Model Using Weighted
Physics Today,
18, 55.
8. Hartigan, J. A. & Wong, M. A. (1979).
International Journal of Research and Reviews
Algorithm as 136: A k-means clustering
in Computer Science (IJRRCS),
1, 2.
algorithm.
Journal of the Royal Statistical
19. Nhung, N. P. & Phuong, T. M. (2007). An
Society. Series C (Applied Statistics)
28(1), 100-
efficient method for filtering image-based spam
e-mail. Proceedings of The 12th international
9. Jaccard, P. (1901). Distribution de la flore alpine
conference on Computer analysis of images and
dans le bassin des Dranses et dans quelques
patterns, (CAIP'07). Springer-Verlag, Berlin,
régions voisines.
Bulletin de la Société Vaudoise
Heidelberg, 945-953.
des Sciences Naturelles, 37, 241-272.
20. Ono, K., Kawaishi, I., & Kamon, T. (2007).
10. Kanungo, T., Mount, D., Netanyahu, N., Piatko,
Trend of Botnet Activities. Proceedings of the
C., Silverman, R., & Wu, A. (2002). An efficient
41st Annual IEEE International Carnahan
k-means clustering algorithm: analysis and
Conference on Security Technology, (ICCST)
implementation.
IEEE Transactions on Pattern
Analysis and Machine Intelligence,
24(7), 881-
21. Ramachandran, A., Feamster, N., & Vempala, S.
11. Knuth, D. E. (2006). The art of computer
blacklisting. Proceedings of the 14th ACM
programming. 4, fascicle 4, 1. print. Generating
Conference on Computer and Communications
all trees. Addison-Wesley.
Security, (CCS) 2007. ACM, New York, NY,
12. Koontz, W. L. G., Narendra, P. M., & Fukunaga,
K. (1975). A Branch and Bound Clustering
Journal of Digital Forensics, Security and Law, Vol. 9(1)
22. Sasaki, M. & Shinnou, H. (2005). Spam
detection using text clustering. Proceedings of the International Conference on Cyberworlds,
4(4), 319.
23. Thomas, K., Grier, C., Ma , J., Paxson , V., &
Song, D. (2011). Design and evaluation of a real-time url spam filtering service. Proceedings of the 2011 IEEE Symposium on Security and Privacy, (S&P 2011), IEEE, 447-462.
University of Alabama at Birmingham, UAB Spam
25. Vitter, J. S. (1985). Random sampling with a
reservoir.
ACM Transactions on Mathematical Software (TOMS),
11(1 Mar), 37-57.
26. Wei, C. (2010). Clustering Spam Domains and
Hosts: Anti-Spam Forensics with Data Mining. Doctoral thesis, University of Alabama at Birmingham.
27. Wei, C., Sprague, A., & Warner, G. (2009).
Clustering malware-generated spam emails with a novel fuzzy string matching algorithm. Proceedings of the 2009 ACM symposium on Applied Computing, (SAC 2009), ACM, New York, NY, USA, 889-890.
28. Ying, W., Kai, Y., & Zhong, Jian Z. (2010).
Using DBSCAN clustering algorithm in spam identifying. Proceedings of the 2nd International Conference on Education Technology and Computer. (ICETC) 2010,
1, 398-402.
Source: http://commons.erau.edu/cgi/viewcontent.cgi?article=1164&context=jdfsl
Jurnal Natur Indonesia 15(1), Februari 2013: 57–62 Synthesis and antimalarial activity 57 Synthesis and Antimalarial Activity of 2-Phenyl-1,10-Phenanthroline Derivative Compounds Ruslin Hadanu1*), Mustofa2), and Nazudin1) 1)Department of Chemistry, Faculty of Teachership and Educational Science, Pattimura University, Poka, Ambon 97233
A High Throughput Approach for Metabolite Note: 369 Profiling and Characterization Using theLXQ Linear Ion Trap Mass SpectrometerMin He, Alicia Du, Gargi Choudhary, Karen Salomon and Diane Cho; Thermo Electron Corporation, San Jose, CA, USA Key Words Within the drug discovery environment, high sample • LXQ™ throughput that provides comprehensive drug metabolite