The purpose of this project is to give you experience with clustering and with the tool Hierarchical Clustering Explorer.
The dataset you are to use is a real life dataset generated from VoIP data. The data was provided by Cisco Systems and represents logged VoIP CDR traffic in their Richardson, Texas facility from Mon Sep 22 12:17:32 2003 to Mon Nov 17 11:29:11 2003. Over 1.5 million call trials were logged. The logs contain 66 attributes including source, destination, starting time, duration, routing/switching, device, etc. Preprocessing of the data has already been performed as follows:
Remove the attributes other than source, destination, starting time, duration from the logs.
Count the connected calls and discard unconnected calls. The total number of connected calls was 272,646.
In addition, the data has been classified into 25 classes based on whether a phone number was internal, local, national, international, or unknown; therefore we have up to 25 directed linked classes (source class + destination class combinations) in the network, e.g. internal to internal, internal to local, etc. Data is aggregated into 15 minute time intervals. The total number of time points is 5422 and the total number of attributes is 26.
Requirements: CSE 5331: You are to use HCEto compare the performance of four different hierarchical clustering techniques. This means that you will be required to run four different HCE experiments. In addition k-means clustering runs are required.
CSE 7331: You are to use HCE to compare the performance of six different hierarchical clustering techniques. This means that you will be required to run six different HCE experiments. In addition k-means clustering runs are required.
(30 pts) Provide as part of your project the output of each run and its dendrogram.
5331 students: Cluster the VoIP data using Single Link, Complete Link, Euclidean Distance, and Manhattan Distance. This will require four different HCE runs.
7331 students: Cluster the VoIP data using Single Link, Complete Link, Average Link, Euclidean Distance, Manhattan Distance, and Pearsons CC. This will require nine different HCE runs.
(10 pts) Compare the performance of these four (nine) runs. You must identify at least three different metrics to use for these comparisons. The choice is up to you, but you should check out the Evaluation tool provided in HCE. Write a 1-2 page comparison of the results.
(10 pts) Rerun all of your experiments by filtering out much of the uninteresting data where call counts are close to zero. Again submit with your project the output and dendrogram of these runs. It is up to you to determine how to do this filtering. HCE provides several ways to do this. Be sure to read the HCE User Manual.
(10 pts) Again, briefly compare the performance of your four (nine) experiments from step 3) using the same metrics as in step 2). Then compare the results of these experiments to those you did earlier. What is different? Why? Submit a 2-3 page analysis.
(10 pts) Your last set of experiments should be performed on both the input. Here please run the k-means clustering algorithm with cluster numbers of 2,3, and 4. So you will end up with three different experiments. Submit for grading the output of the experiments.
(20 pts) Write up a comparison of the k-means runs and the best of your hierarchical clustering runs. You need to identify the experiment from before that you think is the best. Justify this choice. Then compare this experiment to the three you have just completed with k-means. You must choose two metrics for this comparison, describe the metric, and then make the comparison. Please summarize your analysis with some conclusions. Writeup should be 2-3 pages in length.
(10 pts) What have you learned from this project.? Please briefly identify at least three things that you have learned about data mining from this project.
NOTE: You may choose to run a similar set of experiments on data of your choice. HOWEVER, this must be approved by Dr. Dunham in advance.