The goal of this project is to investigate the linguistic features of junk emails and maybe to design a filter for junk emails based on linguistic information rather than on a "bag-of-words" approach.
- The corpus of junk emails can be downloaded from here. This corpus consists of 1563 messages received by us in the last few years, but they are not necessary unique messages.
- Given that we are interested in linguistic features of the junk emails, we thought that it would be better to eliminate duplications. The corpus without duplications can be downloaded from here. The elimination of duplications was automatically done, but it did not consider only perfect matching between messages, but also small formatting differences. More details about the method will be available in our forthcoming paper at LREC2002: "A corpus-based investigation of junk emails"
- A frequency list generated from the corpus without duplications can be downloaded, as well a lematised list
- C. Orasan and R. Krishnamurthy (2002) "A corpus-based investigation of junk emails", In Proceedings of Language Resources and Evaluation Conference (LREC-2002), Las Palmas, Spain (pdf)