![]() |
Projects |
![]()
DSPAM Frequently Asked QuestionsFAQGeneral Information
Compiling DSPAM
Using DSPAM
Q. What is DSPAM and why should I use it? A. DSPAM is an intelligent, adaptive spam filter capable capable of learning what spam is and isn't based on each user's individual email behavior. It is designed for both system-wide filtering and third party integration. You should use DSPAM if you are looking for a scalable, fast, and accurate spam filter that is capable of adaptive learning. Although it is a spam filter by design, DSPAM has shown great proficiency in classifying any kind of document into one of two concepts. Q. Why should I use an open-source solution? A. If you're a large business or ISP, you may be asking yourself why you would want to use an open-source filtering solution over a commercial solution. Please have a look at this article and why open-source statistical filters give you a competitive advantage over not-so-statistical commercial filters. Q. What MTAs will DSPAM work with? A. DSPAM can integrate with any MTA (Mail Transfer Agent) capable of calling dspam either as a local delivery agent or via LMTP. It can also be integrated via procmail scripts and .forwards. For all other MTAs, DSPAM can be integrated in front of the MTA as an "appliance" or by using a POP3/IMAP proxy. I've heard many success stories so far with: Sendmail, Postfix, Exim, Courier, Communigate Pro, and QMail. Q. How long does it take to start filtering spam? A. If you set DPSAM up with global/merged group support, your users can experience instant filtering out of the box. From a completely empty corpus, however, we found DSPAM to start filtering its very first spam with 10-20 SPAMs [reported into the system]. DSPAM generally climbs to around 95-98% accuracy within the first few days to a week or so, depending on how much mail you receive. It catches a majority of spam with only around 100-200 spam reported into the system, learning from the other few thousand it should catch by itself in the meantime. If you require faster precision results, you might want to start your users off with a seeded dictionary (see the dspam_merge tool), or use a global or merged dictionary for your users to start with. Many advanced features of DSPAM, such as Noise Reduction and inoculation do not kick in until at least 2,500 innocent messages have been learned. You can use the included dspam_stats tool to get a good idea of how effective your current dictionary is. Q. How is DSPAM with false positives? A. In real-world scenarios, false positives have ranged anywhere from 0% (none) to 0.10% depending on both implementation and user's mail behavior. Users with relatively predictable mail behavior (such as geeks, dweebs, and freaks) have generally received very few false positives (less than 1 in 10,000 messages). Most false positives are likely to occur during initial training. A feature referred to as the 'training buffer' can be enabled to further water down filtering during training to help prevent false positives (see the README for more information). Alternatively, pretraining a few hundred nonspam can also help overcome the risk of false positives during initial training. Recent versions of DSPAM are equipped with an automatic whitelisting function which whitelists senders the user received a lot of legitimate mail from (automatically). This too helps prevent false positives during training. Everyone's email is different, however. Your mileage may vary. Q. What is with the serial number at the bottom of my emails? A. Since some emails may have to be re-learned as spam or possibly a false positive, the original training data is stored temporarily for relearning. This is usually necessary, as most mail clients rewrite and even completely mangle the message when you forward it. Storing this information server-side ensures that all the data is retrained correctly. Each email processed includes a serial number to identify the signature. Th is is frequently referred to as the DSPAM signature and looks like this: !DSPAM:3eb2c721141672274659! DSPAM has a user-level option to embed the signature in the headers instead of the message body, however this will require the user to forward all spam as attachments (a feature not all clients have). Users may also opt to eliminate the DSPAM signature if they're willing to retrain using the Web UI. Finally, if you're using strictly IMAP or webmail, then you can eliminate the signature entirely and configure DSPAM to retrain using the original message. Q. How should I train DSPAM? A. Just allow email to come in, and forward the messages that are spam. If you have both an innocent and a spam corpus, you can use the dspam_corpus tool to feed it into the system. It is NOT a good idea to feed DSPAM a bunch of spam without feeding it a bunch of nonspam, as this could potentially skew the dictionary and lead to false positives immediately (NOT because DSPAM requires a balanced corpus, but as the result of the scoring of tokens that appear only in one corpus). Special safeguards have been put into place to prevent this under normal spammy email load, but force-feeding DSPAM spam is not recommended. The best advice for training a dictionary is to just act on the email you receive after DSPAM is set up. If you have a large user base, you may wish to create a global or mergedset of data to provide users with out-of-the-box filtering. See the README for more information about global and merged groups. Q. How is DSPAM different from SpamAssassin? A. While both share the common goal of eradicating spam, the two solutions bear very different philosophies. Cocktail Approach vs. Centralized Adaptive Learning SpamAssassin is designed with the arsenal (a.k.a cocktail or toolbox) philosophy and aggregates the results from a myriad of different spam detection tests with the hope that at least some of the components should detect an inbound spam. These different tests range from heuristic "rules" which identify specific characteristics in spam to blacklists, and finally to limited Bayesian learning. DSPAM's philosophy is based on the belief that machine-learning (basic artificial intelligence) can, in and of itself, solve the spam problem without the need for human-maintained rules, inaccurate blacklists, or any hodge-podge of solutions for that matter. DSPAM's one central spam detection function incorporates advanced, concept-based statistical analysis. This has resulted in levels of accuracy up to ten times that of a human, with very few false positives. DSPAM breaks down each email into its colloquial components, analyzes the historical data for each component, and determines the most interesting characteristics to judge an email by. While DSPAM supports many pre-filters, post-filters, and additional layers of analysis, its central function lies solely in adaptive learning and language analysis. This alone has yielded levels of accuracy peaking at 99.991%. We feel that the justification for our philosophy is in the credits. While the SpamAssassin project requires over 100+ individuals to maintain, DSPAM manages to delivery significantly higher levels of accuracy with only one primary maintainer and a small pool of patch contributors. Maintenance Burden DSPAM's philosophy includes removing unnecessary human maintenance by means of its learning abilities. Users simply need to forward spam they receive into the system and DSPAM will automatically learn. There are no rules to update, no thresholds to set, and very little systems administration after DSPAM's initial integration. Through the use of various forms of community groups, even the burden of training can be significantly reduced. Forwarding spam also gives your users a sense of participation in your anti-spam efforts, reducing the number of phone calls, email, and complaints you may receive. SpamAssassin, on the other hand, pushes maintenance to the responsibility of a central systems administrator and prevents the end-user from participating in any capacity that significantly affects their filter training. This leaves many end-users with the feeling of helplessness against false positives and poor filtering results should their mail deviate from what is considered "normal". A systems administrator is required in order to update rulesets, tweak performance, and etcetera. The idea of removing end-user maintenance may be desirable by some very large implementations, and so DSPAM can also be configured to support this by allowing the systems administrator to train a global database of contextual data. DSPAM, however, doesn't require full-time maintenance. Behavioral Philosophy One significant difference between the two tools is their philosophy about behavioral learning. SpamAssassin's primary detection facility has been designed to use a static set of rules to service all users of the system. DSPAM's philosophy is that this presents a significant hindrance to accuracy, because one user's spam is another user's mail. DSPAM is adaptive to the behavior of each individual user on the system so that it can custom-tailor its spam detection to that of each individual. To provide "out-of-the-box" functionality, DSPAM also supports the use of merged groups, which are global databases merged (at run-time) with the user's own training data. This allows new users to receive instant, high-quality spam filtering without losing the ability to train DSPAM for their personal email behavior. Newer versions of SpamAssassin support a form of Bayesian learning, but this doesn't appear to operate on a per-user basis (at least not without extensive configuration). The heuristic training guides also appear to hurt the level of accuracy the Bayesian component could deliver if SpamAssassin was made a pure statistical filter. Technical Philosophy DSPAM's technical philosophy is that of high-accuracy and high-performance in an enterprise environment. For this reason, C was chosen as the language for DSPAM. DSPAM has been implemented in systems exceeding 350,000 users and experiences execution times as low as 0.01s (real time) when tuned properly. The average system of around 100,000 users experiences around 0.06s-0.20s processing time. The philosophy behind SpamAssassin requires a focus around development effort. SpamAssassin is written in Perl, which is generally a much easier language to code the regular expressions used by SpamAssassin's heuristic rules engine. As a result, even novice developers can quickly code new rules for SpamAssassin. It's unfortunately very slow, however, compared to DSPAM and as a result, even small implementations have been known to use up all resources on the machine. Since DSPAM doesn't have any heuristic rules, it doesn't require the use of regular expressions (which are always touted as fast in Perl). DSPAM's tokenizing algorithm is, as a result, much faster then SpamAssassin's analysis engine because it does not use regular expressions. Because Perl is an interpreted language, and because of the extensive (and unnecessary, in my opinion) pattern matching it performs, SpamAssassin ends up running much slower and using many more resources than DSPAM, which uses a language compiled and assembled directly into machine code. While we believe the philosophies we've chosen for DSPAM are better suited for the job (as evidenced by DSPAM's long-term accuracy), there are plenty of other ideas out there that produce acceptable results. Q. What do I do with spam I receive? A. DSPAM is designed to 'learn' based on the spam (and nonspam) you receive, so whenever you receive a spam, you can forward it to a special email address configured by the administrator and DSPAM will automatically analyze it and learn. Alternatively, users can train through DSPAM's web UI. This is an excellent way to insure that DSPAM won't be obsolete a year from now, but continue to learn the new tricks of spammers. Q. What is a quarantine box? A. Each user has a quarantine box which holds messages DSPAM thought was spam. Rather than simply delete these messages, the quarantine box gives the user the ability to identify the occasional false positive and re-learn them as innocent emails. This is a very important step in the learning process that many other tools don't provide. It is understood that no spam filter will be 100% accurate, and therefore it is important to be able to learn from its mistakes. DSPAM's quarantine makes it much easier to manage spam than some other solutions because it sorts the quarantine based on confidence. Therefore, any likely false positives are going to rise to the top of the quarantine for easy review. Users who would prefer to tag spam may specify this in their preferences. Systems administrators looking to integrate DSPAM with fancy IMAP folders can also find support for this. Q. Does DSPAM support whitelists? A. DSPAM doesn't have a whitelist manager, rather whitelisting is an automatic function of DSPAM's Bayesian filtering mechanism. As you receive more emails from your colleagues, their from addresses and other identifying information (such as signatures) is automatically learned by DSPAM to create an internal whitelist. On top of this, DSPAM supports an automatic whitelisting feature which identifies individuals you converse with the most who have never sent spam. Q. Does DSPAM filter viruses? A. DSPAM ignores attachments, javascript, and the like. Therefore, it does no filtering of viruses, however the recent SoBig.F virus was caught by several implementations based on the message content, and many similar viruses are caught quite easily based on their message content alone. As of version 3.6, DSPAM can integrate seemlessly with Clam Antivirus for virus filtering. Q. How is DSPAM different than every other statistical filter? A. A very valid question...since Bayesian is Bayesian and Chi-Square is Chi-Square, why are there so many darn filters out there, and what makes ours different? DSPAM has a two-fold development focus:
These two primary focuses make DSPAM one of the most scalable AND accurate filters available today. Q. DSPAM just isn't going to meet my needs, can you recommend some other filters? A. Absolutely! The following filters are also very good statistical filters written by some very bright individuals I've gotten to know a bit:
A. Yes, and it runs on my Powerbook too. See the mailing list archives. Q. Does it work with Windows? A. v3.2 included a Windows build supplement, which included the necessary Visual C++ project files and portage to compile the agent and tools under Windows. Nobody wanted to maintain it, however, so it is no longer included with the distribution. It's probably best to build it under Cygwin using the general distribution (which builds fine). Q. Are you ripping off some of Bill Yerazunis' Research? (This question comes in reference to my recent additions of Markovian weighting to DSPAM) A. No, in fact, Bill has given me his full blessing and assistance in implementing Markovian discrimination in DSPAM. I wouldn't have even attempted it without such blessing. Bill and I have collaborated on other research in the past (such as message inoculation), and has co-authored the chapter on Markovian discrimination in my book, "Ending Spam". I, for one, love both knowledge and computer science, and I also thought it would be good to implement these topics myself since they're covered in my book. I've learned a lot about how his algorithms work, and am quite impressed. I still think you should run CRM114 if you are looking for a Markovian classifier, in part because it contains plenty of Bill's optimizations and because it's an awesome tool. But I'm hoping too that Bill's research can benefit people who have a need for it with the tools and interfaces provided by DSPAM, if CRM114 isn't a good fit. I think Bill would agree, which is probably why his filter provides many of the algorithms used in DSPAM (and other filters). Bill Yerazunis himself adds: "No, it's fine. Algorithms are algorithms, and if I didn't want other people to use them, then I wouldn't have published them, or GPLed them. Jonathan has already 'made his bones' in spam fighting; he can certainly use the theories, the algorithms, and the code. And he's even putting author credit on it. What more could you want?" Q. My compiler complains about [some library] A. Some libraries may not be installed in a standard location where the compiler looks (by default). So you will need to do one of:
A. If your platform supports the POSIX interface, you should be able to compile DSPAM with no or little tweaking. If you are interested in contracting a port, please email me. Q. configure complains about my libdb version A. There are several different versions of libdb on many systems, and each has a separate db.h header file. configure with the option --with-db4-headers=DIR pointing to the correct directory where configure can find your libdb4 header files. You may also need --with-db4-libraries. If you still get this error, manually check and make sure that your version of the includes matches up with your version of the library. Q. Building on OSX with MySQL, I get "ld: common symbols not allowed with MH_DYLIB output format with the -multi_module option" A. This is due to a restriction in OSX allowing only one definition of each symbol within a shared library. The following workaround should fix your problem (adjust the paths to your MySQL client library accordingly)... # cd /usr/local/mysql/lib # mv libmysqlclient.a libmysqlclient.a.original # mkdir /tmp/mysql # cd /tmp/mysql # ar x /usr/local/mysql/lib/libmysqlclient.a.original # ld -r -d my_error.o # mv a.out my_error.o # ld -r -d charset.o # mv a.out charset.o # libtool -o /usr/local/mysql/lib/libmysqlclient.a *.o Q. My dictionaries are getting big (+10MB each) A. This is a common occurance while building your dictionaries over the first 15-60 days. About 70% of the tokens in each dictionary are unuseful, and will be removed from each user's dictionary at 15, 30, and 60 days depending on just how unuseful they are. If your driver supports it, be sure you have dspam_purge or the purge.sql scripts configured to run nightly, and if you can't wait that long here are some things you can do:
A. Depending on what type of storage driver is used and your system's configuration, this may vary greatly. Processing time hovers typically between 0.01s and 0.07s for most messages (peaking at about 0.20s). Messages with large attachments (6-10MB) may take a little longer due to I/O delay (if they are binary attachments, they are ignored). On some slower systems, or using slower storage drivers, processing time may take a few seconds. If you need the absolutely fastest operation, consider the hash driver or MySQL. Q. How can I set up SpamAssassin-like "out-of-the-box filtering" for my users? A. DSPAM supports global classification groups and global merged groups for this purpose. Global classification groups allows DSPAM to provide a filtering "parent" relationship with all new users on the system, until they have built their own useful dictionaries. Merged groups are similar, but also allow the user to train against the global database; their training data is "merged" with the global database in real-time, allowing them to customize DSPAM to their own behavior. In both cases, users who do not wish to ever forward in spam will have the benefit of being protected by the global dictionary. See the README for more information about both types of groups. Q. I have a huge user base. What are some ways to tune DSPAM? A. First, very cool on the huge userbase thing. Be sure to shoot me an email and let me know about it. Large userbases sometimes require a few changes in your approach to DSPAM. There are a lot of different ways to tune DSPAM to function well on very large installations, and I'll outline a few here. Feel free to contact the dspam-users list for more ideas.
Q. All my messages are getting delivered to root! or mail! or I'm getting funny messages about users not matching! *gasp* *panic* A. This is probably because you skimmed over the 'TRUSTED USERS' section of the README file. Make sure you have added your MTA user, your MTA shell user, and your Apache user (for the CGI) to the trusted.users file. Until you add them, DSPAM will not trust them to set the user using --user, and will force the user to match their uid. Q. What happens if another message comes into my quarantine before I hit 'DELETE ALL' ? A. The Quarantine CGI has a protection to prevent messages from being accidentally deleted which may have come in while you were viewing your quarantine. The filesize of the mailbox is noted when the user goes to view their quarantine, and will fail to 'DELETE ALL' if the mailbox has since changed in size. Q. What is TWEAK -1? A. In the CGI, a button labeled "Tweak -1" exists. If you are anal about keeping accurate web stats as I am, you want to make sure that messages you forward in that are NOT spam don't get counted against the web stats. For example, I forward in virus-ridden emails and the occasional completely blank message - neither of which DSPAM is expected to catch. Clicking "Tweak -1" for each of these emails I send in corrects the web stats so as not to count them against DSPAM's accuracy. That's all it is! Q. I've fed DSPAM thousands of spam, and am only getting marginal accuracy. What's up? A. Your problem might be that you've fed DSPAM thousands of spam, but have not fed it enough nonspam for it to learn adequately. It's typically a bad practice to feed a statistical filter a grossly unbalanced corpus of mail, and if you're using a version of DSPAM that has a "training buffer" enabled by default, feeding a ton of spam can also cause it to start watering down its results until you feed it more ham. This watering down gets stronger the higher your spam ratio is, in an attempt to prevent false positives - so the more spam you feed it, the worse your accuracy will get. There are a few things you can do to remedy this:
It's important not to specify corpus training on missed spam, because DSPAM only learns corpus messages, and doesn't relearn them. So you'll end up with 1 spam tick mark and 1 innocent tick mark, instead of the correct result: 1 spam tick mark and 0 innocent tick marks. Q. Is libdspam Thread-Safe? A. 3.2 and higher is thread-safe, however this is largely storage driver dependent. At the moment, mysql_drv, pgsql_drv, and hash_drv are thread safe. BDB and SQLite drivers don't permit concurrent reads/writes and so they never likely be thread-safe. Oracle may be a good candidate for multithread ing in future versions, but is more complex than the other two SQL-based drivers. To use DSPAM in a multithreaded environment, you'll need to create a separate DSPAM context for each thread and use dspam_attach(). man libdspam for more information. Alternatively, if you're not concerned about concurrent processing, you should be able to use libdspam with your multithreaded application and with any old storage driver by simply using a mutex to control access to libdspam functions. Q. If I develop problems with accuracy, how can I decrease false positives? A. False positives can creep up for a number of reasons. If you receive or have trained on a very large amount of spam, you may experience false positives while training dspam. If you neglect your quarantine and fail to retrain a few false positives, this can snowball into more. Here are some things you can do to fix situations where accuracy degrades:
|
Hosting provided by Global Net Access
|
All Website Content © Sensory Networks, Inc. All Rights Reserved. |
| Reproduction prohibited without permission |