|| seamonkeyrodeo ||
| k a r a o k e | m i n d | c o n t r o l |
| k a r a o k e | m i n d | c o n t r o l |
Tuesday, June 29, 2004
Lunchtime Musings: Ed Felten on Bayesian Filtering
Okay, it's a little sad that I'm sitting here typing between bites of my sandwich (at least it's not a cheese sandwich), but I came across Ed Felten's Victims of Spam Filtering post this morning and wanted to note a couple of things about it. Well, I suppose that I actually want to note one thing: that I entirely disagree with his logic.
While it's best for you to go and read his entire post, I'll copy the first two paragraphs here, since they're the ones that set off alarm bells in my head:
### FELTEN QUOTE BEGINS
Anyway, this reminded me of an interesting problem with Bayesian spam filters: they're trained by the bad guys.
[Background: A Bayesian spam filter uses human advice to learn how to recognize spam. A human classifies messages into spam and non-spam. The Bayesian filter assigns a score to each word, depending on how often that word appears in spam vs. non-spam messages. Newly arrived messages are then classified based on the scores of the words they contain. Words used mostly in spam, such as "Viagra", get negative scores, so messages containing them tend to get classified as spam. Which is good, unless your name is Jose Viagra.]
### FELTEN QUOTE ENDS
Now let's compare that to a snippet from Paul Graham's A Plan for Spam, the document that introduced the word "bayesian" to so many of us:
### GRAHAM QUOTE BEGINS
Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it. So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam.
### GRAHAM QUOTE ENDS
I appears that Felten, like many other people (me, for example), has made the mistake of viewing a Bayesian filter as something like a keyword filter on steroids. A story:
At one point I was using one of the popular open source Bayesian filters. I had it set up so that it wasn't just marking "spam" and "ham," but rather was categorizing all of my mail for me: tech/programming mail into one bucket, personal mail another, mailing lists a third, and so on. This worked well, and led to the brilliant idea of training my system to recognize a special "password" that people could include in their emails if they wanted to end up in my "priority" bucket.
This didn't work well.
Why not? Well, a couple of reasons. The first reason is that I have a bunch of immature programmers as friends, so within fifteen minutes of sending out a note asking for email containing the word "avocado" so that I could train my system, there were no fewer than three scripts written that did nothing but send me email after email, all containing nothing but the word "avocado" repeated over and over. Ha, ha, you bastards.
The second reason that this didn't work is that the whole point of a Bayesian style filter is that it's looking at texts as a whole, not just individual words. Bayesian filters aren't "trained by the bad guys," they're trained by the bad guys, your mother in law, your co-workers, your friends...they're trained by everyone who sends you email. My avocado never did work very well, because a single word ("avocado") was rarely, if ever, enough to change the overall character of a message. If the rest of the message content (good and bad) looked like my other "personal" messages, it would end up in the personal bucket; if it looked like my "social software" messages it ended up in that bucket.
The kind of attack that Ed Felten is imagining would be crippling if Bayesian filtering worked on a sort of "adaptive keyword" basis, picking out the messages with spam words and looking for new spam words to filter...but that's just not the case. Let's take Felten's example of a spammer trying to poison the word "fahrenheit" prior to the release of the Michael Moore film:
You send me 50, 500, or 5000 spam messages containing "fahrenheit," and that word has never before appeared in a message that I received. All of them get marked as spam due to the other spammy message content, which increases the spam potential of "fahrenheit." Then a friend sends me a note with some thoughts on the movie fahrenheit 9/11 -- will that message go into the spam folder? It could, but that's not really likely. Because it's been in n spam messages and no good ones, "fahrenheit" will have a really high spam potential, but there's other content in the message: if it's an email from your friend about a movie that they just saw, the other message content (your friend's email address, your name, the words used in normal conversation, etc.) probably all has very low spam potential. The odds are that the "ham" potential of the 500 other words that your friend wrote will dramatically outweigh the "spam" potential of a single word and the message will make it into your inbox, which in turn reduces the spam potential of "fahrenheit."
The whole idea of Bayesian filtering is to get away from this "one bad word poisons the message" sort of thinking. So forget about this one and go worry about google bombing or something.
Monday, June 28, 2004
The Technology of Tracking
No, not the Verichip and its "the end times are here" fan club.
Just plain old tracking of who's doing what on the internet. The Christian Science Monitor recently published a non-techical article on the difficulties of accurately tracking how many people visit their Web site. "People" is the operative word here -- the beauty of the Web from a tracking perspective is that you've got very precise record of how the machines involved are interacting; knowing something about the people attached to those machines is something else entirely.
With didtheyreadit's recent, brief moment of email tracking infamy, a million and one discussions of how one might track RSS feed usage (including FeedBurner's excellent update to their tracking reports), and -- of course -- MarketingSherpa's belated realization that email open and clickthrough reporting may not be all that they're cracked up to be, a couple of things seem to be happening.
Companies are starting to pay attention to online operations again, and asking the right sort of questions: who is coming to my site/getting my emails/reading my RSS feed? What are those people doing when they access the content I'm putting out there?
Companies are realizing that this tracking is a lot harder than it seems. While DoubleClick, 24/7, and a host of smaller companies offer tools (some better, some worse) to track and analyze Web traffic and email activity , relatively few organizations have the money to spend on those sorts of tools. Even fewer have any idea of what to do with the data once they have it.
We're mostly moved past using the httpd access_log for purposes that nature never intended, but even when tracking tools are using more user-focused metrics, we often don't know what those metrics are, nor what assumptions they're making. Becuase there are machines involved in every step of online business, we often opt for the comforting illusion that we therefore have volumes of bulletproof data about users and their actions, when that's just not the case.
Users are (for the moment) not hardwired into their computers, and it's the computers that we have data on, not the users. We can extrapolate from machine to user pretty well, but it's essential that we understand the assumptions that we're making and the attendant limitations.
Tuesday, June 22, 2004
Anti-Spam Technical Alliance Recommends Not Doing Stupid Things
It never ceases to amaze me that it is necessary to make public statements like "don't do stupid things," and "don't be an asshole," but time and time again such statements prove to be absolutely necessary.
The Anti-Spam Technical Alliance (ASTA), whose big-ticket participants include Yahoo, Microsoft, Earthlink, and AOL, today published a report containing best practices and technical recommendations (article has links to actual documents) for ISPs, Email Service Providers, and high volume email senders.
First and foremost I have to say that I absolutely agree with their recommendations, but then I'm neither an idiot nor an asshole (I hope). What sort of thing appears on their list of recommendations for high volume email senders?
- Do not harvest e-mail addresses through SMTP or other means (defined as collecting e-mail addresses, usually by automated means) without the owners’ affirmative consent.
- Do not employ any technique to hide or obscure any information that identifies the true origin or the transmission path of bulk e-mail.
It's absolutely incredible to me that in the year 2004, as we are buried beneath ever-growing piles of spam, it is necessary to tell ostensibly legitimate companies that harvesting email is a bad idea from both ethical and business perspectives, or that trying to hide the fact that you're sending email is unacceptable behavior.
I suppose that this is really more of a warning shot: whatever else it may accomplish, it lays the groundwork for the Gang of Four to implement the technical solutions that they see fit while chanting "you can't say we didn't warn you" over and over again.
Honestly, while this will necessarily cause a bunch of problems -- some of them probably big and affecting people who are doing everything right -- it's an action that is overdue. I have to support this, for the same reason that I was overjoyed to see MS' "caller ID for email" merge with SPF -- once the big ISPs agree on the standards that they're going to use, you've got a known quantity. Whether or not you agree with those de facto standards, everyone is clear on what they are, not just muddling through with best guesses and sympathetic magic.
Wednesday, June 16, 2004
FeedBurner Stats Updated: EXCELLENT work!
I use FeedBurner to handle the syndication of this little experiment; that they also report usage statistics was an "oh, that's nice" feature to me, so the fact that their reporting was a tad on the byzantine side wasn't a big deal. Today I went to their site for the first time in a while, though, and saw their updated reporting.
Their reporting is now fucking excellent, and there's just no other way to say it. Nice work, FeedBurner folk!
Tuesday, June 15, 2004
The Batter Coating Rule
It falls a little outside my normal range of topics, but I just received an email about this and must pass the word along...
Apparently the USDA now classifies frozen french fries as "fresh vegetables." I can't help wondering whether this is some sort of warped "ketchup is a vegetable redux" Reagan tribute.
More important, however, is the fact that this update is apparently known to as the "Batter Coating Rule." Had anyone asked, I would have predicted that the United States' Batter Coating Rule would be something more along the lines of "things that are coated in batter are gooooooooood." Oh, well. This one's okay, too.
Monday, June 14, 2004
Roll your own Real Simple Shopping feed
About a week ago I noticed Real Simple Shopping, a service that takes the spam risk out of subscribing to product offer email lists by subscribing to those lists for you and passing the content along as a customized RSS feed.
A couple of days later I noticed that dodgeit.com -- a service that offers free, public "maildrops" -- offers the ability to read @dodgeit.com mailboxes via RSS feeds.
Because it was a slow Sunday yesterday, I found myself sitting around and thinking "you know, dodgeit.com would allow me to build a better customized RSS feed right now." I just made up a @dodgeit.com email address, added the RSS feed for the address to FeedDemon, and started subscribing.
Now I've got the offer/event emails from Powell's Books (the best bookstore in the world, bar none), REI (excellent outdoor equipment), and the Self Starter Foundation (good independent records) coming to me in a nice, neat feed...and if Huy Fong Foods offered a mailing list, you know I'd be signed up for that right quick.
Funny thing, really: I genuinely want to hear from all of those places about offers that they might have for me, but I would never have actually signed up for their lists via email. I just have too much unread email for me to voluntarily add to the pile. I'm not sure that I'd even have signed up for RSS feeds from each individual source -- but with the ability to create one completely customized commercial feed of my own? Hell, yes, I'm there!
Thursday, June 10, 2004
Commercial RSS F@#$ing Everywhere
Holy jeez. Whether there's actually any real interest or not on the subscriber side I don't know yet, but it seems that you can't throw a rock these days without hitting somebody who's offering purely commercial RSS feeds.
Now to find out whether any of these guys are actually making money...
Tuesday, June 08, 2004
It's a start...
Real Simple Shopping offers individually customized RSS feeds of advertiser information. Some parts of it still seem a bit rough: you can only subscribe based upon advertiser (even though you can search ads by category), and it's unclear whether they're actually doing any targeting based on the demo information that you provide during registration. Solid start, though, and it's apparently only a few months old, so refinements seem likely.
Update: okay, you just have to scroll a lot to get to the offer by category signup, but it's there...
Monday, June 07, 2004
Too Much Information
For a techie, I'm a bit on the luddite end of the scale: always carry a cellphone, but rarely an email-capable portable device. Check email on an ongoing basis, but rarely accepting IM. And calls/voicemail are virtually never forwarded from one number to another.
Doing a rough evaluation of the (work addresses only) email that I get, I find an average of about 125-150 mesages per day sent by actual people. Conservatively (assuming 125 messages and a ten hour work day) that's one email every five minutes. The only saving grace is that a decent percentage of those messages are just "cc" from my perspective -- no immediate response required.
My inbox is a little absurd. God help Laurence Lessig and his inbox.
Six Degrees of Competition
So anyway, I'm taking half a day to catch up on some of the miscellaneous bits and pieces that intrigue me but aren't necessarily all that important -- a process which naturally starts with rummaging around on the Web to see what new things I should be adding to that list.
Since Jeff Reifman's Seattle Weekly piece has made a second appearance on Slashdot and seems to continue to attract attention, I'll note a couple of related items...
Item 0: What is Mr. Reifman doing that he has to reboot XP every day?
I think that I'm pretty much OS agnostic (I split my workdays between a machine running Windows XP and one running Gentoo linux), but I'm as inclined as any tech snob to engage in a little casual MS bashing -- it's easy, and it's fun! Nevertheless, it's been at least a couple of years since I've had to reboot my Windows workstation daily. In fact, the last time I rebooted either of my current machines was when we were renovating the office and had to cut power for a day.
Item 1: Microsoft's biggest competition, in some sense, is itself.
Maybe not directly, but I think that it's true that MS is in a pretty well unique position, where the company needs to think very carefully about what effect one division's releases may indirectly have upon its other divisions. Does this mean that MS will no longer grow at unprecedented, absurd rates? Yes, that seems likely. Does this mean that MS is effectively dead, and that the coming years will be nothing more than a gradual process of small, agile companies picking the flesh from MS's dead and bloated carcass? No, that doesn't seem real likely. Cash reserves, good business people, reputation, and (yes, it's true) good developers are powerful tools, and MS has all of these things.
The really interesting part of this is that MS is dealing with problems that are unique to MS, and I think that only the psychic or painfully brilliant will be able to predict what might happen. Alas, I don't fall into either of these categories, but this does remind me of a story that I've wanted to note down for a while...
Way back in the year 1999 DCB (Dot Com Boom), the company that I worked for was about to be acquired by one of the giants of the era -- a New Economy juggernaut that had business units that touched pretty much everyone who had ever seen a computer. As the deal rolled along, the members of our technology department (of which I was a part) were presented with JuggerNaut's non-compete agreement and invited to a group meeting to discuss this agreement.
"It says here," began one of our developers, "that if I sign this, I can't work for any company that competes with you for a period of two years after leaving JuggerNaut. Don't you compete with pretty much everybody?"
"Well, yes," said the JuggerNaut representative, "but we don't really enforce this non-compete. We just like to have it signed...just in case, you know."
"Just in case what?" asked the developer, "I'm a web programmer -- that's what I do. 'Just in case' I want to work anywhere in the two years after I leave JuggerNaut? If you don't plan on enforcing it, wouldn't it be simpler for everybody if I just didn't sign it?"
The discussion went downhill from there, and even though the deal eventually died, several significant members of the technology department went elsewhere rather than work for JuggerNaut.
People who could have made significant contributions to JuggerNaut were leery of going there, because JuggerNaut was leery of people learning "too much" about the business. There is a real concern there on the part of JuggerNaut: when you're competing with everyone, how can you every feel secure about what you're doing and who you have doing it? How do you deal with it when six degrees of competition include pretty much every other company out there, plus the guy sitting three cubicles away from you?
Wish I knew. I'd probably be a lot richer right now.
Sunday, June 06, 2004
weekend followup: standards good. stupidity bad.
I suppose that it's just bitterness after yesterday's shiny new firewall installation/de-intallation (mentioned in the previous post), but I feel the need to mention this...
The company that makes the shiny new firewalls that we (hope to one day) use at our colocation facility also made the firewall that we recently installed at our main office. The office firewall is a perfectly good appliance in most respects, but it has one limitation that just boggles the mind.
A few days after installing the office firewall, I started hearing curious intermittent complaints about some Web sites behaving oddly, certain (nonessential third party) applications not working, and the like. After a fair number of hours of review, we found that SSL was the common thread in all cases.
It turns out that the firewall that we installed in the office takes a very strict view of RFC 2246 (The TLS Protocol Version 1.0); if the communication doesn't follow the RFC, it is dropped by the firewall.
That seemed great, at first.
"Excellent default setting!" we said, "it would have been nice to know about it before we installed the device, but nevertheless cool! But since we live in the real world, though, where we have to communicate with people who are using software that may not be strictly RFC compliant, how do we turn this feature off?"
Turns out you do that by moving out into slightly experimental territory...there is no "stock" way to turn this feature off. RFC compliance seemed like to good idea to the designers and engineers, so RFC compliance was dictated. It apparently never occurred to anyone involved that the world might not always comply with the RFC.
If the issue with the shiny new firewalls is anything similar to this, I may have to kick somebody's ass.
I Love Weekends
Yes, I know it's a controversial statement to make, but I love weekends.
Weekends are the time when, after a long, hard week of work, you can go in to your colocation facility with your system engineer, spend two hours pulling out your creaky old firewalls and racking the shiny new firewalls that you finally received from your security management and monitoring company, test your shiny new firewalls, spend an hour on the phone with your security mgt. company discussing the fact that the pre-configuration they did on the shiny new firewalls was definitely supposed to allow DNS queries to resolve, spend four more hours working with said company to troubleshoot shiny new firewalls' dislike of DNS queries, decide to leave the (freezing cold) colocation facility and have dinner, walk (uphill, in the rain) to the nearest restaurant, make mean jokes about the security mgt. company's mothers, get an "it's all good now" phone call from the security mgt. company, feel bad about the mother jokes, walk back to the colo (downhill, rain stopped), discover that all is not, in fact, good and DNS still does not work correctly, spend an hour re-racking a creaky old firewall and testing it, call your car service to get home, discover that the car service is booked up and can't send a car until 1AM (almost a two hour wait), call your CFO (who lives nearby and said to call if he could help) for a ride, get his voicemail, picture CFO pointing and laughing at your name on the caller ID as he sits in a leather armchair smoking a cigar and sipping brandy, curse and shake your fist at the heavens, explain to your system engineer that you haven't really lost your mind, find the number for another car service that says they'll have a cab there in ten minutes, go down to the parking lot and wait for forty minutes until the cab actually shows up, discover that the driver is a little vague on the location of this mysterious city of "Brooklyn" that you speak of, get an interesting driving tour some of the more obscure parts of Weehawken and Hoboken, help the cab driver locate the Holland tunnel, help the cab driver locate the Manhattan bridge, help the cab driver locate your apartment, decide against asking whether the cab driver can find his way back to the Manhattan bridge, and collapse on the floor of your apartment.
Mmmmmmm...weekends...
Friday, June 04, 2004
MarketingSherpa discusses nine things that aren't specific to Gmail
MarketingSherpa.com yesterday published an article entitled Special Report on Google Gmail: Six Concerns & Three Solutions for Emailers. It's an interesting piece, and does bring up some issues that anyone doing email marketing should consider; it doesn't, however, present much of anything that's actually specific to Gmail.
Let's start with the potential problems that they note...
Their first concern is the "related links" to Google News that appear below the AdSense ads:
More hotlinks equals more distractions from your message
Yes, more links means more distractions, but Gmail's interface is actually much cleaner and less distracting than many other Web based email readers. When viewing an email sent by a major computer manufacturer, my Gmail account shows three "sponsored links" and four "related pages" (and many of the messages I tested generated fewer links). Hotmail/MSN splashes up two big graphical banner ads when you read any message, and Yahoo tips the scales at five graphical ads (one banner, four small logo/text ads).
Possibly more significant, though, is that fact that when you set aside third-party ads, Gmail only has one link on its message page that isn't dedicated to gmail functionality or information: a single link to the Google home page. Hotmail/MSN has about half a dozen such non-required links, encouraging you to sign up for or use other Microsoft services from MSN Shopping to their free newsletters, and Yahoo more than thirty (though in their defense, most of their links appear in the page footer and might not always be obvious to the user).
The second issue is also "related links," but more hypothetical at the moment:
Danger: $30 press releases can show up in new "related Pages" section
The article notes, however, that "in our beta of Gmail, AdSense and 'related pages' links didn't appear follow a regular pattern. Some emails showed up without ads or links, including third-party ad emails that should have been ad magnets, such as diet aids and financial investments." So while press releases could, in theory, be used as a sort of gmail-specific contextually targeted advertising, the fact that the selection algorithm for these links is unknown, unpredictable, and subject to change at any time without notice makes it seem unlikely that this could be reliable enough (or profitable enough) to become a widespread practice.
Now we move on the the more technical issues raised in the article...
#1. Gmail blocks all HTML on download.
For now, say good-bye to the little 1x1-pixel image that tracks whether recipients opened your email [...].
Well, you really should have started saying goodbye to that little image some time ago. I hate to be the bearer of bad news, but the "open rate" reports generated from those images were never much more than a back of the envelope estimate of the number of opens, and Gmail is just following the larger trend in email readers in disabling image loading by default. Open rate will hang around for a while, but unless a new technological approach to open tracking comes along, its days are numbered.
As long as we're on the subject, I might as well mention that I've never really understood the value of open rate tracking, anyway. The most cynical part of me remembers that open rate became a popular metric at about the same time that response rates for many email lists were tanking, and views it as a sad effort to substitute a warm and fuzzy metric-lite for the sometimes ugly but clearer clickthrough and acquisition metrics.
#2. Gmail may someday block click tracking.
Gmail may do a lot of things, and this doesn't even make the list of possibilities that worry me. They would simply be eliminating one technological approach to click tracking. It's inconvenient to have to change how you track, but redirect tracking is generally more reliable than referrer logs anyway, so it's worth doing.
#3. Gmail messes up HTML email forwarding.
I'm a bit biased on this one, not being a big fan of HTML email in any case, but it still seems more significant that the other items. Of course you should always have a compelling plain text version of any email you send, and the creative is only a part of what makes a successful email campaign (list quality, targeting, offer quality, creative), but it's unfortunate to have an appealing HTML creative that your recipients can't share with others. As with the other concerns, there are ways to work around this, but if Gmail becomes popular it will force advertisers to re-evaluate their approach to email creative -- maybe drastically. (I've also seen some really remarkable response rates with plain text only creatives recently, but that's a discussion in and of itself.)
#4. Gmail "disappears" much bulk email in the spam folder.
Like problem #1, this is just a continuation of what's happening everwhere else; Gmail may accelerate the process, but it's an issue that marketers have to deal with sooner rather than later, with or without Gmail. If you're not (at minimum) checking your outgoing messages against common filtering tools like SpamAssassin, or better yet analyzing the final allocation of your messages using your own test accounts at the big ISPs -- or one of the commercial providers that offers deliverability analysis -- you just don't really know what's happening to the messages that you send.
Because this post has gotten horrendously long, and because I actually more or less agree with MarketingSherpa's three recommendations, I'll deal with them all at once...
#1. Create a good-looking text version of your email.
#2. Experiment with tweaks to both your HTML and text mailings [...]
#3. Chart subscriptions by domain
Again, with or without Gmail in the picture, you should be doing all three of these things. A surprising number of people (like me, for example) still check email using programs or devices that read text, not HTML. Without a good plain text creative you're automatically dumping a section of your audience. When you're developing your creative, don't assume that everyone is using Outlook: Hotmail/MSN, Yahoo, Outlook, AOL, Eudora, and Lotus Notes may all display your message differently (to say nothing of the dozens of other email readers and sites). Check to see how your message appears in a variety of readers, and adjust it accordingly. And finally, if you don't know what the domain breakdown of your housefile looks like, you should. That, too, is a topic in and of itself, though -- if there's interest I'll post on that another day.