Saturday, November 24, 2007

Online Behavioral Profiling

In preparation for an appearance on KFNX (Phoenix) radio's Tech Talk show with Tom D'Auria, I did my homework and researched the topic. Here are my notes. I am sure there is more here than we will be able to cover in the time available on the program.

1 - What is behavioral profiling?

Behavioral profiling is the practice of drawing conclusions about or categorizing someone based on a limited set of behaviors. Behavioral targeting is the related practice of categorizing people or segmenting a market to provide them with advertising and sales messages designed to appeal to or work with the group they belong to.

For example, if a policeman, or anyone for that matter, sees a car swerving down the road, they may think that the driver is drunk. We speak of someone who is shifty-eyed as being not trustworthy. In the old cowboy movies the bad guys wore black hats. These are all examples of behavioral profiling.

The bride and baby magazines somehow know when weddings and birthdays are expected in everybody's family, and they send targeted promotional materials to the engaged women and the expectant mothers. These are examples of behavioral targeting.

2 - What's wrong with that?

The Federal Trade Commission held a Town-Hall meeting on November 1 & 2nd in Washington, DC on Behavioral Profiling and the Internet. There were a lot of interesting presentations, and I encourage anyone who is concerned about this issue to go to the FTC web site and watch the recordings of the meeting.

One of the big problems is that people do not understand what behavioral profiling and behavioral targeting are. Therefore people may consent to having companies, governments and other organizations collect information about them and track their activities on the Internet, but it is not informed consent.

It turns out that nobody reads privacy policies and End-User License Agreements, except lawyers when they are being paid to read them. Everybody else just clicks through them on the web to get to the content they want.

The problem boils down to this. Organizations online and in the real world are covertly spying on you.

3 - Why would organizations be spying on you? What do they hope to learn?

They are doing it, they say, in order to do a better job serving you. Other motivations include making money, avoiding losses, beating the competition, and making more money.

Governments are interested in identifying criminal behavior and avoiding terrorist plots.

The good news in this is that there is not a lot to learn from spying on most of us. The state of the art is such that there is a) an enormous amount of data being generated constantly and b) there are serious limitations to the data. The volume of data involved limits what can be done with it. You cannot drink from a fire hose. And as all of us who have designed and implemented systems know, garbage in, garbage out.

4 - What kind of data are we talking about? What is being collected?

This gets into the nuts and bolts. Some of the data is very good. Amazon.com has a good handle on you and what you do on their web site. You have a username and password that you use to sign in to make a purchase. And they use ¨1st party cookies¨ to capture information about what you are searching for how you navigate around their website. This lets Amazon and sites like MySpace, present pretty well targeted recommendations and information to you.

But when you go from site to site, there is currently no reliable mechanism for tracking you. What happens is that you collect 3rd-party cookies from various advertising networks that provide ads to most major web sites. Double-Click, and other such ad networks pay your content publishers to carry the ads. The 3rd-party cookies tell Double-Click where else on their ad network your browser has visited, and that helps them know what ad to serve you. Unlike Amazon, they do not know who you are, where you live, what you buy and other valuable information. If you use a different computer, Double-Click doesn't know its you and not somebody else. When you erase your cookies, you become once again a blank slate to Double-Click.

Google is in the process of buying Double-Click. Google has been coining money by putting targeted ads all over the internet. So far, Google has targeted ads based on the content of the pages showing the ads, not based on the behavior of web surfers. So this marks a worrisome development for privacy experts given the size and strength of Google, and the FTC is reviewing the proposed transaction.

5 - What about Internet Service Providers?

Internet Service Providers are in a position to collect data about everything that each of their subscribers does online, and marry that with names, addresses, credit cards, etc. Your ISP knows all and sees all. You may not want them sharing that information with advertisers and the law.

Yahoo! has been strongly criticized because it complied with a lawful request by the Chinese government for emails written by a dissident in China on his Yahoo! email account. The government jailed the dissident based on emails that Yahoo! turned over.

Interestingly enough, while AOL no longer regards itself as an ISP, for years it was in the unique position of knowing almost everything about almost everyone online. They used that information to sell ads and target ads to their subscribers. Arguably, that model that failed to sustain AOL.

But now there are new companies springing up like Adzilla that are setting up alliances with ISPs to get access to all the information about each of their subscribers. Adzilla got $10m in venture capital this past August, and their web site says they currently have alliances with 8 ISPs.

How this kind of activity does not run afoul of laws against wiretapping is still an open question. Their position is that it´s not wiretapping if no human beings are involved; if there are only machines listening and serving ads based on pre-programmed heuristics. It's no different from a spam filter or anti-virus program that scans everything coming and going.

6 - Who are the bad guys in the behavioral profiling space and what are they doing?

It is not easy to say exactly who they are. There are many layers involved. Content and advertising on a given website may come from many different places. Advertising, especially, may come from other places. There are a lot of intermediaries that buy, sell, aggregate, serve and track online ads. They may be doing behavioral profiling, even if the site displaying the ads does not.

That makes it very difficult to identify who's responsible when something bad happens. But we can say what they are doing or not doing as the case may be. We can profile the bad guys.
There are more than a few bad apples among advertisers:
Scammers, people selling get rich quick schemes, quick weight-loss programs, instant credit, and so forth have found the online world to be a fertile place to practice their trade. If something sounds to good to be true, it probably is. That goes double online.

Online fraud can happen when a product or service you buy does not do what it was advertised to do.

Hackers can embed malicious code in advertisements and on sites that ads might take you to. They could steal your usernames and passwords, credit card numbers and bank information if you are not careful. They could erase your hard drive.

And these bad guys can be almost anywhere in the world, beyond the reach of authorities in this country.

Intermediaries go bad when they pay lip-service to privacy and security but then fail to live up to their own policies and market expectations. There is an interesting case where a firm called Gator several years ago had an form filling browser plugin that people downloaded under false pretenses. The application was sending transactions data back to Gator for profiling purposes. Spyware protection programs were programed to delete the Gator app. Gator sued them.

In the end, Gator reformed its ways and survived. Gator changed its name to Claria and now it is one of the more respected names in online advertising.

Trouble happens when intermediaries do not do a good job knowing their customers and vetting the ads they run. That is how the scammers and the hackers get access to legitimate web sites.

Web sites are also known as content publishers. They have more at stake and they can get away with less than the intermediaries. They are more likely to be blamed if something bad happens, whether it is their fault or not. But shame on them if they do not take reasonable precautions to prevent bad things from happening.

For example, banner ads containing malicious code that infects users' machines if they are not properly patched have appeared on MySpace.com, MLB.com and other mainstream web sites. It wasn't their malicious code, but they should have made sure that the ads they displayed were properly screened.

Many mainstream websites are profiling their users' behavior and selling that information to advertisers. The least they should do is let their users opt-out of such profiling.

Advertising on social networking sites is a new frontier for behavioral profiling. MySpace recently opened its doors to targeted ads where MySpace will keep its data under wraps but sell access to various demographic and behavioral populations. So, MySpace or their agents will say to advertisers, if you advertise on MySpace, we can target your ads to girls taking drivers-ed classes or boys with severe acne. They claim that they will be careful not to let objectionable ads reach our kids. But how do they know what is objectionable?

7 - What is the upside or what are the benefits to consumers of behavioral profiling?

Marketers claim that behavioral profiling allows them to present fewer, more relevant ads to consumers. But that is not sound economic reasoning. If the marginal return from advertising expenditures rises, advertising expenditures will rise. Additionally, if advertising becomes more effective, more companies will engage in it, meaning that consumers will see more advertising, not less. Online advertising is growing at the rate of 20% per year. That means a lot more ads, not fewer.

The principal benefit to consumers is that advertising allows them access to almost all the content on the web for free. It is estimated that advertisers will spend $40 b online this year and that number is growing at a rate of 20% per year.

Slate.com which is owned by Microsoft started as a subscription news and opinion service. It failed to attract enough subscribers to make a profit. It switched to a free, ad-supported site and it is now making money.

One of Rupert Murdoch's first steps after acquiring the Wall Street Journal was to change it's online content from a subscription model to a free, ad-supported model.

One commentator has said that the success of the internet in the market boils down to people's perceptions that, "It's all about me, and its all free."

8 - How can we protect our privacy and still enjoy free stuff?

We all can´t. Some of us can. But like Television, if everybody records shows and skips the commercials, the TV networks will die. So for now, some of us like you and me can block ads in our browsers and erase our cookies after every session. That will keep us safe, allow us to travel incognito, and still enjoy the convenience and content of the internet.

9 - What does the future hold?

In the future, there will be more information collected about people. Organizations will know much more about you and me. And there will be additional avenues for these organizations to reach out and touch you.

RFID tags will be embedded in everything we own. And tag readers will be everywhere we go, so not only will they know where we are at all times, they will know what we are wearing, everything that is in our handbags.

When you walk into a store, they will address you by name, and they will know your size and your likes and dislikes. As you drive down the street, a billboard may show you a message specifically for you. Your cell phone might ring to tell you that you missed your morning coffee and there are 4 Starbucks shops in the next block.

10 - Where can people find out more about this subject?

  1. FTC Town Hall Meeting: eHavioral Advertising
  2. Center for Digital Democracy
  3. Electronic Privacy Information Center

Monday, November 19, 2007

On Turning Document into Database

Imagine you have your contacts - names, addresses, phone numbers, spouses, children, etc. - done in Microsoft Word. Periodically you update it, print it and put it in a binder which you keep by the phone. You started doing this 20 years ago, and you have a lot of names and numbers. The information was entered over time without a lot of regard for standards and consistency. How it printed and how it looked when printed were the only considerations. Entries in the document look like this.
Shakespeare, William
Stratford on Avon
or Globe Theatre
London, UK
phone: 707-727-9999)
Tel. 800-555-1212
Fax: (866) 555-4321
Email: bard@globe.com
(Bill & Stacey. Stacey's cell


Now imagine that you want to migrate this information to Microsoft Outlook or other database application. This was the situation presented to us recently by a client. Unfortunately, it turned out that this was not a trivial piece of work; the Word document contained over 2,000 contacts.

The document had information in record-layout form, without consistent fields/delimeters. Before we could import the data into Outlook, we needed to identify and label fields in each and every record.

One of the File/Import options available in Outlook is a VCard format file. This allows one to import data in record-by-record form, where the records comply with the VCard specs. Initially, we went down this road only to find that the fields available in the VCard spec are too limited -- no spouses, for example; and Outlook only imports one(!?) record per VCard file. There are third-party apps that let you import into Outlook more than one record per VCard file, but the field limits were a deal-breaker.

So, as the job evolved, we ended up identifying fields in each and every record and converting the whole thing into tabluar form, saving it as a Comma Separated Values (CSV) text file, and finally importing that into Outlook. Piece of cake? I wish!

The keys to success here are XML, Regular Expressions and XSL. You know how to use XML/XSL and Regular Expressions, don't you?

First, identify all the fields you are going to use. Hint: Use fields like the ones used in Outlook. Then establish your XML tags for each field: <lname>, <fname>, <mname>, <adr1>, ...

In Microsoft Word, clean up the file as much as possible; then, save it as a text file. We need it as a text file in order to use a Regular Expressions tool to perform complex search-and-replace functions, i.e., identify fields and insert our XML tags. Regexxer, a free, Linux tool, is ideal for this part of the job. But, depending upon the lack of standards and consistency in the document, this part of the job takes HOURS! We drastically underestimated the time involved.

When the long job of tagging the data document is completed, the text file should be made into an XML file and associated with an XSL file which you will create. This XSL file will transform the XML file into a CSV table in a browser. The final step is to copy and save the table and import it into Outlook.

Take it from me, this process works. Unfortunately, the time it takes makes it cost-prohibitive for all but the most cost-insensitive clients.

Wednesday, November 07, 2007

Roll Your Own Ubuntu Desktop

If you want/need to roll your own Ubuntu installation, here's what to do.
  1. Using the text-install CD, install Ubuntu to the command prompt.
  2. Apt-get install the packages you need/want to manually configure.
  3. Configure those packages and get them working.
  4. Apt-get install any other packages you want (including dselect).
  5. Get your own or use my list of Ubuntu (Gutsy Gibbon) install packages.
  6. dselect the list.

Here's the background on this...


Following up on my post, Linux Distro Hop, I heard from a reader telling me where to find the text-install version of the new Ubuntu 7.10 (Gutsy Gibbon). Either I missed the link to it the first time around, or they added the link to it sometime after the time I was first looking for it (a day or two after the formal release). In any case...

Let's recap. I have this dual-processor, dual-video card machine that was originally built as a high-end gaming machine. The motherboard was one of the first dual-processor boards. It has a low-end ATI GPU on the motherboard, and there is an Nvidia Geforce3 video card. My plan was to load Linux/Ubuntu on it and repurpose this machine (as a VMware server).

Unfortunately, the Ubuntu LiveCD would not run on the machine. The video card situation confounded Ubuntu. After trying with varying success to load different Linux distros, I settled on Debian because it is the basis of Ubuntu. That's when I posted Linux Distro Hop.

Because Debian worked, I was confident that Ubuntu should work. So, armed with the text-install for Ubuntu, I decided to try again.

Unfortunately, the text-install does not do much in terms of de-obfuscating the Ubuntu installation process. Virtually everything is done automatically, without user input. Like the LiveCD, the text install got hung up and failed. No error messages and no indication of what the problem was. Based on my experience with the various distros, however, I knew that the video cards, drivers and the Xserver configuration were causing the install to fail.

It was with some exasperation that I surveyed my options. On the text-install CD, I noticed that it offered "Install to a command line."
  • Let's see if that works...  It does! That's progress.

Feeling my way forward, I apt-get installed xserver, and edited xorg.conf to use the Nvidia Geforce3 video card.
  • Let's see what happens when I startx...  It works!

I apt-get installed gnome, and I was well on the way to rolling my own Ubuntu!! But then I thought,
  • How am I going to know what all packages I need to apt-get install before I can call this Ubuntu Desktop. And I don't want to overwrite or otherwise break what I've done already.

Fear not! It turns out that it is a simple matter to obtain a listing of all the packages installed on a given Debian/Ubuntu system. So, if you are like me and have a Ubuntu system that is pretty much the way you want it in terms of apps, codecs and proprietary drivers, you can list that machine's packages to a file, move the file to your roll-your-own machine and dselect all those packages in a single command. And, as if that was not cool enough, packages already downloaded and installed are NOT COPIED, so you won't undo what you've already done. Whew!

To do this, I followed arsgeek's guidance (don't be distracted by the comments). In case you don't have a machine to base your work on, here is the text file I generated. It is all the packages installed on a Gutsy Gibbon desktop machine, including sound and video players and codecs to let me access most/all(?) multimedia around the net. It also has KeePassX that I recommend...

You can use this file (save and rename it) as described by arsgeek to roll your own Ubuntu desktop. I did and it worked like a charm! Roll on!