IBM Scientists Rely on the Principle of Uncertainty to Dvlp Web-Privacy Answers

ALMADEN, CA -- If you're an online merchant, ask a visitor to your site a simple question, and you are likely to get a misleading answer, if you get anything at all. That's the dilemma prompting a new privacy-enhancing data mining technique being developed by Dr. Rakesh Agrawal and Dr. Ramakrishnan Srikant, researchers at IBM's Almaden Research Center. The research project -- one of several underway at IBM's Privacy Research Institute -- scientifically addresses the Catch-22 created by Web users entering false personal data on sites to protect their privacy and e-businesses relying on the data to develop data models and deliver customized services. "Our research institutionalizes the notion of fibbing on the Internet, and does so to preserve the overall reality behind the data," says Dr. Agrawal. Called Privacy-Preserving Data Mining, the research relies on the notion that one's personal data can be protected by being scrambled or randomized prior to being communicated. By applying this technique, a retailer could generate highly accurate data models without ever seeing personal information. "The beauty of this research is that retailers and other Web businesses are able to extract the valuable demographic information they need without necessarily knowing the underlying personal consumer data", said Harriet P. Pearson, IBM's Chief Privacy Officer. "I believe we'll see technological approaches such as this playing a larger role in managing the privacy issues of today and the future." According to Dr. Agrawal, the Privacy-Preserving Data Mining research has a wide range of potential applications, from medical research and building disease prediction models using randomized individual medical histories, to e-commerce and accurate promotions using randomized demographics of individual users. How It Works: A Web user decides to enter a piece of personal data -- e.g., age, salary, weight. Upon entry, that number, say age 30 is immediately scrambled or `randomized' by IBM software: the software takes the original number that was input and adds (or subtracts) to it a random value. This randomization step is performed independently for every user who opts to enter their age. So, a 30 year old's age may be randomized to 42, while a 34 year old's entry may be randomized to 28. The randomization differs for every single user. What does not change is the allowed range of the randomization. And, the range is directly linked to the desired level of privacy. Large randomization increases the uncertainty and the personal privacy of the users. However, at the same time, larger randomizations can cause loss in the accuracy of the results that are, at the end, produced by a data mining algorithm that uses the randomized data as input. According to Dr. Agrawal, it is clearly a trade off. Experiments indicate only a 5-10 percent loss in accuracy even for 100 percent randomization after the data mining algorithm has applied corrections to the randomized distributions. An Example: Take the randomization of an IT manager's salary, which, for purposes of this example, may range between $50,000 and $150,000 per year. Let's say that the web merchant (or web site owner) decides that the software's randomization parameter will be set to add a random value somewhere between -$30,000 to +$30,000. Jane, who comes to the site and decides to enter her salary in exchange for personalized recommendations, has a salary of $100,000. Upon entering $100,000, the IBM software happens to pick a random value of -$15,000, so Jane's salary is recorded as $85,000. No record is kept of her true salary to protect her privacy. Then Bob comes to the site and enters his true salary of $90,000. The software happens to pick +$25,000 for Bob and his salary is recorded as $115,000. Again no record is kept of Bob's true salary. To view the effect of the randomization, look at the true or real salary distribution of the group of folks, in addition to Jane and Bob, who input their salary on the site, next to the randomized distribution.
         Distribution Truthful            Distribution Randomized

$50,000- 60,000 :  1 visitor        $50,000-60,000  :  3 visitors
 60,000- 70,000 :  4 visitors        60,000-70,000  :  7 visitors
 70,000- 80,000 : 20 visitors        70,000-80,000  : 12 visitors
 90,000-100,000 : 50 visitors        90,000-100,000 : 33 visitors
100,000-110,000 : 10 visitors       100,000-110,000 : 55 visitors
110,000-120,000 : 45 visitors       110,000-120,000 : 23 visitors
120,000-130,000 : 15 visitors       120,000-130,000 : 10 visitors
130,000-140,000 :  3 visitors       130,000-140,000 :  2 visitors
140,000-150,000 :  2 visitors       140,000-150,000 :  5 visitors
Note in the randomized list that 55 people are in the 100-110 thousand range, whereas truly there were only 10 people. If this randomized data were to be used directly, the results would be very poor. Once all the randomized data is in for a large number of users, the privacy preserving data mining software takes the randomized distribution and reconstructs how the true distribution might have looked like. The software cannot determine what Jane or Bob's salaries were. It has access to only the randomized values and the parameters of randomization (i.e. random values that were added or subtracted came from the range -$30,000 to +$30,000), and nothing else. Based only on this information, the software reconstructs a close approximation of the true distribution. This reconstructed distribution is then used in building an accurate data mining model. Jane gets personalized recommendations by having the data mining model shipped to her client and applied locally. IBM Privacy Research Institute: Launched in early 2002, the IBM Privacy Institute is the industry's first formal technology research effort focused exclusively on developing privacy-enabling and data protection technologies for businesses. Under the direction of Dr. Michael Waidner, the Institute conducts privacy-enabling technology research in IBM's eight research laboratories around the world.