Sunday, March 28, 2010

Web Data Mining

Web data mining is the application of a variety of different techniques in order to discover what kind of information Internet users are looking for. Web data mining may be used to get several different specific kinds of data (such as text only data, or multi-media data), and has a wide variety of potential applications. Data mining has been used almost since the Internet first started and more people are using these kinds of techniques every year. Companies who wish to attract more potential customers can use information gained from data mining in order to better target their advertising campaigns. University students who are doing research on how people use the Internet can use data mining to produce accurate statistics. Even if your sole purpose is to drive traffic and you are not doing any advertising at all, data collected from a data mining operation can give you all the information you need to form effective keywords.

Data mining may also be referred to as "scraping". So how is data mining/scraping accomplished? Software is the most popular method of doing so and there are a number of different companies out there that offer data mining services. Mozenda and Kapow Technologies are two popular companies devoted to data mining. This data can be totally invaluable to commercial advertisers because it gives them the information they need in order to put their product/products in front of as many web users as possible. After all, how can you target ads toward people when you have no idea what they are searching for? How can you start a viral campaign via social media if you do not know how people use the web. Knowing how people use the web is vital to the success of just about any website, commercial or non-commercial.

This is especially true for the little people - small businesses and new websites who are trying to make a name for themselves. Data mining can help large companies as well, but the small company stands to reap much greater benefits. Large companies have already established themselves and will already have a customer base they can count on without the need for advertising. This is almost certainly not true for the small business owner/website operator. So what about the ethical side of all this? It is largely a matter of personal opinion, but virtually everyone agrees that web data mining can be used both ethically and unethically. Many people can agree that as long as data that is not considered personal (sexual preference, religion, political affiliation, etc.) is not harvested then there are no ethical concerns.

So from that point of view, what people in general search for (like popular search trends) and tend to look at is certainly ethical. Unethical from this point of view would mean that a particular company (or companies) harvest data in order to build complex profiles based on personal information in order to promote their product(s). However, even though this point of view is popular does not mean everyone views this issue the same way. Some people believe all data mining is unethical and infringes on privacy. Others believe that data mining is totally acceptable and that people who have nothing sinister to hide have no reason to be concerned. Civil rights organizations are deeply concerned about government use/abuse of data mining technology to spy on it's citizens. This is an interesting issue and there is no telling what the future may hold for data mining. At this point in time, most forms of internet data mining are totally legal and used by a large number of individuals and corporations all around the world.

web data mining

1 comment:

  1. Hi Jon,

    Mozenda looks like an amazingly easy-to-use tool. I wonder however whether there are any comparable open source software solutions around?

    As I am a PhD Doctoral student, I am looking to use webscraping software for non-commercial research. I found and used a java based tool called webharvest, however you have to mess around with Xpath manually. A tool like Mozenda looks perfect but there doesn't seem to be a version available for non-commercial researchers, and the trial seems very limited.