Sunday, August 9, 2009

Web Data Miner

Data miner

A data miner is software that can read materials on public websites. Data miners can detect and read html, asp, php, or anything that can be displayed in a web browser, creating a data flow output that can be instructive to businesses and others in simple to understand like an excel spread sheet.
Health care researchers can use data miners to find out rates of illnesses, the scope of a public health crisis and more. Law enforcement officials can use data miners to root out criminal activity and to solve crimes. Government agencies can use data miner software to look for policy compliance.
By employing data miner applications, businesses take a pro-active approach to information gathering. An ongoing email list, marketing and advertising expertise, and competitor research are three major reasons for data miners. The ways to combine web data are limitless, some of the most successful software companies simply aggregate web data and make it more usable to humans - Think Google, their humble beginnings were simply gathering and displaying web data in an easy to use fashion. Lucky for us, today we have many custom web crawlers on the market today that can help us mine that web data out. I have started to compile an open directory of web data mining programs at


  1. I have a tool that crawls a site and has the ability to specify a content start tag and a footer begin tag so that it only pulls the content and not the presentation/design. Also uses tidy html etc to clean it up. My problem is that this exports to an xml file with s a very specific layout to a proprietary product. My questions is, are there any other cheap or free products out there that will do this to an more universal output?

  2. I am building a directory of all programs so that you can pick the best fit for your needs, however the only program that I have used to do this with is Mozenda. The prices are by far the lowest that I have seen, but not free. If you are looking for free, perl would be the best scripting language to parse the html - with grep you can do it. Give me a week and will have a good/open directory to help.