Pavel Donchev's blog on Software Technologies: March 11, 2007

Mar 13, 2007

Another data mining approach [2nd Edition]

I will edit this a bit. I'll split into few steps. Here, at blogger there is no suitable way to add code so I will probably decide to give the code as downloadable and write only the explanations with short code passages attached.

Let's start with step 1 of our data miner. The first step is to set the things we will need. As I said we will need a WebBrowser control which will load the pages we need. Second - we will need the mshtml and ShDocVw libraries. They contain handy functions and structures we will need to use. Add those to your project and place a WebBrowser control onto the page.

We are now ready to start. We will first need a suitable way to get the html document from string (the URL of the document) and some times we

will need to say to our application :

"Hey, click on this link, wait for the document to fully load and give it to me, so I can process it!". This is a bit trickier than the first requirement. Getting the document from string is easier - we simply call the WebBrowser.Navigate() routine, attach some delegates and do some routine work to check if the document is ready or the OnDocumentComplete hit frame again.

I organized those two simple but important routines into class, called DomBase. The class constructor receives two parameters the first is the WebBrowser control which will be used as navigator and the second is a label which will report the current status (in case you want to hide the WebBrowser control and perform silent navigation). Here is the signature of the DomBase ctor :

public DOMBase(AxWebBrowser wb, Label Status)

In the constructor of the class I am simply assigning the WebBrowserControl and the Label to a private variables I have declared previously so I can access them from the class. The more important is the GetDocument method, which is actually setting the delegates needed and is navigating to the requested document. After the document is fully loaded - the routine returns the HTMLDocument instance which is the document loaded inside the WebBrowser Control. The method is public so it can be accessed from everywhere and receives only a string, containing the URL of the document to get. Here is the signature of the GetDocument routine :

public HTMLDocument GetDocument(string s_Url)

There are few interesting things inside the GetDocument body. The navigation I won't discuss since it is trivial. The next thing is to assign the DocumentComplete delegate so routine defined by us to receive the event. I have forgotten to mention that my global WebBrowser control is named h_wb. So here is how do I assign the OnDocumentComplete event to my routine (this is inside the GetDocument routine) :

this.h_wb.DocumentComplete += new AxSHDocVw.DWebBrowserEvents2_DocumentCompleteEventHandler(h_wb_DocumentComplete);

(longer than the page but you will have it soon in a separate file so you can see it better).

[to be continued].

Another data mining approach

Few weeks ago I urgently needed some tool to gather data for me from a sites. It wasn't easy to find some good tool for crawling sites and get the data from them. Since I love experiments I decided to develop my own spider. I needed it to be human like program which can get and analyze document from a given string. After the document is loaded once - the application should find a link and click on it and then - wait the other page to be loaded and do someting like that. I used the IE automation approach, as it seemed easier then. But it was a lot of pain until all the functionalities were implemented indeed. I will pass some code and explanations later, for now you can search the http://www.codeproject.com and http://msdn2.microsoft.com/en-us/default.aspx

as they seemed to be a good starting point. When I go home I will paste some code among with explanations.