I will edit this a bit. I'll split into few steps. Here, at blogger there is no suitable way to add code so I will probably decide to give the code as downloadable and write only the explanations with short code passages attached. Let's start with step 1 of our data miner. The first step is to set the things we will need. As I said we will need a WebBrowser control which will load the pages we need. Second - we will need the mshtml and ShDocVw libraries. They contain handy functions and structures we will need to use. Add those to your project and place a WebBrowser control onto the page. We are now ready to start. We will first need a suitable way to get the html document from string (the URL of the document) and some times we |
"Hey, click on this link, wait for the document to fully load and give it to me, so I can process it!". This is a bit trickier than the first requirement. Getting the document from string is easier - we simply call the WebBrowser.Navigate() routine, attach some delegates and do some routine work to check if the document is ready or the OnDocumentComplete hit frame again.
I organized those two simple but important routines into class, called DomBase. The class constructor receives two parameters the first is the WebBrowser control which will be used as navigator and the second is a label which will report the current status (in case you want to hide the WebBrowser control and perform silent navigation). Here is the signature of the DomBase ctor :
public DOMBase(AxWebBrowser wb, Label Status)
In the constructor of the class I am simply assigning the WebBrowserControl and the Label to a private variables I have declared previously so I can access them from the class. The more important is the GetDocument method, which is actually setting the delegates needed and is navigating to the requested document. After the document is fully loaded - the routine returns the HTMLDocument instance which is the document loaded inside the WebBrowser Control. The method is public so it can be accessed from everywhere and receives only a string, containing the URL of the document to get. Here is the signature of the GetDocument routine :
public HTMLDocument GetDocument(string s_Url)
There are few interesting things inside the GetDocument body. The navigation I won't discuss since it is trivial. The next thing is to assign the DocumentComplete delegate so routine defined by us to receive the event. I have forgotten to mention that my global WebBrowser control is named h_wb. So here is how do I assign the OnDocumentComplete event to my routine (this is inside the GetDocument routine) :
this.h_wb.DocumentComplete += new AxSHDocVw.DWebBrowserEvents2_DocumentCompleteEventHandler(h_wb_DocumentComplete);
(longer than the page but you will have it soon in a separate file so you can see it better).
[to be continued].