Pavel Donchev's blog on Software Technologies: Data Mining

Showing posts with label Data Mining. Show all posts

Jan 28, 2008

Job Done!

I think I made it finally. The osCommerce 2 vShop import was made this afternoon. Just waiting QA test now.

Jan 22, 2008

.NET C# to parse MySQL queries? Impossible? Pavel can make it ;) (with some ifs and buts offcourse ;)

The most interesting task in my career came the end of the last week. As a Vizibility employee I was asked if I can port osCommerce live site to out vShop solution (http://www.vizibility.co.uk/tmenu/vshop.asp) The interesting thing is that I had no access to osCommerce nor did I ever worked with it. What's even more chalenging - osCommerce database was given to me as SQL queries, something like :

drop table if exists table ;
create table table (
-- Some things here ...
);

insert into table () values ( );

As a normal software developer so there were 3 big chalanges:

1. How to parse this SQL thing (it is MySQL) ?
2. How to figure out

the osCommerce Logic ?
3. How to port all those things as a script so it can be applied to our live servers by the support team?

It's really chalanging don't you think? First I thought it will be a good step to install fresh copy of osCommerce, import the products and try to port them as CSV or something. But then I understood it will take too much time, so I decided to parse the SQL queries on my own and store them in a predefined objects like tables, rows and so on and so on. Then I could easilly transfer them as a text files in the format we needed.

And I almost made it - the parsing methods are done (I was able to parse this thing in a day but some bugs still appear).

Seems the more I work on our shopping solution the more easier I can understand other's shopping solutions.

Great task in a great company!

Mar 13, 2007

Another data mining approach [2nd Edition]

I will edit this a bit. I'll split into few steps. Here, at blogger there is no suitable way to add code so I will probably decide to give the code as downloadable and write only the explanations with short code passages attached.

Let's start with step 1 of our data miner. The first step is to set the things we will need. As I said we will need a WebBrowser control which will load the pages we need. Second - we will need the mshtml and ShDocVw libraries. They contain handy functions and structures we will need to use. Add those to your project and place a WebBrowser control onto the page.

We are now ready to start. We will first need a suitable way to get the html document from string (the URL of the document) and some times we

will need to say to our application :

"Hey, click on this link, wait for the document to fully load and give it to me, so I can process it!". This is a bit trickier than the first requirement. Getting the document from string is easier - we simply call the WebBrowser.Navigate() routine, attach some delegates and do some routine work to check if the document is ready or the OnDocumentComplete hit frame again.

I organized those two simple but important routines into class, called DomBase. The class constructor receives two parameters the first is the WebBrowser control which will be used as navigator and the second is a label which will report the current status (in case you want to hide the WebBrowser control and perform silent navigation). Here is the signature of the DomBase ctor :

public DOMBase(AxWebBrowser wb, Label Status)

In the constructor of the class I am simply assigning the WebBrowserControl and the Label to a private variables I have declared previously so I can access them from the class. The more important is the GetDocument method, which is actually setting the delegates needed and is navigating to the requested document. After the document is fully loaded - the routine returns the HTMLDocument instance which is the document loaded inside the WebBrowser Control. The method is public so it can be accessed from everywhere and receives only a string, containing the URL of the document to get. Here is the signature of the GetDocument routine :

public HTMLDocument GetDocument(string s_Url)

There are few interesting things inside the GetDocument body. The navigation I won't discuss since it is trivial. The next thing is to assign the DocumentComplete delegate so routine defined by us to receive the event. I have forgotten to mention that my global WebBrowser control is named h_wb. So here is how do I assign the OnDocumentComplete event to my routine (this is inside the GetDocument routine) :

this.h_wb.DocumentComplete += new AxSHDocVw.DWebBrowserEvents2_DocumentCompleteEventHandler(h_wb_DocumentComplete);

(longer than the page but you will have it soon in a separate file so you can see it better).

[to be continued].

Another data mining approach

Few weeks ago I urgently needed some tool to gather data for me from a sites. It wasn't easy to find some good tool for crawling sites and get the data from them. Since I love experiments I decided to develop my own spider. I needed it to be human like program which can get and analyze document from a given string. After the document is loaded once - the application should find a link and click on it and then - wait the other page to be loaded and do someting like that. I used the IE automation approach, as it seemed easier then. But it was a lot of pain until all the functionalities were implemented indeed. I will pass some code and explanations later, for now you can search the http://www.codeproject.com and http://msdn2.microsoft.com/en-us/default.aspx

as they seemed to be a good starting point. When I go home I will paste some code among with explanations.