Search Engine Evaluation
Data and Text Mining - CS 583
Web search engines provide an interface to search for information on the World Wide Web. The three most popular search engines are Google, Yahoo and MSN Live search. The aim of the project is to find out which of the three search engines would give a better result.
Goal of the project: Given a search query predict which of the three search engines would give a more satisfactory result.
Design
The project was conducted in two phases:
Phase 1 - Collecting training data: In this phase the users perform searches using Google, Yahoo and MSN Live search engines at random. The users did not know which search engine he is currently using, which helped get an unbiased opinion of how satisfactory the search results are. For each search performed, the user needed to fill out the following:
The Query type (Informational/Navigational)
Are you satisfied with the results? (Satisfied/Partially Satisfied/Unsatisfied)
Page number(s) of the results which you think is best.
Phase 2 - Analysis: In this phase a Data Mining Algorithm was constructed. This algorithm used the training data collected in Phase 1 to learn which search engine gave better results for a particular type of query.
The Algorithm uses Google Directories to label each query with its appropriate category. The user can decide the depth of the category to use to categorize the data. Appropriate weights are given to the level of satisfactory, the page number of the result and the number of words the query has. These weights are used to determine which Search engine would give a better result for a particular category.
Implementation
The input file contains information about user queries and how satisfied they were by the results for the three search engines. Each query record that is parsed has to be categorized. Google Directory Search is used for this purpose. The results page of Google Directory Search returns the category to which each result belongs to in the hierarchy. The category of a page follows the syntax: "/Top/([^\\?]+)\\?". The entire category tree is stored. At the time of analysis, the first one/two/three levels are picked depending on user requirements. Once all the query records are categorized, based on the number of levels chosen, the category information in every query is consolidated and sorted in lexicographic order.
Analysis of the word count and query type is also performed on the input data. At the end of this exercise, a list containing performance of the search engines based on the number of words in the query, the type of the query (Informational or Navigational) is generated. It was found that these factors have little influence on the final result.
In order to find if each of the search engines will give satisfactory results for a new query, the program fetches the categories for this query and finds the score learnt for these categories. The score for this query is the average of the scores of all the categories. If this score is more than 60%, then, the particular search engine may give Satisfactory results; if it is more than 50%, then it is said to be Partially Satisfied; otherwise, it is may be Unsatisfactory.
Click Here to view a sample result for the query 'Web Data Mining'.
Results
All the queries were divided into 16 categories based on Google Directory. The percentage of satisfied results for each category are as shown below:
Clearly, Google is in the forefront, capturing at least 70% of the hearts of people, clearly being the best for Sports and Games. Using MSN Live is a good choice for Shopping (they have some interesting review consolidation techniques). Yahoo! is lagging behind and needs to catch up one field at a time. Also, since Google is way up in the satisfaction charts, they have to do a lot to increase the numbers even by a small percentage. Yahoo! and MSN Live have a good chance of increasing their numbers with little effort.
Click Here to view the detailed report.