Building A Search Engine

by discovery Tue Mar 29, 2011 12:22 am

Building A Search Engine

Building A Search Engine

http://www.sumopaint.com/image/?id=1298269

(Page 1 of 2 )

Have
you ever wanted a search engine on your site but didn't know how to do
it? If so, read this article and see for yourself how Daniel Solin
creates a search engine using PHP.A
little while ago, I was working on a intranet-site for a mid-size
company. As the site grew in both size and popularity, the assigner
requested me to extend the site with a search-feature. And, since one of
the general criterias on the intranet was that all logic code should be
written in-house, using one of the open source engines was not an option.After
about a day's work, the engine was quite complete, and the result
actually turned out better than expected. By using a few simple
techniques, and with the help of PHP and MySQL, these kind of projects
are really not a problem anymore. In this article, I will present to you
a cut-down version of the search engine that was developed. I hope this
will get you started, and give you a few good ideas how to develop a
search-engine of your own. With the example used in this article in
hand, you could easily develop an engine that suits your particular
needs, with the exact features that you would like to have.Database Design And LogicThe database
for the search engine consist of three table: page, word and
occurrence. page holds all web pages that has been indexed, and word
holds all words that has been found on the indexed pages in page. The
rows in table occurrence consists of references to rows in page and
word. Each row representing one occurrence of one particular word on one
particular page. The SQL for creating these tables are shown below.[size=118]CREATE TABLE page (
page_id int(10) unsigned NOT NULL auto_increment,
page_url varchar(200) NOT NULL default '',
PRIMARY KEY (page_id)
) TYPE=MyISAM;[/size]CREATE TABLE word (
word_id int(10) unsigned NOT NULL auto_increment,
word_word varchar(50) NOT NULL default '',
PRIMARY KEY (word_id)
) TYPE=MyISAM;CREATE TABLE occurrence (
occurrence_id int(10) unsigned NOT NULL auto_increment,
word_id int(10) unsigned NOT NULL default '0',
page_id int(10) unsigned NOT NULL default '0',
PRIMARY KEY (occurrence_id)
) TYPE=MyISAM;As you see, while page and word hold actual data,
occurrence act only as a reference table. By joining occurrence with
page and word, one can find out on which page(s) a certain word is
mentioned, as well as how many times it is mentioned. We'll get back to
that a little later, however - first we need to populate the database so
that we have some content to work with.Populating The DatabaseOkay,
the database is created and we're ready to feed it with some content.
For this, we'll create a PHP-script that takes a user-specified URL,
reads the document representing the URL, and creates records in the database based on the words it extracts from the document. Take a look at the listing below./*
* populate.php
*
* Script for populating the search-database with words,
* pages and word-occurences.
*/
/* Connect to the database: */
mysql_pconnect("localhost","root","secret")
or die("ERROR: Could not connect to database!");
mysql_select_db("test");
/* Define the URL that sould be processed: */
$url = $_GET['url'];
if( !$url )
{
die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
$url = "http://$url";
}
/* Does this URL already have a record in the page-table? */
$result = mysql_query("SELECT page_id FROM page WHERE page_url = "$url"");
$row = mysql_fetch_array($result);
if( $row['page_id'] )
{
/* If yes, use the old page_id: */
$page_id = $row['page_id'];
}
else
{
/* If not, create one: */
mysql_query("INSERT INTO page (page_url) VALUES ("$url")");
$page_id = mysql_insert_id();
}
/* Start parsing through the text, and build an index in the database: */
$fd = fopen($url,"r");
while( $buf = fgets($fd,1024) )
{
/* Remove whitespace from beginning and end of string: */
$buf = trim($buf);
/* Try to remove all HTML-tags: */
$buf = strip_tags($buf);
$buf = ereg_replace('/&\w;/', '', $buf);
/* Extract all words matching the regexp from the current line: */
preg_match_all("/(\b[\w+]+\b)/",$buf,$words);
/* Loop through all words/occurrences and insert them into the database: */
for( $i = 0; $words[$i]; $i++ )
{
for( $j = 0; $words[$i][$j]; $j++ )
{
/* Does the current word already have a record in the word-table? */
$cur_word = strtolower($words[$i][$j]);
$result = mysql_query("SELECT word_id FROM word WHERE word_word = '$cur_word'");
$row = mysql_fetch_array($result);
if( $row['word_id'] )
{
/* If yes, use the old word_id: */
$word_id = $row['word_id'];
}
else
{
/* If not, create one: */
mysql_query("INSERT INTO word (word_word) VALUES ("$cur_word")");
$word_id = mysql_insert_id();
}
/* And finally, register the occurrence of the word: */
mysql_query("INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id)");
print "Indexing: $cur_word
";
}
}
}
fclose($fd);
?>Basically,
this script connects to the database, registers the URL (the page) in
the database (if it's not already there), starts to retrieve data, uses
the preg_match_all()-function to extract the words from the page, and
then creates a record in the occurrence-table and/or the word-table for
the currently processed word. So, for example, if the script finds the
word 'linux' on http://www.onlamp.com, it will execute the following INSERT-statements:INSERT INTO page (page_url) VALUES ("http://www.onlamp.com");
INSERT INTO word (word_word) VALUES ("linux");
INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id);However, this is only true if http://www.onlamp.com
has not been indexed yet, and that this occurence of 'linux' is the
first one. If 'linux' occurs once more further down on the page, the two
first statements will not get executed, and the 'old' page_id and
word_id will be used again.Let's
now index a few pages. The seven sites that makes up the O'Reilly
Network is probably a good idea. So, call populate.php with your browser
using the site URLs as the only argument, one at a time:http://localhost/populate.php?url=http://www.macdevcenter.com
http://localhost/populate.php?url=http://www.onjava.com
http://localhost/populate.php?url=http://www.onlamp.com
http://localhost/populate.php?url=http://www.openp2p.com
http://localhost/populate.php?url=http://www.osdir.co
http://localhost/populate.php?url=http://www.perl.com
http://localhost/populate.php?url=http://www.xml.comA quick investigation of the tables now should result in something like this:mysql> SELECT * FROM page;

page_id	page_url
1 2 3 4 5 6 7	http://www.macdevcenter.com http://www.onjava.com http://www.onlamp.com http://www.openp2p.com http://www.osdir.com http://www.perl.com http://www.xml.com

7 rows in set (0.00 sec)mysql> SELECT COUNT(*) FROM word;

COUNT(*)

2423

1 row in set (0.00 sec)mysql> SELECT COUNT(*) FROM occurrence;

COUNT(*)

20628

1 row in set (0.00 sec)There
you go - the database now have indexed seven pages, resulting in 2423
unique words and 20628 occurrences. We're ready for the next-level - the
search-interface.The Search InterfaceOf
course, users of the search-engine will not be able to work with the
MySQL database directly. Therefore, we'll create another PHP-script that
makes it possible to query the database through a HTML-form. This will
work just as any other search-engine: the user enters a word in a
text-box, hits Enter, and the interface presents a result-page with
links to the pages which contains the word that was searched for. In
this example, the order in which the pages are presented is settled by
the number of times the keyword appears in each document. The
search.php-script are listed below/*
* search.php
*
* Script for searching a datbase populated with keywords by the
* populate.php-script.
*/
print "\n";
if( $_POST['keyword'] )
{
/* Connect to the database: */
mysql_pconnect("localhost","root","secret")
or die("ERROR: Could not connect to database!");
mysql_select_db("test");
/* Get timestamp before executing the query: */
$start_time = getmicrotime();
/* Execute the query that performs the actual search in the DB: */
$result = mysql_query(" SELECT
p.page_url AS url,
COUNT(*) AS occurrences
FROM
page p,
word w,
occurrence o
WHERE
p.page_id = o.page_id AND
w.word_id = o.word_id AND
w.word_word = "".$_POST['keyword'].""
GROUP BY
p.page_id
ORDER BY
occurrences DESC
LIMIT ".$_POST['results'] );
/* Get timestamp when the query is finished: */
$end_time = getmicrotime();
/* Present the search-results: */
print "Search results for '".$_POST['keyword']."':

\n";
for( $i = 1; $row = mysql_fetch_array($result); $i++ )
{
print "$i. ".$row['url']."\n";
print "(occurrences: ".$row['occurrences'].")

\n";
}
/* Present how long it took the execute the query: */
print "query executed in ".(substr($end_time-$start_time,0,5))." seconds.";
}
else
{
/* If no keyword is defined, present the search-page instead: */
print "Keyword: \n";
print "Results: value='5'>5 1015 20\n";
print "\n";
}
print "\n";
/* Simple function for retrieving the currenct timestamp in microseconds: */
function getmicrotime()
{
list($usec, $sec) = explode(" ",microtime());
return ((float)$usec + (float)$sec);
}
?>There
are two possible ways this script can get executed: with or without the
keyword-argument. If keyword is defined, the script tries to find that
particular word in the database, and presents the result on-screen.
Additionally, the script also shows how long it took for the database to
come up with the result. If keyword is not defined, the search-page is
presented instead. Enter the URL to this script in your browser, and you
should see something like Figure 1.
Building A Search Engine Bildaseareng_1

Let's search on the keyword 'linux' then. You should now see something like Figure 2.
Building A Search Engine Bildaseareng_2

As
expected, onlamp.com is presented first on the result-page. As stated,
this is simply because the keyword 'linux' appears more frequently on
this site than on the others. A search for 'java' would probably get
onjava.com on the top, and 'xml' would most likely generate most hits on
xml.com. Also note that we selected to limit the number of results to
five on the search-page, which makes only the five most interesting
pages to show up in the result.

Speeding Up The DatabaseAs
you also see on the bottom of the result-page, the query took 0.393
seconds to execute. While this may not seem like an incredible long
time, it would make the searches unreasonable slow as the database
grows. Fortunately, there's a very simple trick you can do to speed this
up. Create an index on the word_word-column:CREATE INDEX word_word_ix ON word (word_word);This
will create an index on the column word_word in the word-table. Since
this column is used every time a search is made with our search-engine,
this will probably make our engine much faster. Let's try it out -
search for the keyword 'linux' again, to see if we gained any
performance. See Figure 3.
Building A Search Engine Bildaseareng_3

Nice!
0.028 seconds - a speed-increase of 0.365 seconds, or 1400%. If this
engine would handle an average of 1000 queries per hour, this would mean
a save in query execution-time of about 144 minutes per day.

by discovery Tue Mar 29, 2011 12:29 am

Building A Search Engine - Summary

(Page 2 of 2 )

As
shown in this article, search engines can be pretty simply built, but
still quite useful. Without much hazzle, you could develop this concept
further to handle multiple keywords, boolean operators, and other
features you find in many commercial search facilities. It
could also be interesting to populate the database further with a few
hunder megs of data. Would the speed still be at en reasonable level?
Well, probably. One thing we could be absolutetly sure of, however, is
that for an intranet of a mid-sized company with just a few dussin
searches per hour, this solution always offer stunning performance.Whether you're planning to develop a big-scale commrcial search engine, or are just playing around, http://www.robotstxt.org/wc/robots.html
offers lots of helpful and interesting reading on this topic. I wish
you good luck, and is looking forward to get a visit from your spider
soon! :)occurrence, not occurence!

DISCLAIMER:
The content provided in this article is not warranted or guaranteed by
Developer Shed, Inc. The content provided is intended for
entertainment and/or educational purposes in order to introduce to the
reader key ideas, concepts, and/or product reviews. As such it is
incumbent upon the reader to employ real-world tactics for security and
implementation of best practices. We are not liable for any negative
consequences that may result from implementing any information covered
in our articles or tutorials. If this is a hardware review, it is not
recommended to open and/or modify your hardware.

http://www.devarticles.com/c/a/HTML/Building-A-Search-Engine/1/

Building A Search Engine

Building A Search Engine

Re: Building A Search Engine