Skip to page content or skip to Accesskey List.
Search evolt.org
evolt.org login: or register

Work

Main Page Content

Boolean Fulltext Searching with PHP and MySQL

Rated 4.35 (Ratings: 19) (Add your rating)

Log in to add a comment
(39 comments so far)

Want more?

 
Picture of altherda

David Altherr

Member info | Full bio

User since: August 08, 2001

Last login: August 08, 2001

Articles written: 1

The Problem

So you've finished two-thirds of an application when your project manager comes to you with a great idea (soon to be requirement), "I was thinking, it would be nice if we could add Boolean support to our search functionality... what do you think?" Now, if the content data is stored in a MySQL database and accessed from a PHP framework, here are a few canned responses to choose from:

  • Would you like some cheese with that whine?
  • Well, it is 'possible', but we may have to wait for the release of MySQL 4.0.
  • Not a problem, I'll have it implemented by tomorrow.

Unfortunately, most of us have settled on like variations of the second response; this is functionality that MySQL has promised with its next full version release. The fact remains, the more web savvy users have come to expect a few options when searching:

  • Search results sorted by relevance
  • Boolean statement support

Support for such options are doubly important when we are dealing with large datasets like knowledge bases and news sites. Fortunately, these are two areas in which MySQL already excels, just not both at the same time. However, there is at least one PHP hack for those of us that can't wait for the 4.0 release.

A Solution

Interactive Example

Before we get into the explanation of how to use it and how it works, why don't we get straight to the point and see what this thing is capable of. Here is a script that lets you test drive this functionality on a small sample database: form.mysql.boolean.php

The Code

The functions: funcs.mysql.boolean.php

<?php

/* * * * funcs.mysql.boolean.php * * * * * * * * * * * * * * * * * * * * *
 *
 *	The following file contains functions for transforming search
 *	strings into boolean SQL.  To download the sample script and
 *	dataset that use these functions, reference:
 *	http://davidaltherr.net/web/php_functions/
 *		boolean/example.mysql.boolean.txt
 *
 * 	Copyright 2001 David Altherr
 *		altherda@email.uc.edu
 *		www.davidaltherr.net
 *
 *	All material granted free for use under MIT general public license
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */



/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 *	:: get_fulltext_key($table) ::
 *	retrieves the fulltext key from a table as a comma delimited
 *	list of values. requires:
 *		a. $mysqldb (selected database)
 *		 OR
 *		b. $table argument in the form 'db.table'
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function get_fulltext_key($table,$db_connect){
	global $mysqldb;
	mysql_select_db($mysqldb,$db_connect);

	/* grab all keys of db.table */
	$indices=mysql_query("SHOW INDEX FROM $table",$db_connect)
		 or die(mysql_error());
	$indices_rows=mysql_num_rows($indices);

	/* grab only fulltext keys */
	for($nth=0;$nth<$indices_rows;$nth++){
		$nth_index=mysql_result($indices,$nth,'Comment');
		if($nth_index=='FULLTEXT'){
			$match_a[].=mysql_result($indices,$nth,'Column_name');
		}
	}

	/* delimit with commas */
	$match=implode(',',$match_a);

	return $match;
}



/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 *	:: boolean_mark_atoms($string) ::
 * 	used to identify all word atoms; works using simple
 *	string replacement process:
 *    		1. strip whitespace
 *    		2. apply an arbitrary function to subject words
 *    		3. represent remaining characters as boolean operators:
 *       		a. ' '[space] -> AND
 *       		b. ','[comma] -> OR
 *       		c. '-'[minus] -> NOT
 *    		4. replace arbitrary function with actual sql syntax
 *    		5. return sql string
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function boolean_mark_atoms($string){
	$result=trim($string);
	$result=preg_replace("/([[:space:]]{2,})/",' ',$result);

	/* convert normal boolean operators to shortened syntax */
	$result=eregi_replace(' not ',' -',$result);
	$result=eregi_replace(' and ',' ',$result);
	$result=eregi_replace(' or ',',',$result);

	/* strip excessive whitespace */
	$result=str_replace('( ','(',$result);
	$result=str_replace(' )',')',$result);
	$result=str_replace(', ',',',$result);
	$result=str_replace(' ,',',',$result);
	$result=str_replace('- ','-',$result);

	/* apply arbitrary function to all 'word' atoms */
	$result=preg_replace(
		"/([A-Za-z0-9]{1,}[A-Za-z0-9\.\_-]{0,})/",
		"foo[('$0')]bar",
		$result);

	/* strip empty or erroneous atoms */
	$result=str_replace("foo[('')]bar",'',$result);
	$result=str_replace("foo[('-')]bar",'-',$result);

	/* add needed space */
	$result=str_replace(')foo[(',') foo[(',$result);
	$result=str_replace(')]bar(',')]bar (',$result);

	/* dispatch ' ' to ' AND ' */
	$result=str_replace(' ',' AND ',$result);

	/* dispatch ',' to ' OR ' */
	$result=str_replace(',',' OR ',$result);

	/* dispatch '-' to ' NOT ' */
	$result=str_replace(' -',' NOT ',$result);

	return $result;
}


/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 *	:: boolean_sql_where($string,$match) ::
 * 	function used to transform identified atoms into mysql
 *	parseable boolean fulltext sql string; allows for
 *	nesting by letting the mysql boolean parser evaluate
 *	grouped statements
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function boolean_sql_where($string,$match){
	$result = boolean_mark_atoms($string);

	/* dispatch 'foo[(#)]bar to actual sql involving (#) */
	$result=preg_replace(
		"/foo\[\(\'([^\)]{4,})\'\)\]bar/",
		" match ($match) against ('$1')>0 ",
		$result);
	$result=preg_replace(
		"/foo\[\(\'([^\)]{1,3})\'\)\]bar/e",
		" '('.boolean_sql_where_short(\"$1\",\"$match\").')' ",
		$result);

	return $result;
}


/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 *	:: boolean_sql_where_short($string,$match) ::
 *	parses short words <4 chars into proper SQL: special adaptive
 *	case to force return of records without using fulltext index
 *	keep in mind that allowing this functionality may have serious
 *	performance issues, especially with large datasets
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function boolean_sql_where_short($string,$match){
	$match_a = explode(',',$match);
	for($ith=0;$ith<count($match_a);$ith++){
		$like_a[$ith] = " $match_a[$ith] LIKE '%$string%' ";
	}
	$like = implode(" OR ",$like_a);

	return $like;
}



/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 *	:: boolean_sql_select($string,$match) ::
 *	function used to transform a boolean search string into a
 *	mysql parseable fulltext sql string used to determine the
 *	relevance of each record;
 *	1. put all subject words into array
 *	2. enumerate array elements into scoring sql syntax
 *	3. return sql string
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function  boolean_sql_select($string,$match){
	/* build sql for determining score for each record */
	preg_match_all(
		"([A-Za-z0-9]{1,}[A-Za-z0-9\-\.\_]{0,})",
		$string,
		$result);
	$result = $result[0];
	for($cth=0;$cth<count($result);$cth++){
		if(strlen($result[$cth])>=4){
			$stringsum_long .=
				" $result[$cth] ";
		}else{
			$stringsum_a[] =
				' '.boolean_sql_select_short($result[$cth],$match).' ';
		}
	}
	if(strlen($stringsum_long)>0){
			$stringsum_a[] = " match ($match) against ('$stringsum_long') ";
	}
	$stringsum .= implode("+",$stringsum_a);
	return $stringsum;
}


/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 *	:: boolean_sql_select_short($string,$match) ::
 *	parses short words <4 chars into proper SQL: special adaptive
 *	case to force 'scoring' of records without using fulltext index
 *	keep in mind that allowing this functionality may have serious
 *	performance issues, especially with large datasets
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function boolean_sql_select_short($string,$match){
	$match_a = explode(',',$match);
	$score_unit_weight = .2;
	for($ith=0;$ith<count($match_a);$ith++){
		$score_a[$ith] =
			" $score_unit_weight*(
			LENGTH($match_a[$ith]) -
			LENGTH(REPLACE(LOWER($match_a[$ith]),LOWER('$string'),'')))
			/LENGTH('$string') ";
	}
	$score = implode(" + ",$score_a);

	return $score;
}


/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 *	:: boolean_inclusive_atoms($string) ::
 *	returns only inclusive atoms within boolean statement
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function boolean_inclusive_atoms($string){

	$result=trim($string);
	$result=preg_replace("/([[:space:]]{2,})/",' ',$result);

	/* convert normal boolean operators to shortened syntax */
	$result=eregi_replace(' not ',' -',$result);
	$result=eregi_replace(' and ',' ',$result);
	$result=eregi_replace(' or ',',',$result);

	/* drop unnecessary spaces */
	$result=str_replace(' ,',',',$result);
	$result=str_replace(', ',',',$result);
	$result=str_replace('- ','-',$result);

	/* strip exlusive atoms */
	$result=preg_replace(
		"(\-\([A-Za-z0-9]{1,}[A-Za-z0-9\-\.\_\,]{0,}\))",
		'',
		$result);
	$result=preg_replace(
		"(\-[A-Za-z0-9]{1,}[A-Za-z0-9\-\.\_]{0,})",
		'',
		$result);
	$result=str_replace('(',' ',$result);
	$result=str_replace(')',' ',$result);
	$result=str_replace(',',' ',$result);

	return $result;
}


/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 *	:: boolean_parsed_as($string) ::
 *	returns the equivalent boolean statement in user readable form
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function boolean_parsed_as($string){
	$result = boolean_mark_atoms($string);

	/* dispatch 'foo[(%)]bar' to empty string */
	$result=str_replace("foo[('","",$result);
	$result=str_replace("')]bar","",$result);

	return $result;
}



?>

The code of a simple example implementation complete with sample data: example.mysql.boolean.php
<?php

/* * * * example.mysql.boolean.php * * * * * * * * * * * * * * * * * * * * *
 *
 *	The following file contains sample data to demonstrate the
 *	capability of the functions contained in funcs.mysql.boolean.php:
 *	http://davidaltherr.net/web/php_functions/
 *		boolean/funcs.mysql.boolean.txt
 *	To see the example, load the data into MySQL then run this script
 *
 * 	Copyright 2001 David Altherr
 *		altherda@email.uc.edu
 *		www.davidaltherr.net
 *
 *	All material granted free for use under MIT general public license
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */



/* * * * Example Implementation * * * * * * * * * * * * * * * * * * * * *

   // MySQL Database //

	// architecture //
	CREATE DATABASE news;
	USE news;

	CREATE TABLE quotes (
		id int(9) NOT NULL auto_increment,
		author varchar(255),
		content text,
		PRIMARY KEY  (id),
		UNIQUE KEY id (id),
		FULLTEXT KEY author (author,content)
	) TYPE=MyISAM;

	// data //
	INSERT INTO quotes VALUES(
		10000,'George Stephanopolous',
		'The President has kept all the promises he intended to keep.');
	INSERT INTO quotes VALUES(
		10001,'Dan Quayle',
		'It is wonderful to be here in the great state of Chicago.');
	INSERT INTO quotes VALUES(
		10002,'Marion Barry',
		'Outside of the killings, Washington has one of the lowest crime rates in the country.');
	INSERT INTO quotes VALUES(
		10003,'David Dinkins',
		'I haven\'t committed a crime. What I did was fail to comply with the law.');
	INSERT INTO quotes VALUES(
		10004,'Dan Quayle',
		'It isn\'t pollution that\'s harming the environment. It\'s the impurities in our air and water that are doing it.');
	INSERT INTO quotes VALUES(
		10005,'George Dubya',
		'One word sums up probably the responsibility of any Governor, and that one word is \' to be prepared \'.');
	INSERT INTO quotes VALUES(
		10006,'George Dubya',
		'The most important job is not to be Governor, or First Lady in my case.');


 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */


    // PHP Script //

	// form input //
	$table_name = 'news.quotes';
	$search_string = 'george (governor,president) -responsibility';

	// database connection //
	$db_host = 'localhost';
	$db_user = 'username';
	$db_pwd = 'password';
	$db_connect = mysql_connect($db_host,$db_user,$db_pwd) or die(mysql_error());

	// sql construction //
	require_once('funcs.mysql.boolean.php');
	$fulltext_key = get_fulltext_key($table_name,$db_connect);
	$sql =  "SELECT id, author, content, \n"
		.boolean_sql_select(
			boolean_inclusive_atoms($search_string),
			$fulltext_key)." as relevance \n"
		."FROM $table_name \n"
		."WHERE \n"
		.boolean_sql_where($search_string,$fulltext_key)." \n"
		."HAVING relevance>0 \n"
		."ORDER BY relevance DESC \n";

	// data query //
	$result = mysql_query($sql,$db_connect) or die(mysql_error());
	$result_rows = mysql_num_rows($result);

	// get results //
	$output = "
		<table border=1>
		 <thead>
		  <tr>
		   <th>id</th>
		   <th>author</th>
		   <th>content</th>
		   <th>relevance</th>
		  </tr>
		 </thead>
		 <tbody>";
	for($ith=0;$ith<$result_rows;$ith++){
		$ir=mysql_fetch_row($result);
		$output .= "
			<tr>
			 <td>$ir[0] </td>
			 <td> $ir[1] </td>
			 <td> $ir[2] </td>
			 <td> $ir[3]</td>
			</tr>\n";
	}
	$output .= "
		 </tbody>
		</table>\n";

	// get user readable statement //
	$parsed_as = boolean_parsed_as($search_string);

	// display process //
	echo 	"<h5>Input Statement</h5>\n"
		."<p>$search_string</p>\n"
		."<h5>Parsed As</h5>\n"
		."<p>$parsed_as</p>\n"
		."<h5>SQL Generated</h5>\n"
		."<p>".nl2br($sql)."</p>\n"
		."<h5>Query Results</h5>\n"
		."<p>$output</p>\n";

?>

Script Implementation

The provided functions are mainly used to generate SQL elements of the SELECT and WHERE clauses as follows. First, we must retrieve the FULLTEXT KEY column names from our table, $table_name, via our database connection, $db_connect:

&nbsp;&nbsp; $fulltext_key = get_fulltext_key($table_name,$db_connect);<br/>

And then we build the SQL statement using $fulltext_key and the user input, $search_string:

&nbsp;&nbsp; $sql = "SELECT id, author, content, \n"<br/>
&nbsp;&nbsp;&nbsp;&nbsp; .boolean_sql_select(<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; boolean_inclusive_atoms($search_string),<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $fulltext_key)." \n"<br/>
&nbsp;&nbsp;&nbsp;&nbsp; ."as relevance \n"<br/>
&nbsp;&nbsp;&nbsp;&nbsp; ."FROM $table_name \n"<br/>
&nbsp;&nbsp;&nbsp;&nbsp; ."WHERE \n"<br/>
&nbsp;&nbsp;&nbsp;&nbsp; .boolean_sql_where($search_string,$fulltext_key)." \n"<br/>
&nbsp;&nbsp;&nbsp;&nbsp; ."HAVING relevance>0 \n"<br/>
&nbsp;&nbsp;&nbsp;&nbsp; ."ORDER BY relevance DESC \n";<br/>

Supported Input

Fundamental Operators
The code supports five basic operators:
  • AND
  • OR
  • NOT
  • (
  • )
Shorthand Operators
The code also supports a shorthand syntax similar to ebay's boolean syntax; the two syntaxes can be intermixed at the users discretion:
  • ' ' [space] represents AND
  • ',' [comma] represents OR
  • '-' [hyphen] represents NOT
Nesting, Grouping, Logical Precedence
The code is also capable of nested Boolean statements to any plausible depth (that which MySQL can handle) via the use of parentheses. These characters can also be used to force the order of evaluation of any given statement. Remember that the AND operator has precedence over the OR operator.
Allowed Characters
As it exists, the code allows alphanumeric characters as well as the special characters '.', '_', and '-', providing they are not used as the first character in a subject word. Regardless of the characters allowed in the code, there are certain limitations imposed by MySQL when it runs the fulltext searches.
Minimum Word Length
The actual minimum word length is defined when MySQL is compiled in the myisam/ftdefs.h file: &nbsp;&nbsp; #define MIN_WORD_LEN 4 If you need more functionality and don't want to go through the trouble of recompiling MySQL, no worry: the code is currently written to be adaptive to words of three or less characters. For a given word, the SQL element in the SELECT clause will switch to a simple scoring algorithm that does a case insensitive count of the subject word in all of the FULLTEXT KEY columns without the use of the fulltext index; the SQL element in the WHERE clause will switch to a LIKE string comparison of the subject word with all of the FULLTEXT KEY columns. Note that enabling this additional functionality may create some performance issues with larger datasets as it runs significantly slower than the fulltext system used for words of 4+ characters, perhaps because it does not use a compiled algorithm or the fulltext index.
Stop Words
Restrictions on stop words (common words) when using the fulltext search are still dictated by MySQL once compiled. Remember that stop words are not excluded from the search, they simply generate a zero value for their individual relevance score. However, if the adaptive functions are enabled then the generated SQL will return records with stop words of three or less characters. Again, keep in mind that this functionality does not use the fullext index and is typically quite slow.
Fundamental Constructs
Some basic input:
  • atom1 AND atom2
  • atom1 OR atom2
  • atom1 NOT atom2
Advanced Constructs
Some more advanced input (based on the shorthand syntax):
  • atom1 (atom2,atom3) -atom4
  • (atom1,atom2) -(atom1 atom2) [construct for atom1 XOR atom2]
  • atom1 (atom2,(atom3 -atom4)) -atom5

The Concept

The WHERE Clause
Assuming we have a table with columns title,content indexed on the FULLTEXT KEY, the typical syntax for implementing a fulltext search on said table might be: &nbsp;&nbsp; MATCH (title,content) AGAINST ('george') If the fulltext index contains the word george, the above will return a floating point number, typically between 0.0 (exclusive) and 5.0 for any given record. If the fulltext index does not contain the word george, the above will return 0.0 for any given record. Thus, we can make the above statement evaluate to TRUE or FALSE with the following syntax: &nbsp;&nbsp; MATCH (title,content) AGAINST ('george') > 0 In order to facilitate boolean capability for an expression like &nbsp;&nbsp; george AND dubya the corresponding SQL code in the WHERE clause will be
  &nbsp;&nbsp; MATCH (title,content) AGAINST ('george') > 0<br/>
  &nbsp;&nbsp; AND<br/>
  &nbsp;&nbsp; MATCH (title,content) AGAINST ('dubya') > 0<br/><br/>
  
The SELECT Clause
The SQL in the SELECT clause can be used to calculate the total relevance as declared by the relevance alias for the above example like so:
  &nbsp;&nbsp; MATCH (title,content) AGAINST ('george')<br/>
  &nbsp;&nbsp; +<br/>
  &nbsp;&nbsp; MATCH (title,content) AGAINST ('dubya')<br/>
  &nbsp;&nbsp; as relevance<br/><br/>
  
but the code utilizes a simplification which produces a numerical equivalent:
  &nbsp;&nbsp; MATCH (title,content) AGAINST ('george dubya')<br/>
  &nbsp;&nbsp; as relevance<br/><br/>
  
The HAVING Clause
We can also add a condition to the HAVING clause which will ensure that the query only returns records with a non-zero relevance: &nbsp;&nbsp; relevance > 0 While this may seem redundant considering the SQL in the WHERE clause which already deals which the inclusion and exclusion of records, this allows the developer to impose a minimum relevance restriction on the result set. &nbsp;&nbsp; relevance > 0.2
The ORDER BY Clause
This clause sorts the results by our calculated relevance starting with the records with the highest relevance: &nbsp;&nbsp; relevance DESC

Other Issues

Performance
I've implemented these functions in a knowledge base with 16,000+ records, and they have been running wonderfully since June 2001. The largest table on this system is about 7,000+ records containing 77.1 MB of data with an 8.5 MB fulltext index. Query times for a typical two to four atom statement on this table average about 0.25 seconds providing they use the fulltext index. However, if the SQL is forced to adapt to a three or less character word, query times jump to over 5.0 seconds in some cases. Query times on smaller tables with 1 MB of data and a 300 KB fulltext index are almost negligible, averaging 0.005 seconds with the fulltext index and 0.02 seconds without.
Security
The string replacements in the functions are only utilized to properly manage whitespace in the input string, they do not check for malicious user input. Such checks should be made prior to passing the user input to these functions.

Keep in mind that some of the parameters in these functions are arbitrary or may be specific to the context in which they are used. However, I do encourage you to email me with any general comments or improvements. For more information on this topic: MySQL Full-text Search.

-- David Altherr

Excellent article!

Submitted by mwarden on September 11, 2001 - 13:43.

A tough, but common, problem solved beautifully and explained fully. Great job!

login or register to post comments

Additional Info on MySQL FULLTEXT

Submitted by psydeshow on September 24, 2001 - 18:16.

Never knew MySQL could do this, sweet. Their summary of fulltext index features is at: http://www.mysql.com/doc/F/u/Fulltext_Search.html.

login or register to post comments

Problem with the example code

Submitted by jpmcc on September 28, 2001 - 02:08.

When I run your example code, I get:
Warning: No ending delimiter found in funcs_mysql_boolean.php on line 231
Warning: No ending delimiter found in funcs_mysql_boolean.php on line 235
Warning: No ending delimiter found in funcs_mysql_boolean.php on line 166
Warning: Bad arguments to implode() in funcs_mysql_boolean.php on line 180
This is PHP Version 4.0.3pl1- do I need something special in my php configuration?

Thanks - John

login or register to post comments

Reason for problem

Submitted by texasbhoy on October 9, 2001 - 09:54.

jpmcc, I had the same problem. Fulltext is available with MySQL 3.23.23, if you are using any version earlier than this, you will receive the same errors.

login or register to post comments

Partial solution

Submitted by jpmcc on October 9, 2001 - 15:47.

MySQL version is 3.23.36 here, which makes me suspect a PHP coding problem.
I had an email from David, which I hope he won't mind me quoting here:
John,
    I think this is actually a small error on my part. My best guess at the
fix is to make sure that there are delimiting forward slashes at the very
beginning and end of each preg matching statement. Generally, change

    preg_replace("...","...",$...);

to
    preg_replace("/.../","...",$...);

Or more specifically, change lines 163-166

    preg_match_all(
        "([A-Za-z0-9]{1,}[A-Za-z0-9\-\.\_]{0,})",
        $string,
        $result);

to

        preg_match_all(
            "/([A-Za-z0-9]{1,}[A-Za-z0-9\-\.\_]{0,})/",
            $string,
            $result);

likewise with the other two occurences. As it currently exists, the code
runs fine on versions 4.0.4pl, 4.0.6; perhaps the parser is not as strict in
these versions?? Let me know if this works for you; if it does, I will
change the article code. Thanks for pointing this out.

Thanks,
David Altherr
This fixes the PHP errors, but the code still doesn't work as advertised in my setup. Any further suggestions welcome!
Thanks for your interest - John

login or register to post comments

Partial solution

Submitted by jpmcc on October 9, 2001 - 15:50.

MySQL version is 3.23.36 here, which makes me suspect a PHP coding problem.
I had an email from David, which I hope he won't mind me quoting here:
John,
    I think this is actually a small error on my part. My best guess at the
fix is to make sure that there are delimiting forward slashes at the very
beginning and end of each preg matching statement. Generally, change

    preg_replace("...","...",$...);

to
    preg_replace("/.../","...",$...);

Or more specifically, change lines 163-166

    preg_match_all(
        "([A-Za-z0-9]{1,}[A-Za-z0-9\-\.\_]{0,})",
        $string,
        $result);

to

        preg_match_all(
            "/([A-Za-z0-9]{1,}[A-Za-z0-9\-\.\_]{0,})/",
            $string,
            $result);

likewise with the other two occurences. As it currently exists, the code
runs fine on versions 4.0.4pl, 4.0.6; perhaps the parser is not as strict in
these versions?? Let me know if this works for you; if it does, I will
change the article code. Thanks for pointing this out.

Thanks,
David Altherr
This fixes the PHP errors, but the code still doesn't work as advertised in my setup. Any further suggestions welcome!
Thanks for your interest - John

login or register to post comments

bad arguments

Submitted by isamujon on November 5, 2001 - 04:00.

I am receiving the error

Bad arguments to implode() in funcs.mysql.boolean.txt on line 45

Not real sure what is the problem. I am relatively new to php and mysql, so I wouldn't have the slightest clue what could be going wrong. any suggestions?

Thanks!

login or register to post comments

implode error

Submitted by altherda on November 15, 2001 - 22:27.

For those receiving the following error

   Bad arguments to implode() in funcs.mysql.boolean.txt on line 45

Check that the table you are querying has a fulltext index. You can check the indices on a table by typing the following at the mysql monitor command line

   mysql> SHOW INDEX FROM $tablename

where $tablename is the name of the table you are querying. For information on creating a Fulltext index reference: http://www.mysql.com/doc/F/u/Fulltext_Search.html

login or register to post comments

RE: implode error

Submitted by jpmcc on November 16, 2001 - 13:58.

Don't think that's it:
Key_Name    Seq_in_index Column_Name Collation Sub_part Comment
Search_Index      1      Page_Heading   A               FULLTEXT 
Search_Index      2      Content        A        1      FULLTEXT
John

login or register to post comments

How did you get FULLTEXT to appear in the comment?

Submitted by CampusParty on November 17, 2001 - 00:37.

How do you get the comment to read 'FULLTEXT' mine reads 'NULL'

+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Comment |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
| quotes |          0 | PRIMARY  |            1 | id          | A         |        NULL |     NULL | NULL   | NULL    |
| quotes |          0 | id       |            1 | id          | A         |        NULL |     NULL | NULL   | NULL    |
| quotes |          1 | author   |            1 | author      | A         |        NULL |     NULL | NULL   | NULL    |
| quotes |          1 | author   |            2 | content     | A         |        NULL |        1 | NULL   | NULL    |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
4 rows in set (0.00 sec)

login or register to post comments

RE: How do you get the comment to read 'FULLTEXT

Submitted by jpmcc on November 17, 2001 - 08:05.

I used PHPMyAdmin ...

John

login or register to post comments

Hmm

Submitted by CampusParty on November 17, 2001 - 21:39.

Well, To all you MySQL wizards...

I did the following
DROP INDEX author ON quotes;

then:
ALTER TABLE quotes ADD FULLTEXT author (author, comment);

Then:
SELECT * FROM quotes WHERE MATCH (author, content) AGAINST ('george');

and got:

+-------+-----------------------+--------------------------------------------------------------------------------------------------------+
| id    | author                | content                                                                                                |
+-------+-----------------------+--------------------------------------------------------------------------------------------------------+
| 10000 | George Stephanopolous | The President has kept all the promises he intended to keep.                                           |
| 10006 | George Dubya          | The most important job is not to be Governor, or First Lady in my case.                                |
| 10005 | George Dubya          | One word sums up probably the responsibility of any Governor, and that one word is ' to be prepared '. |
+-------+-----------------------+--------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

Which I expected... but still the Index looks like

mysql> SHOW INDEX FROM quotes;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Comment |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
| quotes |          0 | PRIMARY  |            1 | id          | A         |        NULL |     NULL | NULL   | NULL    |
| quotes |          0 | id       |            1 | id          | A         |        NULL |     NULL | NULL   | NULL    |
| quotes |          1 | author   |            1 | author      | A         |        NULL |     NULL | NULL   | NULL    |
| quotes |          1 | author   |            2 | content     | A         |        NULL |        1 | NULL   | NULL    |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
4 rows in set (0.00 sec)

With NULL in the "Comment" column
which, obviously, fails in the script.

Without using PHPMyAdmin ... Is there something I missed? If I can just get the "Comment" column to contain "FULLTEXT" instead of "NULL", I think I can get this thing to work.

login or register to post comments

fulltext keys

Submitted by altherda on November 20, 2001 - 05:15.

On my install, I did not use PHPMyAdmin to create the indices. In the provided example, the fulltext index is created when the table is created, although I can't imagine that there would be a problem in creating the index after the table is created. The installs that I have run this code on are all MySQL 3.23.36 or greater.

For those that cannot get rid of the 'implode error' because the 'fulltext' string is not showing up in the comment column when the indices are list or for whatever reason, there is a hardcoded alternative that circumvents the get_fulltext_key() function entirely. If you know the fulltext key, you can define it explicitly when constructing the SQL. To implement this in the example, one would replace the following code

  $fulltext_key = get_fulltext_key($table_name,$db_connect);

with

  $fulltext_key = 'author,content';

login or register to post comments

That fixed it!

Submitted by CampusParty on November 20, 2001 - 12:23.

The version of MySQL I am using (3.23.24-beta) must not provide the FULLTEXT comment... time to upgrade. Using the fix provided, the sample now works. And, I must say, it works beautifully. Now I've got to get to the real database. :)

login or register to post comments

not getting good matches despite relevance

Submitted by webnick on November 24, 2001 - 02:38.

i don't really get this. sometimes my fulltext searches produce results and sometimes not, without any apparent pattern. single word searches not returning anything, only some word combos working despite that they should return matches. here's an example sql query:

SELECT id, product, description, match (product,description) 
against (' sandra diamond ') AS relevance 
FROM Hhood.inventory WHERE 
match (product,description) against ('sandra')>0 AND 
match (product,description) against ('diamond')>0 
HAVING relevance>0 ORDER BY relevance DESC 
so, the above returns nothing even though a product is named ' sandra diamond ' and another ' sandra diamond ring '. this:
 
SELECT id, product, description, 
match (product,description) against (' Ring ') AS relevance 
FROM Hhood.inventory 
ORDER BY relevance DESC 
returns all items, all with zero 'relevance' even though only a few have 'ring' in either the 'product' or description' colums. following is my table data. any advise on why this table seems finaky would be appreciated.

many thanks,
nick

 
# 
# Table structure for table `inventory` 
# 
CREATE TABLE inventory ( 
product varchar(255) NOT NULL default '', 
quantity int(6) NOT NULL default '1', 
id int(6) NOT NULL auto_increment, 
description text, 
price float(10,2) NOT NULL default '0.00', 
category varchar(255) NOT NULL default '1', 
bin_data longblob, 
filename varchar(50) default NULL, 
filesize varchar(50) default NULL, 
filetype varchar(50) default NULL, 
PRIMARY KEY (id), 
UNIQUE KEY id (id), 
FULLTEXT KEY product (product,description) 
) TYPE=MyISAM; 

# 
# Indexes: 
# 
Keyname     /Unique    /Fulltext   /Field 
PRIMARY     /Yes       /No         /id 
id          /Yes       /No         /id 
product -1- /No        /Yes        /product 
product -2- /No        /Yes        /description 1 

login or register to post comments

Not getting good matches

Submitted by jpmcc on November 25, 2001 - 05:29.

As I understand it, the MySQL fulltext search is designed to find a few records from many - "find needle in haystack". If there are no needles in your haystack, it won't return any. Conversely, if your haystack is full of needles, MySQL doesn't know which needle you want, so again (and confusingly IMHO) it doesn't return any.

If you'd like to see this in action, look at this site. Using the search box on the left hand side of the screen, search for 'Arscott'. The serach engine should return the two pages where the name 'Arscott' is found. Now search for the word 'Balerno'. This is found all over the site, but MySQL returns 'not found'...

I can see why the designers of MySQL did it like this, but it is veryconfusing...

John

login or register to post comments

the nature of MySQL and fulltext tuning

Submitted by altherda on November 26, 2001 - 23:04.

   If you are getting seemingly random results or none at all, I hope the following conceptual clarifications help; alot of this restates parts of the above article and MySQL docs (which I trust that you have already read) for clarity. First, the following code in an sql statement

  match (column1,column2) against ('foobar')

generates a decimal number (e.g. 0.5934824...) indicative of the relevance of any given row; it does not filter records or sort by relevance, it only generates a number, which can only be zero or positive reals. Now, if the above code is placed in the SELECT clause of an SQL statement like so

 
 SELECT <br>
&nbsp;&nbsp; column1, <br>
&nbsp;&nbsp; match (column1,column2) against ('foobar') AS relevance <br>
&nbsp; FROM table1;


then MySQL will return a recordset containing all rows and showing column1 and the generated relevance number as a column under the alias relevance. Note that this is how it should work; while totally non-relevant records will be returned, a zero value will be returned for each of these records in the relevance column. In order to filter the record set, one can impose a condition on the relevance of a given row by placing arguments in the WHERE or HAVING clause like so

 
 SELECT <br>
&nbsp;&nbsp; column1, <br>
&nbsp;&nbsp; match (column1,column2) against ('foobar') AS relevance <br>
&nbsp; HAVING relevance>0 <br>
&nbsp; FROM table1;


or equivalently (aliases defined in the SELECT clause can not be referenced in the WHERE clause)

 
 SELECT <br>
&nbsp;&nbsp; column1, <br>
&nbsp;&nbsp; match (column1,column2) against ('foobar') AS relevance <br>
&nbsp; WHERE match (column1,column2) against ('foobar')>0 <br>
&nbsp; FROM table1;


The above two statements will return only rows with nonzero relevance as would usually be expected when searching. Before you try to implement the provided functions, I would suggest verifying that you can get these simple statements to work with your database setup.

   As for the inherent restrictions in the MySQL fulltext system, under a normal install MySQL does not index words of less than four characters, and will therefore not return the row if it is matched against a subject word of less than four characters. Also, MySQL assigns a weight to each subject word proportional to the word's 'uniqeness' within the index. That is, if a word has a high frequency within a table, then it will have a low relevance. Conversely, if a word has a low frequency then it will have a higher weight. The extreme of this weighting system is when MySQL finds the word in more than 50% of the rows, in which case MySQL will assign a weight value of zero due to its high frequency regarding it as a noise word. The system is indeed optimized for larger tables as jpmcc points out; you may need to expand your dataset a bit if you are running test cases on a small database. Note that you can fine-tune these parameters if you have the time to recompile MySQL; reference 6.8.2 Fine-tuning MySQL Full-text Search. However, I will say that in most cases the default values of these variables will suffice for most databases. With regard to the example presented by jpmcc, I would argue that this is desired functionality in most cases: if I search for 'microsoft' on 'www.microsoft.com', the system will still return records but the results are basically useless because I would be searching for a noise word specific to that database.

   I do agree that although this functionality may be desirable in most cases it can be very confusing at times, especially when we start adding boolean statements in the sql. For instance, if we search for 'Arscott AND Balerno' on the above example site the relevance for 'Balerno' will show up as zero in all rows because of its high frequency in the table; remember that the relevance for 'Arscott' will show up as either a zero or a positive number. In the WHERE clause the boolean statement will reduce to either '1 AND 0' or '0 AND 0', which in both cases simplifies logically to '0': this means that no records will be returned. Again, this is due to the wieghting system in which MySQL assigns a zero weight to the 'Balerno' word due to its high frequency. In this case, the inherent functionality is probably not desired nor intuitive.

-David

login or register to post comments

Okay, almost there

Submitted by isamujon on November 29, 2001 - 00:23.

I tried your fix of using $fulltext_key = 'author,content' , but am now getting the error (instead of the line 45 implode error)

Can't find FULLTEXT index matching the column list.

I have no idea what this might mean. Could somebody shed some light as to what may be the problem?

login or register to post comments

Submitted by isamujon on November 29, 2001 - 18:02.

Hey guys. I got it working! It works like a champ. One other question though. After I do the search and display those items, how do I select some of the non-searched fields for display? I thought I could just throw it into the $sql = SELECT command, but it didn't like that very much. In other words, I am searching across a couple of fields, but need to display all of them when i want to output the results. I appreciate the help.

login or register to post comments

Submitted by isamujon on November 29, 2001 - 18:05.

well, nevermind. I'm an idiot, I just forgot to put a comma before the /n command in the $sql = SELECT line. I was right afterall. Sorry for the inconvenience.

login or register to post comments

Submitted by isamujon on November 29, 2001 - 18:09.

I fixed it, but now I've come across a different dilemma. Before when I entered a blank query, I recieved no errors, and the entire table was displayed as a search result. Now I am recieving this error

Warning: Bad arguments to implode() in /home/www/funcs.mysql.boolean.php on line 179 You have an error in your SQL syntax near 'as relevance FROM table.stuff WHERE HAVING relevance>0 ORDER BY releva' at line 2

Other search queries seem to be unaffected.

login or register to post comments

Submitted by isamujon on December 4, 2001 - 17:21.

Me again. How would I implement some error trapping scheme, where if the user inputs a - at the beginning, or a $, %, ^, or invalid inputs, so that I don't get a mysql query error on the database? Or at least have it display Invalid input rather than

You have an error in your SQL syntax near 'as relevance FROM db.table WHERE $ HAVING relevance>0 ORDER BY id ' at line 2

??

login or register to post comments

how to handle literal strings

Submitted by jsignor on December 11, 2001 - 21:06.

I really like the article. However, why did you elect to not support quoted strings (either literally or with another symbol)? if I need to search for phrase "hot dog" and not the phrase "dog hot", there does not appear to be a way to currently handle this since (hot and dog) and not(dog and hot) yields nothing.

login or register to post comments

Submitted by Dries on December 15, 2001 - 16:48.

I don't want to spoil the fun but I generally prefer to use ANSI SQL: it keeps your code portable but then again no database implements ANSI SQL correctly. They all have their own little quirks. Fact remains that things might get unweildy when you use a lot of database specific queries. -- Dries - drop.org - drupal.org

login or register to post comments

What am I missing here?

Submitted by rocketchef on January 23, 2002 - 10:41.

My testing returns this: search: dubya NOT responsibility SQL contains: LIKE '%$1%' I replaced $1 and $0 with $string, but this places the whole search within the word fields. What have I missed?

login or register to post comments

REF:what am I missing here

Submitted by cjoseph on January 27, 2002 - 21:18.

Am not sure that you have missed anything. I had the same problem. It seemed to stem from the syntax used in the preg_replace() statements that include references to the text matched by the pattern in the 'replacement pattern' parameter. $0,$1...$n. The documentation (http://www.php.net/manual/en/function.preg-replace.php) has this to say: Replacement may contain references of the form \\n or (since PHP 4.0.4) $n, with the latter form being the preferred one. Every such reference will be replaced by the text captured by the n'th parenthesized pattern. n can be from 0 to 99, and \\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern. Am relatively new to PHP so I am not sure whether I am running a version that is NOT happy with the form $n but whatever the reason it is happy with the form \\n. Change your preg_replace statements to reflect this and you should be golden!

login or register to post comments

What about queted strings?

Submitted by miraclemaker on February 22, 2002 - 10:47.

As jsignor said, is there any plan to extend this search function to include quoted strings so a seach for "fast cars" will find where the two words appear together, rather than just records which have the two words somewhere in the text. Thanks.

login or register to post comments

small fix to the sql statements

Submitted by bellipses on July 17, 2002 - 09:09.

a small fix that i spent quite a few hours trying to figure out... i'm running mysql 3.23.49-nt on a win2k platform... the script:

 
 SELECT <br>
&nbsp;&nbsp; column1, <br>
&nbsp;&nbsp; match (column1,column2) against ('foobar') AS relevance <br>
&nbsp; HAVING relevance>0 <br>
&nbsp; FROM table1;


should actually be inputted as follows:

 
 SELECT <br>
&nbsp;&nbsp; column1, <br>
&nbsp;&nbsp; match (column1,column2) against ('foobar') AS relevance <br>
&nbsp; FROM table1<br>
&nbsp; HAVING relevance>0;


i found this to work, however, i could not get the former script to execute as it gave me an SQL Syntax error...

hope this helps someone else...
b...

login or register to post comments

general sql structure

Submitted by altherda on July 22, 2002 - 19:38.

b,
   Indeed, the above example would always lead to an SQL error; the general structure for the provided code should be of the general form:

&nbsp;&nbsp; SELECT ... FROM ... WHERE ... HAVING ... <br>

and not as my typo would suggest in one of the above aftertoughts. Of course, SQL statements which can utilize the above code are not limited to this form alone, but I believe the allowed sequence of the above clauses is strictly limited. Regardless, thank you for pointing out said blunder.

login or register to post comments

Substring Search

Submitted by rudiw on August 12, 2002 - 13:46.

Hi, The FULLTEXT search doesn't work unless you specify complete words... e.g. "Accommodation" will return rows only if the full word is found. When searching for "Accom" the query returns no results unlike the WHERE...LIKE '%$Keyword%' method. How can I solve this problem? MATCH...AGAINST('%$Keyword%') doesn't seem to work and I don't want to use LIKE in my queries... Hope anybody out there can help. Please email me: rudiw@webonline.co.za Thanx a stack ;)

login or register to post comments

No Results

Submitted by tedhattemer on August 18, 2002 - 13:05.

I changed $0 and $1 to \\0 and \\1 in order to capture what was being processed between ()... however... I don't get back any rows when I search on things I know should work. I am quite new to the concepts here, but am following them closely. Suggestions?
SELECT id, author, content, 
0.2*(
LENGTH(author) -
LENGTH(REPLACE(LOWER(author),LOWER('Dan'),'')))
/LENGTH('Dan') + 0.2*(
LENGTH(content) -
LENGTH(REPLACE(LOWER(content),LOWER('Dan'),'')))
/LENGTH('Dan') + match (author,content) against (' George ') as relevance 
FROM test.quotes 
WHERE 
match (author,content) against ('George')>0 AND ( author LIKE '%Dan%' OR content LIKE '%Dan%' ) 
HAVING relevance>0 
ORDER BY relevance DESC

login or register to post comments

Further question

Submitted by tedhattemer on August 18, 2002 - 15:46.

It seems that AND is not returning results as I would imagine it would. Is there something I am missing? Could it be that my data set just doesn't have enough records at this point?

login or register to post comments

boolean operators in SQL

Submitted by altherda on August 18, 2002 - 19:07.

tedhattemer,
    Recall that boolean search operators effectively operate on individual records and not entire datasets. For example, george AND bush will return a dataset in which each individual record contains both george AND bush. Likewise, george OR bush will return a dataset in which each individual record contains at least either george OR bush. If one uses the given example database and runs the Dan AND George query, an empty set will be returned because no single record contains both Dan AND George. However, one can make the dataset operator analogy as follows
  • AND -> INTERSECT
  • OR &nbsp;-> UNION
  • NOT -> SUBTRACT
Thus, george AND bush with respect to individual records is analagous to george INTERSECT bush with repect to datasets, i.e. if one were to run george and bush in two seperate queries and then INTERSECT the result datasets, this would be equivalent to running the george AND bush query.

login or register to post comments

3.22

Submitted by kelkafa on August 21, 2002 - 03:08.

ok guys so what do the old search scripts use to search the mysql databases? you said that fulltextindex function is new as of mysql 3.23.23 so how does a script which runs under mysql 3.22 work? thanx...

login or register to post comments

Partial Searches

Submitted by tedhattemer on September 3, 2002 - 09:59.

Is there an easy way to match partial words using this script? I want to match 'math' to 'mathematics' -? Presently I need to search for the the full word 'mathematics' inorder to get back results. %$results%

login or register to post comments

Increasing "short word" length

Submitted by tedhattemer on September 7, 2002 - 16:01.

To get partial words to match up to an actual entry within the database I increased the "short word" length to 10 characters. So now any word that is less than 10 characters will get %$results% in the sql. function boolean_sql_where($string,$match){ $result = boolean_mark_atoms($string); /* dispatch 'foo[(#)]bar to actual sql involving (#) */ $result=preg_replace( "/foo\[\(\'([^\)]{10,})\'\)\]bar/", " match ($match) against ('$1')>0 ", $result); $result=preg_replace( "/foo\[\(\'([^\)]{1,11})\'\)\]bar/e", " '('.boolean_sql_where_short(\"$1\",\"$match\").')' ", $result); return $result; } /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * :: boolean_sql_where_short($string,$match) :: * parses short words

login or register to post comments

sorry, here is the code

Submitted by tedhattemer on September 7, 2002 - 16:05.

<?php
function boolean_sql_where($string,$match){
    
$result = boolean_mark_atoms($string);

    
/* dispatch 'foo[(#)]bar to actual sql involving (#) */
    
$result=preg_replace(
        
"/foo\[\(\'([^\)]{10,})\'\)\]bar/",
        
" match ($match) against ('$1')>0 ",
        
$result);
    
$result=preg_replace(
        
"/foo\[\(\'([^\)]{1,11})\'\)\]bar/e",
        
" '('.boolean_sql_where_short(\"$1\",\"$match\").')' ",
        
$result);

    return
$result;
}


/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
*    :: boolean_sql_where_short($string,$match) ::
*    parses short words <4 chars into proper SQL: special adaptive
*    case to force return of records without using fulltext index
*    keep in mind that allowing this functionality may have serious
*    performance issues, especially with large datasets
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
function boolean_sql_where_short($string,$match){
    
$match_a = explode(',',$match);
    for(
$dataith=0;$dataith
?>

login or register to post comments

Searching for phrases like C/C++

Submitted by martinhej on September 22, 2005 - 09:05.

Has anybody run into a problem with searching phrases containing character + ? If yes and has a fine working solution I would greatly appreciate him/her to share the knowledge. Have a nice day

login or register to post comments

How can search the small string like C++

Submitted by jaikumar on December 9, 2008 - 23:44.

hi, first of all, thanks for the script so well. Here, i need to know the enhance the script functionality to search some keyword like C++. I amd developing a job portal and as u know there would be must a feature to search the short keyword with specila keywords i.e. ++ etc. Could you help me to find those in this script to work more precisely. Have a nice day

login or register to post comments

The access keys for this page are: ALT (Control on a Mac) plus:

evolt.orgEvolt.org is an all-volunteer resource for web developers made up of a discussion list, a browser archive, and member-submitted articles. This article is the property of its author, please do not redistribute or use elsewhere without checking with the author.