EllisLab text mark
Advanced Search
1 of 2
1
   
Suggestion: Search class?
Posted: 13 April 2011 06:49 PM   [ Ignore ]
Joined: 2010-04-04
69 posts

I have a real problem getting a good search engine into my sites. There seem to be no good scripts for implementing one in php and sql and only a few alternatives. For example you can use the LIKE statement in sql to use wildcard matches on fields, or you can use fulltext matching in myisam table type in mysql. Or you can use a crawler (don’t like this idea as it doesn’t see my db or what to search), or if you have a dedicated server you can use something good like Sphinx search, but how many people are on dedicated? I’m open to suggestions on better approaches for shared servers.

So this brings me to my main point: Wouldn’t a search class in CI be a really great idea? I’m ok at coding but I like the idea of handing it over to a specialist who’s a mathematical genius and can just code up something great we can all use.

 
Posted: 13 April 2011 09:58 PM   [ Ignore ]   [ # 1 ]   [ Rating: 0 ]
Joined: 2009-11-06
46 posts

It would be great, but it would take some doing. A LOT of doing.

 
Posted: 13 April 2011 10:48 PM   [ Ignore ]   [ # 2 ]   [ Rating: 0 ]
Avatar
Joined: 2009-11-10
69 posts

Fulltext searching is the best option, it is extremely fast and has extra features (e.g. score based on keyword relevancy). I would think though that creating a generic library would be quite difficult due to different database table structures. Here’s a mini-tutorial on how to implement fulltext searching:

Example table:

CREATE TABLE IF NOT EXISTS `blog_posts` (
  `
post_idint(9NOT NULL auto_increment,
  `
titlevarchar(128) default NULL,
  `
contentstext NOT NULL,
  `
statusenum('Draft','Publish') default NULL,
  `
creationdatetime NOT NULL,
  
PRIMARY KEY  (`post_id`)
ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1

First, determine which fields you are going to search. Note that this only works on varchar and text field types (someone please correct me if I’m wrong).
For this table, we will use title and contents.

ALTER TABLE `blog_postsADD FULLTEXT `search_index` (`title`, `contents`); 

Your table is now all set up for fulltext searching! Now you just need to run your SQL query like so:

$results $this->db
    
->select('blog_posts.*')
    ->
select("(MATCH(`title`, `contents`) AGAINST ('" $keywords "')) AS score")
    ->
where("(MATCH(`title`, `contents`) AGAINST ('" $keywords "'))")
    ->
order_by('score''desc')
    ->
get('blog_posts')
    ->
result_array(); 

As an added bonus, these results are ordered by relevancy, the most relevant result will be at the top.

 Signature 

Website coming soon!

 
Posted: 14 April 2011 09:07 AM   [ Ignore ]   [ # 3 ]   [ Rating: 0 ]
Joined: 2010-04-04
69 posts

Thank you for taking the time to write this approach out.

The problem I’ve found with fulltext is that whilst it’s a lot better than using LIKE() function, it won’t match certain searches. So if you have a search like “form” it will only find “form” and not “forms” or “form’s” for example. It has a lot of drawbacks. Also will not match searches less than 3 chars and has blocked words.

I find it really strange that no one has gone to the trouble of explaining stemming properly on the web to fix this. I followed tutorials from various sites but it’s like once guy wrote some incorrect code and everyone else just copied his code for their tutorials.

 
Posted: 14 April 2011 02:18 PM   [ Ignore ]   [ # 4 ]   [ Rating: 0 ]
Joined: 2009-11-06
46 posts

If it’s just a single word search then you can use the inflector helper in CI to get what you want like this:

$this->load->helper('inflector');
//say you get the search term from a form
$term $this->input->post('search_term');

//you may want to inspect the term for special characters and take them out here

//get the singular form
$singular singular($term);
//get the plural form
$plural plural($singular);
$saxon_genitive $singular "\\'s " $plural "\\'";

$keywords $singular ' ' $plural ' ' $saxon_genitive;
//then you can run the query from above in boolean mode
$results $this->db
    
->select('blog_posts.*')
    ->
select("(MATCH(`title`, `contents`) AGAINST ('" $keywords "' IN BOOLEAN MODE)) AS score")
    ->
where("(MATCH(`title`, `contents`) AGAINST ('" $keywords "' IN BOOLEAN MODE))")
    ->
order_by('score''desc')
    ->
get('blog_posts')
    ->
result_array(); 

Of course, this is a start. You can enhance it and tweak it to fit your desire. You can even allow the user to enter many words and then you would do the above for all of them. Of course you would have to have a list of stop words etc. etc. This kind of functionality however cannot be programmed generally like you suggested because of so many other things that would take me hours to list here.

 
Posted: 14 April 2011 05:35 PM   [ Ignore ]   [ # 5 ]   [ Rating: 0 ]
Avatar
Joined: 2009-11-10
69 posts
Ninjabear - 14 April 2011 01:07 PM

So if you have a search like “form” it will only find “form” and not “forms” or “form’s” for example. It has a lot of drawbacks.

Boolean fulltext searches. Basically allows you to append “*” to your words. e.g.

$this->db
      
->where("MATCH(field1, field2) AGAINST ('word1* word2*' IN BOOLEAN MODE)")
      ->
get('table')
      ->
result_array(); 
Ninjabear - 14 April 2011 01:07 PM

Also will not match searches less than 3 chars and has blocked words.

Solution 1: Change your MySQL conf to allow words with 3 characters.
Solution 2: If that is not an option, use a LIKE statement fallback if the word is 3 characters or less. i.e. First do a fulltext search, and if no results, do a LIKE search. Or optionally, do a fulltext search if the keywords is greater than 3 chars, otherwise do LIKE.

 Signature 

Website coming soon!

 
Posted: 14 April 2011 06:32 PM   [ Ignore ]   [ # 6 ]   [ Rating: 0 ]
Avatar
Joined: 2009-11-10
69 posts

This has actually piqued my interest in building a generic search class, however difficult it may be. I love challenges raspberry

 Signature 

Website coming soon!

 
Posted: 15 April 2011 08:52 AM   [ Ignore ]   [ # 7 ]   [ Rating: 0 ]
Joined: 2010-04-04
69 posts

Thanks for all suggestions here. I’ll definitely come back and look through this stuff when I decide to improve the search feature I have at the moment.

 
Posted: 15 April 2011 02:19 PM   [ Ignore ]   [ # 8 ]   [ Rating: 0 ]
Joined: 2009-11-06
46 posts
wh1tel1te - 14 April 2011 09:35 PM
Ninjabear - 14 April 2011 01:07 PM

So if you have a search like “form” it will only find “form” and not “forms” or “form’s” for example. It has a lot of drawbacks.

Boolean fulltext searches. Basically allows you to append “*” to your words. e.g.

That is a good idea but you will get results that may be semantically irrelevant. With that approach, if you searched for “form” you’d get back documents that contain “formulas” or “formula” or “formulate” and basically every word that starts with “form,” which will be irrelevant to the search.

 
Posted: 15 April 2011 06:17 PM   [ Ignore ]   [ # 9 ]   [ Rating: 0 ]
Joined: 2010-04-04
69 posts
CI Coder - 15 April 2011 06:19 PM

...
That is a good idea but you will get results that may be semantically irrelevant. With that approach, if you searched for “form” you’d get back documents that contain “formulas” or “formula” or “formulate” and basically every word that starts with “form,” which will be irrelevant to the search.

So stemming is a good cure for this? If so it would be great if someone could do quick tutorial on this which explained how to do it.

 
Posted: 15 April 2011 08:15 PM   [ Ignore ]   [ # 10 ]   [ Rating: 0 ]
Avatar
Joined: 2009-11-10
69 posts
CI Coder - 15 April 2011 06:19 PM

That is a good idea but you will get results that may be semantically irrelevant. With that approach, if you searched for “form” you’d get back documents that contain “formulas” or “formula” or “formulate” and basically every word that starts with “form,” which will be irrelevant to the search.

Personally I would say that is acceptable if you still wanted to return results such as “forms”, because that is the only way this will work. If you order by the “score” that MySQL returns (relevancy) descending, results such as “formulas” will display at the bottom because it isn’t as relevant as results with an exact match of “form”.

I guess it’s your choice really: non-boolean searching which will return exact matches only, or boolean searching which will return partial matches with a chance that it will return slightly irrelevant results.

 Signature 

Website coming soon!

 
Posted: 15 April 2011 11:51 PM   [ Ignore ]   [ # 11 ]   [ Rating: 0 ]
Joined: 2009-11-06
46 posts
wh1tel1te - 16 April 2011 12:15 AM

Personally I would say that is acceptable if you still wanted to return results such as “forms”, because that is the only way this will work. If you order by the “score” that MySQL returns (relevancy) descending, results such as “formulas” will display at the bottom because it isn’t as relevant as results with an exact match of “form”.

I guess it’s your choice really: non-boolean searching which will return exact matches only, or boolean searching which will return partial matches with a chance that it will return slightly irrelevant results.

Here is a little experiment:

Create the following table with the following data:

DROP TABLE IF EXISTS `test_searches`;
CREATE TABLE `test_searches` (
  `
idint(11NOT NULL AUTO_INCREMENT,
  `
bodyvarchar(250) DEFAULT NULL,
  
PRIMARY KEY (`id`),
  
FULLTEXT KEY `fulltext_search_index` (`body`)
ENGINE=MyISAM AUTO_INCREMENT=DEFAULT CHARSET=latin1;

-- ----------------------------
-- 
Records of test_searches
-- ----------------------------
INSERT INTO `test_searchesVALUES ('1''This is about your tax form');
INSERT INTO `test_searchesVALUES ('2''I am formulating a new theory about searching and about different forms it can take');
INSERT INTO `test_searchesVALUES ('3''what is the formula for the area of a circle?');
INSERT INTO `test_searchesVALUES ('4''New forms of life');
INSERT INTO `test_searchesVALUES ('5''some text that does not contain the word');
INSERT INTO `test_searchesVALUES ('6''irrelevant text');
INSERT INTO `test_searchesVALUES ('7''etc');
INSERT INTO `test_searchesVALUES ('8''form, form form form form'); 

Now I run the following query:

SELECT *, MATCH (bodyAGAINST'form*' IN BOOLEAN MODE ) AS score FROM test_searches

Here are the results:

+----+-------------------------------------------------------------------------------------+-------+
id body                                                                                score |
+----+-------------------------------------------------------------------------------------+-------+
|  
This is about your tax form                                                         |     |
|  
I am formulating a new theory about searching and about different forms it can take |     |
|  
what is the formula for the area of a circle?                                       |     |
|  
| New forms of life                                                                   |     |
|  
some text that does not contain the word                                            |     |
|  
irrelevant text                                                                     |     |
|  
etc                                                                                 |     |
|  
formform form form form                                                           |     |
+----+-------------------------------------------------------------------------------------+-------+ 

How do you differentiate those records?

I think that in this particular case, if you only need a one-term search, my initial example with the inflector helper is actually best, because it gives you only those docs that contain the term or its plural or its Saxon genitive. Here is an example:

SELECT *, MATCH (bodyAGAINST'form forms form\'s forms\'' IN BOOLEAN MODE ) AS score FROM test_searches;
+----+-------------------------------------------------------------------------------------+-------+
id body                                                                                score |
+----+-------------------------------------------------------------------------------------+-------+
|  
This is about your tax form                                                         |     |
|  
I am formulating a new theory about searching and about different forms it can take |     |
|  
what is the formula for the area of a circle?                                       |     |
|  
| New forms of life                                                                   |     |
|  
some text that does not contain the word                                            |     |
|  
irrelevant text                                                                     |     |
|  
etc                                                                                 |     |
|  
formform form form form                                                           |     |
+----+-------------------------------------------------------------------------------------+-------+ 

In this case “what is the formula for the area of a circle?” has a score of 0, which is to be expected because someone who’s looking for “form” is certainly not looking for “formula,” or they would type in “formula.”

However, if you were to program a generic search library you would have to use all of these techniques, and many more, and combine them into something that would give relevant results in most situations. But then what if the tables are not MyISAM tables? For instance I use only InnoDB tables in my projects to ensure data integrity and guard against orphan records and other things like that which MyISAM doesn’t support. What if the database is not MySQL but Oracle, Postage or even MongoDB (like I used in my last two projects)? Just thinking about the task only “what ifs” pop up in my mind.

I do however enjoy the discussion. That’s why I pulled my glass of Pinot Noir close and I spent about an hour and a half on a Friday evening doing this experiment. smile Another reason is that this idea of “relevancy,” if solved in an unbiased way, has potential. So far Google’s Page Rank is the best solution to relevancy, but only inasmuch as democracy is the best solution to human society.

 
Posted: 16 April 2011 12:00 AM   [ Ignore ]   [ # 12 ]   [ Rating: 0 ]
Avatar
Joined: 2009-11-10
69 posts

You have to do your WHERE statement using boolean, not your SELECT. Non-boolean select will return a decimal number with which you can work with.

You do raise an interesting point about the “formula” result returning a 0 score - this may be useful for determining relevancy smile. I wish I could do some experiments as well, but no time at the moment! Thanks for your experiments though, very interesting results! smile

As for tables that are not MyISAM formatted - you could probably create a dedicated search table that is MyISAM formatted, and populate a keywords field with the other table’s contents.

 Signature 

Website coming soon!

 
Posted: 16 April 2011 01:30 AM   [ Ignore ]   [ # 13 ]   [ Rating: 0 ]
Joined: 2009-11-06
46 posts
wh1tel1te - 16 April 2011 04:00 AM

You have to do your WHERE statement using boolean, not your SELECT. Non-boolean select will return a decimal number with which you can work with.

You do raise an interesting point about the “formula” result returning a 0 score - this may be useful for determining relevancy smile. I wish I could do some experiments as well, but no time at the moment! Thanks for your experiments though, very interesting results! smile

As for tables that are not MyISAM formatted - you could probably create a dedicated search table that is MyISAM formatted, and populate a keywords field with the other table’s contents.

smile OK, but please next time post some code so I don’t feel alone. Add 2 more rows to the table so I can ask you a question after that.

INSERT INTO `test_searchesVALUES ('9''My tax forms are late');
INSERT INTO `test_searchesVALUES ('10''My tax form is late'); 

Run the query:

SELECT *, MATCH (bodyAGAINST'form*' ) AS score FROM test_searches
WHERE MATCH 
(bodyAGAINST ('form*' IN BOOLEAN MODE )
ORDER BY score DESC;
+----+-------------------------------------------------------------------------------------+-------------------+
id body                                                                                score             |
+----+-------------------------------------------------------------------------------------+-------------------+
|  
This is about your tax form                                                         0.837664723396301 |
|  
formform form form form                                                           0.837664723396301 |
10 My tax form is late                                                                 0.828248143196106 |
|  
I am formulating a new theory about searching and about different forms it can take |                 |
|  
what is the formula for the area of a circle?                                       |                 |
|  
| New forms of life                                                                   |                 |
|  
My tax forms are late                                                               |                 |
+----+-------------------------------------------------------------------------------------+-------------------+
7 rows in set 

Question: why is “My tax form is late” relevant ( score > 0 ) and “My tax forms are late” is not ( score = 0 )? Shouldn’t they be at least close? Or, put another way, why is “My tax forms are late” just as relevant as “what is the formula for the area of a circle?” when you are searching for “form?”

And now my way:

SELECT *, MATCH (bodyAGAINST'form forms form\'s forms\'' ) AS score FROM test_searches
WHERE MATCH 
(bodyAGAINST ('form forms form\'s forms\'' IN BOOLEAN MODE )
ORDER BY score DESC;
+----+-------------------------------------------------------------------------------------+------------------+
id body                                                                                score            |
+----+-------------------------------------------------------------------------------------+------------------+
|  
This is about your tax form                                                         |  1.6753294467926 |
|  
formform form form form                                                           |  1.6753294467926 |
|  
| New forms of life                                                                   1.65649628639221 |
|  
My tax forms are late                                                               1.65649628639221 |
10 My tax form is late                                                                 1.65649628639221 |
|  
I am formulating a new theory about searching and about different forms it can take 1.62007236480713 |
+----+-------------------------------------------------------------------------------------+------------------+
6 rows in set 

You see that if you do it my way you get possible relevant results being signaled as such by their score, and “My tax form is late” has the same relevance as “My tax forms are late.” And the “formula” doesn’t even appear in the recordset. Anyway, let’s keep going with this. I got the weekend free anyway smile

 
Posted: 16 April 2011 01:00 PM   [ Ignore ]   [ # 14 ]   [ Rating: 0 ]
Joined: 2010-04-04
69 posts
CI Coder - 16 April 2011 05:30 AM

...
Question: why is “My tax form is late” relevant ( score > 0 ) and “My tax forms are late” is not ( score = 0 )? Shouldn’t they be at least close?

Or, put another way, why is “My tax forms are late” just as relevant as “what is the formula for the area of a circle?” when you are searching for “form?”

I read the whole boolean fulltext search article and could see no clue as to why “My tax form is late” is not close to “My tax forms are late” given that they both contain the word “form” which is main requirement of the * operator.

 
Posted: 16 April 2011 04:15 PM   [ Ignore ]   [ # 15 ]   [ Rating: 0 ]
Joined: 2009-11-06
46 posts
Ninjabear - 16 April 2011 05:00 PM
CI Coder - 16 April 2011 05:30 AM

...
Question: why is “My tax form is late” relevant ( score > 0 ) and “My tax forms are late” is not ( score = 0 )? Shouldn’t they be at least close?

Or, put another way, why is “My tax forms are late” just as relevant as “what is the formula for the area of a circle?” when you are searching for “form?”

I read the whole boolean fulltext search article and could see no clue as to why “My tax form is late” is not close to “My tax forms are late” given that they both contain the word “form” which is main requirement of the * operator.

It’s because when in natural language mode “*”, and all the other signs “+”, “-”, etc. are not logical operators like they are in boolean mode. The rule is that if it’s not alphabetic or numeric, with the exception of apostrophe (’) and underscore (_), then it is a word separator. Basically if you have “form*of*life” it’s the same as “form of life” as far as the parser is concerned.

The query that @wh1tel1te was suggesting uses the “*” character in the SELECT statement (that is in natural language mode) where it is a word separator, therefore no role in the outcome (you might as well not use it) and conditions the presence of those records in the query output based on the usage of the “*” character as a wild card character. In other words the MATCH in the SELECT part of the query is basically looking for documents that contain only the word “form” and nothing else, and the MATCH in the WHERE clause is looking for documents that contain all the words that start with “form.” That is why the document that contains “formula” has a 0 score, because it doesn’t contain the word “form” but it is present in the recordset because it contains the a word that starts with “form.” That’s why the next two queries have the same output:

mysqlselect *, match (bodyagainst ('form*' ) as score from test_searches order by score desc;
+----+-------------------------------------------------------------------------------------+-------------------+
id body                                                                                score             |
+----+-------------------------------------------------------------------------------------+-------------------+
|  
This is about your tax form                                                         0.332646816968918 |
|  
formform form form form                                                           0.332646816968918 |
11 This contains form*                                                                 | 0.332646816968918 |
10 My tax form is late                                                                 0.328907370567322 |
12 This document is form*alized                                                        0.325251072645187 |
|  
I am formulating a new theory about searching and about different forms it can take |                 |
|  
what is the formula for the area of a circle?                                       |                 |
|  
| New forms of life                                                                   |                 |
|  
some text that does not contain the word                                            |                 |
|  
irrelevant text                                                                     |                 |
|  
etc                                                                                 |                 |
|  
My tax forms are late                                                               |                 |
+----+-------------------------------------------------------------------------------------+-------------------+
12 rows in set 
mysqlselect *, match (bodyagainst ('form' ) as score from test_searches order by score desc;
+----+-------------------------------------------------------------------------------------+-------------------+
id body                                                                                score             |
+----+-------------------------------------------------------------------------------------+-------------------+
|  
This is about your tax form                                                         0.332646816968918 |
|  
formform form form form                                                           0.332646816968918 |
11 This contains form*                                                                 | 0.332646816968918 |
10 My tax form is late                                                                 0.328907370567322 |
12 This document is form*alized                                                        0.325251072645187 |
|  
I am formulating a new theory about searching and about different forms it can take |                 |
|  
what is the formula for the area of a circle?                                       |                 |
|  
| New forms of life                                                                   |                 |
|  
some text that does not contain the word                                            |                 |
|  
irrelevant text                                                                     |                 |
|  
etc                                                                                 |                 |
|  
My tax forms are late                                                               |                 |
+----+-------------------------------------------------------------------------------------+-------------------+
12 rows in set 
 
1 of 2
1