Hobione's Weblog

Living & Breathing in Web 2.0 Era

Regain search engine on Glassfish app server

“Regain is a Java search engine based on Jakarta Lucene. It provides indexing and searching files for plenty of formats (currently HTML, XML, Excel, Powerpoint, Word, PDF and RTF).  A TagLibrary eases integrating search results in your JSP based web page.”  You may download the production version from http://regain.sourceforge.net/ or beta version from http://www.assembla.com/spaces/regain2/documents

Our NAS division inTranet web site needed a search look up like Google search.  I came across with Regain Search engine after done some research.  It is pretty simple to install on a server (Solaris box).  Regain does not do any database search, but if you have a link that requires to go to database, Regain will crwal up the link store that file metadata in the index. Hibernate search or Lucene search does the full scale database search, I have not used them yet but that’s what I hear.

Here are few steps to install Regain on Glassfish: (Unzip regain_v1.5.0-preview-80717-1556_server.zip)

  1. Create a directory under ../SUNWappserver/domains/domain1
    1. Name it regain
    2. Copy the crawler directory from c:\regain\runtime (after unzipped) and paste it to domain1/regain/
  2. Under crawler dir, you will find CrawlerConfiguration.xml
    1. Modify the xml file with your domain name etc.
    2. Create a directory and call it searchindex under crawler
  3. Now the fun part is to build indexing
    1. Change directory (cd) to domian1/regain/crawler dir and run this following command
    2. java -jar regain-crawler.jar or,  java -jar regain-crawler.jar –help (more options)
  4. Copy conf dir, it's sub directories and the xml file (all three –> conf\regain\SearchConfiguration.xml ) from the downloaded zip-file (can be found under regain\runtime\search) and paste 'em directly under domian1/application
  5. Modify the SearchConfiguration.xml file mainly line 74, 80 and 83
  6. Deploy the regain.war file via app server’s beautiful admin gui.
  7. Modify the web.xml (domain1\applications\j2ee-modules\regain\WEB-INF). You have to specify the regain webapp where to look for search configuration file.
<!-- The location of the configuration file -->
<context-param>
<param-name>searchConfigFile</param-name>
<param-value>../conf/regain/SearchConfiguration.xml</param-value>

</context-param>

Regain webapp’s web.xml map to SearchConfiguration.xml and SearchConfiguration.xml know where to find searchindex dir to bring up the query results.  (Three steps).

7. Open up a browser and type http://yourdomainname/contextName, i.e. http://axous2.abc.aaa.info/search

Happy searching!

Home page: http://regain.sourceforge.net/
Open Source Full Text Search Engines Written In Java: http://www.manageability.org/blog/stuff/full-text-lucene-jxta-search-engine-java-xml

Advertisements

August 1, 2008 - Posted by | GlassFish, Search Engine

24 Comments »

  1. starting at step 4, things are not too clear… in fact I cant find the log file and I’m getting error:
    Error message: Writing results failed

    Would you have the complete paths for 4, and give an example for 5. Finally, by pointing domains/domain1/applications/j2ee-modules/regain/WEB-INF/web.xml to../conf/regain/SearchConfiguration.xml, is that to say that it will look in domains/domain1/applications/j2ee-modules/regain/conf/regain for the SearchConfiguration.xml file?

    Full paths would help in figuring this out if possible.

    I’m on a solaris box, so my full path starts with /opt/glassfish/domains …

    Comment by Francois Dion | August 15, 2008 | Reply

  2. From the downloaded zip-file in the directory regain\runtime\search, you will find a conf dir. Copy that conf and paste it /opt/glassfish/domains/domain1/applications
    Inside your conf dir, you should have regain dir and inside the regain dir, you should have SearchConfiguration.xml

    Here are code snippet from line 71 to below looks like:

     <!-- The search index 'main' -->
        <index name="main" default="true">
          <!-- The directory where the index is located -->
          <dir>/app01/SUNWappserver/domains/domain1/regain/crawler/searchindex</dir>
        </index>
        
        <!-- The search index 'example' -->
        <index name="example">
          <!-- The directory where the index is located -->
          <dir>/app01/SUNWappserver/domains/domain1/regain/crawler/searchindex</dir>
          
          <rewriteRules>
            <rule prefix="/web/docs/" replacement="http://ato.abc.aaa.info"/>
          </rewriteRules>
        </index>
    

    You have to explicitly define the searchindex directory location

    Hope this help. We are loving it in here at OKC. Regain ROCKS! We were going to pay to buy Google Mini and Regain saved our soul (SOS).

    Comment by hobione | August 15, 2008 | Reply

  3. That worked out good. All I had to do is move the conf folder 2 directories up.

    Now I’m having to redo the index though. I used the file:// index and that works ok, but I end up with links to documents like this:

    http://my.website.com/search/file/%24/%24export/home/user/folder/my+documents/document.pdf?index=main

    instead of http://my.website.com/user/folder/my+documents/document.pdf?index=main

    I switched to indexing with http instead of file, but then I have a problem: it doesn’t follow the links within the jsp pages.

    Did you have any problems like that?

    Comment by Francois Dion | August 18, 2008 | Reply

  4. No, I do not. It crawls no matter .jsp or .html
    Review your configuration file and make sure no where you said, dont crawl .jsp
    Thanks
    Hobi

    Comment by hobione | August 19, 2008 | Reply

  5. would you mind posting your crawlerconfiguration.xml too? I’ve asked about my problem on the regain forum and no answer in two weeks…

    Thanks for any help.

    Comment by Francois Dion | September 3, 2008 | Reply

  6. CrawlerConfiguration.xml

    <!-- The whitelist containing prefixes an URL must have to be processed -->
    <whitelist>
      <prefix name="file">file://</prefix>
      <prefix>http://atowusxx.amc.faa.gov</prefix>
    </whitelist>
    
    
    <!-- The blacklist containing prefixes an URL must NOT have to be processed -->
    <blacklist>
      <!--
      <prefix>http://www.mydomain.de/some/dynamic/content/</prefix>
      <regex>/backup/&#91;^/&#93;*$</regex>
      -->
      <prefix>http://atowusxx.abc.aaa.info/cmc/</prefix>   
      <prefix>http://atowusxx.abc.aaa.info/webapps/</prefix>
      <prefix>http://atowusxx.abc.aaa.info/uploads/</prefix>
      
    </blacklist>
    
    
    <!-- The preferences for the search index -->
    <searchIndex>
      <!-- The directory where the index should be located -->
      <dir>searchindex</dir>
    
      <!-- Specifies, whether the index should be built -->
      <buildIndex>true</buildIndex>
    

    Hope this help
    Thanks
    Hobi

    Comment by hobione | September 3, 2008 | Reply

  7. what config do you have in the crawlerconfiguration.xml file?

    Comment by Francois Dion | September 3, 2008 | Reply

  8. argh… in the startlist start section…. the brackets were processed as html…

    Comment by Francois Dion | September 3, 2008 | Reply

  9. So, is that truely all you have in your crawlerconfiguration, no startlist and start parse=true index=true with a http url?

    If I use a file:// url it searches and index but the links in the search result are wrong. If I use http:// it doesn’t index at all, that’s why I was curious to see what your whole startlist section of crawlerconfiguration.xml looked like.

    Comment by Francois Dion | September 9, 2008 | Reply

  10. 
    <!-- The list of URLs where the spidering will start. -->
    <startlist>
      <!-- Directory parsing -->
      <!--
      <start parse="true" index="false">file://c:/Eigene Dateien</start>
      -->
    
      <!-- HTML parsing -->
       <start parse="true" index="true">http://atowusxx.abc.aaa.info</start>
       
    </startlist>
    
    

    Comment by hobione | September 9, 2008 | Reply

  11. Hi

    I m developing a search engine using regain..i got struck at your 6th point..i.e.
    Deploy the regain.war file via app server’s beautiful admin gui.

    That means the from where i will get this .war file?
    and how to run it?
    please if possible mail me the screen shots..i am in great urgent..

    thanks for understanding.

    Comment by jazz | March 4, 2009 | Reply

  12. When you download the regain zip file, it comes with a regain.war file. It should be under regain1.5\regain\runtime\search\webapps\regain.war
    You can use Glassfish admin tool to deploy the war file just like any other webapp.

    Hope this help.
    Hobi

    Comment by HobiOne | March 4, 2009 | Reply

  13. Thanks for your reply..

    i have deployed the regain.war file and i changed the

    searchConfigFile
    ../conf/regain/SearchConfiguration.xml

    in web.xml(domain1\applications\j2ee-modules\regain\WEB-INF)

    i am getting this error:
    Error message: Writing results failed

    i can see the whole crawling jobs when i run the “java -jar regain-crawler.jar”

    still it is not able to get the results

    Here are the links of my aplication:
    I am indexing my local file folder.

    crawling config: http://img186.imageshack.us/img186/9914/crawlingconfig.jpg
    configuring searchindex: http://img365.imageshack.us/img365/8254/config.jpg
    Deploying: http://img119.imageshack.us/img119/5633/deploy.jpg

    i dont know how to post it correctly..so i uploaded..

    can you sortout my problem..

    Comment by jazz | March 9, 2009 | Reply

  14. Thanks for your reply..
    When i deployed the war file i am getting the

    “Error message: Writing results failed”

    even i provided the correct indexing paths.

    Comment by Jazz | March 10, 2009 | Reply

  15. Please check directory permission and the blacklist.

    Comment by HobiOne | March 10, 2009 | Reply

  16. Dear HobiOne,
    I am thankful to you for prompt replies.

    i have checked the directory permission and the blacklist.they are perfect i think.

    can you check the comment# 13 (my screen shots).Let me know any thing goes wrong.I got crawling info also when i run java -jar regain-crawler.jar

    If possible can you give your mailid..i will send my entire code.
    or else send me any sample application to my mailid(abbhooshan@gmail.com).

    Thanks

    Comment by jazz | March 12, 2009 | Reply

  17. Here is my CrawlerConfiguration.xml
    http://pastebin.com/ff94fbd0

    and SearchConfiguration.xml
    http://pastebin.com/f2245783f

    Hope these help.
    Hobi

    Comment by HobiOne | March 12, 2009 | Reply

  18. Configuration for the regain crawler:
    http://pastebin.com/f6cb28c4c

    SearchConfiguration.xml
    http://pastebin.com/f13ef38c0

    web.xml
    http://pastebin.com/f71b50ad6

    can you please check my code and let me know anything went wrong?

    btw..
    thanks for sending your code

    Comment by jazz | March 12, 2009 | Reply

  19. Hi Hobi

    at last i got one wonderful blog..can you post how to connect to the database using regain??

    i want to index the mysql database.

    cheers

    Comment by Rock | March 12, 2009 | Reply

  20. CrawlerConfiguration.xml:
    http://pastebin.com/f6cb28c4c

    SearchConfiguration.xml
    http://pastebin.com/f13ef38c0

    web.xml:
    http://pastebin.com/f71b50ad6

    can you please check my code and correct me..

    btw..thnx for posting ur code

    bye

    Comment by jazz | March 12, 2009 | Reply

  21. I am indexing the local drive.
    SearchConfiguration.xml
    http://pastebin.com/m2feabe9c

    CrawlerConfiguration.xml
    http://pastebin.com/d109f4f0e

    web.xml
    http://pastebin.com/d64d5895
    still i am getting the same error..

    Comment by jazz | March 12, 2009 | Reply

  22. Hi
    thanks for your help..at last i can index my sites.

    How to start indexing after an interval in server version?
    meaning auto indexing after 1day,after 1 week like…
    i have seen the desktop version is having that.

    Where to configure for server edition?

    thanks

    Comment by Jazz | April 27, 2009 | Reply

  23. Can’t find the regain2 bits -the link in your article is 404

    Comment by Jim | June 17, 2009 | Reply

  24. Hi can you tell how to configure max results display in search.

    That means it is giving 10 results as default can i customize it

    Comment by jazz | July 23, 2009 | Reply


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: