Common ThinkingSphinx Configuration Problems

I have recently added full-text search to two Rails projects using Sphinx and the ThinkingSphinx gem. While I have been extremely impressed with both Sphinx and ThinkingSphinx, I did stumble along the way a few times trying to get everything setup and working consistently. On both projects I had setup delta indexing so that my very large search indexes would not need to be rebuild but once per day. On one of the projects I also added monit to keep searchd running, monitoring the process once every few minutes.

Updated: 9/23/2010

Installing Sphinx and the initial setup of ThinkingSphinx were straightforward and relatively simple, however, I spent about two weeks debugging what turned out to be a collection of small problems that, together, made me think I had gone terribly wrong in choosing Sphinx for my full-text search needs.

searchd Binary Path

Problem
You try and run some of the ThinkingSphinx rake tasks, but they fail because ThinkingSphinx can’t find the required Sphinx binaries.

Solution
ThinkingSphinx needs to be able to start and stop sphinx when deploying. If you ssh into your server using the username of the user who runs your rails application (deploy in my case) and type which searchd at the command prompt, you should see something similar to /usr/bin/searchd although it will vary depending on how you installed Sphinx.

In your production version of the config/sphinx.yml file, set the bin_path configuration option. Let’s say that which searchd returns /usr/local/bin/sphinx/bin/searchd you’d want your sphinx.yml to contain the following:

production:
  bin_path: "/usr/local/bin/sphinx/bin"

searchd pid File

Problem
You’ve setup Monit to monitor searchd but Monit is unable to monitor or restart searchd. In my case, the location of the pid file was not what I was expecting so Monit could not monitor the searchd process.

Solution
In order to have Monit monitor the searchd process, it’s necessary to specify the location of the search pid file in your monit configuration. When you use ThinkingSphinx to build your Sphinx configuration file, the location of your pid file is specified in the resulting production.sphinx.conf. I decided that I wanted to specify the location of the searchd pid file so that others wouldn’t have to go digging through the auto generated configuration file to find it.

production:
  bin_path: "/usr/local/bin/sphinx/bin"
  pid_file: "/home/deploy/apps/my_rails_app/shared/log/searchd.pid"

My /etc/monit.d/sphinx configuration file:1

  check process searchd with pidfile /home/deploy/apps/my_rails_app/current/log/searchd.production.pid 
  start program = "/usr/local/bin/start_sphinx" as uid deploy 
  stop program = "/usr/local/bin/stop_sphinx" as uid deploy

The /usr/local/bin/start_sphinx file used to start searchd:

  #!/bin/bash
  export PATH="$PATH:/usr/local/bin"
 
  cd /home/deploy/apps/my_rails_app/current && RAILS_ENV=production /usr/bin/rake thinking_sphinx:index
  cd /home/deploy/apps/my_rails_app/current && RAILS_ENV=production /usr/bin/rake thinking_sphinx:start > log/sphinx.log 2>&1

The /usr/local/bin/stop_sphinx file used to stop searchd:

  #!/bin/bash
  export PATH="$PATH:/usr/local/bin"
 
  cd /home/deploy/apps/my_rails_app/current && RAILS_ENV=production /usr/bin/rake thinking_sphinx:stop > log/sphinx.log 2>&1

Index File Location

Problem
You’ve deployed your application and build the index. A cron job has been setup to rebuild the index nightly. You notice that when you deploy your application, your indexed results seem to be missing.

Solution
The index files that Sphinx builds and uses should be kept in a shared directory that is available across multiple deploys. Using the typical capistrano setup a good place would be /home/deploy/apps/my_rails_app/shared. By default ThinkingSphinx will store these in RAILS_ROOT/db/sphinx/ENVIRONMENT which is fine in development, but not in production.

First, create a directory in your shared folder on production:

$> mkdir /home/deploy/apps/my_rails_app/shared/db

Then, if your capistrano deployment recipe for production, symlink the shared db path to current release:

  run "ln -nsf  #{shared_path}/db/sphinx/production  #{release_path}/db/sphinx/production"

Finally, tell ThinkingSphinx to use the shared path for the Sphinx’s index files.

production:
  bin_path: "/usr/local/bin/sphinx/bin"
  pid_file: "/home/deploy/apps/my_rails_app/shared/log/searchd.pid"
  searchd_file_path: "/home/deploy/apps/my_rails_app/shared/db/sphinx/production"

Permissions

Problem
The files being created by Sphinx are owned by root and cannot be modified by the user running the ThinkingSphinx rake tasks, usually deploy. This often comes up when delta indexing is being used and the delta indexes are being modified or merged back into the full index.

Solution
You should start and stop searchd using the ThinkingSphinx rake tasks. This will ensure that searchd is started by a user who can later modify the index files if needed. If you are using Monit, make sure you setup your Monit configuration to start or restart the searchd process as the same user who runs the ThinkingSphinx rake tasks.

This is accomplished in my Monit configuration file by using as uid deploy:

  start program = "/usr/local/bin/start_sphinx" as uid deploy 
  stop program = "/usr/local/bin/stop_sphinx" as uid deploy

Monit Restarts searchd Before Rebuilding is Complete

Problem
When rebuilding your index, Monit restarts searchd before the index is rebuilt.

Solution
While there are ways to pause Monit for certain services, I found the easiest way to solve this problem was to increase the frequency at which Monit monitors my searchd process. Given the traffic of your site and required uptime of the search index, this solution may not be for you. For me the magic frequency was every three minutes.

Missing Configuration File

Problem
When you deploy and build your index and configuration file Sphinx appears to be working, the next time you deploy, your log file fills up with errors about a missing configuration file.

Solution
This one is an easy fix. In the ThinkingSphinx documentation, the deployment strategy is simple.

Essentially, the following steps need to be performed for a deployment:
  • stop Sphinx searchd (ensure it’s running)
  • generate Sphinx configuration
  • start Sphinx searchd
  • ensure index is regularly rebuilt

Make sure that part of your deploy process makes a call to the thinking_sphinx:configure rake task. This will regenerate the sphinx configuration file each time you deploy.

Rebuilding My Index is Too Slow!

Problem
Your index has many thousands of records. Running rake thinking_sphinx:rebuild works great, but it’s very slow.

Solution
I recently found out about the thinking_sphinx:reindex rake task. On my sphinx installation with ~115,000 indexed records, reindex is significantly faster than rebuild, so much so that it can be run on an hourly basis to keep my delta indexes from becoming too large.

1 Hat tip to Chris Irish for the Monit configuration and start/stop scripts.

Comments

  1. says

    Sphinx and ThinkingSphinx are great, we’ve had great success with them, but like you – I’ve definitely run across many of these issues before and a list like this could have saved me a good deal of time! I’ll definitely pass this around our Dev department and make sure to pull it up next time I’m setting up Sphinx.

Trackbacks

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>