I have recently added full-text search to two Rails projects using Sphinx and the ThinkingSphinx gem. While I have been extremely impressed with both Sphinx and ThinkingSphinx, I did stumble along the way a few times trying to get everything setup and working consistently. On both projects I had setup delta indexing so that my very large search indexes would not need to be rebuild but once per day. On one of the projects I also added monit to keep searchd running, monitoring the process once every few minutes.
Updated: 9/23/2010
Installing Sphinx and the initial setup of ThinkingSphinx were straightforward and relatively simple, however, I spent about two weeks debugging what turned out to be a collection of small problems that, together, made me think I had gone terribly wrong in choosing Sphinx for my full-text search needs.
searchd Binary Path
Problem
You try and run some of the ThinkingSphinx rake tasks, but they fail because ThinkingSphinx can’t find the required Sphinx binaries.
Solution
ThinkingSphinx needs to be able to start and stop sphinx when deploying. If you ssh into your server using the username of the user who runs your rails application (deploy in my case) and type which searchd at the command prompt, you should see something similar to /usr/bin/searchd although it will vary depending on how you installed Sphinx.
In your production version of the config/sphinx.yml file, set the bin_path configuration option. Let’s say that which searchd returns /usr/local/bin/sphinx/bin/searchd you’d want your sphinx.yml to contain the following:
production: bin_path: "/usr/local/bin/sphinx/bin" |
searchd pid File
Problem
You’ve setup Monit to monitor searchd but Monit is unable to monitor or restart searchd. In my case, the location of the pid file was not what I was expecting so Monit could not monitor the searchd process.
Solution
In order to have Monit monitor the searchd process, it’s necessary to specify the location of the search pid file in your monit configuration. When you use ThinkingSphinx to build your Sphinx configuration file, the location of your pid file is specified in the resulting production.sphinx.conf. I decided that I wanted to specify the location of the searchd pid file so that others wouldn’t have to go digging through the auto generated configuration file to find it.
production: bin_path: "/usr/local/bin/sphinx/bin" pid_file: "/home/deploy/apps/my_rails_app/shared/log/searchd.pid" |
My /etc/monit.d/sphinx configuration file:1
check process searchd with pidfile /home/deploy/apps/my_rails_app/current/log/searchd.production.pid start program = "/usr/local/bin/start_sphinx" as uid deploy stop program = "/usr/local/bin/stop_sphinx" as uid deploy |
The /usr/local/bin/start_sphinx file used to start searchd:
#!/bin/bash export PATH="$PATH:/usr/local/bin" cd /home/deploy/apps/my_rails_app/current && RAILS_ENV=production /usr/bin/rake thinking_sphinx:index cd /home/deploy/apps/my_rails_app/current && RAILS_ENV=production /usr/bin/rake thinking_sphinx:start > log/sphinx.log 2>&1 |
The /usr/local/bin/stop_sphinx file used to stop searchd:
#!/bin/bash export PATH="$PATH:/usr/local/bin" cd /home/deploy/apps/my_rails_app/current && RAILS_ENV=production /usr/bin/rake thinking_sphinx:stop > log/sphinx.log 2>&1 |
Index File Location
Problem
You’ve deployed your application and build the index. A cron job has been setup to rebuild the index nightly. You notice that when you deploy your application, your indexed results seem to be missing.
Solution
The index files that Sphinx builds and uses should be kept in a shared directory that is available across multiple deploys. Using the typical capistrano setup a good place would be /home/deploy/apps/my_rails_app/shared. By default ThinkingSphinx will store these in RAILS_ROOT/db/sphinx/ENVIRONMENT which is fine in development, but not in production.
First, create a directory in your shared folder on production:
$> mkdir /home/deploy/apps/my_rails_app/shared/db
Then, if your capistrano deployment recipe for production, symlink the shared db path to current release:
run "ln -nsf #{shared_path}/db/sphinx/production #{release_path}/db/sphinx/production" |
Finally, tell ThinkingSphinx to use the shared path for the Sphinx’s index files.
production: bin_path: "/usr/local/bin/sphinx/bin" pid_file: "/home/deploy/apps/my_rails_app/shared/log/searchd.pid" searchd_file_path: "/home/deploy/apps/my_rails_app/shared/db/sphinx/production" |
Permissions
Problem
The files being created by Sphinx are owned by root and cannot be modified by the user running the ThinkingSphinx rake tasks, usually deploy. This often comes up when delta indexing is being used and the delta indexes are being modified or merged back into the full index.
Solution
You should start and stop searchd using the ThinkingSphinx rake tasks. This will ensure that searchd is started by a user who can later modify the index files if needed. If you are using Monit, make sure you setup your Monit configuration to start or restart the searchd process as the same user who runs the ThinkingSphinx rake tasks.
This is accomplished in my Monit configuration file by using as uid deploy:
start program = "/usr/local/bin/start_sphinx" as uid deploy stop program = "/usr/local/bin/stop_sphinx" as uid deploy |
Monit Restarts searchd Before Rebuilding is Complete
Problem
When rebuilding your index, Monit restarts searchd before the index is rebuilt.
Solution
While there are ways to pause Monit for certain services, I found the easiest way to solve this problem was to increase the frequency at which Monit monitors my searchd process. Given the traffic of your site and required uptime of the search index, this solution may not be for you. For me the magic frequency was every three minutes.
Missing Configuration File
Problem
When you deploy and build your index and configuration file Sphinx appears to be working, the next time you deploy, your log file fills up with errors about a missing configuration file.
Solution
This one is an easy fix. In the ThinkingSphinx documentation, the deployment strategy is simple.
Essentially, the following steps need to be performed for a deployment:
- stop Sphinx searchd (ensure it’s running)
- generate Sphinx configuration
- start Sphinx searchd
- ensure index is regularly rebuilt
Make sure that part of your deploy process makes a call to the thinking_sphinx:configure rake task. This will regenerate the sphinx configuration file each time you deploy.
Rebuilding My Index is Too Slow!
Problem
Your index has many thousands of records. Running rake thinking_sphinx:rebuild works great, but it’s very slow.
Solution
I recently found out about the thinking_sphinx:reindex rake task. On my sphinx installation with ~115,000 indexed records, reindex is significantly faster than rebuild, so much so that it can be run on an hourly basis to keep my delta indexes from becoming too large.
1 Hat tip to Chris Irish for the Monit configuration and start/stop scripts.
Sphinx and ThinkingSphinx are great, we’ve had great success with them, but like you – I’ve definitely run across many of these issues before and a list like this could have saved me a good deal of time! I’ll definitely pass this around our Dev department and make sure to pull it up next time I’m setting up Sphinx.