Roll Your Own Search with Rails and Postgres: Search Index

Published 04 Jul 2021 · 8 min read

Learn how to build search service using Rails and Postgres Full Text Search for a Gatsby blog.

This is the second in a multi-part series of posts detailing how I built the search feature for this blog. In this post, I will explain how I built the search index.

In case you missed it, Part 1: Search Introduction covers the existing options for adding search to a Gatsby site, and why I decided not to use any of them, and instead build a custom search service using PostgreSQL Full Text Search and Rails.

Document Model

The first step in building the search service is to design a database table containing the content to be searched. Since the posts on this blog are written in markdown, this will be all the text in the markdown files. The table also needs to contain enough information for the search results page on the client (this blog) to render the results and make each result clickable to link to the original content.

The search service is built as a Rails project. To get started, a new Rails project can be scaffolded with rails new myapp -d=postgresql. Or in my case, I already have a Rails project that provides some custom analytics for the blog, so I'm adding the search service to it.

Here is the migration to add the documents table. The text type is used for columns whose contents could get very large such as the body, which will contain the markdown content of each blog post.

class CreateDocuments < ActiveRecord::Migration[6.0]
  def change
    create_table :documents do |t|
      t.string :title, null: false
      t.text :description
      t.string :category
      t.date :published_at
      t.string :slug
      t.text :body, null: false
      t.text :excerpt

      t.timestamps
    end

    add_index :documents, :title, unique: true
  end
end

Running this migration generates a table that looks like this:

                                           Table "public.documents"
    Column    |              Type              | Collation | Nullable |                Default
--------------+--------------------------------+-----------+----------+---------------------------------------
 id           | bigint                         |           | not null | nextval('documents_id_seq'::regclass)
 title        | character varying              |           | not null |
 description  | text                           |           |          |
 category     | character varying              |           |          |
 published_at | date                           |           |          |
 slug         | character varying              |           |          |
 body         | text                           |           | not null |
 created_at   | timestamp(6) without time zone |           | not null |
 updated_at   | timestamp(6) without time zone |           | not null |
 excerpt      | text                           |           |          |
Indexes:
    "documents_pkey" PRIMARY KEY, btree (id)
    "index_documents_on_title" UNIQUE, btree (title)

Document Table Example Row

Next step in building the search index is to populate the table with content from this blog.

The following illustrates how each field in the documents table should be mapped to content from the blog:

For example, to populate a row in the documents table for the VS Code Alternative to Postman blog post illustrated above, the sql would be as follows (the excerpt and body columns are shortened for legibility):

INSERT INTO documents(
  title, description, category, published_at, slug, body, created_at, updated_at, excerpt
)
VALUES(
  'A VS Code Alternative to Postman',
  'Looking for a Postman alternative? This VS Code REST Client extension could be the answer.',
  'Web Development',
  '2021-06-13',
  '/blog/postman-alternative-vscode/',
  'If youve been doing web development for any length of time, youve probably built or worked on an HTTP REST style API and needed a REST Client to test it.'
  now(),
  now(),
  'If youve been doing web development for any length of time, youve probably built or worked on an HTTP REST style API and needed a REST…'
);

Generate INSERT Statements from Gatsby

Of course it wouldn't be practical or scale to manually write all the INSERT statements. These need to be automatically generated. The best place to do this is in the blog build step that processes all the markdown posts.

Recall this is a Gatsby blog. The blog pages are created at build time. The build step runs a file named gatsby-node.js in the root of the blog project. This file exports a function called createPages, which executes a graphql query to get all the markdown content. The graphql query returns a list of nodes that contain the markdown content such as title, category, publish date etc. These are used to create each individual blog page using the Gatsby provided createPage function.

So this is the perfect place to generate some sql INSERT statements, and write them out to a search.sql file. Here is the existing createPages function.

// gatsby-node.js

exports.createPages = ({ graphql, actions }) => {
  const { createPage } = actions

  return new Promise(resolve => {
    graphql(`
      {
        allMarkdownRemark(sort: { fields: [frontmatter___date], order: DESC }) {
          edges {
            node {
              frontmatter {
                title
                category
                date(formatString: "YYYY-MM-DD")
                description
              }
              fields {
                slug
              }
              excerpt,
              html,
              rawMarkdownBody
            }
          }
        }
      }
    `).then(result => {
      result.data.allMarkdownRemark.edges.forEach(({ node }) => {
        // build individual blog pages
        createPage({
          path: node.fields.slug,
          component: path.resolve("./src/templates/post.js"),
          context: {
            slug: node.fields.slug,
            content: node.html,
            title: node.frontmatter.title,
            description: node.frontmatter.description,
          },
        })
      })
      resolve()
    })
  })
}

And here is the modified createPages function that also generates SQL INSERT statements to populate the documents table. The existing createPages function is already fairly lengthy, so the new logic is placed in a new search helper module. The generated insert statement is appended to a search.sql file in the project root. This file is git ignored because it is a generated file.

// gatsby-node.js

const searchHelper = require('./lib/search-helper');

exports.createPages = ({ graphql, actions }) => {
  const { createPage } = actions

  return new Promise(resolve => {
    graphql(`
      {
        allMarkdownRemark(sort: { fields: [frontmatter___date], order: DESC }) {
          edges {
            node {
              snip...
            }
          }
        }
      }
    `).then(result => {
      // NEW CODE HERE: wipe out old search file if it exists
      if (fs.existsSync('search.sql')) {
        fs.unlinkSync('search.sql');
      }

      result.data.allMarkdownRemark.edges.forEach(({ node }) => {
        createPage({
          // snip...
        })

        // NEW CODE HERE: generate search insert statements for search service
        insertStatement = searchHelper.generateInsert(node)
        fs.appendFileSync('search.sql', insertStatement + '\n', 'utf8');
      })
      resolve()
    })
  })
}

The new searchHelper module exposes a generateInsert function that takes in a node result from the graphql query and uses the fields in node to generate an INSERT statement. Newlines and some markdown format characters are removed from the content since they're not necessary for being searched.

As for knowing how to parse a node object into fields such as title, description, this comes from gatsby-transformer-remark. A Gatsby plugin I use on this blog to parse Markdown files.

// lib/search-helper.js
function generateInsert(node) {
  return `INSERT INTO documents(title, description, category, published_at, slug, body, created_at, updated_at, excerpt)
    VALUES(
      '${replaceMany(node.frontmatter.title)}',
      '${replaceMany(node.frontmatter.description)}',
      '${node.frontmatter.category}',
      '${node.frontmatter.date}',
      '${node.fields.slug}',
      '${replaceMany(node.rawMarkdownBody)}',
      now(),
      now(),
      '${replaceMany(node.excerpt)}'
    );
  `
}

function replaceMany(str) {
  return str
    .replace(/(?:\r\n|\r|\n)/g, ' ')
    .replace(/'/g, '')
    .replace(/"/g, '')
    .replace(/#/g, '')
    .replace(/`/g, '');
}

module.exports = {
  generateInsert,
}

Ingest Documents

The final step in building the search index is to apply all the SQL INSERT statements generated in search.sql to the documents table in the Rails project. Here is the command to do it against the local PostgreSQL database. My database and user are named hello:

psql -h 127.0.0.1 -d hello -U hello -f /path/to/search.sql

To verify it worked, connect to to the database and run a query. body column is shortened for legibility:

hello=> select title, category, slug, left(body, 20) as body from documents;
                               title                                |     category     |                          slug                           |         body
--------------------------------------------------------------------+------------------+---------------------------------------------------------+----------------------
 Roll Your Own Search Service with Rails and Postgres: Search Index | web development  | /blog/roll-your-own-search-service-for-gatsby-part2/    |  This is the second
 Roll Your Own Search Service with Rails and Postgres: Introduction | web development  | /blog/roll-your-own-search-service-for-gatsby-part1/    |  This is the first i
 Add a Language to gatsby-remark-vscode                             | web development  | /blog/add-language-gatsby-remark-vscode/                |  This blog is built
 A VS Code Alternative to Postman                                   | Web Development  | /blog/postman-alternative-vscode/                       |  If youve been doing
 Rails CORS Middleware For Multiple Resources                       | rails            | /blog/rails-cors-middleware-multiple/                   |  A short post for to
 TDD by Example: Fixing a Bug                                       | javascript       | /blog/tdd-by-example-bugfix/                            |  This post will demo
 Saving on monthly expenses - A Cautionary Tale                     | personal finance | /blog/save-monthly-expense-caution/                     |  Today I want to sha
 Rails Blocked Host Solved by Docker Cleanup                        | rails            | /blog/rails-blocked-host-docker-clean/                  |  Today I want to sha
 Hack Your RRSP Tax Refund                                          | personal finance | /blog/hack-your-rrsp-refund/                            |  Ah spring is in the
 Using Herokus pg:pull with Docker                                  | PostgreSQL       | /blog/heroku_pg_pull_and_docker/                        |  **TLDR:** Use PG_HO
 Build and Publish a Presentation with RevealJS and Github          | web development  | /blog/build-and-publish-presentation-with-html-and-css/ |  If youve ever given
 Solving a Python Interview Question in Ruby                        | ruby             | /blog/python-interview-question-in-ruby/                |  A few months ago, I
 Automate Tabs & Commands in iTerm2                                 | terminal         | /blog/iterm-automation/                                 |  Do you find yoursel
 Debug Github Actions                                               | rails            | /blog/debug-github-action/                              |  A few weeks ago I w
 TDD by Example                                                     | javascript       | /blog/tdd-by-example/                                   |  If youve been codin
 Promotional interest rates and the fine print                      | personal finance | /blog/emergency-fund-fine-print/                        |  Taking a break from
 Improve Productivity with VS Code Snippets                         | vscode           | /blog/improve-productivity-with-vs-code-snippets/       |  Have you ever found
 Construct a PATCH request for a JSON API                           | rails            | /blog/patch-request-with-json-api/                      |  Im currently buildi
 Use UUID for primary key with Rails and Postgres                   | rails            | /blog/rails-uuid-primary-key-postgres/                  |  When working with R
 Start a Rails 6 Project with RSpec                                 | rails            | /blog/start-rails-6-project-with-rspec/                 |  Currently at work I
 Find Jira Tickets Faster                                           | productivity     | /blog/find-jira-tickets-faster/                         |  A short blog post f
 View Localhost on Your Phone                                       | web development  | /blog/phone-localhost/                                  |  A common occurrence
 Dockerize a Rails Application for Development                      | rails            | /blog/dockerize-rails-app-for-dev-debug-and-testing/    |  At work, I was rece
 VS Code Shell Integration                                          | vscode           | /blog/vscode-shell-integration/                         |  Today Im going to s
 Is a Career in Software Development Right for You?                 | career           | /blog/is-career-in-software-development-right-for-you/  |  A career in softwar
 What Makes a Good Podcast                                          | podcasts         | /blog/what-makes-a-good-podcast/                        |  I absolutely love p
 Off with the Digital Distractions!                                 | productivity     | /blog/off-with-the-digital-distractions/                |  As a programmer, it
 Access Chrome Bookmarks with Keyboard                              | productivity     | /blog/how-to-access-chrome-bookmarks-via-keyboard/      |  If you frequently u
 Crossword Solver with CentOS                                       | just for fun     | /blog/crossword-solver-with-centos/                     |  If you enjoy crossw
 Navigate Back & Forth in VS Code                                   | vscode           | /blog/how-to-navigate-back-and-forth-in-vscode/         |  If you come from a
 How to Learn New Things                                            | career           | /blog/how-to-learn-new-things/                          |  As a software devel
 How I Setup my Terminal                                            | terminal         | /blog/how-i-setup-my-terminal/                          |  Have you ever seen
(32 rows)

Search Index Maintenance

One more thing to consider is how to maintain the search index going forward, as more documents are added to the blog. Recall the title field in the documents table is unique, so attempting to insert the same documents would cause an error. The solution is to use the UPSERT feature in PostgreSQL, which is a special syntax to specify what should happen in the case of a conflicting INSERT.

In my case, I'm keeping it simple and doing nothing, which will basically only insert new records when the search.sql file is loaded. To specify this behavior, the generated INSERT statements need to be modified to specify the ON CONFLICT instructions. Recall these statements are generated in the search helper module in the Gatsby blog project.

// lib/search-helper.js
function generateInsert(node) {
  return `INSERT INTO documents(title, description, category, published_at, slug, body, created_at, updated_at, excerpt)
    VALUES(
      '${replaceMany(node.frontmatter.title)}',
      '${replaceMany(node.frontmatter.description)}',
      '${node.frontmatter.category}',
      '${node.frontmatter.date}',
      '${node.fields.slug}',
      '${replaceMany(node.rawMarkdownBody)}',
      now(),
      now(),
      '${replaceMany(node.excerpt)}'
    )
    ON CONFLICT (title)
    DO NOTHING;
  `
}

function replaceMany(str) {
  // snip...
}

module.exports = {
  generateInsert,
}

This way, every time I add a new article to my blog, I can safely reload the generated search.sql into the documents table.

What's Next?

So now the documents table is fully populated and ready to be used for search. Next up, see Part 3: Search Engine for how to use the built-in PostgreSQL search functions to search the documents.