Anatomy of a search engine

By Emma Chittenden,

Published on Nov 29, 2023   —   4 min read

Summary

Learn the language of a search engine so you can tell algorithm from index, and metadata from synonyms.

A little bit goes a long way.  Although the irony is this is a long(ish) read.  I thought I’d give you the low down on the nuts and bolts of the back-end of a search engine.

These are the basics.  There are lots of other super cool features different search engines will offer.

Next week I’ll give you the user bit of search, aka the front end.

The search engine

The engine its self is the bit that powers what people can search for via a search box on the front end of a website or product.

The Index

The index is the master bit of code that details what parts of your website or product the search engine should crawl.  Largely what it does is it sets the parameters of what it should, and more importantly shouldn’t, crawl.  The index is the raw data of your site that you can then apply the algorithm against.

Crawling or indexing

Crawling and indexing are exactly the same thing.  When you kick-off crawling you’re asking the index to gather all the information on your website or product that meet the rules set by the index.  The first time you run it, it’ll pull all the information on your website or product.  The next time it runs it will be looking for any changes, that means when you’ve added or removed anything.

Once crawled you will have an index of raw information in the search appliance. When you have that you can apply the algorithm against it. What you see here is not what your end user will see.

When to crawl

Running an index is quite power hungry and it normally makes engineers twitchy if you leaving it to run all the time when it’s not required.  There are three (key) different ways you’d run an index.

On Demand

If you have a small website or don’t update it very often, only do a crawl when you make a change.

Scheduled

When you have a site or product where ad-hoc changes are made during the day, schedule the crawl to happen in the early hours of the morning Tuesday - Saturday. If any major changes are made either crawl the whole site or the part that was updated.

Constant

You only want to leave your search index to constantly anticipate changes if you’ve built something around search itself.  Sites like Amazon and LinkedIn, and apps (products) like TikTok and Instagram  will constantly anticipate change.

The algorithm

The thing that gets talked about most. If you’re using an off the shelf search appliance like Algolia the algorithm is a set of predefined configurable rules you can set-up.  The algorithm is applied against the index.  In its most basic sense an algorithm will control the order of the results shown when someone carries out a search.

If you’re designing and building your own search engine from the ground up using something like Solr (search engine coding language) you’ll completely design the algo.

Relevancy

This is the gold standard of search results.  When someone searches for something, you want to return the most relevant result based on what they’ve asked for.  This might sound obvious, but we’ve all searched for something and just got a list of things that don’t resemble what we’ve asked for. Or worse, we’ve used a Meta product.

Typo tolerance

Most off the shelf search appliances will have a feature that allows you to manage typo tolerance.  This means it will try to work out what someone was searching for if they do things like transpose letters or have difficulty with complicated words.

Stop words

A lot of searches are what’s called semantic, i.e. you ask a question.  However, when you ask a question the search query contains filler words.  If the search engine were to try to match those queries verbatim they’d take forever and be filled with noise.  So search engines strip out these words, which are called stop words.  The search engine ignores them looking for the key words.  The search engine runs faster and is more likely to return relevant results.

Metadata

If your site or product has metadata built into it, and your search engine is set-up to see it, it will get indexed at the same time as all your content.

Metadata are little bits of structured information that are used to give more information about your content.  For example, if you own or look after an ecommerce site that sells clothes, the metadata could include information about gender, clothing categories, sizes, and colours.

Metadata is controlled by your ecommerce platform, content management system (CMS) or product.  In enterprise situations, the metadata can be managed by other platforms that can feed directly into the search appliance.

If you want to let your visitors filter search results, you must have metadata.

Synonyms

I love synonyms. They’re a great little tool you use to help your visitors get better results.  Unlike metadata, synonyms are controlled by your search appliance.

There are three ways you can use synonyms.

Cheats metadata

If you can’t implement metadata, you can use synonyms to help people get to content more easily.  It’s dirty and hard to maintain so I would only use it in extreme cases.  You also can’t set it up as a filter.

Translation

If your organisation has its own language, whether that’s jargon, legalese or clinical, you should be using synonyms to help people get to the information quickly.  What that means is providing plain English (or your language equivalent) version.  The NHS have implemented a policy of using terms like pee and poo, because let’s be honest, nobody can spell diarrhoea (even me, I had to google that spelling).

Misspellings

If you have organisational or industrial words that are prone to being misspelled, but wouldn’t be picked up as such, use synonyms.

Noise

When you carry out a search on a website and it brings you back a billion and one results that don’t resemble what you’ve searched for, that’s noise.  If you’ve got noise you probably don’t have relevancy.

NOTE: on some websites thousands of results give visitors confidence, on others it causes uncertainty.  If your visitors don’t trust what they’re seeing they won’t click on a result, even if the top one is the right one. Understanding noise will help you understand when you’re getting this right or wrong.

Share on Facebook Share on Linkedin Share on Twitter Send by email

Subscribe to the newsletter

Subscribe to the newsletter for the latest news and work updates straight to your inbox, every week.

Subscribe