Concepts / Sending and managing data / Preparing your data for indexing

May. 10, 2019

Preparing Your Data for Indexing

Let’s Get Started - Data

The first step is to think about your data and what information you want to make searchable.

For a retail outlet, it’s products. For a music store, it’s records and artists. For a real estate company, it’s houses and locations. The list goes on.

The next question is what information you need to build a search experience. This is different. The answer is that you don’t need everything from your data source, but only what is necessary to build search. We’ll cover that as we go along.

Where is the data located?

Our clients often wonder how Algolia accesses their data. The quick answer is that the data lives on Algolia’s servers. Algolia is akin to a cloud service, where you store your data and access it through Algolia.

This means that Algolia does not change or touch your data in any way, nor does it reach in and fetch data from your servers. The only thing it cares about is that your data is on its own servers. Therefore, you must take the first steps to fetch, format, and send your data to Algolia.

For the fetch, you want to think about where your data is coming from. Let’s say it’s a classic database with tables, records, and fields. Or it’s a set of interrelated databases that constitute a full back-end system. Or a collection of XML files. Or an unmanageable set of Excel spreadsheets. Or a website with embedded content that needs to be crawled. The format of your original data doesn’t matter, what you need is a mechanism that fetches and transforms your data into a format that Algolia understands.

Note that this mechanism, - usually a script -, is up to you to create. It runs exclusively on your computers. You never have direct access to Algolia servers; you access them only through Algolia’s API methods that it provides to send and search your data.

Fetching and Reworking your Data for Algolia

Fetching your data

To get started with Algolia, you need an extract of your data, transformed into JSON. This extract needs to be selective. If you’re working with a product line, Algolia doesn’t need everything you have on your products. Algolia only needs what serves the purposes of search, which is to find products, rank them, and display valuable information about them.

Therefore, a key concern during this first stage is how to improve the search experience.

Extraction involves three sub-tasks: extracting the right data, reworking it a bit, and adding any information that can improve the chances of finding the most relevant results. The reworking part might involve things like turning a number into a string, or adding several phone number formats to improve the chances of the record being found.

Formatting your data

Once you’ve extracted your data, the next step is to format and structure it.

Algolia doesn’t rely on relational database concepts. It has a different approach; it uses a schema-less structure, in JSON.

Why do we use JSON?

Because it’s schema-less: flexibility at the record level
it contains key/value relationships, which is intuitive

What data to send

Let’s think about what Algolia needs to search your data. Does it need everything from your data source? The answer is no. It requires at least two kinds of data: information for people to search on and information to display.

An example will help. Imagine we want to create a search experience around movies, which means we may want to search (and display) movie titles, synopses and actors. We also want to display (but not search) photos, as well as year and country of release. However, we don’t care about production company or filming location.

This results in the following content:

For search: title, synopsis, actors, director.
For display: all of the above + year, photos, country of release.
Not needed: production company, filming location.

With that in mind, let’s look at your data. Imagine it’s in a database, and that the information you need is stored in different tables. You need to combine those tables and export the necessary data into a properly formatted JSON file:

EXTRACT attributes(title, synopsis, actors, director, year, photo URLs, country)
REFORMAT INTO JSON (title, synopsis, actors, director, year, photo URLs, country)

The JSON result should be:

[
  {
    "title": "Some Like It Hot",
    "synopsis": "When two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in.",
    "actor": ["Marilyn Monroe", "Tony Curtis", "Jack Lemmon"],
    "year": 1959,
    "photo_urls": [
      "https://www.yourphotos.com/some-like-it-hot/photo1.jpg",
      "https://www.yourphotos.com/some-like-it-hot/photo2.jpg"
    ],
    "country": "United States"
  },
  {
    "title": "Mission: Impossible - Fallout",
    "synopsis": "Ethan Hunt and his IMF team, along with some familiar allies, race against time after a mission gone wrong.",
    "actors": ["Tom Cruise", "Henry Cavill", "Ving Rhames"],
    "year": 2018,
    "photo_urls": [
      "https://www.yourphotos.com/mission-impossible-fallout/photo1.jpg",
      "https://www.yourphotos.com/mission-impossible-fallout/photo2.jpg"
    ],
    "country": "United States"
  },
  ...
]

When you push this data to Algolia, it becomes an index on an Algolia cluster. An index is what Algolia searches into and where results come from. We’ll see how to push data and how to keep your data up to date later. First, le’s go further on how to format and structure your Algolia index.

Formatting and structuring your data are two of the most important things when it comes to crafting great search and relevance. You need to refine your extract by reworking its content, adding new attributes, creating filters, restructuring record relationships, and more. So far we’ve only presented a straightforward example to illustrate the general idea. It’s actually a bit more complex - not by much, but we haven’t covered all the basics.

Creating Searchable Data

Organizing your data into categories

Above we’ve used movies, an example with universal appeal. Let’s use another example that gives us some additional ideas. Food recipes:

For search: title, description, ingredients, when to serve.
For display: all of the above + photo, popularity.
Not needed: nutritional value of the ingredients.

Putting all that together, we come up with the following JSON dataset:

[
  {
    "title": "Blueberry Pie",
    "description": "Blueberry pie is a tasty pie with a blueberry filling. Blueberry pie is considered one of the easiest pies to make because it does not require pitting or peeling of fruit.",
    "ingredients": ["blueberries", "sugar", "flour", "milk", "baking soda"],
    "when_to_serve": ["dessert", "picnic"],
    "photo": "https://www.yourphotos.com/blueberry-pie.jpg",
    "popularity": 1200
  },
  {
    "title": "Banana Pancakes",
    "description": "Banana pancakes is a tasty pancake dish prepared using bananas and pancake batter as primary ingredients. The bananas can be mashed or sliced, after which they are added to the batter or served on top.",
    "ingredients": ["banana", "flour", "milk", "baking soda"],
    "when_to_serve": ["breakfast", "lunch", "dinner", "dessert", "midnight snack"],
    "photo": "https://www.yourphotos.com/banana-pancakes.jpg",
    "popularity": 800
  }
]

Now, let’s introduce the concept of Searchable Attributes.

In our example, we have two records with six attributes each. Algolia searches attributes and returns records.

How you define and organize your attributes is important, as well as how you distribute your data into those attributes. For example, if you want to search a recipe for “blueberry pie”, Algolia only needs the title attribute. If you want a “tasty pie”, it needs the description attribute. Searching “tasty” and “pie” together will exclude the banana pancakes recipe. However, if you look only for “tasty”, Algolia will match on both records.

While all the attributes from our example can be displayed on the front end, only three are necessary for search. More specifically, we don’t want to search the URL of a photo because a URL does not contain any relevant, search-related information. We don’t want to search in popularity either. Again, it’s not search-related, though it will come in handy when we want to rank our results. We’ll cover this when discussing Custom Ranking.

We therefore need to make sure we exclude photo and popularity from the search. How do we do this? We set up some attributes as searchable attributes (ignore the syntax):

Set SearchableAttributes as:
- "title, description, when_to_serve"

Algolia now only searches into those three attributes and ignores the others.

Let’s go further and say that the recipe’s title and description are more important for finding information than when to serve it. We can adjust our setting as follows:

Set SearchableAttributes as:
  - "title, description"
  - "when_to_serve"

Now there are two list items: the first contains title, description, and the second, when_to_serve. This creates a priority of attributes where title and description are searched before when_to_serve. This priority tells the engine to give more weight to title and description than when_to_serve. We’ll go into far more detail about searchable attributes later, but you should be familiar with this setting right from the start.

Structuring Your Data

Simplifying your record structure

Your data is now formatted properly with searchable attributes. The next step is to look more closely at the structure of your index’s attributes and records.

First, your record structure. You should always structure your data to best accomplish your goals. The general rule for making data searchable is to simplify it. This rule applies whether you’re doing e-commerce, media, SaaS or any other domain. This means that repeating information, which is a big no-no in relational databases, is OK when using Algolia.

For example, let’s search for Charlie Chaplin in a movie database. At the record level, we could decide to put Charlie Chaplin and all of his films and photos into one record. Instead, let’s create one record for each film and photo. Now, when you query “chaplin”, you’ll get multiple results - all of his movies and photos will show up as individual records.

Here’s how that would look (ignore the objectID for now):

[
  {
    "objectID": 123,
    "actor": "Charlie Chaplin",
    "film": "Modern Times"
  },
  {
    "objectID": 124,
    "actor": "Charlie Chaplin",
    "film": "City Lights"
  },
  {
    "objectID": 126,
    "actor": "Charlie Chaplin",
    "photo": "Photo from Modern Times",
    "url_of_photo": "https://www.yourphotos.com/modern-times.jpg"
  },
  {
    "objectID": 127,
    "actor": "Charlie Chaplin",
    "photo": "Photo from City Lights",
    "url_of_photo": "https://www.yourphotos.com/city-lights.jpg"
  }
]

As you can see, instead of combining Charlie Chaplin and all of his films and photos in one record, we’ve created one record for each film and photo.

Doing this not only showcases the variety of your data, it also adds a whole lot more results for the user to make sense of. So it requires some refinement. We’ll do this below when we show how to create a relevance/ranking for your index that will help make sense of this multitude. For now just make note the problem.

You might also wonder about index size. Flattening data will add a more records to your index. If you have 10,000 actors with an average of 100 films and photos per actor, your index will have at least 1M records. This may sound like a lot of data but it’s not. Algolia has no prescribed limit to the number of records, only record disk size.

So let’s continue improving upon this data (and therefore improving the search).

Filtering

Let’s say I’m only interested in seeing Chaplin’s films (not his photos). I can add a new field: media.

[
  {
    "objectID": 123,
    "actor": "Charlie Chaplin",
    "film": "Modern Times",
    "media": ["film", "dvd", "video"]
  },
  {
    "objectID": 124,
    "actor": "Charlie Chaplin",
    "film": "City Lights",
    "media": ["film", "dvd", "video"]
  },
  {
    "objectID": 126,
    "actor": "Charlie Chaplin",
    "photo": "Photo from Modern Times",
    "url_of_photo": "https://www.yourphotos.com/modern-times.jpg"
  },
  {
    "objectID": 127,
    "actor": "Charlie Chaplin",
    "photo": "Photo from City Lights",
    "url_of_photo": "https://www.yourphotos.com/city-lights.jpg"
  }
]

And then, when your user searches “chaplin film”, you’ll add a filter on media.

Solved. The query “chaplin film” will find the first 2 records only. I’ve just used filtering.

Note further that media does not have to be included in every record. This is the benefit of making Algolia schema-less: not all records need to have the same structure. This will be important in considering how to reduce the size of your database.

Custom Ranking

Finally, to make the results more meaningful, I can control the order of their appearance by using a custom ranking attribute. We did this above in the recipes with the popularity attribute. Here we’ll do it with rating, to ensure that the best-rated films appear at the top of the list. In this example, City Lights shows up higher than Modern Times.

[
  {
    "objectID": 123,
    "actor": "Charlie Chaplin",
    "film": "Modern Times",
    "media": ["film", "dvd", "video"],
    "rating": 9.24
  },
  {
    "objectID": 124,
    "actor": "Charlie Chaplin",
    "film": "City Lights",
    "media": ["film", "dvd", "video"],
    "rating": 9.78
  }
]

Handling record hierarchy

Simplifying your records doesn’t mean you have to lose hierarchy or relationships between records. For example, if you want your users to search products and to see them organized by vendor, then you’ll need to store this product-vendor relationship in your index. This can be done because Algolia does not impose a data schema. You can organize your data in any way you want. You can keep it simple without losing complexity.

Let’s look at an example that simplifies hierarchical records. A book with multiple chapters can be either one record or many records.

A single record represents a book reasonably well. But it’s not useful for searching individual chapters. If I search “dragon”, I’ll find one record. It would be better to see at least two records - one for each chapter with a title containing “dragon”.

Here’s the single record:

{
  "objectID": 123,
  "book": "Harry Potter Book 5",
  "description": "A book about a boy and dragons that vanish.",
  "popularity": 1000,
  "chapters": [
    {
      "chapter": "Dragon Fly",
      "description": "In this chapter, the dragon is introduced.",
      "popularity": 900
    },
    {
      "chapter": "Dragon Vanishes",
      "description": "The dragon disappears.",
      "popularity": 800
    },
    ...
  ]
}

Now take a look at that same information in three records, less hierarchical, flatter:

[
  {
    "objectID": 123,
    "book": "Harry Potter Book 5",
    "Description": "A book about a boy and dragons that vanish.",
    "popularity": 1000
  },
  {
    "objectID": 124,
    "from_book": "Harry Potter Book 5",
    "chapter": "Dragon Fly",
    "Description": "In this chapter, the dragon is introduced.",
    "popularity": 900
  },
  {
    "objectID": 125,
    "from_book": "Harry Potter Book 5",
    "chapter": "Dragon Vanishes",
    "Description": "The dragon disappears.",
    "popularity": 800
  }
]

You can also add the popularity attribute to make the most preferred chapters appear at the top.

These are just some of the strategies to improve relevance. The takeaway is that Algolia requires a different kind of thinking, along with a carefully planned data structure - and a lot more records - to achieve its purposes. But once this is understood, effective search is achieved.

Checkout out these strategies in more detail:

In depth Simplifying

The Index

As already mentioned, you send your data to an Algolia Index, and every search is performed using this Index. All of the work in formatting and structuring your data is now finished, you have an Index. Two things remain - configuration and searching; these will be treated at length in their respective sections in our docs.

But before moving on to any new subjects, let’s just speak about indexing.

Putting your Data into One or More Indices

A bit more about structuring your data. An important principle that comes from relational databases is the distribution of data into different tables. This is done to break information into small meaningful units, to avoid redundancy, and to deal with multiplicity.

With Algolia, however, these principles are not so important for search. As already seen above when discussing record structure, flattened data is best for searching. This applies as well at the index level. It might seem reasonable to create multiple indices where each index represents a different subject/table. For example, separating films from actors, and creating an index for each. However, this might not serve the purposes of search. What if you want your users to search for both movies and actors at the same time and you want them to appear in the same results? In that case, you’ll need to use one index.

The main reason for this is relevance. Let’s see what that means.

Algolia searches only one index at a time. It does not perform cross-index searches. A search on two indices will produce two sets of results, each with their own internal relevance. Algolia will not merge these results for you, so you’ll need to do that yourself. Doing your own merge will break the relevance of the search. That’s because merging Algolia’s results after a search requires an understanding and re-implementation of Algolia’s own relevance and ranking algorithm. You would invariably be undoing Algolia’s relevance.

This is not to say that there are no reasons for having multiple indices.

When to use Multiple Indices

Look at it as a front-end UI question. If you want your UI to display films and actors separately, it’s better to use separate indices.

Another case is when you want to present popular queries with an autocomplete pattern like Query Suggestions. For that, you’ll need two indices: one for the common queries and the second for your main content.

You shouldn’t hesitate using several indices when necessary. The guiding principle is as follows: A search is performed on only one index. This means that every index must contain all of the records needed to make a search exhaustive. If you want two records to appear in the same results, or have them weighed against each other in the same relevancy computation, then they need to be in the same index. Otherwise, it’s probably more useful to put them in separate indices.

Best Formatting Practices For Better Results

Here is a summary of the above practices to guide you as construct your index.

Searchable Attributes revisited

We already saw that formatting an index requires being selective about attributes. We came up with two categories: some for search, others for display. Then we added two additional categories: filters and business metrics for custom ranking. We accomplished this selectivity with Searchable Attributes.

Custom Ranking, more details

High on the list of relevance tuning is Custom Ranking. Above we included the popularity attribute for food recipes. We wanted people to see recipes with the highest popularity first. We did this by using custom ranking, which is explained as follows: If 2 similar records are found, the one with the highest popularity will be ranked higher in the results than the lower one.

We introduce Custom Ranking now to encourage you to immediately think about including such metrics like popularity, most_viewed, most_bought, highest rated in your index. It will be easier to understand later when we discuss Algolia’s tie-breaking ranking strategy.

Take a look at this use cases:

Rank per Custom Attributes

Though treated in full elsewhere (see filters), you’ll almost always want to include attributes for filtering and front-end faceting. We already saw that with the attributes media and when_to_serve. In the movies example, we used media for search purposes. We could also have used it to filtering records, to show only films. As for faceting, if we want to allow users to click on categories to fine tune their search (for example, TV series, Film, DVDs, and Video on demand), we can use the contents of ‘media’ to build these facets. If someone is looking for information about “cold war politics”, they can choose to search all media - films, tv, etc. - or they can then limit their results to only “newspapers”. If they want to limit their search to political films, you can add a “subject_matter” filter/facet.

Controlling Relevance by Reformatting Your Content

While Algolia provides a vast collection of settings to help with relevance, many of these settings work in combination with how you format your content.

Examples of this are whether you use one attribute or many for a single piece of information (phone number formats), or whether to include long or short descriptions or both, or whether to repeat (or not) the same word in the title and description, and many more.

Take a look at how formatting can help your relevance:

Building Search UI

Building Search UI

Building Search UI

Building Search UI

Building Search UI

Building Search UI

PHP

Ruby

JavaScript

Python

iOS

Kotlin

Android

.NET

Java

Golang

Scala

InstantSearch.js

React InstantSearch

Vue InstantSearch

Angular InstantSearch

iOS InstantSearch

Android InstantSearch

Index settings and search parameters

A full reference of API Endpoints

Rails

Symfony

Django

Laravel

Magento 1

Magento 2

WordPress

Shopify

Preparing Your Data for Indexing

On this page

Let’s Get Started - Data

Where is the data located?

Fetching and Reworking your Data for Algolia

Fetching your data

Formatting your data

What data to send

Creating Searchable Data

Organizing your data into categories

Structuring Your Data

Simplifying your record structure

Filtering

Custom Ranking

Handling record hierarchy

The Index

Putting your Data into One or More Indices

When to use Multiple Indices

Best Formatting Practices For Better Results

Searchable Attributes revisited

Custom Ranking, more details

Filters and Facets

Controlling Relevance by Reformatting Your Content

Did you find this page helpful?

On this page