Indexing content in complex Umbraco data types

The Umbraco CMS is a very flexible and open platform for building an expressive and intuitive Content Editors interface, but sometimes that comes at a price when it comes to indexing the produced content for Lucene/Examine based searching. A bit like looking for the needle in the haystack. In this article, Rob takes a look at ways to clean that content up while indexing to surface keywords and presentation without any messy artifacts.

By Robert Foster
25 September 2018
3 minutes read

Plugins like Stacked Content, along with Umbraco's built-in Grid Editor and Nested Content data types use JSON as a storage format, and that doesn't lend itself to indexing and searching without some help. So we're going to look at some code to help extract the relevant information by hooking into Examine's Indexing events.

In the first example, we're going to focus on the Nested Content and InnerContent (aka Our.Umbraco.InnerContent) based data types.

If you haven't come across these data types before, Nested Content is build into the Umbraco Core software, and InnerContent is an api supporting derived datatypes Stacked Content and Content List. These two can be installed using NuGet or via the Umbraco Package Manager. Essentially, these data types are based on the concept of using "unattached" Content Nodes and can be rendered to lists of IPublishedContent from a single property. Unattached, because they aren't actually part of the Content Tree and hence don't have a parent node at all. They are instead stored in the Published content cache serialised as JSON formatted objects.

So our first example effectively attempts to map the raw JSON value of these unattached content nodes and extract out the fields that are of interest:

The above code recursively walks through the JSON token structure - if it's a JArray it loops recursively calls itself on each element; and if it's a JObject it loops through each property and extracts out the value of any that match one of the targetedFields passed in, combining them into a single string for return.

For Nested Content and InnerContent and it's derivatives, our targetedFields parameter might contain the following:

Heading
BodyText
Tagline

Pretty much the properties you'd expect to be useful for indexing in any Content item.

Now lets look at how we can do the same for the Grid Editor. Because the Grid Editor also uses JSON to store it's property value, we can use the same method above that we've used for Nested Content and InnerContent data. However, the Grid Editor has only a few properties that are desirable for indexing, dependent on how you have set up your Grid. Out of the box, the JSON keys you may want to target using the targedFields parameter might be:

value (for RichText, Heading 1, etc.)
caption (for an Image property)

The easiest way to work out what you want to index and what you want to discard is to inspect the JSON value itself in the Umbraco.config cache file.

Now to glue it all together. Now we know how to extract the content for indexing from a JSON object, we need to be able to pull the JSON string out of the properties we're indexing in the first place. The following method does that by using the PropertyEditorResolver to retrieve the appropriate editor, which in turn gives us the content we need:

Note the default case (lines 99-101) for the switch statement is to simply assume we can index the property value without any special processing.

Line 104 takes all of the extracted content and puts it in the index with the given key - we're effectively combining the values from a whole lot of properties into one index field, which makes querying a lot simpler.

All we need to do now is hook into the Indexing event and call the AggregateFields method to populate our fields:

In this example, we're creating two new fields - _title and _content - and splitting up the properties we want to index amongst them.

Hook it up to the GatheringNodeData event, and we're in business. We're also only interested in the External index, not the Internal ones, so we filter out those indexes.

TL;DR

By using a complex properties' raw JSON value, we can target specific fields/keys/data within that value and aggregate it into a single index field to simplify search querying. We've covered Nested Content, Grid Editor, and InnerContent derived data types, but the same technique can be used for other complex data types as well.

By Robert Foster
25 September 2018

Share this article...

Keep Reading

Building a Website like a Registered Builder

Website build projects often go wrong when we assume that what is being built is not physical. The truth is it is just as physical, just as real, and just as tangible, as building a house or something grander.

Fiona Williams
19 July 2024

Building a Radically Candid Culture

In today's fast-paced and dynamic business landscape, effective communication and collaboration are paramount for success. One approach that has gained significant attention is the Radical Candor framework. Rooted in the concept of open and honest communication, Radical Candor offers a unique approach to fostering a culture of trust and growth within organisations. In this blog post, we explore what Radical Candor is, its benefits, and how to implement this framework to create a radically candid culture within your team.

Ale Segon
09 July 2024

Guest Article: Codegarden '24 Review with Ben Tudball

In the months leading up to the Codegarden '24 conference, the Umbraco Meetup groups in Australia ran a competition to send one lucky winner on an all-expenses paid trip to Odense, Denmark for the conference in June. This is Ben's story.

Robert Foster
01 July 2024

Your IT Team's take on Codegarden '24

Some of our developers had the chance to go to Codegarden 2024 in Odense, Denmark. The experience was new to Joseph and Josh, but it was Rob's tenth time! We asked them about their experience, highlights and impressions. This is their take on the event.

Ale Segon
26 June 2024

UX (User Experience) vs CX (Customer Experience), the key differences.

Some people use the terms UX and CX interchangeably and believe it’s just semantics. That’s not entirely true. UX deals with a user’s interactions with specific aspects of your product, while CX is broader and covers all customer engagements with your brand. Read on to find the key differences between UX and CX and how to improve both.

Ale Segon
02 April 2024