The Better Way of Filtering and Extracting Elements from Huge JSON Arrays

We were working on a project related to on-the-edge prerendering of dynamic HTML components using Cloudflare workers, when we observed the problem of parsing and filtering huge amount of JSON data (especially because the project was developed using Typescript). At first glance, everything was clear and simple - Cloudlfare's HTMLRewriter API is amazing - easy to work with and shows unbelievable speed results compared to the other available APIs. But then the data and it's structure came into the game.

The data is structured as a JSON that contains a big array of events. Each event has unique 'id' property related to an HTML component and provides the specific information for this component.

The data looks like:

{
  "data": [
    {
      "id": "element-one",
      "title": "Event One",
      "links": [
        {
          "name": "link-one",
          "href": "https://example.com/link-one",
          "title": "Link One",
          "description": "This is the first link."
        },
        {
          "name": "link-two",
          "href": "https://example.com/link-two",
          "title": "Link Two",
          "description": "This is the second link."
        }
      ]
    },
    {
      "id": "element-two",
      "title": "Event Two",
      "links": [
        {
          "name": "link-three",
          "href": "https://example.com/link-three",
          "title": "Link Three",
          "description": "This is the third link."
        },
        {
          "name": "link-four",
          "href": "https://example.com/link-four",
          "title": "Link Four",
          "description": "This is the fourth link."
        }
      ]
    }
  ]
}

And the corresponding HTML component that uses it looks like this:

<template-element src="/assets/data/events-data.json" data-id="element-one">
    <h2>{{{ title }}}</h2>
      {{#each links}}
        <a href="{{ this.href }}" target="_blank" name="{{ this.name }}" class="menu-link" title="{{ this.description }}">
          {{ this.title }}
        </a>
      {{/each}}
</template-element>

Let's take a look at the top-level logic and functionality. The overall concept is:

The Cloudflare Worker will get the static HTML
The HTML will contain a 'template-element' component with specific 'src' (the JSON file location) and 'data-id' (the id of the element from the JSON array that contains the information for this component)
The worker fetches the JSON and searches the array for an element with an id equal to the 'data-id' attribute value of the HTML component
The worker uses the HTMLRewriter API to inject the dynamic information into the HTML component (using HandleBarsJS)

In this article, we will focus on the third step only as it's the most interesting one.

And if we pretend that we have this template HTML component - <template-element src="/assets/data/events-data.json" data-id="element-one">, we should take the element from the array that has id = 'element-one'.

P.S. The JSON is automatically generated from a third-party service used by the client, so we can't easily manipulate it.

Looks simple, right? Let's dive deeper.

The JSON might contain a very large amount of elements with a lot of properties, so the process of parsing and looping through them to get the right one could be very-slow and resource-consuming. We can even hit some Cloudflare worker limits. Let's take a look at this code:

const extractComponentInformation = (json: string, targetId: string): Event => {
    const data: EventsData = JSON.parse(json)

    // TODO! -> Validate the information (check if the event is duplicated or missing)

    const event: Event | undefined = data.find(event => {
        return typeof event.id !== 'undefined' && event.id === targetId
    });

    if (!event) throw new Error(`The component with id = ${targetId} is missing in the JSON payload ${json}!`);

    return event;
};

The JSON parsing for such a big string is a slow operation that will consume a lot of memory.
Looping through thousands of elements might slow down the process as well.
When we add the validation (another looping OR we can implement it in the main loop), it becomes even slower.

The very first thing that came to my mind was - well, if the element is at the beginning, we can just break the loop and return the needed information. But we must take a look over all of the elements to ensure that we have only one element with this ID (and log a well-described exception if there are duplicated events for further investigation). So at the end of the day we must look through all elements.

And in fact, this operation will get slower and slower based on the count of the elements.

The solution

What does the JSON format mean? It's just a raw/unparsed string that contains information, right?

So we can do all fancy string-related operations (some of them are surprisingly well-optimized in Javascript) and benefit from the text-nature of the information. We can search, replace characters, trim, slice, etc. We can potentially pre-process the string and make the JSON parsing process faster.

The same function as in the previous code example but 11x faster:

export const extractComponentInformation = (json: string, targetId: string) => {
  const idPosition = getEventIdPosition({ id, json });
  const startPosition = getStartPosition({ idPosition, json });
  const endPosition = getEndPosition({ idPosition, json });

  const eventJson = json.slice(startPosition, endPosition);
  const data = JSON.parse(eventJson);

  return data;
};

As you can see, we're preprocessing the raw JSON string before the real parsing. Basically, we slice the string, so ONLY the needed part is parsed later instead of the whole big chunk of information. The algorithm is based on character positions and patterns searching. First, we search for the object location in the string.

export const getEventIdPosition = ({ id, json }) => {
  const idSelector = `"id":"${id}"`;
  const idPosition = json.indexOf(idSelector);

  if (json.indexOf(idSelector, idPosition + idSelector.length) !== -1) {
    throw new DuplicatedEventError({ id, json });
  }

  if (idPosition === -1) {
    throw new EventNotFoundError({ id, json });
  }

  return idPosition;
};

We simply search for '"id":"{here is the requested id}"' and once we know its position, we can start scrolling to find the beginning and the end of the object. Again - we will use ONLY string operations. Let's look at the process of extracting the 'startPosition' value.

export const getStartPosition = ({ idPosition, json }) => {
  let pointer = idPosition;
  let pointerBreak = idPosition;

  do {
    pointer = json.lastIndexOf('}', pointer);
    pointerBreak = json.lastIndexOf('{', pointerBreak);
  } while (pointer > pointerBreak);

  return pointerBreak;
};

The logic is very straightforward - we start scrolling backward from the Id to the top/beginning of the string while comparing the positions of the upcoming '{' and '}' brackets. At the moment, when the next '{' bracket is after the '}' bracket - this is the place where the object starts. It might look a little bit confusing, so let's take a look at the following illustration for more context.

The function for getting the end position of the object is very similar.

export const getEndPosition = ({ idPosition, json }: GetPositionParams) => {
  let pointer = idPosition;
  let pointerBreak = idPosition;

  do {
    pointer = json.indexOf('{', pointer) + 1;
    pointerBreak = json.indexOf('}', pointerBreak) + 1;
  } while (pointer < pointerBreak);

  return pointerBreak;
};

But this time, we scroll forward (using 'indexOf' instead of 'lastIndexOf'), and we are interested in the moment when the closing '}' bracket is before the opening '{' one.

Dev Note: Both 'indexOf' and 'lastIndexOf' functions are well-optimized. They can't be used with regex values and that makes them so fast.

Once we have both star and end positions, we should slice the string and parse the needed information only.

I know it's just one example that works with a specifically predefined structure, and it's not available for more general cases. But with a few tweaks in the code, you can adapt it for your case as well (and yes - it is worth it because of the 11x times faster results).

This algorithm for preprocessing the JSON is an alternative approach, and there isn't anything similar over the Internet. Feel free to improve it!

Happy Coding!

News

Tsvetoslav Nikolov | 19 Apr 2023

The data looks like:

And the corresponding HTML component that uses it looks like this:

Let's take a look at the top-level logic and functionality. The overall concept is:

Looks simple, right? Let's dive deeper.

The solution

The same function as in the previous code example but 11x faster:

Bulgaria

United Kingdom

office@mtr-design.com

The data looks like:

And the corresponding HTML component that uses it looks like this:

Let's take a look at the top-level logic and functionality. The overall concept is:

Looks simple, right? Let's dive deeper.

The solution

The same function as in the previous code example but 11x faster:

MTR Design Limited

Bulgaria

United Kingdom

office@mtr-design.com