Unique Array in MongoDB Document

Chunting Wu
5 min readNov 22, 2021

--

In order to explain the solution in detail, I have to introduce the user scenario first. We need a collection to record the group information including name, members, etc. However, we cannot allow a group with the same members in another group. For example, if A and B are in group 1, then they cannot be in group 2. Thus, we need maintain a members array which has an unique constraint.

There is an assumption in our requirement.

The amount of group members will be always 2. In other words, a group only has members ["A", "B"] but not ["A", "B", "C"].

Here comes an example collection:

[{
"_id": 1,
"name": "group1",
"members": ["A", "B"]
}, {
"_id": 2,
"name": "group2",
"members": ["A", "C"]
}]

We want to query more faster, so we add an index on members, however, MongoDB Multikey Index is very different from what our thought.

  • Can we add an unique index on members?
  • If the index type is regular, the answer is no.
  • Can db.group.find({members: ["B", "A"]}) find anything?
  • No, the array should be exactly matched.

The root cause is creating an index on array, the index will become multikey index automatically. For the document with id 1, there will be two indexes Indicating to this document, namely "A" and "B". In the same way, there will be two indexes, "A" and "C", on document with id 2. That is to say, unique constraint doesn’t work while they all have index "A".

Although, there are two indexes "A" and "B" on document 1, it just makes MongoDB retrieve data faster. Instead of the full-table scan, MongoDB can reduce the search scope, especially, the linear search with O(n) is very slow on the array match. However, it has to exactly match the array including the order.

How to solve these problems? There is a trick can handle, that is, use text index. The figure comes from MongoDB Compass as follows.

Although the number of uses is 0, this is actually caused by MongoDB’s implementation. The search for the text index is to compare the text first, and then use _id to locate the real content.

Nevertheless, this approach comes other problems when implementing applications:

  1. To find the content, we must know the order of members; otherwise, data cannot be retrieved.
  2. Similar to the problem one, even though this approach can make sure ["A", "B"] will not be duplicated, it cannot reject the insertion with ["B", "A"].
  3. The search pattern, regarding whether a single target exists in members, is inefficient. We have to use $in without the bound from an index to reduce the search scope.

There is a comprise solution to the problem 1 and 2. Before inserting or fetching data in the database, we sort the members in the application. However, this approach ties the data model and application logic together, which is hard to maintain and use. Hence, it is highly not recommended. As for the problem 3, there is no alternative solution.

Solution

The correct (*see the last section) approach is still using the multikey and text index. But we don’t store scalar string in members; instead, we store objects. Thus, the original example will be modified slightly:

[{
"_id": 1,
"name": "group1",
"members": [{"name": "A"}, {"name": "B"}]
}, {
"_id": 2,
"name": "group2",
"members": [{"name": "A"}, {"name": "C"}]
}]

In addition, the index on members should be on members.name with the text type and unique constraint.

From the above example we can see the original members is changed from ["A", "B"] to [{"name": "A"}, {"name": "B"}]. So what are the benefits of doing this? Yes, in fact, the three problems mentioned above have all been solved.

  1. We don’t care the order of members when searching a group. We can just use db.group.find({$and: [{"members.name": "A"}, {"members.name": "B"}]}).
  2. We don’t care whether "A" is the former or the letter, it can be rejected correctly if there is indeed a group with "A" and "B".
  3. To find a group belongs to a specific member, we can use db.group.find({"members.name": "A"}). This can find out all groups "A" belongs.

It must be emphasized again that this solution is only applicable when all groups are composed of two members. Take the above example, if you want to create a group 3 with members "A", "B", "C", it will be denied.

Why we have a such requirement? We want to simplify the know-how of database operations on the client side. In theory, the client does not need to have any knowledge to manipulate the data of the database, and the database can always respond under inappropriate circumstances. Therefore, we don’t have to embed the database logic into every client. Nonetheless, the approach has its disadvantages. Firstly, the design has been fixed to only support groups of 2 members. Secondly, there is only one text index in a collection, i.e., we sacrifice the possibility of other text indexes. Finally, text index takes lots of spaces, it is considered a high price index.

What should be the correct way? There are many possibilities, but the most feasible one should be findAndModify with upsert. By doing this, we can avoid creating a group with duplicated
members under the race condition. If there is a need to add more members, we can leverage $addToSet to operate on members to keep the uniqueness in members. From this we can know that to maintain the correctness of the data, there are many implementation requirements on the client side, so in the end we decided to write the integrity to the database and let the database take care of part of the correctness of the data.

Updated

We have found the multikey index in text type does not work in some conditions. Therefore, we are using $addToSet to be the temporary solution and still trying to find out a better approach.

--

--

Chunting Wu
Chunting Wu

Written by Chunting Wu

Architect at SHOPLINE. Experienced in system design, backend development, and data engineering.

No responses yet