How is MarkLogic different from MongoDB?

This post is 4 years old. (Or older!) Code samples may not work, screenshots may be missing and links could be broken. Although some of the content may be relevant please take it with a pinch of salt.

The question above is something that I need to answer at every conference and Meetup that I attend and it comes up in different variations - how is MarkLogic better, what does MarkLogic do differently so on and so forth. I thought I'd collect all the answers in this one blogpost, welcome to the official MarkLogic vs MongoDB showdown.

When doing such a comparison one needs to look at a few perspectives that include a technical one as well as a business one. In this article we are going to have a look at some of the technical differences between the two databases - sometimes even technical differences can trigger business-level decisions.

ACID transactions

Have you ever heard someone saying "if you're using NoSQL you can't be using ACID". Now you can tell them that you know one database that is of the "NoSQL kind" and has full support for ACID.

Let's first discuss what ACID means. ACID is an acronym (Atomicity, Consistency, Isolation, Durability) that defines a set of properties that will guarantee reliability in the world of database transactions. Atomicity means that if you have multiple statements in a transaction, all parts of the transaction need to be successful - if any one of them fails, the whole transaction will also fail. Consistency means that there is a guarantee that the database will go from one valid state into another when a transaction commits. Isolation means that if you execute your transactions concurrently, those transactions are unaware of each other and that they are executed serially. Finally, durability means that if a transaction commits (i.e. saves data to the database, does an update or deletes something) those changes are going to persist, even in the case of a system failure.

There are two ways to implement ACID capabilities - either via locking or via multiversioning. MarkLogic implements multiversion concurrency control (MVCC) which means that you can still read a document without acquiring locks (that is, if a document is being written to, you can still read that document nor does reading a document block writing)

MarkLogic has full ACID support; however, MongoDB has what is being referred to as 'eventual consistency'. What does that mean in practice? Let's say you have a cluster of MongoDB servers. Clients (applications or actual people using a client) can access any of these servers to retrieve data. Imagine that you do an update to one of your documents. Without ACID guarantees, MongoDB's default settings do not guarantee that an update is durable before acknowledging that it was made, and it does not guarantee that an update will be replicated to a majority of servers before a read can take place.

Even with the strictest consistency settings – where all reads occur from a single primary server using majority read concern and all writes use majority write concern – there is the potential to end up with inconsistent or stale data. As one example, let’s say that Update A is recorded to your document on the primary server, but the primary server fails before replicating that update to a majority of nodes. From the client’s perspective, the write has failed, but in actuality, Update A may have made it to the secondaries and could be available. The client has no way of knowing if their write actually succeeded or not.

Sharding & Clusters

In this point we will discuss how the two databases scale-out. Ideally an enterprise grade NoSQL database should be able to scale by just adding new servers to the cluster. It should also perform at this scale (no matter what the scale is). MarkLogic does exactly this. You can even have an environment that has a mix of cloud and physical server deployments, and still have high availability.

MongoDB requires you to provision all the hardware for a highly available cluster. When setting up such a cluster you also need to decide on a sharding key. (A database shard is a horizontal partition of data.) Essentially you specify a sharding key and according to this key your data is going to be inserted to different servers in a distributed manner so that you can also optimise your read and write loads. The problem? Once you have a shard key setup, you can't change it. The only way to change is by following a 5 step operation where the first step is to dump all your data from MongoDB into an external format. Functionality also changes when MongoDB is sharded. Sharding breaks several critical features in MongoDB including point-in-time recovery for a production system, in-document isolation, and several performance enhancement options (like certain secondary indexes and operations). Users are instructed to anticipate this loss of functionality by ensuring their code and practices never use the features.

MarkLogic customers have auto-scaling tools that help keep performance level stable regardless of whether there are dozens of 3-node AWS clusters or a single fifty-node physical cluster. Because MarkLogic can be an application server, database, and search engine all at once, topologies are smaller, simpler, and easier to administer.

Indexes

When you load a document into MarkLogic the system will automatically index word tokens of that document (as well as the structure of the document). This index - often referred to as the Universal Index - gives you search out of the box. Of course you can add additional indexes to your database - both term list indexes that will help you answer 'yes or no' type questions - do any of my documents contain the term 'xyz'? Yes they do, here's a list based on relevancy. Additionally, you can enable term list indexes that will help you to do wildcard searches as well, along with a lot of other options. However a term list index cannot answer inequality type questions, i.e. 'show me all the documents where the price is < £25. For these you need to enable range indexes. Range indexes can be added against XML elements, attributes or JSON properties. Furthermore in MarkLogic you can define geospatial indexes as well as triple indexes (yes, MarkLogic is also a triplestore - this may also be another key feature for those of you who use semantics or have standard documents as well as RDF triples).

Contrary to this MongoDB has a single attribute indexing capability and your queries can only use two indexes at a time. More complex queries require you to have what is referred to as a compound index. Simply put a compound index is a structure that holds references to multiple fields and it should be used by frequent queries. However a major drawback is the fact that queries that use compound indexes have to respect order - and restrict how you can sort. Imagine if you have to sort on a compound index with 3 different keys. If you want to sort on two of those keys and reverse sort on another, you’d need to build another index. If you want to sort those keys in a different order, you’d need a third compound index. Interestingly enough the advice from tech support is to build compound indexes rather than query on two indexes.

Collections

In the world of NoSQL - especially the document-based NoSQL solutions - there needs to be a way to store and organise documents. In relational world, this is an easy and rather straight forward concept, you store your data in tables. Documents in a NoSQL database are added to collections, which act as a category label (or you can also think if it as a logical way of grouping your documents together).

MongoDB only allows you to store a document in a single collection. To me, this is a limitation; what if I'd like to have my document in two or even more collections? Think about this for a moment. Let's say you have a recipe that describes a lovely vegetarian dish. You could put that into a collection called 'recipes'. Wouldn't it be nice to also have this document be part of some other collections like 'vegetarian' and say 'asian' without actually having to update the document content? Collections in MarkLogic allow you to "slice and dice" your data as you see it fit - now I can query my data and retrieve all recipes, or only those that are in the vegetarian collection and if I'm organising a dinner for my friends and they all love Asian food I can query for that collection.

Search

Search in MarkLogic has always been part of the core of the system - it's not something that MarkLogic have built around an existing solution.

When you execute a query in MarkLogic you'll see your documents returned to you in what is called 'relevancy ranking' - simply put that means that you'll see the most relevant document displayed to you first. There's a complex algorithm that determines how relevant a document is for your query based on the entire document set.

There are of course multiple ways that you can utilise to affect the relevancy of your documents, giving you the opportunity to have a fine level of control over what the final search result set looks like.

MarkLogic's search features give you, the developer, a great set of functionality that will help you to build search applications. Would you like to work with geospatial data? You can! What about snippets? Facets? Highlighting of search results? Type ahead features? Multi-language support? Stemming? Yes yes yes and yes.

MongoDB's query interface works as a standard database where statement. This also means that you will need to deploy a third party solution to handle search. That's a more complex architecture, more skills you need to staff, and after a change you need to wait until the change gets indexed. The documents that you'll see returned are not based on relevancy, they are based on document order and you can sort them based on a property in your documents (i.e. find all the documents where the name is 'John' and sort it by 'lastname'). ). They also support text search via text indexes and $text but it has limited capabilities and questionable scalability.

Security

It has recently been announced that MarkLogic 8 has received common criteria certification. Common criteria is the most widely recognised security certification for IT products in the world. There are 25 countries, including the United States, Canada, India, Japan, Australia, Malaysia, and many countries in the EU, that all mutually recognise the certification. Common Criteria is also a certification that is specifically requested by customers. MarkLogic is one of only six DBMS vendors to have this certification, and it is the only NoSQL company in this elite group. Data breaches are common in the news, and concerns over security and privacy of data are at an all-time high. Customers are looking for technologies that can help them protect their sensitive data. As the Enterprise NoSQL Database, MarkLogic has had security built in from day one, and there is a continuous investment in the market-leading security features, standards support, and certifications so that customers know that there is no better place for their mission-critical data.

In the beginning of this post I mentioned that you a good technical understanding of a system can lead to business level decisions. I believe if you've read all the previous points it's clear why a business would require a database that has ACID support and allows you to search through your database looking for the right documents while maintaining a strong security model.

If you're interested in MarkLogic please download the latest version of the product which is available for multiple operating systems and it comes with a free developer license.