Natural Language Processing with NoSQL and JavaScript

This post is 4 years old. (Or older!) Code samples may not work, screenshots may be missing and links could be broken. Although some of the content may be relevant please take it with a pinch of salt.

Natural Language Processing (NLP) is part of Artifical Intelligence. More precisely NLP is applying Machine Learning models to text and language. There are a number of tasks that we can achieve using NLP - think about Google translate or Siri from Apple. These are all NLP algorithms in action.

In this article we'll see how to apply sentiment analysis on film (movie) reviews. Sentiment analysis means, that, based on an unknown review, an algorithm can decide whether a newly added review is positive (i.e. the review liked the film) or negative (i.e. the reviewer did not like the film).

We'll create a Node.js application that will accept a review (a sentence passed in as an argument) and it'll determine whether that review is positive or negative.

Supervised machine learning

We are going to be using supervised machine learning - which means that we'll provide our algorithm some training data (training examples). The training data will consists of sentences and an output. These are often times referred to as input (an input vector, or the independent variable). The output is going to be whether that sentence is a positive one or a negative one (this is the dependent variable).

Supervised machine learning is called this way because we are providing training data to the algorithm and the algorithm can learn from the dataset and make predictions.

Unsupervised machine learning on the other hand uses unlabeled data and some algorithms such as clustering to make sense of the data. Algorithms will group data together and make predictions based on that in an unsupervised environment.

In order to teach our algorithm sentiments, we need to go through a number of steps - six steps to be more precise. But before investigating what these steps are, let's take a look the data itself.

The data

We are going to start off by examining two documents, one that contains a positive review and one that contains a negative review:

{ "review": "I loved the film", "positive": 1 }
{ "review": "I completely hated the film", "positive": 0 }

There's nothing surprising here. This data could be coming from a website, from a forum - it really doesn't matter. To keep things simple we are going to stick to reviews modeled in this way.

A review consists of two properties: a review - which is the actual review written by someone, and a positive property, which is set to 1 if the review is positive and 0 if it is not.

NoSQL and Machine Learning

Machine learning requires data - large amounts of data. (In the evolution of machine learning, not only the required data amounts played an important part but of course the available processing power).

Some machine learning algorithms such as sentiment analysis works with unstructured text - we could have sentiment analysis on much larger pieces of text, articles, social media feeds and so on - in order to get an idea about a certain topic / sentiment.

Unstructured text means that we can't use schemas to load our data (well, we could but it would take forever to come up with the right schema, and if a requirement changes - or a new social media feed arrives we need to recreate the schema - which would make the project super-complex).

NoSQL's schema-agnostic approach to data modeling plays a key part of enabling machine learning.

Why MarkLogic?

The MarkLogic NoSQL database has some really great features, amongst which there are some that we'll use in teaching our NLP algorithm. One of the most exciting things about this NoSQL database that it is capable of doing word tokenization and stemming out of the box. This will help us out greatly later on.

The process

Loading the data

We can load the data to a MarkLogic database and add it to a collection. This can be easily achieved with this code executed against MarkLogic's Query Console:

declareUpdate();

const reviews = [
{ review: 'I loved the film', positive: 1 },
{ review: 'I completely hated the film', positive: 0 },
];

reviews.map((review, index) => {
xdmp.documentInsert(`/reviews/review${index}`, review, {
collections: 'reviews',
});
});
('Inserted reviews');

Step by step

These are the following steps that we need to go through in order to teach our algorithm:

  1. Tokenization
  2. Stemming
  3. Merge
  4. Feature Scaling
  5. Train
  6. Classify

Simply put what we want to achieve is to have a list of unique words for our dataset, so if we have the following two reviews:

"I loved the film"
"I completely hated the film"

We'd like to see the following result:

['i', 'loved', 'the', 'film', 'completely', 'hated'];

Notice we have applied a lower case function on the terms as well. Furthermore we could apply stemming to these terms:

['i', 'love', 'the', 'film', 'completely', 'hate'];

And maybe remove some common words (also known as 'stop-words'):

['love', 'film', 'completely', 'hate'];

This is the final data structure that we are after for now. Let's see the appropiate steps and see how we can achieve this.

Tokenization

Tokenization is the exercise where we take a sentence and we create tokens of it, that is, we extract words (and other parts) of a sentence. We are especially interested in word tokens.

In JavaScript we can write something like this to achieve tokenization:

'I loved the film'.split(' '); // ["I", "loved", "the", "film"]

Luckily MarkLogic can do tokenisation for us - it can take a sentence and return what it considers a 'word', 'space' or a 'punctuation'. We can use this built-in functionality to collect 'word' tokens. We'll take a look at this in a moment.

Stemming

Stemming is an interesting concept - we take a term and stemming can produce its root form. For example 'mice' stems back to the root form of 'mouse', 'tables' stems back to 'table' and running stems back to 'run'. You get the idea.

Without our NoSQL database, MarkLogic, we'd have to look for a stemming library to achieve this work for us. Luckily, MarkLogic has stemming built-in and we can use its stemming function:

cts.stem('mice'); // mouse

Merge

Now, we need to take all the terms from the reviews and merge them together, remembering that we should remove duplicate terms.

Merging and duplicate removal can be achieved really easily using ES2015 Sets and the ... spread operator:

const a = [0, 1, 2];
const b = [0, 1, 3];
[...new Set([...a, ...b])]; // [0, 1, 2, 3]

Let's pause for a moment here, and take a look at the code that we should execute against our database to get the desired result. The code below achieves everything that we have discussed so far - it iterates through all the documents in the reviews collection, tokenises and stems the terms, applies stop words to them and returns an array of unique terms.

declareUpdate();
const tokens = [];
for (const document of fn.collection('reviews')) {
for (const token of cts.tokenize(document.toObject().review.toLowerCase())) {
if (
fn.deepEqual(
sc.name(sc.type(token)),
fn.QName('http://marklogic.com/cts', 'word')
)
) {
tokens.push(token);
}
}
}
const stems = tokens.map((token) => Array.from(cts.stem(token, 'en'))[0]);
let unique = [...new Set(stems)];
const stopwords = [
'a',
'be',
'the',
'that',
'this',
'i',
'do',
'it',
's',
've',
're',
];
unique = unique.filter((term) => !stopwords.includes(term));
console.log(unique);

The above yields exactly what we were after:

[ love, film, completely, hate ]

Feature Scaling

Now comes one of the most important parts in the entire process. Featurizing or Feature Scaling. The idea behind feature scaling is that we take our input values and we create an even length of values scaling from 0 to 1. In other words, we try to normalise our data.

Feature scaling is especially important for regressions.

For our case we'll take a simple approach: we'll take our unique list of word tokens and iterate through it; For each input value, we'll either return a 0 (token doesn't exist) or a 1 (token exists).

We need to do this step for all the documents in our database and later on we need to convert the new incoming sentences (these are new reviews, on which we'd like to make sentiment analysis) to 0s and 1s.

[ love, film, completely, hate ] // unique word list
"I loved the film" // gets converted to [ love, film ]
"I completely hated the film" // gets converted to [ completely, hate, film ]

The result should be an array with 4 elements. (Because in the unique word list we have 4 terms). This is how we need to think about process:

  • The first term in the unique array is 'love'. Does the sentence in the first document contain that term? Yes - add 1 to the array.
  • The second term in the unique array is 'film'. Does the sentence in the first document contain that term? Yes - add 1 to the array.
  • The third term in the unique array is 'completely'. Does that appear in the new sentence? No - add 0 to the array.

Going through this process for both of the documents will yield these two arrays:

[ love, film, completely, hate ] // unique word list
"I loved the film" -> [ love, film ] -> [1, 1, 0, 0]
"I completely hated the film" -> [ completely, hate, film ] -> [0, 1, 1, 1]

We'll take these arrays and update our documents with these values. The code is a bit complicated but nevertheless achieves what we are after:

for (const document of fn.collection('reviews')) {
let t = [];
for (const token of cts.tokenize(document.toObject().review.toLowerCase())) {
if (
fn.deepEqual(
sc.name(sc.type(token)),
fn.QName('http://marklogic.com/cts', 'word')
)
) {
let stem = Array.from(cts.stem(token.toString(), 'en'))[0];
t.push(stem);
}
}
t = t.filter((term) => !stopwords.includes(t));
const features = [];
unique.filter((term) => {
if (t.includes(term)) {
features.push(1);
} else {
features.push(0);
}
});
const documentToInsert = document.toObject();
documentToInsert.features = features;
xdmp.documentInsert(fn.baseUri(document), documentToInsert, {
collections: 'reviews',
});
}

The result is a document structure that looks like this:

{
"review": "I loved the film",
"positive": 1,
"features": [1, 1, 0, 0]
}

Feature scaling is now ready for our data. We can proceed to the next step.

Train

We have our data in the right shape, it's time to train our algorithm. In order to achieve this we'll be using Brain.js - a JavaScript Neural Network library. It's installable via npm, and it is very straight forward to use it with Node.js.

During the training step we need to process our data so that it has the following format, which Brain.js accepts:

[{ input: [1, 1, 0, 0], output: { liked: 1 } },
{ input: [0, 1, 1, 1], output: { disliked: 1 }]

It's very straight forward. If the input looks like [1, 1, 0, 1] the output is that the review is positive, so the reviewer liked the film. If it's the other input, the output is that the reviewer did not like the film.

We will use these two values to train our network:

const brain = require('brain.js');
const net = new brain.NeuralNetwork();
const trainingData = [
{ input: [1, 1, 0, 0], output: { liked: 1 } },
{ input: [0, 1, 1, 1], output: { disliked: 1 } },
];

net.train(trainingData);

Classify

Time to test our algorithm. In this step we will create a new sentence and see if our algorithm can determine if it's a positive or a negative one.

What's important to remember is that we need to do the same processing for our new, incoming review that we did with the previous reviews. We need to tokenize it, stem it, merge the unique terms and feature scale it.

So if the incoming sentence is "There's a lot to hate about this film" the result should be [ there, lot, hate, about, film ] (don't forget we also have stop-words).

And now we need to feature scale it - go through our unique term list of [ love, film, completely, hate ] and add 0s and 1s:

[ there, lot, hate, about, film ] -> [0, 1, 0, 1]

We are now ready to see if our algorithm can figure out if this review has a positive or a negative sentiment:

const result = net.run([0, 1, 0, 1]);
console.log(result); // { liked: 0.19893291592597961, disliked: 0.8026838898658752 }

As you can see the value for disliked is a lot higher, therefore we can state that this new reviewer did not like the film. The returned values are probabilities that Brain.js calculated for us based on all the values found in our feature scaled model.

MarkLogic and JavaScript

In earlier examples we saw that we can execute JavaScript code in MarkLogic's Query Console interface. This is a great feature as it allows to do processing of our data at the database level. But when it comes to our Node.js application, we have to do stemming and tokenization as well on all new reviews that we want to classify. This would be tricky as we'd have to come up with some code to achieve this.

Luckily, MarkLogic allows us to invoke JavaScript code from the database. This means that we can invoke the JavaScript code that we have utilised from within our Node.js application and simply process the response that comes back from the database.

Consider this piece of code where process.sjs is a 'ServerSide JavaScript' file stored in our database that does the processing of our input.

const input = process.argv[2];

let processedInput;

db.invoke({
path: '/ext/process.sjs',
variables: { input },
})
.result()
.then((response) => {
processedInput = response[0].value;
console.log('processedInput', processedInput);
return db.documents
.query(qb.where(qb.collection('reviews')).slice(0, 100))
.result();
});

The above code returns processedInput to be [ 'there', 'lot', 'to', 'hate', 'about', 'film' ] - exactly what we have achieved manually earlier.

We can also use MarkLogic and an sjs file to return all unique terms from our documents by invoking the module from Node.js that we have seen earlier. Furthermore we can also query for all documents in our 'review' collection, iterate through them and pass the response back to the Node.js application and use that to train our algorithm.

Closing thoughts

In this article we have looked at two documents - and we have used some manual ways to calculate the normalized value of the input sentence. In a real application we'd have to use multiple document - the bigger the data set, the better the algorithm can learn and the better predictions it can make. We'd also add some processing so that we can capture the input from users.

Here's a video of a better trained algorithm in action - also capturing user input.