Natural Language Processing with NoSQL and JavaScript
Older Article
This article was published 8 years ago. Some information may be outdated or no longer applicable.
Natural Language Processing (NLP) sits inside Artificial Intelligence. More precisely, it’s about applying Machine Learning models to text and language. There are loads of tasks you can pull off with NLP. Think about Google Translate, or Siri. Those are NLP algorithms doing their thing.
In this article, we’ll apply sentiment analysis to film reviews. Sentiment analysis means an algorithm can look at an unknown review and decide whether it’s positive (the reviewer liked the film) or negative (the reviewer didn’t).
We’ll build a Node.js application that accepts a review (a sentence passed in as an argument) and determines whether it’s positive or negative.
Supervised machine learning
We’re going to use supervised machine learning. That means we’ll feed our algorithm some training data (training examples). The training data consists of sentences and an output. These are often called an input (an input vector, or the independent variable). The output tells us whether that sentence is positive or negative (the dependent variable).
Supervised machine learning gets its name because we’re providing training data to the algorithm. The algorithm learns from the dataset and makes predictions.
Unsupervised machine learning, on the other hand, uses unlabelled data and algorithms like clustering to make sense of things. Algorithms group data together and make predictions based on those groupings.
To teach our algorithm sentiments, we need to move through six steps. But before looking at those steps, let’s examine the data itself.
The data
We’ll start by examining two documents. One contains a positive review, the other a negative one:
{ "review": "I loved the film", "positive": 1 }
{ "review": "I completely hated the film", "positive": 0 }
Nothing surprising here. This data could come from a website, a forum, anywhere really. To keep things simple, we’ll stick to reviews modelled this way.
A review has two properties: review (the actual text someone wrote) and positive (set to 1 if the review is positive, 0 if it isn’t).
NoSQL and Machine Learning
Machine learning requires data. Lots of it. (In the evolution of machine learning, available processing power mattered just as much as data volume.)
Some machine learning algorithms like sentiment analysis work with unstructured text. You could run sentiment analysis on much larger pieces of text, articles, social media feeds, and so on, to get a read on a particular topic or sentiment.
Unstructured text means you can’t use schemas to load your data (well, you could, but it’d take forever to design the right schema, and when a requirement changes or a new social media feed arrives, you’d need to rebuild the schema from scratch).
NoSQL’s schema-agnostic approach to data modelling plays a key part in enabling machine learning.
Why MarkLogic?
The MarkLogic NoSQL database has some excellent features, and a few of them are directly useful for teaching our NLP algorithm. One of the most interesting things about MarkLogic: it can do word tokenisation and stemming out of the box. That’ll save us a lot of work later.
The process
Loading the data
We can load the data into a MarkLogic database and add it to a collection. This is easily done with code executed against MarkLogic’s Query Console:
declareUpdate();
const reviews = [
{ review: 'I loved the film', positive: 1 },
{ review: 'I completely hated the film', positive: 0 },
];
reviews.map((review, index) => {
xdmp.documentInsert(`/reviews/review${index}`, review, {
collections: 'reviews',
});
});
('Inserted reviews');
Step by step
These are the steps we need to follow to teach our algorithm:
- Tokenisation
- Stemming
- Merge
- Feature Scaling
- Train
- Classify
What we want is a list of unique words for our dataset. If we have these two reviews:
"I loved the film"
"I completely hated the film"
We’d like to end up with:
['i', 'loved', 'the', 'film', 'completely', 'hated'];
Notice we’ve applied a lowercase function on the terms. We can also apply stemming:
['i', 'love', 'the', 'film', 'completely', 'hate'];
And strip out some common words (also known as “stop words”):
['love', 'film', 'completely', 'hate'];
That’s the data structure we’re after. Let’s look at each step and see how to get there.
Tokenisation
Tokenisation is the exercise of taking a sentence and creating tokens from it. We extract words (and other parts) from a sentence. We’re especially interested in word tokens.
In JavaScript, tokenisation can look like this:
'I loved the film'.split(' '); // ["I", "loved", "the", "film"]
MarkLogic can handle tokenisation for us. It takes a sentence and returns what it considers a “word”, “space”, or “punctuation”. We can use this built-in functionality to collect “word” tokens. We’ll see this in a moment.
Stemming
Stemming is an interesting concept. You take a term and stemming produces its root form. “Mice” stems back to “mouse”. “Tables” stems back to “table”. “Running” stems back to “run”. You get the idea.
Without MarkLogic, we’d have to find a stemming library to do this work. MarkLogic has stemming built in and we can call its stemming function:
cts.stem('mice'); // mouse
Merge
Now we need to take all the terms from our reviews, merge them together, and remove duplicates.
Merging and duplicate removal can be done cleanly using ES2015 Sets and the ... spread operator:
const a = [0, 1, 2];
const b = [0, 1, 3];
[...new Set([...a, ...b])]; // [0, 1, 2, 3]
Let’s pause here and look at the code we’d execute against our database to get the desired result. The code below does everything we’ve discussed so far. It iterates through all documents in the reviews collection, tokenises and stems the terms, applies stop words, and returns an array of unique terms.
declareUpdate();
const tokens = [];
for (const document of fn.collection('reviews')) {
for (const token of cts.tokenize(document.toObject().review.toLowerCase())) {
if (
fn.deepEqual(
sc.name(sc.type(token)),
fn.QName('http://marklogic.com/cts', 'word')
)
) {
tokens.push(token);
}
}
}
const stems = tokens.map((token) => Array.from(cts.stem(token, 'en'))[0]);
let unique = [...new Set(stems)];
const stopwords = [
'a',
'be',
'the',
'that',
'this',
'i',
'do',
'it',
's',
've',
're',
];
unique = unique.filter((term) => !stopwords.includes(term));
console.log(unique);
This yields exactly what we were after:
[ love, film, completely, hate ]
Feature Scaling
Now comes one of the most important parts of the entire process. Featurising (or Feature Scaling). The idea is that we take our input values and create an even length of values scaling from 0 to 1. We’re trying to normalise our data.
Feature scaling is especially important for regressions.
For our case, we’ll take a simple approach. We’ll grab our unique list of word tokens and iterate through it. For each input value, we’ll return either a 0 (token doesn’t exist) or a 1 (token exists).
We need to do this for all the documents in our database. Later, we’ll also need to convert new incoming sentences (new reviews we want to analyse) into 0s and 1s.
[ love, film, completely, hate ] // unique word list
"I loved the film" // gets converted to [ love, film ]
"I completely hated the film" // gets converted to [ completely, hate, film ]
The result should be an array with 4 elements (because our unique word list has 4 terms). Here’s how to think about it:
- The first term in the unique array is “love”. Does the first document contain that term? Yes. Add
1to the array. - The second term in the unique array is “film”. Does the first document contain that term? Yes. Add
1to the array. - The third term in the unique array is “completely”. Does it appear in the sentence? No. Add
0to the array.
Running through this process for both documents produces these two arrays:
[ love, film, completely, hate ] // unique word list
"I loved the film" -> [ love, film ] -> [1, 1, 0, 0]
"I completely hated the film" -> [ completely, hate, film ] -> [0, 1, 1, 1]
We’ll take these arrays and update our documents with these values. The code is a bit involved but gets the job done:
for (const document of fn.collection('reviews')) {
let t = [];
for (const token of cts.tokenize(document.toObject().review.toLowerCase())) {
if (
fn.deepEqual(
sc.name(sc.type(token)),
fn.QName('http://marklogic.com/cts', 'word')
)
) {
let stem = Array.from(cts.stem(token.toString(), 'en'))[0];
t.push(stem);
}
}
t = t.filter((term) => !stopwords.includes(t));
const features = [];
unique.filter((term) => {
if (t.includes(term)) {
features.push(1);
} else {
features.push(0);
}
});
const documentToInsert = document.toObject();
documentToInsert.features = features;
xdmp.documentInsert(fn.baseUri(document), documentToInsert, {
collections: 'reviews',
});
}
The result is a document structure that looks like this:
{
"review": "I loved the film",
"positive": 1,
"features": [1, 1, 0, 0]
}
Feature scaling is done. On to the next step.
Train
Our data is in the right shape. Time to train the algorithm. We’ll use Brain.js, a JavaScript Neural Network library. It’s installable via npm and simple to use with Node.js.
During training, we need to reshape our data into the format Brain.js expects:
[{ input: [1, 1, 0, 0], output: { liked: 1 } },
{ input: [0, 1, 1, 1], output: { disliked: 1 }]
If the input looks like [1, 1, 0, 1], the output says the review is positive (the reviewer liked the film). The other input produces the opposite output.
We’ll use these two values to train our network:
const brain = require('brain.js');
const net = new brain.NeuralNetwork();
const trainingData = [
{ input: [1, 1, 0, 0], output: { liked: 1 } },
{ input: [0, 1, 1, 1], output: { disliked: 1 } },
];
net.train(trainingData);
Classify
Time to test our algorithm. We’ll create a new sentence and see if our algorithm can tell whether it’s positive or negative.
Remember: we need to process the new incoming review the same way we processed the previous ones. Tokenise it, stem it, merge the unique terms, and feature scale it.
If the incoming sentence is “There’s a lot to hate about this film”, the result should be [ there, lot, hate, about, film ] (after removing stop words).
Now we feature scale it. Walk through our unique term list of [ love, film, completely, hate ] and assign 0s and 1s:
[ there, lot, hate, about, film ] -> [0, 1, 0, 1]
Let’s see if the algorithm can figure out the sentiment:
const result = net.run([0, 1, 0, 1]);
console.log(result); // { liked: 0.19893291592597961, disliked: 0.8026838898658752 }
The value for “disliked” is much higher. So we can say this reviewer didn’t like the film. The returned values are probabilities that Brain.js calculated based on the values in our feature-scaled model.
MarkLogic and JavaScript
In earlier examples, we ran JavaScript code in MarkLogic’s Query Console interface. That’s a great feature because it lets you process data at the database level. But when it comes to our Node.js application, we still need stemming and tokenisation on all new reviews we want to classify. Writing that from scratch would be tricky.
MarkLogic lets us invoke JavaScript code stored in the database. We can call the JavaScript code we’ve already written from within our Node.js application and simply process the response that comes back.
Consider this piece of code where process.sjs is a “ServerSide JavaScript” file stored in our database that handles the processing of our input:
const input = process.argv[2];
let processedInput;
db.invoke({
path: '/ext/process.sjs',
variables: { input },
})
.result()
.then((response) => {
processedInput = response[0].value;
console.log('processedInput', processedInput);
return db.documents
.query(qb.where(qb.collection('reviews')).slice(0, 100))
.result();
});
The code above returns processedInput as [ 'there', 'lot', 'to', 'hate', 'about', 'film' ], exactly what we achieved manually earlier.
We can also use MarkLogic and an sjs file to return all unique terms from our documents by invoking the module from Node.js. We can query for all documents in our “review” collection, iterate through them, and pass the response back to the Node.js application to train our algorithm.
Closing thoughts
In this article, we looked at two documents and used some manual methods to calculate the normalised value of the input sentence. In a real application, you’d use many more documents. The bigger the dataset, the better the algorithm learns and the better its predictions become. You’d also add processing to capture input from users.
Here’s a video of a better-trained algorithm in action, also capturing user input.