MarkLogic Node.js API -- Working with Binary Documents
Older Article
This article was published 10 years ago. Some information may be outdated or no longer applicable.
This article first appeared on the MarkLogic Developer Blog
We’re going to explore what Node.js developers can do when it comes to managing and working with binary documents in a MarkLogic database.
When I first started at MarkLogic, I was struck by how easily the database ingests structured, semi-structured and unstructured documents. If you’ve worked with databases before, you know the hassle of persisting binary data. In MarkLogic, it’s simple. Any type of binary can go in: PDF, Word documents, PowerPoint, MP3, MP4, you name it.
There’s a lot to say about how MarkLogic stores binary documents. There’s support for small (< 512 MB by default) binaries, large binaries, and ‘external’ binaries. To learn more, check out this page.
Binary nodes and properties
There’s something particular about binary documents: they aren’t searchable, because MarkLogic stores them as binary nodes internally.
So what if you’ve got a large collection of videos or music and you want to tag them? Maybe the location where a video was recorded, or the artist of your favourite track. Information you’ll want to search on later.
MarkLogic has a solution: properties metadata. This metadata is an XML document that shares the same URI with a document in the database. The properties metadata holds element-value pairs to store information.
Binaries (and their properties documents) are governed by role-based security. You need to authenticate as a user with a role that has read permission on the binary document to view it and search across its properties.
Check out MarkLogic University’s series of short video tutorials to learn more about security in MarkLogic.
Going back to the favourite-track example: you could have a song in your database identified by a URI like /song/j-balvin-safari.mp3. That document could carry these properties:
<artist>J Balvin</artist>
<title>Safari</safari>
You can add properties metadata during the insert process or update the document later. If a document has properties metadata, you can see that in your Query Console:
Take a look at these short video tutorials on how to insert documents and how to update documents using the MarkLogic Node.js Client API.
Extracting Metadata from binaries
Some binary documents carry metadata by nature. Think about a Microsoft Word document: it stores the author, word/character count, last saved time and more. You can extract that metadata in MarkLogic and store it as properties.
There are several ways to achieve this in MarkLogic. Have a look at the Search Developer’s Guide chapter on binary documents to learn more.
Inserting documents
If you want to follow along with examples, clone this GitHub repository: https://github.com/tpiros/marklogic-nodejs-binaries.
To insert a binary document with metadata using the Node.js Client API:
db.documents
.write({
uri: uri,
contentType: 'audio/mpeg',
properties: {
artist: 'J Balvin',
title: 'Safari',
album: 'Energia',
},
content: readStream,
})
.result((response) => console.log(response))
.catch((error) => console.log(error));
Notice the properties property in the document descriptor. That’s what assigns properties metadata to the binary document.
If you’ve cloned the GitHub repository, run
npm run setupto insert some binaries into your database. Make sure you’ve set up the project dependencies as outlined in the readme file.
Now that we’ve got some binary documents in the database, let’s talk about how to display them. There are a few options depending on the binary’s size and whether you want to read the full document or just part of it.
Displaying Images
To display an image, we can use the MarkLogic Node.js Client API’s stream result handling pattern. (There’s also a promise result handling pattern available.)
Using streams
It’s good practice to work with streams when reading binary documents. You ask for chunks of data (smaller pieces that the database sends to your application). When working with streams in JavaScript, we can use event listeners via the on() method, listening for events like data, error and end. The data event fires each time a chunk arrives. This example assumes we’ve already got an image in the database:
http
.createServer((req, res) => {
const uri = req.url;
let data = [];
db.documents
.read(uri)
.stream('chunked')
.on('data', (chunks) => {
data.push(chunks);
})
.on('error', (error) => console.log(error))
.on('end', () => {
let buffer = new Buffer(data.length).fill(0);
buffer = Buffer.concat(data);
res.end(buffer);
});
})
.listen(3000);
To see this example in action, run
npm run image.
Displaying Videos using Range requests
When it comes to displaying videos via the Node.js Client API, we need to talk about partial HTTP GET statements and Content-Range headers.
What’s the difference between streaming a binary in chunks (like we did for images) and streaming via the Content-Range header?
The difference matters. Range requests retrieve part of a binary document. You specify a start and end byte, giving you retryable, random access to portions of the binary.
Accessing part of a binary
Why does this matter? Think about a video in the database that we want to show a user. We shouldn’t download the entire video upfront. Instead, we want to grab the first chunk of bytes, enough to start playback. As they watch, we download subsequent parts (buffering the video). And if the user wants to skip ahead? A Content-Range header handles that cleanly.
In practice, the MarkLogic Node.js Client API lets you pass a range to grab parts of a document:
db.documents.read({
uris: '/binary/song.m4a',
range: [0, 511999],
});
This returns exactly the bytes we asked for.
Handling ranges from the browser
Here’s the tricky part. How do we dynamically populate the range array from the previous example? We need to check for the Content-Range header, extract the start and end bytes, and pass them into the range array.
We also need to return an HTTP 206 status code.
Here’s how it looks in practice:
db.documents
.probe(uri)
.result()
.then((response) => {
let { contentLength, contentType } = response;
contentLength = Number(contentLength);
let rangeRequest = req.headers.range;
if (rangeRequest) {
let [partialStart, partialEnd] = rangeRequest
.replace(/bytes=/, '')
.split('-');
let start = Number(partialStart);
let end = partialEnd ? Number(partialEnd) : contentLength;
let chunksize = end - start;
let streamEnd = end;
end = end - 1;
let header = {
'Content-Disposition': 'filename=' + uri,
'Content-Range': 'bytes ' + start + '-' + end + '/' + contentLength,
'Accept-Ranges': 'bytes',
'Content-Length': chunksize,
'Content-Type': contentType,
};
res.writeHead(206, header);
let stream = db.documents
.read({ uris: uri, range: [start, streamEnd] })
.stream('chunked');
stream.pipe(res);
stream.on('end', () => res.end());
} else {
res.setHeader('Content-Type', contentType);
res.setHeader('Content-Length', contentLength);
let stream = db.documents.read({ uris: uri }).stream('chunked');
stream.pipe(res);
}
})
.catch((error) => console.log(error));
To see this code in action, run
npm run range.
In the code above, we first call db.documents.probe() to get the document’s Content-Type for later reuse. Then we check for the Content-Range header, extract the start and end bytes, and build the header we’ll return along with the 206 status code.
Once that’s done, we create a stream by calling db.documents.read() with the stream result handling pattern.
When returning the 206 status code, some arithmetic is needed to get the right data length into the appropriate headers. For example, if we request bytes 0-100 from a file with a total content length of 1000, we’d set these headers:
Content-Range: 'bytes 0-100/1000'
Content-length: 101
Remember: when specifying the Content-Range, you specify the first and last byte inclusive.
Notice the else statement that covers scenarios where no range headers are sent.
Working with metadata
At the start of this article, I mentioned that we can store metadata against binary documents via properties.
Using the MarkLogic Node.js Client API, it’s possible to manage (insert, update and delete) the properties metadata. If you’ve been following along with the GitHub scripts, the setup script has already inserted properties metadata for some of the binary documents.
To extract and display the metadata, we just tell the API to retrieve that information:
db.documents
.read({
uris: uri,
categories: ['properties'],
})
.result()
.then((data) => res.end(JSON.stringify(data[0].properties)))
.catch((error) => console.log(error));
To see this in action, run
npm run metadata.
Example application
All the previous examples used separate scripts. If you’re curious, run npm run app to launch a sample application that uses React and the techniques discussed above to display video information.
The application’s source code is available on GitHub.
Conclusion
When building applications, you’ll inevitably encounter binary documents in various formats: JPEGs, PDFs and so on. The MarkLogic Node.js Client API lets you manage and display binary documents easily, and it also lets you assign and manage metadata against them.