JSON-Base64
Standards-based record file format.  Streaming-capable, public domain.Standards-based record file format.  Streaming-capable, public domain.

Specification

What follows is the technical specification of the JSON-Base64 file format. This page is intended for software developers to use to build a library to support JSON-Base64 in their software applications. The Overview page has a more general-purpose overview of the JB64 file format.

As you might guess by the name, JSON-Base64 requires a JSON library and a slightly modified Base64 encoder/decoder. A MD5 implementation can be used to check for errors of input data. Nearly all programming languages either have relatively native support for this file format via the relevant function calls in the language itself or have readily available drop-in libraries for the required components.

There are two official reference implementations of JSON-Base64 in PHP and C# that include both a library and a test suite for the library. The official reference implementations can be found on the Implementations page and are open source software under a MIT or LGPL license of your choice. The reference implementations are meant primarily to offer a few ideas. However, each programming language is different and users of that language have certain ways they prefer to do things. Therefore, it is important to cater to users of a JB64 library by balancing flexibility with ease-of-use and maintainability.

Encoding JSON-Base64

The informal specification for JB64 encoding is as follows for each record:

Each record contains the exact same number of columns as all of the other records. When a record contains an incorrect number of columns, it is considered to be an error.

The very first record must contain column information instead of data. The format of the first record is a set of arrays (matching the number of columns) in an array. Each array in the set contains two entries. The first entry is a string containing the name of the column and is intended to be used as a column header. The second entry is a string containing the data type of the column. The following types are allowed:

Columns in data records must be prepared according to the column definition. Entries in the array for the record are always specified in the same order as the column definition in the header record. The most unusual column preparation is binary preparation, which must be Base64 encoded and the output modified by replacing '+' with '-', '/' with '_', and trailing '=' bytes removed.

Null values may be stored if the encoding application supports them. For maximum portability, null values should be avoided and the empty value for the column used instead. See the decoding section below for the empty value mapping.

An encoder should attempt to store record data in the column format declared by the header. Logical conversions of data types may take place during this process to normalize the output. See the official reference implementations for various data normalization practices.

If an error occurs while encoding, the record or record number that caused the problem must be logged, the record ignored, and encoding of the records continues with the next record. This behavior may be overridden by the user of the application.

The unofficial MIME Content-type for JSON-Base64 is: application/jb64

Decoding JSON-Base64

The informal specification for JB64 decoding is as follows for each record:

Each record contains the exact same number of columns as all of the other records. When a record contains an incorrect number of columns, it is considered to be an error.

A decoder may set a reasonable hard limit on allowed line length if system memory or the application is limited in specific ways. If an input line exceeds the decoder's limit, it is considered to be an error.

If the end of input occurs before a '\n' character, then the input is ignored. It is important, therefore, that trailing newlines are not trimmed from the input by another application (e.g. a text editor) or the last record of input will be lost. An application may log unprocessed data as a possible issue for the user to review.

The very first record contains column information instead of data. See the section on encoding above for the format. The allowed types may be converted to other types as follows:

All types can ultimately be converted to binary data. A decoder may also offer additional reasonable conversion from any type to any other type for improved import flexibility. If a type is unable to be handled or converted for some reason, it is considered to be an error.

Binary and custom types must be processed by replacing '_' with '/' and '-' with '+' before Base64 decoding the content.

If a value is null, the application may choose to honor the null value if it is supported or convert it to the column type's equivalent empty value. The empty values are defined as follows:

A decoder should first attempt to convert record data to the column format declared by the header and then convert the data to a desired target type from there. Logical conversions of data types may take place during this process to normalize the input. See the official reference implementations for various data normalization and conversion/mapping practices.

If an error occurs while decoding or processing the header record, the file must be declared corrupt and the application reading the file must not proceed any further. This behavior may be overridden if the user specifies an alternate header record in the application. If an alternate header record is specified by the user, the file's header record is ignored.

If an error occurs while decoding, the record or record number that caused the problem must be logged, the record itself ignored, and decoding of the records continues with the next record. This behavior may be overridden by the user of the application.

EBNF Notation

The following is the modified EBNF notation for the file format encoding:

Where the functions are defined as thus:

Not exactly the most complicated EBNF out there, but it might help answer some questions.

Rationale

Every file format and protocol needs a rationale that explains the format's/protocol's eccentricities. What follows are some random thoughts that may be helpful with development of your library.

JSON (RFC4627 and ECMA-404), Base64 (RFC4648), and MD5 (RFC1321) are all well-formed, widely understood, standard, popular, and unencumbered. For those reasons, they were chosen. JSON was also chosen because it has exactly one allowed character encoding for strings: Unicode.

Base64 is modified slightly due to network transport issues. Sending '+' and '/' in certain cases will result in the receiving end mangling the data before processing it. In addition, JSON escapes '/', so the same transformation also slightly reduces the size for the 'binary' and 'custom:' types. The Base64 in use in JSON-Base64 is known as the "URL and Filename Safe" variant from RFC4648.

Null values lead to problems. A good implementation of JSON-Base64 will let the user disallow nulls for a column, which the library should handle as per the decoding specification.

The MD5 hash is optional and is intended to be used to detect errors in a record before decoding it. This just a sanity check and won't stop malicious intent.

The newline at the end of the Base64 string can be either '\r\n' (carriage return + newline, or CRLF) or '\n' (newline, or LF). The actual end of line character is '\n'. What comes before it doesn't really matter so '\r\n' is simply a courtesy for those using Windows text editors. The official reference implementations use '\r\n'. It is therefore recommended to use '\r\n' to avoid issues with improperly written decoders.

Website (C) CubicleSoft | Privacy Policy