Standards-based record file format. Streaming-capable, public domain.

Overview | Specification | Implementations

Specification

What follows is the technical specification of the JSON-Base64 file format. This page is intended for software developers to use to build a library to support JSON-Base64 in their software applications. The Overview page has a more general-purpose overview of the JB64 file format.

As you might guess by the name, JSON-Base64 requires a JSON library and a slightly modified Base64 encoder/decoder. A MD5 implementation can be used to check for errors of input data. Nearly all programming languages either have relatively native support for this file format via the relevant function calls in the language itself or have readily available drop-in libraries for the required components.

There are two official reference implementations of JSON-Base64 in PHP and C# that include both a library and a test suite for the library. The official reference implementations can be found on the Implementations page and are open source software under a MIT or LGPL license of your choice. The reference implementations are meant primarily to offer a few ideas. However, each programming language is different and users of that language have certain ways they prefer to do things. Therefore, it is important to cater to users of a JB64 library by balancing flexibility with ease-of-use and maintainability.

Encoding JSON-Base64

The informal specification for JB64 encoding is as follows for each record:

Prepare the record according to the column definitions (only for data records).
JSON encode.
Base64 encode.
Replace '+' with '-'. Replace '/' with '_'. Remove trailing '='.
Optionally append a '.' character and a hex encoded (lowercase), MD5 of the data.
Append '\r\n' or '\n'.
Output.

Each record contains the exact same number of columns as all of the other records. When a record contains an incorrect number of columns, it is considered to be an error.

The very first record must contain column information instead of data. The format of the first record is a set of arrays (matching the number of columns) in an array. Each array in the set contains two entries. The first entry is a string containing the name of the column and is intended to be used as a column header. The second entry is a string containing the data type of the column. The following types are allowed:

boolean - A boolean.
integer - An integer.
number - A JSON 'number'.
date - A timestamp in YYYY-MM-DD HH:MM:SS format in UTC.
time - A time in HH:MM:SS 24-hour format in UTC.
string - A Unicode string.
binary - A slightly modified Base64 encoded data blob.
custom:vendorname_type - A slightly modified Base64 encoded data blob containing vendor-specific data for the type.

Columns in data records must be prepared according to the column definition. Entries in the array for the record are always specified in the same order as the column definition in the header record. The most unusual column preparation is binary preparation, which must be Base64 encoded and the output modified by replacing '+' with '-', '/' with '_', and trailing '=' bytes removed.

Null values may be stored if the encoding application supports them. For maximum portability, null values should be avoided and the empty value for the column used instead. See the decoding section below for the empty value mapping.

An encoder should attempt to store record data in the column format declared by the header. Logical conversions of data types may take place during this process to normalize the output. See the official reference implementations for various data normalization practices.

If an error occurs while encoding, the record or record number that caused the problem must be logged, the record ignored, and encoding of the records continues with the next record. This behavior may be overridden by the user of the application.

The unofficial MIME Content-type for JSON-Base64 is: application/jb64

Decoding JSON-Base64

The informal specification for JB64 decoding is as follows for each record:

Read in data until a '\n' is encountered.
Trim surrounding whitespace characters.
Locate a '.' character preceding 32 bytes of data before the end of the string (being careful to avoid out of bounds access attempts). If found, the data after the '.' is a hex encoded MD5 of the data that follows and may be used to verify the integrity of the data before decoding it further. Otherwise, ignore.
Replace '_' with '/'. Replace '-' with '+'.
Base64 decode.
JSON decode.
Process the record according to the column definitions (only for data records).

Each record contains the exact same number of columns as all of the other records. When a record contains an incorrect number of columns, it is considered to be an error.

A decoder may set a reasonable hard limit on allowed line length if system memory or the application is limited in specific ways. If an input line exceeds the decoder's limit, it is considered to be an error.

If the end of input occurs before a '\n' character, then the input is ignored. It is important, therefore, that trailing newlines are not trimmed from the input by another application (e.g. a text editor) or the last record of input will be lost. An application may log unprocessed data as a possible issue for the user to review.

The very first record contains column information instead of data. See the section on encoding above for the format. The allowed types may be converted to other types as follows:

boolean - To integer (0 for false, 1 for true).
integer - To number or string.
number - To string.
date - To string.
time - To string.
string - To binary.
custom:vendorname_type - To binary.

All types can ultimately be converted to binary data. A decoder may also offer additional reasonable conversion from any type to any other type for improved import flexibility. If a type is unable to be handled or converted for some reason, it is considered to be an error.

Binary and custom types must be processed by replacing '_' with '/' and '-' with '+' before Base64 decoding the content.

If a value is null, the application may choose to honor the null value if it is supported or convert it to the column type's equivalent empty value. The empty values are defined as follows:

boolean - false
integer - 0
number - 0
date - 0000-00-00 00:00:00
time - 00:00:00
string - (empty string)
binary - (empty "string")
custom:vendorname_type - (empty "string")

A decoder should first attempt to convert record data to the column format declared by the header and then convert the data to a desired target type from there. Logical conversions of data types may take place during this process to normalize the input. See the official reference implementations for various data normalization and conversion/mapping practices.

If an error occurs while decoding or processing the header record, the file must be declared corrupt and the application reading the file must not proceed any further. This behavior may be overridden if the user specifies an alternate header record in the application. If an alternate header record is specified by the user, the file's header record is ignored.

If an error occurs while decoding, the record or record number that caused the problem must be logged, the record itself ignored, and decoding of the records continues with the next record. This behavior may be overridden by the user of the application.

EBNF Notation

The following is the modified EBNF notation for the file format encoding:

fileformat = { line } ;
line = base64_recorddata, [".", hex(md5(base64_recorddata))], (CRLF | LF) ;
base64_recorddata = remove('=', replace('/', '_', replace('+', '-', base64(json(recorddata))))) ;

Where the functions are defined as thus:

json - A standard JSON serializer/deserializer implementation.
base64 - A standard Base64 encoder/decoder implementation.
replace - Replaces all occurrences of one byte with another in a string. replace(findchr, replacechr, data)
remove - Removes all occurrences of a byte in a string. remove(findchr, data)
md5 - A standard MD5 implementation.
hex - A standard hexadecimal encoder where each unsigned 8-bit byte is output as lowercase and values less than 0x10 (16 decimal) have a '0' prefix (e.g. 10 decimal would get encoded as '0a').

Not exactly the most complicated EBNF out there, but it might help answer some questions.

Rationale

Every file format and protocol needs a rationale that explains the format's/protocol's eccentricities. What follows are some random thoughts that may be helpful with development of your library.

JSON (RFC4627 and ECMA-404), Base64 (RFC4648), and MD5 (RFC1321) are all well-formed, widely understood, standard, popular, and unencumbered. For those reasons, they were chosen. JSON was also chosen because it has exactly one allowed character encoding for strings: Unicode.

Base64 is modified slightly due to network transport issues. Sending '+' and '/' in certain cases will result in the receiving end mangling the data before processing it. In addition, JSON escapes '/', so the same transformation also slightly reduces the size for the 'binary' and 'custom:' types. The Base64 in use in JSON-Base64 is known as the "URL and Filename Safe" variant from RFC4648.

Null values lead to problems. A good implementation of JSON-Base64 will let the user disallow nulls for a column, which the library should handle as per the decoding specification.

The MD5 hash is optional and is intended to be used to detect errors in a record before decoding it. This just a sanity check and won't stop malicious intent.

The newline at the end of the Base64 string can be either '\r\n' (carriage return + newline, or CRLF) or '\n' (newline, or LF). The actual end of line character is '\n'. What comes before it doesn't really matter so '\r\n' is simply a courtesy for those using Windows text editors. The official reference implementations use '\r\n'. It is therefore recommended to use '\r\n' to avoid issues with improperly written decoders.

Website (C) CubicleSoft | Privacy Policy