Bithenge
Bithenge consists of a library and tools that make working with structured binary data easier. The project was started by Sean Bartell as part of Google Summer of Code 2012 to address #317. The code is at lp:~wtachi/helenos/bithenge and periodic updates are posted to HelenOS-devel.
Bithenge is intended to be a general‐purpose library with a variety of uses, but only one tool has been implemented so far. It is a program that reads a script file describing a binary format, applies it to binary data, and prints the decoded result.
Quick start
The Bithenge source code is in uspace/app/bithenge
and uspace/lib/bithenge
in the bithenge branch. It is automatically built in HelenOS to create the
bithenge
application. For Linux, it can be built by running
make -f Makefile.linux
, first in uspace/lib/bithenge
and then in
uspace/app/bithenge
.
The program can be run with bithenge <script file> <binary file>
to decode a
file. There are some example files in
uspace/dist/src/bithenge;
see the syntax documentation.
Overview
Trees
Bithenge represents all data in the form of a data structure called a “tree,” similar to the data structure represented by JSON. A tree consists of a boolean node, integer node, string node, or blob node, or an internal node with children. A boolean node holds a boolean value, an integer node holds a signed integer, and a string holds a Unicode string.
A blob node represents an arbitrary sequence of raw bytes or bits. Blob nodes are polymorphic, allowing any source of raw binary data to be used. Bithenge includes blob node implementations for in‐memory buffers, files, and block devices. An implementation has also been written that reads another task’s virtual memory, but it hasn’t been committed because it’s unclear whether it will be useful in its current form.
An internal node has an arbitrary number of children, each with a unique key. The key can be any node other than an internal node. Arrays can be represented by internal nodes with integer keys starting at 0. The tree node can provide children in an arbitrary order; the order will be used when displaying the tree, but should have no semantic significance. Internal nodes are polymorphic and can delay creation of child nodes until necessary, so keeping the whole tree in memory can be avoided.
Internal nodes are currently responsible for freeing their own children. In the future, it should be possible for there to be multiple references to the same node, but it isn’t clear whether this should be implemented with symbolic links, an acyclic graph with reference counting, or a full graph.
Note that all interpreted data is represented in Bithenge with nodes. Therefore, the word “blob” usually refers to a blob node, and so on.
A decoded tree for a FAT filesystem might look something like this:
○───bits─▶16 │ ├───fat──▶○ │ ├───0───▶0xfff0 │ ├───1───▶0xffff │ └───2───▶0x0000 │ └───root──▶○ ├───0───▶○ │ ├───name───▶README.TXT │ └───size───▶0x1351 │ └───1───▶○ ├───name───▶KERNEL.ELF └───size───▶0x38e9a2
Transforms
A transform is a function from an input tree and zero or more argument trees to
an output tree. One example is uint32le
, which takes a 4‐byte blob node as
the input tree and provides an integer node as the output tree. Another example
is uint_le(len)
, which takes a bit blob node as the input tree and an integer
node as an argument tree, and produces an integer node as the output tree.
Several transforms, including uint32le
and uint_le(len)
, are built in to
Bithenge. More complicated transforms can be created by a format provider, as
explained below. For instance, the
fat.bh
file contains a fat_filesystem
transform that takes a byte node as the input
tree and produces a complex output tree with various decoded information about
the filesystem.
Data providers
Generally speaking, a data provider is anything that provides data from outside Bithenge in the form of a tree. The Bithenge library includes functions for various data providers, and can automatically create a tree from a prefixed string:
Prefix | Tree Type | Example | Description |
---|---|---|---|
file: | Byte blob node | file:/textdemo | Read the contents of a file. This is the default if no prefix is used. |
block: | Byte blob node | block:bd/initrd | Read the contents of a block device. (HelenOS only.) |
hex: | Byte blob node | hex:01000000 | Use a string of hexadecimal characters to create a blob node. |
These prefixes can also be used with the bithenge
program.
Format providers
A format provider creates transforms based on external data, or vice versa. The only implemented format provider is the script parser, which creates a transform from a file written in the Bithenge scripting language. A future idea is the DWARF format provider, which would read DWARF information from an executable and create a transform that could be applied to a task memory dump.
Library
The Bithenge library includes APIs for working with trees and creating and applying transforms. It also contains the built‐in transforms, data providers, and format providers. The library overview describes the API and conventions used; detailed documentation is included in Doxygen comments in the source.
Clients
A client is any program that provides a user interface to the Bithenge library.
The only implemented client is the simple bithenge
program described above.
In the future, there could be an interactive browser for decoded trees, or a
program to generate C code based on a script file.
New design
I am currently working on a new constraint‐based design, where the format specification consists of statements that must be true about the raw and interpreted data, and the program figures out how to solve these constraints. This design should make a number of things possible, particularly editing and bidirectional encoding/decoding. As of December 2012, I have a proof‐of‐concept that generates Python code to encode or decode a trivial structure, and I’m working on improving it.
More information
- Bithenge/Syntax: the syntax of Bithenge scripts.
- Bithenge/Library: the library API, conventions, and internals.
- Bithenge/Design: design decisions, limitations, ideas, and inspirations.