Context Navigation

Changes between Version 8 and Version 9 of StructuredBinaryData

Timestamp:: 2012-06-20T21:03:17Z (13 years ago)
Author:: Sean Bartell
Comment:: add description of project and initial design

Legend:

: Unmodified
: Added
: Removed
: Modified

StructuredBinaryData

-              v8
+              v9
 [[PageOutline(2-3)]]
+This page will document my thoughts and design ideas for the structured binary
+data project. The project aims to address #317; a description of my overall
+approach can be found on the
+[https://www.google-melange.com/gsoc/project/google/gsoc2012/wtachi/46005 GSoC project page].
+As part of [wiki:GSOC Google Summer of Code 2012], Bithenge is being created to
+address #317. This page describes the project’s design and implementation.
+The code is at
+[https://code.launchpad.net/~wtachi/helenos/bithenge lp:~wtachi/helenos/bithenge]
+and periodic updates are posted to
+[http://lists.modry.cz/cgi-bin/listinfo/helenos-devel HelenOS-devel].
+== Overview ==
+Exploring and working with structured binary data is necessary in many
+different situations in a project like HelenOS. For instance, when implementing
+a file format or filesystem, it is first necessary to explore preexisting files
+and disks and learn the low‐level details of the format. Debugging compiled
+programs, working with core dumps, and exploring network protocols also require
+some way of interpreting binary data.
+The most basic tool for exploring binary data is the hex editor. Using a hex
+editor is inefficient and unpleasant because it requires manual calculation of
+lengths and offsets while constantly referring back to the data format.
+General‐purpose scripting languages can be used instead, so a structure can be
+defined once and decoded as often as necessary. However, even with useful tools
+like Python’s struct module, the programmer must specify how to read the input
+data, calculate lengths and offsets, and provide useful output, so there’s much
+more work involved than simply specifying the format of the data. This extra
+code will probably be rewritten every time a new script is made, due to
+slightly differing requirements.
+The Bithenge project involves creating a powerful library and tools that will
+make working with structured binary data faster and easier. It will consist of:
+* A core library that manages structured data and provides basic building
+  blocks for binary data interpretation.
+* Data providers to access various sources of raw binary data.
+* Format providers, which can load and save complex format specifications. In
+  particular, there will be a domain‐specific language for format
+  specifications.
+* Clients, programs which use the library to work with binary data. For
+  instance, there will be an interactive browser.
+The initial goals for the project are an interactive browser for filesystem
+structures and a debugger backend that can interpret core dumps and task
+memory.
 == Requirements ==
 …
 * Work in HelenOS—this means the code must be in C and/or an easily ported
   language like Lua.
 * View on different layers; for instance, switch between viewing the formatted
   date and time for a FAT directory entry, the integers, and the original
   bytes.
+* View on different layers. For instance, when viewing a FAT directory entry,
+  it should be possible to switch between viewing the formatted date and time,
+  the integers, and the original bytes.
 * Check whether data is valid; handle broken data reasonably well.
 * Parse pieces of the data lazily; don’t try to read everything at once.
 * Work in both directions (parsing and building) without requiring too much
   extra effort.
+  extra effort when specifying the format.
 * Support full modifications. Ideally, allow creation of a whole filesystem
   from scratch.
+== Trees ==
+Bithenge represents all data in the form of a data structure called a “tree,”
+similar to the data structure used by JSON. A tree consists of a boolean node,
+integer node, string node, or blob node, or an internal node with children. A
+boolean node holds a boolean value, an integer node holds a signed integer, and
+a string holds a Unicode string.
+A blob node represents an arbitrary sequence of raw bytes. Blob nodes are
+polymorphic, allowing any source of raw binary data to be used. Bithenge
+includes blob node implementations for in‐memory buffers, files, and block
+devices. An implementation has been written that reads another task’s virtual
+memory, but it hasn’t been committed because it’s unclear whether it will be
+useful.
+An internal node has an arbitrary number of children, each associated with a
+unique key. The key can be any node other than an internal node. An array can
+be represented by an internal node with integer keys starting at 0. The tree
+node can provide children in an arbitrary order; the order will be used when
+displaying the tree, but should have no semantic significance. Internal nodes
+are polymorphic and can delay creation of child nodes until necessary, so
+keeping the whole tree in memory can be avoided.
+Internal nodes are currently responsible for freeing their own children. In the
+future, it should be possible for there to be multiple references to the same
+node, but it isn’t clear whether this should be implemented with symbolic
+links, an acyclic graph with reference counting, or a full graph.
+Note that all interpreted data is represented in Bithenge with nodes.
+Therefore, the word “blob” usually refers to a blob node, &c.
+{{{
+○───bits─▶16
+│
+├───fat──▶○
+│         ├───0───▶0xfff0
+│         ├───1───▶0xffff
+│         └───2───▶0x0000
+│
+└───root──▶○
+           ├───0───▶○
+           │        ├───name───▶README.TXT
+           │        └───size───▶0x1351
+           │
+           └───1───▶○
+                    ├───name───▶KERNEL.ELF
+                    └───size───▶0x38e9a2
+}}}
+== Programs ==
+The only program currently available is a simple test that prints some trees as
+JSON and Python literals.
+== Transforms ==
+A transform is a function from a tree to a tree. One example is `uint32le`,
+which takes a 4‐byte blob node as the input tree and provides an integer node
+as the output tree. Another example would be `FAT16_filesystem`, a transform
+that takes a blob node as the input tree and provides a complex output tree
+with various decoded information about the filesystem. Some transforms, like
+`uint32le`, are built in to Bithenge; more complicated transforms can be loaded
+from a script file.
+Transforms are represented in Bithenge with a polymorphic object. The primary
+method is `apply`, which applies a transform to an input tree and creates an
+output tree. When a transform takes a blob node as input, it is sometimes
+necessary to determine the prefix of a given blob that can be used as input to
+the transform; the method `prefix_length` can be used for this.
+== Built‐in transforms ==
+These transforms are implemented in C and included with Bithenge. Note that
+fully specific names are preferred; scripts can define aliases if necessary.
+||= name =||= input =||= output =||= description =||= example =||
+||uint32le        ||4‐byte blob node ||integer node ||decodes a 4‐byte little‐endian unsigned integer ||  `x"01010000"` becomes `257` ||
+||zero_terminated ||blob node        ||blob node    ||takes bytes up until the first `00` ||  `x"7f0400"` becomes `x"7f04"` ||
+||ascii           ||blob node        ||string       ||decodes some bytes as ASCII characters ||  `x"6869"` becomes `"hi"` ||
+||padded_with_spaces_at_end ||string ||string       ||removes spaces from the end of a string ||  `"README  "` becomes `"README"`
+== Basic syntax ==
+Script files used to define new transforms.
+Transforms (including built‐in transforms) can be referenced by name:
+`uint32le`.
+Transforms can be given a new name: `transform u32 = uint32le;` defines a
+shorter alias for `uint32le`.
+Transforms can be composed to create a new transform that applies them in
+order. The transform `padded_with_spaces_at_end . ascii . zero_terminated`
+first removes the 0x00 from the end of the blob, then decodes it as ascii and
+removes spaces from the end. Note that the order of composition is consistent
+with function composition and nested application in mathematics, and also
+consistent with the general idea that data moves from right to left as it is
+decoded.
+== Structs ==
+Structs are used when a blob contains multiple data fields in sequence. A
+struct transform applies each subtransform to sequential parts of the blob and
+combines the results to create an internal node. The result of each
+subtransform is either assigned a key or has its keys and values merged into
+the final internal node. Each subtransform must support `prefix_length`, so the
+lengths and positions of the data fields can be determined.
+=== Example ===
+{{{
+transform point = struct {
+    .x = uint32le;
+    .y = uint32le;
+};
+transform labeled_point = struct {
+    .id = uint32le;
+    .label = ascii . zero_terminated;
+    point;
+};
+}}}
+If `labeled_point` is applied to `x"06000000 4100 03000000 08000000"`, the
+result is `{"id": 6, "label": "A", "x": 3, "y": 8}`.
+== Future features ==
+* Parameters for transforms
+  * Keyword parameters only?
+* Expressions depending on previously decoded values
+* Enumerations
+* Variables
+* Transforming internal nodes
+* Assertions
+  * Transforms that return their input
+  * Different levels (expected, required, mandatory)
+* Error handling
+* Hidden fields
+* Iteration/recursion/repetition
+* Seeking and detecting position
+* Checking alignment
+* Reference to structures at other offsets
+  * How to know what blob to go within?
+  * How to know current offset within that blob?
+  * Could be relative to multiple things at once...
+  * Blob node can be an inherited parameter
+    * This is also useful for endianness
+  * Offset could be an automatically incremented parameter
+* Ad hoc tweaks at runtime
+=== Constraint‐based version ===
+This and most other projects use an imperative design, where the format
+specification is always used in a fixed order, one step at a time. The
+imperative design causes problems when the user wants to modify a field,
+because arbitrary changes to other fields may be necessary that cannot be
+determined from the format specification.
+It may be possible to solve this with a constraint-based design, where the
+format specification consists of statements that must be true about the raw and
+interpreted data, and the program figures out how to solve these constraints.
+Unfortunately, this approach seems too open-ended and unpredictable to fit
+within GSoC.
 == Interesting formats ==
 …
 == Existing Tools ==
 I am researching existing tools related to my project, so they can be used for
+I researched existing tools related to my project, so they can be used for
 inspiration.
 …
 everything else manually.
 === BinData ===
+=== !BinData ===
 [http://bindata.rubyforge.org/ BinData] makes good use of Ruby syntax; it
 …
 [http://corte.si/posts/visualisation/binvis/index.html Space‐filling curves]
 look cool, but this project is about ''avoiding'' looking at raw binary data.
+The next step is to design and implement the domain-specific language. I
+will do this incrementally: start with basic features, design them,
+implement them, and make an example, then move on to more advanced
+features, and so on. I will post an update after each step, especially
+after each part of the design. This is different from the schedule I
+gave in my proposal, but my goal for July 1st is the same: a program
+that can use a format specification file to interpret data and dump it
+in JSON format.