5 | | This page will document my thoughts and design ideas for the structured binary |
6 | | data project. The project aims to address #317; a description of my overall |
7 | | approach can be found on the |
8 | | [https://www.google-melange.com/gsoc/project/google/gsoc2012/wtachi/46005 GSoC project page]. |
| 5 | As part of [wiki:GSOC Google Summer of Code 2012], Bithenge is being created to |
| 6 | address #317. This page describes the project’s design and implementation. |
| 7 | The code is at |
| 8 | [https://code.launchpad.net/~wtachi/helenos/bithenge lp:~wtachi/helenos/bithenge] |
| 9 | and periodic updates are posted to |
| 10 | [http://lists.modry.cz/cgi-bin/listinfo/helenos-devel HelenOS-devel]. |
| 11 | |
| 12 | == Overview == |
| 13 | |
| 14 | Exploring and working with structured binary data is necessary in many |
| 15 | different situations in a project like HelenOS. For instance, when implementing |
| 16 | a file format or filesystem, it is first necessary to explore preexisting files |
| 17 | and disks and learn the low‐level details of the format. Debugging compiled |
| 18 | programs, working with core dumps, and exploring network protocols also require |
| 19 | some way of interpreting binary data. |
| 20 | |
| 21 | The most basic tool for exploring binary data is the hex editor. Using a hex |
| 22 | editor is inefficient and unpleasant because it requires manual calculation of |
| 23 | lengths and offsets while constantly referring back to the data format. |
| 24 | General‐purpose scripting languages can be used instead, so a structure can be |
| 25 | defined once and decoded as often as necessary. However, even with useful tools |
| 26 | like Python’s struct module, the programmer must specify how to read the input |
| 27 | data, calculate lengths and offsets, and provide useful output, so there’s much |
| 28 | more work involved than simply specifying the format of the data. This extra |
| 29 | code will probably be rewritten every time a new script is made, due to |
| 30 | slightly differing requirements. |
| 31 | |
| 32 | The Bithenge project involves creating a powerful library and tools that will |
| 33 | make working with structured binary data faster and easier. It will consist of: |
| 34 | |
| 35 | * A core library that manages structured data and provides basic building |
| 36 | blocks for binary data interpretation. |
| 37 | * Data providers to access various sources of raw binary data. |
| 38 | * Format providers, which can load and save complex format specifications. In |
| 39 | particular, there will be a domain‐specific language for format |
| 40 | specifications. |
| 41 | * Clients, programs which use the library to work with binary data. For |
| 42 | instance, there will be an interactive browser. |
| 43 | |
| 44 | The initial goals for the project are an interactive browser for filesystem |
| 45 | structures and a debugger backend that can interpret core dumps and task |
| 46 | memory. |
| 61 | |
| 62 | == Trees == |
| 63 | |
| 64 | Bithenge represents all data in the form of a data structure called a “tree,” |
| 65 | similar to the data structure used by JSON. A tree consists of a boolean node, |
| 66 | integer node, string node, or blob node, or an internal node with children. A |
| 67 | boolean node holds a boolean value, an integer node holds a signed integer, and |
| 68 | a string holds a Unicode string. |
| 69 | |
| 70 | A blob node represents an arbitrary sequence of raw bytes. Blob nodes are |
| 71 | polymorphic, allowing any source of raw binary data to be used. Bithenge |
| 72 | includes blob node implementations for in‐memory buffers, files, and block |
| 73 | devices. An implementation has been written that reads another task’s virtual |
| 74 | memory, but it hasn’t been committed because it’s unclear whether it will be |
| 75 | useful. |
| 76 | |
| 77 | An internal node has an arbitrary number of children, each associated with a |
| 78 | unique key. The key can be any node other than an internal node. An array can |
| 79 | be represented by an internal node with integer keys starting at 0. The tree |
| 80 | node can provide children in an arbitrary order; the order will be used when |
| 81 | displaying the tree, but should have no semantic significance. Internal nodes |
| 82 | are polymorphic and can delay creation of child nodes until necessary, so |
| 83 | keeping the whole tree in memory can be avoided. |
| 84 | |
| 85 | Internal nodes are currently responsible for freeing their own children. In the |
| 86 | future, it should be possible for there to be multiple references to the same |
| 87 | node, but it isn’t clear whether this should be implemented with symbolic |
| 88 | links, an acyclic graph with reference counting, or a full graph. |
| 89 | |
| 90 | Note that all interpreted data is represented in Bithenge with nodes. |
| 91 | Therefore, the word “blob” usually refers to a blob node, &c. |
| 92 | |
| 93 | {{{ |
| 94 | |
| 95 | ○───bits─▶16 |
| 96 | │ |
| 97 | ├───fat──▶○ |
| 98 | │ ├───0───▶0xfff0 |
| 99 | │ ├───1───▶0xffff |
| 100 | │ └───2───▶0x0000 |
| 101 | │ |
| 102 | └───root──▶○ |
| 103 | ├───0───▶○ |
| 104 | │ ├───name───▶README.TXT |
| 105 | │ └───size───▶0x1351 |
| 106 | │ |
| 107 | └───1───▶○ |
| 108 | ├───name───▶KERNEL.ELF |
| 109 | └───size───▶0x38e9a2 |
| 110 | }}} |
| 111 | |
| 112 | == Programs == |
| 113 | |
| 114 | The only program currently available is a simple test that prints some trees as |
| 115 | JSON and Python literals. |
| 116 | |
| 117 | == Transforms == |
| 118 | |
| 119 | A transform is a function from a tree to a tree. One example is `uint32le`, |
| 120 | which takes a 4‐byte blob node as the input tree and provides an integer node |
| 121 | as the output tree. Another example would be `FAT16_filesystem`, a transform |
| 122 | that takes a blob node as the input tree and provides a complex output tree |
| 123 | with various decoded information about the filesystem. Some transforms, like |
| 124 | `uint32le`, are built in to Bithenge; more complicated transforms can be loaded |
| 125 | from a script file. |
| 126 | |
| 127 | Transforms are represented in Bithenge with a polymorphic object. The primary |
| 128 | method is `apply`, which applies a transform to an input tree and creates an |
| 129 | output tree. When a transform takes a blob node as input, it is sometimes |
| 130 | necessary to determine the prefix of a given blob that can be used as input to |
| 131 | the transform; the method `prefix_length` can be used for this. |
| 132 | |
| 133 | == Built‐in transforms == |
| 134 | |
| 135 | These transforms are implemented in C and included with Bithenge. Note that |
| 136 | fully specific names are preferred; scripts can define aliases if necessary. |
| 137 | |
| 138 | ||= name =||= input =||= output =||= description =||= example =|| |
| 139 | ||uint32le ||4‐byte blob node ||integer node ||decodes a 4‐byte little‐endian unsigned integer || `x"01010000"` becomes `257` || |
| 140 | ||zero_terminated ||blob node ||blob node ||takes bytes up until the first `00` || `x"7f0400"` becomes `x"7f04"` || |
| 141 | ||ascii ||blob node ||string ||decodes some bytes as ASCII characters || `x"6869"` becomes `"hi"` || |
| 142 | ||padded_with_spaces_at_end ||string ||string ||removes spaces from the end of a string || `"README "` becomes `"README"` |
| 143 | |
| 144 | == Basic syntax == |
| 145 | |
| 146 | Script files used to define new transforms. |
| 147 | |
| 148 | Transforms (including built‐in transforms) can be referenced by name: |
| 149 | `uint32le`. |
| 150 | |
| 151 | Transforms can be given a new name: `transform u32 = uint32le;` defines a |
| 152 | shorter alias for `uint32le`. |
| 153 | |
| 154 | Transforms can be composed to create a new transform that applies them in |
| 155 | order. The transform `padded_with_spaces_at_end . ascii . zero_terminated` |
| 156 | first removes the 0x00 from the end of the blob, then decodes it as ascii and |
| 157 | removes spaces from the end. Note that the order of composition is consistent |
| 158 | with function composition and nested application in mathematics, and also |
| 159 | consistent with the general idea that data moves from right to left as it is |
| 160 | decoded. |
| 161 | |
| 162 | == Structs == |
| 163 | |
| 164 | Structs are used when a blob contains multiple data fields in sequence. A |
| 165 | struct transform applies each subtransform to sequential parts of the blob and |
| 166 | combines the results to create an internal node. The result of each |
| 167 | subtransform is either assigned a key or has its keys and values merged into |
| 168 | the final internal node. Each subtransform must support `prefix_length`, so the |
| 169 | lengths and positions of the data fields can be determined. |
| 170 | |
| 171 | === Example === |
| 172 | |
| 173 | {{{ |
| 174 | transform point = struct { |
| 175 | .x = uint32le; |
| 176 | .y = uint32le; |
| 177 | }; |
| 178 | |
| 179 | transform labeled_point = struct { |
| 180 | .id = uint32le; |
| 181 | .label = ascii . zero_terminated; |
| 182 | point; |
| 183 | }; |
| 184 | }}} |
| 185 | |
| 186 | If `labeled_point` is applied to `x"06000000 4100 03000000 08000000"`, the |
| 187 | result is `{"id": 6, "label": "A", "x": 3, "y": 8}`. |
| 188 | |
| 189 | == Future features == |
| 190 | |
| 191 | * Parameters for transforms |
| 192 | * Keyword parameters only? |
| 193 | * Expressions depending on previously decoded values |
| 194 | * Enumerations |
| 195 | * Variables |
| 196 | * Transforming internal nodes |
| 197 | * Assertions |
| 198 | * Transforms that return their input |
| 199 | * Different levels (expected, required, mandatory) |
| 200 | * Error handling |
| 201 | * Hidden fields |
| 202 | * Iteration/recursion/repetition |
| 203 | * Seeking and detecting position |
| 204 | * Checking alignment |
| 205 | * Reference to structures at other offsets |
| 206 | * How to know what blob to go within? |
| 207 | * How to know current offset within that blob? |
| 208 | * Could be relative to multiple things at once... |
| 209 | * Blob node can be an inherited parameter |
| 210 | * This is also useful for endianness |
| 211 | * Offset could be an automatically incremented parameter |
| 212 | * Ad hoc tweaks at runtime |
| 213 | |
| 214 | === Constraint‐based version === |
| 215 | |
| 216 | This and most other projects use an imperative design, where the format |
| 217 | specification is always used in a fixed order, one step at a time. The |
| 218 | imperative design causes problems when the user wants to modify a field, |
| 219 | because arbitrary changes to other fields may be necessary that cannot be |
| 220 | determined from the format specification. |
| 221 | |
| 222 | It may be possible to solve this with a constraint-based design, where the |
| 223 | format specification consists of statements that must be true about the raw and |
| 224 | interpreted data, and the program figures out how to solve these constraints. |
| 225 | Unfortunately, this approach seems too open-ended and unpredictable to fit |
| 226 | within GSoC. |