| 1 | = Bithenge = |
| 2 | |
| 3 | [[PageOutline(2-3)]] |
| 4 | |
| 5 | Various aspects of the [[Bithenge]] design are discussed here. |
| 6 | |
| 7 | == Motivation == |
| 8 | |
| 9 | Exploring and working with structured binary data is necessary in many |
| 10 | different situations in a project like HelenOS. For instance, when implementing |
| 11 | a file format or filesystem, it is first necessary to explore preexisting files |
| 12 | and disks and learn the low‐level details of the format. Debugging compiled |
| 13 | programs, working with core dumps, and exploring network protocols also require |
| 14 | some way of interpreting binary data. |
| 15 | |
| 16 | The most basic tool for exploring binary data is the hex editor. Using a hex |
| 17 | editor is inefficient and unpleasant because it requires manual calculation of |
| 18 | lengths and offsets while constantly referring back to the data format. |
| 19 | General‐purpose scripting languages can be used instead, so a structure can be |
| 20 | defined once and decoded as often as necessary. However, even with useful tools |
| 21 | like Python’s struct module, the programmer must specify how to read the input |
| 22 | data, calculate lengths and offsets, and provide useful output, so there’s much |
| 23 | more work involved than simply specifying the format of the data. This extra |
| 24 | code will probably be rewritten every time a new script is made, due to |
| 25 | slightly differing requirements. |
| 26 | |
| 27 | == Requirements == |
| 28 | |
| 29 | * Work in HelenOS—this means the code must be in C and/or an easily ported |
| 30 | language like Lua. |
| 31 | * View on different layers. For instance, when viewing a FAT directory entry, |
| 32 | it should be possible to switch between viewing the formatted date and time, |
| 33 | the integers, and the original bytes. |
| 34 | * Check whether data is valid; handle broken data reasonably well. |
| 35 | * Parse pieces of the data lazily; don’t try to read everything at once. |
| 36 | * Work in both directions (parsing and building) without requiring too much |
| 37 | extra effort when specifying the format. |
| 38 | * Support full modifications. Ideally, allow creation of a whole filesystem |
| 39 | from scratch. |
| 40 | |
| 41 | == Rationales == |
| 42 | |
| 43 | '''Purity''': Pure functions are a natural basis for converting data, and the |
| 44 | overall purpose of a script does not involve side effects. With a possible |
| 45 | constraint‐based version (see below) in mind, it seemed best to avoid side |
| 46 | effects entirely. |
| 47 | |
| 48 | '''Composition''': Bithenge’s emphasis on transform composition is intended to |
| 49 | help view data on different “layers.” It should be relatively simple to view |
| 50 | the result partway through a composition rather than at the end. Note that the |
| 51 | order of composition is consistent with function composition and nested |
| 52 | application in mathematics, and also consistent with the general idea that data |
| 53 | moves from right to left as it is decoded. |
| 54 | |
| 55 | == Issues == |
| 56 | |
| 57 | - Poor error handling in transform application—where and why did the error |
| 58 | occur? `scope_error` should help, but it isn’t widely used and it can’t give |
| 59 | the full context. |
| 60 | - Mixing transforms and expressions, and using "`in`", are confusing. It may be |
| 61 | best to combine them and eliminate the mandatory parentheses. |
| 62 | - There may be too much emphasis on single‐input, single‐output transforms; |
| 63 | it’s too early to tell. |
| 64 | - The internal handling of parameters and scopes seems awkward. |
| 65 | - The language isn’t very powerful yet. |
| 66 | |
| 67 | == Future language ideas == |
| 68 | |
| 69 | In approximate order of priority. |
| 70 | |
| 71 | Infinite loop detection:: When decoding transforms like |
| 72 | `struct { if (.non_existent) { } }`, an infinite loop occurs. This can also |
| 73 | happen if the field exists, but in an outer `struct`. An error should be |
| 74 | printed instead; Bithenge should not try to look in the `if` when searching |
| 75 | for `.non_existent`. |
| 76 | Code exporter:: A tool could generate C code to read and write data given a |
| 77 | specification. A separate file or annotations could be used to specify which |
| 78 | types should be used and which things should be read lazily or strictly. |
| 79 | Complex expressions:: More operators, and expressions that call transforms. |
| 80 | Better error reporting:: errno.h is intended for system call errors; when |
| 81 | other errors occur, a more helpful error message should be printed using |
| 82 | `scope_error`, like "field .foo not found" or "cannot apply nonzero_boolean |
| 83 | to blobs". |
| 84 | Assertions:: These could be implemented as transforms that don't actually |
| 85 | change the input. There could be multiple levels, ranging from “warning” to |
| 86 | “fatal error”. |
| 87 | Enumerations:: An easier way to handle many constant values, like |
| 88 | `enum { 0: "none", 1: "file", 2: "directory", 3: "symlink" }`. |
| 89 | Recursive transforms:: Although simple cases are handled by `do...while` or |
| 90 | self‐recursion, in some cases transforms need to recursively refer to each |
| 91 | other. |
| 92 | Merge blobs and internal nodes:: Currently, `struct`, `repeat`, and so on only |
| 93 | work with blobs, which must be either byte sequences or bit sequences. |
| 94 | Numbered internal nodes (such as those made by `repeat`) should be supported |
| 95 | as well. |
| 96 | Transforming internal nodes:: After binary data is decoded into a tree, it |
| 97 | should be possible to apply further transforms to interpret the data |
| 98 | further. For instance, after the FAT and directory entries of a FAT |
| 99 | filesystem have been decoded, a further transform could determine the data |
| 100 | for each file. |
| 101 | More information in repeat subtransforms:: Repeat subtransforms should have |
| 102 | access to the current index and previously decoded items. |
| 103 | Hidden fields:: Some fields, such as length fields, are no longer interesting |
| 104 | after the data is decoded, so they should be hidden by default. |
| 105 | Search:: Decoding may require searching for a fixed sequence of bytes in the |
| 106 | data. |
| 107 | Automatic parameters:: It could be useful to automatically pass some |
| 108 | parameters rather than computing and passing them explicitly. For instance, |
| 109 | a version number that affects the format of many different parts of the file |
| 110 | could be passed automatically, without having to write it out every time. A |
| 111 | more advanced automatic parameter could keep track of current offset being |
| 112 | decoded within a blob. There would need to be some sort of grouping to |
| 113 | determine which transforms have the automatic parameters. |
| 114 | Smarter length calculation:: Bithenge should automatically detect the length |
| 115 | of certain composed transforms, such as `repeat(8) {bit} <- bits_le`. This |
| 116 | would also be addressed by the constraint‐based version. |
| 117 | Diff tool:: Could show differences in the interpreted data. |
| 118 | |
| 119 | === Constraint‐based version === |
| 120 | |
| 121 | This and most other projects use an imperative design, where the format |
| 122 | specification is always used in a fixed order, one step at a time. The |
| 123 | imperative design causes problems when the user wants to modify a field, |
| 124 | because arbitrary changes to other fields may be necessary that cannot be |
| 125 | determined from the format specification. |
| 126 | |
| 127 | It may be possible to solve this with a constraint-based design, where the |
| 128 | format specification consists of statements that must be true about the raw and |
| 129 | interpreted data, and the program figures out how to solve these constraints. |
| 130 | Unfortunately, this approach seems too open-ended and unpredictable to fit |
| 131 | within GSoC. |
| 132 | |
| 133 | == Existing Tools == |
| 134 | |
| 135 | I researched existing tools related to my project, so they can be used for |
| 136 | inspiration. |
| 137 | |
| 138 | === Construct === |
| 139 | |
| 140 | [http://construct.wikispaces.com/ Construct] is a Python library for creating |
| 141 | declarative structure definitions. Each instance of the `Construct` class has a |
| 142 | name, and knows how to read from a stream, write to a stream, and determine its |
| 143 | length. Some predefined `Construct` subclasses use an arbitrary Python function |
| 144 | evaluated at runtime, or behave differently depending on whether |
| 145 | sub‐`Construct`s throw exceptions. `Const` uses a sub‐`Construct` and makes |
| 146 | sure the value is correct. Also has lazy `Construct`s. |
| 147 | |
| 148 | Unfortunately, if you change the size of a structure, you still have to change |
| 149 | everything else manually. |
| 150 | |
| 151 | === !BinData === |
| 152 | |
| 153 | [http://bindata.rubyforge.org/ BinData] makes good use of Ruby syntax; it |
| 154 | mostly has the same features as Construct. |
| 155 | |
| 156 | === Imperative DSLs === |
| 157 | |
| 158 | DSLs in this category are used in an obvious, deterministic manner, and complex |
| 159 | edits (changing the length of a structure) are difficult or impossible. They |
| 160 | are simple imperative languages in which fields, structures, bitstructures, and |
| 161 | arrays can be defined. The length, decoded value, and presence of fields can be |
| 162 | determined by expressions using any previously decoded field, and structures |
| 163 | can use `if`/`while`/`continue`/`break` and similar statements. Structures can |
| 164 | inherit from other structures, meaning that the parent’s fields are present at |
| 165 | the beginning of the child. Statements can move to a different offset in the |
| 166 | input data. There may be a real programming language that can be used along |
| 167 | with the DSL. |
| 168 | |
| 169 | [http://dwarfstd.org/ DWARF]:: |
| 170 | Uses a simple stack‐based VM to calculate variable locations. |
| 171 | [http://ijdc.net/index.php/ijdc/article/view/207 “Grammar‐based specification and parsing of binary file formats”]:: |
| 172 | Actually uses an attribute grammar, but it isn’t terribly different from an |
| 173 | imperative language. |
| 174 | [http://pyffi.sourceforge.net/ PyFFI]:: |
| 175 | Lets you create or modify files instead of just reading them. Fields can |
| 176 | refer to blocks of data elsewhere in the file. Uses an XML format. |
| 177 | [http://aluigi.altervista.org/quickbms.htm QuickBMS]:: |
| 178 | A popular tool for extracting files from video game archives. Its main |
| 179 | strength is the broad number of compression formats supported. It can put |
| 180 | modified files back in the archive in trivial situations. |
| 181 | [http://www.synalysis.net/ Synalize It!]:: |
| 182 | Not completely imperative; if you declare optional structs where part of the |
| 183 | data is constant, the correct struct will be displayed. Has a Graphviz export |
| 184 | of file structure. Uses an XML format. |
| 185 | Other free:: |
| 186 | [http://www-old.bro-ids.org/wiki/index.php/BinPAC_Userguide BinPAC], |
| 187 | [https://metacpan.org/module/Data::ParseBinary Data::ParseBinary], |
| 188 | [http://datascript.berlios.de/DataScriptLanguageOverview.html DataScript], |
| 189 | [http://www.dataworkshop.de/ DataWorkshop], |
| 190 | [http://wsgd.free.fr/ Wireshark Generic Dissector], |
| 191 | [http://metafuzz.rubyforge.org/binstruct/ Metafuzz BinStruct], and |
| 192 | [http://www.padsproj.org/ PADS]. |
| 193 | Other proprietary:: |
| 194 | [http://www.sweetscape.com/010editor/#templates 010 Editor], |
| 195 | [http://www.nyangau.org/be/be.htm Andys Binary Folding Editor], |
| 196 | [https://www.technologismiki.com/prod.php?id=31 Hackman Suite], |
| 197 | [http://www.hhdsoftware.com/doc/hex-editor/language-reference-overview.html Hex Editor Neo], |
| 198 | [http://apps.tempel.org/iBored/ iBored], and |
| 199 | [https://www.x-ways.net/winhex/templates.html#User_Templates WinHext]. |
| 200 | |
| 201 | === Less interesting tools === |
| 202 | |
| 203 | Simple formats in hex editors:: |
| 204 | These support static fields and dynamic lengths only: |
| 205 | [http://www.flexhex.com/ FlexHex], |
| 206 | [http://hexedit.com/ HexEdit], |
| 207 | [http://sourceforge.net/projects/hexplorer/ Hexplorer], |
| 208 | [http://www.hexworkshop.com/ Hex Workshop], and |
| 209 | [http://kde.org/applications/utilities/okteta/ Okteta]. |
| 210 | Simple formats elsewhere:: |
| 211 | [http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/ctf.h CTF], |
| 212 | [http://ff-extractor.sourceforge.net/ ffe], |
| 213 | [http://bigeasy.github.com/node-packet/ Node Packet], |
| 214 | [https://www.secdev.org/projects/scapy/ Scapy], and |
| 215 | [http://sourceware.org/gdb/current/onlinedocs/stabs/ stabs] |
| 216 | can only handle trivial structures. |
| 217 | [http://www.tecgraf.puc-rio.br/~lhf/ftp/lua/#lpack lpack], |
| 218 | [http://perldoc.perl.org/functions/pack.html Perl’s pack], |
| 219 | [http://docs.python.org/library/struct.html Python’s struct], and |
| 220 | [https://github.com/ToxicFrog/vstruct VStruct] |
| 221 | use concise string formats to describe simple structures. |
| 222 | [https://bitbucket.org/haypo/hachoir Hachoir] |
| 223 | uses Python for most things. |
| 224 | Protocol definition formats:: |
| 225 | [https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One ASN.1], |
| 226 | [https://en.wikipedia.org/wiki/Microsoft_Interface_Definition_Language MIDL], |
| 227 | [http://piqi.org/ Piqi], |
| 228 | and other IPC implementations go in the other direction: they generate a |
| 229 | binary format from a text description of a structure. ASN.1 in particular |
| 230 | has many features. |
| 231 | [https://www.wireshark.org/ Wireshark] and [http://www.tcpdump.org/ tcpdump]:: |
| 232 | As the Construct wiki notes, you would expect these developers to have some |
| 233 | sort of DSL, but they just use C for everything. Wireshark does use ASN.1, |
| 234 | Diameter, and MIDL for protocols developed with them. |
| 235 | |
| 236 | == Miscellaneous ideas == |
| 237 | |
| 238 | [http://corte.si/posts/visualisation/binvis/index.html Space‐filling curves] |
| 239 | look cool, but this project is about ''avoiding'' looking at raw binary data. |
| 240 | |
| 241 | == Interesting formats == |
| 242 | |
| 243 | These formats will be interesting and/or difficult to handle. They should be |
| 244 | kept in mind when designing the language. |
| 245 | |
| 246 | * Filesystem allocation tables, which should be kept consistent with the actual |
| 247 | usage of the disk. |
| 248 | * Filesystem logs, which should be applied to the rest of the disk before |
| 249 | interpreting it. |
| 250 | * Formats where the whole file can have either endianness depending on a field |
| 251 | in the header. |
| 252 | * The [http://www.blender.org/development/architecture/blender-file-format/ Blender file format] |
| 253 | is especially dynamic. When Blender saves a file, it just copies the |
| 254 | structures from memory and translates the pointers. Since each Blender |
| 255 | version and architecture will have different structures, the output file |
| 256 | includes a header describing the fields and binary layout of each structure. |
| 257 | When the file is loaded, the header is read first and the structures will be |
| 258 | translated as necessary. |
| 259 | * If the language is powerful enough, it might be possible to have a native |
| 260 | description of Zlib and other compression formats. |
| 261 | * It could be interesting to parse ARM or x86 machine code. |