Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of Bithenge/Design

Timestamp:: 2012-08-24T06:41:27Z (13 years ago)
Author:: Sean Bartell
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Bithenge/Design

               v1
+= Bithenge =
+[[PageOutline(2-3)]]
+Various aspects of the [[Bithenge]] design are discussed here.
+== Motivation ==
+Exploring and working with structured binary data is necessary in many
+different situations in a project like HelenOS. For instance, when implementing
+a file format or filesystem, it is first necessary to explore preexisting files
+and disks and learn the low‐level details of the format. Debugging compiled
+programs, working with core dumps, and exploring network protocols also require
+some way of interpreting binary data.
+The most basic tool for exploring binary data is the hex editor. Using a hex
+editor is inefficient and unpleasant because it requires manual calculation of
+lengths and offsets while constantly referring back to the data format.
+General‐purpose scripting languages can be used instead, so a structure can be
+defined once and decoded as often as necessary. However, even with useful tools
+like Python’s struct module, the programmer must specify how to read the input
+data, calculate lengths and offsets, and provide useful output, so there’s much
+more work involved than simply specifying the format of the data. This extra
+code will probably be rewritten every time a new script is made, due to
+slightly differing requirements.
+== Requirements ==
+* Work in HelenOS—this means the code must be in C and/or an easily ported
+  language like Lua.
+* View on different layers. For instance, when viewing a FAT directory entry,
+  it should be possible to switch between viewing the formatted date and time,
+  the integers, and the original bytes.
+* Check whether data is valid; handle broken data reasonably well.
+* Parse pieces of the data lazily; don’t try to read everything at once.
+* Work in both directions (parsing and building) without requiring too much
+  extra effort when specifying the format.
+* Support full modifications. Ideally, allow creation of a whole filesystem
+  from scratch.
+== Rationales ==
+'''Purity''': Pure functions are a natural basis for converting data, and the
+overall purpose of a script does not involve side effects. With a possible
+constraint‐based version (see below) in mind, it seemed best to avoid side
+effects entirely.
+'''Composition''': Bithenge’s emphasis on transform composition is intended to
+help view data on different “layers.” It should be relatively simple to view
+the result partway through a composition rather than at the end. Note that the
+order of composition is consistent with function composition and nested
+application in mathematics, and also consistent with the general idea that data
+moves from right to left as it is decoded.
+== Issues ==
+- Poor error handling in transform application—where and why did the error
+  occur? `scope_error` should help, but it isn’t widely used and it can’t give
+  the full context.
+- Mixing transforms and expressions, and using "`in`", are confusing. It may be
+  best to combine them and eliminate the mandatory parentheses.
+- There may be too much emphasis on single‐input, single‐output transforms;
+  it’s too early to tell.
+- The internal handling of parameters and scopes seems awkward.
+- The language isn’t very powerful yet.
+== Future language ideas ==
+In approximate order of priority.
+ Infinite loop detection:: When decoding transforms like
+   `struct { if (.non_existent) { } }`, an infinite loop occurs. This can also
+   happen if the field exists, but in an outer `struct`. An error should be
+   printed instead; Bithenge should not try to look in the `if` when searching
+   for `.non_existent`.
+ Code exporter:: A tool could generate C code to read and write data given a
+   specification. A separate file or annotations could be used to specify which
+   types should be used and which things should be read lazily or strictly.
+ Complex expressions:: More operators, and expressions that call transforms.
+ Better error reporting:: errno.h is intended for system call errors; when
+   other errors occur, a more helpful error message should be printed using
+   `scope_error`, like "field .foo not found" or "cannot apply nonzero_boolean
+   to blobs".
+ Assertions:: These could be implemented as transforms that don't actually
+   change the input. There could be multiple levels, ranging from “warning” to
+   “fatal error”.
+ Enumerations:: An easier way to handle many constant values, like
+   `enum { 0: "none", 1: "file", 2: "directory", 3: "symlink" }`.
+ Recursive transforms:: Although simple cases are handled by `do...while` or
+   self‐recursion, in some cases transforms need to recursively refer to each
+   other.
+ Merge blobs and internal nodes:: Currently, `struct`, `repeat`, and so on only
+   work with blobs, which must be either byte sequences or bit sequences.
+   Numbered internal nodes (such as those made by `repeat`) should be supported
+   as well.
+ Transforming internal nodes:: After binary data is decoded into a tree, it
+   should be possible to apply further transforms to interpret the data
+   further. For instance, after the FAT and directory entries of a FAT
+   filesystem have been decoded, a further transform could determine the data
+   for each file.
+ More information in repeat subtransforms:: Repeat subtransforms should have
+   access to the current index and previously decoded items.
+ Hidden fields:: Some fields, such as length fields, are no longer interesting
+   after the data is decoded, so they should be hidden by default.
+ Search:: Decoding may require searching for a fixed sequence of bytes in the
+   data.
+ Automatic parameters:: It could be useful to automatically pass some
+   parameters rather than computing and passing them explicitly. For instance,
+   a version number that affects the format of many different parts of the file
+   could be passed automatically, without having to write it out every time. A
+   more advanced automatic parameter could keep track of current offset being
+   decoded within a blob. There would need to be some sort of grouping to
+   determine which transforms have the automatic parameters.
+ Smarter length calculation:: Bithenge should automatically detect the length
+   of certain composed transforms, such as `repeat(8) {bit} <- bits_le`. This
+   would also be addressed by the constraint‐based version.
+ Diff tool:: Could show differences in the interpreted data.
+=== Constraint‐based version ===
+This and most other projects use an imperative design, where the format
+specification is always used in a fixed order, one step at a time. The
+imperative design causes problems when the user wants to modify a field,
+because arbitrary changes to other fields may be necessary that cannot be
+determined from the format specification.
+It may be possible to solve this with a constraint-based design, where the
+format specification consists of statements that must be true about the raw and
+interpreted data, and the program figures out how to solve these constraints.
+Unfortunately, this approach seems too open-ended and unpredictable to fit
+within GSoC.
+== Existing Tools ==
+I researched existing tools related to my project, so they can be used for
+inspiration.
+=== Construct ===
+[http://construct.wikispaces.com/ Construct] is a Python library for creating
+declarative structure definitions. Each instance of the `Construct` class has a
+name, and knows how to read from a stream, write to a stream, and determine its
+length. Some predefined `Construct` subclasses use an arbitrary Python function
+evaluated at runtime, or behave differently depending on whether
+sub‐`Construct`s throw exceptions. `Const` uses a sub‐`Construct` and makes
+sure the value is correct. Also has lazy `Construct`s.
+Unfortunately, if you change the size of a structure, you still have to change
+everything else manually.
+=== !BinData ===
+[http://bindata.rubyforge.org/ BinData] makes good use of Ruby syntax; it
+mostly has the same features as Construct.
+=== Imperative DSLs ===
+DSLs in this category are used in an obvious, deterministic manner, and complex
+edits (changing the length of a structure) are difficult or impossible. They
+are simple imperative languages in which fields, structures, bitstructures, and
+arrays can be defined. The length, decoded value, and presence of fields can be
+determined by expressions using any previously decoded field, and structures
+can use `if`/`while`/`continue`/`break` and similar statements. Structures can
+inherit from other structures, meaning that the parent’s fields are present at
+the beginning of the child. Statements can move to a different offset in the
+input data. There may be a real programming language that can be used along
+with the DSL.
+ [http://dwarfstd.org/ DWARF]::
+  Uses a simple stack‐based VM to calculate variable locations.
+ [http://ijdc.net/index.php/ijdc/article/view/207 “Grammar‐based specification and parsing of binary file formats”]::
+  Actually uses an attribute grammar, but it isn’t terribly different from an
+  imperative language.
+ [http://pyffi.sourceforge.net/ PyFFI]::
+  Lets you create or modify files instead of just reading them. Fields can
+  refer to blocks of data elsewhere in the file. Uses an XML format.
+ [http://aluigi.altervista.org/quickbms.htm QuickBMS]::
+  A popular tool for extracting files from video game archives. Its main
+  strength is the broad number of compression formats supported. It can put
+  modified files back in the archive in trivial situations.
+ [http://www.synalysis.net/ Synalize It!]::
+  Not completely imperative; if you declare optional structs where part of the
+  data is constant, the correct struct will be displayed. Has a Graphviz export
+  of file structure. Uses an XML format.
+ Other free::
+  [http://www-old.bro-ids.org/wiki/index.php/BinPAC_Userguide BinPAC],
+  [https://metacpan.org/module/Data::ParseBinary Data::ParseBinary],
+  [http://datascript.berlios.de/DataScriptLanguageOverview.html DataScript],
+  [http://www.dataworkshop.de/ DataWorkshop],
+  [http://wsgd.free.fr/ Wireshark Generic Dissector],
+  [http://metafuzz.rubyforge.org/binstruct/ Metafuzz BinStruct], and
+  [http://www.padsproj.org/ PADS].
+ Other proprietary::
+  [http://www.sweetscape.com/010editor/#templates 010 Editor],
+  [http://www.nyangau.org/be/be.htm Andys Binary Folding Editor],
+  [https://www.technologismiki.com/prod.php?id=31 Hackman Suite],
+  [http://www.hhdsoftware.com/doc/hex-editor/language-reference-overview.html Hex Editor Neo],
+  [http://apps.tempel.org/iBored/ iBored], and
+  [https://www.x-ways.net/winhex/templates.html#User_Templates WinHext].
+=== Less interesting tools ===
+ Simple formats in hex editors::
+  These support static fields and dynamic lengths only:
+  [http://www.flexhex.com/ FlexHex],
+  [http://hexedit.com/ HexEdit],
+  [http://sourceforge.net/projects/hexplorer/ Hexplorer],
+  [http://www.hexworkshop.com/ Hex Workshop], and
+  [http://kde.org/applications/utilities/okteta/ Okteta].
+ Simple formats elsewhere::
+  [http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/ctf.h CTF],
+  [http://ff-extractor.sourceforge.net/ ffe],
+  [http://bigeasy.github.com/node-packet/ Node Packet],
+  [https://www.secdev.org/projects/scapy/ Scapy], and
+  [http://sourceware.org/gdb/current/onlinedocs/stabs/ stabs]
+  can only handle trivial structures.
+  [http://www.tecgraf.puc-rio.br/~lhf/ftp/lua/#lpack lpack],
+  [http://perldoc.perl.org/functions/pack.html Perl’s pack],
+  [http://docs.python.org/library/struct.html Python’s struct], and
+  [https://github.com/ToxicFrog/vstruct VStruct]
+  use concise string formats to describe simple structures.
+  [https://bitbucket.org/haypo/hachoir Hachoir]
+  uses Python for most things.
+ Protocol definition formats::
+  [https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One ASN.1],
+  [https://en.wikipedia.org/wiki/Microsoft_Interface_Definition_Language MIDL],
+  [http://piqi.org/ Piqi],
+  and other IPC implementations go in the other direction: they generate a
+  binary format from a text description of a structure. ASN.1 in particular
+  has many features.
+ [https://www.wireshark.org/ Wireshark] and [http://www.tcpdump.org/ tcpdump]::
+  As the Construct wiki notes, you would expect these developers to have some
+  sort of DSL, but they just use C for everything. Wireshark does use ASN.1,
+  Diameter, and MIDL for protocols developed with them.
+== Miscellaneous ideas ==
+[http://corte.si/posts/visualisation/binvis/index.html Space‐filling curves]
+look cool, but this project is about ''avoiding'' looking at raw binary data.
+== Interesting formats ==
+These formats will be interesting and/or difficult to handle. They should be
+kept in mind when designing the language.
+* Filesystem allocation tables, which should be kept consistent with the actual
+  usage of the disk.
+* Filesystem logs, which should be applied to the rest of the disk before
+  interpreting it.
+* Formats where the whole file can have either endianness depending on a field
+  in the header.
+* The [http://www.blender.org/development/architecture/blender-file-format/ Blender file format]
+  is especially dynamic. When Blender saves a file, it just copies the
+  structures from memory and translates the pointers. Since each Blender
+  version and architecture will have different structures, the output file
+  includes a header describing the fields and binary layout of each structure.
+  When the file is loaded, the header is read first and the structures will be
+  translated as necessary.
+* If the language is powerful enough, it might be possible to have a native
+  description of Zlib and other compression formats.
+* It could be interesting to parse ARM or x86 machine code.