= Bithenge Syntax = [[PageOutline(2-3)]] This page describes the syntax and semantics of [[Bithenge]] scripts. == Simple example == {{{ transform item(scale) = struct { .name_len <- uint8; .name <- ascii <- known_length(.name_len); .value <- (in * scale) <- uint32le; }; transform main = repeat {item(100)}; }}} More complicated examples can be found in [https://bazaar.launchpad.net/~wtachi/helenos/bithenge/files/head:/uspace/dist/src/bithenge uspace/dist/src/bithenge]. == Tokenization == Bithenge scripts consist of ASCII text. When a script is read, it is first divided into pieces known as ''tokens''. As in most programming languages, the longest possible token is read, so `==` is always a single token rather than two. Whitespace (space, form‐feed, newline, carriage return, horizontal tab, and vertical tab) and comments (`#` until newline) are allowed between tokens; both LF (Unix) and CRLF (DOS) newline conventions can be used. There are several types of tokens: Symbols:: `&& ++ == >= // <- <= != || < > = + - * % ; : { } ( ) .` Keywords:: `do else false if in partial repeat struct switch transform true while` Identifiers:: Any sequence of letters, digits, and underscores, beginning with a letter, that is not a keyword. Integer literals:: A sequence of decimal digits; negative literals are not yet possible. Keywords and identifiers are case‐sensitive. Tokens can be at most 256 characters long. == Scripts == - script → *definition - definition → "`transform`" identifier [parameters] "`=`" transform "`;`" - parameters → "`(`" [identifier *("`,`" identifier)] "`)`" A Bithenge script consists of a series of transform definitions. It must define a transform named `main`, which is applied when the script is used. Transforms may be defined with parameters, in which case corresponding arguments must be provided when the transform is invoked. Examples: - `transform main = uint8;` defines an extremely simple main transform. - `transform is_odd(val) = (val % 2 == 1);` defines a transform that checks whether its argument is odd. == Transform invocation == - transform → identifier [arguments] - arguments → "`(`" [ expression *("`,`" expression) ] "`)`" A built‐in or previously defined transform can be invoked, with arguments if necessary. Examples: - `is_odd(17)` is a transform that applies `is_odd` with 17 as the argument. == Transform composition == - transform → transform "`<-`" transform A transform can be created by composing two existing transforms. It will work by first applying the right transform to the input, then applying the left transform to the result. Examples: - `ascii <- zero_terminated` decodes a zero‐terminated ASCII string. - `repeat{uint32le} <- known_length(8)` decodes two 32‐bit integers. == Expression transforms == - transform → "`(`" expression "`)`" Expression transforms produce their output by calculating an expression. If "`in`" is used in the expression, it refers to the expression transform’s input; otherwise, the input must be an empty blob. In either case, the parentheses are mandatory. Examples: - `(in + 1)` adds 1 to its input. - `(.bytes_per_sector * .sectors_per_cluster)` calculates the number of bytes per cluster. == Struct transforms == - transform → "`struct`" "`{`" *struct-field "`}`" - struct-field → [ "`.`" identifier ] "`<-`" transform "`;`" A struct transform applies each of its subtransforms in sequence to the input blob. The result is an internal node, with the specified key used for the result of each subtransform. If a subtransform has no corresponding key, its result must be an internal node; the result’s keys will be merged into the struct transform’s result. It must be possible to determine the size of the subblob each subtransform consumes. Bithenge currently only handles the simple cases: it knows `uint32le` needs 4 bytes, but not that `known_length(2) <- repeat{uint32le}` needs 8 bytes. Example: {{{ struct { .min <- uint32le; .max <- uint32le; .mid <- ((.min + .max) // 2); <- repeat(3) {uint8}; } }}} One possible output is `{"min": 0, "max": 8, "mid": 4, 0: 127, 1: 126, 2: 125}`. == Partial transforms == - transform → "`partial`" [ "`(`" expression "`)`" ] "`{`" transform "`}`" A partial transform applies its subtransform to a prefix of a blob. An expression can be given to specify the offset within the blob at which the subtransform is applied. Examples: - `partial {uint8}` decodes the first byte of the arbitrary‐length input as an integer. - `partial(1) {uint8}` decodes the second byte of the arbitrary‐length input as an integer. == Conditional transforms == - transform → "`if`" "`(`" expression "`)`" "`{`" transform "`}`" "`else`" "`{`" transform "`}`" - struct-field → "`if`" "`(`" expression "`)`" "`{`" *struct-field "`}`" [ "`else`" "`{`" *struct-field "`}`" ] An if transform applies its first subtransform if the expression evaluates to true, and the second subtransform if the expression evaluates to false. A second form can be used in struct transforms: `struct { ... if (...) {...} ... }` is equivalent to `struct { ... <- if (...) { struct { ... } } else { struct { } }; ... }`. - transform → "`switch`" "`(`" expression "`)`" "`{`" *( (expression/"`else`") "`:`" transform "`;`" ) "`}`" - struct-field → "`switch`" "`(`" expression "`)`" "`{`" *( (expression/"`else`") "`:`" "`{`" *struct-field "`}`" "`;`" ) "`}`" A switch transform compares the first expression to each of the case expressions in turn, stopping when they compare equal and using that subtransform. The last case can be "`else`", which will always be used if none of the previous cases matched. The second form can be used in struct transforms, much like the second form of "`if`". Examples: - `if (little_endian) { uint32le } else { uint32be }` decodes a little‐ or big‐endian integer, depending on `little_endian`. - `struct { switch (.type) { 0: {.x <- uint8; .y <- uint8;}; 1: {.val <- uint16;}; } }` decodes two 8‐bit integers or one 16‐bit integer depending on the value of `.type`. == Repetition transforms == - transform → "`repeat`" [ "`(`" expression "`)`" ] "`{`" transform "`}`" A repeat transform applies its subtransform repeatedly to sequential subblobs of the input. The output is an internal node with a member for each repetition, with keys starting at `0`. If an expression is given, it determines the number of times to repeat; otherwise, the subtransform is applied until it fails or there is no input remaining. - transform → "`do`" "`{`" transform "`}`" "`while`" "`(`" expression "`)`" A do/while transform applies its subtransform repeatedly until the expression evaluates to `false`. The output has the same format as repeat transform output. Scope member expressions should be used in the expression to access the output of the subtransform and determine when to stop. Examples: - `repeat{uint8}` decodes each byte as an integer. - `repeat(256) {uint32le}` decodes an array of 256 integers. - `do { struct { .type <- uint8; .value <- uint32le; } } while (.type != 0)` decodes types and values, stopping after a type of 0 is seen. == Expressions == - expression → "`true`" - expression → "`false`" - expression → integer Expressions can be boolean or integer literals. - expression → identifier Expressions can use parameters. - expression → "`in`" In an expression transform, "`in`" refers to the transform’s input. Otherwise, it refers to the input of the whole transform in the definition. - expression → "`.`" identifier A scope member expression looks in each containing struct transform for a member with the given key. It starts from the innermost struct transform and searches as far as the whole transform being defined. - expression → expression "`.`" identifier - expression → expression "`[`" expression "`]`" An expression can look for a member of another expression with a given key. - expression → expression "`[`" expression "`:`" "`]`" - expression → expression "`[`" expression "`:`" expression "`]`" - expression → expression "`[`" expression "`,`" expression "`]`" An expression can create a subblob of another expression. The first expression specifies the blob and the second expression specifies the start offset within the blob. If a comma is used, the third expression specifies the length of the subblob; otherwise it specifies the end offset. If no third expression is given, the subblob extends to the end of the blob. - expression → "`(`" expression "`)`" - expression → expression operator expression Expressions can, of course, use binary operators. The operators and their precedences are as follows: ||= Operator =||= Precedence =||= Associativity =||= Description =|| || "`.`" || 5 || Left‐to‐right || Member (see above) || || "`[]`" || 5 || Left‐to‐right || Member or subblob (see above) || || "`*`" || 4 || Left‐to‐right || Integer multiplication || || "`//`" || 4 || Left‐to‐right || [https://en.wikipedia.org/wiki/Modulo_operation Floored/euclidean] integer division; divisor must be positive || || "`%`" || 4 || Left‐to‐right || [https://en.wikipedia.org/wiki/Modulo_operation Floored/euclidean] modulo operation; divisor must be positive || || "`+`" || 3 || Left‐to‐right || Integer addition || || "`-`" || 3 || Left‐to‐right || Integer subtraction || || "`++`" || 3 || Left‐to‐right || Blob concatenation || || "`<`" || 2 || Left‐to‐right || Integer less‐than || || "`<=`" || 2 || Left‐to‐right || Integer less‐than‐or‐equal || || "`>`" || 2 || Left‐to‐right || Integer greater‐than || || "`>=`" || 2 || Left‐to‐right || Integer greater‐than‐or‐equal || || "`==`" || 1 || Left‐to‐right || Equal‐to (not supported for internal nodes) || || "`!=`" || 1 || Left‐to‐right || Unequal‐to (not supported for internal nodes) || || "`&&`" || 0 || Left‐to‐right || Logical or || || "`||`" || 0 || Left‐to‐right || Logical and || == Built‐in transforms == These transforms are implemented in C and included with Bithenge. Note that precise names are preferred; scripts can define shorter aliases if necessary. ||= name =||= input =||= output =||= description =||= example =|| ||ascii ||byte blob node ||string ||decodes some bytes as ASCII characters || `hex:6869` becomes `"hi"` || ||bit ||1‐bit blob node ||boolean ||decodes a single bit || `1` becomes `true` || ||bits_be ||byte blob node ||bit blob node ||decodes bytes as bits, starting with the most‐significant bit || `hex:0f` becomes `bit:00001111` || ||bits_le ||byte blob node ||bit blob node ||decodes bytes as bits, starting with the least‐significant bit || `hex:0f` becomes `bit:11110000` || ||known_length(len) ||blob node ||blob node ||requires the input to have a known length || || ||nonzero_boolean ||integer ||boolean ||decodes a boolean where nonzero values are true || `0` becomes `false` || ||uint8 ||1‐byte blob node ||integer node ||decodes a 1‐byte unsigned integer || `hex:11` becomes `17` || ||uint16be ||2‐byte blob node ||integer node ||decodes a 2‐byte big‐endian unsigned integer || `hex:0201` becomes `513` || ||uint16le ||2‐byte blob node ||integer node ||decodes a 2‐byte little‐endian unsigned integer || `hex:0101` becomes `257` || ||uint32be ||4‐byte blob node ||integer node ||decodes a 4‐byte big‐endian unsigned integer || `hex:00000201` becomes `513` || ||uint32le ||4‐byte blob node ||integer node ||decodes a 4‐byte little‐endian unsigned integer || `hex:01010000` becomes `257` || ||uint64be ||8‐byte blob node ||integer node ||decodes a 8‐byte big‐endian unsigned integer || `hex:0000000000000201` becomes `513` || ||uint64le ||8‐byte blob node ||integer node ||decodes a 8‐byte little‐endian unsigned integer || `hex:0101000000000000` becomes `257` || ||uint_be(len) ||bit blob node ||integer node ||decodes bits as an unsigned integer, starting with the most‐significant bit || || ||uint_le(len) ||bit blob node ||integer node ||decodes bits as an unsigned integer, starting with the least‐significant bit || || ||zero_terminated ||byte blob node ||byte blob node||takes bytes up until the first `00` || `hex:7f0400` becomes `hex:7f04` ||