Extras¶
In this section we discuss a number of additional features and programs included in this project.
Debugging¶
The parser and the writer support four debug levels, controlled via the -d
option of the command line interface.
level | description |
0 | No debugging. |
1 | Show general debugging information and internal variables. |
2 | Show general debugging information and parsing details. |
3 | Show all debugging information. |
General debugging information¶
The section DEBUG INFO
contains some general debugging information.
For the parser it contains:
- The file position after the parsing has finished and the size of the file. Something is wrong if these two values are not equal.
- The number of bytes that have been parsed and assigned to variables. This is
all the data that has not been assigned to the
__raw__
list.
For the writer this section only contains the number of bytes written.
Internal variables¶
The section INTERNAL VARIABLES
contains the internal key-value store used
for referencing previously read variables.
Parsing details¶
The section named PARSING DETAILS
contains a detailed trace of the parsing
or writing process. Every line represents either a conversion or information
about substructures.
For the parser, a conversion line contains the following fields:
field | description |
1 : |
File position. |
2 | Field content. |
( 3 ) |
Field size (not used for strings). |
--> 4 |
Variable name. |
In the following example, we see how the file from our balance example is parsed.
0x000000: cf 07 (2) --> year_of_birth
0x000002: John Doe --> name
0x00000b: 8a 0c (2) --> balance
For the writer, a conversion line contains the following fields:
field | description |
1 : |
File position. |
2 | Variable name. |
--> 3 |
Field content. |
In the following example, we see how the file from our balance example is written.
0x000000: year_of_birth --> 1999
0x000002: name --> John Doe
0x00000b: balance --> 3210
The start of a substructure is indicated by --
followed by the name of the
substructure, the end of a substructure is indicated by -->
followed by the
name of the substructure.
make_skeleton
¶
To facilitate the development of support for a new file type, the
make_skeleton
command can be used to generate a definition stub. It takes
an example file and a delimiter as input and outputs a structure and types
files definition. The input file is scanned for occurrences of the delimiter
and creates a field of type raw
for the preceding bytes. All fields are
treated as delimited variable length strings that are processed by the raw
function, as a result, all fixed sized fields are appended to the start of
these strings.
Example¶
Suppose we know that the string delimiter in our balance example is 0x00
.
We can create a stub for the structure and types definitions as follows:
make_skeleton -d 0x00 balance.dat structure.yml types.yml
The -d
parameter can be used multiple times for multi-byte delimiters.
This will generate the following types definition:
---
types:
raw:
delimiter:
- 0x00
function:
name: raw
text:
delimiter:
- 0x00
with the following structure definition:
---
- name: field_000000
type: raw
- name: field_000001
type: raw
The performance of these generated definitions can be assessed by using the parser in debug mode:
bin_parser read -d 2 \
balance.dat structure.yml types.yml balance.yml 2>&1 | less
which gives the following output:
0x000000: <CF>^GJohn Doe --> field_000000
0x00000b: <8A>^L --> field_000001
We see that the first field has two extra bytes preceding the text field. This is an indication that one or more fields need to be added to the start of the structure definition. If we also know that in this file format only strings and 16-bit integers are used, we can change the definitions as follows.
We remove the raw
type and add a type for parsing 16-bit integers:
---
types:
short:
size: 2
function:
name: struct
args:
fmt: '<h'
text:
delimiter:
- 0x00
and we change the structure to enable parsing of the newly found integers:
---
- name: number_1
type: short
- name: name
type: text
- name: number_2
type: short
By iterating this process, reverse engineering of these types of file formats is greatly simplified.
compare_yaml
¶
Since YAML files are serialised dictionaries or JavaScript objects, the order
of the keys is not fixed. Also, differences in indentation, line wrapping and
other formatting differences can lead to false positive detection of
differences when using rudimentary tools like diff
.
compare_yaml
takes two YAML files as input and outputs differences in the
content of these files:
compare_yaml input_1.yaml input_2.yaml
The program recursively compares the contents of dictionaries (keys), lists and values. The following differences are reported:
- Missing keys at any level.
- Lists of unequal size.
- Differences in values.
When a difference is detected, no further recursive comparison attempted, so the list reported differences is not guaranteed to be complete. Conversely, if no differences are reported, then the YAML files are guaranteed to have the same content.
sync_test
¶
To keep the Python- and JavaScript implementations in sync, we use a shell script that compares the output of both the parser and the writer for various examples.
./extras/sync_test
This will perform a parser test and an invariance test for all examples.
Parser test¶
This test uses the Python- and JavaScript implementation to convert from binary
to YAML. compare_yaml
is used to check for any differences.
Invariance test¶
This test performs the following steps:
- Use the Python implementation to convert from binary to YAML.
- Use the Python implementation to convert the output of step 1 back to binary.
- Use the JavaScript implementation to convert the output of step 1 back to binary.
- Use the Python implementation to convert the output of step 2 to YAML.
The output of step 1 and 4 is compared using compare_yaml
to assure that
the generated YAML is invariant under conversion to binary and back in the
Python implementation. The two generated binary files in step 2 and 3 are
compared with diff
to confirm that the Python- and JavaScript
implementations behave identically.
Note that the original binary may not be invariant under conversion to YAML and back. This is the case when variable length strings within fixed sized fields are used.