291 lines
12 KiB
Org Mode
291 lines
12 KiB
Org Mode
-*-mode:org-*-
|
|
|
|
M2-Planet being based on the goal of bootstrapping the Minimal C compiler
|
|
required to support C macros, structs, arrays, inline assembly and self hosting;
|
|
is rather small, around 3Kloc according to sloccount
|
|
|
|
* SETUP
|
|
The most obvious way to setup for M2-Planet development is to clone --recursive
|
|
and setup mescc-tools first (https://github.com/oriansj/mescc-tools.git)
|
|
Then be sure to install any C compiler and make clone of your choice.
|
|
|
|
* BUILD
|
|
The standard C based approach to building M2-Planet is simply running:
|
|
make M2-Planet
|
|
|
|
Should you wish to verify that M2-Planet was built correctly run:
|
|
make test
|
|
|
|
* ROADMAP
|
|
M2-Planet V1.0 is the bedrock of all future M2-Planet versions. Any future
|
|
release that will depend upon a more advanced version to be compiled, will
|
|
require the version prior to it to be named. V2.0 and the same properties apply
|
|
To all future release of M2-Planet. All minor releases are buildable by the last
|
|
major release and All major releases are buildable by the last major release.
|
|
|
|
* DEBUG
|
|
To get a properly debuggable binary of M2-Planet: make M2-Planet
|
|
M2-Planet also can create debuggable binaries with the help of blood-elf and the
|
|
--debug option. if you are comfortable with gdb, knowing that function names are
|
|
prefixed with FUNCTION_ M2-Planet built binaries are quite debuggable.
|
|
|
|
* Bugs
|
|
M2-Planet assumes a very heavily restricted subset of the C language and many C
|
|
programs will break hard when passed to M2-Planet.
|
|
|
|
M2-Planet does not actually implement any primitive functionality, it is assumed
|
|
that will be written in inline assembly by the programmer or leveraged via M2libc
|
|
which is the C library written in the M2-Planet C subset.
|
|
|
|
* Magic
|
|
** argument and local stack
|
|
In M2-Planet the stack is first the EDI pointer which is preserved as should an
|
|
argument be a function which returns a value, it may be overwritten and cause
|
|
issues, this is followed by the previous frame's base pointer (EBP) as it will
|
|
need to be restored upon return from the function call. This is then followed by
|
|
the arguments which are pushed onto the stack from the left to the right,
|
|
followed by the RETURN Pointer generated from the function call, after which the
|
|
locals are placed upon the stack first to last followed by any Temporary values: +----------------------+
|
|
EDI -> | Previous EDI pointer | +----------------------+
|
|
EBP -> | Previous EBP pointer | +----------------------+
|
|
1st -> | Argument 1 | +----------------------+
|
|
2nd -> | Argument 2 | +----------------------+
|
|
... -> ........................ +----------------------+
|
|
Nth -> | Argument N | +----------------------+
|
|
RET -> | RETURN Pointer | +----------------------+
|
|
1st -> | Local 1 | +----------------------+
|
|
2nd -> | Local 2 | +----------------------+
|
|
... -> ........................ +----------------------+
|
|
Nth -> | Local N | +----------------------+
|
|
temps-> .......................
|
|
|
|
** AArch64 port notes
|
|
Some details about design, implementation and generated code; maybe of
|
|
interest for new targets, to M1 users, compiler hackers and curious
|
|
minds in general.
|
|
|
|
*** Target ISA related issues
|
|
|
|
In the ARMv8 AArch64 A64 instruction set that we target, immediate
|
|
values into instructions are not aligned to 4 bits, which is the size
|
|
of the convenient single hexadecimal digit (that served well so far,
|
|
for other ports). Other groups of bits are affected. For example,
|
|
those to encode registers are usually 5 bits long and horror stories
|
|
about non-contiguous chunks (due to endianess interactions with M1, a
|
|
big bit endian language) are told, so not even using octal nor binary
|
|
encodings solve our problem.
|
|
|
|
Because of that, we have less flexible and reusable definitions than
|
|
usual (see aarch64_defs.M1). Also, we resort to unconventional (for
|
|
M2-Planet standards) workarounds and generate worse code. Anyway,
|
|
neither size nor speed are high priorities and there's room for
|
|
improvement.
|
|
|
|
On the bright side, affected codepaths/definitions and working tactics
|
|
are better known now, being this the first target of M2-Planet with
|
|
such features. That might be helpful in future ports (RISC-V comes to
|
|
mind, which has weird structure too... designed "so that as many bits
|
|
as possible are in the same position in every instruction" but not for
|
|
basic tools).
|
|
|
|
Some notable workarounds are:
|
|
|
|
- Create one independent definition per _needed_ operation, instead of
|
|
reusing common parts like we do for other archs. The resulting set is
|
|
quite small even following this simple rule consistently. See how
|
|
the SKIP_INST_* family seems nicely aligned for more fine-grained
|
|
hex but we don't exploit that; or the PUSH/POP ones that also kind
|
|
of do, but watch out for the general case if you plan to create your
|
|
own set of general purpose definitions.
|
|
|
|
One interesting example reflects that creating new definitions is
|
|
avoided unless readability suffers: the pair LOAD_W2_AHEAD,
|
|
LSHIFT_X0_X0_X2 exists because our two main registers are in use in
|
|
postfix_expr_array() and the common shift is inconvenient in this
|
|
particular case. It's possible to reuse definitions (preliminary
|
|
patches did this) using multiplication and addition (quite natural by
|
|
the way, even if suboptimal); or dancing with the stack to fit
|
|
everything into place (harder to reason about). It felt too alien in
|
|
the codebase so a couple of definitions were added.
|
|
|
|
- Use the register-based instructions instead of those using
|
|
immediates. This forces us to generate more code in order to put the
|
|
data in the register. Data is mixed with the code (not even in a
|
|
fancy pool) to be loaded from and then skipped at run-time. See some
|
|
of the multiple instances of the LOAD_W0_AHEAD then SKIP_32_DATA
|
|
pattern.
|
|
|
|
- For control flow structures, the problem about immediates bits us
|
|
again (hits, bites, bytes; sorry, can't resist) for conditional
|
|
PC-relative branching. The jump is arbitrary, because any amount of
|
|
code can be present in any given block to be skipped. AArch64
|
|
PC-relative conditional branch instructions [that I found, newbie on
|
|
board!] are based on immediate values, and we have to avoid
|
|
arbitrary immediate values as usual.
|
|
|
|
There's an *unconditional* absolute branch instruction that accepts
|
|
the target addr from a register (which we can set at will using the
|
|
"load_ahead+skip" pattern). So, we construct an unconditional
|
|
over-the-block jump and skip this jump with the conditional one
|
|
("inverted", more about this in a moment). The point is that now we
|
|
know exactly the distance to jump: it's the size of that
|
|
construction. We can define a couple of conditional branch
|
|
instructions because the immediate is not arbitrary anymore, nice!
|
|
|
|
Maybe this pseudo-code explains it better:
|
|
|
|
if(cond) block_foo; else block_bar;
|
|
more;
|
|
|
|
... is compiled to:
|
|
|
|
if cond then skip past the unconditional-branch // To get to foo-code.
|
|
// We know the space used by this code...
|
|
set register to addr of else-label
|
|
// ... and this one, that completes the jump to the alternative block.
|
|
unconditional-branch to addr in register
|
|
|
|
foo-code
|
|
[Here we jump to the endif-label, omitted for clarity.]
|
|
else-label:
|
|
bar-code
|
|
endif-label:
|
|
more-code
|
|
|
|
Similar approach is used for other control flow structures. See
|
|
CBZ_X0_PAST_BR (cbz x0, #20) and CBNZ_X0_PAST_BR (cbnz x0, #20) used
|
|
as part of the generation of 'if', 'for', 'do' and 'while'
|
|
statements. Notice how the test is inverted: when Knight does JUMP.Z
|
|
we do CBNZ (process_if); when JUMP.NZ we CBZ (process_do).
|
|
|
|
CSEL was considered but required an additional register, more labels
|
|
and code. A bit too invasive a change to make to the codebase.
|
|
|
|
As you can imagine, the ISA colored the port development from the very
|
|
beginning. It's a lot of fun to come up with basic solutions under
|
|
those limitations. The port works as expected but there's room for
|
|
experimentation.
|
|
|
|
*** Function call
|
|
|
|
The Base Pointer and its relation to arguments in function calls and
|
|
locals during function execution is a bit different compared to other
|
|
supported architectures. This simplifies some calculations. See how
|
|
unsurprising the depths are in collect_arguments() and
|
|
collect_local().
|
|
|
|
Note how this calculations are related to the "push/pop size". See
|
|
`Wasted stack space`.
|
|
|
|
Let's follow a couple of M2-Planet functions generating code for
|
|
prologue, call and epilogue with the help of some artsy-less ascii-art
|
|
stack graphs for clarity. The expected stack is "full" (the stack
|
|
pointer register contains the address of the last pushed element) and
|
|
descending (grows towards zero).
|
|
|
|
Most of the work is done by function_call(). First, we save (the
|
|
generated code does it at runtime of the compiled program, but please
|
|
bear with me about the point of view) three registers on the stack. We
|
|
include a scratch one ("tmp" value in the graphs) that we're going to
|
|
use for two different purposes. On the one hand, to store the actual
|
|
stack pointer (which is going to be the reference address --Base
|
|
Pointer-- during the execution of the called function). On the other
|
|
hand, when the BP is already set (which can't be done right now
|
|
because we need the actual BP to evaluate the arguments in caller
|
|
context) we use the register to store the addr of the function to be
|
|
called. The other two registers are the Link Register (X30) and Base
|
|
Pointer (X17 also know as IP1) itself, to allow for recursion. Both
|
|
are prefixed with "o" in the following graphs, as in "old".
|
|
|
|
This structure gives us a simple reference for both the args and the
|
|
locals, without extra elements between those two sets. We rely on the
|
|
semantics of BLR (more on this in a bit) which doesn't use the stack
|
|
to save the return address, but a register. For other archs this is
|
|
not possible (or not exploited, see how for ARM-7 the LR is saved in
|
|
the stack just around the call proper; this puts it between the args
|
|
and the locals) so it's a difference worth documenting.
|
|
|
|
---> Address 0
|
|
tmp | oLR | oBP |
|
|
^
|
|
|
|
|
--- SP
|
|
|
|
|
--- BP-to-be
|
|
|
|
Now we're ready to evaluate and push arguments. Note that M2-Planet
|
|
doesn't follow AAPCS64. The evaluation might involve function calls
|
|
itself and arbitrary use of the stack, but everything will be like
|
|
this after all.
|
|
|
|
tmp | oLR | oBP | arg1 | arg2 | ... | argN |
|
|
^ ^
|
|
| |
|
|
--- BP-to-be --- SP (omitted from now on)
|
|
|
|
At this point we set the BP from the scratch register and execute
|
|
branch-and-link (BLR) to the function reusing the (now free) X16
|
|
register (also know as IP0). This instruction saves the address of the
|
|
next instruction on X30 (LR, which we saved earlier to allow for
|
|
recursion).
|
|
|
|
tmp | oLR | oBP | arg1 | arg2 | ... | argN |
|
|
^
|
|
|
|
|
--- BP
|
|
|
|
During the called function the locals are pushed on the stack as usual
|
|
in M2-Planet.
|
|
|
|
tmp | oLR | oBP | arg1 | arg2 | ... | argN | loc1 | loc2 | ... | locN |
|
|
^
|
|
|
|
|
--- BP
|
|
|
|
When the function is about to return, we remove the locals from the
|
|
stack and execute the return proper, jumping to the address in LR
|
|
thanks to RET. This is handled by return_result().
|
|
|
|
tmp | oLR | oBP | arg1 | arg2 | ... | argN |
|
|
^
|
|
|
|
|
--- BP
|
|
|
|
Back in function_call() we remove the args from the stack.
|
|
|
|
tmp | oLR | oBP |
|
|
^
|
|
|
|
|
--- BP
|
|
|
|
Finally, we restore the saved registers (so X16, LR and BP contain
|
|
tmp, oLR and oBP again) leaving everything as it was before this
|
|
journey. Well... one important thing changed: following M2-Planet
|
|
conventions the value returned from the function, if any, is on X0.
|
|
|
|
*** Stack pointer
|
|
|
|
Due to alignment (128 bits) restriction for "push" and "pop" based on
|
|
the architectural register, we initialize and use X18 as stack pointer
|
|
instead.
|
|
|
|
The M1 definitions referring to SP use X18; stack operations too.
|
|
|
|
For example:
|
|
|
|
DEFINE LDR_X0_[SP] 400240f9 is ldr x0, [x18]
|
|
DEFINE PUSH_LR 5e8e1ff8 is str x30, [x18, #-8]!
|
|
DEFINE INIT_SP f2030091 is mov x18, sp
|