shasm 80386 specifics

The initial target of shasm is the Intel family of processors, primarily the 80386. The assembled instructions for these machines have a complex format that can involve: various instruction prefixes; one or two instruction bytes; zero, one, or two argument qualifier bytes; and zero, one or two register, literal number or memory reference arguments. There can be two register arguments to one instruction, but one x86 instruction can only have one memory reference. A single memory reference can consist of up to two registers and two literal values.

Every shasm x86 routine name is the assembler of that instruction, the interactive help prompt for that instruction or directive, and the assembly lister of that instruction. If you have sourced shasm into your shell state you can do L help or copy help from the shell prompt for examples.

to/from syntax

shasm has two oper argument separators; to, and from. This means the source and destination operands can be in either source/dest or dest/source order for any oper that cares.

asmacs

shasm uses a non-cryptic set of instruction names I call asmacs. I find it helpful. The Intel names are in the help prompts. As far as I am concerned, the usual style of opcode naming is a strange habitual anachronism. Yes, there is something frustrating about typing a dozen or so characters for one opcode, but as far as I can see it's typing well spent. The typing of machine code is not it's primary difficulty. The stuff is plenty cryptic without names like "LMSW". When you see a pig like loadmachinestatusword, ask yourself how often you plan to use that instruction.

I'm interested in the OS-related functionality of the 80386. Early shasms by me will address that, but won't address later architectures, floating point coprocessors, and what I consider to be legacy functionality like the x86 ASCII instructions. The important thing, and the very problematic thing, is the ornate 386 instruction format. It goes something like this...


	[] means optional.			not to scale
   


|<----- prefixes ---->|<- operator ->|<------ mode --------->|dis][im]

[rep/lp][oas][oos][seg]oper [ oper2 ][ modeR/M   [    SIB   ]][dis][im]
                                     [..|...|... [..|...|...]]
                                     [m |r  |R/M [s |i i|b b]]
                                     [o |e o|    [h |n n|a a]]
                                     [d |gop|    [i |d d|s s]]
                                     [e |irc|    [f |e e|e e]]
                                     [  |s o|    [t |x x|   ]]
                                     [  |t d|    [  |   |   ]]
                                     [  |e e|                ]
                                     [  |r  |                ]
                                     [                       ]
|<---------------------bytes-------------------------------->|  various


dis is an optional displacement of a memory reference, and im is an optional immediate value. Either may be 1, 2 or 4 bytes in size. The only thing that's not optional is the first operator byte, "oper". Oper is 0x90 for NOP, "no operation", for example. NOP is a one-byte instruction. In order to need all the above fields in one instruction you would have to be doing something very bizarre, like deliberately trying to use all the above fields in one instruction just for amusement. Real world instructions vary from one to 6 or so bytes usually. Average in protected mode is probably 3 or so. The problem is that a very general-purpose instruction like copy (Intel MOV) might have cases that together use all the above fields, and shasm has to figure out which possibility is intended.

Prefixes are simple. shasm cheats and handles them as separate one-byte opcodes. In a few cases they leave some switches set to control what follows, but usually they just happen independantly. Syntactically, they are separate opcodes, and must be on separate logical shell lines. That's the cost of the cheat. This is the case for the segment register prefixes CS, SS, DS, ES, FS, and GS, the default cellsize switches otheroperandsize and otheraddresssize, for lock, and for the loop prefixes repeating and friends. If you want them on the same line with the instruction they control, use semicolon.

The shasm keyword that assembles a particular oper is a named shell routine. Everything after oper is derived from arguments to that routine. This is where we must confront the syntax issues. We have to name registers, differentiate between register contents themselves and the memory locations they might be pointing at, differentiate between addresses and literal numbers, specify the sizes of numbers, and so on. We have to do all this in the shell. One welcome bit of Forth-like simplicity I cling to for this is that shasm parses tokens whole. Shasm tokens are separated by spaces. It doesn't use character-wise prefixes or suffixes. The shasm scanner is the shell, the tokenizer is shell argument handling, and there is no lexer. There is a fair bit of grammar to parse though, thanks to the rich history of the 386. Shell expressions can be whatever the shell allows, but as oper arguments they must look like single tokens to shasm.

registers

The main x86 register names in shasm are A, B, C, D, SP, BP, SI and DI. Names of sizes of things in shasm are in terms of bytes. register name sub-register qualifiers are byte, dual and quad. For example, AH in the usual usage can be "hbyte A" in shasm. EAX is "quad A", but can usually be specified as just A. A sub-register spec usually sets the size for the whole instruction. A full-size value at the current assembly size, which on x86 may alternately be 16 or 32 bits, is called a cell. "cell" is in fact the assembly mode global variable, and $cell will be 2 or 4.

Special registers have fairly typical names like TR6. AH and friends are avalable as such also.

memory addressing

x86 memory references can consist of a displacement, a base register, an index register, and a scale to upshift the index register by. The math that represents that is...

	displacement + base register value + index register value << scale 
	 
"displacement" is a signed literal. "scale" can be absent, 1, 2 or 3, i.e. index multiplied by 2, 4 or 8. That can all be done quickly to generate one effective address on 386+. That's just the address; that's not taking into account the segment prefix and the size of the item in memory to be acted upon. The variations and sub-cases of this format are the addressing modes of the x86. Complex addressing modes are assembled into the modR/M and SIB bytes. An instruction can only have one mode byte, and thus only one memory reference.

An assembler is supposed to allow you to not think about obscura like the mode and SIB bytes. What you do have to think about is what your arguments to a particular machine instruction are. By this I mean the actual objects the instruction acts on at runtime, not syntactical arguments to the instruction's shasm name. On x86 it's useful to think of machine instruction arguments as source and destination. If you have two arguments you need to separate them, and indicate which is source and which is dest. That's what to/from does.

From there the oper can further break down the instruction. Memory references can be seen as such if they have at least one of @, + or *2^ ("...times two to the...") in shasm token form, i.e. space-separated. An operand containing two registers is also recognized as a memory reference operand. + and @ can be anywhere on the same side of the to/from as the memory reference operand. *2^ is more syntactically specific, and is the only shasm token that asserts a positional grammatical relationship within one side of the "to" or "from". The token immediately preceding *2^ must be the index register that will be shifted, and the token immediately following the *2^ must be or evaluate to 0, 1, 2 or 3.

The following increments a memory cell pointed at by the contents of DI, not the contents of DI itself.


		increment @ DI

All this memref construction is address arithmatic, which is always at the prevailing oper address size. The default operand size is assumed if an oper doesn't specify a sub-register argument. If the byte keyword are seen the instruction is byte size. Usually if you want a 2 byte action you'll use the otheroperandsize prefix. "byte" is an oper argument. bytes is a shasm directive. There is no name conflict here because of the "s", but even if they were the same string, the shell would interpret them as appropriate by context, since command arguments may be any string, and are handled as strings. This means that for example the segment registers can have the same names as prefix operators and as oper register name arguments, e.g. "DS". If DS is alone on a shell logical line, it's a prefix. If it's an argument to an oper, it's the register as an operand of the instruction as a register, not a segment spec.

.......................................................................... ..........................................................................