The Four Stage of Compilation (Toolchain)
Hello Eveyone,
Before running any C code we save our file with .c extension, but we don't know how the compiler or to be more precise tool-chain like gcc manages it . This post is for understanding the basic thing how tool-chain manage to convert our source code to the executable (code which we run).
This concept will help us in many ways:-
1.finding Error in source code.
2.Understanding Tool-chain.
3.Internal of compiler and other binutils and their dependency on each other. and many mores.
We will start our understanding by understanding ELF(executable files).First of all let us understand what ELF is and what it is all about.
In this article, we’ll be doing the following.
- Exploring ELF in a superficial manner.
- Phases of Compilation (tool-chain building ).
0. ELF?
ELF abbreviation is executable and linkable file format. This file format is of executable files, libraries, object files in UNIX like system.
This ELF is more interesting and confusing that you can think off. ELF is nothing but a data structure which is tightly coupled in such a way that it behaves like a knit.. For knitting this elf tremendous data structure are involved in such a way that all these data resembles to be inter-related and inter link with each other such a way that they looks same. We’ll go over each of these structure in detail to understand elf then we will initiate our different tool-chain phases .
Let see what man page of linux say about elf
ELF(5) Linux Programmer's Manual ELF(5)
NAME
elf - format of Executable and Linking Format (ELF) files
SYNOPSIS
#include <elf.h>
DESCRIPTION
The header file <elf.h> defines the format of ELF executable binary
files. Amongst these files are normal executable files, relocatable
object files, core files, and shared objects.
Now, we got a keen idea what is elf is all about. Let's now understand different types of elf files. Let us take a simple coding for all to understand them.
$ cat codingforall.c
#include <stdio.h>
int main()
{
printf("Coding for all\n");
return 0;
}
and build it in the following manner.
$ gcc codingforall.c -o codingforall --save-temps
$ ls
codingforall codingforall.c codingforall.i codingforall.o codingforall.s
1. Executable file: Executable fiel can be define as a file which we run on the operating system or a file which run by the Operating System. This file is generated by linking one or more object file. This executable also use Dynamic linker to access function of other shared object. Here in our case executbale is codingforall.
$ ./codingforall
Coding for all2. Shared Object file: We all know that our code required libraries to get executed.All of these libraries are presented in form of shared object. Libraries is also know as Shared Library. our codingforall program uses.
$ ldd codingforall
linux-vdso.so.1 => (0x00007ffc320ac000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f975fa8e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f975fe58000)
Most of the people are more familiar with the libc (The C Library).This C library contain the defination of the printf function.
Now you people may or may not have queston in mind that why libraries are called shared object?. Libraries are callled share object becuase they can be shared among multiple proecess.. The example code codingforall.c program is using
libc.Some other source code may be using some other libc function. OS store one copy of libc in main memory and all program keep using libc or other library which are stored in OS main memory.So this copy of libc is shared among all the program hence libraries are known are shared object.
3. Object file: This is a machine code equivalent to C source code. It is not a part of executable or a shared object. Its a actual output from the compiler and it wont execute on machine because linker is not link to it.
Direct machine code equivalent of a C source file. It has just a little metadata(which is part of ELF) to keep the code organized. The codingforall.o file is an object file.
- It still is not part of an executable or a shared object. The linker and the programmer will decide it’s fate. Try running it and see what you get.
Object file is an intermediate file used by linker as an input for creating executable.
Object file needs to be passed to a separate linking step to create the executable file.
$ ./codingforall.o
bash: ./hello.o: cannot execute binary file: Exec format error
4. Core file: You all who is reading this article may have come to error known as segmentation fault.You may not have come to core file but you have seen a line known as segmentation fault.Segmentation fault(core dumped). Have you ever though why this error? and what is meaning of core dump?.
Whenever your program throw error or you see any crashes inside your program immediate task followed by you is to know understand why these error or crash came. The core file is used to find it. Core file content snap of your main memory when you code fails. You can get a think at what function , at what address, at what address does the program crash.
This error mainly occur when we want to access the location which is not authorize to us.
Generally, though we get the
core dumpedmessage, it may not get dumped. In many systems, this dumping of core is suppressed by default. It needs to be enabled if you want to see what a core file looks like. Take it as an exercise to enable the core dumping.
These 4 files are the most used ELF files. We’ll be talking about them in detail in later posts.
With that, we know what files are ELF. Let us go a bit deeper into the ELF structure now.
1.Phases of Toolchain (compilation)?
Building executable from C source code is big process.For an overview we can split the generation of executable from C source code in four phases.
- Pre-processing
- Compilation.
- Assembly.
- Linking
Preprocessing
---------------------
codingforall.c-----> | Preprocessor |---> codingforall.i (Preprocessing code)
---------------------
Compiling
---------------------
codingforall.i ------> | Compiler |---> codinforall.s (Assembly code)
---------------------
Assembling
---------------------
codingforall.s -------> | Assembler |---> codingforall.o (Object code)
---------------------
Linking
---------------------
codingforall.o +Lib -------> | Linker | ---> ./a.out(Executable)
--------------------- $ cat codingforall.c
#include <stdio.h>
int main()
{
printf("Coding for all\n");
return 0;
}$ gcc codingforall.c -o codingforall --save-temps
$ ls
codingforall codingforall.c codingforall.i codingforall.o codingforall.sNote:-Generally, output files generated by Preprocessor, Compiler and Assembler are stored temporarily in /tmp directory which are deleted as soon as the executable is generated. But with -save-temps option, we will save those temporary files also, which will help in our analysis. There are 4 sub-processes, so 4 files are generated. code.i, code.s, code.o and code1. code1 is the final executable
Preprocessing (extension .i)
Preprocessing is a first stage of compilation .In preprocessing stage all the header file and defined function or to be mor precise line starting with # symbol get processed. This preprocessed is used to reduce the repetition of the already inbuilt feature inside language by providing functionality to inline files, define macros, and other conditionally omitted code. This language is used to reduce repetition in source code by providing functionality to inline files, define macros, and to conditionally omit code.
Before interpreting commands, the preprocessor does some initial processing. This includes joining continued lines (lines ending with a \) and stripping comments.
To get the result of the preprocessing stage, pass the -E option to your toolchain here gcc:
gcc -E codingforall.c
Above example "Coding for all" , the preprocessor will work and it will produce the contents of the stdio.h and other header file which are included by the coder or developer in his C source code .Header f file joined with the contents of the hello_world.c file, stripped free from its leading comment:
[lines omitted by content writer]
extern int __vsnprintf_chk (char * restrict, size_t,
int, size_t, const char * restrict, va_list);
# 493 "/usr/include/stdio.h" 2 3 4
# 2 "codingforall.c" 2
int
main(void) {
puts("Coding for all");
return 0;
}Cprocessor does the preprocessing
It generates sourcefilename.i file. i stands for intermediate. also know as output of preprocessing phase.
a. Preprocessing will expand all the #include (in our source code it is, #include < stdio.h >) which are included in C sourcefile . Expand A header file means copying #include source code in our location.
1. Different function declarations related to the header file(Eg: stdio.h will have function declarations of standard input and output functions).
2. Different macros defined.
3. #include of other related header files.
4. A bunch of typedef s of different datatypes.
b. Replace MACROS(here, #define NUMBER 100) with their actual values: Wherever macro NUMBER would be used in C sourcefile, it would be replaced by it’s value.
Compilation (extension .s)
cat codingforall.s
.file "codingforall.c"
.globl a
.data
.align 4
.type a, @object
.size a, 4
a:
.long 10
.comm b,4,4
.section .rodata
.LC0:
.string "Coding for all!"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $123, -8(%rbp)
movl $100, -4(%rbp)
movb $120, -9(%rbp)
movl $.LC0, %edi
call puts
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609"
.section .note.GNU-stack,"",@progbitsCompiler convert source code to .s file in various stages . Let us not get into the much detail we will discuss that things later. But here we will understand what actually compiler of but we will understand what all a compiler does.
a. Compiler convert our of C/C++ programs to assembly language.
b. Compiler do all the required optimization.
c. It do all the syntax and semantic hanlding of the code.
Assembly (extension .o)
Lets use objdump . Here objdump gives the dump of object file
objdump stands for object dump, which means “give the dump of object file specified”. Let us see what that dump contains. This is how you use objdump. The
$ objdump -D codingforall.o > codingforlall.dumpThis objdump will consist of many section but lets discuss about important 5 sections:
This section are as follows;-
.text, .data, .rodata, .comment and .eh_frame. We have dump the disassembly of each section present in codingforall.o object file. Disassembly simply means, converting object code to assembly code. . Let us focus on the first 3 sections: .text, .data and .rodata.
a. .text section : This section consists of machine code of all functions we would have written in C sourcefile. In our example, main is the only function. Take a look at this.
codingforall.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 48 83 ec 10 sub rsp,0x10
8: c7 45 f8 7b 00 00 00 mov DWORD PTR [rbp-0x8],0x7b
f: c7 45 fc 64 00 00 00 mov DWORD PTR [rbp-0x4],0x64
16: c6 45 f7 78 mov BYTE PTR [rbp-0x9],0x78
1a: bf 00 00 00 00 mov edi,0x0
1f: e8 00 00 00 00 call 24 <main+0x24>
24: b8 00 00 00 00 mov eax,0x0
29: c9 leave
2a: c3 ret
Note on objdump output: First column from right (push rbp, mov rbp, rsp etc.,) are assembly instructions. The middle column is hexadecimal equivalent of those assembly instructions. You can think of First column from left as serial numbers for now.
We observed that names and datatypes of local variables are removed during compilation. Instead of names and datatypes, compiler gives an address space of 4 bytes for integers, 1 byte for character variables. Eg: 0x7b = 123 in decimal. It is stored at address rbp - 0x08(Do not worry about what rbp is, will explain in next post in detail). So, whenever we refer to variable c in our C program(in code1.c), at assembly level, it is being referred by rbp-0x08. This is a rough example. Will give clear details about this in the next post.
b. .data section : This section consists of Global and static variables. Ideally, objdump should give disassembly of only text section because that is the only section containing machine code. But, objdump is not intelligent enough. That is why, it is disassembling even .data section which you don’t have to worry about.
Disassembly of section .data:
0000000000000000 <a>:
0: 0a 00 or al,BYTE PTR [rax]
c. rodata section : This section consists of all read-only(ro) data. In our example, Hello world!!\n string is the only read-only item in the file.
Disassembly of section .rodata:
0000000000000000 <.rodata>:
0: 48 rex.W
1: 65 6c gs ins BYTE PTR es:[rdi],dx
3: 6c ins BYTE PTR es:[rdi],dx
4: 6f outs dx,DWORD PTR ds:[rsi]
5: 20 77 6f and BYTE PTR [rdi+0x6f],dh
8: 72 6c jb 76 <main+0x76>
a: 64 21 21 and DWORD PTR fs:[rcx],esp
If you closely look, 0x48 is ascii number for H, 0x65 for e, 0x6c for l and so on. You can use ascii command line tool for reference. If it not installed, you can install it in this way.
$ sudo apt-get install ascii $ ascii Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex 0 00 NUL 16 10 DLE 32 20 48 30 0 64 40 @ 80 50 P 96 60 ` 112 70 p 1 01 SOH 17 11 DC1 33 21 ! 49 31 1 65 41 A 81 51 Q 97 61 a 113 71 q 2 02 STX 18 12 DC2 34 22 " 50 32 2 66 42 B 82 52 R 98 62 b 114 72 r 3 03 ETX 19 13 DC3 35 23 # 51 33 3 67 43 C 83 53 S 99 63 c 115 73 s 4 04 EOT 20 14 DC4 36 24 $ 52 34 4 68 44 D 84 54 T 100 64 d 116 74 t 5 05 ENQ 21 15 NAK 37 25 % 53 35 5 69 45 E 85 55 U 101 65 e 117 75 u 6 06 ACK 22 16 SYN 38 26 & 54 36 6 70 46 F 86 56 V 102 66 f 118 76 v 7 07 BEL 23 17 ETB 39 27 ' 55 37 7 71 47 G 87 57 W 103 67 g 119 77 w 8 08 BS 24 18 CAN 40 28 ( 56 38 8 72 48 H 88 58 X 104 68 h 120 78 x 9 09 HT 25 19 EM 41 29 ) 57 39 9 73 49 I 89 59 Y 105 69 i 121 79 y 10 0A LF 26 1A SUB 42 2A * 58 3A : 74 4A J 90 5A Z 106 6A j 122 7A z 11 0B VT 27 1B ESC 43 2B + 59 3B ; 75 4B K 91 5B [ 107 6B k 123 7B { 12 0C FF 28 1C FS 44 2C , 60 3C < 76 4C L 92 5C \ 108 6C l 124 7C | 13 0D CR 29 1D GS 45 2D - 61 3D = 77 4D M 93 5D ] 109 6D m 125 7D } 14 0E SO 30 1E RS 46 2E . 62 3E > 78 4E N 94 5E ^ 110 6E n 126 7E ~ 15 0F SI 31 1F US 47 2F / 63 3F ? 79 4F O 95 5F _ 111 6F o 127 7F DELNOTE:
Every instruction and section should have an address right? But here all sections are starting with address zero. How can 2 section have same address or be at the same address??
Observe .data section. There is no mention of int b, the uninitialized global variable. But if we have used it, it should be somewhere right?
The data present in .rodata section cannot be executed by the processor. It is read-only, non-executable, non-writable data. objdump simply converted the data in .rodata section to it’s assembly equivalent, but it makes no sense because the whole section is non-executable section.
Observe .text section. There is not mention of printf we had used in code1.c . But note that there is a call instruction(1f: line).
There are more, but these are the important ones.
To resolve few of the issues mentioned above, let us use another tool called readelf to analyze code1.o .
- ELF: stands for Executable and Linkable Format. For now, it is enough to know that any file which we want to execute on a Linux machine must be in this format.A file of any other format cannot be run even if it has machine code in it. Similar to ELF, Windows has it’s own executable format. It is known as PE( Portable Executable ) file format.
a. Object file(here code1.o) contains a table known as Symbol Table. Take a look at this symbol table.
~/rev_eng_series/post_1$ readelf -s code1.o
Symbol table '.symtab' contains 13 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS code1.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 1
3: 0000000000000000 0 SECTION LOCAL DEFAULT 3
4: 0000000000000000 0 SECTION LOCAL DEFAULT 4
5: 0000000000000000 0 SECTION LOCAL DEFAULT 5
6: 0000000000000000 0 SECTION LOCAL DEFAULT 7
7: 0000000000000000 0 SECTION LOCAL DEFAULT 8
8: 0000000000000000 0 SECTION LOCAL DEFAULT 6
9: 0000000000000000 4 OBJECT GLOBAL DEFAULT 3 a
10: 0000000000000004 4 OBJECT GLOBAL DEFAULT COM b
11: 0000000000000000 43 FUNC GLOBAL DEFAULT 1 main
12: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND puts
~/rev_eng_series/post_1$
Focus on symbol numbers 9, 10, 11 and 12. Their names are a, b, main and puts respectively.
a is a global object of size 4 bytes.
b is a global object of size 4 bytes. COM stands for COMMON symbol.
main is a global function of size 43 bytes.
puts is a global symbol, but it’s type is not known(NOTYPE). So, at this stage, assembler does not know what puts is(though we know it). NOTE: When there are no format strings in printf(), some compilers replace printf() with puts(). That is why, there is a puts() here instead of printf().
b. Object file also has a section called Relocation Section. have a look at this:
~/rev_eng_series/post_1$ readelf -r code1.o
Relocation section '.rela.text' at offset 0x240 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000001b 00050000000a R_X86_64_32 0000000000000000 .rodata + 0
000000000020 000c00000002 R_X86_64_PC32 0000000000000000 puts - 4
Relocation section '.rela.eh_frame' at offset 0x270 contains 1 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000020 000200000002 R_X86_64_PC32 0000000000000000 .text + 0
~/rev_eng_series/post_1$
We will come to the meaning of Relocation in the next sub-process.
- Now, just observe that there are 2 symbols .rodata and puts in .rela.text section.
This means code1.o has information about puts() in it’s Symbol Table and Relocation Section.
We rectified a few issues mentioned in the NOTE, but not all. We still have to see what relocation is and what happens to puts.
Linker( gives executable code )
Linking and is done by the system program known as linker. This linker takes more than one or more Shared Libraries(like libc) and object code as input. If these linking of shared libraries and object code is successful, then we get a output as a exec generates the executable file. Else, it gives a Linking Error.
An object file has no absolute addresses. Every section started with address 0 and other stuff in a particular section was numbered relative to starting address 0. But this is not possible in an actual executable file. Every section should have a definite / absolute address. The Linker Relocates(or shifts) each section in such a manner that every section has a unique starting address. This the meaning of Relocation.
Linker links symbols present in Relocation Table to their definitions . This is known as Symbol Resolution. Eg:
The symbol main is linked to .text + 0x00 because that is where body of main function is defined. Then how and what will it link puts to? We just have it’s symbol in Relocation Table, but we never explicitly defined it anywhere in out C program.
The linker finds the definition of puts in libc / Standard C Library and will link puts to that .
Linker then gives absolute address to every section in object file and adds a few more sections , thus making it a complete executable file.
#coding #coding for beginner #computer science
Comments
Post a Comment