The Four Stage of Compilation (Toolchain)

Hello Eveyone,

Before running any  C code we save our file with .c extension, but we don't know how the compiler or to be more precise tool-chain like gcc manages it . This post is for understanding the basic thing how tool-chain manage to convert our source code to the executable (code which we run).
This concept will help us in many ways:-
                                1.finding  Error in source code.
                                2.Understanding Tool-chain.
                                3.Internal of compiler and other binutils and their dependency on each                                          other. and many mores. 

We will start our understanding by understanding ELF(executable files).First of all let us understand what ELF is and what it is all about.

In this article, we’ll be doing the following.

  1. Exploring ELF in a superficial manner.
  2. Phases of Compilation (tool-chain building ).
You can also see video of these blog in youtube (Click Here To Watch). Follow us on instagram to see the coding related meme.(Click here)

Don't forgot to subscribe the channel and follow instagram page .

Let us get started!

0. ELF?

ELF abbreviation is executable and linkable file format. This file format is of executable files, libraries, object files in UNIX like system.

This ELF is more interesting and confusing that you can think off. ELF is nothing but  a data structure which is tightly coupled in such a way that it behaves like a knit.. For knitting this elf tremendous data structure are involved in such a way that all these data resembles to be inter-related and inter link with each other such a way that they looks same. We’ll go over each of these structure in detail to understand elf then we will initiate our different tool-chain phases .

Let see what man page of linux say about elf 

ELF(5)                     Linux Programmer's Manual                    ELF(5)

NAME
       elf - format of Executable and Linking Format (ELF) files

SYNOPSIS
       #include <elf.h>

DESCRIPTION
       The  header  file  <elf.h>  defines the format of ELF executable binary
       files.  Amongst these files are normal  executable  files,  relocatable
       object files, core files, and shared objects.

 Now, we got a keen idea what is elf is all about. Let's now understand different types of elf  files. Let us take a simple coding for all to understand them.

$ cat codingforall.c
#include <stdio.h>

int main()
{
	printf("Coding for all\n");
	return 0;
}

and build it in the following manner.

$ gcc codingforall.c -o codingforall --save-temps
$ ls
codingforall codingforall.c  codingforall.i  codingforall.o  codingforall.s

1. Executable file: Executable fiel can be define as a file which we run on the operating system or a file which run by the Operating System. This file is generated by linking one or more object file. This executable also use Dynamic linker to access function of other shared object. Here in our case executbale is codingforall.

$ ./codingforall
Coding for all

2. Shared Object file: We all know that our code required libraries to get executed.All of these libraries are presented in form of shared object. Libraries is also know as Shared Library. our codingforall program uses.

$ ldd codingforall
	linux-vdso.so.1 =>  (0x00007ffc320ac000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f975fa8e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f975fe58000)
  • Most of the people are more familiar with the libc (The C Library).This C library contain the defination of the printf function.

  • Now you people may or may not have queston in mind that why libraries are called shared object?. Libraries are callled share object becuase they can be shared among multiple proecess.. The example code codingforall.c  program is using libc.Some other source code may be using some other libc function. OS store one copy of libc in  main memory and all program keep using libc or other library which are stored in OS main memory.So this copy of libc is shared among all the program hence libraries are known are shared object.

3. Object file: This is a machine code equivalent to C source code. It is not a part of executable or a shared object. Its a actual output from the compiler and it wont execute on machine because linker is  not link to it.

 Direct machine code equivalent of a C source file. It has just a little metadata(which is part of ELF) to keep the code organized. The codingforall.o file is an object file.

  • It still is not part of an executable or a shared object. The linker and the programmer will decide it’s fate. Try running it and see what you get.
  • Object file is an intermediate file used by linker as an input for creating executable.

  • Object file needs to be passed to a separate linking step to create the executable file.

$ ./codingforall.o 
bash: ./hello.o: cannot execute binary file: Exec format error

4. Core file: You all who is reading this article may have come to error known as segmentation fault.You may not have come to core file  but you have seen a line known as segmentation fault.Segmentation fault(core dumped). Have you ever though why this error? and what is meaning of core dump?.

  • Whenever your  program throw error  or you see any crashes inside your program immediate task followed by you is to know understand why these error or crash came. The core file is used to find it.  Core file content snap of your main memory when you code fails. You can get a think at what function , at what address, at what address does the program crash.

  • This error mainly occur when we want to access the location which is not authorize to us.

  • Generally, though we get the core dumped message, it may not get dumped. In many systems, this dumping of core is suppressed by default. It needs to be enabled if you want to see what a core file looks like. Take it as an exercise to enable the core dumping.

These 4 files are the most used ELF files. We’ll be talking about them in detail in later posts.

With that, we know what files are ELF. Let us go a bit deeper into the ELF structure now.


1.Phases of Toolchain (compilation)?

Building executable from C source code is big process.For an overview we can split the generation of executable from C source code in four phases.              

  1. Pre-processing
  2. Compilation.
  3. Assembly.
  4. Linking


                          Preprocessing
                     --------------------- 
codingforall.c-----> |   Preprocessor    |---> codingforall.i (Preprocessing code)
                     ---------------------

                             Compiling
                        ---------------------            
 codingforall.i ------> |     Compiler      |--->  codinforall.s (Assembly code)
                        ---------------------
 
                              Assembling
                        ---------------------
 codingforall.s -------> |     Assembler     |--->  codingforall.o (Object code)
                          ---------------------
                    
                                 Linking
                             ---------------------
codingforall.o +Lib -------> |       Linker      | ---> ./a.out(Executable)     
                             ---------------------                

Lets take our sample code for understanding phases of tool chain as
 $ cat codingforall.c


$ cat codingforall.c
#include <stdio.h>

int main()
{
	printf("Coding for all\n");
	return 0;
}

$ gcc codingforall.c -o codingforall --save-temps
$ ls
codingforall codingforall.c  codingforall.i  codingforall.o  codingforall.s

Note:-Generally, output files generated by Preprocessor, Compiler and Assembler are stored temporarily in /tmp directory which are deleted as soon as the executable is generated. But with -save-temps option, we will save those temporary files also, which will help in our analysis. There are 4 sub-processes, so 4 files are generated. code.i, code.s, code.o and code1. code1 is the final executable


Preprocessing (extension .i)


Preprocessing is a first stage of compilation .In preprocessing stage all the header file and defined function or to be mor precise line starting with # symbol get processed.  This preprocessed is used to reduce the repetition of the already inbuilt feature inside language by providing functionality to inline files, define macros, and other conditionally omitted code.  This language is used to reduce repetition in source code by providing functionality to inline files, define macros, and to conditionally omit code.

Before interpreting commands, the preprocessor does some initial processing. This includes joining continued lines (lines ending with a \) and stripping comments.

To get the result of the preprocessing stage, pass the -E option to your toolchain here gcc:

gcc -E codingforall.c

Above example "Coding for all" , the preprocessor will work and it will  produce the contents of the stdio.h  and other header file which are included by the coder or developer in his C source code .Header f file joined with the contents of the hello_world.c file, stripped free from its leading comment:

[lines omitted by content writer]

extern int __vsnprintf_chk (char * restrict, size_t,
       int, size_t, const char * restrict, va_list);
# 493 "/usr/include/stdio.h" 2 3 4
# 2 "codingforall.c" 2

int
main(void) {
 puts("Coding for all");
 return 0;
}

  • Cprocessor does the preprocessing

  • It generates sourcefilename.i file. i stands for intermediate. also know as output of preprocessing phase.

a. Preprocessing will expand all the #include  (in our source code it is, #include < stdio.h >) which are included in C sourcefile . Expand A header file means copying #include source code in our location.

1. Different function declarations related to the header file(Eg: stdio.h will have function declarations of standard input and output functions).
2. Different macros defined.
3#include of other related header files.
4. A bunch of typedef s of different datatypes.

b. Replace MACROS(here, #define NUMBER 100) with their actual values: Wherever macro NUMBER would be used in C sourcefile, it would be replaced by it’s value.

c. Conditional Compilation. When we include any variable twice but when include #incldue <stdio.h>
when we include multiple times it wont show any error this is a part of conditional compilation.

Compilation (extension .s)

This is the heart of toolchain. Here the actual translation start.Here preprocessed file output is taken as input and translated in to an assembly file.This stages give an output as a assembly  code which is architecture dependent. Here .s is assembly equivalent of the .c file 
cat codingforall.s
    .file   "codingforall.c"
    .globl  a
    .data
    .align 4
    .type   a, @object
    .size   a, 4
a:
    .long   10
    .comm   b,4,4
    .section    .rodata
.LC0:
    .string "Coding for all!"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    movl    $123, -8(%rbp)
    movl    $100, -4(%rbp)
    movb    $120, -9(%rbp)
    movl    $.LC0, %edi
    call    puts
    movl    $0, %eax
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609"
    .section    .note.GNU-stack,"",@progbits

Here the instruction movq,subq,call etc vary from architecture to architecture. 


  • Compiler convert  source code to .s file in various stages . Let us not get into the much detail we will discuss that things later. But here we will understand what actually compiler of but we will understand what all a compiler does.

    a. Compiler convert our  of C/C++ programs to assembly language.
    b. Compiler do all the required optimization.
    c. It do all the syntax and semantic hanlding of the code.





Assembly (extension .o)


  • Lets use objdump . Here objdump gives the dump of object file 

  • objdump stands for object dump, which means “give the dump of object file specified”. Let us see what that dump contains. This is how you use objdump. The 

  • $ objdump  -D codingforall.o > codingforlall.dump
    
    
  • This objdump will consist of many section but lets discuss about important  5 sections: 
    This section are as follows;-
    .text.data.rodata.comment and .eh_frame. We have dump  the disassembly of each section present in codingforall.o object file. Disassembly simply means, converting object code to assembly code. . Let us focus on the first 3 sections: .text.data and .rodata.

a.text section : This section consists of machine code of all functions we would have written in C sourcefile. In our example, main is the only function. Take a look at this.

    codingforall.o:     file format elf64-x86-64


    Disassembly of section .text:

    0000000000000000 <main>:
       0:   55                      push   rbp
       1:   48 89 e5                mov    rbp,rsp
       4:   48 83 ec 10             sub    rsp,0x10
       8:   c7 45 f8 7b 00 00 00    mov    DWORD PTR [rbp-0x8],0x7b
       f:   c7 45 fc 64 00 00 00    mov    DWORD PTR [rbp-0x4],0x64
      16:   c6 45 f7 78             mov    BYTE PTR [rbp-0x9],0x78
      1a:   bf 00 00 00 00          mov    edi,0x0
      1f:   e8 00 00 00 00          call   24 <main+0x24>
      24:   b8 00 00 00 00          mov    eax,0x0
      29:   c9                      leave  
      2a:   c3                      ret    
  • Note on objdump output: First column from right (push rbpmov rbp, rsp etc.,) are assembly instructions. The middle column is hexadecimal equivalent of those assembly instructions. You can think of First column from left as serial numbers for now.

  • We observed that names and datatypes of local variables are removed during compilation. Instead of names and datatypes, compiler gives an address space of 4 bytes for integers, 1 byte for character variables. Eg: 0x7b = 123 in decimal. It is stored at address rbp - 0x08(Do not worry about what rbp is, will explain in next post in detail). So, whenever we refer to variable c in our C program(in code1.c), at assembly level, it is being referred by rbp-0x08. This is a rough example. Will give clear details about this in the next post.

b.data section : This section consists of Global and static variables. Ideally, objdump should give disassembly of only text section because that is the only section containing machine code. But, objdump is not intelligent enough. That is why, it is disassembling even .data section which you don’t have to worry about.

    Disassembly of section .data:

    0000000000000000 <a>:
       0:   0a 00                   or     al,BYTE PTR [rax]

crodata section : This section consists of all read-only(ro) data. In our example, Hello world!!\n string is the only read-only item in the file.

    Disassembly of section .rodata:

    0000000000000000 <.rodata>:
       0:   48                      rex.W
       1:   65 6c                   gs ins BYTE PTR es:[rdi],dx
       3:   6c                      ins    BYTE PTR es:[rdi],dx
       4:   6f                      outs   dx,DWORD PTR ds:[rsi]
       5:   20 77 6f                and    BYTE PTR [rdi+0x6f],dh
       8:   72 6c                   jb     76 <main+0x76>
       a:   64 21 21                and    DWORD PTR fs:[rcx],esp
  • If you closely look, 0x48 is ascii number for H0x65 for e0x6c for l and so on. You can use ascii command line tool for reference. If it not installed, you can install it in this way.

    $ sudo apt-get install ascii
    $ ascii
        
    Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex  
      0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
      1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
      2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
      3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
      4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
      5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
      6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
      7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
      8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
      9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
     10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
     11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
     12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
     13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
     14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
     15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL
    
  • NOTE:

    • Every instruction and section should have an address right? But here all sections are starting with address zero. How can 2 section have same address or be at the same address??

    • Observe .data section. There is no mention of int b, the uninitialized global variable. But if we have used it, it should be somewhere right?

    • The data present in .rodata section cannot be executed by the processor. It is read-only, non-executable, non-writable data. objdump simply converted the data in .rodata section to it’s assembly equivalent, but it makes no sense because the whole section is non-executable section.

    • Observe .text section. There is not mention of printf we had used in code1.c . But note that there is a call instruction(1f: line).

    • There are more, but these are the important ones.

To resolve few of the issues mentioned above, let us use another tool called readelf to analyze code1.o .

  • ELF: stands for Executable and Linkable Format. For now, it is enough to know that any file which we want to execute on a Linux machine must be in this format.A file of any other format cannot be run even if it has machine code in it. Similar to ELF, Windows has it’s own executable format. It is known as PEPortable Executable ) file format.

a. Object file(here code1.o) contains a table known as Symbol Table. Take a look at this symbol table.

    ~/rev_eng_series/post_1$ readelf -s code1.o

    Symbol table '.symtab' contains 13 entries:
       Num:    Value          Size Type    Bind   Vis      Ndx Name
         0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
         1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS code1.c
         2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
         3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
         4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
         5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5 
         6: 0000000000000000     0 SECTION LOCAL  DEFAULT    7 
         7: 0000000000000000     0 SECTION LOCAL  DEFAULT    8 
         8: 0000000000000000     0 SECTION LOCAL  DEFAULT    6 
         9: 0000000000000000     4 OBJECT  GLOBAL DEFAULT    3 a
        10: 0000000000000004     4 OBJECT  GLOBAL DEFAULT  COM b
        11: 0000000000000000    43 FUNC    GLOBAL DEFAULT    1 main
        12: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND puts
    ~/rev_eng_series/post_1$ 
  • Focus on symbol numbers 91011 and 12. Their names are abmain and puts respectively.

    • a is a global object of size 4 bytes.

    • b is a global object of size 4 bytes. COM stands for COMMON symbol.

    • main is a global function of size 43 bytes.

    • puts is a global symbol, but it’s type is not known(NOTYPE). So, at this stage, assembler does not know what puts is(though we know it). NOTE: When there are no format strings in printf(), some compilers replace printf() with puts(). That is why, there is a puts() here instead of printf().

b. Object file also has a section called Relocation Section. have a look at this:

    ~/rev_eng_series/post_1$ readelf -r code1.o

    Relocation section '.rela.text' at offset 0x240 contains 2 entries:
      Offset          Info           Type           Sym. Value    Sym. Name + Addend
    00000000001b  00050000000a R_X86_64_32       0000000000000000 .rodata + 0
    000000000020  000c00000002 R_X86_64_PC32     0000000000000000 puts - 4

    Relocation section '.rela.eh_frame' at offset 0x270 contains 1 entries:
      Offset          Info           Type           Sym. Value    Sym. Name + Addend
    000000000020  000200000002 R_X86_64_PC32     0000000000000000 .text + 0
    ~/rev_eng_series/post_1$ 
  • We will come to the meaning of Relocation in the next sub-process.

    • Now, just observe that there are 2 symbols .rodata and puts in .rela.text section.
  • This means code1.o has information about puts() in it’s Symbol Table and Relocation Section.

We rectified a few issues mentioned in the NOTE, but not all. We still have to see what relocation is and what happens to puts.



Linker( gives executable code )


  • Linking and is done by the system program known as  linker. This linker takes more than one or more  Shared Libraries(like libc) and object code  as input. If these linking of shared libraries and object code is successful, then we get a output as a exec generates the executable file. Else, it gives a Linking Error.

  • An object file has no absolute addresses. Every section started with address 0 and other stuff in a particular section was numbered relative to starting address 0. But this is not possible in an actual executable file. Every section should have a definite / absolute address. The Linker Relocates(or shifts) each section in such a manner that every section has a unique starting address. This the meaning of Relocation.

  • Linker links symbols present in Relocation Table to their definitions . This is known as Symbol Resolution. Eg:

    • The symbol main is linked to .text + 0x00 because that is where body of main function is defined. Then how and what will it link puts to? We just have it’s symbol in Relocation Table, but we never explicitly defined it anywhere in out C program.

    • The linker finds the definition of puts in libc / Standard C Library and will link puts to that .

  • Linker then gives absolute address to every section in object file and adds a few more sections , thus making it a complete executable file.





#coding #coding for  beginner #computer science 

Comments

Popular posts from this blog

Local Global Variable Along with C memory Layout