2017年8月14日星期一

Linux & OS & Linux内核设计与实现 & 深入理解计算机系统

Unix / Linux systems and internals

Preparations

Operating Systems: Know about processes, threads and concurrency issues. Know about locks and mutexes and semaphores and monitors and how they work. Know about deadlock and livelock and how to avoid them. Know what resources a processes needs, and a thread needs, and how context switching works, and how it's initiated by the operating system and underlying hardware. Know a little about scheduling. The world is rapidly moving towards multi-core, so know the fundamentals of "modern" concurrency constructs.

Linux internals
  • Process Execution and/or Threads
  • Memory Usage
  • RAID Levels
  • The kernel and how it interacts with other system components
  • System Calls
  • Signals and Signal Handlers
  • Modern Web Architectures and Webservers
Preparations:
  • Review userspace / Kernel space boundaries and interactions.
    Examples might include: ioctls, sysctls, context switches.
  • Review troubleshooting tools for system-level performance issues.
  • Review troubleshooting tools for debugging application-level performance issues or bugs.
file system processing, file properties, user permissions, text parsing



Linux








  • Process
  • Context switch
    • Process context switch: need to switch virtual memory mapping, so that need to flush TLB (which is expensive)
    • User mode to Kernel mode: a mode transition usually 
  • Kernel
    • How the Kernel Manages Your Memory
    • Kernel can preempt a task running in the kernel so long as it does not hold a lock. Because the kernel is SMP-safe, if a lock is not held, the current code is reentrant and capable of being preempted
  • Permissions
    • chmod
      • sticky bit, 1, t: has the final decision无法删除
      • guid, 2, s: 目录下新文件保留原目录ownership
      • suid, 4, s
      • chmod -R a+rX . 所有子目录可搜索(regular file retain excutable permission)
    • chattr
      • i, immutable
    • filetype
    • umask: file permissions are set for newly created files
  • TCP/UDP packet size: 64KB, MTU: 1500 B
  • Zero Copy: From kernel to NIC, no user space
  • https://fabiokung.com/2014/03/13/memory-inside-linux-containers/
  • executing machine code in memory: Doable, think JIT runtime and python interpreter
  • HDD vs SSD:
    • SSD连续读的能力相比普通磁盘优势并不明显; SSD适合多读,多随机读写
    • erase-before-write: SSD必须erase before write
    • 因为SSD存在“写磨损”的问题,当某个单元长时间被反复擦写时(比如Oracle redo),不仅会造成写入的性能问题,而且会大大缩短SSD的使用寿命,所以必须设计一个均衡负载的算法来保证SSD的每个单元能够被均衡的使用,这就是wear leveling,称为损耗均衡算法 -- offline erase
    • 传统数据库日志是sequential logging因为是连续位置的随机写入; SSD会大大折寿, SSD采用in-place logging: data and log in the same block; 日志文件还是适合HDD
    • SSD作为flashcache掉电后数据是有效还是无效?被当作无效的
  • 4K对齐是一种高级硬盘使用技术,用特殊方法将文件系统格式与硬盘物理层上进行契合,为提高硬盘寿命与高效率使用硬盘空间提供解决方案。因该技术将物理扇区与文件系统的每簇4096字节对齐而得名。当前电脑传统机械硬盘的每个扇区一般大小为512字节

Filesystem







《Linux内核设计与实现》

  • Overview
    • In Linux, each processor is doing exactly one of three things at any given moment
      • In user-space, executing user code in a progress
      • In kernel-space, in process context, executing on behalf of a specific process
      • In kernel-space, in interrupt context, not associated with a process, handling an interrupt
        • top halve has interrupt disabled
    • Kernel memory is not pageable
  • Process Management
    • Process includes
      • open files
      • pending signals
      • internal kernel data
      • processor state
      • memory address space (multiple memory mappings) page tables
      • data section containing global variables
    • Thread includes
      • unique program counter
      • process stack
      • set of processor registers
    • The kernel stores the list of processes in a circular doubly linked list called the task list.
    • Fork() only overhead: duplication of the parent's page tables & creation of a unique process descriptor for the child
      • ZOMBIE only has kernel stack, thread_info structure, and task_struct structure
      • When a task is ptraced, it is temporarily reparented to the debugging process
    • kernel threads do not have an address space (not normal process)
  • Process Scheduling: low latency (process response time) and high throughput (system utilization)
    • Context switch, the switching from one runnable task to another:
      • calls switch_mm() to switch virtual memory mapping
      • calls switch_to() to switch the processor state: saving/restoring stack information and processor registers and others
    • User preemption can occur:
      • when returning to user-space from a system call
      • when returning to user-space from an interrupt handler
    • Kernel preemption can occur: Linux 2.6 is a fully preemptive kernel, it is possible to preempt a task at any point, so long as the kernel is safe to reschedule.
      • NO LOCK. locks are used as markers of regions of nonpreemptibility
      • scheduler_tick() will check flag need_resched (which is per-process)
    • Kernel preemption can occur:
        • When an interrupt handler exits, before returning to kernel-space
        • When kernel code becomes preemptible again (NO LOCK, preempt_count is 0)
        • If a task in the kernel explicitly calls schedule()
        • If a task in the kernel blocks (which results in calling schedule())
    • System Calls
      • syscall (interrupt vector 128): an exception or trap to enter the kernel
      • syscall must be reentrant
    • Interrupts and Interrupt Handlers
      • Why interrupt? Processor polling incurs overhead, interrupt is what hardware to signal the kernel to get attention
      • Interrupt vs Exception:
        • interrupt handlers (top halves) are executed by the kernel asynchronously in immediate response to hardware interrupt
        • exceptions occur synchronously with respect to the processor clock, that they are called synchronous interrupts. (e.g., divide by zero, a page fault)
      • Special interrupt context: it is called atomic context as code executing in this context is unable to block (Top Halves)
      • Interrupt handlers can form only the first half of any interrupt processing solution due to these limitations:
        • Interrupt handlers run asynchronously and thus interrupt other, potentially important, code, including other interrupt handlers. (need to run fast)
        • Interrupt handlers run with at best the current interrupt level disabled, and at at worst all interrupts on the current processor disabled. But disabling interrupts prevents hardware from communicating with the operating systems
        • Interrupt handlers do not run in process context; therefore, they cannot block
        • They are often timing-critical
    • Kernel Synchronization
      • When only a single processor, the only way data could be concurrently accessed was: interrupt occurred or if kernel code explicitly rescheduled and enabled another task to run
      • Causes of concurrency:
        • Interrupts: An interrupt can occur asynchronously at almost any time, interrupting the currently executing code
        • Softirqs and tasklets: The kernel can raise or schedule a softirq or tasklet at almost any time, interrupting the currently executing code
        • Kernel preemption: Because the kernel is preemptive, one task in the kernel can preempt another
        • Sleeping and synchronization with user-space: A task in the kernel can sleep and thus invoke the scheduler, resulting in the running of a new process
        • Symmetrical multiprocessing: Two or more processors can execute kernel code at exactly the same time (per CPU memory)
      • Lock data not code
      • Deadlock
        • Implement lock ordering
        • Prevent starvation. Ask yourself, does this code always finish? If foo does not occur, will bar wait forever?
        • Do not double acquire the same lock
        • Only one process move at one time (prevent livelock)
    • Memory Management
      • A page may be used by page cache, private data, or as a mapping in a process's page table
      • kernel page structure is associate with physical pages not virtual pages
      • high memory: memories one architecture can not directly map

    《深入理解计算机系统》


    链接
    • Static Linking: ld copies functions from AR file
    • Dynamic Linking:
      • At loading:
        • compile to get partial executable object file from shared object (mark symbols relocatable)
        • execve(ld-linux.so),relocate目标文件和可重定向文件libc.so,  relocate text and data
      • At runtime: dlopen() loads functions from 共享库 to its memory directly (top中SHR的一部分就是dll文件)
    • 3 kinds of object files:
      • Executable object file
      • Relocatable object file
        • Shared object (shared library)
    异常处理
    • interrupt(键盘), trap, fault (除0), abort(机器检查)
    • syscall is trap 128
    • User vs Kernel mode is controlled by one mode bit in 控制寄存器
    • context: 寄存器、浮点寄存器、PC、User stack、状态寄存器、Kernel stack、页表、进程表、文件表
    • 信号的缺陷:
      • 同类型待处理信号被阻塞
      • 待处理信号不会排队等待:queue = 1
      • 系统调用可以被中断:read会返回EINTR
    • fork() vs execve()
      • fork将父子进程每个页面标记为read only,每个section都标记为private copy-on-write
      • execve: 删除已存在的用户区域->映射私有区域(text data bss stack heap)->映射共享区域(libc.so)->设置PC
        • execve会继承打开了的fd
    • 打开文件的数据结构
      • 描述符表(descriptor table): 指向file table
      • 文件表(file table): shared by all processes. 当前文件位置, 引用计数(多少个进程指向它),  指向v-node table
      • v-node table: shared by all processes. 文件访问、大小、类型


    Linux startup process



















    Boot

    1. Power On Self Test 
    2. Boot from CD, USB, HD 
    3. Load kernel into initramfs
      1. BIOS -> Master Boot Record (first sector) -> GRUB -> Active Partition load kernel
      2. UEFI (低阶OS) -> GPT -> /boot/efi/boot.efi -> GRUB/kernel
        1. EFI system partition is based on FAT, contains boot loader or kernel image
      3. Why GRUB has so many stages?
        1. stage 1, boot.img, 446 B, within first sector: to locate stage 2
        2. stage 1.5, core.img, 25 KB, between MBR and first partition: contains a few common filesystem drivers (Not all filesystems)
        3. stage 2, /boot/grub2, The kernels are located in the /boot directory, along with an initial RAM disk image, and device maps of the hard drives
    4. Why BIOS could not load kernel directly? Does not have system driver for addressing
    Startup
    1. Kernel loads systemd/init, then mounts filesystems in /etc/fstab
    2. Kernel calls start_kernel() to set up system functions: hardware and memory paging, interrupt handlers, the rest of memory management, device and driver initialization
      1. idle process: power saving mode
      2. scheduler: Complete Fair Scheduler O(logN)
      3. init/systemd who mounts /etc/fstab
    3. init start services from specific levels
    4. display manager and login manager -> session manager
    Shutdown
    1. Close down user space functionality
    2. init terminates
    3. kernel shutdown