About Anand

Anand is a veteran of Silicon Valley with development experience and patents that span Processors, Operating systems, Networking and Systems development. Anand has been working for the past few years with Service Providers and large Enterprises developing e and Training systems.
Author Archive | Anand

Linux Boot: The Beginning was startup_32, and it was with Linus, and it WAS Linus also

In an earlier post, we referred to startup_32. Well, __start and start_of_setup the target of the int 19h within the boot context, the target of the first start_32 is located at the 1M watermark as we had mentioned.

What is also special about __start is that we see here the first instruction executed within the context of the kernel (in the case of the objdump shown, we have opted out of the “SAFE RESET of the Controller at config time”, because INT19H presumably has done a good thorough job in the boot context), and we are onto creating a nice clean stack and then check out the magic codes of the boot sectors etc –>

By the time we get to the first startup_32, we are in protected mode, memory > 1MB can be accessed (the target of the decompress). The second startup_32 can therefore be located at the 1M watermark (0010 0000h), and with VA relocation at c010 0000h.

What is so special about this watermark ? It is … the first time we have executed instructions beyond the addressing limits set by x86 real-mode (which as we all know is limited in memory access to 0xFFFFF ..

We discuss Linux Kernel startup/boot concepts and more in my classes with kernel code walk throughs and programming assignments ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). Please take note, and take advantage also, of upcoming training sessions. Anand has also written production x86 protected-mode microcode, so is in a unique position to educate on that front. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 0 }

4M/2M Pages, TLB Flushes, Huge Page (Splits) and Other Economy Busting “Non-Issues”

It is normal to reason, as we explain in our courses, that the Linux Kernel may organize atleast a “few” pages as 4M pages. It can always be argued that we need not have enabled 4M paging, but that is not the point here. We need to figure out how to have our cake and eat it too. Hopefully.

SO…you guessed it, that when push comes to shove, and for reasons that are as good as the one I can come up with to the contrary, the buddy allocator becomes the bussting allocator bussting up these 4M pages into teeny weeny 4K pages that we really needed to allocate to in the first place. And then … needless to say, performance hell breaks loose with TLB flushes following the reset of ANY PDT’s “PS” bit (Just in case you missed this on 60-minutes, Intel requires TLB Flushes since the behavior for pages left in TLBs …with PS bits in the PDT getting the “Reset” treatment on the likes of page-splits etc etc.. is IMPLEMENTATION SPECIFIC). I call that a bug-turned-nonfeature. Lets move on, shall we ?

You may argue that one busted, forever split, but we do know that the Buddy allocator will not let bygones be bygones, right ? setting the PDT “PS” bit again on Buddy Order-Coalescing etc.

The questions apply to PAE mode, and to 64-bit modes also. I went looking to see just what % of free pages were 4M pages in my VM (NO PAE compile option), 4M Pages enabled, here it is->

We all know the benefits of enabling 4M Pages yak yak, but how about performance lost on TLB entries Flushes when just one or just about one of these Huge Pages is split up ? Data we just presented seems to suggest that if we enable 4M Pages, performance loss would be as probabilistic as rain in the Rain Forests. Especially for “Higher Performance” Systems at the cusp of 4K and 2M/4M page sizes. I mean, being on one side of the camp (4K only) or the other (2M/4M) is so much more preferable. But reality rarely obliges our whims and wishes. And the trend to disappoint in that regards will continue we surmise.

While the issues apply to PAE and 64-bit addressing modes as well, a pic of the 32-bit 4K to 4M paging (PDT.PS) transition based on the PDT “PS” is included here ->

After all this, is there any way of measuring dynamic load-based performance loss due to TLB Flushes against prevailing contexts of page allocations ? We do know it would be “traffic-dependent”, especially for caching and web Servers, where performance losses would, in all probability, be most noticable and critically offensive also. Is there any way for lesser machines to take advantage of 2/4M Page sizes, and not burden a perhaps predominantly 4K allocation ? And vice versa also ? Or do we just have to have customized and predetermined “preallocations” of Huge/4K Pages based on application-level heuristics ? And if so, how is that to be accomplished if at all ?

We discuss Linux Kernel concepts and more in my classes with code walk throughs and programming assignments ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). Please take note of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 2 }

Give me another jiffy, only a bit later

Given the description below, jiffies + 10*HZ means (now + 10 Seconds). And jiffies * 1000 would mean the elapsed time in milliseconds etc.

An example of usage in the Linux Kernel would be in the drivers for XD, where, jiffies + 8*HZ later (as initialized in the member element expires of static struct timer_list) a timer interrupt handler is programmed to execute.

The “watchdog handler” xd_watchdog, in this particular instance, wakes up a sleeping process.

Needless to say, all appropriate caveats emptor apply. Appropriate Processes need to have been sleeping, Timers must have been declared and initialized etc etc

I will blog on this at length later. Please subscribe to our mail list for automated updates on new blog entries.

I explain this specific Linux Kernel concept and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). As always, Feedback, Questions and Comments are appreciated and will be responded to.

Thanks

Comments { 0 }

Linux Kernel: Give me a Jiffy

The jiffies variable is a counter that stores the number of elapsed ticks since the system was started

It is increased by one when a timer interrupt occurs—that is, on every processor tick
i.e. at HZ rate. HZ itself is configured globally.

The Kernel makes generous use of this variable. An example would be to “time” timeouts, “budget” Interrupt Back-Half handlers’ usage of Processors etc to prevent “Hogging”.

The xtime variable derives its information from the jiffies variable and stores the current time and date; it is a structure of type timespec having two fields:

tv_sec: Stores the number of seconds that have elapsed since midnight of January 1, 1970 (UTC)
tv_nsec: Stores the number of nanoseconds that have elapsed within the last second (its value ranges between 0 and 999,999,999)

We also need to blog on back-half handlers. Used in a variety of locations where the non-critical component of Interrupt handling is “deferred”, IO, Packet reception, Transmission etc.

Comments { 0 }

startup_32 : In how many ways shall I say I Love Thee ?

OK ! I am just talking atleast about about a startup_32 here.

But given that this routine ( or rather two routines with identical name but in different directories) get used at boot time, one after the other…. one just has to say I love thee atleast twice. Perhaps more.

/arch/x86/boot/compressed/head_64.S
/arch/x86/kernel/head_32.S

The startup_32 in the “kernel” area is the starting point for the execution of the “decompressed” kernel. It IS in x86 assembly. And guess the … what ? IT IS … located at the C0100000 watermark (per /proc/kallsyms). Atleast for 2.6.32.2

And it is also where the mapping for the first 8M of memory (Identity and Non-Identity mapped) will be set up before paging is enabled so that both the relocated high-mem to the 3G of PAGE_OFFSET (C0000000) and non-relocated addresses may be addressed by the processor, if only for the duration of the jump (Grep swapper_pg_dir within this file, or better yet grep for “Enable Paging”). This jump is the one that clears the processors prefetch queue just so TLBs may be used for the fetch of the first non-identity mapped instruction following the execution of the jump.

I explain Linux Kernel concepts and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). Please register for email updates and new posts.

-Anand

Comments { 5 }

Linux Kernel Packet interfaces for Transmission and Reception

Packet ingresses and egresses in the Linux Kernel are implemented for differing, and also related, objectives.

On ingress, the traditional NIC/Interrupt model is modified to incorporate “bulk” processing. I.e. if the processor has more than a WEIGHT number of packets ingressing (WEIGHT specified when struct napi_struct as specified below is created), and available to be processed by the processor, then the Kernel “Polls” (with interrupts disabled) and processes the packets in “bulk” using the New per-device API Interface (struct NAPI_STRUCT). Else interrupts are enabled, and each packet processed on a per-interrupt basis. The objective is to reduce the effects of interrupt latencies for high-traffic conditions. However there are limits placed on “Polling” in “one-shot” without “deferral” (see below) so as to avoid “hogging” the processor resources from the perspectives of other duties of the Kernel/Processor. When that limit is reached (as specified by configurable BUDGET), or when we have been polling for a set number of time (jiffies), the Kernel’s “deferred processing” SOFTIRQ mechanisms kick in (NET_TX_SOFTIRQ with “action” NET_RX_ACTION). A per-napi “Poll” function is initialized at the time of device initialization (member “poll” within struct napi_struct) to process the packets and pass them to the upper layers, and it is this function that will check the WEIGHT of the device it is processing to determine how many packets to process.

Well, Packet egresses are significantly more complex than packet ingresses, with Queue management and QOS (perhaps even packet-shaping) being implemented. “Queue Discplines” are used to implement user-specifiable QOS policies (Specifiable by the “tc” command for example). Struct Qdisc (queue discipline) is the central actor, with deferred processing implemented via NET_TX_ACTION.

I explain Linux Kernel concepts and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). As always, Feedback, Questions and Comments are appreciated and will be responded to.

-Anand

Comments { 0 }

Linux Kernel Address Spaces, Zone Pages and Allocations etc

Managing the processors memory (Virtual and Physical) is a key component of the Linux Kernel that is intimately tied in to performance and scalability.

Processor physical memory is allocated by ZONEs. But how about its Address Space (struct address_space) ?

Atleast one place that the ADDRESS SPACE (struct address_space) is initialized is when the processors INODE (struct inode) is created (on open system calls, for example).

It is atleast at this time also that the flags of the ADDRESS SPACE is initialzed to indicate the ZONE in which the pages which belong to it will reside.

When it comes time for the allocation of a page, a page needs to be selected from the appropriate ZONE as indicated by the Address Space. Page flags are also initialized at INIT time to belong to the ZONE (free_area_init_core). This facilitates returning the page to the correct ZONE‘s free list when it is freed up (when a process terminates, for example).

I will blog on this at length. Please subscribe to our mail list for automated updates on new blog entries. I explain Linux Kernel concepts and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). As always, Feedback, Questions and Comments are appreciated and will be responded to.

Thanks

Comments { 0 }

Avoiding TLB Flushes on Context Switches on x86 Processors: The PCID

CR3, CR4 are Control registers on x86 processors that are used configure and manage protected-mode functionality. These instructions may only be executed at CPL(Current Privilege) == 0 . i.e. in Kernel Mode.
In the x86, linear addresses are translated by the TLB into Physical addresses (assuming Page Table Walks have been done prior to look up etc).
A Cr3 load switches out Page Tables. In the Linux Kernel, it is executed during scheduling new process at context switch time (context_switch). Before the days of the PCID (see below), a load of CR3 flushed the TLB.
Avoiding TLB flushes on Loads of CR3 are key to avoiding performace hits  on context switches.  In other words, a processor really needs to facilitate the storage of multiple address spaces in the TLB across context switches.
Process-context identifiers (PCIDs) are a facility in x86 processors  by which a logical processor may cache information for multiple linear-address spaces in the TLB.

The processor may retain cached information when software switches to a different linear-address space with a different PCID e.g., by loading CR3.

When using entries in the TLBs and paging-structure caches to translate a linear address, a logical processor uses only those entries associated with the current PCID

i.e. A PCID is a 12-bit identifier, and be thought of as a “Process-ID” for TLBs.

If CR4.PCIDE = 0 (but 17 of CR4), the current PCID is always 000H; otherwise, the current PCID is the value of bits 11:0 of CR3.

Non-zero PCIDs are enabled by setting the PCIDE flag (bit 17 of CR4). Rules do apply on enabling PCID on x86 processors. Caveats emptor. Naturally, restrictions on the operating system may apply to take advantage of this mechanism. Context switches that require isolation of Linear addresses between processes must be done  with care. And/or  linear addresses between processes with PCID‘s enabled may overlap, and so will translations of Linear Addresses to Physical memory ! Ouch.

More on the implications of this for Linux will follow.

I  explain Linux Kernel concepts and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). As always, Feedback, Questions  and Comments are appreciated and will be responded to.

Comments { 0 }

Demystifying the Linux Kernel Socket File Systems (Sockfs)

All Linux  networking works with System Calls creating network sockets (using the Socket System Call). The Socket System Call returns an integer (socket descriptor).

“Writing” or “reading” to/from that socket descriptor (as though it were a file) using generic System Calls  write / read respectively creates TCP network traffic rather than file-system writes/reads.

Note: The file-system descriptor would have been created by the “Open” system call IF … the descriptor were a “regular” file-system descriptor, intended for “regular” / file-system writes and reads (via System Calls write/read respectively) to files etc.

Further Note: This implies that the network socket descriptor created by the “socket” System Call will be used by systems programmer to write/read , using the same System Calls write/read used for “regular” file system writes/reads (System Calls that would, under normal and other circumstances, write/read data to/from memory).

Further further Note:  A System Call  “write” (to the descriptor that was created by the socket System Call)  must translate “magically” into a TCP transaction that “writes” the data across the network (ostensibly to the client on the other end), with the data “written” encapsulated within the payload section of a TCP packet.

This process of adapting  and hijacking the kernel file-system infrastructure to incorporate network operations /socket operations is called SOCKFS (Socket File System).

So how does  the linux kernel accomplish this process, where a file-system write is “faked” into a network-system “write”, if indeed it can be called that ?

Well…as is usually the case, the linux kernel’s methods begins at System / Kernel Initialization, when a special socket file-system (statically defined sock_fs_type)  for networks is “registered” by register_file_system. This happens in sock_init. File systems are registered so that disk partitions can be mounted for that file system.

The kernel registered file system type sock_fs_type  so that it could create a fake mount point  using kern_mount (for the file system sock_fs_type).  This mount point is necessary if the kernel is to later create a “fake file”   *struct file  using  existing/generic mechanisms and infrastructure  made available for the Virtual File System (VFS). These mechanisms  and infrastructure would include a mount point being available.

         Note:  No “actual” mount point exists, not in the sense an inode etc etc.

                       We will blog on file systems later.

Then when the socket System Call is initiated (to create the socket descriptor),  the kernel executes sock_create to create a new descriptor (aka the socket descriptor). The kernel also  executes sock_map_fd, which creates a   “fake file” , and  assigns the “fake file” to the socket descriptor. The “fake” files ops ( file->f_op) are then initialized to be socket_file_ops  (statically defined at compile time in source/net/socket.c).

The kernel assigns/maps the socket descriptor created earlier to the new “fake”  file using fd_install.

This socket descriptor is returned by the Socket System Call (as required by the MAN page of the Socket System Call) to the user program.

I only call it “fake” file because a System Call write executed against that socket descriptor will use the VFS infrastructure created, but  the data will not be written into a disk-file anywhere. It will, instead, be translated into a network operation because of the f_op‘s assigned to the “fake” file (socket_file_ops).

The kernel is now set up to create network traffic when System Calls write/read  are executed to/from to the “fake” file descriptor (the socket descriptor)  which was returned to the user when System Call socket was executed.

In point of fact, a System Call write to the “fake” files socket descriptor will then translate into a call to  __sock_sendmsg within the kernel, instead of a write into the “regular” file system. Because that is how socket_file_ops is statically defined before assignment to the “fake” file.

And then we are into networking space. And the promised Lan of milk, honey,  TCP traffic,  SOCKFS and File Systems.

No one said understanding the kernel was easy. But extremely gratification awaits those that work on it. And also creates enormous opportunities for innovation.  I  explain Linux Kernel concepts and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently).

As always, Feedback, Questions  and Comments are appreciated and will be responded to. I will like to listen to gripes, especially  if you also paypal me some.  Thanks

-Anand

Comments { 6 }