Archive | January, 2012

LinuxEco Cross-Functional Quiz#1

OK… we have talked of Huge Pages in an earlier blog, and we have referred to Kernel Preemption.

So we know now a Huge TLB Page can be GB sized…. and so therefore it stands to reason that a preemption is involved somewhere in  kernel API hugetlb_fault. Right ? GF-rins.

SO … where is it that the Kernel is Possibly attempting a Preemption  on a page fault on a huge page ? More importantly, WHY would we do that ? Do we dig it now ? Haaaaaah ? Tell me then.

We explain these specific x86 and Linux Kernel features, concepts and more in detail in our class Advanced Kernel Memory Management , the second offering of which is to be made Corporate-direct. We suggest also   the Spring Session 2012 UCSC-Extension Advanced Linux Kernel Programming

Please take note of these and other upcoming Linux kernel training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 0 }

Before the world of Linux Kernel Preemptions: a cooperative world

Kernel preemption tries to ensure fair usage of limited CPU resources. One way to understand kernel preemption is to explore its opposite. i.e. the way it WAS before the 2.6 linux kernel (which is preemptive). Well, the way it WAS .. was a cooperative world: A Process that got the CPU resources was expected to play nine-nine, and let cooperatively hand over the cpu (i.e. at EXIT system call), or the kernel could also, when it switched to user-mode, decide to schedule a new process. Processes were expected to be graceful in letting others use CPU resources.

Processes had to deal with many issues to try to ensure this model “worked”, however in additional to some hard problems created , there are some architectural features that just plumb stall out CPU resources and ensure suboptimal CPU resource usage regardless of whatever processes could have done about it. In other words, Processes do not have visibility into the underlying mechanisms of the operating system / kernel itself.

Linux Kernel preemption ensures what I consider to be somewhat fair (aka somewhat arbitrary) reallocation of CPU resources between more processes with “all the knowledge under the sun” on the underlying “goings on”. As an example, if the Kernel KNOWs the system is taking interrupts at HZ rate (see blog below), then why not try to prioritize between existing processes, and give others a chance to run ? If the Kernel KNOWS a process is to stall on a resource that may take some time to come available…why not put the process to “sleep” and “wake” someone fortunate process up ?

Well.. there are many reasons to NOT do that also (if processes are Real-time processes for example). Or to prevent “lockouts” etc etc

In the end, it boils down to a tradeoff between latency for the lucky few, .vs. throughput for the very many. And all shades in between, with considerations galore: A few of which are listed below->

Explicit and Implicit “Blocking”, Critical Code Section Synchronizations, Network and Block Device Processing Latencies and Throughputs, Interrupt Latencies, “Deferred Processing”, Safety in Preemptability (preventing lockouts because we have preempted tasks that should not have been), and there denials of preemptions / recursive depths of denials, relationships to interrups and recursive relationships to the above, system-programming architectural considerations and requirements (Scheduler Priorities, Classes etc), SMP / Cross Processor considerations, memory management, x86 Architectural considerations in Interrupt Latencies  etc etc

Again, all this is probably a good review for our past students. We explain these specific x86 features, Linux Kernel concepts and more in detail in my classes ( Advanced Linux Kernel Programming @UCSC-Extension), and also in other classes that I teach independently. Please take note, and take advantage also, of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 0 }

Linux Kernel Preemptions in the “Interrupt Context”: The x86 Interrupt Descriptor Table IDT

The term Interrupt / IRQ context in the kernel needs to be considered in the context of X86, and the low-level INTERRUPT and FAULT / TRAP handlers. INTERRUPTS are asynchronous events, FAULTS and TRAPS are programmatically generated events (i.e. can be “pegged” to a specific instruction) .

The purpose of this blog is to create an unambiguous delineation between Faults, Traps and Interrupts (COC Change-Of-Control events) . And then  create a lead-in into a future discussion on the implications to the kernel.

The actual entry into the fault handler is based on the Interrupt Number INT# (14 for Page Fault)… so for Page Fault, the index 14 will be used to index into the Interrupt Descriptor Table (IDT), a “gate descriptor is fetched (which has the Offset / EIP of the IRQ Handler). The Gate points to a “target descriptor”, which will have a Segment Selector , which will be used to fetch a Segment Descriptor per the normal mechanisms described in an older blog below the BASE and the OFFSET get added, and we have the Linear Address of the IRQ Handler…..where control will transfer when a COC events occurs.

In the case of an INTERRUPT (not a FAULT or a TRAP), the INT# is returned from the interrupting device, an INT# it was given at the time of device initialization (we will definitely blog on this later) at the time of request_irq, and the process hardware uses that INT# to vector through the IDT Gizmo.

All COCs “push” the  the following x86 state onto the stack: CS (Code Segment Selector.. see Blog below), EIP (Program Counter) and EFLAGS. However,  in the case of an Interrupt, we really dont know which CS:EIP will be pushed onto the stack, it will be the determined by the hardware as to when the INTR pin is raised inside the processor, and which “Instruction Boundary” that INTR is raised.

In the case of the FAULT (for example a Page Fault), the CS:EIP pushed onto the Stack will belong to the Instruction generating the Page Fault (which could be based on the Instruction fetch, Or an Operand Fetch).

In the case of a TRAP, the CS:EIP belongs to the Instruction following the instruction that generated the trap for operands, and to the Instruction itself when the TRAP is generated on an Instruction Fetch.

Now, the entry into the Kernel for all these events, INTERRUPTS, FAULTS, TRAPS etc etc happen through the same IDT, so the Linux Kernel calls them all IRQ events. The entry point into the “C” part of the Kernel (past the Low-Level interrupt drivers in entry_32.S) when preemption may be needed is unified for all these events, and  is preempt_schedule_irq.

However, the state  with which the Linux kernel will enter preempt_schedule_irq will be different based on whether the entry is in the INTERRUPT context or not. For discussion later.  Regardless, an entry into this routine is grounds for a “Preemption”.  Remember, these Preemptions are caused by IRQs.

We will  discuss “non-IRQ” Kernel-induced preemptions later also.

All this is probably a good review for our students. We explain these specific x86 features, Linux Kernel concepts and more in detail in my classes ( Advanced Linux Kernel Programming @UCSC-Extension), and also in other classes that I teach independently. Please take note, and take advantage also, of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to. Cheers  Carlos !

 

Comments { 0 }

Give me your X86 Mode, make it Real, or else forget about it !

OK… I am talking about my friend Carlos Santana, and also about the difference between x86 Intel Real Mode Addressing and Protected Mode Addressing.

We will hit it up with Carlos later on in this blog. But business first for now, shall we ?

We would like to eventually explore just why .. oh why … Real Mode is limited to 1MB Memory.

And while doing so, also understand x86 “Protected Mode” addressing. We start with an x86 assembly instruction (AT&T syntax, because that is what the Linux Kernel uses), and illustrate address formation at the most important fundamental levels:

movl (%ebp), %eax

Let us look at Protected Mode Address formation first:

In this instruction shown above, We use ebp to form the “effective” address (ebp IS the “effective” address), then we add the “base” from the (default) segment register (DS) to the “effective address” to come up with the “linear address”. Which will be used to look up the TLB to translate into a Physical Address (if Paging is enabled), or the Linear Address becomes the Physical Address if Paging is NOT enabled.

The “Selector” shown IS the value of the “Selector” component of the Segment Register “DS” (Each Segment register has a “Selector” component). In Protected mode, the 3rd-most lsb of the “Selector” is called the “TI” / Table Indicator. It is use to select one of two tables (GDT or the LDT), bits (31:3) then index into the GDT or the LDT, and we come with a 8-byte “Segment Descriptor”.

That “Segment Descriptor” we just came up with has a 32-bit “Base”, which will be added to the Effective Address (which in our assembly instruction, is the value of General Purpose Register EBP), and the net result becomes the Linear Address. And that Linear Address either goes through the Paging Translation to determine Physical Mmeory Address (If Paging IS enabled)…. or IS the Physical Address if Paging Mode is NOT enabled (as determined by the CR0.PG bit)

All of the above applies to x86 Protected Mode. How about “Real Mode” which does not have Descriptors etc or even Paging Mode ?

Well, the Selector (DS) and Register Sizes sizes (ebp) can only be 16 bits in real mode. The “Base” of the descriptor is replaced by the Selector Left-Shifted by 4 (* 16), and that is added to the effective address (bp).

Note that this implies that Linear Address is

And we have just corroborated 60 minutes that x86 Real Mode is limited to 1MB Memory addressing.

Ofcourse Carlos, x86 has now gone to Infinity and Beyond with Huge Pages and 48-bit Linear Addressing. And the Implications are huge also with moving beyond the measly (and respected) 4K Page sizes to 2M/4M/1G Page sizes. We have a blog below on this, and more coming.

As well as with some other key areas, Memory Management plays a key and consuming role in Optimizing for Multiprocessing / Multicore and Multithreaded execution models.

There is no substitute in this regard for our Course Advanced Kernel memory Management. Hit us up on it.

We explain these specific x86 features, Linux Kernel concepts and more in detail in my classes ( Advanced Linux Kernel Programming @UCSC-Extension), and also in other classes that I teach independently. Please take note, and take advantage also, of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to. Cheers !

Take it away Carlos Santana !

Comments { 0 }

Some of these posts are so friggin hard to read !

That is the feedback I get. Yes, true. Put me on that list also.

This is true even to the initiated. This IS hard stuff. Make no bones about it. The fact that some of these posts may be difficult to read is perhaps one litmus test of the fact that good information is being communicated here.

And also we all learn more when we revisit Kernel topics “100% understood”. No such thing.

These posts are supposed to get the “click” to those that may be thinking on these issues. And to get the interested members of our workforce trained up in the collaborative stupefying complexity that makes up the Linux kernel.

Kernel developers who wish to extend their reach beyond their immediate pales of influence and Systems/Applications developers will be the first-level benefactors. Also, the line between a great Linux System Admin and Systems programmer is beginning to blur..

Additionally, I do hope the following taken in the right light. I am merely being correct when I say that instructive posts such as those posted here are more than what I got when I got stared with Unix, and then Linux. We had to, and still have to struggle, though I will also add the ROI gets better with the years.

Now we have training sessions for the depths of the Kernel itself. We also discuss systems-level issues, and let the discussion go where it may (within the bounds of reason and time allocated to us). This is true with us, and elsewhere also we hope. These training sessions are very helpful, and this info based on student feedback.

Also, in these sessions, it then becomes clear just why an in-person instruction and Q&A may help clarify matters beyond what may be even reasonably or remotely possible in these blogs.

Some of our upcoming posts will deal the overall issues related to MPX / Multithreaded / Multicore systems. And their relationships to the Linux kernel itself. We also have a talk coming up at the Dojo on this topic. Please put in on your calender.

And we will be blogging a bit(! / lot) more on Huge Pages etc…Cheers !

Comments { 0 }

The linux boot sequence: start_kernel, and there was light

Continuing with our discussion Kernel Boot, start_kernel() is implicit Process 0 (the PID was initialized to zero, lets not go looking for a process 0 with “ps”), AKA the “root thread”, the grand daddy of all processes to come. And the fall-back guy, as we will see.

Process 0 (PID 0, root thread) spawns off Process 1, known as the kernel_init (kernel thread) process, which will /sbin/init as the thread we just created (created, not scheduled to run.. not just yet).

Then the kernel process to start off other kernel processes is created (kthreadd) aka Process 2.

Process 1 (PID 1) could become the idle thread for the CPU when it executes.

Regardless, we then schedule() (we DID create a process or two hopefully). Nota Bene: when the /sbin/init kernel thread was created, we only have Processes 0 and 1 in the run queue. i.e. Process 1 becometh /sbin/init. Per 2.6.32.2 atleast and for awhile now.

As part of schedule(), processes may have pooped out or popped in, so we will probably find something to run, right ? After all we have just gone through, we have a right to expect to have something to run. Grins.

However, in the unlikely event that we (eventually) no-can-do schedule() in or out, then we cpu_idle for as long as it takes to find a process to run. Oh … this time, under the aegis of Process 0, which, by our own definition, is the last man standing. Also which, we are well aware of, cannot be “ps”-ed.

Please do note, a system-specific idle could have been created by /sbin/init, with PID being what we get, since the init thread and its children created could, in principle, have fork-ed till kingdom come.

We did mention that we needed to, and actually did enable/disable, preemption along the way depending on whether we were ready to schedule() or not. Given that we are within the pale of initialization, and memory locations are of origins and values unknown, caution is indeed the better part of valor. So why not apply that principle to thread info’s also ? Right choice.

Such indeed are the joys of Linux Kernel Programming. Grins ?

We explain these specific Linux Kernel concepts and more in detail in my classes ( Advanced Linux Kernel Programming @UCSC-Extension), and also in other classes that I teach independently. Please take note, and take advantage also, of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Thanks

-Anand

Comments { 0 }

Linux Boot: The Beginning was startup_32, and it was with Linus, and it WAS Linus also

In an earlier post, we referred to startup_32. Well, __start and start_of_setup the target of the int 19h within the boot context, the target of the first start_32 is located at the 1M watermark as we had mentioned.

What is also special about __start is that we see here the first instruction executed within the context of the kernel (in the case of the objdump shown, we have opted out of the “SAFE RESET of the Controller at config time”, because INT19H presumably has done a good thorough job in the boot context), and we are onto creating a nice clean stack and then check out the magic codes of the boot sectors etc –>

By the time we get to the first startup_32, we are in protected mode, memory > 1MB can be accessed (the target of the decompress). The second startup_32 can therefore be located at the 1M watermark (0010 0000h), and with VA relocation at c010 0000h.

What is so special about this watermark ? It is … the first time we have executed instructions beyond the addressing limits set by x86 real-mode (which as we all know is limited in memory access to 0xFFFFF ..

We discuss Linux Kernel startup/boot concepts and more in my classes with kernel code walk throughs and programming assignments ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). Please take note, and take advantage also, of upcoming training sessions. Anand has also written production x86 protected-mode microcode, so is in a unique position to educate on that front. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 0 }

4M/2M Pages, TLB Flushes, Huge Page (Splits) and Other Economy Busting “Non-Issues”

It is normal to reason, as we explain in our courses, that the Linux Kernel may organize atleast a “few” pages as 4M pages. It can always be argued that we need not have enabled 4M paging, but that is not the point here. We need to figure out how to have our cake and eat it too. Hopefully.

SO…you guessed it, that when push comes to shove, and for reasons that are as good as the one I can come up with to the contrary, the buddy allocator becomes the bussting allocator bussting up these 4M pages into teeny weeny 4K pages that we really needed to allocate to in the first place. And then … needless to say, performance hell breaks loose with TLB flushes following the reset of ANY PDT’s “PS” bit (Just in case you missed this on 60-minutes, Intel requires TLB Flushes since the behavior for pages left in TLBs …with PS bits in the PDT getting the “Reset” treatment on the likes of page-splits etc etc.. is IMPLEMENTATION SPECIFIC). I call that a bug-turned-nonfeature. Lets move on, shall we ?

You may argue that one busted, forever split, but we do know that the Buddy allocator will not let bygones be bygones, right ? setting the PDT “PS” bit again on Buddy Order-Coalescing etc.

The questions apply to PAE mode, and to 64-bit modes also. I went looking to see just what % of free pages were 4M pages in my VM (NO PAE compile option), 4M Pages enabled, here it is->

We all know the benefits of enabling 4M Pages yak yak, but how about performance lost on TLB entries Flushes when just one or just about one of these Huge Pages is split up ? Data we just presented seems to suggest that if we enable 4M Pages, performance loss would be as probabilistic as rain in the Rain Forests. Especially for “Higher Performance” Systems at the cusp of 4K and 2M/4M page sizes. I mean, being on one side of the camp (4K only) or the other (2M/4M) is so much more preferable. But reality rarely obliges our whims and wishes. And the trend to disappoint in that regards will continue we surmise.

While the issues apply to PAE and 64-bit addressing modes as well, a pic of the 32-bit 4K to 4M paging (PDT.PS) transition based on the PDT “PS” is included here ->

After all this, is there any way of measuring dynamic load-based performance loss due to TLB Flushes against prevailing contexts of page allocations ? We do know it would be “traffic-dependent”, especially for caching and web Servers, where performance losses would, in all probability, be most noticable and critically offensive also. Is there any way for lesser machines to take advantage of 2/4M Page sizes, and not burden a perhaps predominantly 4K allocation ? And vice versa also ? Or do we just have to have customized and predetermined “preallocations” of Huge/4K Pages based on application-level heuristics ? And if so, how is that to be accomplished if at all ?

We discuss Linux Kernel concepts and more in my classes with code walk throughs and programming assignments ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). Please take note of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 2 }

Give me another jiffy, only a bit later

Given the description below, jiffies + 10*HZ means (now + 10 Seconds). And jiffies * 1000 would mean the elapsed time in milliseconds etc.

An example of usage in the Linux Kernel would be in the drivers for XD, where, jiffies + 8*HZ later (as initialized in the member element expires of static struct timer_list) a timer interrupt handler is programmed to execute.

The “watchdog handler” xd_watchdog, in this particular instance, wakes up a sleeping process.

Needless to say, all appropriate caveats emptor apply. Appropriate Processes need to have been sleeping, Timers must have been declared and initialized etc etc

I will blog on this at length later. Please subscribe to our mail list for automated updates on new blog entries.

I explain this specific Linux Kernel concept and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). As always, Feedback, Questions and Comments are appreciated and will be responded to.

Thanks

Comments { 0 }

Linux Kernel: Give me a Jiffy

The jiffies variable is a counter that stores the number of elapsed ticks since the system was started

It is increased by one when a timer interrupt occurs—that is, on every processor tick
i.e. at HZ rate. HZ itself is configured globally.

The Kernel makes generous use of this variable. An example would be to “time” timeouts, “budget” Interrupt Back-Half handlers’ usage of Processors etc to prevent “Hogging”.

The xtime variable derives its information from the jiffies variable and stores the current time and date; it is a structure of type timespec having two fields:

tv_sec: Stores the number of seconds that have elapsed since midnight of January 1, 1970 (UTC)
tv_nsec: Stores the number of nanoseconds that have elapsed within the last second (its value ranges between 0 and 999,999,999)

We also need to blog on back-half handlers. Used in a variety of locations where the non-critical component of Interrupt handling is “deferred”, IO, Packet reception, Transmission etc.

Comments { 0 }