Archive | Linux Kernel Memory Management RSS feed for this section

The Tortuous rode to extending the joys of extending Physical Memory Ranges and Page Sizes

My dear Brethren and Fellow travellers in the land x86 and the Linux Kernel, it is time to come clean on all modes that extend x86 page sizes and physical memory addressing.

So here we go Ka Boom Ka Beem: In 32-bit non-PAE mode, PSE and PSE36 (AKA PSE40), whereby PSE36/40 is a cheap way to get the extended memory addressing for 64GB/1TB, 4K/4M Paging that we created in PSE (Really!) with 32-bit PGDs and PTEs.

In 32-bit PAE mode, the architecture is 32-bit, but the PDPTEs, PDEs and PTEs extend to 64-bit, thereby extending Page sizes to 4K/2M (if PTE is not used), and memory addressing to 52-bits (4PB).

And lastly, but not leastly, we have IA-32E mode (true blue 64-bit architecture), with 48-bit linear addressing, 4K/2M/1G (if PDEs and PTEs are both not used) page sizes, 4PB of Memory addressing, and 64-bit PDPTEs, PDEs and PTEs.

John, did I get it all, and right also, this time ? DID you get the Cigar too ? Haaah ?

We all … must admit to an acute case of x86-itis, so the next post onwards, time to switch gears.  We will start looking next at the various implications to the Kernel of some of these x86-isms for which we have espoused eloquence in this and prior posts.

I will also note that we do do a good job of explaining these concepts in some of our training sessions, this feedback measured by course evaluations, and client (some of which are ISPs) feedbacks also.

I do hope everyone enjoyed this post. Thanks again

Comments { 0 }

Mitigating the Performance impacts of TLB Flushes on Context Switches

We all know, presumably, that MOV CR3 (the PDBR) is an essential part of the Linux Kernel’s context_switch routing. This is necessary, since the tables may have switched, but the MOV CR3 also flushes the TLB thereby forcing Page Table Walks.

Avoiding TLB flushes on Loads of CR3 are key to avoiding performace hits  on context switches.  In other words, a processor really needs to facilitate the storage of address space caching in the TLB across context switches.

In “pure architectures” (which the x86 is NOT, and for good reason of backward compatibility etc), the PID (Process ID) would have been “hashed” with TLB addressing, thereby avoiding the need for TLB Flushes on context switches. Not so with x86, since the PID is not part/parcel of the x86 Architecture.

Process-context identifiers (PCIDs) are a facility in x86 by which a logical processor may cache information for multiple linear-address spaces in the TLB, and preserve it across context switches.

As we noted above, The processor  retains cached Page-Table  information  when software switches to a different linear-address space by loading CR3, and presumable to a different Process (We ARE executing a context_switch)

A PCID is a 12-bit identifier, and may be thought of as a “Process-ID” for TLBs. If CR4.PCIDE = 0 (but 17 of CR4), the current PCID is always 000H; otherwise, the current PCID is the value of bits 11:0 of CR3. Non-zero PCIDs are enabled by setting the PCIDE flag (bit 17 of CR4).
When a logical processor creates entries in the TLBs (Section 4.10.2 of the x86 prog reference manual) and paging structure caches (Section 4.10.3), it associates those entries with the current PCID (Oh … such a loose association of PCID with PID). Note that this means that where the PGD is located is somehow being interpreted in the  PID “process context”.  When using entries in the TLBs and paging-structure caches to translate a linear address, a logical processor uses only those entries associated with the current PCID, and hence flushes of the TLB  are avoided.

With the x86, my dear brothers and sisters in grief and joy, we take what you can get, and run. In this case, where TLB flushes are avoided for what will turn out to be 99% of the *current* address space, that is more than we can bargain for with Intel. I say.. Good Job Intel.

 

Comments { 0 }

The Tortuous Road to extending Page Size -and- Memory Size Addressing

When the earth was young, and you were not born, Unca Intel created Real Mode (Read previous article).

Real mode had the capacity to address a whopping 1 MB of Memory (10-bit address space), with each “program” (there were no processes) addressing 64 KB of space (since the registers etc were 16-bits in size. Believe you me, in those days, it was mongo memory. Ah those were the days …

Then came progress, and the first KING of the Microprocessor Family the 80386, and with it 32-bit addressing and 4K paging. Someone had a bright idea at Intel, why not use the 10-bits reserved to index into the PTE as extensions of page-size ? Viola ! And so were born 4M Pages.

However, memory sizes were still set at 4GB (so only 1000 4M pages, no Cigar John !). AND … it will be noted along with 4M Pages came  with no new paging hierarchy.

SO .. the next step was to extend Memory sizes … why not ? 52 bits it was (1 TB PAE, go for it John, what the frig !) with a new paging hierarchy, and we got 4K and 2M Page sizes. Did we hear NEW Paging hierarchy with 64-bit PDEs and PTEs  ?

 

Ouuch. No John, No way, we NEED our cake (OLD Paging Hierarchy with 32-bit PDEs and PTEs) and wanna eat it too (Larger memory Addressing, Larger anyway then 4G).

SO .. we created PAE-36 (36-bit Memory Addressing,  64 GB of DDRx addressing  with the “old yet modified” 32-bit PDEs and PTEs) with 4M Paging

But John, we TOLD YOU really needed LARGER Page sizes AND … Larger Memory. And lets not worry, Just  GO with paging Hierarchy changes , OK ?

And Until, then no kiss-kiss no Bang Bang.

SO .. John created PSE-4M Page sizes with 40-bit Memory (256 GB DDRx Addressing) and called it a Job Well Done. This was the original 4MB Page-Size PSE format extention WHICH was not intended for extending Physical Addresses by using new 64-bit paging hierarchies in the first place, but eventually did wind up doing just so. Amen.

BUT did he not get 1TB memory addressing also ? Thanks for correcting me John. With the new paging hierarchy, (64-bit PDPTEs/PDEs etc) that comes with PA52, Memory addressing extends out to 4TB – 4PB. More than enough for the EXT3/4 file system.And onto extended file systems.

SO then … would one of my esteemed colleagues please enlighten me on the maximum size of a EXT3/4 file system ? And why it is so ? More when we discuss File Systems on this blog. We discuss these concepts during our talks on Linux Kernel Programming Advanced at UCSC-extension, and during the course on Memory Management taught, please note Event Calender on right.

Comments { 0 }

x86 Segmentation Protection Mechanisms: #1

At the onset, please accept my apologies for the X86 protection mechanisms.

I did not create it. It IS difficult to understand (huge understatement). It IS elegant also (Oucchh). I am merely a messenger that is making an attempt at explaining it. The complaint department on the x86 architecture is with the FBI.. just kidding.

X86 protections revolve around the Current Privilege Level (CPL). CPL is a two-bit field that can take on four privilege levels, 0,1,2,3.

NOTE here please the types of instructions that enable moves between different privilege levels etc.

A future post will follow that describes just how and when the CPL changes, but for now let us take for granted the following: That when x86 processors wake up, they are in in REAL MODE (Give Me Your Real Mode), and CPL is set to numerical “0″. This is the Highest Privilege a x86 processor can operate at, and the Linux Kernel operates at this level. All applications on x86 processors operate at Level 3.

This article focuses on a MOV into a Data Segment register (accomplished with a “MOV DS,AX” for example), and the protection mechanisms associated with that.

Note that a Code Segment CS is “just” another segment register, however a MOV into CS is done with a “transfer control” instruction (CALL FAR, JMP, Software-INT, RET, IRET etc), or an Asynchronous Event (Hardware Interrupt).

In the MOV DS, AX, the AX contains a 16-bit SELECTOR (Bits 1:0), which has a two-bit Requestor Privilege Level (RPL) Value. As explained in an earlier post, the “TI” bit of that SELECTOR has a Table Indicator Bit (“TI”) that is used to select wither the GDT (Global Descriptor Table) or the LDT. The LDT and GDT point to arrays of 8-byte DESCRIPTORs, and one of these DESCRIPTORs is chosen by the SELECTOR to be loaded into the Segment Register DS.

That Chosen DESCRIPTOR has a two-bit (surprise surprise) DPL (Descriptor Privilege Level) field.

SO … the major “players” in Protection Mechanisms are the CPL, DPL and the RPL.

Here, it is important to note the special role that the Segment SELECTOR plays.

In addition to identifying the GDT or the LDT that will be used (as indicated by the “TI” bit of the SELECTOR), and the descriptor itself within the GDT or LDT (as identified by SELECTOR[15:3]), the RPL fiels of the selector plays a role of “WEAKENING” the CPL (Current Privilege Level).

The thinking is that the RPL represents a Privilege level of sorts, and the numerically larger of CPL and SELECTOR’s RPL is the CPL.Effective (Effective CPL) for the purposes of the Segment register Load.

The “MOV DS, AX” (into Segment Register) sets up the memory space that will become accessible to the program doing the MOV DS. The protection check performed (CPL.effective <= Descriptor.DPL) will either enable the MOV DS.

IF the Protection checks fail for the MOV DS, the processor hardware will generate a #GP (General Protection) Fault.

This methodology for Segment register Loads is used, with some modifications, for the gamut of Segment register Loads within X86, wherever they may happen.

“Far” Jumps etc load the Code Segment (CS), when there is a change in Privilege Levels, or on a MOV SS, the stack segment is loaded. On a MOV SS, the protection fault generated on protection failure is the Stack Fault (#SS).

From the above, it become clear that the Kernel has full and complete access to the User Mode memory. This is a capability that is necessary, for example, when in response to a READ system call, the Kernel reads data from Disks (In Privilege Level 0), then copies the Data into User Space (operating in Privilege Level 3). But the vice versa is not true.

Understanding the Linux x86 Server memory model is not possible without understanding x86 Segmentation, the Protection Mechanisms, and the Paging /TLB Mechanisms (To be discussed later).

Comments { 0 }

LinuxEco Cross-Functional Quiz#1

OK… we have talked of Huge Pages in an earlier blog, and we have referred to Kernel Preemption.

So we know now a Huge TLB Page can be GB sized…. and so therefore it stands to reason that a preemption is involved somewhere in  kernel API hugetlb_fault. Right ? GF-rins.

SO … where is it that the Kernel is Possibly attempting a Preemption  on a page fault on a huge page ? More importantly, WHY would we do that ? Do we dig it now ? Haaaaaah ? Tell me then.

We explain these specific x86 and Linux Kernel features, concepts and more in detail in our class Advanced Kernel Memory Management , the second offering of which is to be made Corporate-direct. We suggest also   the Spring Session 2012 UCSC-Extension Advanced Linux Kernel Programming

Please take note of these and other upcoming Linux kernel training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 0 }

Before the world of Linux Kernel Preemptions: a cooperative world

Kernel preemption tries to ensure fair usage of limited CPU resources. One way to understand kernel preemption is to explore its opposite. i.e. the way it WAS before the 2.6 linux kernel (which is preemptive). Well, the way it WAS .. was a cooperative world: A Process that got the CPU resources was expected to play nine-nine, and let cooperatively hand over the cpu (i.e. at EXIT system call), or the kernel could also, when it switched to user-mode, decide to schedule a new process. Processes were expected to be graceful in letting others use CPU resources.

Processes had to deal with many issues to try to ensure this model “worked”, however in additional to some hard problems created , there are some architectural features that just plumb stall out CPU resources and ensure suboptimal CPU resource usage regardless of whatever processes could have done about it. In other words, Processes do not have visibility into the underlying mechanisms of the operating system / kernel itself.

Linux Kernel preemption ensures what I consider to be somewhat fair (aka somewhat arbitrary) reallocation of CPU resources between more processes with “all the knowledge under the sun” on the underlying “goings on”. As an example, if the Kernel KNOWs the system is taking interrupts at HZ rate (see blog below), then why not try to prioritize between existing processes, and give others a chance to run ? If the Kernel KNOWS a process is to stall on a resource that may take some time to come available…why not put the process to “sleep” and “wake” someone fortunate process up ?

Well.. there are many reasons to NOT do that also (if processes are Real-time processes for example). Or to prevent “lockouts” etc etc

In the end, it boils down to a tradeoff between latency for the lucky few, .vs. throughput for the very many. And all shades in between, with considerations galore: A few of which are listed below->

Explicit and Implicit “Blocking”, Critical Code Section Synchronizations, Network and Block Device Processing Latencies and Throughputs, Interrupt Latencies, “Deferred Processing”, Safety in Preemptability (preventing lockouts because we have preempted tasks that should not have been), and there denials of preemptions / recursive depths of denials, relationships to interrups and recursive relationships to the above, system-programming architectural considerations and requirements (Scheduler Priorities, Classes etc), SMP / Cross Processor considerations, memory management, x86 Architectural considerations in Interrupt Latencies  etc etc

Again, all this is probably a good review for our past students. We explain these specific x86 features, Linux Kernel concepts and more in detail in my classes ( Advanced Linux Kernel Programming @UCSC-Extension), and also in other classes that I teach independently. Please take note, and take advantage also, of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 0 }

Give me your X86 Mode, make it Real, or else forget about it !

OK… I am talking about my friend Carlos Santana, and also about the difference between x86 Intel Real Mode Addressing and Protected Mode Addressing.

We will hit it up with Carlos later on in this blog. But business first for now, shall we ?

We would like to eventually explore just why .. oh why … Real Mode is limited to 1MB Memory.

And while doing so, also understand x86 “Protected Mode” addressing. We start with an x86 assembly instruction (AT&T syntax, because that is what the Linux Kernel uses), and illustrate address formation at the most important fundamental levels:

movl (%ebp), %eax

Let us look at Protected Mode Address formation first:

In this instruction shown above, We use ebp to form the “effective” address (ebp IS the “effective” address), then we add the “base” from the (default) segment register (DS) to the “effective address” to come up with the “linear address”. Which will be used to look up the TLB to translate into a Physical Address (if Paging is enabled), or the Linear Address becomes the Physical Address if Paging is NOT enabled.

The “Selector” shown IS the value of the “Selector” component of the Segment Register “DS” (Each Segment register has a “Selector” component). In Protected mode, the 3rd-most lsb of the “Selector” is called the “TI” / Table Indicator. It is use to select one of two tables (GDT or the LDT), bits (31:3) then index into the GDT or the LDT, and we come with a 8-byte “Segment Descriptor”.

That “Segment Descriptor” we just came up with has a 32-bit “Base”, which will be added to the Effective Address (which in our assembly instruction, is the value of General Purpose Register EBP), and the net result becomes the Linear Address. And that Linear Address either goes through the Paging Translation to determine Physical Mmeory Address (If Paging IS enabled)…. or IS the Physical Address if Paging Mode is NOT enabled (as determined by the CR0.PG bit)

All of the above applies to x86 Protected Mode. How about “Real Mode” which does not have Descriptors etc or even Paging Mode ?

Well, the Selector (DS) and Register Sizes sizes (ebp) can only be 16 bits in real mode. The “Base” of the descriptor is replaced by the Selector Left-Shifted by 4 (* 16), and that is added to the effective address (bp).

Note that this implies that Linear Address is

And we have just corroborated 60 minutes that x86 Real Mode is limited to 1MB Memory addressing.

Ofcourse Carlos, x86 has now gone to Infinity and Beyond with Huge Pages and 48-bit Linear Addressing. And the Implications are huge also with moving beyond the measly (and respected) 4K Page sizes to 2M/4M/1G Page sizes. We have a blog below on this, and more coming.

As well as with some other key areas, Memory Management plays a key and consuming role in Optimizing for Multiprocessing / Multicore and Multithreaded execution models.

There is no substitute in this regard for our Course Advanced Kernel memory Management. Hit us up on it.

We explain these specific x86 features, Linux Kernel concepts and more in detail in my classes ( Advanced Linux Kernel Programming @UCSC-Extension), and also in other classes that I teach independently. Please take note, and take advantage also, of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to. Cheers !

Take it away Carlos Santana !

Comments { 0 }

4M/2M Pages, TLB Flushes, Huge Page (Splits) and Other Economy Busting “Non-Issues”

It is normal to reason, as we explain in our courses, that the Linux Kernel may organize atleast a “few” pages as 4M pages. It can always be argued that we need not have enabled 4M paging, but that is not the point here. We need to figure out how to have our cake and eat it too. Hopefully.

SO…you guessed it, that when push comes to shove, and for reasons that are as good as the one I can come up with to the contrary, the buddy allocator becomes the bussting allocator bussting up these 4M pages into teeny weeny 4K pages that we really needed to allocate to in the first place. And then … needless to say, performance hell breaks loose with TLB flushes following the reset of ANY PDT’s “PS” bit (Just in case you missed this on 60-minutes, Intel requires TLB Flushes since the behavior for pages left in TLBs …with PS bits in the PDT getting the “Reset” treatment on the likes of page-splits etc etc.. is IMPLEMENTATION SPECIFIC). I call that a bug-turned-nonfeature. Lets move on, shall we ?

You may argue that one busted, forever split, but we do know that the Buddy allocator will not let bygones be bygones, right ? setting the PDT “PS” bit again on Buddy Order-Coalescing etc.

The questions apply to PAE mode, and to 64-bit modes also. I went looking to see just what % of free pages were 4M pages in my VM (NO PAE compile option), 4M Pages enabled, here it is->

We all know the benefits of enabling 4M Pages yak yak, but how about performance lost on TLB entries Flushes when just one or just about one of these Huge Pages is split up ? Data we just presented seems to suggest that if we enable 4M Pages, performance loss would be as probabilistic as rain in the Rain Forests. Especially for “Higher Performance” Systems at the cusp of 4K and 2M/4M page sizes. I mean, being on one side of the camp (4K only) or the other (2M/4M) is so much more preferable. But reality rarely obliges our whims and wishes. And the trend to disappoint in that regards will continue we surmise.

While the issues apply to PAE and 64-bit addressing modes as well, a pic of the 32-bit 4K to 4M paging (PDT.PS) transition based on the PDT “PS” is included here ->

After all this, is there any way of measuring dynamic load-based performance loss due to TLB Flushes against prevailing contexts of page allocations ? We do know it would be “traffic-dependent”, especially for caching and web Servers, where performance losses would, in all probability, be most noticable and critically offensive also. Is there any way for lesser machines to take advantage of 2/4M Page sizes, and not burden a perhaps predominantly 4K allocation ? And vice versa also ? Or do we just have to have customized and predetermined “preallocations” of Huge/4K Pages based on application-level heuristics ? And if so, how is that to be accomplished if at all ?

We discuss Linux Kernel concepts and more in my classes with code walk throughs and programming assignments ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). Please take note of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 2 }

Linux Kernel Address Spaces, Zone Pages and Allocations etc

Managing the processors memory (Virtual and Physical) is a key component of the Linux Kernel that is intimately tied in to performance and scalability.

Processor physical memory is allocated by ZONEs. But how about its Address Space (struct address_space) ?

Atleast one place that the ADDRESS SPACE (struct address_space) is initialized is when the processors INODE (struct inode) is created (on open system calls, for example).

It is atleast at this time also that the flags of the ADDRESS SPACE is initialzed to indicate the ZONE in which the pages which belong to it will reside.

When it comes time for the allocation of a page, a page needs to be selected from the appropriate ZONE as indicated by the Address Space. Page flags are also initialized at INIT time to belong to the ZONE (free_area_init_core). This facilitates returning the page to the correct ZONE‘s free list when it is freed up (when a process terminates, for example).

I will blog on this at length. Please subscribe to our mail list for automated updates on new blog entries. I explain Linux Kernel concepts and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). As always, Feedback, Questions and Comments are appreciated and will be responded to.

Thanks

Comments { 0 }

Avoiding TLB Flushes on Context Switches on x86 Processors: The PCID

CR3, CR4 are Control registers on x86 processors that are used configure and manage protected-mode functionality. These instructions may only be executed at CPL(Current Privilege) == 0 . i.e. in Kernel Mode.
In the x86, linear addresses are translated by the TLB into Physical addresses (assuming Page Table Walks have been done prior to look up etc).
A Cr3 load switches out Page Tables. In the Linux Kernel, it is executed during scheduling new process at context switch time (context_switch). Before the days of the PCID (see below), a load of CR3 flushed the TLB.
Avoiding TLB flushes on Loads of CR3 are key to avoiding performace hits  on context switches.  In other words, a processor really needs to facilitate the storage of multiple address spaces in the TLB across context switches.
Process-context identifiers (PCIDs) are a facility in x86 processors  by which a logical processor may cache information for multiple linear-address spaces in the TLB.

The processor may retain cached information when software switches to a different linear-address space with a different PCID e.g., by loading CR3.

When using entries in the TLBs and paging-structure caches to translate a linear address, a logical processor uses only those entries associated with the current PCID

i.e. A PCID is a 12-bit identifier, and be thought of as a “Process-ID” for TLBs.

If CR4.PCIDE = 0 (but 17 of CR4), the current PCID is always 000H; otherwise, the current PCID is the value of bits 11:0 of CR3.

Non-zero PCIDs are enabled by setting the PCIDE flag (bit 17 of CR4). Rules do apply on enabling PCID on x86 processors. Caveats emptor. Naturally, restrictions on the operating system may apply to take advantage of this mechanism. Context switches that require isolation of Linear addresses between processes must be done  with care. And/or  linear addresses between processes with PCID‘s enabled may overlap, and so will translations of Linear Addresses to Physical memory ! Ouch.

More on the implications of this for Linux will follow.

I  explain Linux Kernel concepts and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). As always, Feedback, Questions  and Comments are appreciated and will be responded to.

Comments { 0 }