Archive | Huge TLBs RSS feed for this section

The Tortuous rode to extending the joys of extending Physical Memory Ranges and Page Sizes

My dear Brethren and Fellow travellers in the land x86 and the Linux Kernel, it is time to come clean on all modes that extend x86 page sizes and physical memory addressing.

So here we go Ka Boom Ka Beem: In 32-bit non-PAE mode, PSE and PSE36 (AKA PSE40), whereby PSE36/40 is a cheap way to get the extended memory addressing for 64GB/1TB, 4K/4M Paging that we created in PSE (Really!) with 32-bit PGDs and PTEs.

In 32-bit PAE mode, the architecture is 32-bit, but the PDPTEs, PDEs and PTEs extend to 64-bit, thereby extending Page sizes to 4K/2M (if PTE is not used), and memory addressing to 52-bits (4PB).

And lastly, but not leastly, we have IA-32E mode (true blue 64-bit architecture), with 48-bit linear addressing, 4K/2M/1G (if PDEs and PTEs are both not used) page sizes, 4PB of Memory addressing, and 64-bit PDPTEs, PDEs and PTEs.

John, did I get it all, and right also, this time ? DID you get the Cigar too ? Haaah ?

We all … must admit to an acute case of x86-itis, so the next post onwards, time to switch gears.  We will start looking next at the various implications to the Kernel of some of these x86-isms for which we have espoused eloquence in this and prior posts.

I will also note that we do do a good job of explaining these concepts in some of our training sessions, this feedback measured by course evaluations, and client (some of which are ISPs) feedbacks also.

I do hope everyone enjoyed this post. Thanks again

Comments { 0 }

Mitigating the Performance impacts of TLB Flushes on Context Switches

We all know, presumably, that MOV CR3 (the PDBR) is an essential part of the Linux Kernel’s context_switch routing. This is necessary, since the tables may have switched, but the MOV CR3 also flushes the TLB thereby forcing Page Table Walks.

Avoiding TLB flushes on Loads of CR3 are key to avoiding performace hits  on context switches.  In other words, a processor really needs to facilitate the storage of address space caching in the TLB across context switches.

In “pure architectures” (which the x86 is NOT, and for good reason of backward compatibility etc), the PID (Process ID) would have been “hashed” with TLB addressing, thereby avoiding the need for TLB Flushes on context switches. Not so with x86, since the PID is not part/parcel of the x86 Architecture.

Process-context identifiers (PCIDs) are a facility in x86 by which a logical processor may cache information for multiple linear-address spaces in the TLB, and preserve it across context switches.

As we noted above, The processor  retains cached Page-Table  information  when software switches to a different linear-address space by loading CR3, and presumable to a different Process (We ARE executing a context_switch)

A PCID is a 12-bit identifier, and may be thought of as a “Process-ID” for TLBs. If CR4.PCIDE = 0 (but 17 of CR4), the current PCID is always 000H; otherwise, the current PCID is the value of bits 11:0 of CR3. Non-zero PCIDs are enabled by setting the PCIDE flag (bit 17 of CR4).
When a logical processor creates entries in the TLBs (Section 4.10.2 of the x86 prog reference manual) and paging structure caches (Section 4.10.3), it associates those entries with the current PCID (Oh … such a loose association of PCID with PID). Note that this means that where the PGD is located is somehow being interpreted in the  PID “process context”.  When using entries in the TLBs and paging-structure caches to translate a linear address, a logical processor uses only those entries associated with the current PCID, and hence flushes of the TLB  are avoided.

With the x86, my dear brothers and sisters in grief and joy, we take what you can get, and run. In this case, where TLB flushes are avoided for what will turn out to be 99% of the *current* address space, that is more than we can bargain for with Intel. I say.. Good Job Intel.


Comments { 0 }

LinuxEco Cross-Functional Quiz#1

OK… we have talked of Huge Pages in an earlier blog, and we have referred to Kernel Preemption.

So we know now a Huge TLB Page can be GB sized…. and so therefore it stands to reason that a preemption is involved somewhere in  kernel API hugetlb_fault. Right ? GF-rins.

SO … where is it that the Kernel is Possibly attempting a Preemption  on a page fault on a huge page ? More importantly, WHY would we do that ? Do we dig it now ? Haaaaaah ? Tell me then.

We explain these specific x86 and Linux Kernel features, concepts and more in detail in our class Advanced Kernel Memory Management , the second offering of which is to be made Corporate-direct. We suggest also   the Spring Session 2012 UCSC-Extension Advanced Linux Kernel Programming

Please take note of these and other upcoming Linux kernel training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 0 }

4M/2M Pages, TLB Flushes, Huge Page (Splits) and Other Economy Busting “Non-Issues”

It is normal to reason, as we explain in our courses, that the Linux Kernel may organize atleast a “few” pages as 4M pages. It can always be argued that we need not have enabled 4M paging, but that is not the point here. We need to figure out how to have our cake and eat it too. Hopefully.

SO…you guessed it, that when push comes to shove, and for reasons that are as good as the one I can come up with to the contrary, the buddy allocator becomes the bussting allocator bussting up these 4M pages into teeny weeny 4K pages that we really needed to allocate to in the first place. And then … needless to say, performance hell breaks loose with TLB flushes following the reset of ANY PDT’s “PS” bit (Just in case you missed this on 60-minutes, Intel requires TLB Flushes since the behavior for pages left in TLBs …with PS bits in the PDT getting the “Reset” treatment on the likes of page-splits etc etc.. is IMPLEMENTATION SPECIFIC). I call that a bug-turned-nonfeature. Lets move on, shall we ?

You may argue that one busted, forever split, but we do know that the Buddy allocator will not let bygones be bygones, right ? setting the PDT “PS” bit again on Buddy Order-Coalescing etc.

The questions apply to PAE mode, and to 64-bit modes also. I went looking to see just what % of free pages were 4M pages in my VM (NO PAE compile option), 4M Pages enabled, here it is->

We all know the benefits of enabling 4M Pages yak yak, but how about performance lost on TLB entries Flushes when just one or just about one of these Huge Pages is split up ? Data we just presented seems to suggest that if we enable 4M Pages, performance loss would be as probabilistic as rain in the Rain Forests. Especially for “Higher Performance” Systems at the cusp of 4K and 2M/4M page sizes. I mean, being on one side of the camp (4K only) or the other (2M/4M) is so much more preferable. But reality rarely obliges our whims and wishes. And the trend to disappoint in that regards will continue we surmise.

While the issues apply to PAE and 64-bit addressing modes as well, a pic of the 32-bit 4K to 4M paging (PDT.PS) transition based on the PDT “PS” is included here ->

After all this, is there any way of measuring dynamic load-based performance loss due to TLB Flushes against prevailing contexts of page allocations ? We do know it would be “traffic-dependent”, especially for caching and web Servers, where performance losses would, in all probability, be most noticable and critically offensive also. Is there any way for lesser machines to take advantage of 2/4M Page sizes, and not burden a perhaps predominantly 4K allocation ? And vice versa also ? Or do we just have to have customized and predetermined “preallocations” of Huge/4K Pages based on application-level heuristics ? And if so, how is that to be accomplished if at all ?

We discuss Linux Kernel concepts and more in my classes with code walk throughs and programming assignments ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently). Please take note of upcoming training sessions. As always, Feedback, Questions and Comments are appreciated and will be responded to.

Comments { 2 }