6 private links
Real-time performance monitoring, done right! http://netdata.firehol.org
Each of the CPUs has its own set of registers and modes. Only the memory is shared between them. That means that, in order to put the 8 cores of an i7 into long mode, we have to execute the very same procedure for each of the cores, because each core has its own register set, GDT, LDT etc. Therefore, we are able to start a CPU in real mode and keep it there, while directing another CPU to long mode
List of all register on a x86_X64 cpu
For one thing, chips have wider registers and can address more memory. In the 80s, you might have used an 8-bit CPU, but now you almost certainly have a 64-bit CPU in your machine. I’m not going to talk about this too much, since I assume you’re familiar with programming a 64-bit machine. In addition to providing more address space, 64-bit mode provides more registers and more consistent floating point results (via the avoidance of pseudo-randomly getting 80-bit precision for 32 and 64 bit operations via x867 floating point). Other things that you’re very likely to be using that were introduced to x86 since the early 80s include paging / virtual memory and floating point.
L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns on recent CPU
L2 cache reference ........................... 7 ns 14x L1 cache
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs
SSD random read ........................ 150,000 ns = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs 4X memory
Round trip within same datacenter ...... 500,000 ns = 0.5 ms 20x datacenter roundtrip
Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms 80x memory, 20X SSD
Disk seek ........................... 10,000,000 ns = 10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms
LTO4 tape seek/access time ...... 55.000.000.000 ns = 55 s
The development of caches and caching is one of the most significant events in the history of computing. Virtually every modern CPU core from ultra-low power chips like the ARM Cortex-A5 to the highest-end Intel Core i7 use caches. Even higher-end microcontrollers often have small caches or offer them as options — the performance benefits are too significant to ignore, even in ultra low-power designs.
Hardware performance monitoring counters have recently received a lot of attention. They have been used by diverse communities to understand and improve the quality of computing systems: for example, architects use them to extract application characteristics and propose new hardware mechanisms; compiler writers study how generated code behaves on particular hardware; software developers identify critical regions of their applications and evaluate design choices to select the best performing implementation. We propose that counters be used by all categories of users, in particular non-experts, and we advocate that a few simple metrics derived from these counters are relevant and useful. For example, a low IPC (number of executed instructions per cycle) indicates that the hardware is not performing at its best; a high cache miss ratio can suggest several causes, such as conflicts between processes in a multicore environment.