6 private links
Sometimes when programming we need to tune a small portion of code which is critical to an implementation. For example an inner loop may involve a pixel bashing operation which dominates the program’s overall performance. If this operation uses a comparison, and that results in the compiled code branching, it can hurt performance on pipelined CPUs. It may be better to find a branch-free alternative even if it appears to make the code slightly more complex.
Using a virtual dispatch might get relatively expensive in terms of clock cycles due to multiple levels of indirections including indirect branching as well as this pointer adjustment. Wise programmers do not use virtual dispatch without a good reason but oftentimes it is required either by design or when creating non-template reusable components/libraries and the final implementation of some parts of the program is not known.
Memoization is a pretty well-known optimization technique which consists in “remembering” (i.e.: caching) the results of previous calls to a function, so that repeated calls with the same parameters are resolved without repeating the original computation.
Hardware performance monitoring counters have recently received a lot of attention. They have been used by diverse communities to understand and improve the quality of computing systems: for example, architects use them to extract application characteristics and propose new hardware mechanisms; compiler writers study how generated code behaves on particular hardware; software developers identify critical regions of their applications and evaluate design choices to select the best performing implementation. We propose that counters be used by all categories of users, in particular non-experts, and we advocate that a few simple metrics derived from these counters are relevant and useful. For example, a low IPC (number of executed instructions per cycle) indicates that the hardware is not performing at its best; a high cache miss ratio can suggest several causes, such as conflicts between processes in a multicore environment.
This article is an attempt to sum up a small number of generic rules that appear to be useful rules of thumb when creating high performing programs. It is structured by first establishing some fundamental causes of performance hits followed by their extensions.
When optimizing memory access, and memory cache misses in particular, there are surprisingly few tools to help you. valgrind’s cachegrind tool is the closest one I’ve found. It gives you a lot of information on cache misses, but not necessarily in the form you need it.
There is really black magic !
also see that: http://www.agner.org/optimize/microarchitecture.pdf
How to use sse vector instruction
Also: http://felix.abecassis.me/2011/09/cpp-getting-started-with-sse/
perf is performance tool like gprof but it's more user friendly :)
Use it like this:
perf record -p $(pidof program)
(time passes, press Ctrl-C)
perf report -i perf.data