As long as things stay in a pipe, instruction decode and execution looks to execute in one
cycle. Pipe flushes are the penalty. That is where speculative execution pays off. ( also
food for Meltdown and Spectre type security holes ). Such loops are quite fast if the
prediction was right.
Running small unrolled loops only give you a small advantage if the predictor is working
well for your code. Large unrolled loops only gain a small amount percentage wise, as
always.
If one is unrolling a large amount, one may end up getting a cache miss. That can easily
eats up any benefit of unrolling the loops. Before speculative execution, unrolling had a
clear advantage.
Dwight
________________________________
From: cctalk <cctalk-bounces at classiccmp.org> on behalf of Eric Korpela via cctalk
<cctalk at classiccmp.org>
Sent: Wednesday, January 9, 2019 11:06 AM
To: ben; General Discussion: On-Topic and Off-Topic Posts
Subject: Re: OT? Upper limits of FSB
On Tue, Jan 8, 2019 at 3:01 PM ben via cctalk <cctalk at classiccmp.org> wrote:
I bet I/O loops throw every thing off.
Even worse than you might think. For user mode code you've got at least
two context switches which are typically thousands of CPU cycles. On the
plus side when you start waiting for I/O the CPU will execute another
context switch to resume running something else while waiting for the I/O
to complete. By the time you get back to your process, it's likely
process memory may be at L3 or back in main memory. Depending upon what
else is going on it might add 1 to 50 microseconds per I/O just for context
switching and reloading caches.
Of course in an embedded processor you can run in kernel mode and busy wait
if you want.
Even fast memory mapped I/O (i.e. PCIe graphics card) that doesn't trigger
a page fault is going to have variable latency and will probably have
special cache handling.
--
Eric Korpela
korpela at
ssl.berkeley.edu
AST:7731^29u18e3