[ACCEPTED]-Do lock-free algorithms really perform better than their lock-full counterparts?-lock-free
Beyond the simple cases of the InterlockedXxx 55 functions, it seems like the prevailing 54 pattern with all of these is that they implement 53 their own locks.
None of the answers here 52 really seem to get to the heart of the difference 51 between a "lock-free" CAS loop and 50 a mutex or spin-lock.
The important difference 49 is that lock-free algorithms are guaranteed to make 48 progress on their own - without the assistance of other 47 threads. With a lock or spin lock, any 46 poor thread that can't acquire a lock is 45 entirely at the mercy of the thread that owns the 44 lock. The poor thread that can't acquire 43 the lock can do nothing except wait (either 42 via a busy wait or an OS-assisted sleep).
With 41 lock-free algorithms that loop on a CAS, each 40 thread is guaranteed to make progress regardless 39 of what other contending threads are doing. Each 38 thread is, essentially, in control of its 37 own fate. Yes, it still may have to loop 36 many times, but the number of times it loops 35 is limited by the number of contending threads. It 34 cannot infinitely loop, for the most part. (In 33 practice, it's possible for live lock to 32 occur due to, e.g. an LL/SC loop that keeps failing 31 due to false sharing) - but again measures 30 can be taken by the thread itself to deal 29 with this - it is not at the mercy of another 28 thread holding a lock.
As for performance, it 27 depends. I've seen flagrant examples of 26 lock-free algorithms being totally out-performed 25 by their locking counterparts, even under 24 high-thread contention. On an x86-64 machine 23 running Debian 7, I compared the performance 22 between the C++ Boost.Lockfree queue (based 21 on the Michael/Scott algorithm) and a plain 20 old
std::queue surround by an
std::mutex. Under high thread 19 contention, the lockfree version was almost 18 twice as slow.
So why is that? Well, the 17 performance of lockfree algorithms ultimately 16 comes down to the implementation details. How 15 does the algorithm avoid ABA? How does 14 it accomplish safe memory reclamation? There 13 are so many variants... tagged pointers, epoch 12 based reclamation, RCU/quiescent state, hazard 11 pointers, general process-wide garbage collection, etc. All 10 these strategies have performance implications, and 9 some also place restrictions on how your 8 application in general can be designed. In 7 general, reference-counting approaches (or 6 tagged pointer approaches) tend to perform 5 poorly, in my experience. But the alternatives 4 can be much more complex to implement, and 3 require a lot more memory-reclamation infrastructure 2 based around thread-local storage or generalized 1 garbage collection.
In general, lock free algorithms are less 26 efficient per thread - you're doing more 25 work, as you mentioned, in order to implement 24 a lock free algorithm than a simple lock.
However, they 23 do tend to dramatically improve the overall 22 throughput of the algorithm as a whole in 21 the face of contention. Thread switching latency and context switches, which fast, over 20 many threads slow down the throughput of 19 your application dramatically. Lock free 18 algorithms effectively are implementing 17 their own "locks," but they do 16 so in a manner that prevents or reduces 15 the number of context switches, which is 14 why they tend to out perform their locking 13 counterparts.
That being said - most of this 12 depends on the algorithm (and implementation) in 11 question. For example, I've got some routines 10 that I managed to switch to .NET 4's new 9 concurrent collections instead of using 8 the previous locking mechanisms, and have 7 measured improvements over near 30% in my 6 total algorithm speed. That being said, there 5 are many benchmarks you can find that show 4 reduced performance using some of these 3 same collections when compared to a basic 2 lock. As with all performance optimizations 1 - you really don't know until you measure.
Lock-free isn't necessarily any faster, but 11 it can eliminate the possibility of deadlock 10 or livelock, so you can guarantee that your 9 program will always make progress toward 8 finishing. With locks, it's difficult to 7 make any such guarantee -- it's all too 6 easy to miss some possible execution sequence 5 that results in a deadlock.
Past that, it 4 all depends. At least in my experience, differences 3 in speed tend to depend more on the skill 2 level deployed in the implementation than 1 whether it uses locks or not.
Under Windows on x64, a straightforward 11 (no combining array in front of the freelist) lock-free 10 freelist is about an order of magnitude 9 faster than a mutex based freelist.
On my 8 laptop (Core i5), for a single thread, lock-free 7 I get about 31 million freelist operations 6 per second, vs for mutex about 2.3 million 5 operations per second.
For two threads (on 4 separate physical cores), with lock-free 3 I get about 12.4 million freelist operations 2 per thread. With a mutex, I get about 80 1 THOUSAND operations per second.
The primary advantage of genuinely lock-free 17 algorithms is that they are robust even 16 if a task gets waylaid (note that lock free 15 is a tougher condition than "not using locks"(*)). While 14 there are performance advantages to avoiding 13 unnecessary locking, the best-performing 12 data structures are often those which can 11 operate locking in many cases, but which 10 can use locks to minimize thrashing.
(*)I've 9 seen some attempts at "lock free" multi-producer 8 queues where a producer that got waylaid 7 at the wrong time would prevent consumers 6 from seeing any new items until it completed 5 its work); such data structures shouldn't 4 really be called "lock free". One producer 3 that gets blocked won't block other producers 2 from making progress, but may arbitrarily 1 block consumers.
Lock-free algorithms can absolutely be faster 15 then their blocking counterpart. But of 14 course the inverse is true as well. Assuming 13 the implementation performs better then 12 the locking counter part, the only limiting 11 factor is contention.
Take the two Java 10 classes, ConcurrentLinkedQueue and LinkedBlockingQueue. Under 9 moderate real world contention the CLQ outperforms 8 the LBQ by a good amount. With heavy contention 7 the use of suspending threads will allow 6 the LBQ to perform better.
I disagree with 5 user237815. synchronized keyword doesn't 4 require as much overhead as it once did, but 3 relative to a lock-free algorithm it does 2 have a good amount of overhead associated 1 to it compared to a single CAS.
Recently on [JavaOne Russia] Oracle employee 8 (who specializes on Java performance and 7 benchmarks) have showed some measurements 6 about operations per second within parallel 5 access to simple
int counter, using CAS (lock-free, high-level 4 spinlock in fact) and classic locks (
According 3 to this, spin-locks have better performance 2 only until the few number of threads tries 1 to access monitor.
In Java, at least, locking by itself can 6 be very quick. The synchronized keyword 5 doesn't add a lot of overhead. You can benchmark 4 it yourself just by calling a synchronized 3 method in a loop.
Locking only gets slow 2 when there is contention, and the process 1 being locked isn't instantaneous.
Lock-free also has the advantage that it 5 does not sleep. There are places in kernels 4 where you are not permitted to sleep - the 3 Windows kernel has a bunch of them - and 2 that painfully restricts your ability to 1 use data structures.
More Related questions