[ACCEPTED]-Parallelizing Sieve of Eratosthenes Algorithm for finding Prime Number-algorithm

Accepted answer
Score: 10

Note that using the Sieve of Eratostheens method to find the 24 prime numbers table, once you find a prime 23 number i - you set i*n as non-prime for each 22 n.

Note that for 2 elements that you know 21 they are prime numbers - i,j you can do it 20 in parallel, i does not require any information 19 from j and vise versa.
The same of course 18 holds for each k prime numbers.

Note however 17 that finding if a number is prime - depends 16 on last calculations! Thus, a barrier is needed 15 between marking numbers as non-primes and 14 finding next prime number to work with.

A 13 good place to start could be:

repeat until finished filling the table:
    1. Find k prime numbers [serially]
    2. Fill the table for these k numbers [parallely]
    3. after all threads finished step 2 [barrier] - return to 1

Since it seems 12 homework, I'll only give these hints, and 11 let you do the rest of the work. If you 10 later have any specific problem - ask a 9 new question and show us what you already 8 did.

EDIT: one more important hint, note that 7 if i is prime, the calculation of all the 6 non prime numbers that derive from i will 5 not affect the fact that j is prime for each 4 i < j < 2i. Use this fact to find the k prime numbers, you 3 don't want for example to take 2,3,4 as 2 prime numbers on your first iteration of 1 the algorithm.

Score: 5

Prime sieves use a lot of space.

In case 66 of 1 bit per number, you'd need 1 giga-bit 65 (2ˇ30) to store sieve for N = ~10ˇ9. And 64 1 terabit (2ˇ40) to store sieve for N = ~10ˇ12, which 63 is 2ˇ37 bytes = 2ˇ7 GB = 128GB. If only 62 storing odd numbers, 16 numbers/B, 64GB 61 is needed for N=10ˇ9. With your laptops 60 16GB you'll be limited to N < 2ˇ35. Good 59 to start cracking RSA 70-bit. If making 58 algorithm much more complex and not storing 57 multiples of 3 5 or 7 in the seive, 1 byte 56 would hold sieve-bits for 30 numbers, which 55 would let ~2ˇ36 = 2ˇ(30+6) = ~64 x 10ˇ9, dangerous 54 for RSA72 :lol: To crack RSA1024, youd need 53 (1024/2 = 512, 512-36 = 476)
2ˇ476 (~10ˇ159) times 52 more memory.

So memory usage is the main 51 problem.

After all, even fast RAM is hundreds 50 of times slower than CPU's L1 cache, so 49 you want to fit the data to 32KB or upto 48 256KB instead of zillion TBs.

As the sieve 47 is accessesd pretty randomly, it is not 46 automagically loaded to cache and you'd 45 get contineous cache-misses. You need to 44 go through the range of numbers finishing 43 the whole job chunk by chunk. While chunk-size 42 is your fastest memory (L1 L2 of CPU or 41 GPU).

Let's skip the complex part...

You'll 40 have a list of primes that are needed to 39 unmark composites in current junk. If already 38 sieving big numbers, lets say 10ˇ30 upto 37 10ˇ30+10ˇ8, the range of 100M numbers has 36 around 1M primes waiting in the list just 35 to unmark 1 of their composites, each prime 34 of that size would need (i guess) around 33 128B to be stored, so that you'd need 128MB 32 for it. Luckyly this list is accessed sequentially 31 and it may be already in cache when needed. But 30 anyway you'll need zillion bytes to store 29 the prime-lists for the next sieves.

About 28 multithreading, scalability to GPUs, thousands 27 of threads e.g.

Each prime in the list to 26 unmark 1 bit in sieve in the fastest memory, accesses 25 it basically randomly. GPu doesn't write 24 bit by bit, instead it writes a la 128B 23 per operation, it would be 1024bits. While 22 1 thread tries to unmark 1 bit, the other 21 thread unmarks the otherbit while the bit 20 of the 1st thread will have value 1 again. To 19 fence/barrier/lock the memory would stop 18 all the threads and no speed increase althoough 17 lots of threads running.

So threads should 16 not share the sieve. Would need to share 15 sieving-prime-lists instead, so that each 14 thread would have its own chunk of a chunk 13 of sieve, and would use the same primes 12 for unmarking. But after unmarking, they 11 need to "schedule" that prime to the shared 10 primes-list for its future sieve, which 9 will be executed after millions of sievings, at 8 the same time while other threads are changing 7 the same list. Pretyy much stuck again.

It 6 is very easy to make it parallel while slowing 5 it down. Rather often it would be faster 4 to recalculate something in each thread 3 than to ask it over a slow bus like RAM, PCIe, SSD, GBit-net... but 2 it is possible and not very complex if you 1 have only some threads.

Score: 1

An approach is to let a single thread find 7 the next prime number in the sieve, and 6 then let all threads mark the multiples 5 concurently. Every thread will be assigned 4 a different section of the array to avoid 3 memory sharing as much as possible. So every 2 thread needs to determine what range of 1 multiples it will handle.

Score: 0

First, should the parallelization affect 47 the creation of the table, or simply using 46 it to determine whether a number is prime 45 or not. In a real application, I'd do the 44 first off line, so the table would appear in 43 the final C++ code as a statically initialized 42 C style array. So the question of parallelization 41 would be irrelevant. And since nothing 40 should be modified in the second, you can 39 access from as many threads as you want, without 38 concern.

I suspect, however, that the purpose 37 of the exercise is for you to use multiple 36 threads to construct the table. (Otherwise, it 35 doesn't make sense.) In this case: the 34 table is constructed by a series of loops, with 33 steps 2, 3, 5... Each of these loops can 32 be executed in a separate thread, but... some 31 sort of synchronization will be needed for concurrent 30 accesses. If you treat the table as a single 29 object, with just one lock, you'll end up 28 either running the loops sequencially (because 27 you're acquiring the lock outside of the 26 loop), or spending more time acquiring and 25 releasing the lock than doing any real work. (Acquiring 24 and releasing an uncontested lock can be 23 very fast. But not as fast as just setting 22 a bool. And in this case, the lock is going to 21 be very, very contested, since all of the 20 threads want it most of the time.) If you 19 create a lock per bool, that's going to be an 18 awful lot of locks—it will probably 17 take less time to construct the table in a 16 single thread than to create all of the 15 mutexes.

Of course, in C++ (and perhaps in 14 Java as well), you'll want to use a bitmap, rather 13 than one bool per entry; the larger the table, the larger 12 the maximum number you can handle. (Something 11 like bool sieve[INT_MAX]; is almost certain to fail; you might 10 be able to get away with unsigned char sieve[INT_MAX / 8 + 1];, however.) In 9 this case, you'll need a mutex per element, not 8 per entry (which would be a single bit in 7 the element). Given that each mutex eats 6 up some resources as well, you probably 5 want to divide the table into discrete blocks, with 4 a mutex per block, and use a nested loop:

int j = 0;
for ( int i = 0; i < numberOfBlocks; ++ i ) {
    scoped_lock( mutex_table[i] );
    while ( j < (i + 1) * bitsInBlock ) {
        //  ...
        j += step;

Once 3 this is working, a bit of tuning will be 2 necessary to determine the optimal block 1 size (but I would guess fairly big).

Score: 0

Your teacher won't like this, but I guess 10 there is a brutal approach which can be 9 worth considering.

Just let every thread 8 repeat

find the next prime in the sieve
mark all multiples of this prime

independently of all others and without 7 any synchronization. Every thread stops 6 when there it finds no more prime.

This is 5 brutal because several threads may work 4 on the same prime by accident, but the final 3 sieve will be correct (all composites detected), and 2 what you loose in terms of duplicate work 1 could be regained by the absence of synchronization.

More Related questions