ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.

Exclusive for LQ members, get up to 45% off per month. Click here for more info.

Just wrote my qsort() replacement, wanna share it with the community and stress my variant with more versatile data...

Generally, we have three major scenarios:
few distinct keys;
many distinct keys;
ALL distinct keys.

In first two cases, the new implementation is faster or equal to the "Standard" one.
Sadly, in the third case, it is somewhat slower, should be investigated further.

Anyway, benchmarking will show the practical value of 'Magnetica', personally, I started to use it already.
Please, give it a try, my new partitioning scheme is not perfect but pushes the limitations of old ones.

Didn't have time to add the second scenario, but the results (for 1st scenario, with 10 distinct keys) are promising:
273.8/3.7= 74x faster

Now attached as a .txt, LQ doesn't allow .c attachments, hmm.
As for Timsort, never looked it up, just know it is among some major performers.
AFAIK, my 'Magnetica' partitioning is a new one, nothing special, just did on paper for an hour the swaps and saw that there is a crucial moment where unnecessary swaps are to be avoided. Wonder, whether another guy had it before me, just to acknowledge it, I am too lazy to look up other's sources, however love benchmarking against them.

The "performance" of any sorting algorithm, as measured in microseconds, is always dependent upon the data. However, the practical performance of any algorithm is usually described in terms of "order," as in O(Log2N). Shell Sort will always out-perform Bubble Sort, and Quicksort will always out-perform Shell.

I suggest that you create a new Github project for your new algorithm, and introduce it for discussion in various places, including the Software Engineering forum at StackExchange.

No matter how fast the processor, a 100 MB external sort using a single 1993-vintage SCSI disk takes about one minute of elapsed time. This one-minute barrier is created by the 3MB/s sequential transfer rate (bandwidth) of a single commodity disk. We measured both the OpenVMS Sort utility and AlphaSort to take a little under one minute when using one SCSI disk. Both sorts are disk-limited. A faster processor or faster algorithm would not sort much faster because the disk reads at about 4.5 MB/s, and writes at about 3.5 MB/s. Thus, it takes about 25 seconds to read the 100 MB, and about 30 seconds to write the 100 MB answer file. Even on mainframes, sort algorithms like Syncsort and DFsort are limited by this one-minute barrier, unless disk or file striping is used.

>However, the practical performance of any algorithm is usually described in terms of "order," as in O(Log2N).

Indeed, however my focus is on 'perf', you are welcome to see the statistics below and draw even better PRACTICAL conclusions whether one performer is good or no good enough.

>I suggest that you create a new Github project for your new algorithm, and introduce it for discussion in various places, including the Software Engineering forum at StackExchange.

Nah, I prefer sharing it with fellow forum members, few times got treated as a spammer on stack*, Github is cool but I search feedback from amateurs (in his primal sense - lovers of something) as me. Benchmarking is what interests me, of course if some coder improves on it then even better - we all win.
After all, all good things will make their way, by itself, naturally.

>Slightly off-topic. While searching XFCE documentation on their site, I stumbled upon an old article from year 1995 on AlphaSort.

Thanks, no idea what it is, but for a long thirty years I admire the work of Bob Pirko and its RPSORT.ASM - a 16-bit sorter, yet never looked into his source code - didn't want to "steal" his ideas, weird, maybe there are some ideas worth reviving.

Looking calmly at 'Magnetica' for few minutes, couldn't miss the obvious optimization of moving 'PR+1' outside the if-elseif-else, now the refined revision is smaller and faster.
Now, we can benchmark either with QS_bench_r2.c or by adding 'Magnetica' to your code.
You need "22338618_QWORDS.bin" and "mobythesaurus.txt" from this package: https://www.qb64.org/forum/index.php...0;attach=14875

As for the third case (ALL distinct keys), I wanna 2,000,000,000 keys, but how to shuffle them, so each time the keys to be in a given order?

In the near future, I intend to make a showdown between Bentley-McIlroy 3-way partitioning and 'Magnetica', I am not convinced their scheme is better, will see...

This is the refinement which is used in QS_bench_r2.c:

Glad to share the first two cases benchmarked with 'Magnetica' rev.2, the console log on my Fedora 33, i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:

Code:

[kaze@kaze Quicksort_says_rev3]$ gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 qsort few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK.
Performance counter stats for './QS_bench_r2 qsort few':
372,145.95 msec task-clock:u # 0.998 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,363,067 page-faults:u # 0.012 M/sec
1,123,792,986,601 cycles:u # 3.020 GHz (50.00%)
3,268,399,480,738 instructions:u # 2.91 insn per cycle (62.50%)
650,531,988,573 branches:u # 1748.056 M/sec (62.50%)
5,519,977,909 branch-misses:u # 0.85% of all branches (62.50%)
734,216,718,766 L1-dcache-loads:u # 1972.927 M/sec (62.50%)
6,752,096,756 L1-dcache-load-misses:u # 0.92% of all L1-dcache accesses (62.50%)
423,592,362 LLC-loads:u # 1.138 M/sec (50.00%)
126,404,435 LLC-load-misses:u # 29.84% of all LL-cache accesses (50.00%)
372.770596352 seconds time elapsed
364.324923000 seconds user
5.532901000 seconds sys
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 Magnetica few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK.
Performance counter stats for './QS_bench_r2 Magnetica few':
45,966.43 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,363,067 page-faults:u # 0.095 M/sec
123,532,537,542 cycles:u # 2.687 GHz (50.00%)
101,337,186,572 instructions:u # 0.82 insn per cycle (62.50%)
24,106,121,927 branches:u # 524.429 M/sec (62.50%)
3,277,959,707 branch-misses:u # 13.60% of all branches (62.50%)
13,329,593,401 L1-dcache-loads:u # 289.985 M/sec (62.50%)
1,655,094,422 L1-dcache-load-misses:u # 12.42% of all L1-dcache accesses (62.50%)
83,759,355 LLC-loads:u # 1.822 M/sec (50.00%)
69,790,047 LLC-load-misses:u # 83.32% of all LL-cache accesses (50.00%)
45.977603171 seconds time elapsed
40.202975000 seconds user
5.433257000 seconds sys
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 qsort many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK.
Performance counter stats for './QS_bench_r2 qsort many':
547,697.55 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,896,781 page-faults:u # 0.009 M/sec
1,660,777,393,784 cycles:u # 3.032 GHz (50.00%)
2,926,096,300,717 instructions:u # 1.76 insn per cycle (62.50%)
656,833,950,654 branches:u # 1199.264 M/sec (62.50%)
24,981,866,259 branch-misses:u # 3.80% of all branches (62.50%)
675,499,020,355 L1-dcache-loads:u # 1233.343 M/sec (62.50%)
8,384,728,679 L1-dcache-load-misses:u # 1.24% of all L1-dcache accesses (62.50%)
253,666,230 LLC-loads:u # 0.463 M/sec (50.00%)
149,205,273 LLC-load-misses:u # 58.82% of all LL-cache accesses (50.00%)
547.972289671 seconds time elapsed
538.278672000 seconds user
5.984880000 seconds sys
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 Magnetica many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK.
Performance counter stats for './QS_bench_r2 Magnetica many':
241,082.32 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,896,781 page-faults:u # 0.020 M/sec
719,580,616,824 cycles:u # 2.985 GHz (50.00%)
788,555,184,960 instructions:u # 1.10 insn per cycle (62.50%)
152,383,462,963 branches:u # 632.081 M/sec (62.50%)
18,722,063,529 branch-misses:u # 12.29% of all branches (62.50%)
105,308,386,791 L1-dcache-loads:u # 436.815 M/sec (62.50%)
9,762,518,793 L1-dcache-load-misses:u # 9.27% of all L1-dcache accesses (62.50%)
291,667,293 LLC-loads:u # 1.210 M/sec (50.00%)
240,922,518 LLC-load-misses:u # 82.60% of all LL-cache accesses (50.00%)
241.152242784 seconds time elapsed
233.338084000 seconds user
5.991166000 seconds sys
[kaze@kaze Quicksort_says_rev3]$

In case "FEW distinct", Magnetica is 364.3/40.2=9.0x faster than qsort.
In case "MANY distinct", Magnetica is 538.2/233.3= 2.3x faster than qsort.
In case "ALL distinct", to be seen...

Kinda expected Bentley-McIlroy duo to be superior (in many distinct keys scenario), it turns out 'Magnetica' dominates in all three cases, the following showdown tells a lot:

The Quicksort Showdown roster:

Code:

+-----------------+--------------+---------------+---------------+
| Performer/Keys | FEW distinct | MANY distinct | ~ALL distinct |
+-----------------+--------------+---------------+---------------+
| qsort | 317 seconds | 508 seconds | 504 seconds |
| Magnetica | 39 seconds | 222 seconds | 292 seconds |
| Bentley-McIlroy | 45 seconds | 226 seconds | 306 seconds |
+-----------------+--------------+---------------+---------------+
Legend:
FEW = 2,233,861,800 keys, of them distinct keys = 10
MANY = 2,482,300,900 keys, of them distinct keys = 2,847,531
~ALL = 2,009,333,753 keys, of them distinct keys = 1,912,608,132

The code (used in QS_bench_r3.c) of Bentley-McIlroy 3-way partitioning by Dr. Jon Bentley (Bell Laboratories) and Dr. Douglas McIlroy (Bell Laboratories):

Code:

void quicksort_Bentley_McIlroy_3way_partitioning(uint64_t a[], int64_t l, int64_t r)
{
int64_t i = l-1, j = r, p = l-1, q = r;
int64_t k;
uint64_t v = a[r];
if (r <= l) return;
for (;;)
{
while (a[++i] < v) ;
while (v < a[--j]) if (j == l) break;
if (i >= j) break;
swap(&a[i], &a[j]);
if (a[i] == v) { p++; swap(&a[p], &a[i]); }
if (v == a[j]) { q--; swap(&a[j], &a[q]); }
}
swap(&a[i], &a[r]); j = i-1; i = i+1;
for (k = l; k < p; k++, j--) swap(&a[k], &a[j]);
for (k = r-1; k > q; k--, i++) swap(&a[i], &a[k]);
quicksort_Bentley_McIlroy_3way_partitioning(a, l, j);
quicksort_Bentley_McIlroy_3way_partitioning(a, i, r);
}

The code (used in QS_bench_r3.c) of Magnetica 3-way partitioning by Sanmayce:

The benchmark is the most stressful for a 32GB machine, it uses 29GB of QWORDS i.e. 64bit elements, taking 8 bytes as an element at each position in the file being sorted i.e. filesize-8+1 elements in total. All in all, sorting 3,803,483,825 elements (of them unique keys = 3,346,259,533).

The results:
Quicksort 'Magnetica' r.2 is 1020s/569s= 1.792x faster than GLIBC qsort.
Quicksort 'Magnetica' r.2 is 607s/569s= 1.066x faster than Bentley_McIlroy.

And the console dump on my laptop Intel i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz, running Fedora 33, gcc 10.2.1 -O3 was used:

Code:

[kaze@kaze ~]$ cd /run/media/kaze/Sanmayce_223GB_A/Quicksort_says_rev7
[kaze@kaze Quicksort_says_rev7]$ gcc -O3 QS_bench_r7.c -o QS_bench_r7.elf
QS_bench_r7.c:1000:1: warning: return type defaults to â€˜intâ€™ [-Wimplicit-int]
1000 | g(d,h){for(i=s;i<1<<25;i*=2)d=d*1LL*d%m;for(p=t;p<t+N;p+=s)for(i=s,c=1;i;i--)a=p[s]*(h?c:1LL)%m,p[s]=(m*1U+*p-a)*(h?1LL:c)%m,*p=(a*1U+*p)%m,p++,c=c*1LL*d%m;}
| ^
QS_bench_r7.c: In function â€˜gâ€™:
QS_bench_r7.c:1000:1: warning: type of â€˜dâ€™ defaults to â€˜intâ€™ [-Wimplicit-int]
QS_bench_r7.c:1000:1: warning: type of â€˜hâ€™ defaults to â€˜intâ€™ [-Wimplicit-int]
[kaze@kaze Quicksort_says_rev7]$ ls -l
total 4501312
-rwxrwxrwx. 2 kaze kaze 3803483832 Nov 29 02:06 Fedora-Workstation-35-1.2.aarch64.raw.xz
-rwxrwxrwx. 2 kaze kaze 58408 Nov 29 12:56 QS_bench_r7.c
-rwxrwxrwx. 2 kaze kaze 335536 Nov 29 12:56 QS_bench_r7.cod
-rwxrwxrwx. 1 kaze kaze 268465864 Nov 29 13:06 QS_bench_r7.elf
-rwxrwxrwx. 2 kaze kaze 268535808 Nov 29 12:56 QS_bench_r7.exe
-rwxrwxrwx. 2 kaze kaze 268455855 Nov 29 12:56 QS_bench_r7.obj
[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf Magnetica ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 569 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533
Performance counter stats for './QS_bench_r7.elf Magnetica ALLmore':
595,760.27 msec task-clock:u # 0.990 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,428,735 page-faults:u # 0.012 M/sec
1,774,492,088,758 cycles:u # 2.979 GHz (50.00%)
1,795,596,438,927 instructions:u # 1.01 insn per cycle (62.50%)
348,244,974,654 branches:u # 584.539 M/sec (62.50%)
52,674,210,396 branch-misses:u # 15.13% of all branches (62.50%)
238,612,258,163 L1-dcache-loads:u # 400.517 M/sec (62.50%)
12,071,470,906 L1-dcache-load-misses:u # 5.06% of all L1-dcache accesses (62.50%)
291,005,737 LLC-loads:u # 0.488 M/sec (50.00%)
236,834,179 LLC-load-misses:u # 81.38% of all LL-cache accesses (50.00%)
601.769088005 seconds time elapsed
575.942070000 seconds user
15.526557000 seconds sys
[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf BM ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 607 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533
Performance counter stats for './QS_bench_r7.elf BM ALLmore':
632,750.44 msec task-clock:u # 0.992 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,428,742 page-faults:u # 0.012 M/sec
1,890,074,871,799 cycles:u # 2.987 GHz (50.00%)
1,657,130,448,069 instructions:u # 0.88 insn per cycle (62.50%)
424,010,738,154 branches:u # 670.107 M/sec (62.50%)
52,229,564,268 branch-misses:u # 12.32% of all branches (62.50%)
221,684,796,070 L1-dcache-loads:u # 350.351 M/sec (62.50%)
13,935,776,880 L1-dcache-load-misses:u # 6.29% of all L1-dcache accesses (62.50%)
296,863,115 LLC-loads:u # 0.469 M/sec (50.00%)
227,231,894 LLC-load-misses:u # 76.54% of all LL-cache accesses (50.00%)
638.047061016 seconds time elapsed
613.514293000 seconds user
14.803650000 seconds sys
[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf qsort ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 1020 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533
Performance counter stats for './QS_bench_r7.elf qsort ALLmore':
1,047,717.43 msec task-clock:u # 0.996 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,429,008 page-faults:u # 0.007 M/sec
3,158,473,480,125 cycles:u # 3.015 GHz (50.00%)
4,410,953,498,619 instructions:u # 1.40 insn per cycle (62.50%)
1,045,210,950,583 branches:u # 997.608 M/sec (62.50%)
55,719,050,158 branch-misses:u # 5.33% of all branches (62.50%)
1,009,805,547,231 L1-dcache-loads:u # 963.815 M/sec (62.50%)
13,006,118,247 L1-dcache-load-misses:u # 1.29% of all L1-dcache accesses (62.50%)
320,405,768 LLC-loads:u # 0.306 M/sec (50.00%)
172,928,699 LLC-load-misses:u # 53.97% of all LL-cache accesses (50.00%)
1052.421139598 seconds time elapsed
1024.281537000 seconds user
16.586670000 seconds sys
[kaze@kaze Quicksort_says_rev7]$

Note: This is the initial release of Magnetica, not the latest rev.2, there is no median of 5 branchlessly chosen, nor switching to Insertionsort cutoff at 13.

LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.