Quadbike Alpha 1

discuss pc<>acorn file transfer issues and the use of other utils
User avatar
scarybeasts
Posts: 1052
Joined: Tue Feb 06, 2018 7:44 am
Contact:

Re: Quadbike Alpha 1

Post by scarybeasts »

Diminished wrote: Wed Jul 13, 2022 4:48 pm You've mentioned this mysterious heuristic before, and several times I've wondered how it works.
I essentially piled hacks together until it was capable of loading the broadest range of iffy CSW files.

References:
b-em CSW loader: https://github.com/stardot/b-em/blob/d5 ... csw.c#L148
BeebEm CSW loader: https://github.com/stardot/beebem-windo ... w.cpp#L477

beebjit's original approach was similar to b-em and BeebEm, and very simple: look at each half wave length, sequentially, in the CSW file and decide if it's a "0" bit (1200Hz), "1" bit (2400Hz).
Code: https://github.com/scarybeasts/beebjit/ ... csw.c#L111

beebjit's original approach differed from b-em and BeebEm because it was actually pickier in two regards:
1) It checked all CSW half lengths in any bit, requiring each to be "in range". For a "0" bit there are 2x half waves checked and for a "1" bit there are 4x half waves checked. Both b-em and BeebEm look at one half wave only and then skip either 1 or 3 half waves depending on if it's a "0" bit or a "1" bit.
2) The tolerances for acceptable half wave lengths were tighter:
https://github.com/scarybeasts/beebjit/ ... _csw.c#L11
Both b-em and BeebEm, by contrast, appear to differentiate with the simple "<= 0xD".

So the original beebjit CSW loader was less capable of loading wobbly CSWs than b-em or BeebEm.

The heuristic changes applied over time were:

1) Assess the half wave lengths as a sum, not individually. This smooths out any localized wobbles:
https://github.com/scarybeasts/beebjit/ ... 3c52338cd8

2) Fiddle with the exact threshold for detecting a 2400Hz half wave. This is important to lock on to the correct phase at carrier -> data transition.
As can be seen in this change, I also started being more disciplined about collecting and checking in interesting CSW test cases.
https://github.com/scarybeasts/beebjit/ ... fa6be6fbc2

3) Have two different "0" bit bit detection thresholds based on whether the state is in data or in carrier.
This was necessary to get the phase correct at the carrier -> data transition for borderline cases.
https://github.com/scarybeasts/beebjit/ ... d394c16232


Cheers
Chris
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

scarybeasts wrote: Sun Jul 24, 2022 1:24 am...
Thanks for taking the time to go into detail on this -- there is actually a little less black magic here than I expected!
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

I have been working on a version 2 of this. Mentally I am a bit of a mess, so progress is both slow and unenthusiastic. I'm just going to make a note of where I am up to, as much for my own benefit as anyone else's.

There are some issues with version 1, including but not limited to the following:

i) I failed to notice that PLL sync mode needs to be aware of the data's polarity. (This is not a function of the PLL itself, but rather a phase-sensitivity in the carrier extraction method). Although version 1 performs polarity detection for walk sync mode, it completely neglects this in the PLL case, always assuming that waves have positive polarity (yielding very different PLL results for normal and inverted input signals). Phase detection is therefore now needed both for walk and PLL sync mode. One upside is that this does suggest a new way of detecting polarity -- run the PLL against both a normal version and an inverted version of the input, and choose the polarity that produces the superior PLL lock condition. It should therefore be possible to provide two different polarity detection schemes in v2 -- this one, and the old version 1 method.

ii) PLL mode in version 1 cannot accurately transcribe the brief "squawks" caused by the CFS block-zero bug fix hacked into the MOS from 1.2 onwards. These "squawks" occur immediately following silence on the tape, and so the PLL gets no opportunity to lock onto a carrier. Squawks do not contain meaningful data, but they ought still to be captured correctly. Version 2 attempts to address this by (conceptually) reversing the leader tone that follows a squawk, locking the PLL onto this reversed leader tone, and then running backwards off the start of this leader tone into the squawk. The squawk is then transcribed backwards, and its data are simply reversed. This is a stupid idea, but I think it works.

iii) some weird phase issue that cropped up with my captures of Beebug magazine cassettes. Not yet understood.

iv) Version 1 naively assumes that tape speed is constant for the entire file. This is rarely the case. Version 2 therefore makes a formal attempt to break the file up into its CFS blocks at the very start of processing by examining the Goerztel data, and measures the tape speed for each block. This may improve measured bit fidelity for walk and PLL sync modes. I may also give up on the idea of accurately counting every single leader cycle, instead just synthesising those cycles based on the length of a leader span and its average tapespeed. This would be a compromise that sacrifices archival purity for an improvement in transient rejection during leader spans.

I am also unsure whether it might not be better to scrap the idea of measuring half-bits altogether, and just attempt to capture full bits instead. This would probably lead to greatly improved tolerance of iffy sync timings, but would require a pretty severe overhaul of everything, would further sacrifice archival integrity, and might make QB incompatible with the Atom's CFS.
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

I have removed the v1 download.

The more progress I make with v2, the more I realise that v1 was downright awful. Given the lessons I have learned in the meantime, I don't really want anyone wasting their time with v1. I especially don't want people trying to make archival transcriptions of tapes with it.

Version 2's fidelity has improved greatly over v1. The big problem now is performance, both in terms of memory consumption and speed, but I plan to release v2 before I bother trying to optimise it.

I'll look into improving performance in v3. There are various ways to do this, but concurrency is an obvious one.
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

Here's me moaning about CSW phase issues in another thread. This is probably a reasonable primer on how phase shifts are the work of the Devil. I have now completely reversed my previously held position on CSW being superior to UEF. It isn't -- it's a nightmare.

It took me a depressingly long time to appreciate the subtleties of this issue, which is one of the reasons why Quadbike 1 was garbage.
User avatar
vanekp
Posts: 1413
Joined: Thu Nov 30, 2000 7:09 am
Location: The Netherlands
Contact:

Re: Quadbike Alpha 1

Post by vanekp »

oh dear sounds like your struggling a bit with your project, I am sure you will find a way through and we do appreciate all the hard work that you do put into it.
Regards Peter.
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

OK. Happy new year.

I am trying to get a release of Quadbike 2 together. It is not quite ready yet, but there has been a request for a copy of the code in its current development state. I have cleaned things up a bit, and so am now uploading a copy of the code here in case anybody wants to play with it. I am calling this version 1.9.5.

It should build OK on MacOS and Linux (use the provided build.sh script -- you'll need zlib and libsndfile). Version 2's release package will include a Windows binary, but I haven't attempted to build on Windows recently. If you want a Windows executable, you're on your own for now.

Pre-release testing has not been done on this, so anything outside my normal workflow could break. In particular I have done almost no proper testing with MakeUEF yet (unhelpfully).

This performs much better than Quadbike 1. In PLL mode, it transcribes 96% of the WAV files in the vanekp collection without errors, according to my cswblks.php script. Results with other software will vary, because CSW interpretation is ambiguous. (In the last few days I have begun experimenting with a new ASCII low-level output format called TIBET -- Textual Image of Beeb or Electron Tape -- which should eliminate the phase ambiguity associated with CSW).

quadbike-1.9.5-src-only.zip
(170.1 KiB) Downloaded 25 times

Changes from v1:
  • the old "cycap" input/output format has gone; brand new optional ASCII "TIBET" output format (experimental) -- hopefully a "better CSW"; a reference TIBET -> UEF converter (PHP) is currently in work; I also hope to produce a patch for native TIBET loading in b-em at the very least
  • low sample rates (8KHz and 11KHz) are no longer supported
  • version 2 now splits the input into "spans" (e.g. leader, silence, data) and processes each span individually
  • each span has independently-measured tape speed
  • -n switch for optionally normalising the input signal
  • multiple strategies for properly detecting the phase of the input signal; -s walk and -s pll now respect the input phase instead of burying their heads in the sand and pretending it doesn't exist
  • leader span cycles are now synthesised rather than transcribed (there are a few reasons for this)
  • optimum PLL carrier sync offset is now searched for on a per-span basis (+/- 2 samples, global value may also be provided explicitly)
  • output CSW polarity correction (produced CSWs are now supposed to have global zero phase no matter the input, which should please MakeUEF, although this hasn't been tested yet)
  • can independently scale 2400 Hz power relative to 1200 with --scale-hf-pwr option, occasionally helpful for some tapes
  • various automatic mismatched half-bit error correction strategies for -s pll and -s freq (as well as the old selectable -e options)
Recommendations & notes:
  • use 44.1KHz sampling rate; very little testing has been done at 22.05KHz or 48KHz yet
  • best results tend to be obtained with -s pll, so this is the default sync mode (even though it's deathly slow)
  • -s walk still sucks, but very occasionally a tape is encountered that transcribes better with this than with -s pll or -s freq, so it narrowly survives for now (it's also significantly faster than -s pll)
  • don't use -f, it never helps
  • loading CSWs works best in b-em; beebjit's heuristic gets confused easily; I haven't tested with BeebEm yet (I hate CSW)
  • for tapes recorded in multiple sessions, different blocks may have different input phases, so you'll need -p block (or use -s freq, which doesn't care about input phase)
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

Uploading another source code dump. No v2 yet, no binaries yet. And still no documentation.

quadbike-1.9.6-src-only.zip
(159.36 KiB) Downloaded 22 times

Most of this is code cleanup, but there have been a few changes:

Following a rethink, the TIBET file format has been revised, and is now on version 0.2.

Much of the arithmetic has been downgraded to single-precision float from double; disappointingly this does not seem to lead to any performance improvement (on x86 at least), but it does slash RAM consumption in half. Even for really long tapes, it should now be fine to run this on machines with 4 GB of RAM, whereas for 1.9.5 you really needed 8 GB. The negative effects of single-precision arithmetic on transcription fidelity appear to be very minimal, so this is basically a win.

I have now stripped out all of the scaffolding code for "multisource" support, as I am not going to pursue this idea any further in Quadbike. I am of the opinion that "multisourcing" (amalgamating multiple bad recordings or multiple channels of the same tape to form a single good copy) is better done at a higher level, by processing a series of TIBET files.

I am debating whether or not to add UEF support to Quadbike. Either way, here is a PHP script which will generate a UEF from a TIBET file, which hopefully provides an end-to-end tape-to-UEF solution for anyone who is sensible enough not to be using Windows. It is unlikely to work on protected tapes at this time.
tibetuef-0.1-php.zip
(9.05 KiB) Downloaded 21 times
On Windows, you can still of course output a CSW instead, and feed that to MakeUEF.
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

Today I started to consider exploiting SSE and friends to try to vectorise some algorithms. For certain operations (e.g. the PLL) this turns miserable very quickly.

One well-behaved target, though, is the Goertzel algorithm. Since we want to compute 1200 Hz and 2400 Hz Goertzel power over the same piece of audio, the input to these two operations is identical, and it's no problem to do both frequencies at the same time.

This seems to lead to a ~30% speed improvement. It should be possible to expedite tape speed measurement similarly, although I haven't tried this yet.

It's a start.
User avatar
BigEd
Posts: 6261
Joined: Sun Jan 24, 2010 10:24 am
Location: West Country
Contact:

Re: Quadbike Alpha 1

Post by BigEd »

Nice speedup for sure!

I saw a mention the other day, in this context, of SoX, a command line audio tool. The person in question demodulated a similar kind of audio by comparing the output of a high pass and a low pass. Might be an interesting tool to play with, or to do a bit of code-reading with, even.
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

BigEd wrote: Fri Jan 27, 2023 7:01 pm Nice speedup for sure!
I've just applied a similar vectorisation process to tape speed measurement, except in this case I'm processing eight frequencies at once, instead of just two.

On a long test tape, without SSE, this process takes 48 seconds.

With SSE, it takes six seconds, which I still can't quite believe. O_o

The moral: I know there is a general distaste for what Michael Abrash called "premature optimisation", and with good reason, but it certainly would have saved me a lot of time running tests if I'd implemented this sooner ...
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

The vectorisation push continues.

In typical usage with PLL mode, a lengthy tape that took 137 seconds to process without vector support is now down to about 64 seconds using AVX2. There is some overhead associated with having to swizzle and unswizzle data at various stages of the process, and there is also a RAM cost because Quadbike now has to cart around both swizzled and linear copies of the same data at various points, but overall it is a big win.

This is with a vector size of 8. I suspect that on the latest Intel chips with AVX512, you could increase the vector size to 16 (just one #define) and get even more speed, but my Intel chips are 7-10 years old at this point.

The remaining operation which hasn't been vectorised yet is the PLL. I believe I know how to do this, but it's going to be a bit yucky.

Of course one potential train wreck still remains, which will be trying to build this under Visual Studio.
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

There exists a proverb. I am not sure to whom it is attributed, but it is this:


"No plan survives first contact with the enemy."


My own revision of this is as follows:


"No code survives first contact with Microsoft's compiler."



GCC and clang have a lovely SIMD implementation. You start by declaring a vector type; here is a vector of 8 floats, suitable for use with the 256-bit wide registers offered by AVX or AVX2:

Code: Select all

#define QB_VECSIZE 8
typedef float qb_vec_f_t __attribute__ ((vector_size (QB_VECSIZE * 4)));
Then, you can access the elements of the vector just using []:

Code: Select all

qb_vec_f_t a;
a[0] = 0.0f;
a[1] = 1.0f;
a[2] = 2.0f;
a[3] = 3.0f;
a[4] = 4.0f;
a[5] = 5.0f;
a[6] = 6.0f;
a[7] = 7.0f;
Which will give you the vector [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0].

Furthermore, element-wise arithmetic just works:

Code: Select all

qb_vec_f_t b, c;
b = a + a;
c = a * a;
Giving you b = [0.0, 2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0] and c = [0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0].

Of course this also downgrades -- if you take this code and compile it on a machine with SSE rather than AVX, the compiler will automatically split the vectors up so that they are at a suitable size for the target architecture, and everything will still just work. Or maybe you're building on PowerPC instead of x86, and want to use Altivec. Yep, it just works.

This implementation is a thing of beauty.




Then, along comes Microsoft, marching inexorably upon the Mona Lisa with a can of orange spray paint.

Under GCC and clang I have this, as part of the Goertzel algorithm:

Code: Select all

*power_out_vec = (sn2v * sn2v) + (sn1v * sn1v) - (*two_cos_omega_vec * sn1v * sn2v);
which mutates into the following pile of dung under MSVC:

Code: Select all

  c = _mm256_mul_ps(_mm256_mul_ps(*two_cos_omega_vec, sn1v), sn2v);
  b = _mm256_mul_ps(sn1v, sn1v);
  a = _mm256_sub_ps(b, c);
  *power_out_vec = _mm256_fmadd_ps(sn2v, sn2v, a);
And, to rub salt into the wound, this will not downgrade, and will only build for AVX2.

(yes, there are windows binaries now. I hope you have a recent Intel chip)
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

Unit tests written so far: 252 :x
User avatar
Diminished
Posts: 1235
Joined: Fri Dec 08, 2017 9:47 pm
Contact:

Re: Quadbike Alpha 1

Post by Diminished »

Here's the latest code dump.

quadbike-197.zip
(1.36 MiB) Downloaded 29 times

This time, though, I've thrown in a couple of Windows binaries: quadbike-197-slow.exe and quadbike-197-avx2.exe. You will need 64-bit Windows. To use the fast ("fast") SIMD version with AVX2, you are also going to need an x86 CPU made sometime in the last ten years or so.

One word of warning: I've just bundled the two DLLs (zlib and libsndfile) that I built for the Quadbike 1 release, rather than bothering to make new ones at this time. These DLLs should work, but they are old versions of the libraries that likely have some security bugs, so I would advise against processing audio files of unknown provenance. Most of the time you're going to be processing audio files you've recorded yourself, so I don't anticipate this being a problem in typical use.

The build configuration has become slightly more complicated now, thanks to the SIMD support. Note that it will only build cleanly for 64-bit systems.

For building on Linux or MacOS, edit the build script src/build.sh to pick a compiler (clang is default, but gcc should work), and define either QB_MACOS or QB_LINUX depending on the target:

Code: Select all

...
# -DQB_MACOS, -DQB_LINUX:
D="-DQB_MACOS"
...
You will then also want to look at src/build.h (apologies for the similar filename) and define one of the SIMD modes (if desired), or just comment both these lines out to get a scalar build instead.

Code: Select all

...
// Choose vector implementation:
#define QB_VECTORS_GCC_CLANG
//#define QB_VECTORS_MSVC_AVX2
...
QB_VECTORS_GCC_CLANG activates a generic SIMD implementation that uses GCC/clang's vector extensions. You probably want this. You might also look at src/vector.h and consider changing QB_VECSIZE, which defaults to 8 for AVX. If you are lucky and have a recent chip supporting AVX512, you should try changing QB_VECSIZE to 16. I have no CPU new enough to test this, so I would be interested to hear what kind of speed increase is achieved with AVX512 and QB_VECSIZE=16 versus 8. (If you need some test audio, head over to the vanekp archive).

Conversely, if you have a CPU that only supports SSE, you might do better setting QB_VECSIZE to 4. In my testing on an old AMD Phenom, the overhead of having to swizzle and unswizzle various buffers means that SSE does not actually provide too much of a benefit over just building scalar code instead (maybe ~10%, but your mileage may vary).

(The alternative QB_VECTORS_MSVC_AVX2 macro activates a "hard-coded" AVX2 implementation using Intel AVX2 intrinsics that I put together for compiling under Visual Studio. You actually *can* select this option under GCC or clang, too, but there isn't really any benefit over defining the generic QB_VECTORS_GCC_CLANG macro instead.)

Then, cd to src and run build.sh to build it.

The big thing I have left to do now is write the bloody documentation.
User avatar
vanekp
Posts: 1413
Joined: Thu Nov 30, 2000 7:09 am
Location: The Netherlands
Contact:

Re: Quadbike Alpha 1

Post by vanekp »

Thanks, will give it a go some time and yes the documentation is the lest fun bit.
Regards Peter.
Post Reply

Return to “software & utilities for the pc, mac or unix”