Frequently ciphers are attacked not through plainetxt or ciphertext, but through the key scheduling, exploiting the process of subkey generation. The generic name for this attack is "Related Key Attack", where differences are applied to the key and then differences in plaintext or ciphertext are investigated to find correlations between them. So the question is - how do we assure that any slightest change in the key produces a complete avalanche making the attack on key scheduling useless?

Here is a proposal of a subroutine which seems to foot the bill. It is based on modular multiplication (mod 2**32+1) and produces a stream of key-dependent randomized subkeys. Subkey material is produced in chunks 2 bytes at a time, one can generate as many bytes as needed for a particular application. 20 subkeys, or 50 subkeys, or 330 subkeys, or 2048 subkeys - there is no limit for this scheduler.

At the heart of the scheduler there is a 64-byte array, which can be regarded as a kind of shift register. The 32-bit multiplier is multiplied (mod 2**32+1) with 4 leftmost bytes of the register, and the product replaces the previous content of these 4 bytes. The data in the register is then cyclically shifted 2 bytes to the left (2 leftmost bytes end up in the rightmost position). Thereafter the procedure is repeated - 32-bit modular multiplication followed by 2-byte shift, until the first shifted bytes go 2 full turns and come back to the original position (this will require 64 multiplications-shifts). Schematically the procedure can be visualized with the following diagram.

MMMM - multiplier; 
AAA...AAA -original content;
PPP...PPP - multiplication products.

MMMM                  operation 1
AAAAAAAAA....AAAA    ->   PPPPAAAA....AAAA   ->    PPAAAA....AAAAPP
originally                    multiplication           shift-rotation

MMMM                          operation 2
PPAAAA....AAAAPP        ->   PPPPAAAA....AAPP    ->   PPAAAA....AAPPPP
originally                    multiplication           shift-rotation
....................
MMMM                          operation 31
PPAAPP....PPPPPP        ->   PPPPPPPP....PPPP    ->   PPPPPP....PPPPPP
originally                    multiplication           shift-rotation

Next operation #32 will complete 1 full turn, then another 32 operations will be applied in the same manner. If we take into account that each subsequent multiplication has a 2-byte overlap with the product of previous multiplication, it is easy to see that complete avalanche is likely to happen already after the 1-st full turn, and is certain to happen at the end of the 2-nd turn.

It should be obvious that the shift register is exactly where the user key must be placed, whatever the number of bytes in it (not more than 64), followed by some appropriate padding. Then 2 turns of multiply-shift procedure will produce in the register some hashed value ready for subkey generation. Any slightest alteration in the key, even 1 changed bit, causes an avalanche in the register, changing its hashed content to the extent of complete non-recognizability.

Generation of subkeys continues along the same procedure that was used for hashing. Multiplication (mod 2**32+1) and subsequent 2-byte rotation are repeated as many times as needed in order to produce the needed number of subkeys. Two bytes rotated to the right end of the register are copied to the subkey array.

In real program no actual shifts or rotations are done, instead of this the program juggles with indexes, changing them cyclically (mod 64). There are 3 functions necessary for this program:
1) Routine for modular multiplication (mod2**32+1)
2) Routine to break the 32-bit long number into bytes.
3) Routine to build the 32-bit long number from bytes.

The short "main" helps to test the program. It asks for user key, stuffs the shift register with the key characters, prints on screen the original content of the register, the hashed content of the register, and first 100 subkeys 16 bits each.

This program produces key-dependent pseudorandom numbers. Who knows about some handy program for statistical testing of keyed PRNG-s, please give a hint. I am now in a funny position - it works, but I'd be damned if I can describe how it works. As for one, my choice of the multiplier can be far from optimal (now it is 0x6A09E667, which happens to be the first 8 fractional hex digits of Sqrt(2), less initial 1 ). Maybe digits of "pi", or "e" or "cos(1)" will work better?