SC5 - FF:MP3


>    Van-e valakinek MP3 formatum leirasa, vagy tud-e valaki egy
> helyet ahonnan le tudom szedni. En mar kerestem, de nem talatam

Ehhez a kerdeshez kapcsolodo levelek a listarol es vallogatas AMP
fejleszto listarol...

TomCat/Abaddon


Ha valaki grabbelget meg tomoritget audio dolgokat, annak jol johet a 
kovetkezo ket tores

Az egyik:

L3ENC ISO/MPEG Audio Layer 3 Software Only Decoder
version: 2.70
size: 344568
name: L3ENC.EXE

153591: 83 fa 0d 77 0a
                 eb

153763: 83 c4 0c 83 f8 01 74 0b
                          eb

153782: 66 39 bd 76 ff ff ff 74 07
                             eb

153851: 81 7e 0c 67 12 00 00 74 32
                             90 90

153880: 83 7a 14 61 7f 06
                    eb

153895: 81 7b 0c 9b 0b 00 00 74 06 83 7b 0c 01 75 08
                             90 90             eb

30.aug.1997 [tNC]


A masik:

CD Worx for Windows 95
version: 2.10.0579
size: 532480
name: CDWORX.EXE

7592: 8b 44 24 28 2d ff ff ff 7f
      b8 01 00 00 00 eb 09

7.sep.1997 [tNC]


Csinaltam egy IMHO okos kis utility-t, beta tester-ek
tolonghatnak; neve: SMP3 - Scheduler for L3ENC MP3 encoder.

Azok jelentkezzenek, akik ugy tomoritenek MP3-at, hogy
legrabbolnak tobb disk-et egy halozati drive-ra,
majd tobb geppel raengednek L3ENC-et. Az util azt tudja,
hogy be kell allitani minden mpeg-elo gepen, hogy hol vannak
a WAV-konyvtarak, aztan csak siman el kell engedni oket,
es aki kapja marja elven tomoritik az anyagot. *.WAV-bol
*.MP3-at csinal, a WAV-ot letorli. Ha valamelyik
gepen megszakitjak a tomoritest, akkor az visszaadja a
tobbieknek a WAV-ot. Ha pedig ugy szakitjak meg, hogy
brutalisan, akkor a kovetkezo thread teszi tisztaba,
mielott hozzalatna egy L3ENC-hez. Szoval tkp. nem egy nagy
szam, de az emberi melot teljesen minimizalja. Szolo gepen
is hasznos a megszakitgatasok miatt.

Az util kb. 12 Kbyte, attachment-ben kuldom a jelentkezoknek.
100.00% Freeware, csak nem akarom bugosan release-elni.

Persze, kizarolag legalis anyagot szabad tomoriteni vele,
meg persze regisztralt L3ENC-cel...

ERN0


I have been experimenting with using different compilers for amp today, and
I got some interesting results (this is primarily for Linux users, Sun users
for example, already have a decent compiler).

che% time ./amp-gcc -q -nobuffer /mnt/aux/test_mp3/classic1.mp3 
1.50user 0.07system 0:20.26elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (80major+24minor)pagefaults 0swaps

che% time ./amp-pgcc -q -nobuffer /mnt/aux/test_mp3/classic1.mp3 
1.29user 0.11system 0:20.26elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (82major+25minor)pagefaults 0swaps

che% time ./amp-egcs -q -nobuffer /mnt/aux/test_mp3/classic1.mp3 
1.36user 0.11system 0:20.26elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (80major+25minor)pagefaults 0swaps


...it seems that you can win a _lot_ if you use pentium-optimized pgcc,
or even generally improved (over gcc) egcs. I am waiting to see what 
happens when egcs get enhanced with pentium-specific patches. Note that
measuring was done on 22kHz mp3's and a 200MHz pentium; the little 10%
won here might make a differnece on your machine.
   I am also considering making binary releases of amp, because most
people don't have the time to compile these compilers at home (or mess
with advanced features like rt).

pgcc is available at:
   http://www.goof.com/pcg/

egcs is available at:
   http://www.cygnus.com/egcs/


tomislav


I'm working on a feature for amp that I call "Automatic Volume Control". 
I means that I handle volume correction for a start, and that I try to 
amplify the sound if the overal ouput of a song is low.
I use a lookup table for the samples, and apply a factor on the values as 
necessary. For the moment the code works, but:
+ It's outside of amp itself, since I do my processing after the frame is 
completly decoded
+ I still have problems with songs that start very high (drums intros or 
such)

for other songs, the system works really well.

I compute the maximums of the values of the samples, then I do something 
like that:

                                        if (pre > player->levelCurrentMax) {
                                                if (!player->levelOverflows)
                                                        printf("Start overflow\n");
                                                player->levelOverflows++;
                                        }
                                        if (pre > player->levelPreMax)
                                                player->levelPreMax = preL;
                                        if (player->levelOverflows > kLevelChangeTreshold) {
                                                float val = kVolumeControlLkpSize;
                                                float fact = val / player->levelPreMax;
                                                if (fact > 2.5) fact = 2.5;
                                                if (fact != player->volComputeFactor) {
                                                        player->volComputeFactor = fact;
                                                        printf("Computed volume factor %f\n", player->volComputeFactor);
                                                        volumeComputeTable(player, player->volComputeFactor);
                                                }
                                                player->levelCurrentMax = player->levelPreMax;
                                                player->levelOverflows = 0;
                                        }

The reason why I have problem with fast intros is obvious, since the 
first factors computed on silence frames is almost always 2.5, and the 
first real frames gets REALLY loud :-)
I'm currently changing the system to start with a factor of 1.0, and 
gradually push it up to the computed value.. that way, we'll never go to 
the 2.5 factor since the first real frames will not be played at that 
volume.

In any case, the real goal for me is to compute that factor value for a 
complete song the first time, and then SAVE it in the playlist for later 
use on that particular song.

Michel Pollet


I found this MPEG conformance test bitstreams, I may be interesting for
testing of amp :
ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/mpeg1/compliance/

there are various mpeg files in layer 1, 2 and 3

David Balazic


I've made available a prerelease of 0.7.7 on the mail server. You can get it with the standard:
          send amp_prerelease
sent in the body of the message to multimed@rasip.fer.hr

0.7.7 has been lagging behind a little because I had to work on my diploma thesis
last week, but all the major features are already in:

   - guicontrol fixes (me and Budor)
   - qt gui (Lodewijk Voge)
   - win32 port (SJC)

...and a lot of little fixes. guicontrol.c should now work on almost all unices, once
you define which mechanism you use for passing descriptors (see doc/guicontrol.txt).
Qt gui works and it's nice, try it out. Win32 port is working, but it will need a
maintainer as I, well, don't use windows (I tested the port, though).

What is still left to be done is some more flexibility in configuration (interactive
configure script?) and making it possible to decide certain things at runtime. More
fixing, tweaking, etc. If anyone tests this, drop me a note.

tomislav


Howdy all, here is my patch for amp 0.7.7. I've added preliminary MPEG 2.5
support, using information a guy named Marc Pirotte sent me.. (I also had
to do a little poking around in l3aud32.dll from winplay3 to get some data
he didn't have). Anyways, here it is, and it works, although it doesn't
sound perfect on all MPEG 2.5 files (there are little high blips
occasionally). Probably some oversight, I'm sure we'll get it worked
out...

Justin Frankel
justin@nullsoft.com 

--8323328-287504126-875128036=:8657
Content-Type: APPLICATION/octet-stream; name="0.7.7-mpeg2.5.patch.gz"
Content-Transfer-Encoding: BASE64
Content-ID: 
Content-Description: 

H4sICFAAKTQCAzAuNy43LW1wZWcyLjUucGF0Y2gA7Rprc9pI8jP8ig5XzoEl
sB4IAT6ycWK8oSqOc7azub3YpRJoMKqAxEoijtfr/37dMyOQeMSQePfuw7mM
NNL09PR7enrk+cMhVGdQrUZsMIti/wvDdsBuq0N/zKBWO3An06pWs2v2gTvz
/LA2gFraKlar1bUghdMwgAs2Bd0CXW9bjbaBjVbLLiqKshhf+Mg8DmbUQau3
9Xq7bguwly+h2mipNij8+vJlEQqFSXxTLt3DngcPUFKHkTth78LKIXY94M8f
QvnIuXhz9tG57J124flzeFaWQHu6VqnAfbFaKMRsEAZe3JE9++URcz0WVV/0
jjsd7SfLbrR13TIqB4kTI19jP7hxhhH7bcaCwd2nBfD1vL0Kdn1YVJ5uJvb1
evt5SUx+MEtY3JHzHzQ0/jolR95hD2QHF+unPa+9pxnedUmV41UJiAImbRhN
U9U1UOjeEAqJb/1kMIIyiyISLqIauDGDn7uXb47Pne75eRs8n5VL2B9GECHZ
SC1MpuwG+n4SJ/hmUgP21U/wfe0qKHFd0l8fuz4fkrqyGN9dQBtu3Sgol5KR
HwP+u8DN1A/g9H33ZzBqFgzDaOImKtyOfCQOYYIwAY8N/YB5Yo4qTSER9e+g
d3F2QKNr0EsI/qrkQjxlA98dw0nkzoJROGSRxHtVqq0iQUWAFzIxVTybTsMo
AU6iGFSDOIyiu+zIlMUloZ281VMWU7Jg7N7h9HrKisSPnoO008zlO5ZU1olv
GffJXHxoMUzStlAFSjNi+SnyhJMZmFqDm4GpGXQnO7ifO9/Rh+PemfP+7dGv
FQCanr/mL199ODnpnjsXvX930fgrXAg8CpxNWVD+EV9Tyde2x7WDN6lSnHPH
nYQee9YxeWg5co7PPr477f2romLU6V46v5y9/XDaFTpg45gtWHw1G6IBPSWj
22P8K9iNWDKLAqBo8lD0dl5QRvMFYbRpQRllFxSt2dZaGDpXF5TR0oJi6nMw
sl3dwMil4FUsKH/zg8F45jH4R5zw0S+KaLToC7NBAsJq33SPjrvnaOFKwQ8S
IHESw6Kdtrh/pg/TKEzYIPHDwEHHEj5jNdUmKFYLrzRvEUNewlBgBB8P3DEb
ugNn/Mm45v8GRfF1ELGE0M3rTyYCVTNAidNvChSmRLHSG8tencYqa8dmhitr
h2cwpCTGIx7siNvITVgKYhFIjg+fY7Ds5rWQSsPmy3tTaiMHu584YxUvcV4U
wWC0zNga4zeEeJQt4CRofvIj558fet1LVSYTr99R8+Ts/PTo0vl49EsXn44/
nL53XvXeHZ3/erg09uN577LrXJ6JYKhmXpz03nYF67reUFtoibqt6nXJ/TLb
C36/zSh0KK+5B8PQLA1UtHxNo7vewPuDSl31us5f1ZuiyzSoq6g8KhmOWync
w8H+Yn3l8Ix0HcP+ARQVwLBxj2meZlg4LeFWoSkmV6gHH3GywoOaxaMd6DVt
CZcA38QI79zECmWAD1yROMXPH3qAqUsShWN0ZwyVhJvL3WqoBrqi3jDUujA7
BE9GDH04Rk8b9mmRvcGQhdokqQz8ZHwHGHqmNzWjWYNwyKHjxA08N6LlUTpn
EkYxTg77tKoiDK6rYcD4gorsTzFCun1wx3HIgZBTnPcSMWFcBESQTEKc3w88
9hVE1sBjjYrLL59w6EdEn/8VymGAFCUjWsBnMfMqJH8+Dw1KlaQCEshHBmic
fGAOBGWvkhJQe0gMCWGJCI9hHoBK6R1zGsT8YlbkZMziHD4D8rlthUsiJSHE
SyQHZ0fp+VF6RYiGqDGQNotMgNRPL6vrglznnlsUJwhe1Zp9nsr1uq/RGHXb
qJp85L2l6rqKrmaYqtFSTUutm6qFSayl2i21Rb34M1uqjhEJ1wqEQ1jMck0E
NbG/3kB4Q1ct2+L+9Dg+zJTxtd5AsBYCmbpq2Dbi0xGfKfDhfNvj256+YhXd
DHGaqs2x4shWHrGuIlm2qTYRoYY/05SEWhIxElrX1ToSlBK4ARkCIRO4PdNV
HKSjwnQD7xbeEbmBUIaNFBI0UvsIspTlJnZp+MNYwjFx+SEgkmcin/U6klVv
pbw+HKLxZ4wjs8ZtaRwpOVz8JvGt4rQonxbJhjpQPi09RzrCWpxom8uSBCk1
Tu3tgM055lWVoS5JKi2uLBSJKYRLEywhF7CoMZt0gfIhi2kJ6a0FbXEGceW1
JR32nGguznRRWMkIOnwhuEdxZ9aCQ4xhGK6EjDHz6WPjDgNY8PcEA3CECSBu
fWRcf1ov/HMwIj6yAq6sjGHbwhIsLifpzi2JBgWKRm416IeDGmSahFIYvFj2
OG5N5ZM84PNjhllOY/QBhsHKf1OAuwSyp6YxI7snjGbfQPc98ezb1O0c0cg+
ioqIasqG1Dv1xO/0w63jkvo94DYPNNyHUDcoSVI6PnPnIcU0SEFkUvZ3Oscq
G4/E7S24yEbunZhea6OPhe9V6G8E8HWoN4fw1H7SDQXmWzdxmVrYGKiDkRvB
/j62v9DG/UvoewjmJ7479mPmeGyAO/6oTO+xf5edvDebTHllWDRW9/HifeEk
8vn+HHAb32jjFt1qZbfxEmqpLNyYQ1ESX28ir6DgzdB5Cs/ZwG3orSPSyfK6
Lfy+6KvwchUJRO5XO4uda7ZMYlbTJ767X9RNJLDD0+RruTsbxp0fqxZzLJPp
jSprCcp2WHeqDCvpHPhzcG+Rli24UcTJ5FOTl49FJY/vgCsgKjv0mkN36nmx
pPD5NJ4m6ejUR2Uw/mRQQTwPyb6msEhMRyOAFJxeWMu4qRjV6ZgVKtAMpneo
5IlamoRBKMqefGy2K8Y9OROd2D2NkPthuXQVXBXhfRROWZT4LG5joIW9GPa8
N79TV/V1yGvUpywZhV6bB9g93cONEucXmwSlbIDCXw4OXglj4bPQRN5nNJ+D
+CooXVF1UCVCh7E6V7wqravCzw+y3VmdZaB2rLeJim/MHXXeXvXVeVfhAvex
p+4dmDroWrtut61m1l0XgHmP1Yy2kfFYg1d6+JX8ldfMWECs+UI7+NQZJmw8
LoezxCG6K9V6nSrlP+xZSuGJ3YjKnbfo/QtS6d3wNvITVvZj3Op7KkZpTc32
76SlG5ZQTZ5rad5e1dK8azmottq6mdXSAjCrJRN12bYyx224UjdAwavQEt+K
x4mb+APUGIqBiUA7dSNcKraItEChVpGmDznHdyRJZZ2EswryCICIRXMYYx1M
vg6bR5iW4HRd1uDSA8a1QR7DlAV//JHtXzUMHplkHbxKsW/BOaYz5zPUMdoD
CvILLrXeopgmMMri12qA1Kj2TrEgH2KRnLwwnol3BAlrmYAX0KzkKFSQoQMq
0oDreT4Jyh3jWsAGnyEJYeJ+xnA6mkVMFLN4GY1iXsC+sAhQmjFMfDSPzyxA
4Rdk4cuF+C4Y3IZUGuMpaTSjWhGxHocThm5/g7NwknWQp04q9LH7dhRKPAM3
ovJSnwZJSDe4u8Uo1K5WatCDySxOiAJxwhb5NyOBPfEnrFYkHAdy7YC81FCW
eRMSZxsbTIZWpyV55Q46QNiRKe0I831pRjRzGUFR1R2KAOP+MCrXK5VnHW2O
UXSL6IeCKx8eyuPp2xHFg3I5zpisgZp71gHt63A4lGB8khgNBN+x9F0h55vP
k8k0PW1cWBZGQkz47hyuaQHTydA1P2kVAymjTEla0KNLcmxBDsgJUp7poJfM
oLyB5QI/rt8pJHJ1GTwips3VgJj25OKh0bY0DInZeDiHW1q0zLbWXIRDs06L
Fr/OtbrJUpDJUez/zhRKdEilT5VcpjHkx5fB7THtthqmf/nYhXkdcwOHf/ng
kGQ6ep3K+fuSw4NhnBkr8r6lAbaxAr+7yZgLkzE3moy5YjL1Vts0V0zGXDUZ
q20YmQ9WdH6ipavzowVRHOcl+5Eb8yOBm8gNMCgWRZR6KkP5X0yU/lLT4LFY
0+igVdzEqZo43kEcyW1Ie8TZ2IObEFxcTrgGqoXEGXeey4Lnlrx90rjIUVhy
ZLzTSGXDnDtId45nHQW74hFi8lh/dsM34GKx5+LZxeGmYczTCO5yi4dVp1v0
LWWuut42c/uLDORKrDYzB/t1Hqvraaze9PfAY3P6tOp6uBjWhNtRS7ocNTfG
5Y0+k6Ki+zrvKBQWIfkRJKTOLdAtGCN/EwO287XCrs5WKKRZzRIwKEATTymj
RAJxEFTFyiizbpvvBG17/kkf5ntxCLcMwyOaFiae6Jx9F3NQCpUcMeKsQ/+O
DiMV0HkLyn4A/MsmORPcYnDlp6E1gfTg/3p+Aj3v5P5R0p8Nue/L1qrjy47c
x6GYnGltdOqM16dgKy5fz9QUTCp5WpigibuwpkKZPi7MVowoUcWd2U+Q1oOg
DZmqUSG7ZcO0tFpwMYeGzvrPIrLISUvZ5w1r6vboSP07TyDSUj8cJOMy/wTK
GXoqXLw7fn351jm+eO9cvO92j1V4TnRUoNNBTmXKPuVfZZZLfHB7Zcwm+dB8
nKlnTyOkivi4ajes3yEryfVQFgLjBAEjFUof3ShAyDZwyJg+q0xHA3kCfW08
/7SzxI2sNHHps1eGOy/cA3whkKugpALfMT2FTMSHfn+OHMRHftwc5IfTxf8A
K1fXOQMuAAA=
--8323328-287504126-875128036=:8657--


>      Is anyone working on a streaming MP3 player (vs. WinAmp's current need 
>      to download then play)? If so, I would like to get in touch.  If not, 
>      I would like to begin building such a beast.

There is a player supporting the http protocol, called mpg123. I don't
know who to contact about it, but I'm sure you can find something on
the net.

Tomas


WinAmp supports HTTP streaming from some 1.??? version on ( latest is 1.55 ).
U can get it from http://winamp.lh.net

David Balazic


 MP3 Web Pages
-------------

   Buster's Music Page    http://ns2.clever.net/~buster/music.htm
   The Music Vault        http://199.77.34.251/~tryp/musicvaul/
   Rabid Neurosis         http://www.wantree.com.au/~silpub/rns/
   Compress Da Audio      http://www.cda.net/

For more details, please write me:
Igor_Gavrilov


for making mp3s U can use l3enc avaliable at :
http://www.iis.fhg.de/departs/amm/layer3/

There is a bunch of mp3 related SW on www.mp3shoppingmall.com

David Balazic


>The original ISO source is available at
>http://users.bart.nl/~soloh/mpegEnc.html

Pete, no. As far as I know, there is another source code that you have
to purchase from ISO that does support joint-stereo.

Forgive me if I am wrong, but I think the encoder that they used on
http://users.bart.nl/~soloh/mpegEnc.html is same as:

ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/mpeg2/public_software/
according to the Layer 3 FAQ on FhG site:
http://www.iis.fhg.de/departs/amm/layer3/sw/index.html

I think there is another code that is in the
ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/mpeg1/software/

Dmitry


>I managed to compile the ISO encoder source code "dist10/lsf" on the Mac
>without big problems. It also seems to run fine, but the encoded files
>(regardless whether I choose layer II or layer III) just do not work with
>my MPEG players.

>Do I have to swap bytes in the output (Motorola/Intel byte ordering), or
>can you imagine what else I might have done wrong?

I finally ran a test file through the ISO encoder this evening to see how it
works. The encoder takes raw PCM data in big endian, signed (2's compliment)
format. I used IRIX 5.3's SoundFiler to convert an AIFF file to raw PCM and
it worked great (I was unable to get the included dist10/tool/pcm2aiff to do
much).  The resulting MP3 works on all my players including AMP and WinAMP
However, this ISO source indeed doesn't include Joint-Stereo; it will
generate an error message if this is selected for layer-3 output; I had to
use plain stereo. The resulting MP3 still sounds very clear however
(44.1Khz; 128kbit/s).

Pete Plank
Illudium Design
http://www.kic.or.jp/~plankp/


L3ENC/DEC V2.61 ISO/MPEG Audio Layer 3
    Software Only Encoder/Decoder
     Registration Code Generator
 Written by Outsider/NWT, Nov 16 1996

--------------------------------------
Write me, & I'm sending U this 'Code Generator'

Igor_Gavrilov@p10.f620.n5030.z2.fidonet.org


        I would like to try those assembly optimisations in AMP.
Unfortunately my C compiler doesn't understand that way of writing
 assembly instructions. The truth is, neither I don't understand it. I used
 to work with classical one. i.e.:
fmul    st(1)
fstp    [ecx+eax*4]
inc     eax
...

Can someone give me a clue for understanding
those cryptic fmuls 8(%1)\n\t"\ and so on...

> #if defined(ARCH_i586)
>       /* x86 assembler optimisations.  These optimisations are tuned
>            specifically for Intel Pentiums. */
> 
>           asm("movl $15,%%eax\n\t"\
>         "1:\n\t"\
>       "flds (%0)\n\t"\
>       "fmuls (%1)\n\t"\
>       "flds 4(%0)\n\t"\
>       "fmuls 4(%1)\n\t"\
>       "flds 8(%0)\n\t"\
>       "fmuls 8(%1)\n\t"\
>       "fxch %%st(2)\n\t"\
>       "faddp\n\t"\
>       "flds 12(%0)\n\t"\
>.....

These assembly statements are in at&t assembly format, as oposed to
intel format you are using. They are also coded using a gcc option
called "inline assembly". This is something most compilers have, only
with a different syntax. 

I'm sending you a document describing at&t format and inline assembly 
privately via email. Anyone else needing this should feel free to
ask for it as well.


Tomislav

P.S. The code itself was written by Karl Oygard, and he is probably the
only person that understands how it works and why it's so fast. I don't
have a clue personally :-)


                     Brennan's Guide to Inline Assembly

                      by Brennan "Mr. Wacko" Underwood

Document version 1.1.2

Ok. This is meant to be an introduction to inline assembly under DJGPP.
DJGPP is based on GCC, so it uses the AT&T/UNIX syntax and has a somewhat
unique method of inline assembly. I spent many hours figuring some of this
stuff out and told Info that I hate it, many times.

Hopefully if you already know Intel syntax, the examples will be helpful to
you. I've put variable names, register names and other literals in bold
type.

                                 The Syntax

So, DJGPP uses the AT&T assembly syntax. What does that mean to you?

   * Register naming:
     Register names are prefixed with "%". To reference eax:

     AT&T:  %eax
     Intel: eax

   * Source/Destination Ordering:
     In AT&T syntax (which is the UNIX standard, BTW) the source is always
     on the left, and the destination is always on the right.
     So let's load ebx with the value in eax:

     AT&T:  movl %eax, %ebx
     Intel: mov ebx, eax

   * Constant value/immediate value format:
     You must prefix all constant/immediate values with "$".
     Let's load eax with the address of the "C" variable booga, which is
     static.

     AT&T:  movl $_booga, %eax
     Intel: mov eax, _booga

     Now let's load ebx with 0xd00d:

     AT&T:  movl $0xd00d, %ebx
     Intel: mov ebx, d00dh

   * Operator size specification:
     You must suffix the instruction with one of b, w, or l to specify the
     width of the destination register as a byte, word or longword. If you
     omit this, GAS (GNU assembler) will attempt to guess. You don't want
     GAS to guess, and guess wrong! Don't forget it.

     AT&T:  movw %ax, %bx
     Intel: mov bx, ax

     The equivalent forms for Intel is byte ptr, word ptr, and dword ptr,
     but that is for when you are...
   * Referencing memory:
     DJGPP uses 386-protected mode, so you can forget all that real-mode
     addressing junk, including the restrictions on which register has what
     default segment, which registers can be base or index pointers. Now, we
     just get 6 general purpose registers. (7 if you use ebp, but be sure to
     restore it yourself or compile with -fomit-frame-pointer.)
     Here is the canonical format for 32-bit addressing:

     AT&T:  immed32(basepointer,indexpointer,indexscale)
     Intel: [basepointer + indexpointer*indexscale + immed32]

     You could think of the formula to calculate the address as:

       immed32 + basepointer + indexpointer * indexscale

     You don't have to use all those fields, but you do have to have at
     least 1 of immed32, basepointer and you MUST add the size suffix to the
     operator!
     Let's see some simple forms of memory addressing:

        o Addressing a particular C variable:

          AT&T:  _booga
          Intel: [_booga]

          Note: the underscore ("_") is how you get at C variables from
          assembler. But usually you will use extended asm to have them
          preloaded. I address that farther down.

        o Addressing what a register points to:

          AT&T:  (%eax)
          Intel: [eax]

        o Addressing a variable offset by a value in a register:

          AT&T: _variable(%eax)
          Intel: [eax + _variable]

        o Addressing a value in an array of integers (scaling up by 4):

          AT&T:  _array(,%eax,4)
          Intel: [eax*4 + array]

        o You can also do offsets with the immediate value:

          C code: *(p+1) where p is a char *
          AT&T:  1(%eax) where eax has the value of p
          Intel: [eax + 1]

        o You can do some simple math on the immediate value:

          AT&T: _struct_pointer+8

          I assume you can do that with Intel format as well.

        o Addressing a particular char in an array of 8-character records:
          eax holds the number of the record desired. ebx has the wanted
          char's offset within the record.

          AT&T:  _array(%ebx,%eax,8)
          Intel: [ebx + eax*8 + _array]

     Whew. Hopefully that covers all the addressing you'll need to do. As a
     note, you can put esp into the address, but only as the base register.

                            Basic inline assembly

The format for basic inline assembly is very simple, and much like Borland's
method.

asm ("statements");

Pretty simple, no? So

asm ("nop");

will do nothing of course, and

asm ("cli");

will stop interrupts, with

asm ("sti");

of course enabling them. You can use __asm__ instead of asm if the keyword
asm conflicts with something in your program.

When it comes to simple stuff like this, basic inline assembly is fine. You
can even push your registers onto the stack, use them, and put them back.

asm ("pushl %eax\n\t"
     "movl $0, %eax\n\t"
     "popl %eax");

(The \n's and \t's are there so the .s file that GCC generates and hands to
GAS comes out right when you've got multiple statements per asm.)
It's really meant for issuing instructions for which there is no equivalent
in C and don't touch the registers.

But if you do touch the registers, and don't fix things at the end of your
asm statement, like so:

asm ("movl %eax, %ebx");
asm ("xorl %ebx, %edx");
asm ("movl $0, _booga");

then your program will probably blow things to hell. This is because GCC
hasn't been told that your asm statement clobbered ebx and edx and booga,
which it might have been keeping in a register, and might plan on using
later. For that, you need:

                          Extended inline assembly

The basic format of the inline assembly stays much the same, but now gets
Watcom-like extensions to allow input arguments and output arguments.

Here is the basic format:

asm ( "statements" : output_registers : input_registers : clobbered_registers);

Let's just jump straight to a nifty example, which I'll then explain:

asm ("cld\n\t"
     "rep\n\t"
     "stosl"
     : /* no output registers */
     : "c" (count), "a" (fill_value), "D" (dest)
     : "%ecx", "%edi" );

The above stores the value in fill_value count times to the pointer dest.

Let's look at this bit by bit.

asm ("cld\n\t"

We are clearing the direction bit of the flags register. I think Intel
format calls this cltd or something. You never know what this is going to be
left at, and it costs you all of 1 or 2 cycles.

     "rep\n\t"
     "stosl"

Notice that GAS requires the rep prefix to occupy a line of it's own. Notice
also that stos has the l suffix to make it move longwords.

     : /* no output registers */

Well, there aren't any in this function.

     : "c" (count), "a" (fill_value), "D" (dest)

Here we load ecx with count, eax with fill_value, and edi with dest. Why
make GCC do it instead of doing it ourselves? Because GCC, in its register
allocating, might be able to arrange for, say, fill_value to already be in
eax. If this is in a loop, it might be able to preserve eax thru the loop,
and save a movl once per loop.

     : "%ecx", "%edi" );

And here's where we specify to GCC, "you can no longer count on the values
you loaded into ecx or edi to be valid." This doesn't mean they will be
reloaded for certain. This is the clobberlist.

Seem funky? Well, it really helps when optimizing, when GCC can know exactly
what you're doing with the registers before and after. It folds your
assembly code into the code it's generates (whose rules for generation look
remarkably like the above) and then optimizes. It's even smart enough to
know that if you tell it to put (x+1) in a register, then if you don't
clobber it, and later C code refers to (x+1), and it was able to keep that
register free, it will reuse the computation. Whew.

Here's the list of register loading codes that you'll be likely to use:

a        eax
b        ebx
c        ecx
d        edx
S        esi
D        edi
I        constant value (0 to 31)
q,r      dynamically allocated register (see below)

Note that you can't directly refer to the byte registers (ah, al, etc.) or
the word registers (ax, bx, etc.) when you're loding this way. Once you've
got it in there, though, you can specify ax or whatever all you like.

The codes have to be in quotes, and the expressions to load in have to be in
parentheses.

When you do the clobber list, you specify the registers as above with the %.
If you write to a variable, you must include "memory" as one of The
Clobbered. This is in case you wrote to a variable that GCC thought it had
in a register. This is the same as clobbering all registers. While I've
never run into a problem with it, you might also want to add "cc" as a
clobber if you change the condition codes (the bits in the flags register
the jnz, je, etc. operators look at.)

Now, that's all fine and good for loading specific registers. But what if
you specify, say, ebx, and ecx, and GCC can't arrange for the values to be
in those registers without having to stash the previous values. It's
possible to let GCC pick the register(s). You do this:

asm ("leal (%1,%1,4), %0"
     : "=3Dr" (x)
     : "0" (x) );

The above example multiplies x by 5 really quickly (1 cycle on the Pentium).
Now, we could have specified, say eax. But unless we really need a specific
register (like when using rep movsl or rep stosl, which are hardcoded to use
ecx, edi, and esi), why not let GCC pick an available one? So when GCC
generates the output code for GAS, %0 will be replaced by the register it
picked.

And where did "q" and "r" come from? Well, "q" causes GCC to allocate from
eax, ebx, ecx, and edx. "r" lets GCC also consider esi and edi. So make
sure, if you use "r" that it would be possible to use esi or edi in that
instruction. If not, use "q".

Now, you might wonder, how to determine how the %n tokens get allocated to
the arguments. It's a straightforward first-come-first-served, left-to-right
thing, mapping to the "q"'s and "r"'s. But if you want to reuse a register
allocated with a "q" or "r", you use "0", "1", "2"... etc.

You don't need to put a GCC-allocated register on the clobberlist as GCC
knows that you're messing with it.

Now for output registers.

asm ("leal (%1,%1,4), %0"
     : "=3Dr" (x_times_5)
     : "r" (x) );

Note the use of =3D to specify an output register. You just have to do it that
way. If you want 1 variable to stay in 1 register for both in and out, you
have to respecify the register allocated to it on the way in with the "0"
type codes as mentioned above.

asm ("leal (%0,%0,4), %0"
     : "=3Dr" (x)
     : "0" (x) );

This also works, by the way:

asm ("leal (%%ebx,%%ebx,4), %%ebx"
     : "=3Db" (x)
     : "b" (x) );

2 things here:

   * Note that we don't have to put ebx on the clobberlist, GCC knows it
     goes into x. Therefore, since it can know the value of ebx, it isn't
     considered clobbered.
   * Notice that in extended asm, you must prefix registers with %% instead
     of just %. Why, you ask? Because as GCC parses along for %0's and %1's
     and so on, it would interpret %edx as a %e parameter, see that that's
     non-existent, and ignore it. Then it would bitch about finding a symbol
     named dx, which isn't valid because it's not prefixed with % and it's
     not the one you meant anyway.

Important note: If your assembly statement must execute where you put it,
(i.e. must not be moved out of a loop as an optimization), put the keyword
volatile after asm and before the ()'s. To be ultra-careful, use

__asm__ __volatile__ (...whatever...);

However, I would like to point out that if your assembly's only purpose is
to calculate the output registers, with no other side effects, you should
leave off the volatile keyword so your statement will be processed into
GCC's common subexpression elimination optimization.

                            Some useful examples

#define disable() __asm__ __volatile__ ("cli");

#define enable() __asm__ __volatile__ ("sti");

Of course, libc has these defined too.

#define times3(arg1, arg2) \
__asm__ ( \
  "leal (%0,%0,2),%0" \
  : "=3Dr" (arg2) \
  : "0" (arg1) );

#define times5(arg1, arg2) \
__asm__ ( \
  "leal (%0,%0,4),%0" \
  : "=3Dr" (arg2) \
  : "0" (arg1) );

#define times9(arg1, arg2) \
__asm__ ( \
  "leal (%0,%0,8),%0" \
  : "=3Dr" (arg2) \
  : "0" (arg1) );

These multiply arg1 by 3, 5, or 9 and put them in arg2. You should be ok to
do:

times5(x,x);

as well.

#define rep_movsl(src, dest, numwords) \
__asm__ __volatile__ ( \
  "cld\n\t" \
  "rep\n\t" \
  "movsl" \
  : : "S" (src), "D" (dest), "c" (numwords) \
  : "%ecx", "%esi", "%edi" )

Helpful Hint: If you say memcpy() with a constant length parameter, GCC will
inline it to a rep movsl like above. But if you need a variable length
version that inlines and you're always moving dwords, there ya go.

#define rep_stosl(value, dest, numwords) \
__asm__ __volatile__ ( \
  "cld\n\t" \
  "rep\n\t" \
  "stosl" \
  : : "a" (value), "D" (dest), "c" (numwords) \
  : "%ecx", "%edi" )

Same as above but for memset(), which doesn't get inlined no matter what
(for now.)

                                   The End

"The End"?! Yah, I guess so.

If you're wondering, I personally am a big fan of AT&T/UNIX syntax now. (It
might have helped that I cut my teeth on SPARC assembly. Of course, that
machine actually had a decent number of general registers.) It might seem
weird to you at first, but it's really more logical than Intel format, and
has no ambiguities.

If I still haven't answered a question of yours, look in the Info pages for
more information, particularly on the input/output registers. You can do
some funky stuff like use "A" to allocate two registers at once for 64-bit
math or "m" for static memory locations, and a bunch more that aren't really
used as much as "q" and "r".

Alternately, mail me, and I'll see what I can do. (If you find any errors in
the above, please, e-mail me and tell me about it! It's frustrating enough
to learn without buggy docs!) Or heck, mail me to say "boogabooga."

It's the least you can do.

----------------------------------------------------------------------------
Thanks to Eric J. Korpela  for corrections.
----------------------------------------------------------------------------
Have you seen the DJGPP2+Games Page? Probably.
Page written and provided by Brennan Underwood.
Copyright =A9 1996 Brennan Underwood. Share and enjoy!
Page created with vi, God's own editor.


|  Could you say something about your code?  How does it work?

If you really want to understand it, start out with the C version.  The
assembler version is essentially the same, but with better instruction
scheduling.

In any case, the algorithm just does some dewindowing, which involves 16
coefficient multiplications.  The tables, however, are fairly large, which
breaks the cache.  Thus, the code has been obfuscated beyond recognition,
in order to make table accesses and caching more optimal.  If you really
want to understand it, you should download earlier versions of amp
(e.g. 0.7.3) to see how it's done there; it's easier to understand, but not
as efficient.  From there you can go to the newer versions; they all do the
same, but in a different way.

Now for the assembler part.  As I mentioned, the dewindowing involves a lot
of multiplications.  Fortunately, the Pentium can pipeline floating point
operations very well.  In effect, multiplications, additions and
subtractions take three clock cycles, but, if you write code well, the
Pentium can do one multiplication, addition or subtraction every clock
cycle.  If you operate on different floating point registers and none of
them depend on each other, you get no stalls and maximum throughput.  I'll
describe the code a bit:

        "flds (%0)\n\t"\        ; 1    push u_ptr[0] onto fpu register stack
        "fmuls (%1)\n\t"\       ; 2    multiply dewindow[0] with st0
        "flds 4(%0)\n\t"\       ; 3    push u_ptr[1] onto fpu register stack
        "fmuls 4(%1)\n\t"\      ; 4    multiply dewindow[1] with st0
        "flds 8(%0)\n\t"\       ; 5    push u_ptr[2] onto fpu register stack
        "fmuls 8(%1)\n\t"\      ; 6    multiply dewindow[2] with st0
        "fxch %%st(2)\n\t"\     ; 6    swap st0 and st2 (result from cycle 2)
        "faddp\n\t"\            ; 7    add st0 to st1 (result from cycle 4) and pop st0 off stack
        "flds 12(%0)\n\t"\      ; 8    push u_ptr[3] onto fpu register stack
        "fmuls 12(%1)\n\t"\     ; 9    multiply dewindow[3] with st0
        "fxch %%st(2)\n\t"\     ; 9    swap st0 and st2 (result from cycle 6)
        "faddp\n\t"\            ; 10   add st0 to st1 (result from cycle 7) and pop st0 off stack
        "flds 16(%0)\n\t"\      ; 11   push u_ptr[4] onto fpu register stack
        "fmuls 16(%1)\n\t"\     ; 12   multiply dewindow[4] with st0
        "fxch %%st(2)\n\t"\     ; 12   swap st0 and st2 (result from cycle 9)
        "faddp\n\t"\            ; 13   add st0 to st1 (result from cycle 10) and pop st0 off stack
          :

Here we did 5 multiplications, 3 additions and 5 loads in 13 cycles.  Now,
if we were _really_ bad at this, we could have made it run in something
like 34 cycles.  In this case, we're nearly three times as fast.

However, the code is pretty much incomprehensible.  This is what your
compiler should have done for you, but doesn't get quite right.  By putting
your mind to it, and minimising level 1 and level 2 cache stalls, you can
do it quite a bit better yourself.

If you want to read more about assembler optimisation on the Pentium cpu,
check out http://www.goof.com/pcg/docs.html.

Regards,
Karl Anders Oygard


Since I cannot get any version of amp (dos or windows) to compile, I am 
unable to make the changes myself (anyone with  Borland 4.5 friendly 
source?)

1) There is a mixed precision in all of the calculations for decoding;
    while this may seem no big deal, I think there is enough room to
    warrant lower precision calculations ( = faster ) .

2) Why not use an adaptive huffman model? ( Ok, speed is an issue
    and most MP3 players are designed for single pass decoding)
    but if we use an adaptive model, we can achieve  5 to 60% smaller files.

Taiken.


        Sorry to throw in this technical support email, but the resolve
the external symbol problem you need to include "winmm.lib" in the
"Project | Settings | Link" option.


Pv


FYI: This seems to be the German version of Visual C++ - the error message
translates into "unresolved external symbol".
Guess M$ got the libraries wrong once more.

LLaP
bero

http://www.star-trek.ml.org/ --

"Nobody will ever need more than 640k RAM!"
                           -- Bill Gates, 1981
"Windows 95 needs at least 8 MB RAM."
                           -- Bill Gates, 1996
"Nobody will ever need Windows 95."
                           -- logical conclusion


   Don't forget to add winmm.lib to the Release x86 configuration. If you have
any doupts, take a look at the Alpha config's. These are OK for shure.

   Remember I did this port using a machine with Alpha CPU. It is natural that
some x86 configs are not correct.

                Pedro Miguel Teixeira


[car mp3 stereo based on AMP]

>DOS has these advantages:
>I am more familiar with it.
>At least at first look, it is much smaller than Linux.
>Being smaller, it's possible the OS and application could fit on a single
>floppy disk during development, bypassing the need for a hard drive or
>large flash memory.

you could try mpg123/dos. it already has a driver for sb16 and so on and so
on...

http://www.sci.fi/~tobo/mpg123

and read the page before downloading :) i will not answer stupid questions.


I've already drafted a document on that topic. It needs some serious
input. Please take a look and comment.

http://home.dwave.net/~whizkid/ml3cd/

MP3 files already have a de-fecto standard for including title, artist,
album, year, genre and comment in the file itself.

Chris


Some parts of the newer amps aren't as accurate as they could be. The extra high
quality mode on winamp 1.64 uses a slower, more accurate poly() (though it isn't
much better). Winamp 1.666 (coming real soon) uses a rewritten MUCH more accurate
poly(), which was originally considerably slower (I can actually understand what 
it does), but I've optimized it in assembly to make it only a bit slower...

Anyways, not really a bug, just some optimizations of amp have been detrimental to
the accuracy of decoding. I'm trying to get past that :)

Justin


> I was curious to find out what you call 'extra high quality' in your windows
> port of the AMP decoder. The AMP code does not seem to do any special
> degrading of the audio quality, as it uses floating point integers to do the
> computation. Also, I tried to decode a stream into a "WAV" file, both with
> extra high quality and normal quality, and the difference in the PCM values is
> most of the time 0, and never larger than 1 (in absolue value) (this is on a
> 16 bit range). 
> Also, I noticed that when compared to a reference decoder like l3dec (from the
> Fraunhoffer institue), the PCM output has significant variations from the
> reference, especially in the MPEG frames that are using the SHORT WINDOWS
> transforms. Is it a bug ?

        Actually, amp has big problems in IMDCT and POLY. But I am
        not going to get into that. Also the fact that we're using
        floats instead of doubles, lowers the quality, but ups the
        peformance. I guess it is one of those quality vs. price
        drawbacks.

        Dmitry


>I'd like to see an mp3 player that supports more sophisticated
>pre-buffering which can occur *during* playback of another
>track.  Are there any currently available players with this feature?
>I've not encountered any.

i cant say for sure but i believe mpg123 for linux has this feature...
it uses a separate process for buffering and so on... http://mpg.123.org/

linux -> full source


MPEG-4 Audio Reference Software is avaliable :

MPEG Audio Web Page:
  http://www.tnt.uni-hannover.de/project/mpeg/audio/

MPEG Audio ftp server:
  ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/

It contains an AAC encoder/decoder

David Balazic


does anyone have a text on what is really going on in mp3 compression? if
not...a brief description would be nice...i've been wondering and i doubt
it's simply working with the text or changing  0100100101100 to 012121122
or something.(actually, i have heard that that type of compression was
tried but doesn't give very good rates)

what i do know is this...the mp3 technology is based on the limited range
of our hearing(something like 2-20 hz...)  but that is it.....does anyone
else have anymore info on this? 

andy haninger


"If you have to ask you won't understand the answer."  

However there are some pretty accessible intros to MP3 and audio compression
technology on www.mpeg.org.  The will give you a general understanding of the
sort of tricks that are used.  Real understanding requires a good background
in signal processing.

> does anyone have a text on what is really going on in mp3 compression? if
> not...a brief description would be nice...i've been wondering and i doubt
> it's simply working with the text or changing  0100100101100 to 012121122
> or something.(actually, i have heard that that type of compression was
> tried but doesn't give very good rates)
> 
> what i do know is this...the mp3 technology is based on the limited range
> of our hearing(something like 2-20 hz...)  but that is it.....does anyone
> else have anymore info on this? 

Well, the limited range of our hearing which you speak of is a very simple
model of hearing which is not of much use (considering that recording
technology already takes advantage of this).  MP3 exploits a more complicated
phenomenon known as psychoacoustic masking.

-Matt


Vegul egy kis forras reszlet az AMP-bol.....

/* this file is a part of amp software, (C) tomislav uzelac 1996,1997
*/
 
/* transform.c  imdct and polyphase(DCT) transforms
 *
 * Created by: tomislav uzelac  May 1996
 * Karl Anders Oygard optimized this for speed, Mar 13 97
 * Some optimisations based on ideas from Michael Hipp's mpg123 package
 */

/*
 * Comments for this file:
 *
 * The polyphase algorithm is clearly the most cpu consuming part of mpeg 1
 * layer 3 decoding.  Thus, there has been some effort to optimise this
 * particular algorithm.  Currently, everything has been kept in straight C
 * with no assembler optimisations, but in order to provide efficient paths
 * for different architectures, alternative implementations of some
 * critical sections has been done.  You may want to experiment with these,
 * to see which suits your architecture better.
 *
 * Selection of the different implementations is done with the following
 * defines:
 *
 *     HAS_AUTOINCREMENT
 *
 *         Define this if your architecture supports preincrementation of
 *         pointers when referencing (applies to e.g. 68k)
 *
 * For those who are optimising amp, check out the Pentium rdtsc code
 * (define PENTIUM_RDTSC).  This code uses the rdtsc counter for showing
 * how many cycles are spent in different parts of the code.
 */

#include 
#include 
#include 
#include 

#include "audio.h"
#include "getdata.h"
#include "misc2.h"

#define TRANSFORM
#include "transform.h"

#define PI12      0.261799387f
#define PI36      0.087266462f

void imdct_init()
{
  int i;

  for(i=0;i<36;i++) /* 0 */
    win[0][i] = (float) sin(PI36 *(i+0.5));
  for(i=0;i<18;i++) /* 1 */
    win[1][i] = (float) sin(PI36 *(i+0.5));
  for(i=18;i<24;i++)
    win[1][i] = 1.0f;
  for(i=24;i<30;i++)
    win[1][i] = (float) sin(PI12 *(i+0.5-18));
  for(i=30;i<36;i++)
    win[1][i] = 0.0f;
  for(i=0;i<6;i++) /* 3 */
    win[3][i] = 0.0f;
  for(i=6;i<12;i++)
    win[3][i] = (float) sin(PI12 * (i+ 0.5 - 6.0));
  for(i=12;i<18;i++)
    win[3][i] = 1.0f;
  for(i=18;i<36;i++)
    win[3][i] = (float) sin(PI36 * (i + 0.5));
}

/* This uses Byeong Gi Lee's Fast Cosine Transform algorithm to decompose
   the 36 point and 12 point IDCT's into 9 point and 3 point IDCT's,
   respectively. Then the 9 point IDCT is computed by a modified version of
   Mikko Tommila's IDCT algorithm, based on the WFTA. See his comments
   before the first 9 point IDCT. The 3 point IDCT is already efficient to
   implement. -- Jeff Tsay. */
/* I got the unrolled IDCT from Jeff Tsay; the code is presumably by 
   Francois-Raymond Boyer - I unrolled it a little further. tu */

void imdct(int win_type,int sb,int ch)
{
/*------------------------------------------------------------------*/
/*                                                                  */
/*    Function: Calculation of the inverse MDCT                     */
/*    In the case of short blocks the 3 output vectors are already  */
/*    overlapped and added in this modul.                           */
/*                                                                  */
/*    New layer3                                                    */
/*                                                                  */
/*------------------------------------------------------------------*/

       float tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8, tmp9, tmp10, tmp11;

       register float  save;
       float  pp1, pp2;
       float   *win_bt;
       int     i, p, ss;
                 float *in = xr[ch][sb];
                 float *s_p = s[ch][sb];
                 float *res_p = res[sb];
                 float out[36];

   if(win_type == 2){
                for(p=0;p<36;p+=9) {
                        out[p]   = out[p+1] = out[p+2] = out[p+3] =
                        out[p+4] = out[p+5] = out[p+6] = out[p+7] =
                        out[p+8] = 0.0f;
                }

        for(ss=0;ss<18;ss+=6) {

        /*
         *  12 point IMDCT
         */

                /* Begin 12 point IDCT */

                /* Input aliasing for 12 pt IDCT */
                in[5+ss]+=in[4+ss];in[4+ss]+=in[3+ss];in[3+ss]+=in[2+ss];
                in[2+ss]+=in[1+ss];in[1+ss]+=in[0+ss];

                /* Input aliasing on odd indices (for 6 point IDCT) */
                in[5+ss] += in[3+ss];  in[3+ss]  += in[1+ss];

                /* 3 point IDCT on even indices */

                pp2 = in[4+ss] * 0.500000000f;
                pp1 = in[2+ss] * 0.866025403f;
                save = in[0+ss] + pp2;
                tmp1 = in[0+ss] - in[4+ss];
                tmp0 = save + pp1;
                tmp2 = save - pp1;

                /* End 3 point IDCT on even indices */

                /* 3 point IDCT on odd indices (for 6 point IDCT) */

                pp2 = in[5+ss] * 0.500000000f;
                pp1 = in[3+ss] * 0.866025403f;
                save = in[1+ss] + pp2;
                tmp4 = in[1+ss] - in[5+ss];
                tmp5 = save + pp1;
                tmp3 = save - pp1;

                /* End 3 point IDCT on odd indices */

                /* Twiddle factors on odd indices (for 6 point IDCT) */

                tmp3 *= 1.931851653f;
                tmp4 *= 0.707106781f;
                tmp5 *= 0.517638090f;

                /* Output butterflies on 2 3 point IDCT's (for 6 point IDCT) */

                save = tmp0;
                tmp0 += tmp5;
                tmp5 = save - tmp5;

                save = tmp1;
                tmp1 += tmp4;
                tmp4 = save - tmp4;

                save = tmp2;
                tmp2 += tmp3;
                tmp3 = save - tmp3;

                /* End 6 point IDCT */

                /* Twiddle factors on indices (for 12 point IDCT) */

                tmp0 *= 0.504314480f;
                tmp1 *= 0.541196100f;
                tmp2 *= 0.630236207f;
                tmp3 *= 0.821339815f;
                tmp4 *= 1.306562965f;
                tmp5 *= 3.830648788f;

                /* End 12 point IDCT */

                /* Shift to 12 point modified IDCT, multiply by window type 2 */
                tmp8  = tmp0 * -0.793353340f;
                tmp9  = tmp0 * -0.608761429f;
                tmp7  = tmp1 * -0.923879532f;
                tmp10 = tmp1 * -0.382683432f;
                tmp6  = tmp2 * -0.991444861f;
                tmp11 = tmp2 * -0.130526192f;

                tmp0  = tmp3;
                tmp1  = tmp4 *  0.382683432f;
                tmp2  = tmp5 *  0.608761429f;

                tmp3  = tmp5 * -0.793353340f;
                tmp4  = tmp4 * -0.923879532f;
                tmp5  = tmp0 * -0.991444861f;

                tmp0 *= 0.130526192f;

                out[ss + 6]  += tmp0;
                out[ss + 7]  += tmp1;
                out[ss + 8]  += tmp2;
                out[ss + 9]  += tmp3;
                out[ss + 10] += tmp4;
                out[ss + 11] += tmp5;
                out[ss + 12] += tmp6;
                out[ss + 13] += tmp7;
                out[ss + 14] += tmp8;
                out[ss + 15] += tmp9;
                out[ss + 16] += tmp10;
                out[ss + 17] += tmp11;

        }
        if (sb&1) {
                for (i=0;i<18;i+=2) res_p[i]=out[i] + s_p[i];
                for (i=1;i<18;i+=2) res_p[i]=-out[i] - s_p[i];
        } else
                for (i=0;i<18;i++) res_p[i]=out[i] + s_p[i];
        for (i=18;i<36;i++) s_p[i-18]=out[i];

    } else {
/*
 * 36 point IDCT ****************************************************************
 */
        float tmp[18];

      /* input aliasing for 36 point IDCT */
      in[17]+=in[16]; in[16]+=in[15]; in[15]+=in[14]; in[14]+=in[13];
      in[13]+=in[12]; in[12]+=in[11]; in[11]+=in[10]; in[10]+=in[9];
      in[9] +=in[8];  in[8] +=in[7];  in[7] +=in[6];  in[6] +=in[5];
      in[5] +=in[4];  in[4] +=in[3];  in[3] +=in[2];  in[2] +=in[1];
      in[1] +=in[0];

      /* 18 point IDCT for odd indices */
      
      /* input aliasing for 18 point IDCT */
      in[17]+=in[15]; in[15]+=in[13]; in[13]+=in[11]; in[11]+=in[9];
      in[9] +=in[7];  in[7] +=in[5];  in[5] +=in[3];  in[3] +=in[1];
      

{
   float tmp0,tmp1,tmp2,tmp3,tmp4,tmp0_,tmp1_,tmp2_,tmp3_;
   float tmp0o,tmp1o,tmp2o,tmp3o,tmp4o,tmp0_o,tmp1_o,tmp2_o,tmp3_o;

/* Fast 9 Point Inverse Discrete Cosine Transform
//
// By  Francois-Raymond Boyer
//         mailto:boyerf@iro.umontreal.ca
//         http://www.iro.umontreal.ca/~boyerf
//
// The code has been optimized for Intel processors
//  (takes a lot of time to convert float to and from iternal FPU representation)
//
// It is a simple "factorization" of the IDCT matrix.
*/
   /* 9 point IDCT on even indices */
   {
   /* 5 points on odd indices (not realy an IDCT) */
   float i0 = in[0]+in[0];
   float i0p12 = i0 + in[12];

   tmp0 = i0p12 + in[4]*1.8793852415718f  + in[8]*1.532088886238f   + in[16]*0.34729635533386f;
   tmp1 = i0    + in[4]                   - in[8] - in[12] - in[12] - in[16];
   tmp2 = i0p12 - in[4]*0.34729635533386f - in[8]*1.8793852415718f  + in[16]*1.532088886238f;
   tmp3 = i0p12 - in[4]*1.532088886238f   + in[8]*0.34729635533386f - in[16]*1.8793852415718f;
   tmp4 = in[0] - in[4]                   + in[8] - in[12]          + in[16];
   }
   {
   float i6_ = in[6]*1.732050808f;              

   tmp0_ = in[2]*1.9696155060244f  + i6_ + in[10]*1.2855752193731f  + in[14]*0.68404028665134f;
   tmp1_ = (in[2]                        - in[10]                   - in[14])*1.732050808f;
   tmp2_ = in[2]*1.2855752193731f  - i6_ - in[10]*0.68404028665134f + in[14]*1.9696155060244f;
   tmp3_ = in[2]*0.68404028665134f - i6_ + in[10]*1.9696155060244f  - in[14]*1.2855752193731f;
   }

   /* 9 point IDCT on odd indices */
   {
   /* 5 points on odd indices (not realy an IDCT) */
   float i0 = in[0+1]+in[0+1];
   float i0p12 = i0 + in[12+1];

   tmp0o = i0p12   + in[4+1]*1.8793852415718f  + in[8+1]*1.532088886238f       + in[16+1]*0.34729635533386f;
   tmp1o = i0      + in[4+1]                   - in[8+1] - in[12+1] - in[12+1] - in[16+1];
   tmp2o = i0p12   - in[4+1]*0.34729635533386f - in[8+1]*1.8793852415718f      + in[16+1]*1.532088886238f;
   tmp3o = i0p12   - in[4+1]*1.532088886238f   + in[8+1]*0.34729635533386f     - in[16+1]*1.8793852415718f;
   tmp4o = (in[0+1] - in[4+1]                   + in[8+1] - in[12+1]            + in[16+1])*0.707106781f; /* Twiddled */
   }
   {
   /* 4 points on even indices */
   float i6_ = in[6+1]*1.732050808f;            /* Sqrt[3] */

   tmp0_o = in[2+1]*1.9696155060244f  + i6_ + in[10+1]*1.2855752193731f  + in[14+1]*0.68404028665134f;
   tmp1_o = (in[2+1]                        - in[10+1]                   - in[14+1])*1.732050808f;
   tmp2_o = in[2+1]*1.2855752193731f  - i6_ - in[10+1]*0.68404028665134f + in[14+1]*1.9696155060244f;
   tmp3_o = in[2+1]*0.68404028665134f - i6_ + in[10+1]*1.9696155060244f  - in[14+1]*1.2855752193731f;
   }

   /* Twiddle factors on odd indices
   // and
   // Butterflies on 9 point IDCT's
   // and
   // twiddle factors for 36 point IDCT
   */
   {
   float e, o;
   e = tmp0 + tmp0_; o = (tmp0o + tmp0_o)*0.501909918f; tmp[0] = (e + o)*(-0.500476342f*.5f);    tmp[17] = (e - o)*(-11.46279281f*.5f);
   e = tmp1 + tmp1_; o = (tmp1o + tmp1_o)*0.517638090f; tmp[1] = (e + o)*(-0.504314480f*.5f);    tmp[16] = (e - o)*(-3.830648788f*.5f);
   e = tmp2 + tmp2_; o = (tmp2o + tmp2_o)*0.551688959f; tmp[2] = (e + o)*(-0.512139757f*.5f);    tmp[15] = (e - o)*(-2.310113158f*.5f);
   e = tmp3 + tmp3_; o = (tmp3o + tmp3_o)*0.610387294f; tmp[3] = (e + o)*(-0.524264562f*.5f);    tmp[14] = (e - o)*(-1.662754762f*.5f);
                                                        tmp[4] = (tmp4 + tmp4o)*(-0.541196100f); tmp[13] = (tmp4 - tmp4o)*(-1.306562965f);
   e = tmp3 - tmp3_; o = (tmp3o - tmp3_o)*0.871723397f; tmp[5] = (e + o)*(-0.563690973f*.5f);    tmp[12] = (e - o)*(-1.082840285f*.5f);
   e = tmp2 - tmp2_; o = (tmp2o - tmp2_o)*1.183100792f; tmp[6] = (e + o)*(-0.592844523f*.5f);    tmp[11] = (e - o)*(-0.930579498f*.5f);
   e = tmp1 - tmp1_; o = (tmp1o - tmp1_o)*1.931851653f; tmp[7] = (e + o)*(-0.630236207f*.5f);    tmp[10] = (e - o)*(-0.821339815f*.5f);
   e = tmp0 - tmp0_; o = (tmp0o - tmp0_o)*5.736856623f; tmp[8] = (e + o)*(-0.678170852f*.5f);    tmp[9] =  (e - o)*(-0.740093616f*.5f);
   }
   }
        /* shift to modified IDCT */
        win_bt = win[win_type];

        if (sb&1) {
                res_p[0] =  -tmp[9]  * win_bt[0] + s_p[0];
                res_p[1] =-(-tmp[10] * win_bt[1] + s_p[1]);
                res_p[2] =  -tmp[11] * win_bt[2] + s_p[2];
                res_p[3] =-(-tmp[12] * win_bt[3] + s_p[3]);
                res_p[4] =  -tmp[13] * win_bt[4] + s_p[4];
                res_p[5] =-(-tmp[14] * win_bt[5] + s_p[5]);
                res_p[6] =  -tmp[15] * win_bt[6] + s_p[6];
                res_p[7] =-(-tmp[16] * win_bt[7] + s_p[7]);
                res_p[8] =  -tmp[17] * win_bt[8] + s_p[8];
           
                res_p[9] = -(tmp[17] * win_bt[9] + s_p[9]);
                res_p[10]=  tmp[16] * win_bt[10] + s_p[10];
                res_p[11]=-(tmp[15] * win_bt[11] + s_p[11]);
                res_p[12]=  tmp[14] * win_bt[12] + s_p[12];
                res_p[13]=-(tmp[13] * win_bt[13] + s_p[13]);
                res_p[14]=  tmp[12] * win_bt[14] + s_p[14];
                res_p[15]=-(tmp[11] * win_bt[15] + s_p[15]);
                res_p[16]=  tmp[10] * win_bt[16] + s_p[16];
                res_p[17]=-(tmp[9]  * win_bt[17] + s_p[17]);
        } else {
                res_p[0] = -tmp[9]  * win_bt[0] + s_p[0];
                res_p[1] = -tmp[10] * win_bt[1] + s_p[1];
                res_p[2] = -tmp[11] * win_bt[2] + s_p[2];
                res_p[3] = -tmp[12] * win_bt[3] + s_p[3];
                res_p[4] = -tmp[13] * win_bt[4] + s_p[4];
                res_p[5] = -tmp[14] * win_bt[5] + s_p[5];
                res_p[6] = -tmp[15] * win_bt[6] + s_p[6];
                res_p[7] = -tmp[16] * win_bt[7] + s_p[7];
                res_p[8] = -tmp[17] * win_bt[8] + s_p[8];
           
                res_p[9] = tmp[17] * win_bt[9] + s_p[9];
                res_p[10]= tmp[16] * win_bt[10] + s_p[10];
                res_p[11]= tmp[15] * win_bt[11] + s_p[11];
                res_p[12]= tmp[14] * win_bt[12] + s_p[12];
                res_p[13]= tmp[13] * win_bt[13] + s_p[13];
                res_p[14]= tmp[12] * win_bt[14] + s_p[14];
                res_p[15]= tmp[11] * win_bt[15] + s_p[15];
                res_p[16]= tmp[10] * win_bt[16] + s_p[16];
                res_p[17]= tmp[9]  * win_bt[17] + s_p[17];
        }

        s_p[0]= tmp[8]  * win_bt[18];
        s_p[1]= tmp[7]  * win_bt[19];
        s_p[2]= tmp[6]  * win_bt[20];
        s_p[3]= tmp[5]  * win_bt[21];
        s_p[4]= tmp[4]  * win_bt[22];
        s_p[5]= tmp[3]  * win_bt[23];
        s_p[6]= tmp[2]  * win_bt[24];
        s_p[7]= tmp[1]  * win_bt[25];
        s_p[8]= tmp[0]  * win_bt[26];

        s_p[9]= tmp[0]  * win_bt[27];
        s_p[10]= tmp[1]  * win_bt[28];
        s_p[11]= tmp[2]  * win_bt[29];
        s_p[12]= tmp[3]  * win_bt[30];
        s_p[13]= tmp[4]  * win_bt[31];
        s_p[14]= tmp[5]  * win_bt[32];
        s_p[15]= tmp[6]  * win_bt[33];
        s_p[16]= tmp[7]  * win_bt[34];
        s_p[17]= tmp[8]  * win_bt[35];
    }
}

/* fast DCT according to Lee[84]
 * reordering according to Konstantinides[94]
 */ 
void poly(const int ch,int f)
{
static float u[2][2][17][16]; /* no v[][], it's redundant */
static int u_start[2]={0,0}; /* first element of u[][] */
static int u_div[2]={0,0}; /* which part of u[][] is currently used */
int start = u_start[ch];
int div = u_div[ch];
float (*u_p)[16];

#if defined(PENTIUM_RDTSC)
unsigned int cnt4, cnt3, cnt2, cnt1;
static int min_cycles = 99999999;

        __asm__(".byte 0x0f,0x31" : "=a" (cnt1), "=d" (cnt4));
#endif

        {
        float d16,d17,d18,d19,d20,d21,d22,d23,d24,d25,d26,d27,d28,d29,d30,d31;
        float d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15;

        /* step 1: initial reordering and 1st (16 wide) butterflies
        */

        d0 = res[ 0][f]; d16=(d0  - res[31][f]) *  b1; d0 += res[31][f];
        d1 = res[ 1][f]; d17=(d1  - res[30][f]) *  b3; d1 += res[30][f];
        d3 = res[ 2][f]; d19=(d3  - res[29][f]) *  b5; d3 += res[29][f];
        d2 = res[ 3][f]; d18=(d2  - res[28][f]) *  b7; d2 += res[28][f];
        d6 = res[ 4][f]; d22=(d6  - res[27][f]) *  b9; d6 += res[27][f];
        d7 = res[ 5][f]; d23=(d7  - res[26][f]) * b11; d7 += res[26][f];
        d5 = res[ 6][f]; d21=(d5  - res[25][f]) * b13; d5 += res[25][f];
        d4 = res[ 7][f]; d20=(d4  - res[24][f]) * b15; d4 += res[24][f];
        d12= res[ 8][f]; d28=(d12 - res[23][f]) * b17; d12+= res[23][f];
        d13= res[ 9][f]; d29=(d13 - res[22][f]) * b19; d13+= res[22][f];
        d15= res[10][f]; d31=(d15 - res[21][f]) * b21; d15+= res[21][f];
        d14= res[11][f]; d30=(d14 - res[20][f]) * b23; d14+= res[20][f];
        d10= res[12][f]; d26=(d10 - res[19][f]) * b25; d10+= res[19][f];
        d11= res[13][f]; d27=(d11 - res[18][f]) * b27; d11+= res[18][f];
        d9 = res[14][f]; d25=(d9  - res[17][f]) * b29; d9 += res[17][f];
        d8 = res[15][f]; d24=(d8  - res[16][f]) * b31; d8 += res[16][f];

        {
        float c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15;

/* a test to see what can be done with memory separation
 * first we process indexes 0-15
*/
        c0 = d0 + d8 ; c8 = ( d0 - d8 ) *  b2;
        c1 = d1 + d9 ; c9 = ( d1 - d9 ) *  b6;
        c2 = d2 + d10; c10= ( d2 - d10) * b14;
        c3 = d3 + d11; c11= ( d3 - d11) * b10;
        c4 = d4 + d12; c12= ( d4 - d12) * b30;
        c5 = d5 + d13; c13= ( d5 - d13) * b26;
        c6 = d6 + d14; c14= ( d6 - d14) * b18;
        c7 = d7 + d15; c15= ( d7 - d15) * b22;
        
        /* step 3: 4-wide butterflies
        */
        d0 = c0 + c4 ; d4 = ( c0 - c4 ) *  b4;
        d1 = c1 + c5 ; d5 = ( c1 - c5 ) * b12;
        d2 = c2 + c6 ; d6 = ( c2 - c6 ) * b28;
        d3 = c3 + c7 ; d7 = ( c3 - c7 ) * b20;
        
        d8 = c8 + c12; d12= ( c8 - c12) *  b4;
        d9 = c9 + c13; d13= ( c9 - c13) * b12;
        d10= c10+ c14; d14= (c10 - c14) * b28;
        d11= c11+ c15; d15= (c11 - c15) * b20;


        /* step 4: 2-wide butterflies
        */
        {
        float rb8 = b8;
        float rb24 = b24;

/**/    c0 = d0 + d2 ; c2 = ( d0 - d2 ) *  rb8;
        c1 = d1 + d3 ; c3 = ( d1 - d3 ) * rb24;
/**/    c4 = d4 + d6 ; c6 = ( d4 - d6 ) *  rb8;
        c5 = d5 + d7 ; c7 = ( d5 - d7 ) * rb24;
/**/    c8 = d8 + d10; c10= ( d8 - d10) *  rb8;
        c9 = d9 + d11; c11= ( d9 - d11) * rb24;
/**/    c12= d12+ d14; c14= (d12 - d14) *  rb8;
        c13= d13+ d15; c15= (d13 - d15) * rb24;
        }

        /* step 5: 1-wide butterflies
        */
        {
        float rb16 = b16;

        /* this is a little 'hacked up'
        */
        d0 = (-c0 -c1) * 2; d1 = ( c0 - c1 ) * rb16; 
        d2 = c2 + c3; d3 = ( c2 - c3 ) * rb16; 
        d3 -= d2;

        d4 = c4 +c5; d5 = ( c4 - c5 ) * rb16;
        d5 += d4;
        d7 = -d5;
        d7 += ( c6 - c7 ) * rb16; d6 = +c6 +c7;

        d8 = c8 + c9 ; d9 = ( c8 - c9 ) * rb16;
        d11= +d8 +d9;
        d11 +=(c10 - c11) * rb16; d10= c10+ c11; 

        d12 = c12+ c13; d13 = (c12 - c13) * rb16;
        d13 += -d8-d9+d12;
        d14 = c14+ c15; d15 = (c14 - c15) * rb16;
        d15-=d11;
        d14 += -d8 -d10;
        }

        /* step 6: final resolving & reordering
         * the other 32 are stored for use with the next granule
         */

        u_p = (float (*)[16]) &u[ch][div][0][start];

/*16*/  u_p[ 0][0] =+d1 ;
        u_p[ 2][0] = +d9 -d14;
/*20*/  u_p[ 4][0] = +d5 -d6;
        u_p[ 6][0] = -d10 +d13;
/*24*/  u_p[ 8][0] =d3;
        u_p[10][0] = -d8 -d9 +d11 -d13;
/*28*/  u_p[12][0] = +d7;
        u_p[14][0] = +d15;

        /* the other 32 are stored for use with the next granule
         */

        u_p = (float (*)[16]) &u[ch][!div][0][start];

/*0*/   u_p[16][0] = d0;
        u_p[14][0] = -(+d8 );
/*4*/   u_p[12][0] = -(+d4 );
        u_p[10][0] = -(-d8 +d12 );
/*8*/   u_p[ 8][0] = -(+d2 );
        u_p[ 6][0] = -(+d8 +d10 -d12 );
/*12*/  u_p[ 4][0] = -(-d4 +d6 );
        u_p[ 2][0] = -d14;
        u_p[ 0][0] = -d1;
        }

        {
        float c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15;

/* memory separation, second part
*/
/* 2
*/
        c0=d16 + d24; c8= (d16 - d24) *  b2;
        c1=d17 + d25; c9= (d17 - d25) *  b6;
        c2=d18 + d26; c10= (d18 - d26) * b14;
        c3=d19 + d27; c11= (d19 - d27) * b10;
        c4=d20 + d28; c12= (d20 - d28) * b30;
        c5=d21 + d29; c13= (d21 - d29) * b26;
        c6=d22 + d30; c14= (d22 - d30) * b18;
        c7=d23 + d31; c15= (d23 - d31) * b22;

/* 3
*/
        d16= c0+ c4; d20= (c0 - c4) *  b4;
        d17= c1+ c5; d21= (c1 - c5) * b12;
        d18= c2+ c6; d22= (c2 - c6) * b28;
        d19= c3+ c7; d23= (c3 - c7) * b20;

        d24= c8+ c12; d28= (c8 - c12) *  b4;
        d25= c9+ c13; d29= (c9 - c13) * b12;
        d26= c10+ c14; d30= (c10 - c14) * b28;
        d27= c11+ c15; d31= (c11 - c15) * b20;

/* 4
*/
        {
        float rb8 = b8;
        float rb24 = b24;

/**/    c0= d16+ d18; c2= (d16 - d18) *  rb8;
        c1= d17+ d19; c3= (d17 - d19) * rb24;
/**/    c4= d20+ d22; c6= (d20 - d22) *  rb8;
        c5= d21+ d23; c7= (d21 - d23) * rb24;
/**/    c8= d24+ d26; c10= (d24 - d26) *  rb8;
        c9= d25+ d27; c11= (d25 - d27) * rb24;
/**/    c12= d28+ d30; c14= (d28 - d30) *  rb8;
        c13= d29+ d31; c15= (d29 - d31) * rb24;
        }

/* 5
*/
        {
        float rb16 = b16;
        d16= c0+ c1; d17= (c0 - c1) * rb16;
        d18= c2+ c3; d19= (c2 - c3) * rb16;

        d20= c4+ c5; d21= (c4 - c5) * rb16;
        d20+=d16; d21+=d17;
        d22= c6+ c7; d23= (c6 - c7) * rb16;
        d22+=d16; d22+=d18;
        d23+=d16; d23+=d17; d23+=d19;


        d24= c8+ c9; d25= (c8 - c9) * rb16;
        d26= c10+ c11; d27= (c10 - c11) * rb16;
        d26+=d24;
        d27+=d24; d27+=d25;

        d28= c12+ c13; d29= (c12 - c13) * rb16;
        d28-=d20; d29+=d28; d29-=d21;
        d30= c14+ c15; d31= (c14 - c15) * rb16;
        d30-=d22;
        d31-=d23;
        }

        /* step 6: final resolving & reordering 
         * the other 32 are stored for use with the next granule
         */
        
        u_p = (float (*)[16]) &u[ch][!div][0][start];

        u_p[ 1][0] = -(+d30 );  
        u_p[ 3][0] = -(+d22 -d26 );
        u_p[ 5][0] = -(-d18 -d20 +d26 );
        u_p[ 7][0] = -(+d18 -d28 );
        u_p[ 9][0] = -(+d28 );
        u_p[11][0] = -(+d20 -d24 );
        u_p[13][0] = -(-d16 +d24 );
        u_p[15][0] = -(+d16 );

        /* the other 32 are stored for use with the next granule
         */

        u_p = (float (*)[16]) &u[ch][div][0][start];

        u_p[15][0] = +d31;
        u_p[13][0] = +d23 -d27;
        u_p[11][0] = -d19 -d20 -d21 +d27;
        u_p[ 9][0] = +d19 -d29;
        u_p[ 7][0] = -d18 +d29;
        u_p[ 5][0] = +d18 +d20 +d21 -d25 -d26;
        u_p[ 3][0] = -d17 -d22 +d25 +d26;
        u_p[ 1][0] = +d17 -d30;
        }
        }

#if defined(PENTIUM_RDTSC)
        __asm__(".byte 0x0f,0x31" : "=a" (cnt3), "=d" (cnt4));
#endif

        /* we're doing dewindowing and calculating final samples now
         */

#if defined(ARCH_i586)
        /* x86 assembler optimisations.  These optimisations are tuned
           specifically for Intel Pentiums. */

            asm("movl $15,%%eax\n\t"\
        "1:\n\t"\
        "flds (%0)\n\t"\
        "fmuls (%1)\n\t"\
        "flds 4(%0)\n\t"\
        "fmuls 4(%1)\n\t"\
        "flds 8(%0)\n\t"\
        "fmuls 8(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 12(%0)\n\t"\
        "fmuls 12(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 16(%0)\n\t"\
        "fmuls 16(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 20(%0)\n\t"\
        "fmuls 20(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 24(%0)\n\t"\
        "fmuls 24(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 28(%0)\n\t"\
        "fmuls 28(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 32(%0)\n\t"\
        "fmuls 32(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 36(%0)\n\t"\
        "fmuls 36(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 40(%0)\n\t"\
        "fmuls 40(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 44(%0)\n\t"\
        "fmuls 44(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 48(%0)\n\t"\
        "fmuls 48(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 52(%0)\n\t"\
        "fmuls 52(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 56(%0)\n\t"\
        "fmuls 56(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 60(%0)\n\t"\
        "fmuls 60(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "addl $64,%0\n\t"\
        "addl $128,%1\n\t"\
        "subl $4,%%esp\n\t"\
        "faddp\n\t"\
        "fistpl (%%esp)\n\t"\
        "popl %%ecx\n\t"\
        "cmpl $32767,%%ecx\n\t"\
        "jle 2f\n\t"\
        "movw $32767,%%cx\n\t"\
        "jmp 3f\n\t"\
        "2: cmpl $-32768,%%ecx\n\t"\
        "jge 3f\n\t"\
        "movw $-32768,%%cx\n\t"\
        "3: movw %%cx,(%2)\n\t"\
        "addl %3,%2\n\t"\
        "decl %%eax\n\t"\
        "jns 1b\n\t"\

        "testb $1,%4\n\t"\
        "je 4f\n\t"

        "flds (%0)\n\t"\
        "fmuls (%1)\n\t"\
        "flds 8(%0)\n\t"\
        "fmuls 8(%1)\n\t"\
        "flds 16(%0)\n\t"\
        "fmuls 16(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 24(%0)\n\t"\
        "fmuls 24(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 32(%0)\n\t"\
        "fmuls 32(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 40(%0)\n\t"\
        "fmuls 40(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 48(%0)\n\t"\
        "fmuls 48(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 56(%0)\n\t"\
        "fmuls 56(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "subl $4,%%esp\n\t"\
        "subl $64,%0\n\t"\
        "subl $192,%1\n\t"\
        "faddp\n\t"\
        "fistpl (%%esp)\n\t"\
        "popl %%ecx\n\t"\
        "cmpl $32767,%%ecx\n\t"\
        "jle 2f\n\t"\
        "movw $32767,%%cx\n\t"\
        "jmp 3f\n\t"\
        "2: cmpl $-32768,%%ecx\n\t"\
        "jge 3f\n\t"\
        "movw $-32768,%%cx\n\t"\
        "3: movw %%cx,(%2)\n\t"\

        "movl %5,%%ecx\n\t"\
        "sall $3,%%ecx\n\t"\
        "addl %%ecx,%1\n\t"\
        "addl %3,%2\n\t"\
        "movl $14,%%eax\n\t"\

        "1:flds 4(%0)\n\t"\
        "fmuls 56(%1)\n\t"\
        "flds (%0)\n\t"\
        "fmuls 60(%1)\n\t"\
        "flds 12(%0)\n\t"\
        "fmuls 48(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubp\n\t"\
        "flds 8(%0)\n\t"\
        "fmuls 52(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 20(%0)\n\t"\
        "fmuls 40(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 16(%0)\n\t"\
        "fmuls 44(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 28(%0)\n\t"\
        "fmuls 32(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 24(%0)\n\t"\
        "fmuls 36(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 36(%0)\n\t"\
        "fmuls 24(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 32(%0)\n\t"\
        "fmuls 28(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 44(%0)\n\t"\
        "fmuls 16(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 40(%0)\n\t"\
        "fmuls 20(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 52(%0)\n\t"\
        "fmuls 8(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 48(%0)\n\t"\
        "fmuls 12(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 60(%0)\n\t"\
        "fmuls (%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 56(%0)\n\t"\
        "fmuls 4(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "subl $64,%0\n\t"\
        "subl $128,%1\n\t"\
        "subl $4,%%esp\n\t"\
        "fsubp\n\t"\
        "fistpl (%%esp)\n\t"\
        "popl %%ecx\n\t"\
        "cmpl $32767,%%ecx\n\t"\
        "jle 2f\n\t"\
        "movw $32767,%%cx\n\t"\
        "jmp 3f\n\t"\
        "2: cmpl $-32768,%%ecx\n\t"\
        "jge 3f\n\t"\
        "movw $-32768,%%cx\n\t"\
        "3: movw %%cx,(%2)\n\t"\
        "addl %3,%2\n\t"\
        "decl %%eax\n\t"\
        "jns 1b\n\t"\
        "jmp 5f\n\t"\

        "4:flds 4(%0)\n\t"\
        "fmuls 4(%1)\n\t"\
        "flds 12(%0)\n\t"\
        "fmuls 12(%1)\n\t"\
        "flds 20(%0)\n\t"\
        "fmuls 20(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 28(%0)\n\t"\
        "fmuls 28(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 36(%0)\n\t"\
        "fmuls 36(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 44(%0)\n\t"\
        "fmuls 44(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 52(%0)\n\t"\
        "fmuls 52(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 60(%0)\n\t"\
        "fmuls 60(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "subl $4,%%esp\n\t"\
        "subl $64,%0\n\t"\
        "subl $192,%1\n\t"\
        "faddp\n\t"\
        "fistpl (%%esp)\n\t"\
        "popl %%ecx\n\t"\
        "cmpl $32767,%%ecx\n\t"\
        "jle 2f\n\t"\
        "movw $32767,%%cx\n\t"\
        "jmp 3f\n\t"\
        "2: cmpl $-32768,%%ecx\n\t"\
        "jge 3f\n\t"\
        "movw $-32768,%%cx\n\t"\
        "3: movw %%cx,(%2)\n\t"\

        "movl %5,%%ecx\n\t"\
        "sall $3,%%ecx\n\t"\
        "addl %%ecx,%1\n\t"\
        "addl %3,%2\n\t"\

        "movl $14,%%eax\n\t"\
        "1:flds (%0)\n\t"\
        "fmuls 60(%1)\n\t"\
        "flds 4(%0)\n\t"\
        "fmuls 56(%1)\n\t"\
        "flds 8(%0)\n\t"\
        "fmuls 52(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubp\n\t"\
        "flds 12(%0)\n\t"\
        "fmuls 48(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 16(%0)\n\t"\
        "fmuls 44(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 20(%0)\n\t"\
        "fmuls 40(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 24(%0)\n\t"\
        "fmuls 36(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 28(%0)\n\t"\
        "fmuls 32(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 32(%0)\n\t"\
        "fmuls 28(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 36(%0)\n\t"\
        "fmuls 24(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 40(%0)\n\t"\
        "fmuls 20(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 44(%0)\n\t"\
        "fmuls 16(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 48(%0)\n\t"\
        "fmuls 12(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 52(%0)\n\t"\
        "fmuls 8(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "flds 56(%0)\n\t"\
        "fmuls 4(%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "fsubrp\n\t"\
        "flds 60(%0)\n\t"\
        "fmuls (%1)\n\t"\
        "fxch %%st(2)\n\t"\
        "faddp\n\t"\
        "subl $64,%0\n\t"\
        "subl $128,%1\n\t"\
        "subl $4,%%esp\n\t"\
        "fsubp\n\t"\
        "fistpl (%%esp)\n\t"\
        "popl %%ecx\n\t"\
        "cmpl $32767,%%ecx\n\t"\
        "jle 2f\n\t"\
        "movw $32767,%%cx\n\t"\
        "jmp 3f\n\t"\
        "2: cmpl $-32768,%%ecx\n\t"\
        "jge 3f\n\t"\
        "movw $-32768,%%cx\n\t"\
        "3: movw %%cx,(%2)\n\t"\
        "addl %3,%2\n\t"\
        "decl %%eax\n\t"\
        "jns 1b\n\t"\

        "5:"\
            : : "b" (u[ch][div]), "d" (t_dewindow[0] + 16 - start), "S" (&sample_buffer[f>>(2-nch)][nch==2?0:(f&1?16:0)][ch]), "m" (sizeof(short) * nch), "m" (div), "m" (start)\
            : "eax", "ecx", "memory");
#else
        {
          short *samples = (&sample_buffer[f>>(2-nch)][nch==2?0:(f&1?16:0)][ch]);
          int out, j;

#define PUT_SAMPLE(out)                 \
                if (out > 32767)        \
                  *samples = 32767;     \
                else                    \
                  if (out < -32768)     \
                    *samples = -32768;  \
                  else                  \
                    *samples = out;     \
                                        \
                samples += nch;

#if defined(SUPERHACK)
          /* These is a simple implementation which should be nicer to the
             cache; computation of samples are done in one pass rather than
             two.  However, for various reasons which I do not have time to
             investigate, it runs quite a lot slower than two pass
             computations.  If you have time, you are welcome to look into
             it. */

          {
            float (*u_ptr)[16] = u[ch][div];
            const float *dewindow2 = t_dewindow[0] + start;

            {
              float outf1, outf2, outf3, outf4;

              outf1  = u_ptr[0][ 0] * dewindow[0x0];
              outf2  = u_ptr[0][ 1] * dewindow[0x1];
              outf3  = u_ptr[0][ 2] * dewindow[0x2];
              outf4  = u_ptr[0][ 3] * dewindow[0x3];
              outf1 += u_ptr[0][ 4] * dewindow[0x4];
              outf2 += u_ptr[0][ 5] * dewindow[0x5];
              outf3 += u_ptr[0][ 6] * dewindow[0x6];
              outf4 += u_ptr[0][ 7] * dewindow[0x7];
              outf1 += u_ptr[0][ 8] * dewindow[0x8];
              outf2 += u_ptr[0][ 9] * dewindow[0x9];
              outf3 += u_ptr[0][10] * dewindow[0xa];
              outf4 += u_ptr[0][11] * dewindow[0xb];
              outf1 += u_ptr[0][12] * dewindow[0xc];
              outf2 += u_ptr[0][13] * dewindow[0xd];
              outf3 += u_ptr[0][14] * dewindow[0xe];
              outf4 += u_ptr[0][15] * dewindow[0xf];

              out = outf1 + outf2 + outf3 + outf4;

              dewindow += 32;
              dewindow2 += 32;
              u_ptr++;

              if (out > 32767)
                samples[0] = 32767;
              else
                if (out < -32768)
                  samples[0] = -32768;
                else
                  samples[0] = out;
            }

            if (div & 0x1) {
              for (j = 1; j < 16; ++j) {
                float outf1, outf2, outf3, outf4;

                outf1  = u_ptr[0][ 0] * dewindow[0x0];
                outf3  = u_ptr[0][ 0] * dewindow2[0xf];
                outf2  = u_ptr[0][ 1] * dewindow[0x1];
                outf4  = u_ptr[0][ 1] * dewindow2[0xe];
                outf1 += u_ptr[0][ 2] * dewindow[0x2];
                outf3 += u_ptr[0][ 2] * dewindow2[0xd];
                outf2 += u_ptr[0][ 3] * dewindow[0x3];
                outf4 += u_ptr[0][ 3] * dewindow2[0xc];
                outf1 += u_ptr[0][ 4] * dewindow[0x4];
                outf3 += u_ptr[0][ 4] * dewindow2[0xb];
                outf2 += u_ptr[0][ 5] * dewindow[0x5];
                outf4 += u_ptr[0][ 5] * dewindow2[0xa];
                outf1 += u_ptr[0][ 6] * dewindow[0x6];
                outf3 += u_ptr[0][ 6] * dewindow2[0x9];
                outf2 += u_ptr[0][ 7] * dewindow[0x7];
                outf4 += u_ptr[0][ 7] * dewindow2[0x8];
                outf1 += u_ptr[0][ 8] * dewindow[0x8];
                outf3 += u_ptr[0][ 8] * dewindow2[0x7];
                outf2 += u_ptr[0][ 9] * dewindow[0x9];
                outf4 += u_ptr[0][ 9] * dewindow2[0x6];
                outf1 += u_ptr[0][10] * dewindow[0xa];
                outf3 += u_ptr[0][10] * dewindow2[0x5];
                outf2 += u_ptr[0][11] * dewindow[0xb];
                outf4 += u_ptr[0][11] * dewindow2[0x4];
                outf1 += u_ptr[0][12] * dewindow[0xc];
                outf3 += u_ptr[0][12] * dewindow2[0x3];
                outf2 += u_ptr[0][13] * dewindow[0xd];
                outf4 += u_ptr[0][13] * dewindow2[0x2];
                outf1 += u_ptr[0][14] * dewindow[0xe];
                outf3 += u_ptr[0][14] * dewindow2[0x1];
                outf2 += u_ptr[0][15] * dewindow[0xf];
                outf4 += u_ptr[0][15] * dewindow2[0x0];

                dewindow += 32;
                dewindow2 += 32;
                u_ptr++;

                out = outf1 + outf2;

                if (out > 32767)
                  samples[j * 2] = 32767;
                else
                  if (out < -32768)
                    samples[j * 2] = -32768;
                  else
                    samples[j * 2] = out;

                out = outf4 - outf3;

                if (out > 32767)
                  samples[64 - (j * 2)] = 32767;
                else
                  if (out < -32768)
                    samples[64 - (j * 2)] = -32768;
                  else
                    samples[64 - (j * 2)] = out;
              }

              {
                float outf2, outf4;

                outf2  = u_ptr[0][ 0] * dewindow[0x0];
                outf4  = u_ptr[0][ 2] * dewindow[0x2];
                outf2 += u_ptr[0][ 4] * dewindow[0x4];
                outf4 += u_ptr[0][ 6] * dewindow[0x6];
                outf2 += u_ptr[0][ 8] * dewindow[0x8];
                outf4 += u_ptr[0][10] * dewindow[0xa];
                outf2 += u_ptr[0][12] * dewindow[0xc];
                outf4 += u_ptr[0][14] * dewindow[0xe];

                out = outf2 + outf4;

                if (out > 32767)
                  samples[16 * 2] = 32767;
                else
                  if (out < -32768)
                    samples[16 * 2] = -32768;
                  else
                    samples[16 * 2] = out;
              }
            } else {
              for (j = 1; j < 16; ++j) {
                float outf1, outf2, outf3, outf4;

                outf1  = u_ptr[0][ 0] * dewindow[0x0];
                outf3  = u_ptr[0][ 0] * dewindow2[0xf];
                outf2  = u_ptr[0][ 1] * dewindow[0x1];
                outf4  = u_ptr[0][ 1] * dewindow2[0xe];
                outf1 += u_ptr[0][ 2] * dewindow[0x2];
                outf3 += u_ptr[0][ 2] * dewindow2[0xd];
                outf2 += u_ptr[0][ 3] * dewindow[0x3];
                outf4 += u_ptr[0][ 3] * dewindow2[0xc];
                outf1 += u_ptr[0][ 4] * dewindow[0x4];
                outf3 += u_ptr[0][ 4] * dewindow2[0xb];
                outf2 += u_ptr[0][ 5] * dewindow[0x5];
                outf4 += u_ptr[0][ 5] * dewindow2[0xa];
                outf1 += u_ptr[0][ 6] * dewindow[0x6];
                outf3 += u_ptr[0][ 6] * dewindow2[0x9];
                outf2 += u_ptr[0][ 7] * dewindow[0x7];
                outf4 += u_ptr[0][ 7] * dewindow2[0x8];
                outf1 += u_ptr[0][ 8] * dewindow[0x8];
                outf3 += u_ptr[0][ 8] * dewindow2[0x7];
                outf2 += u_ptr[0][ 9] * dewindow[0x9];
                outf4 += u_ptr[0][ 9] * dewindow2[0x6];
                outf1 += u_ptr[0][10] * dewindow[0xa];
                outf3 += u_ptr[0][10] * dewindow2[0x5];
                outf2 += u_ptr[0][11] * dewindow[0xb];
                outf4 += u_ptr[0][11] * dewindow2[0x4];
                outf1 += u_ptr[0][12] * dewindow[0xc];
                outf3 += u_ptr[0][12] * dewindow2[0x3];
                outf2 += u_ptr[0][13] * dewindow[0xd];
                outf4 += u_ptr[0][13] * dewindow2[0x2];
                outf1 += u_ptr[0][14] * dewindow[0xe];
                outf3 += u_ptr[0][14] * dewindow2[0x1];
                outf2 += u_ptr[0][15] * dewindow[0xf];
                outf4 += u_ptr[0][15] * dewindow2[0x0];

                dewindow += 32;
                dewindow2 += 32;
                u_ptr++;

                out = outf1 + outf2;

                if (out > 32767)
                  samples[j * 2] = 32767;
                else
                  if (out < -32768)
                    samples[j * 2] = -32768;
                  else
                    samples[j * 2] = out;

                out = outf3 - outf4;

                if (out > 32767)
                  samples[64 - (j * 2)] = 32767;
                else
                  if (out < -32768)
                    samples[64 - (j * 2)] = -32768;
                  else
                    samples[64 - (j * 2)] = out;
              }

              {
                float outf2, outf4;

                outf2  = u_ptr[0][ 1] * dewindow[0x1];
                outf4  = u_ptr[0][ 3] * dewindow[0x3];
                outf2 += u_ptr[0][ 5] * dewindow[0x5];
                outf4 += u_ptr[0][ 7] * dewindow[0x7];
                outf2 += u_ptr[0][ 9] * dewindow[0x9];
                outf4 += u_ptr[0][11] * dewindow[0xb];
                outf2 += u_ptr[0][13] * dewindow[0xd];
                outf4 += u_ptr[0][15] * dewindow[0xf];

                out = outf2 + outf4;

                if (out > 32767)
                  samples[16 * 2] = 32767;
                else
                  if (out < -32768)
                    samples[16 * 2] = -32768;
                  else
                    samples[16 * 2] = out;
              }
            }
          }
#elif defined(HAS_AUTOINCREMENT)
          const float *dewindow = t_dewindow[0] + 15 - start;

          /* This is tuned specifically for architectures with
             autoincrement and -decrement. */

          {
            float *u_ptr = (float*) u[ch][div];

            u_ptr--;

            for (j = 0; j < 16; ++j) {
              float outf1, outf2, outf3, outf4;

              outf1  = *++u_ptr * *++dewindow;
              outf2  = *++u_ptr * *++dewindow;
              outf3  = *++u_ptr * *++dewindow;
              outf4  = *++u_ptr * *++dewindow;
              outf1 += *++u_ptr * *++dewindow;
              outf2 += *++u_ptr * *++dewindow;
              outf3 += *++u_ptr * *++dewindow;
              outf4 += *++u_ptr * *++dewindow;
              outf1 += *++u_ptr * *++dewindow;
              outf2 += *++u_ptr * *++dewindow;
              outf3 += *++u_ptr * *++dewindow;
              outf4 += *++u_ptr * *++dewindow;
              outf1 += *++u_ptr * *++dewindow;
              outf2 += *++u_ptr * *++dewindow;
              outf3 += *++u_ptr * *++dewindow;
              outf4 += *++u_ptr * *++dewindow;

              out = outf1 + outf2 + outf3 + outf4;

              dewindow += 16;

              PUT_SAMPLE(out)
            }

            if (div & 0x1) {
              {
                float outf2, outf4;

                outf2  = u_ptr[ 1] * dewindow[0x1];
                outf4  = u_ptr[ 3] * dewindow[0x3];
                outf2 += u_ptr[ 5] * dewindow[0x5];
                outf4 += u_ptr[ 7] * dewindow[0x7];
                outf2 += u_ptr[ 9] * dewindow[0x9];
                outf4 += u_ptr[11] * dewindow[0xb];
                outf2 += u_ptr[13] * dewindow[0xd];
                outf4 += u_ptr[15] * dewindow[0xf];

                out = outf2 + outf4;

                PUT_SAMPLE(out)
              }

              dewindow -= 31;
              dewindow += start;
              dewindow += start;
              u_ptr -= 16;

              for (; j < 31; ++j) {
                float outf1, outf2, outf3, outf4;

                outf1  = *++u_ptr * *--dewindow;
                outf2  = *++u_ptr * *--dewindow;
                outf3  = *++u_ptr * *--dewindow;
                outf4  = *++u_ptr * *--dewindow;
                outf1 += *++u_ptr * *--dewindow;
                outf2 += *++u_ptr * *--dewindow;
                outf3 += *++u_ptr * *--dewindow;
                outf4 += *++u_ptr * *--dewindow;
                outf1 += *++u_ptr * *--dewindow;
                outf2 += *++u_ptr * *--dewindow;
                outf3 += *++u_ptr * *--dewindow;
                outf4 += *++u_ptr * *--dewindow;
                outf1 += *++u_ptr * *--dewindow;
                outf2 += *++u_ptr * *--dewindow;
                outf3 += *++u_ptr * *--dewindow;
                outf4 += *++u_ptr * *--dewindow;

                out = outf2 - outf1 + outf4 - outf3;

                dewindow -= 16;
                u_ptr -= 32;

                PUT_SAMPLE(out)
              }
            } else {
              {
                float outf2, outf4;

                outf2  = u_ptr[ 2] * dewindow[ 0x2];
                outf4  = u_ptr[ 4] * dewindow[ 0x4];
                outf2 += u_ptr[ 6] * dewindow[ 0x6];
                outf4 += u_ptr[ 8] * dewindow[ 0x8];
                outf2 += u_ptr[10] * dewindow[ 0xa];
                outf4 += u_ptr[12] * dewindow[ 0xc];
                outf2 += u_ptr[14] * dewindow[ 0xe];
                outf4 += u_ptr[16] * dewindow[0x10];

                out = outf2 + outf4;

                PUT_SAMPLE(out)
              }

              dewindow -= 31;
              dewindow += start;
              dewindow += start;
              u_ptr -= 16;

              for (; j < 31; ++j) {
                float outf1, outf2, outf3, outf4;

                outf1  = *++u_ptr * *--dewindow;
                outf2  = *++u_ptr * *--dewindow;
                outf3  = *++u_ptr * *--dewindow;
                outf4  = *++u_ptr * *--dewindow;
                outf1 += *++u_ptr * *--dewindow;
                outf2 += *++u_ptr * *--dewindow;
                outf3 += *++u_ptr * *--dewindow;
                outf4 += *++u_ptr * *--dewindow;
                outf1 += *++u_ptr * *--dewindow;
                outf2 += *++u_ptr * *--dewindow;
                outf3 += *++u_ptr * *--dewindow;
                outf4 += *++u_ptr * *--dewindow;
                outf1 += *++u_ptr * *--dewindow;
                outf2 += *++u_ptr * *--dewindow;
                outf3 += *++u_ptr * *--dewindow;
                outf4 += *++u_ptr * *--dewindow;

                out = outf1 - outf2 + outf3 - outf4;

                dewindow -= 16;
                u_ptr -= 32;

                PUT_SAMPLE(out)
              }
            }
          }
#else
          const float *dewindow = t_dewindow[0] + 16 - start;

          /* These optimisations are tuned specifically for architectures
             without autoincrement and -decrement. */

          {
            float (*u_ptr)[16] = u[ch][div];

            for (j = 0; j < 16; ++j) {
              float outf1, outf2, outf3, outf4;

              outf1  = u_ptr[0][ 0] * dewindow[0x0];
              outf2  = u_ptr[0][ 1] * dewindow[0x1];
              outf3  = u_ptr[0][ 2] * dewindow[0x2];
              outf4  = u_ptr[0][ 3] * dewindow[0x3];
              outf1 += u_ptr[0][ 4] * dewindow[0x4];
              outf2 += u_ptr[0][ 5] * dewindow[0x5];
              outf3 += u_ptr[0][ 6] * dewindow[0x6];
              outf4 += u_ptr[0][ 7] * dewindow[0x7];
              outf1 += u_ptr[0][ 8] * dewindow[0x8];
              outf2 += u_ptr[0][ 9] * dewindow[0x9];
              outf3 += u_ptr[0][10] * dewindow[0xa];
              outf4 += u_ptr[0][11] * dewindow[0xb];
              outf1 += u_ptr[0][12] * dewindow[0xc];
              outf2 += u_ptr[0][13] * dewindow[0xd];
              outf3 += u_ptr[0][14] * dewindow[0xe];
              outf4 += u_ptr[0][15] * dewindow[0xf];

              out = outf1 + outf2 + outf3 + outf4;

              dewindow += 32;
              u_ptr++;

              PUT_SAMPLE(out)
            }

            if (div & 0x1) {
              {
                float outf2, outf4;

                outf2  = u_ptr[0][ 0] * dewindow[0x0];
                outf4  = u_ptr[0][ 2] * dewindow[0x2];
                outf2 += u_ptr[0][ 4] * dewindow[0x4];
                outf4 += u_ptr[0][ 6] * dewindow[0x6];
                outf2 += u_ptr[0][ 8] * dewindow[0x8];
                outf4 += u_ptr[0][10] * dewindow[0xa];
                outf2 += u_ptr[0][12] * dewindow[0xc];
                outf4 += u_ptr[0][14] * dewindow[0xe];

                out = outf2 + outf4;

                PUT_SAMPLE(out)
              }

              dewindow -= 48;
              dewindow += start;
              dewindow += start;

              for (; j < 31; ++j) {
                float outf1, outf2, outf3, outf4;

                --u_ptr;

                outf1  = u_ptr[0][ 0] * dewindow[0xf];
                outf2  = u_ptr[0][ 1] * dewindow[0xe];
                outf3  = u_ptr[0][ 2] * dewindow[0xd];
                outf4  = u_ptr[0][ 3] * dewindow[0xc];
                outf1 += u_ptr[0][ 4] * dewindow[0xb];
                outf2 += u_ptr[0][ 5] * dewindow[0xa];
                outf3 += u_ptr[0][ 6] * dewindow[0x9];
                outf4 += u_ptr[0][ 7] * dewindow[0x8];
                outf1 += u_ptr[0][ 8] * dewindow[0x7];
                outf2 += u_ptr[0][ 9] * dewindow[0x6];
                outf3 += u_ptr[0][10] * dewindow[0x5];
                outf4 += u_ptr[0][11] * dewindow[0x4];
                outf1 += u_ptr[0][12] * dewindow[0x3];
                outf2 += u_ptr[0][13] * dewindow[0x2];
                outf3 += u_ptr[0][14] * dewindow[0x1];
                outf4 += u_ptr[0][15] * dewindow[0x0];

                out = -outf1 + outf2 - outf3 + outf4;

                dewindow -= 32;

                PUT_SAMPLE(out)
              }
            } else {
              {
                float outf2, outf4;

                outf2  = u_ptr[0][ 1] * dewindow[0x1];
                outf4  = u_ptr[0][ 3] * dewindow[0x3];
                outf2 += u_ptr[0][ 5] * dewindow[0x5];
                outf4 += u_ptr[0][ 7] * dewindow[0x7];
                outf2 += u_ptr[0][ 9] * dewindow[0x9];
                outf4 += u_ptr[0][11] * dewindow[0xb];
                outf2 += u_ptr[0][13] * dewindow[0xd];
                outf4 += u_ptr[0][15] * dewindow[0xf];

                out = outf2 + outf4;

                PUT_SAMPLE(out)
              }

              dewindow -= 48;
              dewindow += start;
              dewindow += start;

              for (; j < 31; ++j) {
                float outf1, outf2, outf3, outf4;

                --u_ptr;

                outf1  = u_ptr[0][ 0] * dewindow[0xf];
                outf2  = u_ptr[0][ 1] * dewindow[0xe];
                outf3  = u_ptr[0][ 2] * dewindow[0xd];
                outf4  = u_ptr[0][ 3] * dewindow[0xc];
                outf1 += u_ptr[0][ 4] * dewindow[0xb];
                outf2 += u_ptr[0][ 5] * dewindow[0xa];
                outf3 += u_ptr[0][ 6] * dewindow[0x9];
                outf4 += u_ptr[0][ 7] * dewindow[0x8];
                outf1 += u_ptr[0][ 8] * dewindow[0x7];
                outf2 += u_ptr[0][ 9] * dewindow[0x6];
                outf3 += u_ptr[0][10] * dewindow[0x5];
                outf4 += u_ptr[0][11] * dewindow[0x4];
                outf1 += u_ptr[0][12] * dewindow[0x3];
                outf2 += u_ptr[0][13] * dewindow[0x2];
                outf3 += u_ptr[0][14] * dewindow[0x1];
                outf4 += u_ptr[0][15] * dewindow[0x0];

                out = outf1 - outf2 + outf3 - outf4;

                dewindow -= 32;

                PUT_SAMPLE(out)
              }
            }
          }
#endif                                                                  
        }
#endif

        --u_start[ch];
        u_start[ch] &= 0xf;
        u_div[ch]=u_div[ch] ? 0 : 1;

#if defined(PENTIUM_RDTSC)
        __asm__(".byte 0x0f,0x31" : "=a" (cnt2), "=d" (cnt4));
                        
        if (cnt2-cnt1 < min_cycles) {
          min_cycles = cnt2-cnt1;
          printf("%d, %d cycles, %d\n", cnt3-cnt1, min_cycles, start);
        }
#endif
}

void premultiply()
{
  int i,t;

  for (i = 0; i < 17; ++i)
    for (t = 0; t < 32; ++t)
      t_dewindow[i][t] *= 16383.5f;
}


here's Justins mp2.5 patch for anyone caring to try it out. 
Please report success/failure if you've tried it.

Tomislav

begin 755 mp25pat.gz
M'XL("%  *30" S N-RXW+6UP96,20>/?NP[F,
M--+T]/1[>GKD^<,A5&=0K49L,(MB_PO#=L!NJT-_S*!6.W GTZI6LVOV@3OS
M_+ V@%K:*E:KU;4@A=,P@ LV!=T"76];C;:!C5;++BJ*LAA?^,@\#F;40:NW
M]7J[;@NPER^AVFBI-BC\^O)E$0J%27Q3+MW#G@%:60'NZ5JG ?;%:*,1L$ 9>W)$]^^41OMY24Q^,$M8W)'S'S0T_CHE1]YA#V0'%^NG/:^]IQG>=4F5XU4)B (F;1A-
M4]4U4.C>$ J);_UD,((RBR(2+J(:N#&#G[N7;X[/G>[Y>1L\GY5+V!]&$"'9
M2"U,INP&^GX2)_AF4@/VU4_P?>TJ*'%=TE\?NSX?DKJR&-]=0!MNW2@HEY*1
M'P/^N\#-U _@]'WW9S!J%@S#:.(F*MR.?"0.88(P 8\-_8!Y8HXJ32$1]>^@
M=W%V0*-KT$L(_JKD0CQE ]\=PTGDSH)1.&21Q'M5JJTB046 %S(Q53R;3L,H
M 4ZB&%2#.(RBN^S(E,4EH9V\U5,64[)@[-[A]'K*BL2/GH.TT\SE.Y94UHEO
M&??)7'QH,4S2ME %2C-B^2GRA),9F%J#FX&I&70G.[B?.]_1A^/>F?/^[=&O
M%0":GK_F+U]].#GIGCL7O7]WT?@K7 @\"IQ-65#^$5]3R=>VQ[6#-ZE2G'/'
MG80>>]8Q>6@YPE!@!!\/W#$;
MN@-G_,FXYO\&1?%U$+&$T,WK3R8"53- B=-O"A2F1+'2&\M>G<8J:\=FABMK
MAV$>)0MX"1H?O(CYY\?>MU+5283K]]1\^3L_/3HTOEX]$L7GXX_
MG+YW7O7>'9W_>K@T]N-Y[[+K7)Z)8*AF7ISTWG8%Z[K>4%MHB;JMZG7)_3+;
M"WZ_S2AT**^Y!\/0+ U4M'Q-H[O>P/N#2EWUNLY?U9NBRS2HJZ@\*AF.6RG<
MP\'^8GWE\(QT'9E@X+>%6H2DF5Z@''W&RPH.:Q:,=Z#5M
M"9< W\0([]S$"F6 #UR1.,7/'WJ J4L2A6-T9PR5A)O+W6JH!KJBWC#4NC [
M!$]&#'TX1D\;]FF1O<&0A=HDJ0S\9'P'&'JF-S6C68-PR*'CQ T\-Z+E43IG
M$D8Q3@[[M*HB#*ZK8<#X@HKL3S%"NGUPQW'(@9!3G/<2,6%T@,"6&)"(]A'H!*Z1US&L3\8E;D9,SB'#X#\KEMA4LB)2'$
M2R0'9T?I^5%Z18B&J#&0-HM,@-1/+ZOK@ESGGEL4)PA>U9I]GLKUNJ_1&'7;
MJ)I\Y+VEZKJ*KF:8JM%234NMFZJ%2:REVBVU1;WX,UNJCA$)UPJ$0UC,$;F!4(:-%!(T4OL(LI3E)G9I^,-8PC%Q^2$@DF6U
M)1WVG&@NSG116,D(.GPAN$=Q9]:"0XQA&*Z$C#'SZ6/C#@-8\/<$ W"$"2!N
M?61#&F2:A%(8O%CV
M.&Y-Y9,\X/-CAEE.8_0!AL'*?U. NP2RIZ8Q([LGC&;?0/<]\>S;U.T:Z./A>]5Z&\$\'6H-X?PU'[2#07F6S=QF5K8&*B#D1O!
M_CZVO]#&_4OH>PCF)[X[]F/F>&R ._ZH3.^Q?Y>=O#>;3'EE6#16]_'B?>$D
M\OG^'' ;WVCC%MUJ9;?Q$FJI+-R80U$27V\BKZ#@S=!Y"L_9P&WHK2/2R?*Z
M+?R^Z*OPJ#"OI'/AS<&^1EBVX4<3)Y%.3EX]%)8_O@"L@*COTFD-WZGFQ
MI/#Y-)XFZ>C41V4P_F1003P/R;ZFL$A,1R. %)Q>6,NXJ1C5Z9@5*M ,IG>H
MY(E:FH1!*,J>?&RV*\8].1.=V#V-D/MAN7057!7A?11.693X+&YCH(6]&/:\
M-[]35_5UR&O4IRP9A5Z;!]@]W<.-$N<7FP2E;(#"7PX.7@ECX;/01-YG-)^#
M^"HH75%U4"5"A[$Z5[PJK:O"SP^RW5F=9:!VK+>)BF_,'77>7O75>5?A O>Q
MI^X=F#KH6KMNMZUFUET7@'F/U8RVD?%8@U=Z^)7\E=?,6$"L^4([^-09)FP\
M+H>SQ"&Z*]5ZG2KE/^Q92N&)W8C*G;?H_0M2Z=WP-O(35O9CW.I[*D9I3U=*\:SFHMMJZF=72 C"K)1-UV;8RQVVX4C= P:O0$M^*
MQXF;^ /4&(J!B4 [=2-<*K:(M$"A5I&F#SG'=R1)99V$LPKR"("(17,88QU,
MO@Z;1YB6X'1=UN#2 \:U01[#E 5__)'M7S4,'IED';Q*L6_!.:8SYS/4,=H#
M"O(++K7>HI@F,,KBUVJ U*CV3K$@'V*1G+PPGHEW! EKF8 7T*SD*%20H0,J
MTH#K>3X)RAWC6L &GR$)8>)^QG ZFD5,%+-X&8UB7L"^L A0FC%,?#2/SRQ 
MX1=DXXN\4HU*Y6:M"#R2Q.B )QPA;Y-R.!/?$GK%8D' =R[8"\U%"6
M>1,29QL;3(96IR5YY0XZ0-B1*>T(\WUI1C1S&4%1U1V* ./^,"K7*Y5G'6V.
M472+Z(>"*Q\>RN/IVQ'%@W(YSIBL@9I[U@'MZW XE&!\DA@-!-^Q]%TAYYO/
MD\DT/6U<6!9&0DSX[ARN:0'3R= U/VD5 RFC3$E:T*-+#B'6UJTS+;67(1#LTZ+
M%K_.M;K)4I#)4>S_SA1*=$BE3Y5IW*^?N2PX-AG!DK\KZE ;:Q K^[R9@+DS$WFHRY8C+U5MLT5TS&7#49
MJVT8F0]6='ZBI:OSHP51'.\39V(.;$%Q<3K@&JH7$&7>>RX+GEKQ]TKC(45AR
M9+S32&7#G#M(=XYG'06[XA%B\EA_=L,WX&*QY^+9Q>&F8]/? 8W/ZM.IZN!C6A-M12[H<-3?&
MY8T^DZ*B^SKO*!06(?D1)*3.+= M&"-_$P.V\[7"KLY6**19S1(P*$ 33RFC
M1 )Q$%3%RBBS;IOO!&U[_DD?YGMQ"+<,PR.:%B:>Z)Q]%W-0"I4<,>*L0_^.
M#B,5T'D+RGX _,LF.1/<8G#EIZ$U@?3@_WI^ CWOY/Y1TI\-N>_+UJKCRX[<
MQZ&8G&EM=.J,UZ=@*RY?S]043"IY6IB@B;NPID*9/B[,5HPH4<6=V4^0UH.@
M#9FJ42&[9<.TM%IP,8>&SOK/(K+(24O9YPUKZO;H2/T[3R#24C\<).,R_P3*
M&7HJ7+P[?GWYUCF^>.]LWR$KR?50%@+C! $C%4H?W2A R#9PR)@^JTQ' WD"?6T\
M_[2SQ(VL-''ILU>&.R_< WPAD*N@I +?,3V%3,2'?G^.',1'?MP