>    Van-e valakinek MP3 formatum leirasa, vagy tud-e valaki egy
> helyet ahonnan le tudom szedni. En mar kerestem, de nem talatam

Ehhez a kerdeshez kapcsolodo levelek a listarol es vallogatas AMP
fejleszto listarol...

TomCat/Abaddon


Ha valaki grabbelget meg tomoritget audio dolgokat, annak jol johet a kovetkezo ket tores Az egyik: L3ENC ISO/MPEG Audio Layer 3 Software Only Decoder version: 2.70 size: 344568 name: L3ENC.EXE 153591: 83 fa 0d 77 0a eb 153763: 83 c4 0c 83 f8 01 74 0b eb 153782: 66 39 bd 76 ff ff ff 74 07 eb 153851: 81 7e 0c 67 12 00 00 74 32 90 90 153880: 83 7a 14 61 7f 06 eb 153895: 81 7b 0c 9b 0b 00 00 74 06 83 7b 0c 01 75 08 90 90 eb 30.aug.1997 [tNC] A masik: CD Worx for Windows 95 version: 2.10.0579 size: 532480 name: CDWORX.EXE 7592: 8b 44 24 28 2d ff ff ff 7f b8 01 00 00 00 eb 09 7.sep.1997 [tNC]
Csinaltam egy IMHO okos kis utility-t, beta tester-ek tolonghatnak; neve: SMP3 - Scheduler for L3ENC MP3 encoder. Azok jelentkezzenek, akik ugy tomoritenek MP3-at, hogy legrabbolnak tobb disk-et egy halozati drive-ra, majd tobb geppel raengednek L3ENC-et. Az util azt tudja, hogy be kell allitani minden mpeg-elo gepen, hogy hol vannak a WAV-konyvtarak, aztan csak siman el kell engedni oket, es aki kapja marja elven tomoritik az anyagot. *.WAV-bol *.MP3-at csinal, a WAV-ot letorli. Ha valamelyik gepen megszakitjak a tomoritest, akkor az visszaadja a tobbieknek a WAV-ot. Ha pedig ugy szakitjak meg, hogy brutalisan, akkor a kovetkezo thread teszi tisztaba, mielott hozzalatna egy L3ENC-hez. Szoval tkp. nem egy nagy szam, de az emberi melot teljesen minimizalja. Szolo gepen is hasznos a megszakitgatasok miatt. Az util kb. 12 Kbyte, attachment-ben kuldom a jelentkezoknek. 100.00% Freeware, csak nem akarom bugosan release-elni. Persze, kizarolag legalis anyagot szabad tomoriteni vele, meg persze regisztralt L3ENC-cel... ERN0
I have been experimenting with using different compilers for amp today, and I got some interesting results (this is primarily for Linux users, Sun users for example, already have a decent compiler). che% time ./amp-gcc -q -nobuffer /mnt/aux/test_mp3/classic1.mp3 1.50user 0.07system 0:20.26elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (80major+24minor)pagefaults 0swaps che% time ./amp-pgcc -q -nobuffer /mnt/aux/test_mp3/classic1.mp3 1.29user 0.11system 0:20.26elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (82major+25minor)pagefaults 0swaps che% time ./amp-egcs -q -nobuffer /mnt/aux/test_mp3/classic1.mp3 1.36user 0.11system 0:20.26elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (80major+25minor)pagefaults 0swaps ...it seems that you can win a _lot_ if you use pentium-optimized pgcc, or even generally improved (over gcc) egcs. I am waiting to see what happens when egcs get enhanced with pentium-specific patches. Note that measuring was done on 22kHz mp3's and a 200MHz pentium; the little 10% won here might make a differnece on your machine. I am also considering making binary releases of amp, because most people don't have the time to compile these compilers at home (or mess with advanced features like rt). pgcc is available at: http://www.goof.com/pcg/ egcs is available at: http://www.cygnus.com/egcs/ tomislav
I'm working on a feature for amp that I call "Automatic Volume Control". I means that I handle volume correction for a start, and that I try to amplify the sound if the overal ouput of a song is low. I use a lookup table for the samples, and apply a factor on the values as necessary. For the moment the code works, but: + It's outside of amp itself, since I do my processing after the frame is completly decoded + I still have problems with songs that start very high (drums intros or such) for other songs, the system works really well. I compute the maximums of the values of the samples, then I do something like that: if (pre > player->levelCurrentMax) { if (!player->levelOverflows) printf("Start overflow\n"); player->levelOverflows++; } if (pre > player->levelPreMax) player->levelPreMax = preL; if (player->levelOverflows > kLevelChangeTreshold) { float val = kVolumeControlLkpSize; float fact = val / player->levelPreMax; if (fact > 2.5) fact = 2.5; if (fact != player->volComputeFactor) { player->volComputeFactor = fact; printf("Computed volume factor %f\n", player->volComputeFactor); volumeComputeTable(player, player->volComputeFactor); } player->levelCurrentMax = player->levelPreMax; player->levelOverflows = 0; } The reason why I have problem with fast intros is obvious, since the first factors computed on silence frames is almost always 2.5, and the first real frames gets REALLY loud :-) I'm currently changing the system to start with a factor of 1.0, and gradually push it up to the computed value.. that way, we'll never go to the 2.5 factor since the first real frames will not be played at that volume. In any case, the real goal for me is to compute that factor value for a complete song the first time, and then SAVE it in the playlist for later use on that particular song. Michel Pollet
I found this MPEG conformance test bitstreams, I may be interesting for testing of amp : ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/mpeg1/compliance/ there are various mpeg files in layer 1, 2 and 3 David Balazic
I've made available a prerelease of 0.7.7 on the mail server. You can get it with the standard: send amp_prerelease sent in the body of the message to multimed@rasip.fer.hr 0.7.7 has been lagging behind a little because I had to work on my diploma thesis last week, but all the major features are already in: - guicontrol fixes (me and Budor) - qt gui (Lodewijk Voge) - win32 port (SJC) ...and a lot of little fixes. guicontrol.c should now work on almost all unices, once you define which mechanism you use for passing descriptors (see doc/guicontrol.txt). Qt gui works and it's nice, try it out. Win32 port is working, but it will need a maintainer as I, well, don't use windows (I tested the port, though). What is still left to be done is some more flexibility in configuration (interactive configure script?) and making it possible to decide certain things at runtime. More fixing, tweaking, etc. If anyone tests this, drop me a note. tomislav
Howdy all, here is my patch for amp 0.7.7. I've added preliminary MPEG 2.5 support, using information a guy named Marc Pirotte sent me.. (I also had to do a little poking around in l3aud32.dll from winplay3 to get some data he didn't have). Anyways, here it is, and it works, although it doesn't sound perfect on all MPEG 2.5 files (there are little high blips occasionally). Probably some oversight, I'm sure we'll get it worked out... Justin Frankel justin@nullsoft.com --8323328-287504126-875128036=:8657 Content-Type: APPLICATION/octet-stream; name="0.7.7-mpeg2.5.patch.gz" Content-Transfer-Encoding: BASE64 Content-ID: Content-Description: H4sICFAAKTQCAzAuNy43LW1wZWcyLjUucGF0Y2gA7Rprc9pI8jP8ig5XzoEl sB4IAT6ycWK8oSqOc7azub3YpRJoMKqAxEoijtfr/37dMyOQeMSQePfuw7mM NNL09PR7enrk+cMhVGdQrUZsMIti/wvDdsBuq0N/zKBWO3An06pWs2v2gTvz /LA2gFraKlar1bUghdMwgAs2Bd0CXW9bjbaBjVbLLiqKshhf+Mg8DmbUQau3 9Xq7bguwly+h2mipNij8+vJlEQqFSXxTLt3DngcPUFKHkTth78LKIXY94M8f QvnIuXhz9tG57J124flzeFaWQHu6VqnAfbFaKMRsEAZe3JE9++URcz0WVV/0 jjsd7SfLbrR13TIqB4kTI19jP7hxhhH7bcaCwd2nBfD1vL0Kdn1YVJ5uJvb1 evt5SUx+MEtY3JHzHzQ0/jolR95hD2QHF+unPa+9pxnedUmV41UJiAImbRhN U9U1UOjeEAqJb/1kMIIyiyISLqIauDGDn7uXb47Pne75eRs8n5VL2B9GECHZ SC1MpuwG+n4SJ/hmUgP21U/wfe0qKHFd0l8fuz4fkrqyGN9dQBtu3Sgol5KR HwP+u8DN1A/g9H33ZzBqFgzDaOImKtyOfCQOYYIwAY8N/YB5Yo4qTSER9e+g d3F2QKNr0EsI/qrkQjxlA98dw0nkzoJROGSRxHtVqq0iQUWAFzIxVTybTsMo AU6iGFSDOIyiu+zIlMUloZ281VMWU7Jg7N7h9HrKisSPnoO008zlO5ZU1olv GffJXHxoMUzStlAFSjNi+SnyhJMZmFqDm4GpGXQnO7ifO9/Rh+PemfP+7dGv FQCanr/mL199ODnpnjsXvX930fgrXAg8CpxNWVD+EV9Tyde2x7WDN6lSnHPH nYQee9YxeWg5co7PPr477f2romLU6V46v5y9/XDaFTpg45gtWHw1G6IBPSWj 22P8K9iNWDKLAqBo8lD0dl5QRvMFYbRpQRllFxSt2dZaGDpXF5TR0oJi6nMw sl3dwMil4FUsKH/zg8F45jH4R5zw0S+KaLToC7NBAsJq33SPjrvnaOFKwQ8S IHESw6Kdtrh/pg/TKEzYIPHDwEHHEj5jNdUmKFYLrzRvEUNewlBgBB8P3DEb ugNn/Mm45v8GRfF1ELGE0M3rTyYCVTNAidNvChSmRLHSG8tencYqa8dmhitr h2cwpCTGIx7siNvITVgKYhFIjg+fY7Ds5rWQSsPmy3tTaiMHu584YxUvcV4U wWC0zNga4zeEeJQt4CRofvIj558fet1LVSYTr99R8+Ts/PTo0vl49EsXn44/ nL53XvXeHZ3/erg09uN577LrXJ6JYKhmXpz03nYF67reUFtoibqt6nXJ/TLb C36/zSh0KK+5B8PQLA1UtHxNo7vewPuDSl31us5f1ZuiyzSoq6g8KhmOWync w8H+Yn3l8Ix0HcP+ARQVwLBxj2meZlg4LeFWoSkmV6gHH3GywoOaxaMd6DVt CZcA38QI79zECmWAD1yROMXPH3qAqUsShWN0ZwyVhJvL3WqoBrqi3jDUujA7 BE9GDH04Rk8b9mmRvcGQhdokqQz8ZHwHGHqmNzWjWYNwyKHjxA08N6LlUTpn EkYxTg77tKoiDK6rYcD4gorsTzFCun1wx3HIgZBTnPcSMWFcBESQTEKc3w88 9hVE1sBjjYrLL59w6EdEn/8VymGAFCUjWsBnMfMqJH8+Dw1KlaQCEshHBmic fGAOBGWvkhJQe0gMCWGJCI9hHoBK6R1zGsT8YlbkZMziHD4D8rlthUsiJSHE SyQHZ0fp+VF6RYiGqDGQNotMgNRPL6vrglznnlsUJwhe1Zp9nsr1uq/RGHXb qJp85L2l6rqKrmaYqtFSTUutm6qFSayl2i21Rb34M1uqjhEJ1wqEQ1jMck0E NbG/3kB4Q1ct2+L+9Dg+zJTxtd5AsBYCmbpq2Dbi0xGfKfDhfNvj256+YhXd DHGaqs2x4shWHrGuIlm2qTYRoYY/05SEWhIxElrX1ToSlBK4ARkCIRO4PdNV HKSjwnQD7xbeEbmBUIaNFBI0UvsIspTlJnZp+MNYwjFx+SEgkmcin/U6klVv pbw+HKLxZ4wjs8ZtaRwpOVz8JvGt4rQonxbJhjpQPi09RzrCWpxom8uSBCk1 Tu3tgM055lWVoS5JKi2uLBSJKYRLEywhF7CoMZt0gfIhi2kJ6a0FbXEGceW1 JR32nGguznRRWMkIOnwhuEdxZ9aCQ4xhGK6EjDHz6WPjDgNY8PcEA3CECSBu fWRcf1ov/HMwIj6yAq6sjGHbwhIsLifpzi2JBgWKRm416IeDGmSahFIYvFj2 OG5N5ZM84PNjhllOY/QBhsHKf1OAuwSyp6YxI7snjGbfQPc98ezb1O0c0cg+ ioqIasqG1Dv1xO/0w63jkvo94DYPNNyHUDcoSVI6PnPnIcU0SEFkUvZ3Oscq G4/E7S24yEbunZhea6OPhe9V6G8E8HWoN4fw1H7SDQXmWzdxmVrYGKiDkRvB /j62v9DG/UvoewjmJ7479mPmeGyAO/6oTO+xf5edvDebTHllWDRW9/HifeEk 8vn+HHAb32jjFt1qZbfxEmqpLNyYQ1ESX28ir6DgzdB5Cs/ZwG3orSPSyfK6 Lfy+6KvwchUJRO5XO4uda7ZMYlbTJ767X9RNJLDD0+RruTsbxp0fqxZzLJPp jSprCcp2WHeqDCvpHPhzcG+Rli24UcTJ5FOTl49FJY/vgCsgKjv0mkN36nmx pPD5NJ4m6ejUR2Uw/mRQQTwPyb6msEhMRyOAFJxeWMu4qRjV6ZgVKtAMpneo 5IlamoRBKMqefGy2K8Y9OROd2D2NkPthuXQVXBXhfRROWZT4LG5joIW9GPa8 N79TV/V1yGvUpywZhV6bB9g93cONEucXmwSlbIDCXw4OXglj4bPQRN5nNJ+D +CooXVF1UCVCh7E6V7wqravCzw+y3VmdZaB2rLeJim/MHXXeXvXVeVfhAvex p+4dmDroWrtut61m1l0XgHmP1Yy2kfFYg1d6+JX8ldfMWECs+UI7+NQZJmw8 LoezxCG6K9V6nSrlP+xZSuGJ3YjKnbfo/QtS6d3wNvITVvZj3Op7KkZpTc32 76SlG5ZQTZ5rad5e1dK8azmottq6mdXSAjCrJRN12bYyx224UjdAwavQEt+K x4mb+APUGIqBiUA7dSNcKraItEChVpGmDznHdyRJZZ2EswryCICIRXMYYx1M vg6bR5iW4HRd1uDSA8a1QR7DlAV//JHtXzUMHplkHbxKsW/BOaYz5zPUMdoD CvILLrXeopgmMMri12qA1Kj2TrEgH2KRnLwwnol3BAlrmYAX0KzkKFSQoQMq 0oDreT4Jyh3jWsAGnyEJYeJ+xnA6mkVMFLN4GY1iXsC+sAhQmjFMfDSPzyxA 4Rdk4cuF+C4Y3IZUGuMpaTSjWhGxHocThm5/g7NwknWQp04q9LH7dhRKPAM3 ovJSnwZJSDe4u8Uo1K5WatCDySxOiAJxwhb5NyOBPfEnrFYkHAdy7YC81FCW eRMSZxsbTIZWpyV55Q46QNiRKe0I831pRjRzGUFR1R2KAOP+MCrXK5VnHW2O UXSL6IeCKx8eyuPp2xHFg3I5zpisgZp71gHt63A4lGB8khgNBN+x9F0h55vP k8k0PW1cWBZGQkz47hyuaQHTydA1P2kVAymjTEla0KNLcmxBDsgJUp7poJfM oLyB5QI/rt8pJHJ1GTwips3VgJj25OKh0bY0DInZeDiHW1q0zLbWXIRDs06L Fr/OtbrJUpDJUez/zhRKdEilT5VcpjHkx5fB7THtthqmf/nYhXkdcwOHf/ng kGQ6ep3K+fuSw4NhnBkr8r6lAbaxAr+7yZgLkzE3moy5YjL1Vts0V0zGXDUZ q20YmQ9WdH6ipavzowVRHOcl+5Eb8yOBm8gNMCgWRZR6KkP5X0yU/lLT4LFY 0+igVdzEqZo43kEcyW1Ie8TZ2IObEFxcTrgGqoXEGXeey4Lnlrx90rjIUVhy ZLzTSGXDnDtId45nHQW74hFi8lh/dsM34GKx5+LZxeGmYczTCO5yi4dVp1v0 LWWuut42c/uLDORKrDYzB/t1Hqvraaze9PfAY3P6tOp6uBjWhNtRS7ocNTfG 5Y0+k6Ki+zrvKBQWIfkRJKTOLdAtGCN/EwO287XCrs5WKKRZzRIwKEATTymj RAJxEFTFyiizbpvvBG17/kkf5ntxCLcMwyOaFiae6Jx9F3NQCpUcMeKsQ/+O DiMV0HkLyn4A/MsmORPcYnDlp6E1gfTg/3p+Aj3v5P5R0p8Nue/L1qrjy47c x6GYnGltdOqM16dgKy5fz9QUTCp5WpigibuwpkKZPi7MVowoUcWd2U+Q1oOg DZmqUSG7ZcO0tFpwMYeGzvrPIrLISUvZ5w1r6vboSP07TyDSUj8cJOMy/wTK GXoqXLw7fn351jm+eO9cvO92j1V4TnRUoNNBTmXKPuVfZZZLfHB7Zcwm+dB8 nKlnTyOkivi4ajes3yEryfVQFgLjBAEjFUof3ShAyDZwyJg+q0xHA3kCfW08 /7SzxI2sNHHps1eGOy/cA3whkKugpALfMT2FTMSHfn+OHMRHftwc5IfTxf8A K1fXOQMuAAA= --8323328-287504126-875128036=:8657--
> Is anyone working on a streaming MP3 player (vs. WinAmp's current need > to download then play)? If so, I would like to get in touch. If not, > I would like to begin building such a beast. There is a player supporting the http protocol, called mpg123. I don't know who to contact about it, but I'm sure you can find something on the net. Tomas
WinAmp supports HTTP streaming from some 1.??? version on ( latest is 1.55 ). U can get it from http://winamp.lh.net David Balazic
MP3 Web Pages ------------- Buster's Music Page http://ns2.clever.net/~buster/music.htm The Music Vault http://199.77.34.251/~tryp/musicvaul/ Rabid Neurosis http://www.wantree.com.au/~silpub/rns/ Compress Da Audio http://www.cda.net/ For more details, please write me: Igor_Gavrilov
for making mp3s U can use l3enc avaliable at : http://www.iis.fhg.de/departs/amm/layer3/ There is a bunch of mp3 related SW on www.mp3shoppingmall.com David Balazic
>The original ISO source is available at >http://users.bart.nl/~soloh/mpegEnc.html Pete, no. As far as I know, there is another source code that you have to purchase from ISO that does support joint-stereo. Forgive me if I am wrong, but I think the encoder that they used on http://users.bart.nl/~soloh/mpegEnc.html is same as: ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/mpeg2/public_software/ according to the Layer 3 FAQ on FhG site: http://www.iis.fhg.de/departs/amm/layer3/sw/index.html I think there is another code that is in the ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/mpeg1/software/ Dmitry
>I managed to compile the ISO encoder source code "dist10/lsf" on the Mac >without big problems. It also seems to run fine, but the encoded files >(regardless whether I choose layer II or layer III) just do not work with >my MPEG players. >Do I have to swap bytes in the output (Motorola/Intel byte ordering), or >can you imagine what else I might have done wrong? I finally ran a test file through the ISO encoder this evening to see how it works. The encoder takes raw PCM data in big endian, signed (2's compliment) format. I used IRIX 5.3's SoundFiler to convert an AIFF file to raw PCM and it worked great (I was unable to get the included dist10/tool/pcm2aiff to do much). The resulting MP3 works on all my players including AMP and WinAMP However, this ISO source indeed doesn't include Joint-Stereo; it will generate an error message if this is selected for layer-3 output; I had to use plain stereo. The resulting MP3 still sounds very clear however (44.1Khz; 128kbit/s). Pete Plank Illudium Design http://www.kic.or.jp/~plankp/
L3ENC/DEC V2.61 ISO/MPEG Audio Layer 3 Software Only Encoder/Decoder Registration Code Generator Written by Outsider/NWT, Nov 16 1996 -------------------------------------- Write me, & I'm sending U this 'Code Generator' Igor_Gavrilov@p10.f620.n5030.z2.fidonet.org
I would like to try those assembly optimisations in AMP. Unfortunately my C compiler doesn't understand that way of writing assembly instructions. The truth is, neither I don't understand it. I used to work with classical one. i.e.: fmul st(1) fstp [ecx+eax*4] inc eax ... Can someone give me a clue for understanding those cryptic fmuls 8(%1)\n\t"\ and so on... > #if defined(ARCH_i586) > /* x86 assembler optimisations. These optimisations are tuned > specifically for Intel Pentiums. */ > > asm("movl $15,%%eax\n\t"\ > "1:\n\t"\ > "flds (%0)\n\t"\ > "fmuls (%1)\n\t"\ > "flds 4(%0)\n\t"\ > "fmuls 4(%1)\n\t"\ > "flds 8(%0)\n\t"\ > "fmuls 8(%1)\n\t"\ > "fxch %%st(2)\n\t"\ > "faddp\n\t"\ > "flds 12(%0)\n\t"\ >..... These assembly statements are in at&t assembly format, as oposed to intel format you are using. They are also coded using a gcc option called "inline assembly". This is something most compilers have, only with a different syntax. I'm sending you a document describing at&t format and inline assembly privately via email. Anyone else needing this should feel free to ask for it as well. Tomislav P.S. The code itself was written by Karl Oygard, and he is probably the only person that understands how it works and why it's so fast. I don't have a clue personally :-)
Brennan's Guide to Inline Assembly by Brennan "Mr. Wacko" Underwood Document version 1.1.2 Ok. This is meant to be an introduction to inline assembly under DJGPP. DJGPP is based on GCC, so it uses the AT&T/UNIX syntax and has a somewhat unique method of inline assembly. I spent many hours figuring some of this stuff out and told Info that I hate it, many times. Hopefully if you already know Intel syntax, the examples will be helpful to you. I've put variable names, register names and other literals in bold type. The Syntax So, DJGPP uses the AT&T assembly syntax. What does that mean to you? * Register naming: Register names are prefixed with "%". To reference eax: AT&T: %eax Intel: eax * Source/Destination Ordering: In AT&T syntax (which is the UNIX standard, BTW) the source is always on the left, and the destination is always on the right. So let's load ebx with the value in eax: AT&T: movl %eax, %ebx Intel: mov ebx, eax * Constant value/immediate value format: You must prefix all constant/immediate values with "$". Let's load eax with the address of the "C" variable booga, which is static. AT&T: movl $_booga, %eax Intel: mov eax, _booga Now let's load ebx with 0xd00d: AT&T: movl $0xd00d, %ebx Intel: mov ebx, d00dh * Operator size specification: You must suffix the instruction with one of b, w, or l to specify the width of the destination register as a byte, word or longword. If you omit this, GAS (GNU assembler) will attempt to guess. You don't want GAS to guess, and guess wrong! Don't forget it. AT&T: movw %ax, %bx Intel: mov bx, ax The equivalent forms for Intel is byte ptr, word ptr, and dword ptr, but that is for when you are... * Referencing memory: DJGPP uses 386-protected mode, so you can forget all that real-mode addressing junk, including the restrictions on which register has what default segment, which registers can be base or index pointers. Now, we just get 6 general purpose registers. (7 if you use ebp, but be sure to restore it yourself or compile with -fomit-frame-pointer.) Here is the canonical format for 32-bit addressing: AT&T: immed32(basepointer,indexpointer,indexscale) Intel: [basepointer + indexpointer*indexscale + immed32] You could think of the formula to calculate the address as: immed32 + basepointer + indexpointer * indexscale You don't have to use all those fields, but you do have to have at least 1 of immed32, basepointer and you MUST add the size suffix to the operator! Let's see some simple forms of memory addressing: o Addressing a particular C variable: AT&T: _booga Intel: [_booga] Note: the underscore ("_") is how you get at C variables from assembler. But usually you will use extended asm to have them preloaded. I address that farther down. o Addressing what a register points to: AT&T: (%eax) Intel: [eax] o Addressing a variable offset by a value in a register: AT&T: _variable(%eax) Intel: [eax + _variable] o Addressing a value in an array of integers (scaling up by 4): AT&T: _array(,%eax,4) Intel: [eax*4 + array] o You can also do offsets with the immediate value: C code: *(p+1) where p is a char * AT&T: 1(%eax) where eax has the value of p Intel: [eax + 1] o You can do some simple math on the immediate value: AT&T: _struct_pointer+8 I assume you can do that with Intel format as well. o Addressing a particular char in an array of 8-character records: eax holds the number of the record desired. ebx has the wanted char's offset within the record. AT&T: _array(%ebx,%eax,8) Intel: [ebx + eax*8 + _array] Whew. Hopefully that covers all the addressing you'll need to do. As a note, you can put esp into the address, but only as the base register. Basic inline assembly The format for basic inline assembly is very simple, and much like Borland's method. asm ("statements"); Pretty simple, no? So asm ("nop"); will do nothing of course, and asm ("cli"); will stop interrupts, with asm ("sti"); of course enabling them. You can use __asm__ instead of asm if the keyword asm conflicts with something in your program. When it comes to simple stuff like this, basic inline assembly is fine. You can even push your registers onto the stack, use them, and put them back. asm ("pushl %eax\n\t" "movl $0, %eax\n\t" "popl %eax"); (The \n's and \t's are there so the .s file that GCC generates and hands to GAS comes out right when you've got multiple statements per asm.) It's really meant for issuing instructions for which there is no equivalent in C and don't touch the registers. But if you do touch the registers, and don't fix things at the end of your asm statement, like so: asm ("movl %eax, %ebx"); asm ("xorl %ebx, %edx"); asm ("movl $0, _booga"); then your program will probably blow things to hell. This is because GCC hasn't been told that your asm statement clobbered ebx and edx and booga, which it might have been keeping in a register, and might plan on using later. For that, you need: Extended inline assembly The basic format of the inline assembly stays much the same, but now gets Watcom-like extensions to allow input arguments and output arguments. Here is the basic format: asm ( "statements" : output_registers : input_registers : clobbered_registers); Let's just jump straight to a nifty example, which I'll then explain: asm ("cld\n\t" "rep\n\t" "stosl" : /* no output registers */ : "c" (count), "a" (fill_value), "D" (dest) : "%ecx", "%edi" ); The above stores the value in fill_value count times to the pointer dest. Let's look at this bit by bit. asm ("cld\n\t" We are clearing the direction bit of the flags register. I think Intel format calls this cltd or something. You never know what this is going to be left at, and it costs you all of 1 or 2 cycles. "rep\n\t" "stosl" Notice that GAS requires the rep prefix to occupy a line of it's own. Notice also that stos has the l suffix to make it move longwords. : /* no output registers */ Well, there aren't any in this function. : "c" (count), "a" (fill_value), "D" (dest) Here we load ecx with count, eax with fill_value, and edi with dest. Why make GCC do it instead of doing it ourselves? Because GCC, in its register allocating, might be able to arrange for, say, fill_value to already be in eax. If this is in a loop, it might be able to preserve eax thru the loop, and save a movl once per loop. : "%ecx", "%edi" ); And here's where we specify to GCC, "you can no longer count on the values you loaded into ecx or edi to be valid." This doesn't mean they will be reloaded for certain. This is the clobberlist. Seem funky? Well, it really helps when optimizing, when GCC can know exactly what you're doing with the registers before and after. It folds your assembly code into the code it's generates (whose rules for generation look remarkably like the above) and then optimizes. It's even smart enough to know that if you tell it to put (x+1) in a register, then if you don't clobber it, and later C code refers to (x+1), and it was able to keep that register free, it will reuse the computation. Whew. Here's the list of register loading codes that you'll be likely to use: a eax b ebx c ecx d edx S esi D edi I constant value (0 to 31) q,r dynamically allocated register (see below) Note that you can't directly refer to the byte registers (ah, al, etc.) or the word registers (ax, bx, etc.) when you're loding this way. Once you've got it in there, though, you can specify ax or whatever all you like. The codes have to be in quotes, and the expressions to load in have to be in parentheses. When you do the clobber list, you specify the registers as above with the %. If you write to a variable, you must include "memory" as one of The Clobbered. This is in case you wrote to a variable that GCC thought it had in a register. This is the same as clobbering all registers. While I've never run into a problem with it, you might also want to add "cc" as a clobber if you change the condition codes (the bits in the flags register the jnz, je, etc. operators look at.) Now, that's all fine and good for loading specific registers. But what if you specify, say, ebx, and ecx, and GCC can't arrange for the values to be in those registers without having to stash the previous values. It's possible to let GCC pick the register(s). You do this: asm ("leal (%1,%1,4), %0" : "=3Dr" (x) : "0" (x) ); The above example multiplies x by 5 really quickly (1 cycle on the Pentium). Now, we could have specified, say eax. But unless we really need a specific register (like when using rep movsl or rep stosl, which are hardcoded to use ecx, edi, and esi), why not let GCC pick an available one? So when GCC generates the output code for GAS, %0 will be replaced by the register it picked. And where did "q" and "r" come from? Well, "q" causes GCC to allocate from eax, ebx, ecx, and edx. "r" lets GCC also consider esi and edi. So make sure, if you use "r" that it would be possible to use esi or edi in that instruction. If not, use "q". Now, you might wonder, how to determine how the %n tokens get allocated to the arguments. It's a straightforward first-come-first-served, left-to-right thing, mapping to the "q"'s and "r"'s. But if you want to reuse a register allocated with a "q" or "r", you use "0", "1", "2"... etc. You don't need to put a GCC-allocated register on the clobberlist as GCC knows that you're messing with it. Now for output registers. asm ("leal (%1,%1,4), %0" : "=3Dr" (x_times_5) : "r" (x) ); Note the use of =3D to specify an output register. You just have to do it that way. If you want 1 variable to stay in 1 register for both in and out, you have to respecify the register allocated to it on the way in with the "0" type codes as mentioned above. asm ("leal (%0,%0,4), %0" : "=3Dr" (x) : "0" (x) ); This also works, by the way: asm ("leal (%%ebx,%%ebx,4), %%ebx" : "=3Db" (x) : "b" (x) ); 2 things here: * Note that we don't have to put ebx on the clobberlist, GCC knows it goes into x. Therefore, since it can know the value of ebx, it isn't considered clobbered. * Notice that in extended asm, you must prefix registers with %% instead of just %. Why, you ask? Because as GCC parses along for %0's and %1's and so on, it would interpret %edx as a %e parameter, see that that's non-existent, and ignore it. Then it would bitch about finding a symbol named dx, which isn't valid because it's not prefixed with % and it's not the one you meant anyway. Important note: If your assembly statement must execute where you put it, (i.e. must not be moved out of a loop as an optimization), put the keyword volatile after asm and before the ()'s. To be ultra-careful, use __asm__ __volatile__ (...whatever...); However, I would like to point out that if your assembly's only purpose is to calculate the output registers, with no other side effects, you should leave off the volatile keyword so your statement will be processed into GCC's common subexpression elimination optimization. Some useful examples #define disable() __asm__ __volatile__ ("cli"); #define enable() __asm__ __volatile__ ("sti"); Of course, libc has these defined too. #define times3(arg1, arg2) \ __asm__ ( \ "leal (%0,%0,2),%0" \ : "=3Dr" (arg2) \ : "0" (arg1) ); #define times5(arg1, arg2) \ __asm__ ( \ "leal (%0,%0,4),%0" \ : "=3Dr" (arg2) \ : "0" (arg1) ); #define times9(arg1, arg2) \ __asm__ ( \ "leal (%0,%0,8),%0" \ : "=3Dr" (arg2) \ : "0" (arg1) ); These multiply arg1 by 3, 5, or 9 and put them in arg2. You should be ok to do: times5(x,x); as well. #define rep_movsl(src, dest, numwords) \ __asm__ __volatile__ ( \ "cld\n\t" \ "rep\n\t" \ "movsl" \ : : "S" (src), "D" (dest), "c" (numwords) \ : "%ecx", "%esi", "%edi" ) Helpful Hint: If you say memcpy() with a constant length parameter, GCC will inline it to a rep movsl like above. But if you need a variable length version that inlines and you're always moving dwords, there ya go. #define rep_stosl(value, dest, numwords) \ __asm__ __volatile__ ( \ "cld\n\t" \ "rep\n\t" \ "stosl" \ : : "a" (value), "D" (dest), "c" (numwords) \ : "%ecx", "%edi" ) Same as above but for memset(), which doesn't get inlined no matter what (for now.) The End "The End"?! Yah, I guess so. If you're wondering, I personally am a big fan of AT&T/UNIX syntax now. (It might have helped that I cut my teeth on SPARC assembly. Of course, that machine actually had a decent number of general registers.) It might seem weird to you at first, but it's really more logical than Intel format, and has no ambiguities. If I still haven't answered a question of yours, look in the Info pages for more information, particularly on the input/output registers. You can do some funky stuff like use "A" to allocate two registers at once for 64-bit math or "m" for static memory locations, and a bunch more that aren't really used as much as "q" and "r". Alternately, mail me, and I'll see what I can do. (If you find any errors in the above, please, e-mail me and tell me about it! It's frustrating enough to learn without buggy docs!) Or heck, mail me to say "boogabooga." It's the least you can do. ---------------------------------------------------------------------------- Thanks to Eric J. Korpela for corrections. ---------------------------------------------------------------------------- Have you seen the DJGPP2+Games Page? Probably. Page written and provided by Brennan Underwood. Copyright =A9 1996 Brennan Underwood. Share and enjoy! Page created with vi, God's own editor.
| Could you say something about your code? How does it work? If you really want to understand it, start out with the C version. The assembler version is essentially the same, but with better instruction scheduling. In any case, the algorithm just does some dewindowing, which involves 16 coefficient multiplications. The tables, however, are fairly large, which breaks the cache. Thus, the code has been obfuscated beyond recognition, in order to make table accesses and caching more optimal. If you really want to understand it, you should download earlier versions of amp (e.g. 0.7.3) to see how it's done there; it's easier to understand, but not as efficient. From there you can go to the newer versions; they all do the same, but in a different way. Now for the assembler part. As I mentioned, the dewindowing involves a lot of multiplications. Fortunately, the Pentium can pipeline floating point operations very well. In effect, multiplications, additions and subtractions take three clock cycles, but, if you write code well, the Pentium can do one multiplication, addition or subtraction every clock cycle. If you operate on different floating point registers and none of them depend on each other, you get no stalls and maximum throughput. I'll describe the code a bit: "flds (%0)\n\t"\ ; 1 push u_ptr[0] onto fpu register stack "fmuls (%1)\n\t"\ ; 2 multiply dewindow[0] with st0 "flds 4(%0)\n\t"\ ; 3 push u_ptr[1] onto fpu register stack "fmuls 4(%1)\n\t"\ ; 4 multiply dewindow[1] with st0 "flds 8(%0)\n\t"\ ; 5 push u_ptr[2] onto fpu register stack "fmuls 8(%1)\n\t"\ ; 6 multiply dewindow[2] with st0 "fxch %%st(2)\n\t"\ ; 6 swap st0 and st2 (result from cycle 2) "faddp\n\t"\ ; 7 add st0 to st1 (result from cycle 4) and pop st0 off stack "flds 12(%0)\n\t"\ ; 8 push u_ptr[3] onto fpu register stack "fmuls 12(%1)\n\t"\ ; 9 multiply dewindow[3] with st0 "fxch %%st(2)\n\t"\ ; 9 swap st0 and st2 (result from cycle 6) "faddp\n\t"\ ; 10 add st0 to st1 (result from cycle 7) and pop st0 off stack "flds 16(%0)\n\t"\ ; 11 push u_ptr[4] onto fpu register stack "fmuls 16(%1)\n\t"\ ; 12 multiply dewindow[4] with st0 "fxch %%st(2)\n\t"\ ; 12 swap st0 and st2 (result from cycle 9) "faddp\n\t"\ ; 13 add st0 to st1 (result from cycle 10) and pop st0 off stack : Here we did 5 multiplications, 3 additions and 5 loads in 13 cycles. Now, if we were _really_ bad at this, we could have made it run in something like 34 cycles. In this case, we're nearly three times as fast. However, the code is pretty much incomprehensible. This is what your compiler should have done for you, but doesn't get quite right. By putting your mind to it, and minimising level 1 and level 2 cache stalls, you can do it quite a bit better yourself. If you want to read more about assembler optimisation on the Pentium cpu, check out http://www.goof.com/pcg/docs.html. Regards, Karl Anders Oygard
Since I cannot get any version of amp (dos or windows) to compile, I am unable to make the changes myself (anyone with Borland 4.5 friendly source?) 1) There is a mixed precision in all of the calculations for decoding; while this may seem no big deal, I think there is enough room to warrant lower precision calculations ( = faster ) . 2) Why not use an adaptive huffman model? ( Ok, speed is an issue and most MP3 players are designed for single pass decoding) but if we use an adaptive model, we can achieve 5 to 60% smaller files. Taiken.
Sorry to throw in this technical support email, but the resolve the external symbol problem you need to include "winmm.lib" in the "Project | Settings | Link" option. Pv
FYI: This seems to be the German version of Visual C++ - the error message translates into "unresolved external symbol". Guess M$ got the libraries wrong once more. LLaP bero http://www.star-trek.ml.org/ -- "Nobody will ever need more than 640k RAM!" -- Bill Gates, 1981 "Windows 95 needs at least 8 MB RAM." -- Bill Gates, 1996 "Nobody will ever need Windows 95." -- logical conclusion
Don't forget to add winmm.lib to the Release x86 configuration. If you have any doupts, take a look at the Alpha config's. These are OK for shure. Remember I did this port using a machine with Alpha CPU. It is natural that some x86 configs are not correct. Pedro Miguel Teixeira
[car mp3 stereo based on AMP] >DOS has these advantages: >I am more familiar with it. >At least at first look, it is much smaller than Linux. >Being smaller, it's possible the OS and application could fit on a single >floppy disk during development, bypassing the need for a hard drive or >large flash memory. you could try mpg123/dos. it already has a driver for sb16 and so on and so on... http://www.sci.fi/~tobo/mpg123 and read the page before downloading :) i will not answer stupid questions.
I've already drafted a document on that topic. It needs some serious input. Please take a look and comment. http://home.dwave.net/~whizkid/ml3cd/ MP3 files already have a de-fecto standard for including title, artist, album, year, genre and comment in the file itself. Chris
Some parts of the newer amps aren't as accurate as they could be. The extra high quality mode on winamp 1.64 uses a slower, more accurate poly() (though it isn't much better). Winamp 1.666 (coming real soon) uses a rewritten MUCH more accurate poly(), which was originally considerably slower (I can actually understand what it does), but I've optimized it in assembly to make it only a bit slower... Anyways, not really a bug, just some optimizations of amp have been detrimental to the accuracy of decoding. I'm trying to get past that :) Justin
> I was curious to find out what you call 'extra high quality' in your windows > port of the AMP decoder. The AMP code does not seem to do any special > degrading of the audio quality, as it uses floating point integers to do the > computation. Also, I tried to decode a stream into a "WAV" file, both with > extra high quality and normal quality, and the difference in the PCM values is > most of the time 0, and never larger than 1 (in absolue value) (this is on a > 16 bit range). > Also, I noticed that when compared to a reference decoder like l3dec (from the > Fraunhoffer institue), the PCM output has significant variations from the > reference, especially in the MPEG frames that are using the SHORT WINDOWS > transforms. Is it a bug ? Actually, amp has big problems in IMDCT and POLY. But I am not going to get into that. Also the fact that we're using floats instead of doubles, lowers the quality, but ups the peformance. I guess it is one of those quality vs. price drawbacks. Dmitry
>I'd like to see an mp3 player that supports more sophisticated >pre-buffering which can occur *during* playback of another >track. Are there any currently available players with this feature? >I've not encountered any. i cant say for sure but i believe mpg123 for linux has this feature... it uses a separate process for buffering and so on... http://mpg.123.org/ linux -> full source
MPEG-4 Audio Reference Software is avaliable : MPEG Audio Web Page: http://www.tnt.uni-hannover.de/project/mpeg/audio/ MPEG Audio ftp server: ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/ It contains an AAC encoder/decoder David Balazic
does anyone have a text on what is really going on in mp3 compression? if not...a brief description would be nice...i've been wondering and i doubt it's simply working with the text or changing 0100100101100 to 012121122 or something.(actually, i have heard that that type of compression was tried but doesn't give very good rates) what i do know is this...the mp3 technology is based on the limited range of our hearing(something like 2-20 hz...) but that is it.....does anyone else have anymore info on this? andy haninger
"If you have to ask you won't understand the answer." However there are some pretty accessible intros to MP3 and audio compression technology on www.mpeg.org. The will give you a general understanding of the sort of tricks that are used. Real understanding requires a good background in signal processing. > does anyone have a text on what is really going on in mp3 compression? if > not...a brief description would be nice...i've been wondering and i doubt > it's simply working with the text or changing 0100100101100 to 012121122 > or something.(actually, i have heard that that type of compression was > tried but doesn't give very good rates) > > what i do know is this...the mp3 technology is based on the limited range > of our hearing(something like 2-20 hz...) but that is it.....does anyone > else have anymore info on this? Well, the limited range of our hearing which you speak of is a very simple model of hearing which is not of much use (considering that recording technology already takes advantage of this). MP3 exploits a more complicated phenomenon known as psychoacoustic masking. -Matt
Vegul egy kis forras reszlet az AMP-bol..... /* this file is a part of amp software, (C) tomislav uzelac 1996,1997 */ /* transform.c imdct and polyphase(DCT) transforms * * Created by: tomislav uzelac May 1996 * Karl Anders Oygard optimized this for speed, Mar 13 97 * Some optimisations based on ideas from Michael Hipp's mpg123 package */ /* * Comments for this file: * * The polyphase algorithm is clearly the most cpu consuming part of mpeg 1 * layer 3 decoding. Thus, there has been some effort to optimise this * particular algorithm. Currently, everything has been kept in straight C * with no assembler optimisations, but in order to provide efficient paths * for different architectures, alternative implementations of some * critical sections has been done. You may want to experiment with these, * to see which suits your architecture better. * * Selection of the different implementations is done with the following * defines: * * HAS_AUTOINCREMENT * * Define this if your architecture supports preincrementation of * pointers when referencing (applies to e.g. 68k) * * For those who are optimising amp, check out the Pentium rdtsc code * (define PENTIUM_RDTSC). This code uses the rdtsc counter for showing * how many cycles are spent in different parts of the code. */ #include #include #include #include #include "audio.h" #include "getdata.h" #include "misc2.h" #define TRANSFORM #include "transform.h" #define PI12 0.261799387f #define PI36 0.087266462f void imdct_init() { int i; for(i=0;i<36;i++) /* 0 */ win[0][i] = (float) sin(PI36 *(i+0.5)); for(i=0;i<18;i++) /* 1 */ win[1][i] = (float) sin(PI36 *(i+0.5)); for(i=18;i<24;i++) win[1][i] = 1.0f; for(i=24;i<30;i++) win[1][i] = (float) sin(PI12 *(i+0.5-18)); for(i=30;i<36;i++) win[1][i] = 0.0f; for(i=0;i<6;i++) /* 3 */ win[3][i] = 0.0f; for(i=6;i<12;i++) win[3][i] = (float) sin(PI12 * (i+ 0.5 - 6.0)); for(i=12;i<18;i++) win[3][i] = 1.0f; for(i=18;i<36;i++) win[3][i] = (float) sin(PI36 * (i + 0.5)); } /* This uses Byeong Gi Lee's Fast Cosine Transform algorithm to decompose the 36 point and 12 point IDCT's into 9 point and 3 point IDCT's, respectively. Then the 9 point IDCT is computed by a modified version of Mikko Tommila's IDCT algorithm, based on the WFTA. See his comments before the first 9 point IDCT. The 3 point IDCT is already efficient to implement. -- Jeff Tsay. */ /* I got the unrolled IDCT from Jeff Tsay; the code is presumably by Francois-Raymond Boyer - I unrolled it a little further. tu */ void imdct(int win_type,int sb,int ch) { /*------------------------------------------------------------------*/ /* */ /* Function: Calculation of the inverse MDCT */ /* In the case of short blocks the 3 output vectors are already */ /* overlapped and added in this modul. */ /* */ /* New layer3 */ /* */ /*------------------------------------------------------------------*/ float tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8, tmp9, tmp10, tmp11; register float save; float pp1, pp2; float *win_bt; int i, p, ss; float *in = xr[ch][sb]; float *s_p = s[ch][sb]; float *res_p = res[sb]; float out[36]; if(win_type == 2){ for(p=0;p<36;p+=9) { out[p] = out[p+1] = out[p+2] = out[p+3] = out[p+4] = out[p+5] = out[p+6] = out[p+7] = out[p+8] = 0.0f; } for(ss=0;ss<18;ss+=6) { /* * 12 point IMDCT */ /* Begin 12 point IDCT */ /* Input aliasing for 12 pt IDCT */ in[5+ss]+=in[4+ss];in[4+ss]+=in[3+ss];in[3+ss]+=in[2+ss]; in[2+ss]+=in[1+ss];in[1+ss]+=in[0+ss]; /* Input aliasing on odd indices (for 6 point IDCT) */ in[5+ss] += in[3+ss]; in[3+ss] += in[1+ss]; /* 3 point IDCT on even indices */ pp2 = in[4+ss] * 0.500000000f; pp1 = in[2+ss] * 0.866025403f; save = in[0+ss] + pp2; tmp1 = in[0+ss] - in[4+ss]; tmp0 = save + pp1; tmp2 = save - pp1; /* End 3 point IDCT on even indices */ /* 3 point IDCT on odd indices (for 6 point IDCT) */ pp2 = in[5+ss] * 0.500000000f; pp1 = in[3+ss] * 0.866025403f; save = in[1+ss] + pp2; tmp4 = in[1+ss] - in[5+ss]; tmp5 = save + pp1; tmp3 = save - pp1; /* End 3 point IDCT on odd indices */ /* Twiddle factors on odd indices (for 6 point IDCT) */ tmp3 *= 1.931851653f; tmp4 *= 0.707106781f; tmp5 *= 0.517638090f; /* Output butterflies on 2 3 point IDCT's (for 6 point IDCT) */ save = tmp0; tmp0 += tmp5; tmp5 = save - tmp5; save = tmp1; tmp1 += tmp4; tmp4 = save - tmp4; save = tmp2; tmp2 += tmp3; tmp3 = save - tmp3; /* End 6 point IDCT */ /* Twiddle factors on indices (for 12 point IDCT) */ tmp0 *= 0.504314480f; tmp1 *= 0.541196100f; tmp2 *= 0.630236207f; tmp3 *= 0.821339815f; tmp4 *= 1.306562965f; tmp5 *= 3.830648788f; /* End 12 point IDCT */ /* Shift to 12 point modified IDCT, multiply by window type 2 */ tmp8 = tmp0 * -0.793353340f; tmp9 = tmp0 * -0.608761429f; tmp7 = tmp1 * -0.923879532f; tmp10 = tmp1 * -0.382683432f; tmp6 = tmp2 * -0.991444861f; tmp11 = tmp2 * -0.130526192f; tmp0 = tmp3; tmp1 = tmp4 * 0.382683432f; tmp2 = tmp5 * 0.608761429f; tmp3 = tmp5 * -0.793353340f; tmp4 = tmp4 * -0.923879532f; tmp5 = tmp0 * -0.991444861f; tmp0 *= 0.130526192f; out[ss + 6] += tmp0; out[ss + 7] += tmp1; out[ss + 8] += tmp2; out[ss + 9] += tmp3; out[ss + 10] += tmp4; out[ss + 11] += tmp5; out[ss + 12] += tmp6; out[ss + 13] += tmp7; out[ss + 14] += tmp8; out[ss + 15] += tmp9; out[ss + 16] += tmp10; out[ss + 17] += tmp11; } if (sb&1) { for (i=0;i<18;i+=2) res_p[i]=out[i] + s_p[i]; for (i=1;i<18;i+=2) res_p[i]=-out[i] - s_p[i]; } else for (i=0;i<18;i++) res_p[i]=out[i] + s_p[i]; for (i=18;i<36;i++) s_p[i-18]=out[i]; } else { /* * 36 point IDCT **************************************************************** */ float tmp[18]; /* input aliasing for 36 point IDCT */ in[17]+=in[16]; in[16]+=in[15]; in[15]+=in[14]; in[14]+=in[13]; in[13]+=in[12]; in[12]+=in[11]; in[11]+=in[10]; in[10]+=in[9]; in[9] +=in[8]; in[8] +=in[7]; in[7] +=in[6]; in[6] +=in[5]; in[5] +=in[4]; in[4] +=in[3]; in[3] +=in[2]; in[2] +=in[1]; in[1] +=in[0]; /* 18 point IDCT for odd indices */ /* input aliasing for 18 point IDCT */ in[17]+=in[15]; in[15]+=in[13]; in[13]+=in[11]; in[11]+=in[9]; in[9] +=in[7]; in[7] +=in[5]; in[5] +=in[3]; in[3] +=in[1]; { float tmp0,tmp1,tmp2,tmp3,tmp4,tmp0_,tmp1_,tmp2_,tmp3_; float tmp0o,tmp1o,tmp2o,tmp3o,tmp4o,tmp0_o,tmp1_o,tmp2_o,tmp3_o; /* Fast 9 Point Inverse Discrete Cosine Transform // // By Francois-Raymond Boyer // mailto:boyerf@iro.umontreal.ca // http://www.iro.umontreal.ca/~boyerf // // The code has been optimized for Intel processors // (takes a lot of time to convert float to and from iternal FPU representation) // // It is a simple "factorization" of the IDCT matrix. */ /* 9 point IDCT on even indices */ { /* 5 points on odd indices (not realy an IDCT) */ float i0 = in[0]+in[0]; float i0p12 = i0 + in[12]; tmp0 = i0p12 + in[4]*1.8793852415718f + in[8]*1.532088886238f + in[16]*0.34729635533386f; tmp1 = i0 + in[4] - in[8] - in[12] - in[12] - in[16]; tmp2 = i0p12 - in[4]*0.34729635533386f - in[8]*1.8793852415718f + in[16]*1.532088886238f; tmp3 = i0p12 - in[4]*1.532088886238f + in[8]*0.34729635533386f - in[16]*1.8793852415718f; tmp4 = in[0] - in[4] + in[8] - in[12] + in[16]; } { float i6_ = in[6]*1.732050808f; tmp0_ = in[2]*1.9696155060244f + i6_ + in[10]*1.2855752193731f + in[14]*0.68404028665134f; tmp1_ = (in[2] - in[10] - in[14])*1.732050808f; tmp2_ = in[2]*1.2855752193731f - i6_ - in[10]*0.68404028665134f + in[14]*1.9696155060244f; tmp3_ = in[2]*0.68404028665134f - i6_ + in[10]*1.9696155060244f - in[14]*1.2855752193731f; } /* 9 point IDCT on odd indices */ { /* 5 points on odd indices (not realy an IDCT) */ float i0 = in[0+1]+in[0+1]; float i0p12 = i0 + in[12+1]; tmp0o = i0p12 + in[4+1]*1.8793852415718f + in[8+1]*1.532088886238f + in[16+1]*0.34729635533386f; tmp1o = i0 + in[4+1] - in[8+1] - in[12+1] - in[12+1] - in[16+1]; tmp2o = i0p12 - in[4+1]*0.34729635533386f - in[8+1]*1.8793852415718f + in[16+1]*1.532088886238f; tmp3o = i0p12 - in[4+1]*1.532088886238f + in[8+1]*0.34729635533386f - in[16+1]*1.8793852415718f; tmp4o = (in[0+1] - in[4+1] + in[8+1] - in[12+1] + in[16+1])*0.707106781f; /* Twiddled */ } { /* 4 points on even indices */ float i6_ = in[6+1]*1.732050808f; /* Sqrt[3] */ tmp0_o = in[2+1]*1.9696155060244f + i6_ + in[10+1]*1.2855752193731f + in[14+1]*0.68404028665134f; tmp1_o = (in[2+1] - in[10+1] - in[14+1])*1.732050808f; tmp2_o = in[2+1]*1.2855752193731f - i6_ - in[10+1]*0.68404028665134f + in[14+1]*1.9696155060244f; tmp3_o = in[2+1]*0.68404028665134f - i6_ + in[10+1]*1.9696155060244f - in[14+1]*1.2855752193731f; } /* Twiddle factors on odd indices // and // Butterflies on 9 point IDCT's // and // twiddle factors for 36 point IDCT */ { float e, o; e = tmp0 + tmp0_; o = (tmp0o + tmp0_o)*0.501909918f; tmp[0] = (e + o)*(-0.500476342f*.5f); tmp[17] = (e - o)*(-11.46279281f*.5f); e = tmp1 + tmp1_; o = (tmp1o + tmp1_o)*0.517638090f; tmp[1] = (e + o)*(-0.504314480f*.5f); tmp[16] = (e - o)*(-3.830648788f*.5f); e = tmp2 + tmp2_; o = (tmp2o + tmp2_o)*0.551688959f; tmp[2] = (e + o)*(-0.512139757f*.5f); tmp[15] = (e - o)*(-2.310113158f*.5f); e = tmp3 + tmp3_; o = (tmp3o + tmp3_o)*0.610387294f; tmp[3] = (e + o)*(-0.524264562f*.5f); tmp[14] = (e - o)*(-1.662754762f*.5f); tmp[4] = (tmp4 + tmp4o)*(-0.541196100f); tmp[13] = (tmp4 - tmp4o)*(-1.306562965f); e = tmp3 - tmp3_; o = (tmp3o - tmp3_o)*0.871723397f; tmp[5] = (e + o)*(-0.563690973f*.5f); tmp[12] = (e - o)*(-1.082840285f*.5f); e = tmp2 - tmp2_; o = (tmp2o - tmp2_o)*1.183100792f; tmp[6] = (e + o)*(-0.592844523f*.5f); tmp[11] = (e - o)*(-0.930579498f*.5f); e = tmp1 - tmp1_; o = (tmp1o - tmp1_o)*1.931851653f; tmp[7] = (e + o)*(-0.630236207f*.5f); tmp[10] = (e - o)*(-0.821339815f*.5f); e = tmp0 - tmp0_; o = (tmp0o - tmp0_o)*5.736856623f; tmp[8] = (e + o)*(-0.678170852f*.5f); tmp[9] = (e - o)*(-0.740093616f*.5f); } } /* shift to modified IDCT */ win_bt = win[win_type]; if (sb&1) { res_p[0] = -tmp[9] * win_bt[0] + s_p[0]; res_p[1] =-(-tmp[10] * win_bt[1] + s_p[1]); res_p[2] = -tmp[11] * win_bt[2] + s_p[2]; res_p[3] =-(-tmp[12] * win_bt[3] + s_p[3]); res_p[4] = -tmp[13] * win_bt[4] + s_p[4]; res_p[5] =-(-tmp[14] * win_bt[5] + s_p[5]); res_p[6] = -tmp[15] * win_bt[6] + s_p[6]; res_p[7] =-(-tmp[16] * win_bt[7] + s_p[7]); res_p[8] = -tmp[17] * win_bt[8] + s_p[8]; res_p[9] = -(tmp[17] * win_bt[9] + s_p[9]); res_p[10]= tmp[16] * win_bt[10] + s_p[10]; res_p[11]=-(tmp[15] * win_bt[11] + s_p[11]); res_p[12]= tmp[14] * win_bt[12] + s_p[12]; res_p[13]=-(tmp[13] * win_bt[13] + s_p[13]); res_p[14]= tmp[12] * win_bt[14] + s_p[14]; res_p[15]=-(tmp[11] * win_bt[15] + s_p[15]); res_p[16]= tmp[10] * win_bt[16] + s_p[16]; res_p[17]=-(tmp[9] * win_bt[17] + s_p[17]); } else { res_p[0] = -tmp[9] * win_bt[0] + s_p[0]; res_p[1] = -tmp[10] * win_bt[1] + s_p[1]; res_p[2] = -tmp[11] * win_bt[2] + s_p[2]; res_p[3] = -tmp[12] * win_bt[3] + s_p[3]; res_p[4] = -tmp[13] * win_bt[4] + s_p[4]; res_p[5] = -tmp[14] * win_bt[5] + s_p[5]; res_p[6] = -tmp[15] * win_bt[6] + s_p[6]; res_p[7] = -tmp[16] * win_bt[7] + s_p[7]; res_p[8] = -tmp[17] * win_bt[8] + s_p[8]; res_p[9] = tmp[17] * win_bt[9] + s_p[9]; res_p[10]= tmp[16] * win_bt[10] + s_p[10]; res_p[11]= tmp[15] * win_bt[11] + s_p[11]; res_p[12]= tmp[14] * win_bt[12] + s_p[12]; res_p[13]= tmp[13] * win_bt[13] + s_p[13]; res_p[14]= tmp[12] * win_bt[14] + s_p[14]; res_p[15]= tmp[11] * win_bt[15] + s_p[15]; res_p[16]= tmp[10] * win_bt[16] + s_p[16]; res_p[17]= tmp[9] * win_bt[17] + s_p[17]; } s_p[0]= tmp[8] * win_bt[18]; s_p[1]= tmp[7] * win_bt[19]; s_p[2]= tmp[6] * win_bt[20]; s_p[3]= tmp[5] * win_bt[21]; s_p[4]= tmp[4] * win_bt[22]; s_p[5]= tmp[3] * win_bt[23]; s_p[6]= tmp[2] * win_bt[24]; s_p[7]= tmp[1] * win_bt[25]; s_p[8]= tmp[0] * win_bt[26]; s_p[9]= tmp[0] * win_bt[27]; s_p[10]= tmp[1] * win_bt[28]; s_p[11]= tmp[2] * win_bt[29]; s_p[12]= tmp[3] * win_bt[30]; s_p[13]= tmp[4] * win_bt[31]; s_p[14]= tmp[5] * win_bt[32]; s_p[15]= tmp[6] * win_bt[33]; s_p[16]= tmp[7] * win_bt[34]; s_p[17]= tmp[8] * win_bt[35]; } } /* fast DCT according to Lee[84] * reordering according to Konstantinides[94] */ void poly(const int ch,int f) { static float u[2][2][17][16]; /* no v[][], it's redundant */ static int u_start[2]={0,0}; /* first element of u[][] */ static int u_div[2]={0,0}; /* which part of u[][] is currently used */ int start = u_start[ch]; int div = u_div[ch]; float (*u_p)[16]; #if defined(PENTIUM_RDTSC) unsigned int cnt4, cnt3, cnt2, cnt1; static int min_cycles = 99999999; __asm__(".byte 0x0f,0x31" : "=a" (cnt1), "=d" (cnt4)); #endif { float d16,d17,d18,d19,d20,d21,d22,d23,d24,d25,d26,d27,d28,d29,d30,d31; float d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15; /* step 1: initial reordering and 1st (16 wide) butterflies */ d0 = res[ 0][f]; d16=(d0 - res[31][f]) * b1; d0 += res[31][f]; d1 = res[ 1][f]; d17=(d1 - res[30][f]) * b3; d1 += res[30][f]; d3 = res[ 2][f]; d19=(d3 - res[29][f]) * b5; d3 += res[29][f]; d2 = res[ 3][f]; d18=(d2 - res[28][f]) * b7; d2 += res[28][f]; d6 = res[ 4][f]; d22=(d6 - res[27][f]) * b9; d6 += res[27][f]; d7 = res[ 5][f]; d23=(d7 - res[26][f]) * b11; d7 += res[26][f]; d5 = res[ 6][f]; d21=(d5 - res[25][f]) * b13; d5 += res[25][f]; d4 = res[ 7][f]; d20=(d4 - res[24][f]) * b15; d4 += res[24][f]; d12= res[ 8][f]; d28=(d12 - res[23][f]) * b17; d12+= res[23][f]; d13= res[ 9][f]; d29=(d13 - res[22][f]) * b19; d13+= res[22][f]; d15= res[10][f]; d31=(d15 - res[21][f]) * b21; d15+= res[21][f]; d14= res[11][f]; d30=(d14 - res[20][f]) * b23; d14+= res[20][f]; d10= res[12][f]; d26=(d10 - res[19][f]) * b25; d10+= res[19][f]; d11= res[13][f]; d27=(d11 - res[18][f]) * b27; d11+= res[18][f]; d9 = res[14][f]; d25=(d9 - res[17][f]) * b29; d9 += res[17][f]; d8 = res[15][f]; d24=(d8 - res[16][f]) * b31; d8 += res[16][f]; { float c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15; /* a test to see what can be done with memory separation * first we process indexes 0-15 */ c0 = d0 + d8 ; c8 = ( d0 - d8 ) * b2; c1 = d1 + d9 ; c9 = ( d1 - d9 ) * b6; c2 = d2 + d10; c10= ( d2 - d10) * b14; c3 = d3 + d11; c11= ( d3 - d11) * b10; c4 = d4 + d12; c12= ( d4 - d12) * b30; c5 = d5 + d13; c13= ( d5 - d13) * b26; c6 = d6 + d14; c14= ( d6 - d14) * b18; c7 = d7 + d15; c15= ( d7 - d15) * b22; /* step 3: 4-wide butterflies */ d0 = c0 + c4 ; d4 = ( c0 - c4 ) * b4; d1 = c1 + c5 ; d5 = ( c1 - c5 ) * b12; d2 = c2 + c6 ; d6 = ( c2 - c6 ) * b28; d3 = c3 + c7 ; d7 = ( c3 - c7 ) * b20; d8 = c8 + c12; d12= ( c8 - c12) * b4; d9 = c9 + c13; d13= ( c9 - c13) * b12; d10= c10+ c14; d14= (c10 - c14) * b28; d11= c11+ c15; d15= (c11 - c15) * b20; /* step 4: 2-wide butterflies */ { float rb8 = b8; float rb24 = b24; /**/ c0 = d0 + d2 ; c2 = ( d0 - d2 ) * rb8; c1 = d1 + d3 ; c3 = ( d1 - d3 ) * rb24; /**/ c4 = d4 + d6 ; c6 = ( d4 - d6 ) * rb8; c5 = d5 + d7 ; c7 = ( d5 - d7 ) * rb24; /**/ c8 = d8 + d10; c10= ( d8 - d10) * rb8; c9 = d9 + d11; c11= ( d9 - d11) * rb24; /**/ c12= d12+ d14; c14= (d12 - d14) * rb8; c13= d13+ d15; c15= (d13 - d15) * rb24; } /* step 5: 1-wide butterflies */ { float rb16 = b16; /* this is a little 'hacked up' */ d0 = (-c0 -c1) * 2; d1 = ( c0 - c1 ) * rb16; d2 = c2 + c3; d3 = ( c2 - c3 ) * rb16; d3 -= d2; d4 = c4 +c5; d5 = ( c4 - c5 ) * rb16; d5 += d4; d7 = -d5; d7 += ( c6 - c7 ) * rb16; d6 = +c6 +c7; d8 = c8 + c9 ; d9 = ( c8 - c9 ) * rb16; d11= +d8 +d9; d11 +=(c10 - c11) * rb16; d10= c10+ c11; d12 = c12+ c13; d13 = (c12 - c13) * rb16; d13 += -d8-d9+d12; d14 = c14+ c15; d15 = (c14 - c15) * rb16; d15-=d11; d14 += -d8 -d10; } /* step 6: final resolving & reordering * the other 32 are stored for use with the next granule */ u_p = (float (*)[16]) &u[ch][div][0][start]; /*16*/ u_p[ 0][0] =+d1 ; u_p[ 2][0] = +d9 -d14; /*20*/ u_p[ 4][0] = +d5 -d6; u_p[ 6][0] = -d10 +d13; /*24*/ u_p[ 8][0] =d3; u_p[10][0] = -d8 -d9 +d11 -d13; /*28*/ u_p[12][0] = +d7; u_p[14][0] = +d15; /* the other 32 are stored for use with the next granule */ u_p = (float (*)[16]) &u[ch][!div][0][start]; /*0*/ u_p[16][0] = d0; u_p[14][0] = -(+d8 ); /*4*/ u_p[12][0] = -(+d4 ); u_p[10][0] = -(-d8 +d12 ); /*8*/ u_p[ 8][0] = -(+d2 ); u_p[ 6][0] = -(+d8 +d10 -d12 ); /*12*/ u_p[ 4][0] = -(-d4 +d6 ); u_p[ 2][0] = -d14; u_p[ 0][0] = -d1; } { float c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15; /* memory separation, second part */ /* 2 */ c0=d16 + d24; c8= (d16 - d24) * b2; c1=d17 + d25; c9= (d17 - d25) * b6; c2=d18 + d26; c10= (d18 - d26) * b14; c3=d19 + d27; c11= (d19 - d27) * b10; c4=d20 + d28; c12= (d20 - d28) * b30; c5=d21 + d29; c13= (d21 - d29) * b26; c6=d22 + d30; c14= (d22 - d30) * b18; c7=d23 + d31; c15= (d23 - d31) * b22; /* 3 */ d16= c0+ c4; d20= (c0 - c4) * b4; d17= c1+ c5; d21= (c1 - c5) * b12; d18= c2+ c6; d22= (c2 - c6) * b28; d19= c3+ c7; d23= (c3 - c7) * b20; d24= c8+ c12; d28= (c8 - c12) * b4; d25= c9+ c13; d29= (c9 - c13) * b12; d26= c10+ c14; d30= (c10 - c14) * b28; d27= c11+ c15; d31= (c11 - c15) * b20; /* 4 */ { float rb8 = b8; float rb24 = b24; /**/ c0= d16+ d18; c2= (d16 - d18) * rb8; c1= d17+ d19; c3= (d17 - d19) * rb24; /**/ c4= d20+ d22; c6= (d20 - d22) * rb8; c5= d21+ d23; c7= (d21 - d23) * rb24; /**/ c8= d24+ d26; c10= (d24 - d26) * rb8; c9= d25+ d27; c11= (d25 - d27) * rb24; /**/ c12= d28+ d30; c14= (d28 - d30) * rb8; c13= d29+ d31; c15= (d29 - d31) * rb24; } /* 5 */ { float rb16 = b16; d16= c0+ c1; d17= (c0 - c1) * rb16; d18= c2+ c3; d19= (c2 - c3) * rb16; d20= c4+ c5; d21= (c4 - c5) * rb16; d20+=d16; d21+=d17; d22= c6+ c7; d23= (c6 - c7) * rb16; d22+=d16; d22+=d18; d23+=d16; d23+=d17; d23+=d19; d24= c8+ c9; d25= (c8 - c9) * rb16; d26= c10+ c11; d27= (c10 - c11) * rb16; d26+=d24; d27+=d24; d27+=d25; d28= c12+ c13; d29= (c12 - c13) * rb16; d28-=d20; d29+=d28; d29-=d21; d30= c14+ c15; d31= (c14 - c15) * rb16; d30-=d22; d31-=d23; } /* step 6: final resolving & reordering * the other 32 are stored for use with the next granule */ u_p = (float (*)[16]) &u[ch][!div][0][start]; u_p[ 1][0] = -(+d30 ); u_p[ 3][0] = -(+d22 -d26 ); u_p[ 5][0] = -(-d18 -d20 +d26 ); u_p[ 7][0] = -(+d18 -d28 ); u_p[ 9][0] = -(+d28 ); u_p[11][0] = -(+d20 -d24 ); u_p[13][0] = -(-d16 +d24 ); u_p[15][0] = -(+d16 ); /* the other 32 are stored for use with the next granule */ u_p = (float (*)[16]) &u[ch][div][0][start]; u_p[15][0] = +d31; u_p[13][0] = +d23 -d27; u_p[11][0] = -d19 -d20 -d21 +d27; u_p[ 9][0] = +d19 -d29; u_p[ 7][0] = -d18 +d29; u_p[ 5][0] = +d18 +d20 +d21 -d25 -d26; u_p[ 3][0] = -d17 -d22 +d25 +d26; u_p[ 1][0] = +d17 -d30; } } #if defined(PENTIUM_RDTSC) __asm__(".byte 0x0f,0x31" : "=a" (cnt3), "=d" (cnt4)); #endif /* we're doing dewindowing and calculating final samples now */ #if defined(ARCH_i586) /* x86 assembler optimisations. These optimisations are tuned specifically for Intel Pentiums. */ asm("movl $15,%%eax\n\t"\ "1:\n\t"\ "flds (%0)\n\t"\ "fmuls (%1)\n\t"\ "flds 4(%0)\n\t"\ "fmuls 4(%1)\n\t"\ "flds 8(%0)\n\t"\ "fmuls 8(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 12(%0)\n\t"\ "fmuls 12(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 16(%0)\n\t"\ "fmuls 16(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 20(%0)\n\t"\ "fmuls 20(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 24(%0)\n\t"\ "fmuls 24(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 28(%0)\n\t"\ "fmuls 28(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 32(%0)\n\t"\ "fmuls 32(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 36(%0)\n\t"\ "fmuls 36(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 40(%0)\n\t"\ "fmuls 40(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 44(%0)\n\t"\ "fmuls 44(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 48(%0)\n\t"\ "fmuls 48(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 52(%0)\n\t"\ "fmuls 52(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 56(%0)\n\t"\ "fmuls 56(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 60(%0)\n\t"\ "fmuls 60(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "addl $64,%0\n\t"\ "addl $128,%1\n\t"\ "subl $4,%%esp\n\t"\ "faddp\n\t"\ "fistpl (%%esp)\n\t"\ "popl %%ecx\n\t"\ "cmpl $32767,%%ecx\n\t"\ "jle 2f\n\t"\ "movw $32767,%%cx\n\t"\ "jmp 3f\n\t"\ "2: cmpl $-32768,%%ecx\n\t"\ "jge 3f\n\t"\ "movw $-32768,%%cx\n\t"\ "3: movw %%cx,(%2)\n\t"\ "addl %3,%2\n\t"\ "decl %%eax\n\t"\ "jns 1b\n\t"\ "testb $1,%4\n\t"\ "je 4f\n\t" "flds (%0)\n\t"\ "fmuls (%1)\n\t"\ "flds 8(%0)\n\t"\ "fmuls 8(%1)\n\t"\ "flds 16(%0)\n\t"\ "fmuls 16(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 24(%0)\n\t"\ "fmuls 24(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 32(%0)\n\t"\ "fmuls 32(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 40(%0)\n\t"\ "fmuls 40(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 48(%0)\n\t"\ "fmuls 48(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 56(%0)\n\t"\ "fmuls 56(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "subl $4,%%esp\n\t"\ "subl $64,%0\n\t"\ "subl $192,%1\n\t"\ "faddp\n\t"\ "fistpl (%%esp)\n\t"\ "popl %%ecx\n\t"\ "cmpl $32767,%%ecx\n\t"\ "jle 2f\n\t"\ "movw $32767,%%cx\n\t"\ "jmp 3f\n\t"\ "2: cmpl $-32768,%%ecx\n\t"\ "jge 3f\n\t"\ "movw $-32768,%%cx\n\t"\ "3: movw %%cx,(%2)\n\t"\ "movl %5,%%ecx\n\t"\ "sall $3,%%ecx\n\t"\ "addl %%ecx,%1\n\t"\ "addl %3,%2\n\t"\ "movl $14,%%eax\n\t"\ "1:flds 4(%0)\n\t"\ "fmuls 56(%1)\n\t"\ "flds (%0)\n\t"\ "fmuls 60(%1)\n\t"\ "flds 12(%0)\n\t"\ "fmuls 48(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubp\n\t"\ "flds 8(%0)\n\t"\ "fmuls 52(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 20(%0)\n\t"\ "fmuls 40(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 16(%0)\n\t"\ "fmuls 44(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 28(%0)\n\t"\ "fmuls 32(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 24(%0)\n\t"\ "fmuls 36(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 36(%0)\n\t"\ "fmuls 24(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 32(%0)\n\t"\ "fmuls 28(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 44(%0)\n\t"\ "fmuls 16(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 40(%0)\n\t"\ "fmuls 20(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 52(%0)\n\t"\ "fmuls 8(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 48(%0)\n\t"\ "fmuls 12(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 60(%0)\n\t"\ "fmuls (%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 56(%0)\n\t"\ "fmuls 4(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "subl $64,%0\n\t"\ "subl $128,%1\n\t"\ "subl $4,%%esp\n\t"\ "fsubp\n\t"\ "fistpl (%%esp)\n\t"\ "popl %%ecx\n\t"\ "cmpl $32767,%%ecx\n\t"\ "jle 2f\n\t"\ "movw $32767,%%cx\n\t"\ "jmp 3f\n\t"\ "2: cmpl $-32768,%%ecx\n\t"\ "jge 3f\n\t"\ "movw $-32768,%%cx\n\t"\ "3: movw %%cx,(%2)\n\t"\ "addl %3,%2\n\t"\ "decl %%eax\n\t"\ "jns 1b\n\t"\ "jmp 5f\n\t"\ "4:flds 4(%0)\n\t"\ "fmuls 4(%1)\n\t"\ "flds 12(%0)\n\t"\ "fmuls 12(%1)\n\t"\ "flds 20(%0)\n\t"\ "fmuls 20(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 28(%0)\n\t"\ "fmuls 28(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 36(%0)\n\t"\ "fmuls 36(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 44(%0)\n\t"\ "fmuls 44(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 52(%0)\n\t"\ "fmuls 52(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 60(%0)\n\t"\ "fmuls 60(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "subl $4,%%esp\n\t"\ "subl $64,%0\n\t"\ "subl $192,%1\n\t"\ "faddp\n\t"\ "fistpl (%%esp)\n\t"\ "popl %%ecx\n\t"\ "cmpl $32767,%%ecx\n\t"\ "jle 2f\n\t"\ "movw $32767,%%cx\n\t"\ "jmp 3f\n\t"\ "2: cmpl $-32768,%%ecx\n\t"\ "jge 3f\n\t"\ "movw $-32768,%%cx\n\t"\ "3: movw %%cx,(%2)\n\t"\ "movl %5,%%ecx\n\t"\ "sall $3,%%ecx\n\t"\ "addl %%ecx,%1\n\t"\ "addl %3,%2\n\t"\ "movl $14,%%eax\n\t"\ "1:flds (%0)\n\t"\ "fmuls 60(%1)\n\t"\ "flds 4(%0)\n\t"\ "fmuls 56(%1)\n\t"\ "flds 8(%0)\n\t"\ "fmuls 52(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubp\n\t"\ "flds 12(%0)\n\t"\ "fmuls 48(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 16(%0)\n\t"\ "fmuls 44(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 20(%0)\n\t"\ "fmuls 40(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 24(%0)\n\t"\ "fmuls 36(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 28(%0)\n\t"\ "fmuls 32(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 32(%0)\n\t"\ "fmuls 28(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 36(%0)\n\t"\ "fmuls 24(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 40(%0)\n\t"\ "fmuls 20(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 44(%0)\n\t"\ "fmuls 16(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 48(%0)\n\t"\ "fmuls 12(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 52(%0)\n\t"\ "fmuls 8(%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "flds 56(%0)\n\t"\ "fmuls 4(%1)\n\t"\ "fxch %%st(2)\n\t"\ "fsubrp\n\t"\ "flds 60(%0)\n\t"\ "fmuls (%1)\n\t"\ "fxch %%st(2)\n\t"\ "faddp\n\t"\ "subl $64,%0\n\t"\ "subl $128,%1\n\t"\ "subl $4,%%esp\n\t"\ "fsubp\n\t"\ "fistpl (%%esp)\n\t"\ "popl %%ecx\n\t"\ "cmpl $32767,%%ecx\n\t"\ "jle 2f\n\t"\ "movw $32767,%%cx\n\t"\ "jmp 3f\n\t"\ "2: cmpl $-32768,%%ecx\n\t"\ "jge 3f\n\t"\ "movw $-32768,%%cx\n\t"\ "3: movw %%cx,(%2)\n\t"\ "addl %3,%2\n\t"\ "decl %%eax\n\t"\ "jns 1b\n\t"\ "5:"\ : : "b" (u[ch][div]), "d" (t_dewindow[0] + 16 - start), "S" (&sample_buffer[f>>(2-nch)][nch==2?0:(f&1?16:0)][ch]), "m" (sizeof(short) * nch), "m" (div), "m" (start)\ : "eax", "ecx", "memory"); #else { short *samples = (&sample_buffer[f>>(2-nch)][nch==2?0:(f&1?16:0)][ch]); int out, j; #define PUT_SAMPLE(out) \ if (out > 32767) \ *samples = 32767; \ else \ if (out < -32768) \ *samples = -32768; \ else \ *samples = out; \ \ samples += nch; #if defined(SUPERHACK) /* These is a simple implementation which should be nicer to the cache; computation of samples are done in one pass rather than two. However, for various reasons which I do not have time to investigate, it runs quite a lot slower than two pass computations. If you have time, you are welcome to look into it. */ { float (*u_ptr)[16] = u[ch][div]; const float *dewindow2 = t_dewindow[0] + start; { float outf1, outf2, outf3, outf4; outf1 = u_ptr[0][ 0] * dewindow[0x0]; outf2 = u_ptr[0][ 1] * dewindow[0x1]; outf3 = u_ptr[0][ 2] * dewindow[0x2]; outf4 = u_ptr[0][ 3] * dewindow[0x3]; outf1 += u_ptr[0][ 4] * dewindow[0x4]; outf2 += u_ptr[0][ 5] * dewindow[0x5]; outf3 += u_ptr[0][ 6] * dewindow[0x6]; outf4 += u_ptr[0][ 7] * dewindow[0x7]; outf1 += u_ptr[0][ 8] * dewindow[0x8]; outf2 += u_ptr[0][ 9] * dewindow[0x9]; outf3 += u_ptr[0][10] * dewindow[0xa]; outf4 += u_ptr[0][11] * dewindow[0xb]; outf1 += u_ptr[0][12] * dewindow[0xc]; outf2 += u_ptr[0][13] * dewindow[0xd]; outf3 += u_ptr[0][14] * dewindow[0xe]; outf4 += u_ptr[0][15] * dewindow[0xf]; out = outf1 + outf2 + outf3 + outf4; dewindow += 32; dewindow2 += 32; u_ptr++; if (out > 32767) samples[0] = 32767; else if (out < -32768) samples[0] = -32768; else samples[0] = out; } if (div & 0x1) { for (j = 1; j < 16; ++j) { float outf1, outf2, outf3, outf4; outf1 = u_ptr[0][ 0] * dewindow[0x0]; outf3 = u_ptr[0][ 0] * dewindow2[0xf]; outf2 = u_ptr[0][ 1] * dewindow[0x1]; outf4 = u_ptr[0][ 1] * dewindow2[0xe]; outf1 += u_ptr[0][ 2] * dewindow[0x2]; outf3 += u_ptr[0][ 2] * dewindow2[0xd]; outf2 += u_ptr[0][ 3] * dewindow[0x3]; outf4 += u_ptr[0][ 3] * dewindow2[0xc]; outf1 += u_ptr[0][ 4] * dewindow[0x4]; outf3 += u_ptr[0][ 4] * dewindow2[0xb]; outf2 += u_ptr[0][ 5] * dewindow[0x5]; outf4 += u_ptr[0][ 5] * dewindow2[0xa]; outf1 += u_ptr[0][ 6] * dewindow[0x6]; outf3 += u_ptr[0][ 6] * dewindow2[0x9]; outf2 += u_ptr[0][ 7] * dewindow[0x7]; outf4 += u_ptr[0][ 7] * dewindow2[0x8]; outf1 += u_ptr[0][ 8] * dewindow[0x8]; outf3 += u_ptr[0][ 8] * dewindow2[0x7]; outf2 += u_ptr[0][ 9] * dewindow[0x9]; outf4 += u_ptr[0][ 9] * dewindow2[0x6]; outf1 += u_ptr[0][10] * dewindow[0xa]; outf3 += u_ptr[0][10] * dewindow2[0x5]; outf2 += u_ptr[0][11] * dewindow[0xb]; outf4 += u_ptr[0][11] * dewindow2[0x4]; outf1 += u_ptr[0][12] * dewindow[0xc]; outf3 += u_ptr[0][12] * dewindow2[0x3]; outf2 += u_ptr[0][13] * dewindow[0xd]; outf4 += u_ptr[0][13] * dewindow2[0x2]; outf1 += u_ptr[0][14] * dewindow[0xe]; outf3 += u_ptr[0][14] * dewindow2[0x1]; outf2 += u_ptr[0][15] * dewindow[0xf]; outf4 += u_ptr[0][15] * dewindow2[0x0]; dewindow += 32; dewindow2 += 32; u_ptr++; out = outf1 + outf2; if (out > 32767) samples[j * 2] = 32767; else if (out < -32768) samples[j * 2] = -32768; else samples[j * 2] = out; out = outf4 - outf3; if (out > 32767) samples[64 - (j * 2)] = 32767; else if (out < -32768) samples[64 - (j * 2)] = -32768; else samples[64 - (j * 2)] = out; } { float outf2, outf4; outf2 = u_ptr[0][ 0] * dewindow[0x0]; outf4 = u_ptr[0][ 2] * dewindow[0x2]; outf2 += u_ptr[0][ 4] * dewindow[0x4]; outf4 += u_ptr[0][ 6] * dewindow[0x6]; outf2 += u_ptr[0][ 8] * dewindow[0x8]; outf4 += u_ptr[0][10] * dewindow[0xa]; outf2 += u_ptr[0][12] * dewindow[0xc]; outf4 += u_ptr[0][14] * dewindow[0xe]; out = outf2 + outf4; if (out > 32767) samples[16 * 2] = 32767; else if (out < -32768) samples[16 * 2] = -32768; else samples[16 * 2] = out; } } else { for (j = 1; j < 16; ++j) { float outf1, outf2, outf3, outf4; outf1 = u_ptr[0][ 0] * dewindow[0x0]; outf3 = u_ptr[0][ 0] * dewindow2[0xf]; outf2 = u_ptr[0][ 1] * dewindow[0x1]; outf4 = u_ptr[0][ 1] * dewindow2[0xe]; outf1 += u_ptr[0][ 2] * dewindow[0x2]; outf3 += u_ptr[0][ 2] * dewindow2[0xd]; outf2 += u_ptr[0][ 3] * dewindow[0x3]; outf4 += u_ptr[0][ 3] * dewindow2[0xc]; outf1 += u_ptr[0][ 4] * dewindow[0x4]; outf3 += u_ptr[0][ 4] * dewindow2[0xb]; outf2 += u_ptr[0][ 5] * dewindow[0x5]; outf4 += u_ptr[0][ 5] * dewindow2[0xa]; outf1 += u_ptr[0][ 6] * dewindow[0x6]; outf3 += u_ptr[0][ 6] * dewindow2[0x9]; outf2 += u_ptr[0][ 7] * dewindow[0x7]; outf4 += u_ptr[0][ 7] * dewindow2[0x8]; outf1 += u_ptr[0][ 8] * dewindow[0x8]; outf3 += u_ptr[0][ 8] * dewindow2[0x7]; outf2 += u_ptr[0][ 9] * dewindow[0x9]; outf4 += u_ptr[0][ 9] * dewindow2[0x6]; outf1 += u_ptr[0][10] * dewindow[0xa]; outf3 += u_ptr[0][10] * dewindow2[0x5]; outf2 += u_ptr[0][11] * dewindow[0xb]; outf4 += u_ptr[0][11] * dewindow2[0x4]; outf1 += u_ptr[0][12] * dewindow[0xc]; outf3 += u_ptr[0][12] * dewindow2[0x3]; outf2 += u_ptr[0][13] * dewindow[0xd]; outf4 += u_ptr[0][13] * dewindow2[0x2]; outf1 += u_ptr[0][14] * dewindow[0xe]; outf3 += u_ptr[0][14] * dewindow2[0x1]; outf2 += u_ptr[0][15] * dewindow[0xf]; outf4 += u_ptr[0][15] * dewindow2[0x0]; dewindow += 32; dewindow2 += 32; u_ptr++; out = outf1 + outf2; if (out > 32767) samples[j * 2] = 32767; else if (out < -32768) samples[j * 2] = -32768; else samples[j * 2] = out; out = outf3 - outf4; if (out > 32767) samples[64 - (j * 2)] = 32767; else if (out < -32768) samples[64 - (j * 2)] = -32768; else samples[64 - (j * 2)] = out; } { float outf2, outf4; outf2 = u_ptr[0][ 1] * dewindow[0x1]; outf4 = u_ptr[0][ 3] * dewindow[0x3]; outf2 += u_ptr[0][ 5] * dewindow[0x5]; outf4 += u_ptr[0][ 7] * dewindow[0x7]; outf2 += u_ptr[0][ 9] * dewindow[0x9]; outf4 += u_ptr[0][11] * dewindow[0xb]; outf2 += u_ptr[0][13] * dewindow[0xd]; outf4 += u_ptr[0][15] * dewindow[0xf]; out = outf2 + outf4; if (out > 32767) samples[16 * 2] = 32767; else if (out < -32768) samples[16 * 2] = -32768; else samples[16 * 2] = out; } } } #elif defined(HAS_AUTOINCREMENT) const float *dewindow = t_dewindow[0] + 15 - start; /* This is tuned specifically for architectures with autoincrement and -decrement. */ { float *u_ptr = (float*) u[ch][div]; u_ptr--; for (j = 0; j < 16; ++j) { float outf1, outf2, outf3, outf4; outf1 = *++u_ptr * *++dewindow; outf2 = *++u_ptr * *++dewindow; outf3 = *++u_ptr * *++dewindow; outf4 = *++u_ptr * *++dewindow; outf1 += *++u_ptr * *++dewindow; outf2 += *++u_ptr * *++dewindow; outf3 += *++u_ptr * *++dewindow; outf4 += *++u_ptr * *++dewindow; outf1 += *++u_ptr * *++dewindow; outf2 += *++u_ptr * *++dewindow; outf3 += *++u_ptr * *++dewindow; outf4 += *++u_ptr * *++dewindow; outf1 += *++u_ptr * *++dewindow; outf2 += *++u_ptr * *++dewindow; outf3 += *++u_ptr * *++dewindow; outf4 += *++u_ptr * *++dewindow; out = outf1 + outf2 + outf3 + outf4; dewindow += 16; PUT_SAMPLE(out) } if (div & 0x1) { { float outf2, outf4; outf2 = u_ptr[ 1] * dewindow[0x1]; outf4 = u_ptr[ 3] * dewindow[0x3]; outf2 += u_ptr[ 5] * dewindow[0x5]; outf4 += u_ptr[ 7] * dewindow[0x7]; outf2 += u_ptr[ 9] * dewindow[0x9]; outf4 += u_ptr[11] * dewindow[0xb]; outf2 += u_ptr[13] * dewindow[0xd]; outf4 += u_ptr[15] * dewindow[0xf]; out = outf2 + outf4; PUT_SAMPLE(out) } dewindow -= 31; dewindow += start; dewindow += start; u_ptr -= 16; for (; j < 31; ++j) { float outf1, outf2, outf3, outf4; outf1 = *++u_ptr * *--dewindow; outf2 = *++u_ptr * *--dewindow; outf3 = *++u_ptr * *--dewindow; outf4 = *++u_ptr * *--dewindow; outf1 += *++u_ptr * *--dewindow; outf2 += *++u_ptr * *--dewindow; outf3 += *++u_ptr * *--dewindow; outf4 += *++u_ptr * *--dewindow; outf1 += *++u_ptr * *--dewindow; outf2 += *++u_ptr * *--dewindow; outf3 += *++u_ptr * *--dewindow; outf4 += *++u_ptr * *--dewindow; outf1 += *++u_ptr * *--dewindow; outf2 += *++u_ptr * *--dewindow; outf3 += *++u_ptr * *--dewindow; outf4 += *++u_ptr * *--dewindow; out = outf2 - outf1 + outf4 - outf3; dewindow -= 16; u_ptr -= 32; PUT_SAMPLE(out) } } else { { float outf2, outf4; outf2 = u_ptr[ 2] * dewindow[ 0x2]; outf4 = u_ptr[ 4] * dewindow[ 0x4]; outf2 += u_ptr[ 6] * dewindow[ 0x6]; outf4 += u_ptr[ 8] * dewindow[ 0x8]; outf2 += u_ptr[10] * dewindow[ 0xa]; outf4 += u_ptr[12] * dewindow[ 0xc]; outf2 += u_ptr[14] * dewindow[ 0xe]; outf4 += u_ptr[16] * dewindow[0x10]; out = outf2 + outf4; PUT_SAMPLE(out) } dewindow -= 31; dewindow += start; dewindow += start; u_ptr -= 16; for (; j < 31; ++j) { float outf1, outf2, outf3, outf4; outf1 = *++u_ptr * *--dewindow; outf2 = *++u_ptr * *--dewindow; outf3 = *++u_ptr * *--dewindow; outf4 = *++u_ptr * *--dewindow; outf1 += *++u_ptr * *--dewindow; outf2 += *++u_ptr * *--dewindow; outf3 += *++u_ptr * *--dewindow; outf4 += *++u_ptr * *--dewindow; outf1 += *++u_ptr * *--dewindow; outf2 += *++u_ptr * *--dewindow; outf3 += *++u_ptr * *--dewindow; outf4 += *++u_ptr * *--dewindow; outf1 += *++u_ptr * *--dewindow; outf2 += *++u_ptr * *--dewindow; outf3 += *++u_ptr * *--dewindow; outf4 += *++u_ptr * *--dewindow; out = outf1 - outf2 + outf3 - outf4; dewindow -= 16; u_ptr -= 32; PUT_SAMPLE(out) } } } #else const float *dewindow = t_dewindow[0] + 16 - start; /* These optimisations are tuned specifically for architectures without autoincrement and -decrement. */ { float (*u_ptr)[16] = u[ch][div]; for (j = 0; j < 16; ++j) { float outf1, outf2, outf3, outf4; outf1 = u_ptr[0][ 0] * dewindow[0x0]; outf2 = u_ptr[0][ 1] * dewindow[0x1]; outf3 = u_ptr[0][ 2] * dewindow[0x2]; outf4 = u_ptr[0][ 3] * dewindow[0x3]; outf1 += u_ptr[0][ 4] * dewindow[0x4]; outf2 += u_ptr[0][ 5] * dewindow[0x5]; outf3 += u_ptr[0][ 6] * dewindow[0x6]; outf4 += u_ptr[0][ 7] * dewindow[0x7]; outf1 += u_ptr[0][ 8] * dewindow[0x8]; outf2 += u_ptr[0][ 9] * dewindow[0x9]; outf3 += u_ptr[0][10] * dewindow[0xa]; outf4 += u_ptr[0][11] * dewindow[0xb]; outf1 += u_ptr[0][12] * dewindow[0xc]; outf2 += u_ptr[0][13] * dewindow[0xd]; outf3 += u_ptr[0][14] * dewindow[0xe]; outf4 += u_ptr[0][15] * dewindow[0xf]; out = outf1 + outf2 + outf3 + outf4; dewindow += 32; u_ptr++; PUT_SAMPLE(out) } if (div & 0x1) { { float outf2, outf4; outf2 = u_ptr[0][ 0] * dewindow[0x0]; outf4 = u_ptr[0][ 2] * dewindow[0x2]; outf2 += u_ptr[0][ 4] * dewindow[0x4]; outf4 += u_ptr[0][ 6] * dewindow[0x6]; outf2 += u_ptr[0][ 8] * dewindow[0x8]; outf4 += u_ptr[0][10] * dewindow[0xa]; outf2 += u_ptr[0][12] * dewindow[0xc]; outf4 += u_ptr[0][14] * dewindow[0xe]; out = outf2 + outf4; PUT_SAMPLE(out) } dewindow -= 48; dewindow += start; dewindow += start; for (; j < 31; ++j) { float outf1, outf2, outf3, outf4; --u_ptr; outf1 = u_ptr[0][ 0] * dewindow[0xf]; outf2 = u_ptr[0][ 1] * dewindow[0xe]; outf3 = u_ptr[0][ 2] * dewindow[0xd]; outf4 = u_ptr[0][ 3] * dewindow[0xc]; outf1 += u_ptr[0][ 4] * dewindow[0xb]; outf2 += u_ptr[0][ 5] * dewindow[0xa]; outf3 += u_ptr[0][ 6] * dewindow[0x9]; outf4 += u_ptr[0][ 7] * dewindow[0x8]; outf1 += u_ptr[0][ 8] * dewindow[0x7]; outf2 += u_ptr[0][ 9] * dewindow[0x6]; outf3 += u_ptr[0][10] * dewindow[0x5]; outf4 += u_ptr[0][11] * dewindow[0x4]; outf1 += u_ptr[0][12] * dewindow[0x3]; outf2 += u_ptr[0][13] * dewindow[0x2]; outf3 += u_ptr[0][14] * dewindow[0x1]; outf4 += u_ptr[0][15] * dewindow[0x0]; out = -outf1 + outf2 - outf3 + outf4; dewindow -= 32; PUT_SAMPLE(out) } } else { { float outf2, outf4; outf2 = u_ptr[0][ 1] * dewindow[0x1]; outf4 = u_ptr[0][ 3] * dewindow[0x3]; outf2 += u_ptr[0][ 5] * dewindow[0x5]; outf4 += u_ptr[0][ 7] * dewindow[0x7]; outf2 += u_ptr[0][ 9] * dewindow[0x9]; outf4 += u_ptr[0][11] * dewindow[0xb]; outf2 += u_ptr[0][13] * dewindow[0xd]; outf4 += u_ptr[0][15] * dewindow[0xf]; out = outf2 + outf4; PUT_SAMPLE(out) } dewindow -= 48; dewindow += start; dewindow += start; for (; j < 31; ++j) { float outf1, outf2, outf3, outf4; --u_ptr; outf1 = u_ptr[0][ 0] * dewindow[0xf]; outf2 = u_ptr[0][ 1] * dewindow[0xe]; outf3 = u_ptr[0][ 2] * dewindow[0xd]; outf4 = u_ptr[0][ 3] * dewindow[0xc]; outf1 += u_ptr[0][ 4] * dewindow[0xb]; outf2 += u_ptr[0][ 5] * dewindow[0xa]; outf3 += u_ptr[0][ 6] * dewindow[0x9]; outf4 += u_ptr[0][ 7] * dewindow[0x8]; outf1 += u_ptr[0][ 8] * dewindow[0x7]; outf2 += u_ptr[0][ 9] * dewindow[0x6]; outf3 += u_ptr[0][10] * dewindow[0x5]; outf4 += u_ptr[0][11] * dewindow[0x4]; outf1 += u_ptr[0][12] * dewindow[0x3]; outf2 += u_ptr[0][13] * dewindow[0x2]; outf3 += u_ptr[0][14] * dewindow[0x1]; outf4 += u_ptr[0][15] * dewindow[0x0]; out = outf1 - outf2 + outf3 - outf4; dewindow -= 32; PUT_SAMPLE(out) } } } #endif } #endif --u_start[ch]; u_start[ch] &= 0xf; u_div[ch]=u_div[ch] ? 0 : 1; #if defined(PENTIUM_RDTSC) __asm__(".byte 0x0f,0x31" : "=a" (cnt2), "=d" (cnt4)); if (cnt2-cnt1 < min_cycles) { min_cycles = cnt2-cnt1; printf("%d, %d cycles, %d\n", cnt3-cnt1, min_cycles, start); } #endif } void premultiply() { int i,t; for (i = 0; i < 17; ++i) for (t = 0; t < 32; ++t) t_dewindow[i][t] *= 16383.5f; }
here's Justins mp2.5 patch for anyone caring to try it out. Please report success/failure if you've tried it. Tomislav begin 755 mp25pat.gz M'XL("% *30" S N-RXW+6UP96,20>/?NP[F, M--+T]/1[>GKD^<,A5&=0K49L,(MB_PO#=L!NJT-_S*!6.W GTZI6LVOV@3OS M_+ V@%K:*E:KU;4@A=,P@ LV!=T"76];C;:!C5;++BJ*LAA?^,@\#F;40:NW M]7J[;@NPER^AVFBI-BC\^O)E$0J%27Q3+MW#G@%:60'NZ5JG ?;%:*,1L$ 9>W)$]^^41OMY24Q^,$M8W)'S'S0T_CHE1]YA#V0'%^NG/:^]IQG>=4F5XU4)B (F;1A- M4]4U4.C>$ J);_UD,((RBR(2+J(:N#&#G[N7;X[/G>[Y>1L\GY5+V!]&$"'9 M2"U,INP&^GX2)_AF4@/VU4_P?>TJ*'%=TE\?NSX?DKJR&-]=0!MNW2@HEY*1 M'P/^N\#-U _@]'WW9S!J%@S#:.(F*MR.?"0.88(P 8\-_8!Y8HXJ32$1]>^@ M=W%V0*-KT$L(_JKD0CQE ]\=PTGDSH)1.&21Q'M5JJTB046 %S(Q53R;3L,H M 4ZB&%2#.(RBN^S(E,4EH9V\U5,64[)@[-[A]'K*BL2/GH.TT\SE.Y94UHEO M&??)7'QH,4S2ME %2C-B^2GRA),9F%J#FX&I&70G.[B?.]_1A^/>F?/^[=&O M%0":GK_F+U]].#GIGCL7O7]WT?@K7 @\"IQ-65#^$5]3R=>VQ[6#-ZE2G'/' MG80>>]8Q>6@YPE!@!!\/W#$; MN@-G_,FXYO\&1?%U$+&$T,WK3R8"53- B=-O"A2F1+'2&\M>G<8J:\=FABMK MAV$>)0MX"1H?O(CYY\?>MU+5283K]]1\^3L_/3HTOEX]$L7GXX_ MG+YW7O7>'9W_>K@T]N-Y[[+K7)Z)8*AF7ISTWG8%Z[K>4%MHB;JMZG7)_3+; M"WZ_S2AT**^Y!\/0+ U4M'Q-H[O>P/N#2EWUNLY?U9NBRS2HJZ@\*AF.6RG< MP\'^8GWE\(QT'9E@X+>%6H2DF5Z@''W&RPH.:Q:,=Z#5M M"9< W\0([]S$"F6 #UR1.,7/'WJ J4L2A6-T9PR5A)O+W6JH!KJBWC#4NC [ M!$]&#'TX1D\;]FF1O<&0A=HDJ0S\9'P'&'JF-S6C68-PR*'CQ T\-Z+E43IG M$D8Q3@[[M*HB#*ZK8<#X@HKL3S%"NGUPQW'(@9!3G/<2,6%T@,"6&)"(]A'H!*Z1US&L3\8E;D9,SB'#X#\KEMA4LB)2'$ M2R0'9T?I^5%Z18B&J#&0-HM,@-1/+ZOK@ESGGEL4)PA>U9I]GLKUNJ_1&'7; MJ)I\Y+VEZKJ*KF:8JM%234NMFZJ%2:REVBVU1;WX,UNJCA$)UPJ$0UC,$;F!4(:-%!(T4OL(LI3E)G9I^,-8PC%Q^2$@DF6U M)1WVG&@NSG116,D(.GPAN$=Q9]:"0XQA&*Z$C#'SZ6/C#@-8\/<$ W"$"2!N M?61#&F2:A%(8O%CV M.&Y-Y9,\X/-CAEE.8_0!AL'*?U. NP2RIZ8Q([LGC&;?0/<]\>S;U.T:Z./A>]5Z&\$\'6H-X?PU'[2#07F6S=QF5K8&*B#D1O! M_CZVO]#&_4OH>PCF)[X[]F/F>&R ._ZH3.^Q?Y>=O#>;3'EE6#16]_'B?>$D M\OG^'' ;WVCC%MUJ9;?Q$FJI+-R80U$27V\BKZ#@S=!Y"L_9P&WHK2/2R?*Z M+?R^Z*OPJ#"OI'/AS<&^1EBVX4<3)Y%.3EX]%)8_O@"L@*COTFD-WZGFQ MI/#Y-)XFZ>C41V4P_F1003P/R;ZFL$A,1R. %)Q>6,NXJ1C5Z9@5*M ,IG>H MY(E:FH1!*,J>?&RV*\8].1.=V#V-D/MAN7057!7A?11.693X+&YCH(6]&/:\ M-[]35_5UR&O4IRP9A5Z;!]@]W<.-$N<7FP2E;(#"7PX.7@ECX;/01-YG-)^# M^"HH75%U4"5"A[$Z5[PJK:O"SP^RW5F=9:!VK+>)BF_,'77>7O75>5?A O>Q MI^X=F#KH6KMNMZUFUET7@'F/U8RVD?%8@U=Z^)7\E=?,6$"L^4([^-09)FP\ M+H>SQ"&Z*]5ZG2KE/^Q92N&)W8C*G;?H_0M2Z=WP-O(35O9CW.I[*D9I3U=*\:SFHMMJZF=72 C"K)1-UV;8RQVVX4C= P:O0$M^* MQXF;^ /4&(J!B4 [=2-<*K:(M$"A5I&F#SG'=R1)99V$LPKR"("(17,88QU, MO@Z;1YB6X'1=UN#2 \:U01[#E 5__)'M7S4,'IED';Q*L6_!.:8SYS/4,=H# M"O(++K7>HI@F,,KBUVJ U*CV3K$@'V*1G+PPGHEW! EKF8 7T*SD*%20H0,J MTH#K>3X)RAWC6L &GR$)8>)^QG ZFD5,%+-X&8UB7L"^L A0FC%,?#2/SRQ MX1=DXXN\4HU*Y6:M"#R2Q.B )QPA;Y-R.!/?$GK%8D' =R[8"\U%"6 M>1,29QL;3(96IR5YY0XZ0-B1*>T(\WUI1C1S&4%1U1V* ./^,"K7*Y5G'6V. M472+Z(>"*Q\>RN/IVQ'%@W(YSIBL@9I[U@'MZW XE&!\DA@-!-^Q]%TAYYO/ MD\DT/6U<6!9&0DSX[ARN:0'3R= U/VD5 RFC3$E:T*-+#B'6UJTS+;67(1#LTZ+ M%K_.M;K)4I#)4>S_SA1*=$BE3Y5IW*^?N2PX-AG!DK\KZE ;:Q K^[R9@+DS$WFHRY8C+U5MLT5TS&7#49 MJVT8F0]6='ZBI:OSHP51'.\39V(.;$%Q<3K@&JH7$&7>>RX+GEKQ]TKC(45AR M9+S32&7#G#M(=XYG'06[XA%B\EA_=L,WX&*QY^+9Q>&F8]/? 8W/ZM.IZN!C6A-M12[H<-3?& MY8T^DZ*B^SKO*!06(?D1)*3.+= M&"-_$P.V\[7"KLY6**19S1(P*$ 33RFC M1 )Q$%3%RBBS;IOO!&U[_DD?YGMQ"+<,PR.:%B:>Z)Q]%W-0"I4<,>*L0_^. M#B,5T'D+RGX _,LF.1/<8G#EIZ$U@?3@_WI^ CWOY/Y1TI\-N>_+UJKCRX[< MQZ&8G&EM=.J,UZ=@*RY?S]043"IY6IB@B;NPID*9/B[,5HPH4<6=V4^0UH.@ M#9FJ42&[9<.TM%IP,8>&SOK/(K+(24O9YPUKZO;H2/T[3R#24C\<).,R_P3* M&7HJ7+P[?GWYUCF^>.]LWR$KR?50%@+C! $C%4H?W2A R#9PR)@^JTQ' WD"?6T\ M_[2SQ(VL-''ILU>&.R_< WPAD*N@I +?,3V%3,2'?G^.',1'?MP