The decoding library would make no difference whatsoever. Is your compile using the nasm code? If not, that would account for most of the difference.
I have compiled using NASM and indeed there was an improvement (from 35 seconds down to 26, but still far from the 8 seconds of the rareware exe). The following is the output of the same WAV converted with my executable vs the one from rarewares. You can see the big speed difference in "play/CPU" (5x vs 16x) and "CPU time/estim":
C:\>lame -b 320 IN.wav OUT.mp3
LAME 3.100 32bits (http://lame.sf.net)
CPU features: MMX (ASM used), SSE (ASM used), SSE2
Using polyphase lowpass filter, transition band: 20094 Hz - 20627 Hz
Encoding IN.wav to OUT.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (4.4x) 320 kbps qval=3
Frame | CPU time/estim | REAL time/estim | play/CPU | ETA
5624/5624 (100%)| 0:26/ 0:26| 0:00/ 0:00| 5.5449x| 0:00 h
-------------------------------------------------------------------------------
kbps LR MS % long switch short %
320.0 93.1 6.9 91.0 5.2 3.8
Writing LAME Tag...done
ReplayGain: -3.2dB
C:\>rlame -b 320 IN.wav OUT.mp3
LAME 3.100.1 32bits (https://lame.sourceforge.io)
CPU features: MMX (ASM used), SSE (ASM used), SSE2
Using polyphase lowpass filter, transition band: 20094 Hz - 20627 Hz
Encoding IN.wav to OUT.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (4.4x) 320 kbps qval=3
Frame | CPU time/estim | REAL time/estim | play/CPU | ETA
5624/5624 (100%)| 0:08/ 0:08| 0:08/ 0:08| 16.744x| 0:00
-------------------------------------------------------------------------------
kbps LR MS % long switch short %
320.0 93.1 6.9 91.0 5.2 3.8
Writing LAME Tag...done
ReplayGain: -3.2dB