New to this forum, hoping I can get some help. I have a real-time media server which supports a number of RTP media stream formats. I need to record these to MP4 (we also support video though this is not important for this issue) or M4A containing AAC. The resulting file must be played back for real-time streaming over RTP and played in standard media players including browsers. The input RTP can be various codecs, various sample rates (limited set from 8 kHz to 48 kHz), may contain 1 or 2 channels, etc. The input is depacked and decoded to 16 bit linear PCM in 10 msec chunks / frames no matter the input.
I am attempting to use the libdfk-aac for encoding and libavformat for creating and writing the frames to MP4 or M4A container.
My limited understanding is that I need to encode to AAC-LC to allow the resulting file to be played back in standard players. It seems the libfdk-aac encoder, configured for LC, is limited to frames of 1024 samples, is this true? I'm used to other encoders which tend to work in multiples of 10 msec frame sizes. And these sizes work well within the real-time architecture we use. Using the default of 1024, the resulting MP4 contains frames of varying time lengths / sample sizes depending on the input sample rate. For 16 kHz /mono input frames (160 samples every 10 msec), the resulting MP4 frames are 1280 samples or 80 msec. For 48 kHz / mono input frames (480 samples every 10 msec), the resulting MP4 frames are 1920 samples or 40 msec. Not sure if this makes sense or if there is some way of configuring the encoder such that the resulting frames are 10 or 20 msec no matter the input sampling rate.
AACENC_GRANULE_LENGTH = 0x0105, /*!< Core encoder (AAC) audio frame length in samples:
- 1024: Default configuration.
- 512: Default LD/ELD configuration.
- 480: Optional length in LD/ELD configuration. */
Any feedback or pointers to addition info is greatly appreciated.
Thanks in advance,
Transform codecs (MP3, AAC-LC, Vorbis, Opus, etc) use fixed length transforms of one or two lengths (usually about 1024 samples and a smaller size of which several can be put in a row to make up one 1024 sample block). The frame sizes won't be an even number like 10 ms because of the fixed transform length.
Shorter transform lengths are possible using AAC-LD (low delay), but most of the time only AAC-LC/HE are supported by normal decoders so you will probably need to use 1024 samples if you expect people to be able to decode your streams using their own software.
Thanks Saratoga. This makes more sense. Assuming that I am now setting up the encoder properly, and the encoder buffers the 16 kHz frames until it has 1024 input samples, I would expect the resulting output frame to be recorded (using libavformat) properly at whatever the encoder frame size is. However, the resulting file does not sound great and seems to play back faster using standard players. The ffprobe frame dump shows a pkt_duration of 1280 samples (8 16 kHz frames or 80 msec) but the nb_samples value is 1024. This seems inconsistent but I am not sure this is the cause of the play back issues.
Perhaps the assumption that the encoder buffers the samples provided until it has a complete 1024 frame is wrong. Perhaps I need to do this buffering and only call the encoder with 1024 samples?
I only worked on porting bits of ffmpegs decoder to libfaad so I can't say much about ffmpeg's APIs. I never used them.
You might want to ask on the ffmpeg list.