Question 1: What is needed for streaming?
In my opinion, a file format that supports streaming must allow the decoder to start decoding at an arbitrary position within the stream. That means that all info about the stream that the decoder needs to know must be repeated in the stream.
A generic solution would be to interleave stream info and audio data at an arbitrary, user-definable ratio, for example:
(SI=stream info frame, A=audio frame)
ratio SI:A = 1:1 SI A SI A SI A SI A SI ...
-lowest streaming delay
-biggest storage overhead
-purpose: broadcast (streaming)
ratio SI:A = 1:5 SI A A A A A SI A A A A A SI ...
-higher streaming delay
-smaller storage overhead
-purpose: local storage for playback (quick seeking)
ratio SI:A = 1:n SI A A A A A A A A A A ...
-no streaming support
-no storage overhead
-purpose: archiving (slow seeking)
The stream info frames could contain info like this: (just an incomplete example!)
- sync code
- SI frame CRC
- audio stream info (sample bit width, sampling frequency, etc.)
- current position within stream (timestamp and/or sample number)
- meta data (artist, title, etc.)
The sync code together with the SI frame CRC lowers the chance for a false positive match to practically zero.
A seek table and other non-streaming info (cue-sheets, album cover JPGs, etc.) could be included in an additional info frame at the start of the file, only. The seek table would allow players to skip as many SI frames as possible (defined by the precision/number of entries in the seek table) to reach the target position within the stream. This quick seeking feature would rely on the presence of SI frames because only those have sync codes that allow the decoder to re-sync with the audio stream. On the other hand, files without any SI frames (purpose: archiving) would require the player to do slow seeking, i.e. to decode the entire audio stream up to the target position.