I know there are already a few questions like this here on SO, however they do not fully explain the formulas presented in the answers.
Im writing a parser that should be able to process MPEG-1,2,2.5 Audio Layer I,II,III frame headers. The goal is to calculate the exact size of the frame, including header, CRC (if present) and any data or metadata of this frame (basically the number of bytes between the start of one header and the beginning of the next one).
One of the code snippets/formulas commonly seen on the internet to achieve this is (in no specific programming language):
padding = doesThisFramehavePadding ? 1 : 0;
coefficient = sampleCount / 8;
// makes sense to me. the slot size seems to be the smallest addressable space in an mp3 frame
// and is thus important for padding.
slotSize = mpegLayer == Layer1 ? 4 : 1;
// all fine here. bitRate / sampleRate yields bits per sample, multiplied by that weird
// coefficient from earlier probably gives us <total bytes> per <all samples in this frame>.
// then add padding times slotSize.
frameSizeInBytes = ((coefficient * bitRate / sampleRate) + padding) * slotSize;
I have multiple questions regarding above code snippet:
sampleCount / 8
it's probably just something used to convert the units from bits to bytes in the final calculation, right?(coefficient * bitRate / sampleRate)
already yields something in bytes what would multiplying it with the slot size achieve for Audio Layer I specifically? Wouldn't this imply that the unit of (coefficient * bitRate / sampleRate)
should have been "slots" earlier, not "bytes"? If so, then what does the coefficient do, like why divide by 8, even for audio layer 1 frames? Is this even correct?frameSizeInBytes
or does the result indicate the length of the frame data/body?Basically all these sub-questions can be summarized to:
What is the formula to calculate the total and exact length of the current frame in bytes, including the header, and stuff like CRC, or Xing and LAME meta data frames and other eventualities?
I wrote that in Delphi/Pascal and the function returns either 0
for a bad frame or its exact size of bytes. It is based on multiple websites - the first two illustrate and explains an MPEG audio frame header with full precision, while the third has crucial additions like the formula(s):
For Layer I files us this formula:
FrameLengthInBytes = (12 * BitRate / SampleRate + Padding) * 4
For Layer II & III files use this formula:FrameLengthInBytes = 144 * BitRate / SampleRate + Padding
const
MPEG_BITRATE: Array[0.. 1, 1.. 3, 0.. 14] of Word= // MPEG 2/1, Layer III/II/I
( ( ( 0, 8, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144, 160 ) // 2 Layer III
, ( 0, 8, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144, 160 ) // 2 Layer II
, ( 0, 32, 48, 56, 64, 80, 96, 112, 128, 144, 160, 176, 192, 224, 256 ) // 2 Layer I
)
, ( ( 0, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320 ) // 1 Layer III
, ( 0, 32, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, 384 ) // 1 Layer II
, ( 0, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448 ) // 1 Layer I
)
);
MPEG_SAMPLERATE: Array[0.. 3, 0.. 2] of Word= // MPEG 2.5/?/2/1
( ( 11025, 12000, 8000 )
, ( 0, 0, 0 )
, ( 22050, 24000, 16000 )
, ( 44100, 48000, 32000 )
);
// Read from a file and give back a positive 16-bit value of the PAYLOAD size,
// excluding the 4 bytes header size. Make sure we can read at least 4 byte off the
// file. If a non-standard condition is met, the function exits with size 0,
// indicating a bad frame.
function IsValidMpegHeader( oIn: TStream ): Word;
var
aHead: Array[1.. 4] of Byte; // 4 bytes.
iBitRateKilo, iSampleRate: Word; // 16-bit; looked up from the array constants above.
iPadding, iSlotSize, iSamples: Byte; // 8-bit.
begin
oIn.Read( aHead[1], 4 ); // Read next 4 bytes into array.
// 11 bits sync:
if (aHead[1]<> $FF) then exit; // First 8 bits.
if (aHead[2] and $E0)<> $E0 then exit; // Next 3 bits.
// 2 bits MPEG version:
if (aHead[2] and $18)= $08 then exit; // $00=2.5; $08=reserved; $10=2; $18=1
// 2 bits Audio Layer:
if (aHead[2] and $06)= $00 then exit; // $00=reserved; $02=III; $04=II; $06=I
// 1 bit "Protection" flag. End of 16 bits.
// 4 bits Bitrate:
if (aHead[3] and $F0)= $F0 then exit; // 0=free, thus allowed; all 4 bits set=bad
// 2 bits Frequency:
if (aHead[3] and $0C)= $0C then exit; // All bits=reserved.
// 1 bit "Padding" flag.
// 1 bit "Private" flag. End of 24 bits.
// 2 bits "Channel Mode": 0=stereo; 1=joint stereo; 2=dual channel; 3=mono
// 2 bits Mode Extension.
// 1 bit "Copyright" flag.
// 1 bit "Original" flag.
// 2 bits Emphasis. End of 32 bit.
if (aHead[4] and $03)= $02 then exit; // $00=none; $01=50/15 ms; $02=reserved; $03=CCIT J.17
// 1 upper bit from 2nd byte, shifted 3 bits to the right = MPEG version
// 2 bits from 2nd byte, shifted 1 bit to the right = Audio Layer
// 4 bits from 3rd byte, shifted 4 bits to the right = Bitrate
iBitRateKilo:= MPEG_BITRATE[(aHead[2] shr 3) and 1][(aHead[2] shr 1) and 3][(aHead[3] shr 4) and $F];
// Layer II disallows specific combinations.
if (aHead[2] and $06)= $04 then
case iBitRateKilo of
32, 48, 56, 80: if (aHead[4] and $C0)<> $C0 then exit; // Only single channel allowed.
224, 256, 320, 384: if (aHead[4] and $C0)= $C0 then exit; // No single channel allowed.
end;
// Samples per frame in bytes, not bits.
if (aHead[2] and $18)= $18 then begin // MPEG v1
case aHead[2] and $06 of
$06: iSamples:= 12; // Layer I
else
iSamples:= 144; // Layer II and III
end;
end else begin // MPEG v2 and v2.5
case aHead[2] and $06 of
$06: iSamples:= 12; // Layer I
$04: iSamples:= 144; // Layer II
else
iSamples:= 72; // Layer III
end;
end;
// Set slot size and padding (in bytes).
if (aHead[2] and $06)= $06 then iSlotSize:= 4 else iSlotSize:= 1; // Layer I = 32 bits.
if (aHead[3] and $02)= $02 then iPadding := 1 else iPadding := 0; // Padding bit.
// 2 bits from second byte, shifted 3 bits to the right = MPEG version
// 2 bits from third byte, shifted 2 bits to the right = Frequency
iSampleRate:= MPEG_SAMPLERATE[(aHead[2] shr 3) and 3][(aHead[3] shr 2) and 3];
if iSampleRate= 0 then exit;
// The division itself is a real/float one, not an Integer division. The quotient
// must not be rounded, but instead its Integer part must be cut off from any decimals.
// If it is 1152.9 then it still means 1152 bytes, not 1153. This calculation works
// for all MPEG versions, not just v1.
result:= Trunc( ((iSamples* iBitRateKilo* 1000/ iSampleRate)+ iPadding)* iSlotSize );
(* Originally I thought the hash sum would make the frame bigger, but after experiencing
a couple of files the 2 CRC bytes are meant to be in the frame payload already. This
is also confirmed by https://hydrogenaud.io/index.php/topic,119033.0.html indicating
that this was never meant for (stored) files, but instead only for (network) transmissions
and would indeed waste 16 valuable bits.
if (aHead[2] and $01)= $00 then Inc( result, 2 ); // 16-bit CRC after header. *)
end;
If the function returns 0
you're most likely in any metadata tag's area. The calculated frame size is for its payload=content and does not count the 4 bytes of header data. It's exactly the amount of bytes to seek forward in the file to be in front of the next frame's headers.
Padding is used to fit the bit rates exactly. For an example: 128k 44.1kHz layer II uses a lot of 418 bytes and some of 417 bytes long frames to get the exact 128k bitrate. For Layer I slot is 32 bits long, for Layer II and Layer III slot is 8 bits long.
First, let's distinguish two terms frame size and frame length. Frame size is the number of samples contained in a frame. It is constant and always 384 samples for Layer I and 1152 samples for Layer II and Layer III. Frame length is length of a frame when compressed. It is calculated in slots. One slot is 4 bytes long for Layer I, and one byte long for Layer II and Layer III. When you are reading MPEG file you must calculate this to be able to find each consecutive frame. Remember, frame length may change from frame to frame due to padding or bitrate switching.
I wrote this to exactly count frames in MP3 files encoded with variable bitrates, where frame sizes can have very different lengths. And I was fed up with lazy overall calculations that would only do guesswork.
The "special" VBR frames that don't contain audio but instead additional info can be fairly well detected, too. For this we need to know the "side info" of a frame:
const
// https://www.codeproject.com/Articles/8295/MPEG-Audio-Frame-Header
MPEG_SIDEINFO: Array[0.. 1, FALSE.. TRUE] of Byte= // MPEG 2/1, Mono/Non-mono
( ( 9, 17 )
, ( 17, 32 ) // Only MPEG 1 non-mono has the offset after 32 bytes
);
// Returns TRUE if one of the identifications matches.
function IsVbrFrame( oIn: TStream ): Boolean;
var
iSideInfo: Byte;
aIdent: Array[1.. 4] of Char; // Like bytes, but treating it as ASCII.
begin
// 1 upper bit from 2nd byte, shifted 3 bits to the right = MPEG version
// 2 highest bits from 4th byte (Channel Mode) equal mode "Mono"?
iSideInfo:= MPEG_SIDEINFO[(aHead[2] shr 3) and 1][(aHead[4] and $C0)<> $C0];
// After we read the 4 bytes from the header, go forward either 9, 17 or 32
// bytes and read 4 bytes of identification for almost any VBR frame.
oIn.Seek( iSideInfo, soCurrent );
oIn.Read( aIdent[1], 4 );
if (aIdent= 'Xing')
or (aIdent= 'Info')
or (aIdent= 'LAME')
or (aIdent= 'UUUU')
or (aIdent= 'GOGO')
or (aIdent= 'MPGE') then begin
result:= TRUE;
end else begin
// Go back the 4 bytes we just read and the sideinfo portion we skipped
// to then always jump 32 bytes forwards, regardless of MPEG version and
// Channel Mode. Then read 4 bytes again and check for the only known ID.
oIn.Seek( 0- 4- iSideInfo, soCurrent );
oIn.Seek( 32, soCurrent );
oIn.Read( aIdent[1], 4 );
result:= (aIdent= 'VBRI');
end;
end;
You may also want to read
...which is also useful to know where the first audio frame is to be found (after tags at the start of the file) and when you've reached the last one (before tags at the end of the file).