I have a sample C script that successfully detects a keyword in an audio file using the Microsoft Speech C SDK. I'm somewhat new to C so it took me a bit to craft this working sample (especially because they do not document their C API so I've had to infer from some C++ docs and intuition).
(Note: I'm aware there is a JS version of their Speech SDK. The problem is they have not implemented on-device keyword detection in their JS SDK and when I opened an issue about it they recommended I use their C/C++ SDK with node wrappers).
#include "./speechsdk/include/c_api/speechapi_c.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define CHECK_RESULT(result, message) \
if ((result) != SPX_NOERROR) { \
fprintf(stderr, message ": %lu\n", (result)); \
exit(-1); \
}
int main() {
SPXAUDIOCONFIGHANDLE audioConfig;
AZACHR audioConfigResult =
audio_config_create_audio_input_from_wav_file_name(&audioConfig,
"./file.wav");
CHECK_RESULT(audioConfigResult, "Failed to create audio config.");
printf("Created audio config.\n");
SPXKEYWORDHANDLE keywordModel;
AZACHR keywordModelResult = keyword_recognition_model_create_from_file(
"./keyword_models/hey_bumblebee.table", &keywordModel);
CHECK_RESULT(keywordModelResult, "Failed to create keyword model.");
printf("Created keyword model.\n");
SPXRECOHANDLE recognizer;
AZACHR recognizerResult =
recognizer_create_keyword_recognizer_from_audio_config(&recognizer,
audioConfig);
CHECK_RESULT(recognizerResult, "Failed to create recognizer.");
printf("Created recognizer.\n");
SPXRESULTHANDLE resultHandle = NULL;
AZACHR recognizeKeywordResult = recognizer_recognize_keyword_once(
recognizer, keywordModel, &resultHandle);
CHECK_RESULT(recognizeKeywordResult, "Failed to to start recognition.");
Result_Reason reason;
AZACHR reasonResult = result_get_reason(resultHandle, &reason);
CHECK_RESULT(reasonResult, "Failed to get result reason.");
if (reason == ResultReason_RecognizedKeyword) {
char textBuffer[256];
AZACHR textResult =
result_get_text(resultHandle, textBuffer, sizeof(textBuffer));
CHECK_RESULT(textResult, "Failed to get recognized text.");
printf("Recognized: \"%s\"\n", textBuffer);
} else if (reason == ResultReason_NoMatch) {
Result_NoMatchReason noMatchReason;
AZACHR noMatchReasonResult =
result_get_no_match_reason(resultHandle, &noMatchReason);
CHECK_RESULT(noMatchReasonResult, "Failed to get no match reason.");
printf("No match. Reason: %d.\n", noMatchReason);
} else if (reason == ResultReason_Canceled) {
Result_CancellationReason cancellationReason;
Result_CancellationErrorCode cancelationCode;
AZACHR canceledReasonResult =
result_get_reason_canceled(resultHandle, &cancellationReason);
CHECK_RESULT(canceledReasonResult, "Failed to get canceled reason.");
AZACHR canceledCodeResult =
result_get_canceled_error_code(resultHandle, &cancelationCode);
CHECK_RESULT(canceledCodeResult, "Failed to get canceled error code.");
printf("Canceled. Reason: %d. Code: %d.\n", cancellationReason,
cancelationCode);
} else {
printf("Unknown.\n");
}
return 0;
}
The SDK methods return a status number that I check to ensure is 0
(aka SPX_NOERROR
) via the CHECK_RESULT macro. The actual data I want from each call gets populated to a pointer I create and pass in. The C script appears to work perfectly and does indeed accurately detect the keyword in the audio file if it's present.
Now, I'm trying to do this exact thing but with Bun's FFI capabilities. Here's the code:
import { dlopen, FFIType, CString, ptr } from "bun:ffi";
const cwd = process.cwd();
// Enum derived from the SDK: https://github.com/catdadcode/microsoft-speech-sdk/blob/main/include/c_api/speechapi_c_result.h#L11-L26
enum ResultReason {
NoMatch = 0,
Canceled = 1,
RecognizingSpeech = 2,
RecognizedSpeech = 3,
RecognizingIntent = 4,
RecognizedIntent = 5,
TranslatingSpeech = 6,
TranslatedSpeech = 7,
SynthesizingAudio = 8,
SynthesizingAudioComplete = 9,
RecognizingKeyword = 10,
RecognizedKeyword = 11,
SynthesizingAudioStart = 12,
}
// Assuming the Microsoft Speech SDK shared library is available at a certain path
const speechSdkPath = `${cwd}/speechsdk/lib/x64/libMicrosoft.CognitiveServices.Speech.core.so`;
// Load the Speech SDK shared library
const speechSdk = dlopen(speechSdkPath, {
audio_config_create_audio_input_from_wav_file_name: {
args: [FFIType.cstring, FFIType.ptr],
returns: FFIType.u64_fast,
},
keyword_recognition_model_create_from_file: {
args: [FFIType.cstring, FFIType.ptr],
returns: FFIType.u64_fast,
},
recognizer_create_keyword_recognizer_from_audio_config: {
args: [FFIType.ptr, FFIType.ptr],
returns: FFIType.u64_fast,
},
recognizer_recognize_keyword_once: {
args: [FFIType.ptr, FFIType.ptr, FFIType.ptr],
returns: FFIType.u64_fast,
},
result_get_reason: {
args: [FFIType.ptr, FFIType.ptr],
returns: FFIType.u64_fast,
},
result_get_text: {
args: [FFIType.ptr, FFIType.ptr, FFIType.u32],
returns: FFIType.u64_fast,
},
});
const textEncoder = new TextEncoder();
const textDecoder = new TextDecoder();
const checkResult = (result: number | bigint) => {
if (result !== 0) {
throw new Error(`Error: ${result}`);
}
};
// Replace these with the actual file paths
const audioFilePath = textEncoder.encode(`${cwd}/file.wav`);
const keywordModelFilePath = textEncoder.encode(
`${cwd}/keyword_models/hey_bumblebee.table`
);
// Create audio config
console.log(
`Creating audio config from file: ${textDecoder.decode(audioFilePath)}`
);
const audioConfig = ptr(new Uint8Array(8));
const audioConfigResult =
speechSdk.symbols.audio_config_create_audio_input_from_wav_file_name(
audioFilePath,
audioConfig
);
checkResult(audioConfigResult);
console.log("Created audio config.");
// Create keyword model
console.log(
`Creating keyword model from file: ${textDecoder.decode(
keywordModelFilePath
)}`
);
const keywordModel = ptr(new Uint8Array(8));
const keywordModelResult =
speechSdk.symbols.keyword_recognition_model_create_from_file(
keywordModelFilePath,
keywordModel
);
checkResult(keywordModelResult);
console.log("Created keyword model.");
// Create recognizer
console.log("Creating recognizer...");
const recognizer = ptr(new Uint8Array(8));
const recognizerResult =
speechSdk.symbols.recognizer_create_keyword_recognizer_from_audio_config(
recognizer,
audioConfig
);
console.log();
checkResult(recognizerResult!);
console.log("Created recognizer.");
// // Start recognition
// console.log("Starting recognition...");
// const resultHandle = ptr(new Uint8Array(16));
// const recognizeKeywordResult =
// speechSdk.symbols.recognizer_recognize_keyword_once(
// recognizer,
// keywordModel,
// resultHandle
// );
// checkResult(recognizeKeywordResult);
// console.log("Recognition finished.");
//
// // Get result reason
// console.log("Getting result reason...");
// const reason = ptr(new Uint8Array(8));
// const reasonResult = speechSdk.symbols.result_get_reason(resultHandle, reason);
// checkResult(reasonResult);
// console.log("Got result reason:", new CString(reason));
//
// // Check the reason and handle accordingly
// if (reason === ResultReason.RecognizedKeyword) {
// // Assuming a buffer size, adjust as needed
// const textBuffer = new Uint8Array(256);
// const textResult = speechSdk.symbols.result_get_text(
// resultHandle,
// textBuffer,
// textBuffer.length
// );
// checkResult(textResult);
// console.log("Recognized:", new CString(ptr(textBuffer)));
// } else {
// console.error("Recognition failed:", reason);
// }
You'll note that half of it is commented out. This is because it's already failing before the commented out portion so I commented out the rest to keep it simple, but I'm including it for completeness. The recognizerResult
code that gets returned ends up being a large number that doesn't seem to correspond with any error codes. The first two calls to create the audioConfig
and the keywordModel
appear to succeed fine and do indeed give me zeros. I don't know if I'm screwing up the symbol type mapping here or what.
Microsoft doesn't seem to have a repo anywhere with their SDK header files (they just offer a .zip
download) so I put them in a repo so I could link to it in case you want to reference the SDK itself when looking over this question: https://github.com/catdadcode/microsoft-speech-sdk
Oh also, I build the C script with the following command:
gcc main.c -o main -I./speechsdk/include/c_api/ -L./speechsdk/lib/x64/ -lMicrosoft.CognitiveServices.Speech.core -Wl,-rpath=./speechsdk/lib/x64/ -g
In case that matters. I'm not sure if I need to do anything more with the FFI version beyond passing the actual SDK .so
library file to dlopen
.
The issue here was two-fold. First, the strings supplied from Bun's FFI needed to be nul terminated. This can be accomplished by concatenating a zeroed buffer:
const nul = Buffer.from([0])
const buf = Buffer.concat([Buffer.from("hello"), nul])
Or you can simply use the nul escape character:
const buf = Buffer.from("hello\0")
The hexidecimal zero is also equivalent:
const buf = Buffer.from("hello\x00")
The second issue was a simple misuse of Bun's FFI API with regard to pointers. The value of a pointer in C is the address of the memory space it points to and that address can be accessed in C by simply referencing the pointer variable. Getting the address of the pointer itself is as simple as prepending the pointer variable with an ampersand &
.
In Bun this is more complicated. Getting the value of a pointer (the address the pointer points to) requires us to read the bytes from the pointer. Meanwhile, referencing the address of the pointer itself is as simple as passing in the raw pointer. In Bun the Pointer
type is just a number
, the address of the poitner.
What this means is that passing the raw pointer variable in C (ptr
) is like passing the value of the pointer, which is the address of the data being pointed to. Passing the raw pointer variable in Bun (ptr
) is like passing the address of the pointer itself (similar to passing with &
in C).
To get the address the pointer is pointing to in Bun we have to read the bytes from the memory where our pointer variable is stored (read.ptr(ptr)
). Here is a simple example of doing the same thing in C and Bun:
C
SOMEHANDLE ptr;
someMethod(&ptr);
someOtherMethod(ptr);
Bun
import { dlopen, ptr, FFIType, read } from "bun:ffi";
const sdk = dlopen("path/to/sdk", {
someMethod: FFIType.ptr,
someOtherMethod: FFIType.ptr
});
const ptr = ptr(new UInt8Array(8));
sdk.someMethod(ptr);
sdk.someOtherMethod(read.ptr(ptr));
The fact that things are flip-flopped (extra syntax for reading pointer address in C versus extra syntax for reading the address being pointed to in Bun) really threw me for a loop.