I try to parse a binary file and extract different data structures from it. One can be an uint8 or int8 (also uint16, int16 ... till 64).
To have a most universal method, I read in the data from the given file pointer and save it in an uint8 array (buffer).
With my test, I assumed that a file content of 40 (in hex) should lead to a resulting integer 64. That's why my test method asserts this values to be shore about it. ** Unfortunately the uint8 array's content results always in a decimal int of 52.** I don't know why and tries various other ways to read in a specific amount of bytes and assign them to an integer variable. Is this a topic of endianess or something?
Thanks in advance, if someone can help :)
My read_int method:
int read_int(FILE * file,int n,bool is_signed) throw(){
assert(n>0);
uint8_t n_chars[n];
int result;
for (int i = 0; i < n; i++)
{
if(fread(&n_chars[i],sizeof(n_chars[i]),1,file)!=1){
std::cerr<< "fread() failed!\n";
throw new ReadOpFailed();
}
result*=255;
result+=n_chars[i];
}
std::cout<< "int read: "<<result<<"\n";
return result;
//-------------Some ideas that didn't work out either------------------
// std::stringstream ss;
// ss << std::hex << static_cast<int>(static_cast<unsigned char>(n_chars)); // Convert byte to hexadecimal string
// int result;
// ss >> result; // Parse the hexadecimal string to integer
// std::cout << "result" << result<<"\n";
One little test that tremendously fails... The part with the endian detection gives the output for little endian (don't know if this is anyhow a part of the problem).
struct TestContext{
FILE * create_test_file_hex(char * input_hex,const char * rel_file_path = "test.gguf") {
std::ofstream MyFile(rel_file_path, std::ios::binary);
// Write to the file
MyFile << input_hex;
// Close the file
MyFile.close();
// std::fstream outfile (rel_file_path,std::ios::trunc);
// char str[20] =
// outfile.write(str, 20);
// outfile.close();
FILE *file = fopen(rel_file_path,"rb");
try{
assert(file != nullptr);
}catch (int e){
std::cout << "file couldn't be opened due to exception n° "<<std::to_string(e)<<"\n";
ADD_FAILURE();
}
std::remove(rel_file_path); //remove file whilst open, to be able to use it, but delete it after the last pointer was deleted.
return file;
}
};
TEST(test_tool_functions, test_read_int){
int n = 1;
// little endian if true
if(*(char *)&n == 1) {std::cout<<"Little Endian Detected!!!\n";}
else{std::cout<<"Big Endian Detected!!!\n";}
std::string file_hex_content = "400A0E00080000016";
uint64_t should;
std::istringstream("40") >> std::hex >> should;
ASSERT_EQ(should,64);
uint64_t result = read_int(TestContext().create_test_file_hex(file_hex_content.data()),1,false);
ASSERT_EQ(result,should);
}
The root cause of the problem is that your file_hex_content
consists of ASCII character bytes (which form a human-readable hexadecimal string representation of a number), not of the bytes that would form a binary integer representation. Therefore it doesn’t start with a single byte 0x40
a.k.a. 64
but with a byte '4'
(ASCII byte value 52
) followed by another byte '0'
(ASCII value 48
). A single byte 64
(0x40
) corresponds to the ASCII character '@'
rather than two characters '4'
and '0'
.
A small serialization example follows. As long as you serialize and deserialize on the same architecture and have no portability concerns, endianness is not a concern either.
#include <cstdint>
#include <ios>
#include <iostream>
#include <sstream>
int main() {
std::stringstream encoded;
const uint64_t source{0xabcd1234deadbeefULL};
encoded.write(reinterpret_cast<const char*>(&source), sizeof(source));
uint64_t target;
encoded.read(reinterpret_cast<char*>(&target), sizeof(target));
std::cout << "source == target: " << std::hex << source << " == " << target
<< "\nserialized bytes:";
for (const uint8_t byte : encoded.str())
std::cout << ' ' << static_cast<uint32_t>(byte);
std::cout << std::endl;
}
The output from the program above, when executed on my little endian machine, looks like this:
source == target: abcd1234deadbeef == abcd1234deadbeef
serialized bytes: ef be ad de 34 12 cd ab
As expected, the serialized string starts from the lowest order byte 0xef
and ends with the highest order byte 0xab
. On a big endian platform, the second line would be ordered from highest to lowest order byte, i.e. ab cd 12 34 de ad be ef
.