I wrote this for my OS class:
#include <iostream>
#include <fstream>
//encodes a file using the (8,4) Hamming Code.
//usage : HammingEncode.out < inputFile > outputFile
int main() {
unsigned char const codebook[] = {0x00, 0x1E, 0x2D, 0x33, 0x4B, 0x55, 0x66, 0x78, 0x87, 0x99, 0xAA, 0xB4, 0xCC, 0xD2, 0xE1, 0xFF};
unsigned char in, nextByte;
unsigned char const leftMask = 0xF0, rightMask = 0x0F;
in = std::cin.get();
while (!std::cin.eof()) {
nextByte = (in & leftMask) >> 4;
std::cout << codebook[nextByte];
nextByte = in & rightMask;
std::cout << codebook[nextByte];
in = std::cin.get();
}
}
I then decided to test its speed (just to see) on the Old Testamenet form the King James Bible. This was our standard test file for a Data Structures Class which was taught in Java, and we could sort it and Huffman Encode it in basically no time, yet this takes quite a long time to encode. What is going on?
std::cin
is open in text-mode, and as such it is constantly on the lookout for all kinds of things to watch for (like newlines, etc).
Given the constant character sniffing by the std::cin
input stream, i'm not surprised it takes longer, but it does seem a little excessive. the following, bypassing iostream
and using FILE
stream directly will probably do what you were expecting:
#include <cstdlib>
#include <cstdio>
int main(int argc, char *argv[])
{
static unsigned char const codebook[] =
{
0x00, 0x1E, 0x2D, 0x33, 0x4B, 0x55, 0x66, 0x78,
0x87, 0x99, 0xAA, 0xB4, 0xCC, 0xD2, 0xE1, 0xFF
};
for (int c = std::fgetc(stdin); c!=EOF; c=std::fgetc(stdin))
{
std::fputc(codebook[c >> 4], stdout);
std::fputc(codebook[c & 0x0F], stdout);
}
return EXIT_SUCCESS;
}
I tested the exact code above on an 10MB random file loaded with chars ranging from a
to z
, and the results were ridiculously long when using std::cin
and std::cout
. Using the FILE
streams directly, the difference was enormous. All of the code in this answer was tested with Apple LLVM version 5.1 (clang-503.0.38) (based on LLVM 3.4svn)
using -O3
optimization.
Using FILE
streams
time ./hamming < bigfile.txt > bigfile.ham
real 0m1.855s
user 0m1.812s
sys 0m0.041s
Using std::cin
and std::cout
time ./hamming < bigfile.txt > bigfile.ham
real 0m23.819s
user 0m7.416s
sys 0m16.377s
Using std::cin
and std::cout
with std::cout.sync_with_stdio(false);
time ./hamming < bigfile.txt > bigfile.ham
real 0m24.867s
user 0m7.705s
sys 0m17.118s
In summary, ouch. Of note is the time spent in system. If I get a chance to update this with the std::istream::get()
and put()
methods I will, but honestly I don't expect any miracles on that. Unless there is some magic (to me, not to others) way of turning off io xlat from std::cin
the FILE
streams may be a reasonable alternative. I haven't yet investigated whether slurping std::cin
's rdbuf()
is a viable option either, but it may too have promise.
Edit: Using std::istreambuf_iterator<char>
Using the streambuf iterator class has a significant improvement, as it essentially bypasses all the inline slat junk, but it still isn't as efficient as FILE
streams:
#include <iostream>
#include <cstdlib>
#include <cstdio>
int main(int argc, char *argv[])
{
static unsigned char const codebook[] =
{
0x00, 0x1E, 0x2D, 0x33, 0x4B, 0x55, 0x66, 0x78,
0x87, 0x99, 0xAA, 0xB4, 0xCC, 0xD2, 0xE1, 0xFF
};
std::istreambuf_iterator<char> cin_it(std::cin), cin_eof;
std::for_each(cin_it, cin_eof, [](char c)
{
std::cout.put(static_cast<char>(codebook[static_cast<unsigned char>(c) >> 4]));
std::cout.put(static_cast<char>(codebook[static_cast<unsigned char>(c) & 0x0F]));
});
return EXIT_SUCCESS;
}
Results:
time ./hamming < bigfile.txt > bigfile.ham
real 0m6.062s
user 0m5.795s
sys 0m0.053s
Note that system
is now comparable to the FILE
stream results, but the overhead from rest of the iostream templates in user
seems a sore point (but still better than the other iostream
attempts). You win some, you lose some =P
Edit: Unbuffered System IO
In effort to be completely fair, bypassing all runtime buffering and relying solely on the system calls to do this madness, the following is also worth noting:
#include <cstdlib>
#include <cstdio>
#include <unistd.h>
int main(int argc, char *argv[])
{
static unsigned char const codebook[] =
{
0x00, 0x1E, 0x2D, 0x33, 0x4B, 0x55, 0x66, 0x78,
0x87, 0x99, 0xAA, 0xB4, 0xCC, 0xD2, 0xE1, 0xFF
};
unsigned char c;
while (read(STDIN_FILENO, &c, 1)> 0)
{
unsigned char duo[2] =
{
codebook[ c >> 4 ],
codebook[ c & 0x0F ]
};
write(STDOUT_FILENO, duo, sizeof(duo));
}
return EXIT_SUCCESS;
}
The results, as you would expect, were dreadful:
time ./hamming < bigfile.txt > bigfile.ham
real 0m26.509s
user 0m2.370s
sys 0m24.087s