common-lispbinaryfiles

Use a Lisp to identify whether a given sequence exists within a binary stream


Using standard Common Lisp techniques (libraries acceptable if they offer easier or quicker possibilities), what would be the analogous way to do (search "bar" "foobarbaz"), where "bar" is a given sequence of bytes and "foobarbaz" is the stream? Specifically, I am trying to search in a file that I have opened as (unsigned-byte 8). It would be helpful to know where the characters I am looking for (4d 54 68 64) are located within the file.

I know that the analogy breaks down because a stream is not the same kind of object as a collection. But there's got to be a relatively straightforward way to do this reading out some of the contents of a file I've opened using with-open-file.

I have combed the CL Cookbook and Practical Common Lisp, which has a great section on reading/writing binary files, but I couldn't quite figure out a quick and dirty solution for this.

EDIT: I've updated the title to reflect a more general purpose: using any Lisp to solve this problem. Many of the solutions posted here for doing so using Common Lisp represent valuable examples of accomplishing this task, but the simplest solution appears to be using Racket and "ports" along with the built-in support for regular expressions. For Racket reference:

https://docs.racket-lang.org/reference/ports.html

https://docs.racket-lang.org/reference/regexp.html

https://docs.racket-lang.org/guide/regexp.html


Solution

  • One way you could do this is to read from the stream to fill a buffer repeatedly, searching the buffer for the desired sequence until it is found or until the stream is exhausted. Each time the buffer is filled and searched without finding a match, the last n bytes, where n is the number of bytes in the sought sequence, have already been checked. So the last n-1 bytes will need to be checked after replenishing the buffer. This can be done by moving the last n-1 bytes to the front of the buffer before refilling it from the stream.

    The example code below defines a stream-find function that takes a needle argument which represents the sought sequence, a stream argument which represents the stream to be searched, and a buffer-length argument which defaults to 1024. It uses an array of unsigned bytes for a buffer. keep is the number of bytes to be kept between buffer refills, and refill-length is the number of fresh bytes to be searched after each refill. If the needle sequence is larger than the buffer an error is raised.

    A local function check-buffer is created which is recursively called to fill the buffer and search for the needle. The check-buffer function takes a keeping argument which tracks the number of bytes kept from the previous search, and a checked argument which tracks the total number of bytes searched so far. These are both zero for the first call of check-buffer.

    When check-buffer is called it copies the end of the buffer to the front of the buffer if the call is keeping bytes from a previous search. Then the contents of the stream are read into the buffer starting from the last kept byte. Then the buffer is searched for needle, and the resulting position is saved in found.

    If needle was found, the found position is combined with the number of bytes previously checked to return the byte position of needle in the stream. If the call to read-sequence terminated before reaching the end of the buffer the stream was exhausted without finding needle, so it was not in the stream and nil is returned. Otherwise the stream is not exhausted, so check-buffer is called again, keeping n-1 bytes from the end of the buffer and updating the number of bytes in checked.

    (defun stream-find (needle stream &key (buffer-length 1024))
      (let* ((buffer (make-array buffer-length
                                 :element-type '(unsigned-byte 8)))
             (needle-length (length needle))
             (keep (- needle-length 1))
             (refill-length (- buffer-length keep)))
        (when (> needle-length buffer-length)
          (error "buffer must be able to contain needle"))
        (labels ((check-buffer (keeping checked)
                   (when (> keeping 0)                         ; copy end of buffer to front
                     (map-into buffer #'identity
                               (subseq buffer refill-length)))
                   (let ((last (read-sequence buffer stream :start keeping)) ; fill buffer
                         (found (search needle buffer)))                     ; search buffer
                     (cond (found (+ found checked))                         ; found it
                           ((< last buffer-length) nil)                      ; not found
                           (t (check-buffer keep
                                            (+ checked refill-length)))))))  ; keep looking
          (check-buffer 0 0))))
    

    Here are a few sample usages given this testfile.txt. Note that the repl code here uses the SBCL extension sb-ext:string-to-octets as a convenience to create a vector of bytes to use for a needle:

    _________
    __**_____
    _________
    __***____
    
    CL-USER> (with-open-file (stream "c:/code/lisp/scratch/testfile.txt"
                                     :element-type '(unsigned-byte 8))
               (stream-find (sb-ext:string-to-octets "**") stream))
    13
    
    CL-USER> (with-open-file (stream "c:/code/lisp/scratch/testfile.txt"
                                     :element-type '(unsigned-byte 8))
               (stream-find (sb-ext:string-to-octets "***") stream))
    35
    
    CL-USER> (with-open-file (stream "c:/code/lisp/scratch/testfile.txt"
                                     :element-type '(unsigned-byte 8))
               (stream-find (sb-ext:string-to-octets "**") stream :buffer-length 3))
    13
    
    CL-USER> (with-open-file (stream "c:/code/lisp/scratch/testfile.txt"
                                     :element-type '(unsigned-byte 8))
               (stream-find (sb-ext:string-to-octets "**!") stream :buffer-length 3))
    NIL
    

    Finding All Matches in a File

    The above solution finds the position of the first match in a stream, but you may want to find all match positions. The above can be adapted to do this. One solution is to write a find-all function that returns a list of all position matches in a sequence, and to use this instead of search in check-buffer. These lists of matches will need to be adjusted to account for the number of bytes already processed in previous batches before collecting them into a list of results.

    ;;; Returns a list of 0-indexed positions where `needle` was found in `stream`.
    (defun stream-find-all (needle stream &key (buffer-length 1024))
      (let* ((buffer (make-array buffer-length
                                 :element-type '(unsigned-byte 8)))
             (needle-length (length needle))
             (keep (- needle-length 1))
             (refill-length (- buffer-length keep)))
        (when (> needle-length buffer-length)
          (error "buffer must be able to contain needle"))
        (labels ((check-buffer (keeping checked positions)
                   (when (> keeping 0)                         ; copy end of buffer to front
                     (map-into buffer #'identity
                               (subseq buffer refill-length)))
                   (let* ((last (read-sequence buffer stream :start keeping)) ; fill buffer
                          (found (find-all needle buffer last)))              ; search buffer
                     (cond (found (check-buffer keep                          ; found it
                                                (+ checked refill-length)
                                                (append positions
                                                        (mapcar #'(lambda (x)
                                                                    (+ x checked))
                                                                found))))
                           ((< last buffer-length) positions)         ; stream exhausted
                           (t (check-buffer keep                      ; keep looking
                                            (+ checked refill-length)
                                            positions))))))
          (check-buffer 0 0 nil))))
    
    ;;; Returns a list of 0-indexed positions where `needle` was found in `seq`.
    (defun find-all (needle seq &optional (end-seq nil))
      (let ((i 0))
        (loop with last = (- (length seq) (length needle))
              when (<= i last)
                do (setf i (search needle seq :start2 i :end2 end-seq))
              when (or (null i) (> i last))
                return positions
              collect i into positions
              do (incf i))))
    

    The stream-find-all function finds all matches in a stream, exhausting the stream before returning a list of positions. The local check-buffer function now has an accumulator argument that collects results which are taken from find-all and adjusted using mapcar to account for previously processed bytes. find-all returns a list of all match positions within the sequence it is given, but these positions are relative to that sequence, not the stream being processed.

    CL-USER> (with-open-file (stream "c:/code/lisp/scratch/testfile.txt"
                                     :element-type '(unsigned-byte 8))
               (stream-find-all (sb-ext:string-to-octets "____") stream))
    (0 1 2 3 4 5 15 16 22 23 24 25 26 27 38)
    

    The Easiest Non-Library Solution

    Most of the code in stream-find and stream-find-all is for managing the buffer. If you can slurp your file into a single buffer the problem becomes much simpler.

    Finding the first match needs none of the above code:

    CL-USER> (with-open-file (stream "c:/code/lisp/scratch/testfile.txt"
                                     :element-type '(unsigned-byte 8))
               (let ((file-contents (make-array (file-length stream)
                                                :element-type '(unsigned-byte 8))))
                 (read-sequence file-contents stream)
                 (search (sb-ext:string-to-octets "**") file-contents)))
    13
    

    Finding all matches only needs the find-all function from above:

    CL-USER> (with-open-file (stream "c:/code/lisp/scratch/testfile.txt"
                                     :element-type '(unsigned-byte 8))
               (let ((file-contents (make-array (file-length stream)
                                                :element-type '(unsigned-byte 8))))
                 (read-sequence file-contents stream)
                 (find-all (sb-ext:string-to-octets "**") file-contents)))
    (13 35 36)
    

    Or Use a Library

    I'm sure that there are ways to improve the code above, and I agree with @ignis volens that there are likely algorithmic improvements to be found.

    One virtue of using an existing library is that these problems have probably already been thought about and reasonable solutions arrived at. And using library code probably means that you will need to write less code yourself, as in the answer using the Series library by @coredump.