cobol

Record grouping and aggregation


I have a fixed record file with several millions of records containing sales for several thousands customers. I need to generate file with one record for each customer containing customer id, sales count and sales total.

I've just started to learn COBOL and I do not have idea what is a "COBOL way" to do that.

Obvious way in other programming languages will be to use some kind of map, but COBOL lacks such structures. COBOL arrays is out of question from performance reasons.

Other solution that comes to my mind is utilizing database, but it seems overkill for such trivial task.

Could someone point me in the right direction?


Solution

  • I can not stress this enough. If this is a one-time program or even run once a month, using the COBOL SORT should be preferred to spending time making it faster.


    This method is similar to an "unordered map".

    Create two tables: a "hash pointer table" (HPT) with more entries than there are customers (a prime number is preferred and the more entries the better for the distribution of hash values) and a variable length (ODO) "record table" (RT) large enough to hold records for all customers that may be present.

    The HPT consists of an integer value where zero means no records for that value while a positive integer is used as a subscript to the record table.

    The RT entries consist of two parts: an integer pointing to the next entry in a linked list (LLP) and a group containing the record fields (customer id, sales count, and sales total). The linked list is used for overflow and should be quite short – only a few records.

    An integer RTP is used to point to the next available RT entry. The initial value is zero. It is also used to set the length of the RT. (OCCURS DEPENDING ON RTP)

    As each record is read, the customer id is "hashed" to an integer within the number of entries in the HPT. This involves operations on the characters of the customer id, a MOD operation for the size of the HPT, and an ADD one.

    To insert or accumulate values, first use the hashed value to access the HPT.

    If the HPT entry is zero, insert the record by adding one to the RPT, move the RPT to the HPT entry for that hash value, move zero to the LLP for the RTP, move the customer id and sales amount to the RT entry and 1 to the count.

    If the HPT entry is not zero, use the value of the HPT entry to follow the linked list in the RT to determine if a customer record exists (matching customer id). If the record exists, add one to the count and add the sales amount to the sales total. If the record does not exist, (the LLP is zero) create a new entry for the RT as before moving the RTP to the LLP of the last record in the linked list rather than the HPT entry.

    After all records have been processed, the table will consist of RTP entries. Sort the table using an in-memory sort on customer id and output the values.

    This method should require no more than a few megabytes of storage.


    A few caveats.

    This method is sensitive to the distribution of values from the hash function.

    Performance is slowed if the more frequent accesses go to the end of the linked lists.

    As the number of customers grows, table sizes will need to grow and that may cause distribution problems with the values from the hash function.


    I only used this method once — over twenty years ago in a word count speed challenge. The challenge was how fast a program could count the number of distinct words in a large text file. It was very fast and much faster than other methods. It has some similarities to this problem.

    It took several tries to get the program tuned for it to run really fast which I why I suggest staying with the standard SORT. It works and is more easily understood by others.


    I wrote a program to demonstrate how this may be done. Since the question mentioned "several millions of records" and "several thousands customers", I created a test file with about 4 million records and about 27 thousand customers. I used a random number on each record to simulate date/time and sorted the test file into that sequence.

    I also wrote a program sorting the input file by customer id for comparison. Both programs created identical outputs. but the hashing solution ran about 12 times faster on my old COBOL system.

    Code:

       environment division.
       input-output section.
       file-control.
           select cust-txns assign "z:custtxns.dat"
               organization sequential
               .
           select summ-list assign "z:summhash.txt"
               organization line sequential
               .
       data division.
       file section.
       fd cust-txns.
       01 cust-rec.
         02 cust-dt comp pic 9(8).
         02 cust-id pic x(8).
         02 cust-sale-amt pic 9(4)v99.
       fd summ-list.
       01 summ-rec.
         02 summ-id pic x(8).
         02 summ-cnt pic zzz9.
         02 summ-amt pic z,zz9.99.
       working-storage section.
       01 start-time pic 9(8).
       01 end-time pic 9(8).
       01 n comp-5 pic 9(8).
       01 cust-cnt comp-5 pic 9(8) value 0.
       01 hash-num comp-5 pic 9(8) value 0.
       01 .
         02 hash-id pic x(8).
         02 redefines hash-id.
           03 hash-1 comp pic 9(8).
           03 hash-2 comp pic 9(8).
      *> a prime value selected for efficient hashing
       78 hash-mod value 54983.
       01 hash-value comp-5 pic 9(8) value 0.
       01 discard comp-5 pic 9(16) value 0.
       01 curr-llp comp-5 pic 9(8) value 0.
       01 work-rec.
         02 work-llp comp-5 pic 9(8).
         02 work-id pic x(8).
         02 work-cnt comp-5 pic 9(8).
         02 work-sales comp-5 pic 9(6)v99.
      *> the hash table size is about twice the number of customers
       01 hash-lookup.
         02 hash-entry comp-5 pic 9(8) value 0 occurs hash-mod.
      *> required on my system for issues with crossing boundaries
       78 mem-fill value 65535 - next.
       01 pic x(mem-fill).
      *> less that 27000 customers expected
       01 cust-tab.
         02 cust-entry occurs 0 to 30000 depending cust-cnt.
           03 cust-llp comp-5 pic 9(8).
           03 cust-name pic x(8).
           03 cust-txn-cnt comp-5 pic 9(8).
           03 cust-total-sales comp-5 pic 9(6)v99.
      *> for the 64k boundary issue
           78 cust-fill value 12.
           03 pic x(cust-fill).
       procedure division.
           accept start-time from time
           open input cust-txns
           perform until exit
               read cust-txns
               at end
                   exit perform
               not end
                   perform count-record
               end-read
           end-perform
           close cust-txns
           perform print-report
           accept end-time from time
           display "Start time: " start-time
           display "  End time: " end-time
           stop run
           .
       count-record.
      *> place data into a temporary record
           move 0 to work-llp
           move cust-id to work-id
           move 1 to work-cnt
           move cust-sale-amt to work-sales
      *> get the hash-value
           perform hash-proc
      *> add the record if this is the first with that hash-value
           if hash-entry (hash-value) = 0
               add 1 to cust-cnt
               move work-rec to cust-entry (cust-cnt)
               move cust-cnt to hash-entry (hash-value)
           else
               perform search-existing
           end-if
           .
       search-existing.
      *> set the head of a linked list
           move hash-entry (hash-value) to curr-llp
      *> follow the links to the last record or an existing record
           perform until
               cust-llp (curr-llp) = 0
             or cust-name (curr-llp) = work-id
               move cust-llp (curr-llp) to curr-llp
           end-perform
      *> if an existing record update
           if cust-name (curr-llp) = work-id
               add 1 to cust-txn-cnt (curr-llp)
               add work-sales to cust-total-sales (curr-llp)
           else
      *> add new record to end of linked list
               add 1 to cust-cnt
               move cust-cnt to cust-llp (curr-llp)
               move work-rec to cust-entry (cust-cnt)
           end-if
           .
       hash-proc.
           move work-id to hash-id
           compute discard = hash-1 * 17 + hash-2
      *> this is a MOD function but faster on this system
           divide discard by hash-mod
               giving discard remainder hash-value
           add 1 to hash-value
           .
       print-report.
      *> in-memory sort for the approx 27000 customers
           sort cust-entry ascending cust-name
      *>
           open output summ-list
           perform varying n from 1 by 1
           until n > cust-cnt
               move cust-name (n) to  summ-id
               move cust-txn-cnt (n) to summ-cnt
               move cust-total-sales (n) to summ-amt
               write summ-rec
           end-perform
           close summ-list
           .