I have a fixed record file with several millions of records containing sales for several thousands customers. I need to generate file with one record for each customer containing customer id, sales count and sales total.
I've just started to learn COBOL and I do not have idea what is a "COBOL way" to do that.
Obvious way in other programming languages will be to use some kind of map, but COBOL lacks such structures. COBOL arrays is out of question from performance reasons.
Other solution that comes to my mind is utilizing database, but it seems overkill for such trivial task.
Could someone point me in the right direction?
I can not stress this enough. If this is a one-time program or even run once a month, using the COBOL SORT
should be preferred to spending time making it faster.
This method is similar to an "unordered map".
Create two tables: a "hash pointer table" (HPT) with more entries than there are customers (a prime number is preferred and the more entries the better for the distribution of hash values) and a variable length (ODO) "record table" (RT) large enough to hold records for all customers that may be present.
The HPT consists of an integer value where zero means no records for that value while a positive integer is used as a subscript to the record table.
The RT entries consist of two parts: an integer pointing to the next entry in a linked list (LLP) and a group containing the record fields (customer id, sales count, and sales total). The linked list is used for overflow and should be quite short – only a few records.
An integer RTP is used to point to the next available RT entry. The initial value is zero. It is also used to set the length of the RT. (OCCURS DEPENDING ON RTP)
As each record is read, the customer id is "hashed" to an integer within the number of entries in the HPT. This involves operations on the characters of the customer id, a MOD
operation for the size of the HPT, and an ADD
one.
To insert or accumulate values, first use the hashed value to access the HPT.
If the HPT entry is zero, insert the record by adding one to the RPT, move the RPT to the HPT entry for that hash value, move zero to the LLP for the RTP, move the customer id and sales amount to the RT entry and 1 to the count.
If the HPT entry is not zero, use the value of the HPT entry to follow the linked list in the RT to determine if a customer record exists (matching customer id). If the record exists, add one to the count and add the sales amount to the sales total. If the record does not exist, (the LLP is zero) create a new entry for the RT as before moving the RTP to the LLP of the last record in the linked list rather than the HPT entry.
After all records have been processed, the table will consist of RTP entries. Sort the table using an in-memory sort on customer id and output the values.
This method should require no more than a few megabytes of storage.
A few caveats.
This method is sensitive to the distribution of values from the hash function.
Performance is slowed if the more frequent accesses go to the end of the linked lists.
As the number of customers grows, table sizes will need to grow and that may cause distribution problems with the values from the hash function.
I only used this method once — over twenty years ago in a word count speed challenge. The challenge was how fast a program could count the number of distinct words in a large text file. It was very fast and much faster than other methods. It has some similarities to this problem.
It took several tries to get the program tuned for it to run really fast which I why I suggest staying with the standard SORT
. It works and is more easily understood by others.
I wrote a program to demonstrate how this may be done. Since the question mentioned "several millions of records" and "several thousands customers", I created a test file with about 4 million records and about 27 thousand customers. I used a random number on each record to simulate date/time and sorted the test file into that sequence.
I also wrote a program sorting the input file by customer id for comparison. Both programs created identical outputs. but the hashing solution ran about 12 times faster on my old COBOL system.
Code:
environment division.
input-output section.
file-control.
select cust-txns assign "z:custtxns.dat"
organization sequential
.
select summ-list assign "z:summhash.txt"
organization line sequential
.
data division.
file section.
fd cust-txns.
01 cust-rec.
02 cust-dt comp pic 9(8).
02 cust-id pic x(8).
02 cust-sale-amt pic 9(4)v99.
fd summ-list.
01 summ-rec.
02 summ-id pic x(8).
02 summ-cnt pic zzz9.
02 summ-amt pic z,zz9.99.
working-storage section.
01 start-time pic 9(8).
01 end-time pic 9(8).
01 n comp-5 pic 9(8).
01 cust-cnt comp-5 pic 9(8) value 0.
01 hash-num comp-5 pic 9(8) value 0.
01 .
02 hash-id pic x(8).
02 redefines hash-id.
03 hash-1 comp pic 9(8).
03 hash-2 comp pic 9(8).
*> a prime value selected for efficient hashing
78 hash-mod value 54983.
01 hash-value comp-5 pic 9(8) value 0.
01 discard comp-5 pic 9(16) value 0.
01 curr-llp comp-5 pic 9(8) value 0.
01 work-rec.
02 work-llp comp-5 pic 9(8).
02 work-id pic x(8).
02 work-cnt comp-5 pic 9(8).
02 work-sales comp-5 pic 9(6)v99.
*> the hash table size is about twice the number of customers
01 hash-lookup.
02 hash-entry comp-5 pic 9(8) value 0 occurs hash-mod.
*> required on my system for issues with crossing boundaries
78 mem-fill value 65535 - next.
01 pic x(mem-fill).
*> less that 27000 customers expected
01 cust-tab.
02 cust-entry occurs 0 to 30000 depending cust-cnt.
03 cust-llp comp-5 pic 9(8).
03 cust-name pic x(8).
03 cust-txn-cnt comp-5 pic 9(8).
03 cust-total-sales comp-5 pic 9(6)v99.
*> for the 64k boundary issue
78 cust-fill value 12.
03 pic x(cust-fill).
procedure division.
accept start-time from time
open input cust-txns
perform until exit
read cust-txns
at end
exit perform
not end
perform count-record
end-read
end-perform
close cust-txns
perform print-report
accept end-time from time
display "Start time: " start-time
display " End time: " end-time
stop run
.
count-record.
*> place data into a temporary record
move 0 to work-llp
move cust-id to work-id
move 1 to work-cnt
move cust-sale-amt to work-sales
*> get the hash-value
perform hash-proc
*> add the record if this is the first with that hash-value
if hash-entry (hash-value) = 0
add 1 to cust-cnt
move work-rec to cust-entry (cust-cnt)
move cust-cnt to hash-entry (hash-value)
else
perform search-existing
end-if
.
search-existing.
*> set the head of a linked list
move hash-entry (hash-value) to curr-llp
*> follow the links to the last record or an existing record
perform until
cust-llp (curr-llp) = 0
or cust-name (curr-llp) = work-id
move cust-llp (curr-llp) to curr-llp
end-perform
*> if an existing record update
if cust-name (curr-llp) = work-id
add 1 to cust-txn-cnt (curr-llp)
add work-sales to cust-total-sales (curr-llp)
else
*> add new record to end of linked list
add 1 to cust-cnt
move cust-cnt to cust-llp (curr-llp)
move work-rec to cust-entry (cust-cnt)
end-if
.
hash-proc.
move work-id to hash-id
compute discard = hash-1 * 17 + hash-2
*> this is a MOD function but faster on this system
divide discard by hash-mod
giving discard remainder hash-value
add 1 to hash-value
.
print-report.
*> in-memory sort for the approx 27000 customers
sort cust-entry ascending cust-name
*>
open output summ-list
perform varying n from 1 by 1
until n > cust-cnt
move cust-name (n) to summ-id
move cust-txn-cnt (n) to summ-cnt
move cust-total-sales (n) to summ-amt
write summ-rec
end-perform
close summ-list
.