Reporting the Top N items is one of the classic problems of data bases. The reporting part of the Las Vegas 4GL Problem is a Top N problem. The Top N problem is trivially solve in DATATRIEVE using REPORT FIRST N ... SORTED BY DECREASING ... when there is one record for each occurrence to be sorted. When there are multiple occurrences, the records need to be aggregated with
Typical uses of the Top N problem in a manufacturing environment is to determine the Top 50 customers for a particular period of time. Another example would be to determine the Top 100 manufactured items by gross or net.
The particular twist to the Top N problem which is the subject of this article is to determine the Top N items for the Top 20 customers for a particular period of time. We call this the Double Top N problem.
Knappco Corporation is a small manufacturer making fillcaps, hatches, valves, nozzles, vents, and manholes for the cargo tank manufacturing industry. Knappco has several thousand customers in its customer master file, and Knappco makes several thousand different items. The customer master file contains year-to-date and month-to-date sales for each customer. The inventory master file contains the year-to-date and the month-to-date sales for each item (part). But only the sales history file contains the records of the individual sales which can determine which customer purchased which item. Unfortunately the sales history is a relative file written from DIBOL so there are no keys in the file to facilitate rapid access to the data. There are 25 fields in the sales history file; only six are relevant to illustrate this problem. There are
DEFINE RECORD SALESHISTORY_RECORD USING 01 SALESHISTORY_REC. 03 SH_ITEM_NUMBER PIC X(15). 03 SH_ITEM_DESCRIPTION PIC X(30). 03 MEMO_DATE. 05 SALES_MONTH PIC 99. 05 SALES_DAY PIC 99. 05 SALES_YEAR PIC 99. 03 SH_QUANTITY PIC 9(6) USAGE IS ZONED EDIT_STRING IS ------9. 03 SH_EXTENDED_PRICE PIC 9(8)V99 USAGE IS ZONED EDIT_STRING IS $$$,$$$,$$9.99- . 03 SH_CUSTOMER_NAME PIC X(25). ;
The data in SALESHISTORY is summarized into the derived data domains TOP_CUSTS, TOP_ITEMS, and TOP_BOTH. We originally tried to accomplish this without the use of the third domain, TOP_BOTH; however, performance was so bad that the use of a third domain was essentially required. The records for the three derived domains are:
DEFINE RECORD TOP_CUSTS_RECORD USING 01 TOP_CUSTS_REC. 03 SH_CUSTOMER_NAME PIC X(25). ! primary key ** 03 TOTAL_SH_EXTENDED_PRICE PIC 9(8)V99 USAGE IS ZONED EDIT_STRING IS $$$,$$$,$$9.99- . 03 CUSTS_ORDER USAGE IS WORD. 03 ALTERNATE_NAME PIC X(25). ; DEFINE RECORD TOP_ITEMS_RECORD USING 01 TOP_ITEMS_REC. 03 SH_ITEM_NUMBER PIC X(15). ! (primary key) 03 SH_ITEM_DESCRIPTION PIC X(30). 03 TOTAL_SH_QUANTITY PIC 9(6) USAGE IS ZONED EDIT_STRING IS ------9. 03 TOTAL_SH_EXTENDED_PRICE PIC 9(8)V99 USAGE IS ZONED EDIT_STRING IS $$$,$$$,$$9.99- . 03 ITEMS_ORDER USAGE IS WORD. 03 ALTERNATE_NAME PIC X(15). ; DEFINE RECORD TOP_BOTH_RECORD USING 01 TOP_BOTH_REC. 03 GROUP PIC 9. 03 ITEMS_ORDER USAGE IS WORD. 03 CUSTS_ORDER USAGE IS WORD. 03 ALTERNATE_ITEMNO PIC X(15). 03 ALTERNATE_NAME PIC X(25). 03 SH_QUANTITY PIC 9(6) USAGE IS ZONED EDIT_STRING IS ------9. 03 SH_EXTENDED_PRICE PIC 9(8)V99 USAGE IS ZONED EDIT_STRING IS $$$,$$$,$$9.99- . ;
The procedure to report the Top N Items for the Top 20 Customers is in several parts:
The first part of the procedure establishes global variables for selecting the year, the number of items to be reported, and temporary variables (labeled TEMP_) to tally the aggregation of sales and quantities. Access to the SALESHISTORY is then gained.
DEFINE PROCEDURE TOP_N_ITEMS_FOR_TOP20_CUSTS DECLARE YEAR PIC 99 VALID IF YEAR BT 89 AND 99 . DECLARE THIS_YEAR PIC 99. DECLARE TODAY USAGE IS DATE. TODAY = "today" THIS_YEAR = FN$YEAR(TODAY) - 1900 YEAR = *."year [89, 90, 91, etc]" DECLARE MAX_NUMBER_OF_ITEMS USAGE IS WORD. MAX_NUMBER_OF_ITEMS = 1000 DECLARE NUMBER_OF_ITEMS USAGE IS WORD VALID IF NUMBER_OF_ITEMS BETWEEN 1 AND MAX_NUMBER_OF_ITEMS . NUMBER_OF_ITEMS = *."how many of the top items to print [25, 50, etc]" DECLARE FIRST_FLAG PIC X. FIRST_FLAG = *."space and return when LA100 printer is ready" FIRST_FLAG = "Y" ! DECLARE TEMP_SH_ITEM_NUMBER PIC X(15). DECLARE TEMP_SH_ITEM_DESCRIPTION PIC X(30). DECLARE TEMP_TOTAL_SH_QUANTITY PIC 9(6) USAGE IS ZONED . DECLARE TEMP_TOTAL_SH_EXTENDED_PRICE PIC 9(8)V99 USAGE IS ZONED . DECLARE TOTAL_SALES PIC 9(8)V99 USAGE IS ZONED . READY SALESHISTORY SHARED READ
The second part of the procedure creates a new blank TOP_ITEMS file, READYs it for storing data, zeroes temporary variables, and begins reading SALESHISTORY records in order of item number and tallying then up and storing them in TOP_ITEMS. Note that the last STORE outside of the FOR SALESHISTORY loop handles the last or last group of records properly.
DEFINE FILE FOR TOP_ITEMS KEY=SH_ITEM_NUMBER; READY TOP_ITEMS SHARED WRITE TEMP_SH_ITEM_NUMBER = " " TEMP_TOTAL_SH_QUANTITY = 0 TEMP_TOTAL_SH_EXTENDED_PRICE = 0.0 TOTAL_SALES = 0.0 FOR SALESHISTORY WITH SALES_YEAR = YEAR SORTED BY SH_ITEM_NUMBER BEGIN TOTAL_SALES = TOTAL_SALES + SH_EXTENDED_PRICE IF (FIRST_FLAG EQ "Y") THEN BEGIN FIRST_FLAG = "N" END ELSE BEGIN IF (SH_ITEM_NUMBER NE TEMP_SH_ITEM_NUMBER) THEN BEGIN STORE TOP_ITEMS USING BEGIN SH_ITEM_NUMBER = TEMP_SH_ITEM_NUMBER SH_ITEM_DESCRIPTION = TEMP_SH_ITEM_DESCRIPTION TOTAL_SH_QUANTITY = TEMP_TOTAL_SH_QUANTITY TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE END TEMP_TOTAL_SH_QUANTITY = SH_QUANTITY TEMP_TOTAL_SH_EXTENDED_PRICE = SH_EXTENDED_PRICE END ELSE BEGIN TEMP_TOTAL_SH_QUANTITY = TEMP_TOTAL_SH_QUANTITY + SH_QUANTITY TEMP_TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE + SH_EXTENDED_PRICE END END TEMP_SH_ITEM_NUMBER = SH_ITEM_NUMBER TEMP_SH_ITEM_DESCRIPTION = SH_ITEM_DESCRIPTION END STORE TOP_ITEMS USING BEGIN SH_ITEM_NUMBER = TEMP_SH_ITEM_NUMBER SH_ITEM_DESCRIPTION = TEMP_SH_ITEM_DESCRIPTION TOTAL_SH_QUANTITY = TEMP_TOTAL_SH_QUANTITY TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE END
The third part of the procedure goes back through the data in TOP_ITEMS and assigns the order of the items for the first NUMBER_OF_ITEMS. For those items past the first NUMBER_OF_ITEMS it assigns the order MAX_NUMBER_OF_ITEMS + 1 and the alternate name of "OTHER".
FOR TOP_ITEMS SORTED BY DECREASING TOTAL_SH_EXTENDED_PRICE BEGIN IF (RUNNING COUNT LE NUMBER_OF_ITEMS) THEN BEGIN MODIFY USING BEGIN ITEMS_ORDER = RUNNING COUNT ALTERNATE_NAME = SH_ITEM_NUMBER END END ELSE BEGIN MODIFY USING BEGIN ITEMS_ORDER = MAX_NUMBER_OF_ITEMS + 1 ALTERNATE_NAME = "OTHERS" END END END
The fourth part of the procedure summarizes the data by customer. But there is a small problem here. Knappco deals with several manufacturing locations of large corporate trailer manufactures. That is, there a many different locations (and thus different customer numbers and customer names) for manufactures such as CARGILL, BEALL, FREUHAUF, CHEVRON, and TRAILMOBILE. But we would like to summarize the sales by the whole corporation rather than one specific manufacturing location. Therefore, we created a declared variable which collapses all the various customer names such as FREUHAUF-1, FREUHAUF-2, FREUHAUF-3 into FREUHAUF.
DECLARE CUSTOMER_NAME COMPUTED BY CHOICE OF SH_CUSTOMER_NAME STARTING WITH "BEALL" THEN "BEALL" SH_CUSTOMER_NAME STARTING WITH "CARGILL" THEN "CARGILL" SH_CUSTOMER_NAME STARTING WITH "CHEVRON" THEN "CHEVRON" SH_CUSTOMER_NAME STARTING WITH "FREUHAUF" THEN "FREUHAUF" SH_CUSTOMER_NAME STARTING WITH "TRAILMOBILE" THEN "TRAILMOBILE" . . . ELSE SH_CUSTOMER_NAME END_CHOICE.
Then a temporary variable is declared to keep track of when we have reached the next customer, a new TOP_CUSTS files is created, the domain is READY'ed for storing data, temporary variables are set, and the SALESHISTORY data is read in order of CUSTOMER_NAME (the collapsed customer name), the sales dollars tallied, and the data is stored in TOP_CUSTS. Again, an extra STORE is required outside of the FOR SALESHISTORY loop to handle the last record or group of records.
DECLARE TEMP_CUSTOMER_NAME PIC X(25). DEFINE FILE FOR TOP_CUSTS KEY=SH_CUSTOMER_NAME; READY TOP_CUSTS SHARED WRITE FIRST_FLAG = "Y" TEMP_TOTAL_SH_EXTENDED_PRICE = 0.0 TOTAL_SALES = 0.0 ! FOR SALESHISTORY WITH SALES_YEAR = YEAR SORTED BY CUSTOMER_NAME BEGIN TOTAL_SALES = TOTAL_SALES + SH_EXTENDED_PRICE IF (FIRST_FLAG EQ "Y") THEN BEGIN FIRST_FLAG = "N" END ELSE BEGIN IF (CUSTOMER_NAME NE TEMP_CUSTOMER_NAME) THEN BEGIN STORE TOP_CUSTS USING BEGIN SH_CUSTOMER_NAME = TEMP_CUSTOMER_NAME TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE END TEMP_TOTAL_SH_EXTENDED_PRICE = SH_EXTENDED_PRICE END ELSE BEGIN TEMP_TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE + SH_EXTENDED_PRICE END END TEMP_CUSTOMER_NAME = CUSTOMER_NAME END STORE TOP_CUSTS USING BEGIN SH_CUSTOMER_NAME = TEMP_CUSTOMER_NAME TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE END
The fifth part of the procedure assigns the top 20 customers. For those past the top 20, the order is set to 21 and the name is changed to "OTHERS".
FOR TOP_CUSTS SORTED BY DECREASING TOTAL_SH_EXTENDED_PRICE BEGIN IF (RUNNING COUNT LE 20) THEN BEGIN MODIFY USING BEGIN CUSTS_ORDER = RUNNING COUNT ALTERNATE_NAME = SH_CUSTOMER_NAME END END ELSE BEGIN MODIFY USING BEGIN CUSTS_ORDER = 20 + 1 ALTERNATE_NAME = "OTHERS" END END END
The sixth part of the procedure is to copy to the data into TOP_BOTH. In this step the order is assigned for both customers and items. In addition, a GROUP is assigned. The field GROUP is used later to control the sub-totals of those items which are in the Top N parts and to prevent a sub-total from being displayed for those outside the Top N. ITEMS_ORDER is determined by looking up the item number in a domain table, TOP_ITEMS_ORDER_TABLE, which is from the domain TOP_ITEMS mapping SH_ITEM_NUMBER to ITEMS_ORDER. CUSTS_ORDER is determined by looking up the collapsed customer name in TOP_CUSTS to find the customers order. Data is read from SALESHISTORY and stored in TOP_BOTH. Storing data in TOP_BOTH could be avoided. However, because the SALESHISTORY file contains several years worth of data and the performance is very significantly improved if GROUP is a stored variable, we have chosen to actually rewrite the data.
DECLARE GROUP COMPUTED BY CHOICE OF ITEMS_ORDER LE NUMBER_OF_ITEMS THEN 0 ELSE 1 END_CHOICE. DECLARE ITEMS_ORDER COMPUTED BY FORMAT (SH_ITEM_NUMBER VIA TOP_ITEMS_ORDER_TABLE) USING 9999 . DECLARE CUSTS_ORDER COMPUTED BY CUSTOMER_NAME VIA TOP_CUSTS_ORDER_TABLE . DEFINE FILE FOR TOP_BOTH; READY TOP_BOTH SHARED WRITE FOR SALESHISTORY WITH SALES_YEAR = YEAR BEGIN STORE TOP_BOTH USING BEGIN GROUP = GROUP ITEMS_ORDER = ITEMS_ORDER CUSTS_ORDER = CUSTS_ORDER ALTERNATE_NAME = CUSTOMER_NAME VIA TOP_CUSTS_NAME_TABLE ALTERNATE_ITEMNO = SH_ITEM_NUMBER VIA TOP_ITEMS_NAME_TABLE SH_QUANTITY = SH_QUANTITY SH_EXTENDED_PRICE = SH_EXTENDED_PRICE END END
The seventh part of the procedure creates three sets of twenty-one declared variables each. X01 to X21 tally the sales; Y01 to Y21 tally the item quantities; and Z01 to Z21 tally the average price. This looks like a lot of "brute force" and it is. But this is the standard technique in DATATRIEVE for getting horizontal totals and sub-totals. Note that the declarations for X02 through X20, Y02 through Y20, and Z02 through Z20 have been replaced by " . . . ".
DECLARE X01 COMPUTED BY CHOICE OF CUSTS_ORDER EQ 01 THEN SH_EXTENDED_PRICE ELSE 0 END_CHOICE. . . . DECLARE X21 COMPUTED BY CHOICE OF CUSTS_ORDER EQ 21 THEN SH_EXTENDED_PRICE ELSE 0 END_CHOICE. DECLARE Y01 COMPUTED BY CHOICE OF CUSTS_ORDER EQ 01 THEN SH_QUANTITY ELSE 0 END_CHOICE. . . . DECLARE Y21 COMPUTED BY CHOICE OF CUSTS_ORDER EQ 21 THEN SH_QUANTITY ELSE 0 END_CHOICE. DECLARE Z01 COMPUTED BY CHOICE OF TOTAL Y01 EQ 0 THEN 0 ELSE (TOTAL X01)/(TOTAL Y01) END_CHOICE. . . . DECLARE Z21 COMPUTED BY CHOICE OF TOTAL Y21 EQ 0 THEN 0 ELSE (TOTAL X21)/(TOTAL Y21) END_CHOICE.
The eighth and last part of the procedure actually makes the report. However, first the label indicating which customers are the top 20 customers must be created. This requires a variable LABEL and a loop through the TOP_CUSTS domain picking up the first seven characters of the customer name. With the number of columns in this report, there is only enough room for seven characters from each. Fortunately, seven characters are enough to differentiate the customers each of which, after all, purchase several hundreds of thousands of dollars of worth of parts from Knappco. (We surely ought to be able to recognize our best customers.)
DECLARE LABEL PIC X(176). LABEL = "" FOR FIRST 21 TOP_CUSTS SORTED BY CUSTS_ORDER BEGIN LABEL = LABEL || FORMAT (SH_CUSTOMER_NAME VIA TOP_CUSTS_NAME_TABLE) USING X(7) | "|" END LABEL = LABEL || "Totals"
The report is directed to an LA100. The reason that an LA100 is
chosen rather than line printer is the width of the report.
LP-type devices print only 132 columns; an LA100 in 16.5
characters per inch mode can printer a report which is 217
characters wide. To report the top 50 items would required 4 or 5
pages so performance is not a serious problem with the slower
LA100 device. The
ON LA100printer: BEGIN PRINT "
[4w" REPORT TOP_BOTH SORTED BY GROUP, ITEMS_ORDER SET COLUMNS_PAGE = 217 SET REPORT_NAME = "KNAPPCO Corporation"/"Top Items for Top 20 Customers"
The AT TOP OF PAGE is a little unusual. Since the LABEL contains the first seven characters of the Top 20 customers, an AT TOP OF PAGE was required to have computed column headers on the tallies. However, the first part of the AT TOP OF PAGE allows the creation of a line of the form
Top N Items for 19YY [Year to Date] on total sales of $$$,$$$,$$$9.99
which appear before the column headers. The optional "Year to Date" is printed if we are reporting the current (incomplete) year.
AT TOP OF PAGE PRINT REPORT_HEADER, SKIP, COL 1, "Top"||| FORMAT NUMBER_OF_ITEMS USING ZZZ9 ||| "Items for 19"| FORMAT YEAR USING 99 | CHOICE OF YEAR EQ THIS_YEAR THEN " Year to date" ELSE "" END_CHOICE ||| "on total sales of "| FORMAT TOTAL_SALES USING $$$,$$$,$$9.99, SKIP 2, COL 1, "Item", COL 10, "Description", COL 42, LABEL(-), SKIP
The data is summarized by ITEMS_ORDER to give the part number (the ALTERNATE_ITEMNO), and the description of the part (via a domain table INVMAS_DESC_TABLE). The total sales, X01 to X21; the total quantities, Y01 to Y21; and the average unit prices, Z01 to Z21 are printed on three successive lines. Notice that code for totals of X02 through X20, totals Y02 through Y20, and Z02 through Z20 have been replaced with " . . . ".
AT BOTTOM OF ITEMS_ORDER PRINT ALTERNATE_ITEMNO USING X(8), CHOICE OF ITEMS_ORDER LE NUMBER_OF_ITEMS THEN ALTERNATE_ITEMNO VIA INVMAS_DESC_TABLE ELSE " " END_CHOICE USING X(30), COL 42, TOTAL X01(-) USING Z(6)9, . . . ,TOTAL X21(-) USING Z(6)9, TOTAL SH_EXTENDED_PRICE(-) USING Z(6)9, SKIP, COL 15, "TOTAL QUANTITIES =", COL 42, TOTAL Y01(-) USING Z(6)9, . . . , TOTAL Y21(-) USING Z(6)9, TOTAL SH_QUANTITY(-) USING Z(6)9, SKIP, COL 15, "AVERAGE UNIT PRICE =", COL 42, Z01(-) USING ZZZ9.99, . . . , Z21(-) USING ZZZ9.99, (TOTAL SH_EXTENDED_PRICE)/(TOTAL SH_QUANTITY)(-) USING ZZZ9.99
At the bottom of GROUP, sales total are printed. But only for the items in the Top N. Notice that if GROUP = 1, there are no sub-totals printed; only a blank line is printed! Notice that the code for total X02 through X20 has been replaced with " . . . ".
AT BOTTOM OF GROUP PRINT SKIP, COL 2, CHOICE OF GROUP = 0 THEN "Subtotals=" ELSE " " END_CHOICE , COL 42, CHOICE OF GROUP = 0 THEN FORMAT TOTAL X01 USING Z(6)9 ELSE " " END_CHOICE, . . . CHOICE OF GROUP = 0 THEN FORMAT TOTAL X21 USING Z(6)9 ELSE " " END_CHOICE, CHOICE OF GROUP = 0 THEN FORMAT TOTAL SH_EXTENDED_PRICE USING Z(6)9 ELSE " " END_CHOICE, SKIP
At the bottom of the report the grand totals for sales are printed, the LA100 printer is returned to 10 characters per inch mode, and all domains FINISHed, and all tables and global variables RELEASEd (although the code is not shown).
AT BOTTOM OF REPORT PRINT COL 2, "Grand Totals=", TOTAL X01(-) USING Z(6)9, . . . , TOTAL X21(-) USING Z(6)9, TOTAL SH_EXTENDED_PRICE(-) USING Z(6)9 END_REPORT PRINT "
[0w" END ! finish all domains and release all tables and global variables END-PROCEDURE
Because the report is 217 columns wide, a typical report can not easily be printed in the newsletter. However, the report does contain a number of interesting features particularly the suppression of a sub-total on the second group.
Joe H. Gallagher, Ph. D.