Developing Applications in DATATRIEVE
The Double Top N Problem

Jerry L. Kelley and Joe H. Gallagher



Reporting the Top N items is one of the classic problems of data bases. The reporting part of the Las Vegas 4GL Problem is a Top N problem. The Top N problem is trivially solve in DATATRIEVE using REPORT FIRST N ... SORTED BY DECREASING ... when there is one record for each occurrence to be sorted. When there are multiple occurrences, the records need to be aggregated with

  1. a declared variable and REDUCE TO (the technique used by John Babiarz in his article on the Las Vegas 4GL Problem which appears earlier in this issue),
  2. the use of a hierarchy where there is one and only one occurrence for each master and a declared variable tallies the results of the detail records, or
  3. the storage of the summarized data into another temporary domain.

Typical uses of the Top N problem in a manufacturing environment is to determine the Top 50 customers for a particular period of time. Another example would be to determine the Top 100 manufactured items by gross or net.

The particular twist to the Top N problem which is the subject of this article is to determine the Top N items for the Top 20 customers for a particular period of time. We call this the Double Top N problem.

Knappco Corporation is a small manufacturer making fillcaps, hatches, valves, nozzles, vents, and manholes for the cargo tank manufacturing industry. Knappco has several thousand customers in its customer master file, and Knappco makes several thousand different items. The customer master file contains year-to-date and month-to-date sales for each customer. The inventory master file contains the year-to-date and the month-to-date sales for each item (part). But only the sales history file contains the records of the individual sales which can determine which customer purchased which item. Unfortunately the sales history is a relative file written from DIBOL so there are no keys in the file to facilitate rapid access to the data. There are 25 fields in the sales history file; only six are relevant to illustrate this problem. There are


DEFINE RECORD SALESHISTORY_RECORD USING
01 SALESHISTORY_REC.
   03 SH_ITEM_NUMBER                   PIC X(15).
   03 SH_ITEM_DESCRIPTION              PIC X(30).
   03 MEMO_DATE.
      05 SALES_MONTH                   PIC 99.
      05 SALES_DAY                     PIC 99.
      05 SALES_YEAR                    PIC 99.
   03 SH_QUANTITY                      PIC 9(6) USAGE IS ZONED
      EDIT_STRING IS ------9.
   03 SH_EXTENDED_PRICE                PIC 9(8)V99 USAGE IS ZONED
      EDIT_STRING IS $$$,$$$,$$9.99- .
   03 SH_CUSTOMER_NAME                 PIC X(25).
;

The data in SALESHISTORY is summarized into the derived data domains TOP_CUSTS, TOP_ITEMS, and TOP_BOTH. We originally tried to accomplish this without the use of the third domain, TOP_BOTH; however, performance was so bad that the use of a third domain was essentially required. The records for the three derived domains are:


DEFINE RECORD TOP_CUSTS_RECORD USING
01 TOP_CUSTS_REC.
   03 SH_CUSTOMER_NAME                 PIC X(25). ! primary key **
   03 TOTAL_SH_EXTENDED_PRICE          PIC 9(8)V99 USAGE IS ZONED
      EDIT_STRING IS $$$,$$$,$$9.99- .
   03 CUSTS_ORDER                      USAGE IS WORD.
   03 ALTERNATE_NAME                   PIC X(25).
;

DEFINE RECORD TOP_ITEMS_RECORD USING
01 TOP_ITEMS_REC.
   03 SH_ITEM_NUMBER                   PIC X(15). ! (primary key)
   03 SH_ITEM_DESCRIPTION              PIC X(30).
   03 TOTAL_SH_QUANTITY                PIC 9(6) USAGE IS ZONED
      EDIT_STRING IS ------9.
   03 TOTAL_SH_EXTENDED_PRICE          PIC 9(8)V99 USAGE IS ZONED
      EDIT_STRING IS $$$,$$$,$$9.99- .
   03 ITEMS_ORDER                      USAGE IS WORD.
   03 ALTERNATE_NAME                   PIC X(15).
;

DEFINE RECORD TOP_BOTH_RECORD USING
01 TOP_BOTH_REC.
   03 GROUP                            PIC 9.
   03 ITEMS_ORDER                      USAGE IS WORD.
   03 CUSTS_ORDER                      USAGE IS WORD.
   03 ALTERNATE_ITEMNO                 PIC X(15).
   03 ALTERNATE_NAME                   PIC X(25).
   03 SH_QUANTITY                      PIC 9(6) USAGE IS ZONED
      EDIT_STRING IS ------9.
   03 SH_EXTENDED_PRICE                PIC 9(8)V99 USAGE IS ZONED
      EDIT_STRING IS $$$,$$$,$$9.99- .
;

The procedure to report the Top N Items for the Top 20 Customers is in several parts:

  1. enter information to select the period of time and enter the number of items to report
  2. summarize the data by items
  3. assign the top N items and all others
  4. summarize the data by customer
  5. assign the top 20 customers and all others
  6. copy the SALESHISTORY records into TOP_BOTH assigning alternate names for customers and items and setting the ordering
  7. create three sets of global variables for tallying up the results
  8. make the report

The first part of the procedure establishes global variables for selecting the year, the number of items to be reported, and temporary variables (labeled TEMP_) to tally the aggregation of sales and quantities. Access to the SALESHISTORY is then gained.


DEFINE PROCEDURE TOP_N_ITEMS_FOR_TOP20_CUSTS
DECLARE YEAR PIC 99 VALID IF YEAR BT 89 AND 99 .
DECLARE THIS_YEAR PIC 99.
DECLARE TODAY USAGE IS DATE.
TODAY = "today"
THIS_YEAR = FN$YEAR(TODAY) - 1900
YEAR = *."year [89, 90, 91, etc]"
DECLARE MAX_NUMBER_OF_ITEMS USAGE IS WORD.
MAX_NUMBER_OF_ITEMS = 1000
DECLARE NUMBER_OF_ITEMS USAGE IS WORD
    VALID IF NUMBER_OF_ITEMS BETWEEN 1 AND MAX_NUMBER_OF_ITEMS .
NUMBER_OF_ITEMS = *."how many of the top items to print [25, 50, etc]"
DECLARE FIRST_FLAG PIC X.
FIRST_FLAG = *."space and return when LA100 printer is ready"
FIRST_FLAG = "Y"
!
DECLARE TEMP_SH_ITEM_NUMBER            PIC X(15).
DECLARE TEMP_SH_ITEM_DESCRIPTION       PIC X(30).
DECLARE TEMP_TOTAL_SH_QUANTITY         PIC 9(6) USAGE IS ZONED .
DECLARE TEMP_TOTAL_SH_EXTENDED_PRICE   PIC 9(8)V99 USAGE IS ZONED .
DECLARE TOTAL_SALES                    PIC 9(8)V99 USAGE IS ZONED .
READY SALESHISTORY SHARED READ

The second part of the procedure creates a new blank TOP_ITEMS file, READYs it for storing data, zeroes temporary variables, and begins reading SALESHISTORY records in order of item number and tallying then up and storing them in TOP_ITEMS. Note that the last STORE outside of the FOR SALESHISTORY loop handles the last or last group of records properly.


DEFINE FILE FOR TOP_ITEMS KEY=SH_ITEM_NUMBER;
READY TOP_ITEMS SHARED WRITE
TEMP_SH_ITEM_NUMBER = " "
TEMP_TOTAL_SH_QUANTITY = 0
TEMP_TOTAL_SH_EXTENDED_PRICE = 0.0
TOTAL_SALES = 0.0
FOR SALESHISTORY WITH SALES_YEAR = YEAR SORTED BY SH_ITEM_NUMBER BEGIN
    TOTAL_SALES = TOTAL_SALES + SH_EXTENDED_PRICE
    IF (FIRST_FLAG EQ "Y") THEN BEGIN
        FIRST_FLAG = "N"
    END ELSE BEGIN
        IF (SH_ITEM_NUMBER NE TEMP_SH_ITEM_NUMBER) THEN BEGIN
            STORE TOP_ITEMS USING BEGIN
                SH_ITEM_NUMBER = TEMP_SH_ITEM_NUMBER
                SH_ITEM_DESCRIPTION = TEMP_SH_ITEM_DESCRIPTION
                TOTAL_SH_QUANTITY = TEMP_TOTAL_SH_QUANTITY
                TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE
                END
            TEMP_TOTAL_SH_QUANTITY = SH_QUANTITY
            TEMP_TOTAL_SH_EXTENDED_PRICE = SH_EXTENDED_PRICE
            END ELSE BEGIN
            TEMP_TOTAL_SH_QUANTITY = TEMP_TOTAL_SH_QUANTITY + SH_QUANTITY
            TEMP_TOTAL_SH_EXTENDED_PRICE = 
                TEMP_TOTAL_SH_EXTENDED_PRICE + SH_EXTENDED_PRICE
            END
        END
    TEMP_SH_ITEM_NUMBER = SH_ITEM_NUMBER
    TEMP_SH_ITEM_DESCRIPTION = SH_ITEM_DESCRIPTION
    END
STORE TOP_ITEMS USING BEGIN
    SH_ITEM_NUMBER = TEMP_SH_ITEM_NUMBER
    SH_ITEM_DESCRIPTION = TEMP_SH_ITEM_DESCRIPTION
    TOTAL_SH_QUANTITY = TEMP_TOTAL_SH_QUANTITY
    TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE
    END

The third part of the procedure goes back through the data in TOP_ITEMS and assigns the order of the items for the first NUMBER_OF_ITEMS. For those items past the first NUMBER_OF_ITEMS it assigns the order MAX_NUMBER_OF_ITEMS + 1 and the alternate name of "OTHER".


FOR TOP_ITEMS SORTED BY DECREASING TOTAL_SH_EXTENDED_PRICE BEGIN
    IF (RUNNING COUNT LE NUMBER_OF_ITEMS) THEN BEGIN
        MODIFY USING BEGIN
            ITEMS_ORDER = RUNNING COUNT
            ALTERNATE_NAME = SH_ITEM_NUMBER
            END
        END ELSE BEGIN
        MODIFY USING BEGIN
            ITEMS_ORDER = MAX_NUMBER_OF_ITEMS + 1
            ALTERNATE_NAME = "OTHERS"
            END
        END
    END

The fourth part of the procedure summarizes the data by customer. But there is a small problem here. Knappco deals with several manufacturing locations of large corporate trailer manufactures. That is, there a many different locations (and thus different customer numbers and customer names) for manufactures such as CARGILL, BEALL, FREUHAUF, CHEVRON, and TRAILMOBILE. But we would like to summarize the sales by the whole corporation rather than one specific manufacturing location. Therefore, we created a declared variable which collapses all the various customer names such as FREUHAUF-1, FREUHAUF-2, FREUHAUF-3 into FREUHAUF.


DECLARE CUSTOMER_NAME COMPUTED BY CHOICE OF
SH_CUSTOMER_NAME STARTING WITH "BEALL" THEN "BEALL"
SH_CUSTOMER_NAME STARTING WITH "CARGILL" THEN "CARGILL"
SH_CUSTOMER_NAME STARTING WITH "CHEVRON" THEN "CHEVRON"
SH_CUSTOMER_NAME STARTING WITH "FREUHAUF" THEN "FREUHAUF"
SH_CUSTOMER_NAME STARTING WITH "TRAILMOBILE" THEN "TRAILMOBILE"
    .
    .
    .
ELSE SH_CUSTOMER_NAME
END_CHOICE.

Then a temporary variable is declared to keep track of when we have reached the next customer, a new TOP_CUSTS files is created, the domain is READY'ed for storing data, temporary variables are set, and the SALESHISTORY data is read in order of CUSTOMER_NAME (the collapsed customer name), the sales dollars tallied, and the data is stored in TOP_CUSTS. Again, an extra STORE is required outside of the FOR SALESHISTORY loop to handle the last record or group of records.


DECLARE TEMP_CUSTOMER_NAME PIC X(25).
DEFINE FILE FOR TOP_CUSTS KEY=SH_CUSTOMER_NAME;
READY TOP_CUSTS SHARED WRITE
FIRST_FLAG = "Y"
TEMP_TOTAL_SH_EXTENDED_PRICE = 0.0
TOTAL_SALES = 0.0
!
FOR SALESHISTORY WITH SALES_YEAR = YEAR SORTED BY CUSTOMER_NAME BEGIN
    TOTAL_SALES = TOTAL_SALES + SH_EXTENDED_PRICE
    IF (FIRST_FLAG EQ "Y") THEN BEGIN
        FIRST_FLAG = "N"
        END ELSE BEGIN
        IF (CUSTOMER_NAME NE TEMP_CUSTOMER_NAME) THEN BEGIN
            STORE TOP_CUSTS USING BEGIN
                SH_CUSTOMER_NAME = TEMP_CUSTOMER_NAME
                TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE
                END
            TEMP_TOTAL_SH_EXTENDED_PRICE = SH_EXTENDED_PRICE
            END ELSE BEGIN
            TEMP_TOTAL_SH_EXTENDED_PRICE = 
                TEMP_TOTAL_SH_EXTENDED_PRICE + SH_EXTENDED_PRICE
            END
        END
    TEMP_CUSTOMER_NAME = CUSTOMER_NAME
    END
STORE TOP_CUSTS USING BEGIN
    SH_CUSTOMER_NAME = TEMP_CUSTOMER_NAME
    TOTAL_SH_EXTENDED_PRICE = TEMP_TOTAL_SH_EXTENDED_PRICE
    END

The fifth part of the procedure assigns the top 20 customers. For those past the top 20, the order is set to 21 and the name is changed to "OTHERS".


FOR TOP_CUSTS SORTED BY DECREASING TOTAL_SH_EXTENDED_PRICE BEGIN
    IF (RUNNING COUNT LE 20) THEN BEGIN
        MODIFY USING BEGIN
            CUSTS_ORDER = RUNNING COUNT
            ALTERNATE_NAME = SH_CUSTOMER_NAME
            END
        END ELSE BEGIN
        MODIFY USING BEGIN
            CUSTS_ORDER = 20 + 1
            ALTERNATE_NAME = "OTHERS"
        END
        END
    END

The sixth part of the procedure is to copy to the data into TOP_BOTH. In this step the order is assigned for both customers and items. In addition, a GROUP is assigned. The field GROUP is used later to control the sub-totals of those items which are in the Top N parts and to prevent a sub-total from being displayed for those outside the Top N. ITEMS_ORDER is determined by looking up the item number in a domain table, TOP_ITEMS_ORDER_TABLE, which is from the domain TOP_ITEMS mapping SH_ITEM_NUMBER to ITEMS_ORDER. CUSTS_ORDER is determined by looking up the collapsed customer name in TOP_CUSTS to find the customers order. Data is read from SALESHISTORY and stored in TOP_BOTH. Storing data in TOP_BOTH could be avoided. However, because the SALESHISTORY file contains several years worth of data and the performance is very significantly improved if GROUP is a stored variable, we have chosen to actually rewrite the data.


DECLARE GROUP COMPUTED BY CHOICE OF
ITEMS_ORDER LE NUMBER_OF_ITEMS THEN 0
ELSE 1
END_CHOICE.
DECLARE ITEMS_ORDER COMPUTED BY
    FORMAT (SH_ITEM_NUMBER VIA TOP_ITEMS_ORDER_TABLE) USING 9999 .
DECLARE CUSTS_ORDER COMPUTED BY 
    CUSTOMER_NAME VIA TOP_CUSTS_ORDER_TABLE .
DEFINE FILE FOR TOP_BOTH;
READY TOP_BOTH SHARED WRITE
FOR SALESHISTORY WITH SALES_YEAR = YEAR BEGIN
    STORE TOP_BOTH USING BEGIN
        GROUP = GROUP
        ITEMS_ORDER = ITEMS_ORDER
        CUSTS_ORDER = CUSTS_ORDER
        ALTERNATE_NAME = CUSTOMER_NAME VIA TOP_CUSTS_NAME_TABLE
        ALTERNATE_ITEMNO = SH_ITEM_NUMBER VIA TOP_ITEMS_NAME_TABLE
        SH_QUANTITY = SH_QUANTITY
        SH_EXTENDED_PRICE = SH_EXTENDED_PRICE
        END
    END

The seventh part of the procedure creates three sets of twenty-one declared variables each. X01 to X21 tally the sales; Y01 to Y21 tally the item quantities; and Z01 to Z21 tally the average price. This looks like a lot of "brute force" and it is. But this is the standard technique in DATATRIEVE for getting horizontal totals and sub-totals. Note that the declarations for X02 through X20, Y02 through Y20, and Z02 through Z20 have been replaced by " . . . ".


DECLARE X01 COMPUTED BY CHOICE OF
CUSTS_ORDER EQ 01 THEN SH_EXTENDED_PRICE
ELSE 0
END_CHOICE.
    .
    .
    .
DECLARE X21 COMPUTED BY CHOICE OF
CUSTS_ORDER EQ 21 THEN SH_EXTENDED_PRICE
ELSE 0
END_CHOICE.

DECLARE Y01 COMPUTED BY CHOICE OF
CUSTS_ORDER EQ 01 THEN SH_QUANTITY
ELSE 0
END_CHOICE.
    .
    .
    .
DECLARE Y21 COMPUTED BY CHOICE OF
CUSTS_ORDER EQ 21 THEN SH_QUANTITY
ELSE 0
END_CHOICE.

DECLARE Z01 COMPUTED BY CHOICE OF
TOTAL Y01  EQ 0 THEN 0
ELSE  (TOTAL X01)/(TOTAL Y01)
END_CHOICE.
    .
    .
    .
DECLARE Z21 COMPUTED BY CHOICE OF
TOTAL Y21  EQ 0 THEN 0
ELSE  (TOTAL X21)/(TOTAL Y21)
END_CHOICE.

The eighth and last part of the procedure actually makes the report. However, first the label indicating which customers are the top 20 customers must be created. This requires a variable LABEL and a loop through the TOP_CUSTS domain picking up the first seven characters of the customer name. With the number of columns in this report, there is only enough room for seven characters from each. Fortunately, seven characters are enough to differentiate the customers each of which, after all, purchase several hundreds of thousands of dollars of worth of parts from Knappco. (We surely ought to be able to recognize our best customers.)


DECLARE LABEL PIC X(176).
LABEL = ""
FOR FIRST 21 TOP_CUSTS SORTED BY CUSTS_ORDER BEGIN
    LABEL = LABEL ||
    FORMAT (SH_CUSTOMER_NAME VIA TOP_CUSTS_NAME_TABLE) USING X(7) | "|"
    END
LABEL = LABEL || "Totals"

The report is directed to an LA100. The reason that an LA100 is chosen rather than line printer is the width of the report. LP-type devices print only 132 columns; an LA100 in 16.5 characters per inch mode can printer a report which is 217 characters wide. To report the top 50 items would required 4 or 5 pages so performance is not a serious problem with the slower LA100 device. The [4w puts the LA100 in 16.5 characters per inch mode. GROUP has a value of zero for the first N items and one for items greater than the first N. It is the variable GROUP which allows sub-totals to printed for the Top N, but no sub-totals for the "OTHER" items.


ON LA100printer: BEGIN
    PRINT "[4w"
    REPORT TOP_BOTH SORTED BY GROUP, ITEMS_ORDER
    SET COLUMNS_PAGE = 217
    SET REPORT_NAME = "KNAPPCO Corporation"/"Top Items for Top 20 Customers"

The AT TOP OF PAGE is a little unusual. Since the LABEL contains the first seven characters of the Top 20 customers, an AT TOP OF PAGE was required to have computed column headers on the tallies. However, the first part of the AT TOP OF PAGE allows the creation of a line of the form


  Top N Items for 19YY [Year to Date] on total sales of $$$,$$$,$$$9.99

which appear before the column headers. The optional "Year to Date" is printed if we are reporting the current (incomplete) year.


    AT TOP OF PAGE PRINT REPORT_HEADER, SKIP, COL 1,
        "Top"||| FORMAT NUMBER_OF_ITEMS USING ZZZ9 |||
        "Items for 19"| FORMAT YEAR USING 99 | CHOICE OF
        YEAR EQ THIS_YEAR THEN " Year to date"
        ELSE ""
        END_CHOICE ||| "on total sales of "|
        FORMAT TOTAL_SALES USING $$$,$$$,$$9.99, SKIP 2, COL 1,
        "Item", COL 10, "Description", COL 42, LABEL(-), SKIP

The data is summarized by ITEMS_ORDER to give the part number (the ALTERNATE_ITEMNO), and the description of the part (via a domain table INVMAS_DESC_TABLE). The total sales, X01 to X21; the total quantities, Y01 to Y21; and the average unit prices, Z01 to Z21 are printed on three successive lines. Notice that code for totals of X02 through X20, totals Y02 through Y20, and Z02 through Z20 have been replaced with " . . . ".


    AT BOTTOM OF ITEMS_ORDER PRINT ALTERNATE_ITEMNO USING X(8),
        CHOICE OF
        ITEMS_ORDER LE NUMBER_OF_ITEMS THEN 
            ALTERNATE_ITEMNO VIA INVMAS_DESC_TABLE
        ELSE " "
        END_CHOICE USING X(30), COL 42,
        TOTAL X01(-) USING Z(6)9, . . . ,TOTAL X21(-) USING Z(6)9,
        TOTAL SH_EXTENDED_PRICE(-) USING Z(6)9,
        SKIP, COL 15, "TOTAL QUANTITIES   =", COL 42,
        TOTAL Y01(-) USING Z(6)9, . . . , TOTAL Y21(-) USING Z(6)9,
        TOTAL SH_QUANTITY(-) USING Z(6)9,
        SKIP, COL 15, "AVERAGE UNIT PRICE =", COL 42,
        Z01(-) USING ZZZ9.99, . . . , Z21(-) USING ZZZ9.99,
        (TOTAL SH_EXTENDED_PRICE)/(TOTAL SH_QUANTITY)(-) USING ZZZ9.99

At the bottom of GROUP, sales total are printed. But only for the items in the Top N. Notice that if GROUP = 1, there are no sub-totals printed; only a blank line is printed! Notice that the code for total X02 through X20 has been replaced with " . . . ".


    AT BOTTOM OF GROUP PRINT SKIP, COL 2,
        CHOICE OF
        GROUP = 0 THEN "Subtotals="
        ELSE           "          "
        END_CHOICE , COL 42,
        CHOICE OF
        GROUP = 0 THEN FORMAT TOTAL X01 USING Z(6)9
        ELSE "       "
        END_CHOICE,
            .
            .
            .
        CHOICE OF
        GROUP = 0 THEN FORMAT TOTAL X21 USING Z(6)9
        ELSE "       "
        END_CHOICE,
        CHOICE OF
        GROUP = 0 THEN FORMAT TOTAL SH_EXTENDED_PRICE USING Z(6)9
        ELSE "       "
        END_CHOICE, SKIP

At the bottom of the report the grand totals for sales are printed, the LA100 printer is returned to 10 characters per inch mode, and all domains FINISHed, and all tables and global variables RELEASEd (although the code is not shown).


    AT BOTTOM OF REPORT PRINT COL 2, "Grand Totals=",
        TOTAL X01(-) USING Z(6)9, . . . , TOTAL X21(-) USING Z(6)9,
        TOTAL SH_EXTENDED_PRICE(-) USING Z(6)9
    END_REPORT
    PRINT "[0w"
    END
! finish all domains and release all tables and global variables
END-PROCEDURE

Because the report is 217 columns wide, a typical report can not easily be printed in the newsletter. However, the report does contain a number of interesting features particularly the suppression of a sub-total on the second group.


Jerry Kelley was the Controller for Knappco Corporation where he ran their microVAX II computer system. Knappco's integrated manufacturing and business applications were written in DIBOL with significant use of ENGLISH, 20/20, and DATATRIEVE for reporting and analysis.
Originally published in the newsletter of the 4GL SIG, The Wombat Examiner and 4GL Dispatch, Volume 12, Number 7, pages 5-9; in the Combined SIGs Newsletters of Digital Equipment Computer Users Society, Volume 6, Number 7, March 1991.
Joe H. Gallagher, Ph. D.
dtrwiz@ix.netcom.com
MAILTO

BACK Back