Developing Applications in DATATRIEVE
Generating Test Data

Joe H. Gallagher, Ph. D.

Most of our efforts in developing applications are spent in thinking about and analyzing the problem or writing, testing, and debugging procedures, reports, FMS or TDMS screens, tables, domain definitions, and record definitions. When an application becomes more complex (the number of concurrently accessed domains), when the number of simultaneous users increases, and when remote data is accessed (data accessed across a network), the ability to accurately simulate the "real-world" with trivial data is no longer possible. In such situations it is necessary to create data to test our application design and to evaluate the performance of our application.

The best test data for an application, of course, is the "real" data. But there are many instances when the "real" data has not yet been acquired or may not yet be available to the application developer. It is these instances were one needs to generate test data that will be address by this article.

The examples used in this article are taken from the command files and procedures which were used to generated the data for the Las Vegas 1990 Fall Symposium 4GL Problem. A complete description of the Las Vegas 4GL Problem will appear in a future issue of this newsletter.

Types of Data

With respect to generation of data, there are two broad categories. The first is data that because of its non-mathematical nature or because there are a discrete number of occurrence among a very, very large number of possibilities can only be generated from lists of existing data. The second is data that is numeric, either continuous or discrete, and which can be generated by mathematical methods.

List Data

The best example of first kind of data, list data, is first and last names. Mathematically there a billions more possible names than actually occur in a given language (English, for example); so there is no effective way to generate real names (as opposed to nonsense names) with a random number generator. The only way to get realistic names is to take names from a pre-existing list. One source of possible names is the telephone book. John Babiarz used his local telephone book to generate names for the 1990 Spring New Orleans 4GL Problem. He used a little over 1100 unique names and used each one 4 times with four different first names. However, he did not make a complete or random sample of the names in the phone book. The frequency of occurrence of names was peculiar, if not bazaar, as there were more that 400 names beginning with "Z" in a set of 5518 names.

I chose another approach. Since I work in a fairly large hospital and have some good contacts in the Data Processing Department, I was able to secure the list of last names of the most recent 32000 hospital admissions. This list of names has some rather interesting properties. First there were about 20 names in the list whose length was more that 15 characters. All but three of the names were due to hyphenated names (the concatenation of two last names). All three of the "naturally" long names were of Germanic origin. They were 16, 17, and 19 characters long (and the 19 character one was a real mouth full). Because the master file in the problem set had already specified that the LAST-NAME field was PIC X(15), I discarded all last names longer than 15 characters. Names on a hospital admitting list do have one peculiar property. If a name appears on the list, then there is too high a probability that the next name on the list will be the same. The occurs when several members of the same family are involved in a motor vehicle accident and they are all admitted through the emergency room at the same time. Because of this peculiarity, I did remove the duplicates if they occurred together. The set of names did have a good representation of English, Irish, Scotch, German, American Black, and Hispanic names with a few Scandinavian, Asian, Middle Eastern, and Indian names. However, because the 4GL Problem was of project of the 4GL SIG, I check the names against the list of members of the SIG Steering Committee and if a steering committee member's name was not present I added it. Thus, the list I used came from three sources: more than 1100 unique names from John Babiarz's telephone book, about 20 names from the SIG Steering Committee list who weren't already on the list, and more than 4000 more from the patient admissions' list at Research Medical Center to make a list of 5518 names. While the list had some of the properties of a typical list of American names (Smith, Jones, and Brown occurred 46, 37, and 35 times), there were still 67.4% (3718 of 5518) which were unique. In a typical list there are often only 33% to 40% unique names.

Now with a list of 5518 last names in a domain NEWNAMES, I needed to load these names into the MASTER domain. If the MASTER domain did not already have values in the other fields, I could just STORE the records. But to move the first name from NEWNAMES to the first last-name in MASTER, and the second to the second and so on, there is a bit of a trick!

The MASTER record is

	   03 ID_NUMBER         PIC 9(7).
	   03 LAST_NAME         PIC X(15). ! may contain " ", "'", or "-"
	   03 FIRST_NAME        PIC X(10).
	   03 MIDDLE_INITIAL    PIC X(1).
	   03 SENIORITY         PIC X(4).
	   03 ADDR              PIC X(15).
	   03 CITY              PIC X(12).
	   03 STATE             PIC X(2).
	   03 ZIP_CODE          PIC 9(5).
and the NEWNAMES record is

	   03 NAMES             PIC X(15).
One would like to be able to have DATATRIEVE do something like

    ready MASTER modify
    ready NEWNAMES read
    for a in NEWNAMES
        for next 1 b in MASTER
            modify using b.last-name = a.names
but, of course, the "FOR NEXT 1" is not legal syntax in DATATRIEVE. Richard Copeland suggested something like

    ready MASTER write
    ready NEWNAMES read
    find MASTER
    declare size_of_master usage is integer.
    size_of_master = count of current
    declare x usage is integer.
    for a in NEWNAMES begin
        x = running count
        modify b in MASTER with 
            FN$MOD(running count,size_of_master) = 
            x using last-name = names
which would work, but its performance would be poor. The solutions I used was:

    ready MASTER modify
    ready NEWNAMES read
    find c in MASTER
    for NEWNAMES begin
        select next c
        modify using LAST-NAME = NAMES
which apparently violates the documented restriction that a SELECT may not appear within a BEGIN-END block. However, in this case it works just fine and gives the desired result. Mathematically Generated Data The second type of data can be, but is not always, generated by some mathematical function or algorithm. If the data is continuous or almost continuous, a function with or without the use of a random number generator is often used. But if the data is discrete, many times it is easier to develop a table or a frequency of occurrence for the values. Discrete Values For the Las Vegas 4GL Problem, we needed DETAIL records which would be associated with the MASTER records. The frequency of occurrence of the DETAIL records with respect to the MASTER records is given by the values in the domain DISTRIBUTION. I actually used a spread sheet to fiddle with the values so that the records would come out to exactly 5518. The three fields in the DISTRIBUTION domain are N, the number of DETAIL records per MASTER record, M, the number of MASTER records having this number of DETAIL records, and Z, total number of detail records (the product of N and M).

            N            M          Z
            0         1199          0
            1          996        996
            2          364        728
            3          487       1461
            4          254       1016
            5          187        935
            6           96        576
            7           73        511
            8          229       1832
            9          174       1566
           10          135       1350
           11          288       3168
           12          418       5016
           13            9        117
           14           18        252
           15           23        345
           16           42        672
           17           71       1207
           18           97       1746
           19            0          0
           20            0          0
           21            1         21
           22            1         22
           23            2         46
           24            6        144
           25            0          0
           26            0          0
           27            1         27
           28            0          0
           29            0          0
           30            0          0
           31            1         31
           32            4        128
           33           10        330
           34           35       1190
           35           98       3430
           36          199       7164
         Totals       5518      36027
To create DETAIL records with the distribution gives by the values in DISTRIBUTION, the following skeleton set of commands and statements can be used:

	ready DISTRIBUTION read
	ready MASTER modify
	define file for DETAIL;
	ready DETAIL write
	find a in MASTER
	    repeat m begin
	        select next a
	        repeat n begin
	            store DETAIL using begin
	                id-number =
	                . . 
and again the trick of using SELECT NEXT A in a collection is used to step through the MASTER domain.

Continuous Uniform Random Numbers

A random number generator makes numbers uniformly, but pseudo-randomly, distributed on the interval 0 <= x < 1. There are many kinds of random number generators and many kinds of tests for them as well. For information on random number generators see "Encyclopedia of Computer Science", Anthony Ralsont, Editor, Petrocelli/Charter, New York, 1976, pp. 1192-7, or "The Art of Computer Programming" Volume 2 (Seminumerical Algorithms). Addison Wesley, Reading, MA, 1969. The VAX-FORTRAN random number generator has properties better than the additive-multiplicative generator and a shift generator since it is apparently has properties of both types. A self-seeding routine to call the FORTRAN random number generator was described in Volume 4, Number 1, Section DTR, pages 11-12 of the newsletter. The FORTRAN code is:

!       a self-initializing, FORTRAN interface for 
!       DATATRIEVE to the VAX random number generator.
        real*4 function fn_random()
        real*4 mth$random
        integer*4 quad(2),seed
        equivalence (quad,seed)
        if (seed .eq. 0) then
            call sys$gettim(quad)
            end if
        fn_random = mth$random(seed)
The linkage to attach this function to DATATRIEVE is

; FN$RANDOM- a pseudo-random number generator
; output is an F-floating pseudo-random number 0<= x < 1
One can then use the random number generator to create number distributed between LOW_VALUE and HIGH_VALUE by the expression

        LOW_VALUE + fn$random * (HIGH_VALUE - LOW_VALUE)
However if one needs numbers exponentially distributed between 10 and 100,000, once could use the expression

            FN$EXP(FN$LN(10.0)*(1.0 + 4.0*FN$RANDOM))

Discrete Random Numbers

To make integers (discrete numbers) rather than real (continuous) number, judicious use of FN$FLOOR or FN$NINT will convert the continues variable to a discrete one. However, if you need unique integer, you have to keep track of which integers have already been used. For that, you need a domain and a table like

   03 ID PIC 9(7).
and code like

        ready INDEX write
        declare temp-id pic 9(7).
        . . .
        temp-id = 9999999 * fn$random
        while (temp-id in index-table) begin
            if temp-id eq 9999999 then begin
                temp-id = 0
            temp-id = temp-id + 1
        store INDEX using id = temp-id
        . . .
to keep track of which numbers have already been used.

Normally Distributed (Gaussian) Numbers

Normally distributed numbers can be created from uniformly distributed numbers by inverting the Gaussian function. Unfortunately, there is no exact closed-form expression for this inverse distribution. There is, however, a good approximation to the needed function. The FORTRAN routine FN_ZPROB gives such a result.

!       A real function to invert the probability function.
!       See Handbook of Mathematical Functions with Formulas, Graphs,
!       and Mathematical Tables, Edited by Milton Abramowitz and 
!       Irene A. Stegun, National Bureau of Standards, Applied
!       Mathematics Series, 55, page 933.
!       input                0 =< rin  =< 1.0
!       output          -9.262 =< rout =< +9.262
!       If the input is uniformly distributed between 0 and 1,
!       the output is normally distributed about zero with a
!       standard deviation of 1 (except for the very ends of
!       the distribution).
!       Because of the singularity at rin = 0 and rin = 1
!       this function sets
!               0 <= rin <= 1.0E-20     rout = -9.262
!            1 - 1.0E-20 <= rin <= 1.0  rout = +9.262
!       Thus, the tails of the distribution are "chopped" off
!       at the very, very ends.
        real function fn_zprob(rin)
        real*4 rin, rout, p, t
        logical side    ! 0 to 0.5 -> TRUE; 0.5 to 1 -> FALSE

        side = .TRUE.
        p = rin
        if ( p .gt. 0.5 ) then
            side = .FALSE.      ! data is 0.5 to 1
            p = 1.0 - p         ! invert the data
            end if
        if ( p .lt. 1.0E-20) then       ! is data too close to 0?
            p = 1.0E-20                 ! yes, set to a small value
            end if
        t = sqrt( 2.0 * alog ( 1.0/p))
        rout = t - (2.515517 + t * (0.802853 + t * 0.010328))/
        1 (1.0 + t * (1.432788 + t * (0.189269 + t*0.001308)))
!       Size of error in this formula is
!           | error(p) | < 0.0004.5
!       However, near input of 0.5, output of formula can go
!       slightly negative.
!       May sure value does not go negative, here.
        if (rout .lt. 0.0) then
            rout = 0.0
            end if
        if (side) then          ! which side are we on?
            rout = -1.0 * rout  ! 0 to 0.5 maps to -9.262 to 0.0
            end if
        fn_zprob = rout         ! return value
The linkage to DATATRIEVE is

; FN$ZPROB- a normally distributed function
; input is an F-floating number   0<= z <= 1
; output is an F-floating number normally distributed
The function combination FN$ZPROB(FN$RANDOM) can be used to normally and randomly distribute values with a mean, MEAN, and a standard deviation, STD_DEV, with the expression

              MEAN + STD_DEV * FN$ZPROB(FN$RANDOM)

Generation of Test Data for the DETAIL Record

All of the techniques described above are used to generate records for the DETAIL domain. The record definition is:

   03 ID_NUMBER          PIC 9(7).
   03 FUND_CODE          PIC 9(1).
The variable ID_NUMBER is uniformly distributed over the range 0 to 9999999, but each value is unique. The DATE_OF_DONATION is normally distributed about a target date which is one nth (where n is the number of donations) of the time between the beginning and ending donation dates. These start and stop dates depend on the number of donations given. The start date is determine by a complex CHOICE OF expression which depends on the number of donations. The standard deviation about these target dates is arithmetically distributed between one half and one fifth of the mean time between donations. The donations are in 1987, 1988, and 1989. Donation dates are also restricted to be week days; that is, Saturdays and Sundays are not allowed. The FUND_CODE is uniformly distributed between 0 and 9. The AMOUNT_OF_DONATION is exponentially distributed between $10.00 and $1,000.00 or $10.00 and $100,000.00 depending on the number of donations. Ninety- eight percent of the donations are truncated to 3 significant figures; 2% of the donations have all digits. The DATATRIEVE code to generate the data in DETAIL is:

!   a command file to create new id numbers and to
!   create DETAIL records for the Las Vegas 4GL Problem
declare start-date usage is date.
declare stop-date usage is date.
declare start-quad usage is quad.
declare stop-quad usage is quad.
declare delta usage is quad.
declare std usage is quad.
declare index-quad usage is quad.
declare temp-quad usage is quad.
declare temp-date usage is date.
declare ln_of_10 usage is real.
declare trunk usage is integer.
ln_of_10 = fn$ln(10.0)
declare dollar-base usage is real.
declare dollar-range usage is real.
declare t usage is real.
declare amount pic 9(6)v99 .
declare tamount pic 9(6)v99 .
declare fund pic 9.
declare tfund pic 9.
stop-date = "12/31/89"
stop-quad = stop-date
ready MASTER modify
define file for DETAIL;
ready DETAIL write
define file for INDEX key=id;
ready INDEX write
declare temp-id pic 9(7).
find a in MASTER
    repeat m begin
        temp-id = 9999999 * fn$random
        while temp-id in INDEX-table begin
            if temp-id eq 9999999 then begin
                temp-id = 0
            temp-id = temp-id + 1
        store INDEX using id = temp-id
        select next a
        modify using id-number = temp-id
        if (n gt 0) begin
            start_date = choice of
            n bt 1 and 6 then "1/1/87"
            n bt 7 and 12 then "1/1/89"
            n bt 13 and 18 then "7/1/88"
            n bt 19 and 24 then "1/1/88"
            else "1/1/87"

            start-quad = start-date
            delta = (stop-quad - start-quad)/n
            index-quad = start-quad + delta/2
            std = delta/(2.0 + 3.0*fn$random)
            if ( n bt 1 and 6) then begin
                dollar-base = 1.0
                dollar-range = 4.0
                end else begin
                dollar-base = 1.0
                dollar-range = 2.0
            t = fn$random
            t = t * t
            tamount=fn$exp((dollar-base+t*dollar-range)* ln_of_10)
            tfund = fn$floor(10.0 * fn$random)
            repeat n begin
                amount = tamount
                if (fn$random le 0.05) then begin
                    t = fn$random
                    amount = fn$exp( (dollar-base +  t * dollar-range) *      
                if (fn$random lt 0.98) then begin
                    trunk = fn$floor(amount/100.0)
                    if (trunk lt 1) then trunk = 1
                    amount = trunk * fn$floor(amount/trunk)
                fund = tfund
                if (fn$random le 0.15) then begin
                    fund = fn$floor(10.0 * fn$random)
                temp-quad = index-quad + std * fn$zprob(fn$random)
                temp-date = temp-quad
                if (temp-date before "1/1/87") then begin
                    temp-date = "1/2/87"
                if (temp-date after "12/29/89") then begin
                    temp-date = "12/29/89"
                temp-date = fn$date(fn$upcase(format 
                    temp-date using dd-mmm-yyyy |" 00:00:00.00"))
                while (format temp-date using w(3) eq "Sat","Sun") begin
                    temp-date = temp-date + 1
                store DETAIL using begin
                    id-number = temp-id
                    date-of-donation = temp-date
                    fund-code = fund
                    amount-of-donation = amount
                index-quad = index-quad + delta
Perhaps the techniques in this article will give you the knowledge and tools to generate data to test your application.
Originally published in the newsletter of the 4GL SIG, The Wombat Examiner and 4GL Dispatch, Volume 12, Number 1, pages 6-10; in the Combined SIGs Newsletters of Digital Equipment Computer Users Society, Volume 6, Number 1, September 1990.
Joe H. Gallagher, Ph. D.