Most of our efforts in developing applications are spent in thinking about and analyzing the problem or writing, testing, and debugging procedures, reports, FMS or TDMS screens, tables, domain definitions, and record definitions. When an application becomes more complex (the number of concurrently accessed domains), when the number of simultaneous users increases, and when remote data is accessed (data accessed across a network), the ability to accurately simulate the "real-world" with trivial data is no longer possible. In such situations it is necessary to create data to test our application design and to evaluate the performance of our application.
The best test data for an application, of course, is the "real" data. But there are many instances when the "real" data has not yet been acquired or may not yet be available to the application developer. It is these instances were one needs to generate test data that will be address by this article.
The examples used in this article are taken from the command files and procedures which were used to generated the data for the Las Vegas 1990 Fall Symposium 4GL Problem. A complete description of the Las Vegas 4GL Problem will appear in a future issue of this newsletter.
Types of Data
With respect to generation of data, there are two broad categories. The first is data that because of its non-mathematical nature or because there are a discrete number of occurrence among a very, very large number of possibilities can only be generated from lists of existing data. The second is data that is numeric, either continuous or discrete, and which can be generated by mathematical methods.
The best example of first kind of data, list data, is first and last names. Mathematically there a billions more possible names than actually occur in a given language (English, for example); so there is no effective way to generate real names (as opposed to nonsense names) with a random number generator. The only way to get realistic names is to take names from a pre-existing list. One source of possible names is the telephone book. John Babiarz used his local telephone book to generate names for the 1990 Spring New Orleans 4GL Problem. He used a little over 1100 unique names and used each one 4 times with four different first names. However, he did not make a complete or random sample of the names in the phone book. The frequency of occurrence of names was peculiar, if not bazaar, as there were more that 400 names beginning with "Z" in a set of 5518 names.
I chose another approach. Since I work in a fairly large hospital and have some good contacts in the Data Processing Department, I was able to secure the list of last names of the most recent 32000 hospital admissions. This list of names has some rather interesting properties. First there were about 20 names in the list whose length was more that 15 characters. All but three of the names were due to hyphenated names (the concatenation of two last names). All three of the "naturally" long names were of Germanic origin. They were 16, 17, and 19 characters long (and the 19 character one was a real mouth full). Because the master file in the problem set had already specified that the LAST-NAME field was PIC X(15), I discarded all last names longer than 15 characters. Names on a hospital admitting list do have one peculiar property. If a name appears on the list, then there is too high a probability that the next name on the list will be the same. The occurs when several members of the same family are involved in a motor vehicle accident and they are all admitted through the emergency room at the same time. Because of this peculiarity, I did remove the duplicates if they occurred together. The set of names did have a good representation of English, Irish, Scotch, German, American Black, and Hispanic names with a few Scandinavian, Asian, Middle Eastern, and Indian names. However, because the 4GL Problem was of project of the 4GL SIG, I check the names against the list of members of the SIG Steering Committee and if a steering committee member's name was not present I added it. Thus, the list I used came from three sources: more than 1100 unique names from John Babiarz's telephone book, about 20 names from the SIG Steering Committee list who weren't already on the list, and more than 4000 more from the patient admissions' list at Research Medical Center to make a list of 5518 names. While the list had some of the properties of a typical list of American names (Smith, Jones, and Brown occurred 46, 37, and 35 times), there were still 67.4% (3718 of 5518) which were unique. In a typical list there are often only 33% to 40% unique names.
Now with a list of 5518 last names in a domain NEWNAMES, I needed to load these names into the MASTER domain. If the MASTER domain did not already have values in the other fields, I could just STORE the records. But to move the first name from NEWNAMES to the first last-name in MASTER, and the second to the second and so on, there is a bit of a trick!
The MASTER record is
and the NEWNAMES record is01 MASTER_REC. 03 ID_NUMBER PIC 9(7). 03 LAST_NAME PIC X(15). ! may contain " ", "'", or "-" 03 FIRST_NAME PIC X(10). 03 MIDDLE_INITIAL PIC X(1). 03 TITLE_OF_ADDRESS PIC X(4). 03 SENIORITY PIC X(4). 03 ADDR PIC X(15). 03 CITY PIC X(12). 03 STATE PIC X(2). 03 ZIP_CODE PIC 9(5). ;
One would like to be able to have DATATRIEVE do something like01 NEWNAMES_REC. 03 NAMES PIC X(15). ;
but, of course, the "FOR NEXT 1" is not legal syntax in DATATRIEVE. Richard Copeland suggested something likeready MASTER modify ready NEWNAMES read for a in NEWNAMES for next 1 b in MASTER modify using b.last-name = a.names
which would work, but its performance would be poor. The solutions I used was:ready MASTER write ready NEWNAMES read find MASTER declare size_of_master usage is integer. size_of_master = count of current declare x usage is integer. for a in NEWNAMES begin x = running count modify b in MASTER with FN$MOD(running count,size_of_master) = x using last-name = names end
which apparently violates the documented restriction that a SELECT may not appear within a BEGIN-END block. However, in this case it works just fine and gives the desired result. Mathematically Generated Data The second type of data can be, but is not always, generated by some mathematical function or algorithm. If the data is continuous or almost continuous, a function with or without the use of a random number generator is often used. But if the data is discrete, many times it is easier to develop a table or a frequency of occurrence for the values. Discrete Values For the Las Vegas 4GL Problem, we needed DETAIL records which would be associated with the MASTER records. The frequency of occurrence of the DETAIL records with respect to the MASTER records is given by the values in the domain DISTRIBUTION. I actually used a spread sheet to fiddle with the values so that the records would come out to exactly 5518. The three fields in the DISTRIBUTION domain are N, the number of DETAIL records per MASTER record, M, the number of MASTER records having this number of DETAIL records, and Z, total number of detail records (the product of N and M).ready MASTER modify ready NEWNAMES read find c in MASTER for NEWNAMES begin select next c modify using LAST-NAME = NAMES end
To create DETAIL records with the distribution gives by the values in DISTRIBUTION, the following skeleton set of commands and statements can be used:N M Z 0 1199 0 1 996 996 2 364 728 3 487 1461 4 254 1016 5 187 935 6 96 576 7 73 511 8 229 1832 9 174 1566 10 135 1350 11 288 3168 12 418 5016 13 9 117 14 18 252 15 23 345 16 42 672 17 71 1207 18 97 1746 19 0 0 20 0 0 21 1 21 22 1 22 23 2 46 24 6 144 25 0 0 26 0 0 27 1 27 28 0 0 29 0 0 30 0 0 31 1 31 32 4 128 33 10 330 34 35 1190 35 98 3430 36 199 7164 Totals 5518 36027
and again the trick of using SELECT NEXT A in a collection is used to step through the MASTER domain.ready DISTRIBUTION read ready MASTER modify define file for DETAIL; ready DETAIL write find a in MASTER for DISTRIBUTION begin repeat m begin select next a repeat n begin store DETAIL using begin id-number = a.id-number . . end end end end
Continuous Uniform Random NumbersA random number generator makes numbers uniformly, but pseudo-randomly, distributed on the interval 0 <= x < 1. There are many kinds of random number generators and many kinds of tests for them as well. For information on random number generators see "Encyclopedia of Computer Science", Anthony Ralsont, Editor, Petrocelli/Charter, New York, 1976, pp. 1192-7, or "The Art of Computer Programming" Volume 2 (Seminumerical Algorithms). Addison Wesley, Reading, MA, 1969. The VAX-FORTRAN random number generator has properties better than the additive-multiplicative generator and a shift generator since it is apparently has properties of both types. A self-seeding routine to call the FORTRAN random number generator was described in Volume 4, Number 1, Section DTR, pages 11-12 of the newsletter. The FORTRAN code is:
The linkage to attach this function to DATATRIEVE is! ! a self-initializing, FORTRAN interface for ! DATATRIEVE to the VAX random number generator. ! real*4 function fn_random() real*4 mth$random integer*4 quad(2),seed equivalence (quad,seed) if (seed .eq. 0) then call sys$gettim(quad) end if fn_random = mth$random(seed) return end
One can then use the random number generator to create number distributed between LOW_VALUE and HIGH_VALUE by the expression; ; FN$RANDOM- a pseudo-random number generator ; ; output is an F-floating pseudo-random number 0<= x < 1 ; $DTR$FUN_DEF FN$RANDOM, FN_RANDOM, 0 $DTR$FUN_OUT_ARG TYPE = FUN$K_VALUE, DTYPE = DSC$K_DTYPE_F $DTR$FUN_NOOPTIMIZE $DTR$FUN_END_DEF
However if one needs numbers exponentially distributed between 10 and 100,000, once could use the expressionLOW_VALUE + fn$random * (HIGH_VALUE - LOW_VALUE)
FN$EXP(FN$LN(10.0)*(1.0 + 4.0*FN$RANDOM))
Discrete Random NumbersTo make integers (discrete numbers) rather than real (continuous) number, judicious use of FN$FLOOR or FN$NINT will convert the continues variable to a discrete one. However, if you need unique integer, you have to keep track of which integers have already been used. For that, you need a domain and a table like
and code likeDEFINE DOMAIN INDEX USING INDEX_RECORD ON INDEX.DAT; DEFINE RECORD INDEX_RECORD USING 01 INDEX_REC. 03 ID PIC 9(7). ; DEFINE TABLE INDEX_TABLE FROM DOMAIN INDEX ID : ID END_TABLE
to keep track of which numbers have already been used.ready INDEX write declare temp-id pic 9(7). . . . temp-id = 9999999 * fn$random while (temp-id in index-table) begin if temp-id eq 9999999 then begin temp-id = 0 end temp-id = temp-id + 1 end store INDEX using id = temp-id . . .
Normally Distributed (Gaussian) NumbersNormally distributed numbers can be created from uniformly distributed numbers by inverting the Gaussian function. Unfortunately, there is no exact closed-form expression for this inverse distribution. There is, however, a good approximation to the needed function. The FORTRAN routine FN_ZPROB gives such a result.
The linkage to DATATRIEVE is! A real function to invert the probability function. ! See Handbook of Mathematical Functions with Formulas, Graphs, ! and Mathematical Tables, Edited by Milton Abramowitz and ! Irene A. Stegun, National Bureau of Standards, Applied ! Mathematics Series, 55, page 933. ! ! input 0 =< rin =< 1.0 ! output -9.262 =< rout =< +9.262 ! ! If the input is uniformly distributed between 0 and 1, ! the output is normally distributed about zero with a ! standard deviation of 1 (except for the very ends of ! the distribution). ! ! Because of the singularity at rin = 0 and rin = 1 ! this function sets ! ! 0 <= rin <= 1.0E-20 rout = -9.262 ! 1 - 1.0E-20 <= rin <= 1.0 rout = +9.262 ! ! Thus, the tails of the distribution are "chopped" off ! at the very, very ends. ! real function fn_zprob(rin) real*4 rin, rout, p, t logical side ! 0 to 0.5 -> TRUE; 0.5 to 1 -> FALSE side = .TRUE. p = rin if ( p .gt. 0.5 ) then side = .FALSE. ! data is 0.5 to 1 p = 1.0 - p ! invert the data end if if ( p .lt. 1.0E-20) then ! is data too close to 0? p = 1.0E-20 ! yes, set to a small value end if t = sqrt( 2.0 * alog ( 1.0/p)) rout = t - (2.515517 + t * (0.802853 + t * 0.010328))/ 1 (1.0 + t * (1.432788 + t * (0.189269 + t*0.001308))) ! ! Size of error in this formula is ! | error(p) | < 0.0004.5 ! However, near input of 0.5, output of formula can go ! slightly negative. ! May sure value does not go negative, here. ! if (rout .lt. 0.0) then rout = 0.0 end if if (side) then ! which side are we on? rout = -1.0 * rout ! 0 to 0.5 maps to -9.262 to 0.0 end if fn_zprob = rout ! return value return end
The function combination FN$ZPROB(FN$RANDOM) can be used to normally and randomly distribute values with a mean, MEAN, and a standard deviation, STD_DEV, with the expression; ; FN$ZPROB- a normally distributed function ; ; input is an F-floating number 0<= z <= 1 ; output is an F-floating number normally distributed ; $DTR$FUN_DEF FN$ZPROB, FN_ZPROB, 1 $DTR$FUN_OUT_ARG TYPE = FUN$K_VALUE, DTYPE = DSC$K_DTYPE_F $DTR$FUN_IN_ARG TYPE = FUN$K_REF, DTYPE = DSC$K_DTYPE_F, ORDER = 1 $DTR$FUN_NOOPTIMIZE $DTR$FUN_END_DEF
MEAN + STD_DEV * FN$ZPROB(FN$RANDOM)
Generation of Test Data for the DETAIL RecordAll of the techniques described above are used to generate records for the DETAIL domain. The record definition is:
The variable ID_NUMBER is uniformly distributed over the range 0 to 9999999, but each value is unique. The DATE_OF_DONATION is normally distributed about a target date which is one nth (where n is the number of donations) of the time between the beginning and ending donation dates. These start and stop dates depend on the number of donations given. The start date is determine by a complex CHOICE OF expression which depends on the number of donations. The standard deviation about these target dates is arithmetically distributed between one half and one fifth of the mean time between donations. The donations are in 1987, 1988, and 1989. Donation dates are also restricted to be week days; that is, Saturdays and Sundays are not allowed. The FUND_CODE is uniformly distributed between 0 and 9. The AMOUNT_OF_DONATION is exponentially distributed between $10.00 and $1,000.00 or $10.00 and $100,000.00 depending on the number of donations. Ninety- eight percent of the donations are truncated to 3 significant figures; 2% of the donations have all digits. The DATATRIEVE code to generate the data in DETAIL is:01 DETAIL_REC. 03 ID_NUMBER PIC 9(7). 03 DATE_OF_DONATION USAGE IS DATE. 03 FUND_CODE PIC 9(1). 03 AMOUNT_OF_DONATION PIC 9(5)V99 . ;
Perhaps the techniques in this article will give you the knowledge and tools to generate data to test your application.! ! a command file to create new id numbers and to ! create DETAIL records for the Las Vegas 4GL Problem ! declare start-date usage is date. declare stop-date usage is date. declare start-quad usage is quad. declare stop-quad usage is quad. declare delta usage is quad. declare std usage is quad. declare index-quad usage is quad. declare temp-quad usage is quad. declare temp-date usage is date. declare ln_of_10 usage is real. declare trunk usage is integer. ln_of_10 = fn$ln(10.0) ! declare dollar-base usage is real. declare dollar-range usage is real. declare t usage is real. declare amount pic 9(6)v99 . declare tamount pic 9(6)v99 . declare fund pic 9. declare tfund pic 9. ! stop-date = "12/31/89" stop-quad = stop-date ! ready DISTRIBUTION read ready MASTER modify define file for DETAIL; ready DETAIL write define file for INDEX key=id; ready INDEX write declare temp-id pic 9(7). find a in MASTER ! for DISTRIBUTION begin repeat m begin temp-id = 9999999 * fn$random while temp-id in INDEX-table begin if temp-id eq 9999999 then begin temp-id = 0 end temp-id = temp-id + 1 end store INDEX using id = temp-id select next a modify using id-number = temp-id if (n gt 0) begin start_date = choice of n bt 1 and 6 then "1/1/87" n bt 7 and 12 then "1/1/89" n bt 13 and 18 then "7/1/88" n bt 19 and 24 then "1/1/88" else "1/1/87" end_choice start-quad = start-date delta = (stop-quad - start-quad)/n index-quad = start-quad + delta/2 std = delta/(2.0 + 3.0*fn$random) if ( n bt 1 and 6) then begin dollar-base = 1.0 dollar-range = 4.0 end else begin dollar-base = 1.0 dollar-range = 2.0 end t = fn$random t = t * t tamount=fn$exp((dollar-base+t*dollar-range)* ln_of_10) tfund = fn$floor(10.0 * fn$random) repeat n begin amount = tamount if (fn$random le 0.05) then begin t = fn$random amount = fn$exp( (dollar-base + t * dollar-range) * ln_of_10) end if (fn$random lt 0.98) then begin trunk = fn$floor(amount/100.0) if (trunk lt 1) then trunk = 1 amount = trunk * fn$floor(amount/trunk) end fund = tfund if (fn$random le 0.15) then begin fund = fn$floor(10.0 * fn$random) end temp-quad = index-quad + std * fn$zprob(fn$random) temp-date = temp-quad if (temp-date before "1/1/87") then begin temp-date = "1/2/87" end if (temp-date after "12/29/89") then begin temp-date = "12/29/89" end temp-date = fn$date(fn$upcase(format temp-date using dd-mmm-yyyy |" 00:00:00.00")) while (format temp-date using w(3) eq "Sat","Sun") begin temp-date = temp-date + 1 end store DETAIL using begin id-number = temp-id date-of-donation = temp-date fund-code = fund amount-of-donation = amount end index-quad = index-quad + delta end end end end
Joe H. Gallagher, Ph. D.