Generating Test Data

Most of our efforts in developing applications are spent in thinking about and analyzing the problem or writing, testing, and debugging procedures, reports, FMS or TDMS screens, tables, domain definitions, and record definitions. When an application becomes more complex (the number of concurrently accessed domains), when the number of simultaneous users increases, and when remote data is accessed (data accessed across a network), the ability to accurately simulate the "real-world" with trivial data is no longer possible. In such situations it is necessary to create data to test our application design and to evaluate the performance of our application.

The best test data for an application, of course, is the "real" data. But there are many instances when the "real" data has not yet been acquired or may not yet be available to the application developer. It is these instances were one needs to generate test data that will be address by this article.

The examples used in this article are taken from the command files and procedures which were used to generated the data for the Las Vegas 1990 Fall Symposium 4GL Problem. A complete description of the Las Vegas 4GL Problem will appear in a future issue of this newsletter.

Types of Data

With respect to generation of data, there are two broad categories. The first is data that because of its non-mathematical nature or because there are a discrete number of occurrence among a very, very large number of possibilities can only be generated from lists of existing data. The second is data that is numeric, either continuous or discrete, and which can be generated by mathematical methods.

List Data

The best example of first kind of data, list data, is first and last names. Mathematically there a billions more possible names than actually occur in a given language (English, for example); so there is no effective way to generate real names (as opposed to nonsense names) with a random number generator. The only way to get realistic names is to take names from a pre-existing list. One source of possible names is the telephone book. John Babiarz used his local telephone book to generate names for the 1990 Spring New Orleans 4GL Problem. He used a little over 1100 unique names and used each one 4 times with four different first names. However, he did not make a complete or random sample of the names in the phone book. The frequency of occurrence of names was peculiar, if not bazaar, as there were more that 400 names beginning with "Z" in a set of 5518 names.

I chose another approach. Since I work in a fairly large hospital and have some good contacts in the Data Processing Department, I was able to secure the list of last names of the most recent 32000 hospital admissions. This list of names has some rather interesting properties. First there were about 20 names in the list whose length was more that 15 characters. All but three of the names were due to hyphenated names (the concatenation of two last names). All three of the "naturally" long names were of Germanic origin. They were 16, 17, and 19 characters long (and the 19 character one was a real mouth full). Because the master file in the problem set had already specified that the LAST-NAME field was PIC X(15), I discarded all last names longer than 15 characters. Names on a hospital admitting list do have one peculiar property. If a name appears on the list, then there is too high a probability that the next name on the list will be the same. The occurs when several members of the same family are involved in a motor vehicle accident and they are all admitted through the emergency room at the same time. Because of this peculiarity, I did remove the duplicates if they occurred together. The set of names did have a good representation of English, Irish, Scotch, German, American Black, and Hispanic names with a few Scandinavian, Asian, Middle Eastern, and Indian names. However, because the 4GL Problem was of project of the 4GL SIG, I check the names against the list of members of the SIG Steering Committee and if a steering committee member's name was not present I added it. Thus, the list I used came from three sources: more than 1100 unique names from John Babiarz's telephone book, about 20 names from the SIG Steering Committee list who weren't already on the list, and more than 4000 more from the patient admissions' list at Research Medical Center to make a list of 5518 names. While the list had some of the properties of a typical list of American names (Smith, Jones, and Brown occurred 46, 37, and 35 times), there were still 67.4% (3718 of 5518) which were unique. In a typical list there are often only 33% to 40% unique names.

Now with a list of 5518 last names in a domain NEWNAMES, I needed to load these names into the MASTER domain. If the MASTER domain did not already have values in the other fields, I could just STORE the records. But to move the first name from NEWNAMES to the first last-name in MASTER, and the second to the second and so on, there is a bit of a trick!

The MASTER record is

01 MASTER_REC.
03 ID_NUMBER PIC 9(7).
03 LAST_NAME PIC X(15). ! may contain " ", "'", or "-"
03 FIRST_NAME PIC X(10).
03 MIDDLE_INITIAL PIC X(1).
03 TITLE_OF_ADDRESS PIC X(4).
03 SENIORITY PIC X(4).
03 ADDR PIC X(15).
03 CITY PIC X(12).
03 STATE PIC X(2).
03 ZIP_CODE PIC 9(5).
;

and the NEWNAMES record is
01 NEWNAMES_REC.
03 NAMES PIC X(15).
;

One would like to be able to have DATATRIEVE do something like
ready MASTER modify
ready NEWNAMES read
for a in NEWNAMES
for next 1 b in MASTER
modify using b.last-name = a.names

but, of course, the "FOR NEXT 1" is not legal syntax in DATATRIEVE.
Richard Copeland suggested something like
ready MASTER write
ready NEWNAMES read
find MASTER
declare size_of_master usage is integer.
size_of_master = count of current
declare x usage is integer.
for a in NEWNAMES begin
x = running count
modify b in MASTER with
FN$MOD(running count,size_of_master) =
x using last-name = names
end

which would work, but its performance would be poor.
The solutions I used was:
ready MASTER modify
ready NEWNAMES read
find c in MASTER
for NEWNAMES begin
select next c
modify using LAST-NAME = NAMES
end

which apparently violates the documented restriction that a
SELECT may not appear within a BEGIN-END block. However, in this
case it works just fine and gives the desired result.
Mathematically Generated Data
The second type of data can be, but is not always, generated by
some mathematical function or algorithm. If the data is
continuous or almost continuous, a function with or without the
use of a random number generator is often used. But if the data
is discrete, many times it is easier to develop a table or a
frequency of occurrence for the values.
Discrete Values
For the Las Vegas 4GL Problem, we needed DETAIL records which
would be associated with the MASTER records. The frequency of
occurrence of the DETAIL records with respect to the MASTER
records is given by the values in the domain DISTRIBUTION. I
actually used a spread sheet to fiddle with the values so that
the records would come out to exactly 5518. The three fields in
the DISTRIBUTION domain are N, the number of DETAIL records per
MASTER record, M, the number of MASTER records having this number
of DETAIL records, and Z, total number of detail records (the
product of N and M).
N M Z
0 1199 0
1 996 996
2 364 728
3 487 1461
4 254 1016
5 187 935
6 96 576
7 73 511
8 229 1832
9 174 1566
10 135 1350
11 288 3168
12 418 5016
13 9 117
14 18 252
15 23 345
16 42 672
17 71 1207
18 97 1746
19 0 0
20 0 0
21 1 21
22 1 22
23 2 46
24 6 144
25 0 0
26 0 0
27 1 27
28 0 0
29 0 0
30 0 0
31 1 31
32 4 128
33 10 330
34 35 1190
35 98 3430
36 199 7164
Totals 5518 36027

To create DETAIL records with the distribution gives by the
values in DISTRIBUTION, the following skeleton set of commands
and statements can be used:
ready DISTRIBUTION read
ready MASTER modify
define file for DETAIL;
ready DETAIL write
find a in MASTER
for DISTRIBUTION begin
repeat m begin
select next a
repeat n begin
store DETAIL using begin
id-number = a.id-number
. .
end
end
end
end

and again the trick of using SELECT NEXT A in a collection is
used to step through the MASTER domain.
Continuous Uniform Random Numbers

A random number generator makes numbers uniformly, but pseudo-randomly, distributed on the interval 0 <= x < 1. There are many kinds of random number generators and many kinds of tests for them as well. For information on random number generators see "Encyclopedia of Computer Science", Anthony Ralsont, Editor, Petrocelli/Charter, New York, 1976, pp. 1192-7, or "The Art of Computer Programming" Volume 2 (Seminumerical Algorithms). Addison Wesley, Reading, MA, 1969. The VAX-FORTRAN random number generator has properties better than the additive-multiplicative generator and a shift generator since it is apparently has properties of both types. A self-seeding routine to call the FORTRAN random number generator was described in Volume 4, Number 1, Section DTR, pages 11-12 of the newsletter. The FORTRAN code is:
!
! a self-initializing, FORTRAN interface for
! DATATRIEVE to the VAX random number generator.
!
real*4 function fn_random()
real*4 mth$random
integer*4 quad(2),seed
equivalence (quad,seed)
if (seed .eq. 0) then
call sys$gettim(quad)
end if
fn_random = mth$random(seed)
return
end

The linkage to attach this function to DATATRIEVE is
;
; FN$RANDOM- a pseudo-random number generator
;
; output is an F-floating pseudo-random number 0<= x < 1
;
$DTR$FUN_DEF FN$RANDOM, FN_RANDOM, 0
$DTR$FUN_OUT_ARG TYPE = FUN$K_VALUE, DTYPE = DSC$K_DTYPE_F
$DTR$FUN_NOOPTIMIZE
$DTR$FUN_END_DEF

One can then use the random number generator to create number
distributed between LOW_VALUE and HIGH_VALUE by the expression
LOW_VALUE + fn$random * (HIGH_VALUE - LOW_VALUE)

However if one needs numbers exponentially distributed between 10
and 100,000, once could use the expression
FN$EXP(FN$LN(10.0)*(1.0 + 4.0*FN$RANDOM))

Discrete Random Numbers

To make integers (discrete numbers) rather than real (continuous) number, judicious use of FN$FLOOR or FN$NINT will convert the continues variable to a discrete one. However, if you need unique integer, you have to keep track of which integers have already been used. For that, you need a domain and a table like
DEFINE DOMAIN INDEX USING INDEX_RECORD ON INDEX.DAT;
DEFINE RECORD INDEX_RECORD USING
01 INDEX_REC.
03 ID PIC 9(7).
;
DEFINE TABLE INDEX_TABLE FROM DOMAIN INDEX
ID : ID
END_TABLE

and code like
ready INDEX write
declare temp-id pic 9(7).
. . .
temp-id = 9999999 * fn$random
while (temp-id in index-table) begin
if temp-id eq 9999999 then begin
temp-id = 0
end
temp-id = temp-id + 1
end
store INDEX using id = temp-id
. . .

to keep track of which numbers have already been used.
Normally Distributed (Gaussian) Numbers

Normally distributed numbers can be created from uniformly distributed numbers by inverting the Gaussian function. Unfortunately, there is no exact closed-form expression for this inverse distribution. There is, however, a good approximation to the needed function. The FORTRAN routine FN_ZPROB gives such a result.
! A real function to invert the probability function.
! See Handbook of Mathematical Functions with Formulas, Graphs,
! and Mathematical Tables, Edited by Milton Abramowitz and
! Irene A. Stegun, National Bureau of Standards, Applied
! Mathematics Series, 55, page 933.
!
! input 0 =< rin =< 1.0
! output -9.262 =< rout =< +9.262
!
! If the input is uniformly distributed between 0 and 1,
! the output is normally distributed about zero with a
! standard deviation of 1 (except for the very ends of
! the distribution).
!
! Because of the singularity at rin = 0 and rin = 1
! this function sets
!
! 0 <= rin <= 1.0E-20 rout = -9.262
! 1 - 1.0E-20 <= rin <= 1.0 rout = +9.262
!
! Thus, the tails of the distribution are "chopped" off
! at the very, very ends.
!
real function fn_zprob(rin)
real*4 rin, rout, p, t
logical side ! 0 to 0.5 -> TRUE; 0.5 to 1 -> FALSE
side = .TRUE.
p = rin
if ( p .gt. 0.5 ) then
side = .FALSE. ! data is 0.5 to 1
p = 1.0 - p ! invert the data
end if
if ( p .lt. 1.0E-20) then ! is data too close to 0?
p = 1.0E-20 ! yes, set to a small value
end if
t = sqrt( 2.0 * alog ( 1.0/p))
rout = t - (2.515517 + t * (0.802853 + t * 0.010328))/
1 (1.0 + t * (1.432788 + t * (0.189269 + t*0.001308)))
!
! Size of error in this formula is
! | error(p) | < 0.0004.5
! However, near input of 0.5, output of formula can go
! slightly negative.
! May sure value does not go negative, here.
!
if (rout .lt. 0.0) then
rout = 0.0
end if
if (side) then ! which side are we on?
rout = -1.0 * rout ! 0 to 0.5 maps to -9.262 to 0.0
end if
fn_zprob = rout ! return value
return
end

The linkage to DATATRIEVE is
;
; FN$ZPROB- a normally distributed function
;
; input is an F-floating number 0<= z <= 1
; output is an F-floating number normally distributed
;
$DTR$FUN_DEF FN$ZPROB, FN_ZPROB, 1
$DTR$FUN_OUT_ARG TYPE = FUN$K_VALUE, DTYPE = DSC$K_DTYPE_F
$DTR$FUN_IN_ARG TYPE = FUN$K_REF, DTYPE = DSC$K_DTYPE_F, ORDER = 1
$DTR$FUN_NOOPTIMIZE
$DTR$FUN_END_DEF

The function combination FN$ZPROB(FN$RANDOM) can be used to
normally and randomly distribute values with a mean, MEAN, and a
standard deviation, STD_DEV, with the expression
MEAN + STD_DEV * FN$ZPROB(FN$RANDOM)

Generation of Test Data for the DETAIL Record

All of the techniques described above are used to generate records for the DETAIL domain. The record definition is:
01 DETAIL_REC.
03 ID_NUMBER PIC 9(7).
03 DATE_OF_DONATION USAGE IS DATE.
03 FUND_CODE PIC 9(1).
03 AMOUNT_OF_DONATION PIC 9(5)V99 .
;

The variable ID_NUMBER is uniformly distributed over the range 0
to 9999999, but each value is unique. The DATE_OF_DONATION is
normally distributed about a target date which is one nth (where
n is the number of donations) of the time between the beginning
and ending donation dates. These start and stop dates depend on
the number of donations given. The start date is determine by a
complex CHOICE OF expression which depends on the number of
donations. The standard deviation about these target dates is
arithmetically distributed between one half and one fifth of the
mean time between donations. The donations are in 1987, 1988, and
1989. Donation dates are also restricted to be week days; that
is, Saturdays and Sundays are not allowed. The FUND_CODE is
uniformly distributed between 0 and 9. The AMOUNT_OF_DONATION is
exponentially distributed between $10.00 and $1,000.00 or $10.00
and $100,000.00 depending on the number of donations. Ninety-
eight percent of the donations are truncated to 3 significant
figures; 2% of the donations have all digits.
The DATATRIEVE code to generate the data in DETAIL is:
!
! a command file to create new id numbers and to
! create DETAIL records for the Las Vegas 4GL Problem
!
declare start-date usage is date.
declare stop-date usage is date.
declare start-quad usage is quad.
declare stop-quad usage is quad.
declare delta usage is quad.
declare std usage is quad.
declare index-quad usage is quad.
declare temp-quad usage is quad.
declare temp-date usage is date.
declare ln_of_10 usage is real.
declare trunk usage is integer.
ln_of_10 = fn$ln(10.0)
!
declare dollar-base usage is real.
declare dollar-range usage is real.
declare t usage is real.
declare amount pic 9(6)v99 .
declare tamount pic 9(6)v99 .
declare fund pic 9.
declare tfund pic 9.
!
stop-date = "12/31/89"
stop-quad = stop-date
!
ready DISTRIBUTION read
ready MASTER modify
define file for DETAIL;
ready DETAIL write
define file for INDEX key=id;
ready INDEX write
declare temp-id pic 9(7).
find a in MASTER
!
for DISTRIBUTION begin
repeat m begin
temp-id = 9999999 * fn$random
while temp-id in INDEX-table begin
if temp-id eq 9999999 then begin
temp-id = 0
end
temp-id = temp-id + 1
end
store INDEX using id = temp-id
select next a
modify using id-number = temp-id
if (n gt 0) begin
start_date = choice of
n bt 1 and 6 then "1/1/87"
n bt 7 and 12 then "1/1/89"
n bt 13 and 18 then "7/1/88"
n bt 19 and 24 then "1/1/88"
else "1/1/87"
end_choice
start-quad = start-date
delta = (stop-quad - start-quad)/n
index-quad = start-quad + delta/2
std = delta/(2.0 + 3.0*fn$random)
if ( n bt 1 and 6) then begin
dollar-base = 1.0
dollar-range = 4.0
end else begin
dollar-base = 1.0
dollar-range = 2.0
end
t = fn$random
t = t * t
tamount=fn$exp((dollar-base+t*dollar-range)* ln_of_10)
tfund = fn$floor(10.0 * fn$random)
repeat n begin
amount = tamount
if (fn$random le 0.05) then begin
t = fn$random
amount = fn$exp( (dollar-base + t * dollar-range) *
ln_of_10)
end
if (fn$random lt 0.98) then begin
trunk = fn$floor(amount/100.0)
if (trunk lt 1) then trunk = 1
amount = trunk * fn$floor(amount/trunk)
end
fund = tfund
if (fn$random le 0.15) then begin
fund = fn$floor(10.0 * fn$random)
end
temp-quad = index-quad + std * fn$zprob(fn$random)
temp-date = temp-quad
if (temp-date before "1/1/87") then begin
temp-date = "1/2/87"
end
if (temp-date after "12/29/89") then begin
temp-date = "12/29/89"
end
temp-date = fn$date(fn$upcase(format
temp-date using dd-mmm-yyyy |" 00:00:00.00"))
while (format temp-date using w(3) eq "Sat","Sun") begin
temp-date = temp-date + 1
end
store DETAIL using begin
id-number = temp-id
date-of-donation = temp-date
fund-code = fund
amount-of-donation = amount
end
index-quad = index-quad + delta
end
end
end
end

Perhaps the techniques in this article will give you the
knowledge and tools to generate data to test your application.
Originally published in the newsletter of the 4GL SIG,

Joe H. Gallagher, Ph. D. dtrwiz@ix.netcom.com |

Back |