Copyright © 2000 Christopher S. Charabaruk and Matthew R. Knight. All rights reserved. All articles, tutorials, etc. copyright © by the original authors unless otherwise noted. QB Cult Magazine is the exclusive property and copyright of Christopher Steffan Charabaruk and Matthew R. Knight.
Welcome to the first issue of QB Cult Magazine's second year. Although smaller than I had hoped, it's still a great issue, and I'm sure you'll find it as useful as any of our previous.
Like last month, we have no award image for Site of the Month. Next month (or any time before), this problem will be corrected, and you'll once again have a pretty picture to look at in the issues and on the winning sites.
Now, I want to talk about the diminishing quantity of submissions to QBCM. Since about Issue 7, we've been recieving less and less content each month, to the point where QBCM might have to be published once every other month. I don't want to do this, and I doubt you want that to happen either. So I ask you, submit tips, articles, news, anything you feel that belongs in QB Cult Magazine. We're by QB coders for QB coders, so without you QBCM is nothing.
Let's keep this boat floating.
Chris Charabaruk (EvilBeaver), editor
Note: Regarding BASIC Techniques and Utilities, it was originally published by Ziff-Davis Press, producers of PC Magazine and other computer related publications. After ZD stopped printing it, they released the rights to Ethan Winer, the author. After communicating with Ethan by e-mail, he allowed me to reproduce his book, chapter by chapter, in QBCM. You can find a full text version of BASIC Techniques and Utilities at Ethan's website <www.ethanwiner.com>, or wait until the serialization is complete, when I will have an HTML version ready.
There's a lot of interesting information in your magazine. I want to respond regarding Peter Cooper's letter in Issue 9. My 3d game is a direct descendant of Peter's original raycaster, ray1.bas. Thanks for showing the way in QB, Peter.
Jacques Mallah <jackmallah@yahoo.com>
I'm glad that you find QBCM interesting. We're here to be interesting and informative. And you aren't the only one who's built a raycaster around Peter's engine. And if you like QB games in 3d environments, check out this month's Demo of the Month.
To place an ad, please e-mail <qbcm@tekscode.com>, subject QB Ads. Include your name (real or fake), e-mail address, and message. You may include HTML formatting (but risk the chance of the formatting being thrown out).
By Ethan Winer <ethan@ethanwiner.com>
At some point, all but the most trivial computer programs will need to store and retrieve data using a disk file. Data files are used for two primary purposes: to hold information when there is more than can fit into the computer's memory all at once, and to provide a permanent, non-volatile means of storage. Files are also used to allow data from one computer to be used on another. Such data sharing can be as simple as a "sneaker net" system, whereby a floppy disk is manually carried from one PC to another, or as complex as a multi-user network where disk data can be accessed simultaneously by several users.
Although there are two fundamentally different types of disk drives, floppy and fixed [not counting CD-ROMs drives which are removable], they are accessed identically using the same BASIC statements. BASIC's file commands may also be used to communicate with devices such as a printer or modem, and even the screen and keyboard. There are many ways to manipulate files and devices, and some are substantially faster than others. By understanding fully how BASIC interacts with DOS, file access in your programs can often be speeded up by a factor of five or even more.
In this chapter I will address the fundamental aspects of file and device handling, and provide specific examples of how to achieve the highest performance possible. I will begin with an overview of how DOS organizes information on a disk, and then continue with practical examples. Unlike earlier chapters in which only short program fragments were shown, several complete programs and subprograms will be presented to illustrate the most important of these techniques in context. I will also describe the underlying theory of how disks are organized, and explain why this is important for the BASIC programmer to know.
In Chapter 7 the subject of files will be continued; there you will learn how to write programs for use with a network, and also how relational databases are constructed. In particular, coverage of these two very important subjects is severely lacking in the documentation that comes with Microsoft BASIC. As personal computers continue to permeate the office environment, networks and databases are becoming ever more common. Many programmers find themselves in the awkward position of having to write programs that run on a network, but with no adequate source of information.
All disks used with MS-DOS are organized into groups of bytes called sectors, and these sectors are further combined into clusters. DOS keeps track of every file on a disk, but with this organization DOS needs to remember only the cluster number at which each file begins. The minimum amount of disk space that is allocated by DOS is one cluster. Therefore, if you create a very small file--say, ten bytes--an entire cluster is allocated to that file, and then marked as unavailable for other use.
In most cases, each disk sector holds 512 bytes; however, one exception is when you use a RAM disk to simulate a disk drive in memory. Many RAM disk programs lets you specify a smaller sector size, to minimize waste when there are many small files. The number of sectors that are stored in each cluster depends on the type of disk and its size. For example, a 360K floppy disk stores two sectors in each cluster, and a 32 MB hard disk formatted using DOS 3.3 stores four sectors in each cluster. Therefore, the minimum unit of storage allocation for these disks is 1K (1024 bytes), and 2K (2048 bytes) respectively. DOS 2.x offers less room to store cluster numbers, and must combine more sectors into each cluster. A 20MB hard disk formatted with DOS 2.1 allocates 8K for even a one-line batch file!
As files are created and appended, DOS allocates new space to hold the file contents. By allocating disk space in units, DOS is also able to minimize disk fragmentation. As you learned in Chapter 2, BASIC manages variable-length strings by claiming new memory as necessary. When available memory is exhausted BASIC compacts its string space, overwriting abandoned string data with strings that are still active.
This method is not practical with disk files, because copying data from one part of the disk to another for the purpose of compaction would take an unacceptable amount of time. Therefore, DOS initially allocates an entire cluster for each file, to provide space for subsequent data. When the ten-byte file mentioned earlier is added to, space on the disk has already been set aside for all or part of the new data that will be written. And when the first cluster's capacity is exceeded, DOS allocates an entire second cluster to hold the additional data.
Even though it is common for a disk to become fragmented, allocating clusters that are comprised of groups of contiguous sectors greatly reduces the number of individual fragments that must be accessed. The track, sector, and cluster makeup of a 360k 5-1/4 inch floppy disk is shown in Figure 6-1.
This disk is divided into 40 circular tracks, and each track is further divided into nine sectors. One track holds 512 bytes, and each pair of tracks is combined to form a single cluster. For a 360k disk, no file fragment will ever be smaller than two clusters, since this is the minimum amount of space that DOS allocates. Likewise, a hard disk that combines four sectors into each cluster will never be divided into pieces smaller than four sectors.
Please understand that tracks and sectors are physical entities that are magnetically encoded onto the disk when it is formatted--it is DOS that treats each pair of sectors as a single cluster. Note that since a 360k disk stores nine sectors on each track, some clusters will in fact span two tracks.
Using the disk in Figure 6-1 as an example, the first short file that is written to it will be placed in cluster 1 (sectors 1 and 2), even if the file does not fill both sectors. The second file written to this disk will then be stored starting at cluster 2 (sectors 3 and 4). If the first file is later extended beyond the 1,024 bytes that can fit into cluster 1, the excess will be added beginning at cluster 3 (sectors 5 and 6). Thus, when DOS reads the first file sequentially, it must read cluster 1, skip over cluster 2, and then continue reading at cluster 3.
Of course, this takes longer than reading a file that is contiguous, because the disk drive must wait until the second file's intervening sectors have passed beneath it. This problem is compounded by additional head movement when the fragmentation extends across more than one track, as well as by other timing issues.
There are also three special areas on every disk: the boot sector, the Disk Directory and the File Allocation Table (FAT). DOS uses the directory and FAT to know the name of each file, and where on the disk its first cluster is located. For simplicity, these are not shown in Figure 6-1, and indeed, they are in fact stored before any files on a disk.
When a 360K floppy disk is formatted, DOS sets aside room for 112 directory entries. Each entry is 32 bytes long, and holds the name of each file on the disk, its current size, the date and time it was last written to, its attribute (hidden, read-only, and so forth), and starting cluster number. When you open a file, DOS searches each directory entry for the file name you specified, and once found, goes to the first cluster that holds the file's data.
The disk's FAT contains one entry for every cluster in the data area, to show which clusters are in use and by which file. The FAT is organized as a linked list, with each entry pointing to the next. The last cluster in the file is identified with a special value. The FAT also holds other special values to identify unused, reserved, and defective clusters.
Because there are a fixed number of directory entries on a disk, it is possible to receive a "Disk full" message when attempting to open a new file, even when there is sufficient data space. The root directory of a 360K floppy disk is limited to 112 entries, and a 1.2MB disk can hold up to 224 file names. Notice that a volume label takes one directory entry, although no data space is allocated to it. Unlike the root directory on a disk, subdirectories that you create are not limited to an arbitrary number of file name entries. Rather, a subdirectory is in fact a file, and it can be extended indefinitely until there is no more room on the disk.
Fortunately, most programmers do not have to deal with disk access at this level. When you ask BASIC to open a file and then read from or write to it, DOS handles all the low-level details for you. However, I think it is important to have at least a rudimentary understanding of how disks are organized. If you are interested in learning more about the structure of disks and data files, I recommend Peter Norton's Programmer's Guide to the IBM PC & PS/2. This excellent reference is published by Microsoft Press, and can be found at most major book stores.
A device is related to a file in that you can open it using BASIC's OPEN command, and then access it with GET # and PRINT # and the other file- related BASIC statements. There are a number of devices commonly used with personal computers, and these include printers, modems, tape backup units, and the console (the PC's keyboard and display screen). Some of these devices are maintained by DOS, and others are also controlled by BASIC.
For example, when you open "SCRN:" for Output mode in a BASIC program, BASIC takes responsibility for displaying the characters that you print. However, if you instead open "CON", BASIC merely sends the data to DOS, which in turn sends it to the display screen. Any device whose name is followed by a colon is considered a to be BASIC device; the absence of a trailing colon indicates a DOS device. This is important to understand, because there may be situations when you want to route your program's output directly through DOS, and not have it be intercepted by BASIC.
One such situation would be when printing the special control characters that the ANSI.SYS device driver recognizes. Normally, BASIC processes data in a PRINT statement by writing directly to screen memory. This provides the fastest response, which is of course desirable in most programs. But ANSI.SYS operates by intercepting the stream of characters sent through DOS. Since BASIC normally bypasses DOS for screen operations, ANSI.SYS never gets a chance to see those characters.
Another reason for printing through DOS is to activate TSR (Terminate and Stay Resident) programs that intercept the BIOS video routines. (When data is sent through DOS for display, DOS merely passes it on to the BIOS routines which do the real work.) For example, some early screen design utilities use this method, to accommodate multiple programming languages by avoiding the differences in calling and linking. Therefore, to activate, say, a pop-up help screen, you are required to print a special control string. One such utility uses two CHR$(255) bytes followed by the name of the screen to be displayed.
Although this method is very clumsy when compared to newer products that provide BASIC-linkable object files, it is simpler for the vendor than providing different objects for each supported language. This also allows screens to be displayed from within a batch file using the ECHO command. Therefore, if you need to send data through DOS or the BIOS for whatever reason, you would open and print to the "CON" device, instead of using normal PRINT statements or printing to the "SCRN:" device.
One final point worth mentioning is the value of using the same syntax for both files and devices. Many programs let the user specify where a report is to be sent--either to a disk file, a printer, or the screen. Rather than duplicate similar code three times in a program, you can simply assign a string variable to the appropriate device or file name. This is shown in the listing below.
PRINT "Printer, Screen, or File? (P/S/F): ";
DO
Choice$ = UCASE$(INKEY$)
LOOP UNTIL INSTR(" PSF", Choice$) > 1
IF Choice$ = "P" THEN
Report$ = "LPT1:"
ELSEIF Choice$ = "S" THEN
Report$ = "SCRN:"
ELSE
PRINT
LINE INPUT "Enter a file name: ", Report$
END IF
OPEN Report$ FOR OUTPUT AS #1
PRINT #1, Header$
PRINT #1, SomeStuff$
PRINT #1, MoreStuff$
...
...
CLOSE #1
END
Here, the same block of code can be used regardless of where the report is to be sent. The only alternative is to duplicate similar code three times using PRINT statements if the screen was specified, LPRINT if they want the printer, or PRINT # if the report is being sent to a file. Of course, this example could be further expanded to prompt for a printer number (1, 2, or 3) if a printer is specified.
All data is stored on disk as a continuous stream of binary information, regardless of how the file was opened. Even though BASIC and other languages offer a number of different file access methods, all disk files merely contain a series of individual bytes. When you open a file for random access, you are telling BASIC that it is to treat those bytes in a particular manner. In this case, the file is comprised of one or more fixed-length records. Thus, BASIC can perform many of the low level details that help you to organize and maintain that data.
Likewise, opening a file for INPUT tells BASIC that you plan to read variable-length string data. Rather than reading or writing a single block of a given length, BASIC instead knows to continue to read bytes from the file until a terminating comma or carriage return is encountered. However, in both of these cases the disk file is still comprised of a series of bytes, and the access method you specify merely tells BASIC how it is to treat those bytes.
The short program below illustrates this in context, and you can verify that all three files are identical using the DOS COMP utility program.
OPEN "File1" FOR OUTPUT AS #1 PRINT #1, "Testing"; SPC(13); CLOSE OPEN "File2" FOR BINARY AS #1 Work$ = "Testing" + SPACE$(13) PUT #1, , Work$ CLOSE OPEN "File3" FOR RANDOM AS #1 LEN = 20 FIELD #1, 20 AS Temp$ LSET Temp$ = "Testing" PUT #1 CLOSE END
In fact, even executable program files are indistinguishable from data files, other than by their file name extension. Again, it is how you choose to view the file contents that determines the actual form of the data.
Before I explain the various file access methods that BASIC provides, there is one additional low-level detail that needs to be addressed: file buffers. A file buffer is a portion of memory that holds data on its way to and from a disk file, and it is used to speed up file reads and writes.
As you undoubtedly know, accessing a disk drive is one of the slowest operations that occurs on a PC. Because disk drives are mechanical, data being read or written requires a motor that spins the actual disk, as well as a mechanism to move the drive head to the appropriate location on the disk surface. Even if a file is located in contiguous disk clusters, a substantial amount of mechanical activity is required during the course of accessing a large file.
When you open a file for reading, DOS uses a section of memory that it allocated on bootup as a disk buffer. The first time the file is accessed, DOS reads an entire sector into memory, even if your program requests only a few bytes. This way, when your program makes a subsequent read request, DOS can retrieve that data from memory instead of from the disk. This provides an enormous performance boost, since memory can be accessed many times faster than any mechanical disk drive. Even if the next portion of data being read is located in the same sector, the disk drive must wait for the disk to spin until that sector arrives at the magnetic read/write head.
When using a floppy disk the time delays are even worse. Once a second or two have passed after accessing a floppy disk, the motor is turned off automatically. Having to then restart it again imposes yet another one or two second delay.
Similarly, when you write data to a file DOS simply stores the data in the buffer, instead of writing it to the disk. When the buffer becomes full (or when you close the file--whichever comes first), DOS writes the entire buffer contents to the disk all at once. Again, this is many times faster than accessing the physical drive every time data is written.
You can control the amount of memory that DOS sets aside for its buffers with a BUFFERS= statement in the PC's CONFIG.SYS file. For each buffer you specify, 512 bytes of memory is taken and made unavailable for other uses. Even though you might think that more buffers will always be faster than fewer, this is not necessarily the case. For each buffer, DOS also maintains a table that shows which disk sectors the buffer currently holds. At some point it can actually take longer for DOS to search through this table than to read the sector from disk. Of course, this time depends on the type of disk (floppy or hard), and the disk's access speed.
Although DOS' use of disk buffers greatly improves file access speed, there is still room for improvement. Each call to DOS to read or write a file takes a finite amount of time, because most DOS services are handled by the same interrupt service routine. Which particular service a program wants is specified in one of the processor's registers, and determining which of the many possible services has been requested takes time.
To further improve disk access performance, BASIC performs additional file buffering using its own routines. Since BASIC's buffers are usually located in near memory, they can also be accessed very quickly, because additional steps are needed to access data outside of DGROUP. However, BASIC PDS [and VB/DOS] store file buffers in the same segment used for string variables, so there is slightly less improvement when far strings are being used. When you open a random access file, a block of memory large enough to hold one entire record is set aside in string memory. If a record length is given as part of the OPEN command with LEN =, BASIC uses that for the buffer size. Otherwise, it uses the default size of 128 bytes.
When you open a file for sequential access, BASIC also allocates string memory for a buffer. 512 bytes are used by default, though you can override that with the optional LEN = argument. Specifying a buffer size with non-random files will be discussed later in this chapter.
Note that BASIC PDS does not create a buffer when a file is opened for random access and you are using far strings. If a subsequent FIELD statement is then used, the fielded strings themselves comprise the buffer. Otherwise, BASIC assumes you will be reading the data into a TYPE variable, and avoids the extra buffering altogether. Also, file buffers in a BASIC PDS program are always stored in string memory, which is not necessarily DGROUP. If you are in the QBX environment or have compiled with the /fs far strings option, all file buffers will be stored in the far string data segment.
Although BASIC's additional file buffering does improve your program's speed, it also comes at a cost: the buffers take away from string memory, and the only way to release their memory is to flush their contents to disk by closing the file. DOS offers a service to purge a file's buffers, to ensure that the data will be intact even if the program is terminated abnormally or the power is turned off. Therefore, it is considered good practice to periodically close a file during long data entry sessions. But closing the file and then reopening it after writing each record takes a long time, and more than negates any advantage offered by BASIC's added buffering. [Also, the DOS service that flushes a file's buffers does not flush BASIC's buffers. Any data you have written to disk that is still pending in a BASIC buffer will not be written to the file by this service.]
It is interesting to note that BASIC always closes all open files when a program ends, so it is not strictly necessary to do that manually. I mention this only because you can save a few bytes by eliminating the CLOSE command. Also, DOS flushes its buffers and closes all open files when a program ends, so a few bytes can be saved this way even with non-BASIC programs. Again, I am not necessarily recommending that you do this, and some programmers would no doubt disagree with such advice. But the fact is that an explicit CLOSE is not truly needed.
BASIC offers three fundamental methods for accessing files, and these are specified when the file is opened. There are also several variations and options available with each method, and these will be discussed in more detail in the sections that describe each method.
The first access method is called Sequential, because it requires you to read from or write to the file in a continuous stream. That is, to read the last item in a sequential file you must read all of the items that precede it. There are three different forms of OPEN for accessing sequential files.
OPEN FOR OUTPUT creates the named file if it does not yet exist, or truncates it to a length of zero if it does. Once a file has been opened for output, you may only write data to it.
OPEN FOR APPEND is related to OPEN FOR OUTPUT, and it also tells BASIC to open the file for writing. Unlike OPEN FOR OUTPUT, however, OPEN FOR APPEND does not truncate a file if it already exists. Rather, it opens the file and then seeks to the place just past the last byte. This way, data that is subsequently written will be appended to the end of the file. Note that OPEN FOR APPEND will also create a file if it does not already exist.
OPEN FOR INPUT requires that the named file be present; otherwise, a "File not found" error will result. Once a file has been opened for input, you may only read from it.
BASIC also offers the SEEK command to skip to any arbitrary position in the file, and SEEK can in fact be used with sequential files. However, sequential files are generally written using a comma or a carriage return/line feed pair, to indicate the end of each data item. Since each item can be of a varying length, it is difficult if not impossible to determine where in the file a given item begins. That is, if you wanted to read, say, the 200th line in a README file, how could you know where to seek to?
The second primary file access method is Random, and it allows you to read from and write to the file. When you use OPEN FOR RANDOM, BASIC knows that you will be accessing fixed-length blocks of data called records. The advantage of random access is that any record can be accessed by a record number, instead of having to read through the entire file to get to a particular location. That is, you can read or write any record randomly, without regard to where it is in the file. Because each record has the same physical length as every other record, it is easy for BASIC to calculate the location in the file to seek to, based on the desired record number and the fixed record length.
Using random access is ideal for data that is already organized as fixed-length records such as you would find in a name and address database. Since each record contains the same amount of information, there is a natural one-to-one correspondence between the data and the record number in which it resides. For example, the data for customer number 1 would be stored in record number 1, customer 2 is stored in record 2, and so forth.
Random access can also be used for text and other document files; however, that is much less common. Although this would let you quickly access any arbitrary line of text in the file, the tradeoff is a considerable waste of disk resources. For each line, space equal to the longest one must be set aside for all of them. In a typical document file line lengths will vary greatly, and it is wasteful to set aside, say, 80 bytes for each line.
The third access method is Binary, which is a hybrid of sequential and random access. A binary file is opened using the OPEN FOR BINARY command, and like random, BASIC lets you both read and write the file. Binary access is most commonly used when the data in the file is neither fixed- length in nature, nor delimited by commas or carriage returns. One example of a binary file is a Lotus 1-2-3 worksheet file. Each cell's contents follows a well-defined format, but varying types of information are interspersed throughout the file.
For example, an 8-byte double-precision number may be followed by a variable length text field, which is in turn followed by the current column width represented as a 2-byte integer. Another example of binary information is the header portion of a dBASE data file. Although the data itself is of a fixed length, a block of data is stored at the beginning of every dBASE data file to indicate the number of fields in each file and their type. [Naturally, the length of this header will vary depending on the number of fields in each record.] An example program to read Lotus worksheet files is given later in this chapter, and a program to read and process dBASE files is shown in Chapter 7.
Note that BASIC imposes its own rules on what you may and may not do with each file access method. This is unfortunate, because DOS itself has no such restrictions. That is, DOS allows you to open a file for output, and then freely read from the same file. To do this with BASIC you must first close the file, and then open it again for input. You can bypass BASIC entirely if you want, to open files and then read and write them. This requires using CALL Interrupt, and examples of doing this will be shown in Chapter 11.
BASIC offers two different forms of the OPEN command. The more common method--and the one I prefer--is as follows:
OPEN FileName$ FOR OUTPUT AS #FileNum [LEN = Length].
Of course, OUTPUT could be replaced with RANDOM, BINARY, INPUT, or APPEND. The other syntax is more cryptic, and it uses a string to specify the file mode. To open a file for output using the second method you'd use this:
OPEN "O", #FileNum, FileName$, [Length]
The first syntax is available only in QuickBASIC and the other current versions of the BASIC compiler. The second is a holdover from GW-BASIC, and according to Microsoft is maintained solely for compatibility with old programs. The available single-letter mode designators are "O" for output, "I" for input, "R" for random, "A" for append, and "B" for binary. Note that "B" is not supported in GW-BASIC, and was added beginning with QuickBASIC version 4.0.
Besides being more obscure and harder to read, the older syntax does not let you specify the various access and sharing options available in the newer syntax. One advantage of the older method is that you can defer the open mode until the program runs. That is, a string variable can be used to determine how the file will be opened. However, there are few situations I can envision where that would be useful. Of course, the choice is yours, and some programmers continue to use the original version.
BASIC offers a number of different statements for opening and manipulating files. In a few cases, the same command may have different meanings, depending on how the file is opened. For example LEN = mentioned earlier assumes a different default value when a file is opened for random access compared to when it is opened for output. Similarly, GET # may or may not accept or require a variable name and optional seek offset, depending on the file mode. Therefore, pay close attention to each statement as it is described in the sections that follow. Specific differences will be listed as they relate to each of the various file access methods.
Before any file or device may be accessed, it must first be opened with BASIC's OPEN statement. When you use OPEN, it is up to you make up a file number that will be used when you reference the file later. If you use OPEN "MYDATA" FOR OUTPUT AS #1, then you will also use the same file number (1) when you subsequently print to the file. For example, you might use PRINT #1, Any$. Initially, it might appear that letting the programmer determine his or her own file numbers is a feature. After all, you are allowed to make up your own variable names, so why not file numbers too? Indeed, BASIC is rare among the popular languages in this regard; both C and Pascal require that the programmer remember a file number that is given to them.
There are several problems with BASIC's use of file numbers, and in fact DOS does not use this method either. Instead, DOS returns a file handle when a file has been successfully opened. When an assembly language program (or BASIC itself) calls DOS to open a file, it is DOS who issues the number, and not the program. BASIC must therefore maintain a translation table to relate the numbers you give to the actual handles that DOS returns. This table requires memory, and that memory is taken from DGROUP.
But there is another, more severe problem with BASIC's use of file numbers instead of DOS handles, because it is possible that you could accidentally try to open more than one file using the same number. In a small program that opens only one or two files, it is not difficult to remember which file number goes with which file. But when designing reusable subroutines that will be added to more than one program, it is impossible to know ahead of time what file numbers will be in use.
To solve this problem, Microsoft introduced the FREEFILE function with QuickBASIC 4.0. FREEFILE was described in Chapter 4, but it certainly bears a brief mention again here. Each time you use FREEFILE it returns the next available file number, based on which numbers are already taken. Therefore, any subroutine that needs to open a file can use the number FREEFILE returns, confident that the number is not already in use.
Unless you specify otherwise, a file that has been opened for RANDOM or BINARY can be both read from and written to. The ACCESS option of the OPEN statement lets you indicate that a random or binary file may be read or written only. Even though you may ask for both READ and WRITE access when the file is opened, read/write permission is the default. In some cases you may need to open a file for binary access, and also prevent your program from later writing to it. In that case you would use the ACCESS READ option.
Likewise, specifying ACCESS WRITE tells BASIC to let your program write to the file, but prevent it from reading. This may seem nonsensical, but one situation in which write-only access might be desirable is when designing a network mail system. In that case it is quite likely that a program would be permitted to send mail to another user's electronic "mailbox", but not be allowed to read the mail contained in that file. The various ACCESS options are intended for use with any version of DOS higher than 2.0.
Frankly, these ACCESS options are pointless, because if you wrote the program then you can control whether the file is read from or written to. If you are writing the Send Mail portion of a network application, then you would disallow reading someone else's mail as part of the program logic. And if you do open a file for ACCESS WRITE, BASIC will generate an error if you later try to read from it. So I personally don't see any real value in using these ACCESS arguments.
The remaining two OPEN options are LOCK and SHARED, and these are meant for use with shared files under DOS 3.0 or later. Shared access is primarily employed on a network, though it is possible to share files on a single computer. This could be the case when a file needs to be accessed by more than one program when running under a task-switching program such as Microsoft Windows.
You can specify that a file is to be shared by simply adding the SHARED clause to the OPEN statement. Thus, another program could both read and write the file, even while it is open in your program. To specify shared access but prevent other programs from writing to the file you would use LOCK WRITE. Similarly, using LOCK READ lets another program write to the file but not read from it, and LOCK READ WRITE prevents both.
The LOCK statement can optionally be used on a shared file that is already open to prohibit another program from accessing it only at certain times. The LOCK statement allows all or just a portion of a file to be locked, and the UNLOCK statement releases the locks that were applied earlier. Please understand that these network operations are described here just as a way to introduce what is possible. Network and database programming will be described in depth in Chapter 7.
Finally, you close an open file using BASIC's CLOSE command. CLOSE accepts one or more file numbers separated by commas, or no numbers at all which means that every open file is to be closed. You can also use the RESET command to close all currently open files. When a file that has been opened for one of the output modes is closed, its file buffer is flushed to disk and DOS updates the directory entry for that file to indicate the current date and time and new file size. Closing any type of file releases the buffer memory back to BASIC's string memory pool for other uses.
Once a file has been opened you can read from it, write to it, or both, depending on what form of OPEN was used. Any file that has been opened for input may be read from only. Unlike the BASIC-related limitations I mentioned earlier, DOS imposes this restriction, and for obvious reasons. However, when you open a file for output or append, it is BASIC that prevents you from reading back what you wrote. BASIC imposes several other unfortunate limitations regarding what you can and cannot do with an open file, as you will see momentarily.
Sequential access is commonly used with devices as well as with files. Although it is possible to open a printer for random access, there is little point since data is always printed sequentially. Similarly, reading from the keyboard or writing to the screen must be sequential. In the discussions that follow, you can assume that what is said about accessing files also applies to devices, unless otherwise noted.
Data is written to a sequential file using the PRINT # statement, using the same syntax as the normal PRINT statement when printing to the display screen. That is, PRINT # accepts an optional semicolon to suppress a carriage return and line feed from being written to the file, or a comma to indicate that one or more blank spaces is to be written after the data. The number of blanks sent to the file depends on the current print position, just like when printing to the screen.
You can also use the WRITE # statement to print data to a sequential file, but I recommend against using WRITE in most situations. Unlike PRINT that merely sends the data you give it, WRITE adds surrounding quotes to all string data, which takes time and also additional disk space. Since a subsequent INPUT from the file will just have to remove those quotes which takes even more time, what's the point? Further, WRITE does not let you specify a trailing semicolon or comma. Although a comma may be used as a delimiter between items written to disk, the comma is stored in the file literally when WRITE is used.
The only time I can see WRITE being useful is for printing data that will be read by a non-BASIC application that explicitly requires this format. Many database and spreadsheet programs let you import comma- delimited data with quoted strings such as WRITE uses. These programs treat each complete line ending with a carriage return as an entire record, and each comma-delimited item within the line as a field in that record. But you should avoid WRITE unless your program really needs to communicate with other such applications, because it results in larger data files and slower performance.
Another use for WRITE is to protect strings that contain commas from being read incorrectly by a subsequent INPUT statement. INPUT uses commas to delimit individual strings, and the quotes allow you to input an entire string with a single INPUT command. But BASIC's LINE INPUT does this anyway, since it reads an entire line of text up to a terminating carriage return. You could also add the quotes manually when needed:
IF INSTR(Work$, ",") THEN PRINT #1, CHR$(34); Work$; CHR$(34) ELSE PRINT #1, Work$ END IF
You may also use TAB and SPC to format the output you print to a file or device. For the most part, TAB and SPC operate like their non-file counterparts, including the need to add an extra empty PRINT to force a carriage return at the end of a line. That is, when you use
PRINT Any$; TAB(20)
or
PRINT #1, SomeVar; SPC(13)
BASIC adds a trailing semicolon whether you want it or not. To force a new line at that point in the printing process requires an additional PRINT or PRINT # statement. This isn't really as much of a nuisance as yet another code bloater, since an empty PRINT adds 9 bytes of compiler-generated code and an empty PRINT # adds 18 bytes.
One important difference between the screen and file versions of TAB and SPC is the way long strings are handled. If you use TAB or SPC in a PRINT statement that is then followed by a string too long to fit on the current line, the screen version will advance to the next row, and print the string at the left edge. This is probably not what you expected or wanted. When printing to a file, however, the string is simply written without regard to the current column. Column 80 is the default width for the screen and printer when they have been opened as devices, though you may change that using WIDTH.
The WIDTH statement lets you specify at which column BASIC is to automatically add a carriage return/line feed pair. The default for a printer is at column 80. In most programming situations this behavior is a nuisance, since many printers can accommodate 132 columns. After all, why shouldn't you be allowed to print what you want when you want, without BASIC intervening to add unexpected and often unwanted extra characters? Most programmers disable this automatic line wrapping by using WIDTH # FileNum, 255 if the printer was opened as a device, or WIDTH LPRINT, 255 if using LRPINT statements.
Curiously, this special value is not mentioned anywhere in the otherwise very complete documentation that comes with BASIC PDS. In fact, using a width value of 255 is mandatory if you intend to send binary data to a printer. Most modern printers accept both graphics commands and downloadable fonts. Since either of these will no doubt result in strings longer than 80 or even 255 characters, it is essential that you have a way to disable the "favor" that BASIC does for you. Undoubtedly, the automatic addition of a carriage return and line feed goes back to the early days of primitive printers that required this. The only reason Microsoft continues this behavior is to assure compatibility with programs written using earlier versions of BASIC.
Related to the WIDTH anomaly is BASIC's insistence on adding a CHR$(10) line feed whenever you print a CHR$(13) carriage return to a device. Again, this dubious feature is provided on the assumption that you would always want a line feed after every carriage return. But there are many cases where you wouldn't, such as the font and graphics examples mentioned earlier. If you add the "BIN" (binary) option when opening a printer, you can prevent BASIC from forcing a new line every 80 columns, and also suppress the addition of a line feed following each carriage return. For example, OPEN "LPT1:BIN" FOR OUTPUT AS #1 tells BASIC to open the first parallel printer in binary mode.
The PRINT # USING statement lets you send formatted numeric data to a file, in the same way you would use the regular PRINT USING to format numbers on the screen. PRINT # USING accepts the same set of formatting commands as PRINT USING, allowing you to mix text and formatted numbers in a single PRINT operation. If your program will be printing formatted reports from the disk file later, I recommend using PRINT USING at that time, instead of when writing the data to disk. Otherwise, the extra spaces and other formatting information are added to the file increasing its size. In fact, PRINT # USING is really most appropriate when printing to a device such as a printer.
Finally, it is important to point out the importance of selecting a suitable buffer size. As I described earlier, BASIC and DOS employ an area of memory as a buffer to hold information on its way to and from disk. This way information can often be written to or read from memory, instead of having to access the physical disk each time. Besides the buffers that DOS maintains, BASIC provides additional buffering when your program is using sequential input or output.
BASIC lets you control the size of this buffer, using the LEN = option of the OPEN statement. In general, the larger you make the buffer, the faster your programs will read and write files. The trade-off, however, is that BASIC's buffers are stored in string memory. With QuickBASIC and near strings in BASIC PDS, the buffer is located in DGROUP. When BASIC PDS far strings are used, the buffer is in the same segment that the current module uses for string storage.
Conversely, you can actually reduce the default buffer size when string space is at a premium, but at the expense of disk access speed. When using OPEN FOR INPUT and OPEN FOR OUTPUT, BASIC sets aside 512 bytes of string memory for the buffer, unless you specify otherwise. If you have many sequential files open at once you could reduce the buffer sizes to 128 bytes, for a net savings of 384 bytes for each file. The legal range of values for LEN = is between 1 and 32767 bytes.
Notice that the best buffer values will be a multiple of a power of two, and when increasing the buffer size, a multiple of 512. Since a disk sector is almost always 512 bytes, DOS will fill the buffer with an entire sector. In fact, DOS always reads and writes entire sectors anyway. If you use a buffer size of, say, 600 bytes, DOS will have to read 1024 bytes just to get the first portion of the second sector. But when more data is needed later, BASIC will then have to go back and ask DOS for the same information again. By reading entire sectors or evenly divisible portions of a sector, you can avoid having BASIC and DOS read the same information more than once.
Even though larger buffers usually translate to better performance, you will eventually reach the point of diminishing returns, beyond which little performance improvement will result. Table 6-1 shows the timing results with various buffer sizes when reading a 104K BASIC source file using LINE INPUT. Understand that this test is informal, and merely shows the results obtained using only one PC. In particular, the hard disk results are for a fairly fast (17 millisecond) 150 MB ESDI drive and a PC equipped with a 25 MHz. 386. Therefore, the improvement from a larger buffer is less than you would get on a slower computer with a slower hard disk or with a floppy disk. Many older XT and AT compatible PCs will probably fall somewhere between the results shown here for the hard and floppy disks. Notice that while the improvement actually seems somewhat worse for some increases, this can be attributed to the lack of resolution in the PC's system timer.
| Buffer Size (in bytes) | Seconds |
|---|---|
| 64 | 2.699 |
| 128 | 2.420 |
| 256 | 2.410 |
| 512 | 2.420 |
| 1024 | 2.311 |
| 2048 | 2.139 |
| 4096 | 2.201 |
| 8192 | 2.080 |
| 16384 | 2.039 |
| Buffer Size (in bytes) | Seconds |
|---|---|
| 64 | 45.260 |
| 128 | 45.141 |
| 256 | 45.148 |
| 512 | 45.150 |
| 1024 | 27.180 |
| 2048 | 18.180 |
| 4096 | 13.570 |
| 8192 | 11.650 |
| 16384 | 11.371 |
It is important to point out that a buffer is created only for sequential input and output, and also for random files with QuickBASIC. Opening a file for random access with BASIC PDS [and I'll presume VB/DOS] does not create a buffer, nor does opening a file for binary with either version. Further, with random access files a buffer is created by QuickBASIC only when FIELD is used, and the buffer is located within the actual fielded strings. Therefore, the LEN = argument in an OPEN FOR RANDOM statement merely tells BASIC how to calculate record offsets when SEEK and GET are used.
Sequential data is read using INPUT #, LINE INPUT #, or INPUT$ #. Like the console form of INPUT, INPUT # can be used to read one or more variables of any type and in any order with a single statement. When reading a file, INPUT # recognizes both the comma and the carriage return as a valid delimiter, to indicate the end of one variable. This is in contrast to the regular [keyboard] version of INPUT, which issues a "Redo from start" error if the wrong number of comma-delimited variables are entered. Instead, INPUT # simply moves on to the next line for the remaining variables.
LINE INPUT # avoids this entirely, and simply reads an entire string without regard to commas until a carriage return is encountered. This precludes LINE INPUT # from being used with anything but string variables. However, LINE INPUT # can be used with fixed- as well as variable-length strings, without the overhead of copying from one type to the other that BASIC usually adds. [This copying was described in Chapter 2.] As with INPUT #, LINE INPUT # strips leading and trailing quotes from the line if they are present in the file.
The last method for reading a sequential file or device is with the INPUT$ # function. INPUT$ # is used to read a specified number of characters, without regard to their meaning. Where commas and carriage returns are normally used to delimit each line of text, INPUT$ returns them as part of the string. INPUT$ # accepts two arguments--the number of characters to read and the file number--and assigns them to the specified string. To read, say, 20 bytes from a sequential file that has been opened as #3, you would use Any$ = INPUT$(20, #3). Although the pound sign (#) is optional, I prefer to include it to avoid confusion as to which parameter is the file number and which is the number of bytes.
As with sequential output, specifying a larger buffer size than the default 512 bytes can greatly improve the speed of INPUT # and LINE INPUT # statements, but at the expense of string memory.
Unlike sequential files that are almost always read starting at the beginning, data in a random access file can be accessed literally in any arbitrary order. Random access files are comprised of fixed-length records, and each record contains one or more fields. The most common application of random access techniques is in database programs, where each record holds the same type of information as the next. For example, a customer name and address database is comprised of a first name, a last name, a street address, city, state, and zip code. Even though different names and addresses will be stored in different records, the format and length of the information in each record is identical.
BASIC provides two different ways to handle random access files: the FIELD statement and TYPE variables. Before QuickBASIC version 4.0, the FIELD method was the only way to define the structure of a random access data file. Although Microsoft has publicly stated that FIELD is provided in current versions of BASIC only for compatibility with older programs, it has several important properties that cannot be duplicated in any other way. FIELD also lets you perform some interesting an non-obvious tricks that have nothing to do with reading or writing files. These are described later in this chapter in the section Advanced File Techniques.
Once a file has been opened for RANDOM you may use the FIELD statement by specifying one or more string variables to hold each field, along with their length. A typical example showing the syntax for the FIELD statement is as follows:
OPEN FileName$ FOR RANDOM AS #1 LEN = 97 FIELD #1, 17 AS LastName$, 14 AS FirstName$, 32 AS Address$, 15 AS City$, _ 2 AS State$, 9 AS Zip$, 8 AS BalanceDue$
Here, the file is opened for random access, and the record length is established as being 97 characters. This allows room for each of the fields in the FIELD statement. In this case 17 characters are set aside for the last name, 14 for the first name, 32 for the street address, 15 for the city, 2 for the state, 9 for the zip code, and 8 for the double precision balance due value. I often use a field length of 32 characters for name and address data, because that's how many can fit comfortably on a standard 3-1/2 by 15/16 inch mailing label. (The first and last names above add up to 32 characters, including a separating blank space.)
Note that the underscore shown above is used here as line continuation character, and you'd actually type the entire statement as one long line. In fact, in most cases a FIELD statement must be able to fit entirely on a single line, and there is no direct way to continue the list of variables. Although the BC compiler recognizes an underscore to continue a line as shown here, the BASIC environment does not. Underscores in a source file are removed by the BASIC editor when the file is loaded, and the lines are then combined.
If a second FIELD statement for the same file number is given on a separate line, the additional strings specified are placed starting at the beginning of the same buffer. While it is possible to coerce a new FIELD statement to begin farther into the buffer, that requires an additional dummy string variable:
FIELD #1, 17 AS LastName$, 14 AS FirstName$ FIELD #1, 31 AS Dummy$, 32 AS Address$, 15 AS City$ FIELD #1, 78 AS Dummy2$, 2 AS State$, 9 AS Zip$
Here, the dummy strings are used as placeholders to force the Address$ and State$ variables farther into the buffer, and you would not refer to the dummy strings in your program.
Once a field buffer has been defined, special precautions are needed when assigning and reading the fielded string variables. As you know, BASIC often moves strings around in memory when they are assigned. However, that would be fatal if those strings are in a field buffer. A field buffer is written to disk all at once when you use PUT, and it is essential that all of the strings therein be contiguous. If you simply assign a variable that is part of a field buffer, BASIC may move the string data to a new location outside of the buffer and your program will fail.
To avoid this problem you must assign fielded string using either LSET, RSET, or the statement form of MID$. These BASIC commands let you insert characters into a string, so BASIC will not have to claim new string memory. This further contributes to FIELD's complexity, and it also adds slightly to the amount of code needed for each assignment. For example, the statement One$ = Two$ generates 13 bytes of compiled code, and the statement LSET One$ = Two$ creates 17. Although LSET is generally faster than a direct assignment, it is important to understand that it also creates more code. But the situation gets even worse.
Because all of the variables in a field buffer must be strings, additional steps are needed to assign numeric variables such as integer and double precision. The CVI and MKS$ family of BASIC functions are needed to convert numeric data to their equivalent in string form and back. There are eight of these functions in QuickBASIC with two each for integer, long integer, single precision, and double precision variables. BASIC PDS adds two more to support the Currency data type. All of the various conversion functions have names that start with the letters MK or CV, and a complete list can be found in your BASIC manual.
To convert a double precision variable to equivalent data in an 8-byte string you would use MKD$, and to convert a 2-byte string that holds an integer to an actual integer value you would use CVI. MKD$ stands for "Make Double into a string" and it has a dollar sign to show that it returns a string. CVI stands for "Convert to Integer" and the absence of a dollar sign shows that it returns a numeric value. Combined with the requisite LSET, a complete assignment prior to writing a record to disk with PUT would be something like this: LSET BalanceDue$ = MKD$(BalDue#). And if a record has just been read using GET, an integer value in the field buffer could be retrieved using code such as MyInt% = CVI(IntVar$).
The need for LSET, RSET, CVI, and MKS$ and so forth has historically made learning random access file techniques one of the most difficult and messy aspects of BASIC programming. Besides having to learn all of the statements and how they are used, you also need to understand how many bytes each numeric data type occupies to set aside the correct amount of space in the field buffer. Further, a lot of compiled code is created to convert large amounts of data between numeric and string form. For these and other reasons, Microsoft introduced the TYPE variable with its release of QuickBASIC 4.0.
The TYPE method allows you to establish a record's structure by defining a custom variable that contains individual components for each field in the record. In general, using TYPE is a much clearer way to define a record, and it also avoids the added library code to handle the FIELD, LSET, CVI, and MKS$ statements. When you use AS INTEGER and AS DOUBLE and so forth to define each portion of the TYPE, the correct number of bytes are allocated to store the value in its native fixed-length format. This avoids having to convert the data to and from ASCII digits.
Using the earlier example, here's how you would define and assign the same record using a TYPE variable:
TYPE Record LastName AS STRING * 17 FirstName AS STRING * 14 Address AS STRING * 32 State AS STRING * 2 Zip AS STRING 9 BalanceDue AS DOUBLE END TYPE DIM MyRecord AS Record MyRecord.LastName = LastName$ MyRecord.FirstName = FirstName$ MyRecord.Address = Address$ MyRecord.State = State$ MyRecord.Zip = Zip$ MyRecord.BalanceDue = BalanceDue#
Even though the same names are used for both the TYPE variable members and the strings they are being assigned from, you may of course use any names you want. You could also assign the portions of a TYPE variable from constants using MyRecord.Zip = "06896" or MyRecord.BalanceDue = 4029.80. Further, one entire TYPE variable may be assigned to another in a single operation using ThisType = ThatType. Dissimilar TYPE variables may be assigned using LSET like this: LSET MyType = YourType.
As you can see, using TYPE variables instead of FIELD yields an enormous improvement in a program's clarity. However, there are still some programming problems that only FIELD can solve. One limitation of using TYPE variables is that the file structure must be known when the program is compiled, and you cannot defer this until runtime. Therefore, it is impossible to design a general purpose database program, in which a single program can manipulate any number of differently structured files. The compiler needs to know the length and type of data within a TYPE variable, in order to access the data it contains. So while you can use a variable as the LEN = argument with OPEN, the record structure itself must remain fixed.
FIELD avoids that limitation because it accepts a variable number of arguments, and varying lengths within each field component. Therefore, by dimensioning a string array to the number of elements needed for a given record, the entire process of opening, fielding, reading, and writing can be handled using variables whose contents and type are determined at runtime. Some amount of IF testing will of course be required when the program runs, but at least it's possible to process a file using variable information.
The following complete program first creates a random access file with five slightly different records using a TYPE variable. It then reads the file independently of the TYPE structure using the FIELD method. Although the second portion of the program uses DATA statements to define the file's structure, in practice this information would be read from disk. In fact, this is the method used by dBASE and Clipper files, based on the field information that is stored in a header portion of the data file.
'----- create a data file containing five records
DEFINT A-Z
TYPE MyType
FirstName AS STRING * 17
LastName AS STRING * 14
DblValue AS DOUBLE
IntValue AS INTEGER
MiscStuff AS STRING * 20
SngValue AS SINGLE
END TYPE
DIM MyVar AS MyType
OPEN "MYFILE.DAT" FOR RANDOM AS #1 LEN = 65
MyVar.FirstName = "Jonathan"
MyVar.LastName = "Smith"
MyVar.DblValue = 123456.7
MyVar.IntValue = 10
MyVar.MiscStuff = "Miscellaneous stuff"
MyVar.SngValue = 14.29
FOR X = 1 TO 5
PUT #1, , MyVar
MyVar.DblValue = MyVar.DblValue * 2
MyVar.IntValue = MyVar.IntValue * 2
MyVar.SngValue = MyVar.SngValue * 2
NEXT
CLOSE #1
'----- read the data without regard to the TYPE above
READ FileName$, NumFields
REDIM Buffer$(1 TO NumFields) 'holds the FIELD strings
REDIM FieldType(1 TO NumFields) 'the array of data types
RecLength = 0
FOR X = 1 TO NumFields
READ ThisType
FieldType(X) = ThisType
RecLength = RecLength + ABS(ThisType)
NEXT
OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
PadLength = 0
FOR X = 1 TO NumFields
ThisLength = ABS(FieldType(X))
FIELD #1, PadLength AS Pad$, ThisLength AS Buffer$(X)
PadLength = PadLength + ThisLength
NEXT
NumRecs = LOF(1) \ RecLength 'calc number of records
FOR X = 1 TO NumRecs 'read each in sequence
GET #1 'get the current record
CLS
FOR Y = 1 TO NumFields 'walk through each field
PRINT "Field"; Y; TAB(15); 'display each field
SELECT CASE FieldType(Y) 'see what type of data
CASE -8 'double precision
PRINT CVD(Buffer$(Y)) 'so use CVD
CASE -4 'single precision
PRINT CVS(Buffer$(Y)) 'as above
CASE -2 'integer
PRINT CVI(Buffer$(Y))
CASE ELSE 'string
PRINT Buffer$(Y)
END SELECT
NEXT
LOCATE 20, 1
PRINT "Press a key to view the next record ";
WHILE LEN(INKEY$) = 0: WEND
NEXT
CLOSE #1
END
DATA MYFILE.DAT, 6
DATA 17, 14, -8, -2, 20, -4
There are several issues that need elaboration in this program. First is the use of arrays to hold the fielded string data and also each field's type. When the field buffer is defined with an array, the same variable name can be used repeatedly in a loop. A parallel array that holds the field data types permits the program to relate the field data to its corresponding type of data. That is, Buffer$(3) holds the data for field 3, and FieldType(3) indicates what type of data it is.
Second, the FieldType array uses a simple coding method that combines both the data type and its length into a single value. That is, positive values are used to indicate string data, and the value itself is the field length. Negative values reflect the data type as well as the length, using a negative version of that data type's length. Specifically, -8 is used to indicate a double precision field type, -4 a single precision type, and -2 an integer. If you need to handle long integers or the BASIC PDS Currency data type, you'll need to devise a slightly different method. I chose this one because it is simple and effective.
The final point worth mentioning when comparing FIELD to TYPE is that the field buffer is relinquished back to BASIC's string pool when the file is closed. But when a TYPE variable is dimensioned, the near memory it occupies is allocated by the compiler, and is never available for other uses. Although there is a solution, it requires some slight trickery. The statement REDIM TypeVar(1 TO 1) AS TypeName will create a 1-element TYPE array in far memory that can then be used as if it were a single TYPE variable. That is, any place you would have used the TYPE variable, simply substitute the sole element in the array.
Understand that more code is required to access data in a dynamic array than in a static variable. For example, an integer assignment to a member of a dynamic TYPE array generates 17 bytes of code, compared to only 6 bytes for the same operation on a static TYPE. But when string space is more important than .EXE file size, this trick can make the difference between a program that runs and one that doesn't.
Regardless of which method you use--TYPE or FIELD--there are several additional points to be aware of. First, the PUT # and GET # statements are used to write and read a random access file respectively. PUT # and GET # accept two different forms, depending on whether you are using TYPE or FIELD to define the record structure.
When FIELD is used, PUT # and GET # may be used with either no argument to access the current record, or with an optional record number argument. That is, PUT #1 writes the current field buffer contents to disk at the current DOS SEEK position, and GET #1, RecNum reads record number RecNum into the buffer for subsequent access by your program.
As with sequential files, each time a record is read or written, DOS advances its internal seek location to the next successive position in the file. Therefore, to read a group of records in forward order does not require a record number, nor does writing them in that order. In fact, slightly more time is required to access a record when a record number is given but not needed, because BASIC makes a separate call to perform an explicit Seek to that location in the file.
When the TYPE method is used to access random access data, the record number is also optional, but you must provide the name of a TYPE variable or TYPE array element. In this case, the record number is still used as the first argument, and the TYPE variable is the second argument. If you omit the record number you must include an empty comma placeholder. For example, PUT #1, RecNum, TypeVar writes the contents of TypeVar to the file at record number RecNum, and GET #1, , TypeArray(X) reads the current record into TYPE array element X.
It is not essential that the TYPE variable be as long as the record length specified when LEN = was used with OPEN, but it generally should be. When a record number is given with PUT # or GET #, BASIC uses the original LEN = value to know where to seek to in the file. If a record number is omitted, BASIC will still advance to the next complete record even if the TYPE variable being read or written is shorter than the stated record length. In most cases, however, you should use a TYPE whose length corresponds to the LEN = argument unless you have a good reason not to.
Notice that when LEN = is omitted, BASIC defaults to a record length of 128 bytes. Indeed, forgetting to include the length can lead to some interesting surprises. One clever trick that avoids having to calculate the record length manually is to use BASIC's LEN function. Although earlier versions of BASIC allowed LEN only in conjunction with string variables, QuickBASIC 4.0 and later versions recognize LEN for any type of data.
For example, LEN(IntVar%) is always 2, and LEN(AnyDouble#) is always equal to 8. When LEN is used this way the compiler merely substitutes the appropriate numeric constant when it builds your program. Since LEN can also be used with TYPE variables and TYPE array elements, you can let BASIC do the byte counting for you. The brief program fragment below shows this in context.
TYPE Something X AS INTEGER Y AS DOUBLE Z AS STRING * 100 END TYPE DIM Anything AS Something OPEN MyData$ FOR RANDOM AS #1 LEN = LEN(Anything)
In particular, this method is useful if you later modify the TYPE definition, since the program will be self-accommodating. Changing Z to STRING * 102 will also change the value used as the LEN = argument to OPEN. Be careful to use the actual variable name with LEN, and not the TYPE name itself. That is, LEN(Anything) will equal 110, but LEN(Something) will be 2 if DEFINT is in effect. When BASIC sees LEN(Something) it assumes you are referring to a variable with that name, not the TYPE definition.
The only time this use of LEN will be detrimental is when it is used as a passed parameter many times in a program. Since LEN is treated in this case as a numeric constant, it is subject to the same copying issues that CONST values and literal numbers are. Therefore, you would probably want to assign a variable once from the value that LEN returns, and use that variable repeatedly later as described in Chapter 2.
Binary file access lets you read or write any portion of a file, and manipulate any type of information. Reading a sequential file requires that the end of each data item be identified by a comma, or a carriage return line feed pair. Random access files do not require special delimiters, and instead rely on a fixed record length to know where each record's data starts and ends. A binary file may be organized in any arbitrary manner; however, it is up to the programmer to devise a method for determining what goes where in the file.
The overwhelming advantage of binary over sequential access is the enormous space and speed savings. A file that requires extra carriage returns or commas will be larger than one that does not. Moreover, numeric data in a binary file is stored in its native fixed-length format, instead of as a string of ASCII digits. Therefore, the integer value -32700 will occupy only two bytes, as opposed to the seven needed for the digits plus either a comma or carriage return and line feed.
Furthermore, converting between numbers and their ASCII representation is one of the slowest operations in BASIC. Because the STR$ and VAL functions must be able to operate on floating point numbers and perform rounding, they are extremely slow. For example, VAL must examine the digits in a string for many special characters such as "e", "d", "&H", and so forth. And with the statement IntVar% = VAL("1234.56"), VAL must also round the value to 1235 before assigning the result to IntVar%. Even if you don't use STR$ or VAL explicitly when reading or writing a file, BASIC does internally. That is, the statement PRINT #1, D# is compiled as if you used PRINT #1, STR$(D#). Likewise, INPUT #1, IntVar% is compiled the same as INPUT #1, Temp$: IntVar% = VAL(Temp$).
When a file has been opened for binary access you may not use PRINT #, WRITE #, or PRINT # USING. The only statement that can write data to a binary file is PUT #. PUT # may be used with any type of variable, but not constants or expressions. That is, you can use PUT #1, , AnyVar, but not PUT #1, , 13 or PUT #1, SeekLoc, X + Y! or PUT #1, , LEFT$(Work$, 10). This is yet another unnecessary BASIC limitation, which means that to write a constant you must first assign it to a temporary variable, and then use PUT specifying that variable.
Reading from a binary file requires GET #, which is the complement of PUT #. Like PUT #, GET # may be used with any kind of variable, including TYPE variables. When a string variable is written to disk with PUT #, the entire string is sent. However, when a string variable is used with GET #, BASIC reads only as many bytes as will fit into the target string. So to read, say, 20 bytes into a string from a binary file you would use this:
Temp$ = SPACE$(20) 'make room for 20 bytes
GET #FileNum, , Temp$ 'read all 20 bytes
Although fixed-length strings cannot be cleared to relinquish the memory they occupied, they are equally valid for reading data from a binary file:
DIM FLen AS STRING * 20
GET #FileNum, , FLen
You can also use INPUT$ to read a specified number of bytes from a binary file. Therefore you can replace both examples above with the statement Temp$ = INPUT$(20, #FileNum). Contrary to some versions of Microsoft BASIC documentation, PUT # does not store the length of the string in a binary file prior to writing the data as it does with files opened for RANDOM.
As you've seen, data is written to a binary file using the PUT # command, and read using GET #. These work much like their random access counterparts in that a seek offset is optional, and if omitted must be replaced with an empty comma placeholder. But where the seek argument in a random GET # or PUT # specifies a record number, a binary GET # treats it as a byte offset into the file.
The first byte in a binary file is considered by BASIC to be byte number 1. This is important to point out now, because DOS considers the first byte to be numbered 0. When we discuss using CALL Interrupt to access files in Chapter 11, you will need to take this difference into account.
When reading and writing binary files, BASIC always uses the length of the specified variable to know how many bytes to read or write. The statement GET #1, , IntVar% reads two bytes at the current DOS seek location into the integer variable IntVar%, and PUT #1, 1000, LongVar# writes the contents of LongVar# (eight bytes) to the file starting at the 1000th byte. Let's now take a look at a practical application of binary file techniques.
Rather than invent a binary file format as an example, I will instead use the Lotus 1-2-3 file structure to illustrate the effective use of binary access. Although it is possible to skip around in a binary file and read its data in any arbitrary order, a Lotus worksheet file is intended to be read sequentially. Each data item is preceded by an integer code that indicates the type and length of the data that follows. Note that the same format is used by Lotus 1-2-3 versions 1 and 2, and also Lotus Symphony. Newer versions of 1-2-3 that support three-dimensional work sheets use a different format that this program will not accommodate.
A Lotus spreadsheet can contain as many as 63 different kinds of data. However, we will concern ourselves with only those that are of general interest such as cell contents and simple formatting commands. These are Beginning of File, End of File, Integer values, Floating point values, Text labels and their format, and the double precision values embedded within a Formula record. The format used by the actual formulas is quite complex, and will not be addressed. Other records that will not be covered here are those that pertain to the structure of the worksheet itself. For example, range names, printer setup strings, macro definitions, and so forth. You can get complete information on the Lotus file structure as well as other standard formats in Jeff Walden's excellent book, File Formats for Popular PC Software (Wiley Press, ISBN 0-471-83671-0). [Unfortunately that book is now out of print. But you may be able to get this information from Lotus directly.]
A Lotus file is comprised of individual records, and each record may have a varying length. The length of a record depends on its type and contents, and most records contain a fixed-length header which describes the information that follows. Regardless of the type of record being considered, each follows the same format: an operation code (opcode), the data length, and the data itself.
The opcode is always a two-byte integer which identifies the type of data that will follow. For example, an opcode of 15 indicates that the data in the record will be treated by 1-2-3 as a text label. The length is also an integer, and it holds the number of bytes in the Data section (the actual text) that follows.
All of the records that pertain to a spreadsheet cell contain a five-byte header at the beginning of the data section. These five bytes are included as part of the data's length word. The first header byte contains the formatting information, such as the number of decimal positions to display. The next two bytes together contain the cell's row as an integer, and the following two bytes hold the cell's column.
Again, this header is present only in records that refer to a cell's contents. For example, the Beginning of File and End of File records do not contain a header, nor do those records that describe the worksheet. Some records such as labels and formulas will have a varying length, while those that contain numbers will be fixed, depending on the type of number. Floating point values are always eight bytes long, and are in the same IEEE format used by BASIC. Likewise, an integer value will always have a length of two bytes. Because the length word includes the five-byte header size, the total length for these double precision and integer examples is 13 and 7 respectively.
It is important to understand that in a Lotus worksheet file, rows and columns are based at zero. Even though 1-2-3 considers the leftmost row to be number 1, it is stored in the file as a zero. Likewise, the first column as displayed by 1-2-3 is labelled "A", but is identified in the file as column 0. Thus, it is up to your program to take that into account as translates the columns to the alphabetic format, if you intend to display them as Lotus does.
In the Read portion of the program that follows, the same steps are performed for each record. That is, binary GET # statements read the record's type, length, and data. If the record type indicates that it pertains to a worksheet cell, then the five-byte header is also read using the GetFormat subprogram. Opcodes that are not supported by this program are simply displayed, so you will see that they were encountered.
The Write portion of the program performs simple formatting, and also ensures that a column-width record is written only once. Table 6-2 shows the makeup of the numeric formatting byte used in all Lotus files.

The program example below can either read or write a Lotus 1-2-3 worksheet file. If you select Create when this program is run, it will write a worksheet file named SAMPLE.WKS suitable for reading into any version of Lotus 123. This sample file contains an assortment of labels and values. If you select Read, the program will prompt for the name of a worksheet file which it then reads and displays.
DEFINT A-Z
DECLARE SUB GetFormat (Format, Row, Column)
DECLARE SUB WriteColWidth (Column, ColWidth)
DECLARE SUB WriteInteger (Row, Column, ColWidth, Temp)
DECLARE SUB WriteLabel (Row, Column, ColWidth, Msg$)
DECLARE SUB WriteNumber (Row, Col, ColWidth, Fmt$, Num#)
DIM SHARED CellFmt AS STRING * 1 'to read one byte
DIM SHARED ColNum(40) 'max columns to write
DIM SHARED FileNum 'the file number to use
CLS
PRINT "Read an existing 123 file or ";
PRINT "Create a sample file (R/C)? "
LOCATE , , 1
DO
X$ = UCASE$(INKEY$)
LOOP UNTIL X$ = "R" OR X$ = "C"
LOCATE , , 0
PRINT X$
IF X$ = "R" THEN
'----- read an existing file
INPUT "Lotus file to read: ", FileName$
IF INSTR(FileName$, ".") = 0 THEN
FileName$ = FileName$ + ".WKS"
END IF
PRINT
'----- get the next file number and open the file
FileNum = FREEFILE
OPEN FileName$ FOR BINARY AS #FileNum
DO UNTIL Opcode = 1 'until End of File code
GET FileNum, , Opcode 'get the next opcode
GET FileNum, , Length 'and the data length
SELECT CASE Opcode 'filter the Opcodes
CASE 0 'Beginning of File record
PRINT "Beginning of file, Lotus ";
GET FileNum, , Temp
SELECT CASE Temp
CASE 1028
PRINT "1-2-3 version 1.0 or 1A"
CASE 1029
PRINT "Symphony version 1.0"
CASE 1030
PRINT "123 version 2.x"
CASE ELSE
PRINT "NOT a Lotus File!"
END SELECT
CASE 1 'End of File
PRINT "End of File"
CASE 12 'Blank cell
'Note that Lotus saves blank cells only if
'they are formatted or protected.
CALL GetFormat(Format, Row, Column)
PRINT "Blank: Format ="; Format,
PRINT "Row ="; Row,
PRINT "Col ="; Column
CASE 13 'Integer
CALL GetFormat(Format, Row, Column)
GET FileNum, , Temp
PRINT "Integer: Format ="; Format,
PRINT "Row ="; Row,
PRINT "Col ="; Column,
PRINT "Value ="; Temp
CASE 14 'Floating point
CALL GetFormat(Format, Row, Column)
GET FileNum, , Number#
PRINT "Number: Format ="; Format,
PRINT "Row ="; Row,
PRINT "Col ="; Column,
PRINT "Value ="; Number#
CASE 15 'Label
CALL GetFormat(Format, Row, Column)
'Create a string to hold the label. 6 is
'subtracted to exclude the Format, Column,
'and Row information.
Info$ = SPACE$(Length - 6)
GET FileNum, , Info$ 'read the label
GET FileNum, , CellFmt$ 'eat the CHR$(0)
PRINT "Label: Format ="; Format,
PRINT "Row ="; Row,
PRINT "Col ="; Column, Info$
CASE 16 'Formula
CALL GetFormat(Format, Row, Column)
GET FileNum, , Number# 'read cell value
GET FileNum, , Length 'and formula length
SEEK FileNum, SEEK(FileNum) + Length 'skip formula
PRINT "Formula: Format ="; Format,
PRINT "Row ="; Row,
PRINT "Col ="; Column,
PRINT "Value ="; Number#
CASE ELSE
Dummy$ = SPACE$(Length) 'skip the record
GET FileNum, , Dummy$ 'read it in
PRINT "Opcode: "; Opcode 'show its Opcode
END SELECT
'----- pause when the screen fills
IF CSRLIN > 21 THEN
PRINT
PRINT "Press
There are several points worth noting about this program. First, Lotus
label strings are always terminated with a CHR$(0) zero byte, which is the
same method used by DOS and the C language. Therefore, the WriteLabel
subprogram adds this byte, which is also included as part of the length
word that follows the Opcode.
In the WriteNumber subprogram, the 1-byte format code is either 127 to
default to unformatted, or bit-coded to indicate fixed, currency, or
percent formatting. WriteNumber expects a format string such as "F3" which
indicates fixed-point with three decimal positions, or "P1" for percent
formatting using one decimal place. If you instead use "C", WriteNumber
will use a fixed 2-decimal point currency format.
Earlier I pointed out the extra work is needed to write a constant
value to a binary file, because only variables may be used with PUT #.
This is painfully clear in each of the Write subprograms, where the integer
variable Temp is repeatedly assigned to new values. We can only hope that
Microsoft will see fit to remove this arbitrary limitation in a later
version of BASIC.
Finally, note the use of the fixed-length string CellFmt$. Although
some language support a one-byte numeric variable type, BASIC does not.
Therefore, to read and write these values you must use a fixed-length
string. To determine the value after reading a file you will use ASC, and
to assign a value prior to writing it you instead use CHR$. For example,
to assign CellFmt$ to the byte value 123 use CellFmt$ = CHR$(123).
Navigating Your Files
BASIC offers a number of file-related functions to determine how long a
file is, the current DOS seek location where the next read or write will
take place, and also if that location is at the end of the file. These are
LOF, LOC and SEEK, and EOF respectively. LOF stands for Length Of File,
LOC means current Location, and EOF is End Of File. The SEEK statement is
also available to force the next file access to occur at a specified place
within the file. All of these require a file number argument to indicate
which file is being referred to.
The EOF Function
The EOF function is most useful when reading sequential text files, and it
avoids BASIC's "Input past end" error that would otherwise result from
trying to read past the end of the available data. The following short
complete program reads a text file and displays it contents, and shows how
EOF is used for this purpose.
OPEN FileName$ FOR INPUT AS #1
WHILE NOT EOF(1)
LINE INPUT #1, This$
PRINT This$
WEND
CLOSE
Notice the use of the NOT operator in this example. The EOF function
returns an integer value of either -1 or 0, to indicate true (at the end of
the file) or false. Therefore, NOT -1 is equal to 0 (False), and NOT 0 is
equal to -1 (True). This use of bit manipulation was described earlier in
Chapter 2.
EOF can also be used with binary and random access files for the same
purpose. In fact, EOF may be even more useful in those cases, because
BASIC does not create an error when you attempt to read past the end as it
does for sequential files. Indeed, once you go past the end of a binary or
random access file, BASIC simply fills the variables being read with zero
bytes. Without EOF there is no way to distinguish between zeros returned
by BASIC because you went past the end of the file and zeros that were read
as legitimate data.
The EOF function was originally needed with DOS 1.0 for a program to
determine when the end of the file was reached. That version of DOS always
wrote all data in multiples of 128 bytes, and all file directory entries
also were listed with lengths being a multiple of 128. [That is, a file
which contains only ten bytes of data will be reported by DIR as being 128
bytes long.] To indicate the true end of the file, a CHR$(26) end of file
marker was placed just past the last byte of valid data. Thus, EOF was
originally written to search for a byte with that value, and return True
when it was found.
Most modern applications do not use an EOF character, and instead rely
on the file length that is stored in the file's directory entry. However,
some older programs still write a CHR$(26) at the end of the data, and DOS'
COPY CON command does this as well. Therefore, BASIC's EOF will return a
True value when this character is encountered, even if there is still more
data to be read in the file. In fact, you can provide a minimal amount of
data security by intentionally writing a CHR$(26) at or near the beginning
of a sequential file. If someone then uses the DOS TYPE command to view
the file, only what precedes the EOF marker will be displayed.
Another implication of EOF characters in BASIC surfaces when you open
a sequential file for append mode. BASIC makes a minimal attempt to locate
an EOF character, and if one exists it begins appending on top of it.
After all, if writing started just past the EOF byte, a subsequent LINE
INPUT would fail when it reached that point. Likewise, an EOF test would
return true and the program would stop reading at that location in the
file. Therefore, BASIC checks the last few bytes in the file when you open
for append, to see if an EOF marker is present. However, if the marker is
much earlier in a large file, BASIC will not see it.
When EOF is used with serial communications, it returns 0 until a
CHR$(26) byte is received, at which point it continues to return -1 until
the communications port is closed.
The LOF Function
The LOF function simply returns the current length of the file, and that
too can be used as a way to tell when you have reached the end. In the
random access FIELD example program shown earlier, LOF was used in
conjunction with the record length to determine the number of records in
the file. Since the length of most random access files is directly related
to [and evenly divisible by] the number of records in the file, simple
division can be used to determine how many records there are. The formula
is NumRecords = LOF(FileNum) \ RecLength.
Understand that when used with sequential and binary files, LOF
returns the length of the file in bytes. But with a random access file,
LOF instead provides the number of records.
LOF can also be used as a crude way to see if a file exists. Even
though this is done much more effectively and elegantly with assembly
language or CALL Interrupt, the short example below shows how LOF can be
used for this purpose.
FUNCTION Exist% (FileName$) STATIC
FileNum = FREEFILE
OPEN FileName$ FOR BINARY AS #FileNum
Length = LOF(FileNum)
CLOSE #FileNum
IF Length = 0 THEN 'it probably wasn't there
Exist% = 0 'return False to show that
KILL FileName$ 'and delete what we created
ELSE
Exist% = -1 'otherwise return True
END IF
END FUNCTION
Besides being clunky, this program also has a serious flaw: If the file
does exist but has a perfectly legal length of zero, this function will say
it doesn't exist and then delete it! As I said, this method is crude, but
a lot of programmers have used it.
The LOC and SEEK Functions
LOC and SEEK are closely related, in that they return information about
where you are in the file. However, LOC reports the position of the last
read or write, and SEEK tells where the next one will occur. As with LOF,
LOC and SEEK return byte values for files that were opened for sequential
or binary access, and record numbers when used with random access files.
In practice, LOC is of little value, especially when you are
manipulating sequential files. For reasons that only Microsoft knows, LOC
returns the number of the last byte read or written, but divided by 128.
Since no program I know of treats sequential files as containing 128-byte
records, I cannot imagine how this could be useful. Further, since LOC
returns the location of the last read or write, it never reflects the
true position in the file.
When used with communications, LOC reports the number of characters in
the receive buffer that are currently waiting to be read, which is useful.
When used with INPUT$ #, LOC provides a handy way to retrieve all of the
characters present in the buffer at one time. This is shown in context
below, and the example assumes that the communications port has already
been opened.
NumChars = LOC(1)
IF NumChars THEN
This$ = INPUT$(NumChars, #1)
END IF
The SEEK function always returns the current file position, which is the
point at which the next read or write will take place. One good use for
SEEK is to read the current location in a sequential file, to allow a
program to walk backwards through the file later. For example, if you need
to create a text file browsing program, there is no other way to know where
the previous line of a file is located. A short program that shows this in
context follows in the section that describes the SEEK statement.
The SEEK Statement
Where the SEEK function lets you determine where you are currently in a
file, the SEEK statement lets you move to any arbitrary position. As you
might imagine, SEEK as a statement is similar to the function version in
that it assumes a byte value when used with sequential and binary files,
and a record number with random access files.
SEEK can be very useful in a variety of situations, and in particular
when indexing random access files. When an indexing system is employed,
selected portions of a data file are loaded into memory where they can be
searched very quickly. Since the location of the index information being
searched corresponds to the record number of the complete data record, the
record can be accessed with a single GET #. This was described briefly in
the discussion of the BASIC PDS ISAM options in Chapter 5. Thus, once the
record number for a given entry has been identified, the SEEK statement (or
the SEEK argument in the GET # command) is used to access that particular
record.
For this example, though, I will instead show how SEEK can be used
with a sequential file. The following complete program provides the
rudiments of a text file browser, but this version displays only one line
at a time. It would be fairly easy to expand this program to display
entire screenfuls of text, and I leave that as an exercise for you.
The program begins by prompting for a file name, and then opens that
file for sequential input. The maximum number of lines that can be
accommodated is set arbitrarily at 5000, though you will not be able to
specify more than 16384 unless you compile with the /ah option. The long
integer Offset&() array is used to remember where each line encountered so
far in the file begins, and 16384 is the maximum number of elements that
can fit into a single 64K array. For a typical text file with line lengths
that average 60 characters, 16384 lines is nearly 1MB of text.
When you run the program, it expects only the up and down arrow keys
to advance and go backwards through the file, the Home key to jump to the
beginning, or the Escape key to end the program. Notice that the words
"blank line" are printed when a blank line is encountered, just so you can
see that something has happened.
DEFINT A-Z
CONST MaxLines% = 5000
REDIM Offset&(1 TO MaxLines%)
CLS
PRINT "Enter the name of file to browse: ";
LINE INPUT "", FileName$
OPEN FileName$ FOR INPUT AS #1
Offset&(1) = 1 'initialize to offset 1
CurLine = 1 'and start with line 1
WHILE Action$ <> CHR$(27) 'until they press Escape
SEEK #1, Offset&(CurLine) 'seek to the current line
LINE INPUT #1, Text$ 'read that line
Offset&(CurLine + 1) = SEEK(1) 'save where the next
' line starts
CLS
IF LEN(Text$) THEN 'if it's not blank
PRINT Text$ 'print the line
ELSE 'otherwise
PRINT "(blank line)" 'show that it's blank
END IF
DO 'wait for a key
Action$ = INKEY$
LOOP UNTIL LEN(Action$)
SELECT CASE ASC(RIGHT$(Action$, 1))
CASE 71 'Home
CurLine = 1
CASE 72 'Up arrow
IF CurLine > 1 THEN
CurLine = CurLine - 1
END IF
CASE 80 'Down arrow
IF (NOT EOF(1)) AND CurLine < MaxLines% THEN
CurLine = CurLine + 1
END IF
CASE ELSE
END SELECT
WEND
CLOSE
END
You should be aware that BASIC does not prevent you from using SEEK to go
past the end of a file that has been opened for Binary access. If you do
this and then write any data, DOS will actually extend the file to include
the data that was just written. Therefore, it is important to understand
that any data that lies between the previous end of the file and the newly
added data will be undefined. When a file is deleted DOS simply abandons
the sectors that held its data, and makes them available for later use.
But whatever data those sectors contained remains intact. When you later
expand a file this way using SEEK, the old abandoned sector contents are
incorporated into the file. Even if the sectors that are allocated were
never written to previously, they will contain the &HF6 bytes that DOS'
FORMAT.COM uses to initialize a disk.
You can turn this behavior into an important feature, and in some
cases recreate a file that was accidentally truncated. If you erase a file
by mistake, it is possible to recover it using the Norton Utilities or a
similar disk utility program. But when an existing file is opened for
output, DOS truncates it to a length of zero. The following program shows
the steps necessary to reconstruct a file that has been destroyed this way.
OPEN FileName$ FOR BINARY AS #1
SEEK #1, 30000
PUT #1, , X%
CLOSE #1
In this case, the file is restored to a length of 30000, and you can use
larger or smaller values as appropriate. Understand that there is no
guarantee that DOS will reassign the same sectors to the file that it
originally used. But I have seen this trick work more than once, and it is
at least worth a try.
In a similar fashion, you can reduce the size of a file by seeking to
a given location and then writing zero bytes there. Since BASIC provides
no way to write zero bytes to a file, some additional trickery is needed.
This will be described in Chapter 11 in the section that discusses using
CALL Interrupt to access DOS and BIOS services.
Advanced File Techniques
There are a number of clever file-related tricks that can be performed
using only BASIC programming. Some of these tricks help you to improve on
BASIC's speed, and others let you do things that are not possible using the
normal and obvious methods. BASIC is no slower than other languages when
reading and writing large amounts of data, and indeed, the bottleneck is
frequently DOS itself. Further, if you can reduce the amount of data that
is written, your files will be smaller as well. With that in mind, let's
look at some ways to further improve your programs.
Speeding Up File Access
The single most important way to speed up your programs is to read and
write large amounts of data in one operation. The normal method for saving
a numeric or TYPE array is to write each element to disk in a loop. But
when there are many thousands of elements, a substantial amount of overhead
is incurred just from BASIC's repeated calls to DOS. There are several
solutions you can consider, each with increasing levels of complexity.
BLOAD and BSAVE
The simplest way to read and write a large amount of contiguous data is
with BLOAD and BSAVE. BSAVE takes a "snapshot" of any contiguous area of
memory up to 64K in size, and saves it to disk in a single operation. When
an application calls DOS to read or write a file, it furnishes DOS with the
segment and address where the data is to be loaded or saved from, and also
the number of bytes. BLOAD and BSAVE provide a simple interface to the DOS
read and write services, and they can be used to load and save numeric
arrays up to 64K in size, as well as screen images.
[I have seen a number of messages in the MSBASIC forum on CompuServe
stating that BSAVE and BLOAD do not work with compressed disks. Many of
those messages have come from Microsoft technical support, and I have no
reason to doubt them. It may be that only VB/DOS has this problem, but I
have no way to test QB and PDS because I don't use disk compression.]
A file that has been written using BSAVE includes a 7-byte header that
identifies it as a BSAVE file, and also shows where it was saved from and
how many bytes it contains. BLOAD requires this header, and thus cannot be
used with any arbitrary type of file. But when used together, these
commands can be as much as ten times faster than a FOR/NEXT loop.
The example below creates and then saves a single precision array, and
then loads it again to prove the process worked.
DEFINT A-Z
CONST NumEls% = 20000
REDIM Array(1 TO NumEls%) 'create the array
FOR X = 1 TO NumEls% 'file it with values
Array(X) = X
NEXT
DEF SEG = VARSEG(Array(1)) 'set the BSAVE segment
BSAVE "ARRAY.DAT", VARPTR(Array(1)), NumEls% * LEN(Array(1))
REDIM Array(1 TO NumEls%) 'recreate the array
DEF SEG = VARSEG(Array(1)) 'the array may have moved
BLOAD "ARRAY.DAT", VARPTR(Array(1))
FOR X = 1 TO NumEls% 'prove the data is valid
IF Array(X) <> X THEN
PRINT "Error in element"; X
END IF
NEXT
END
Because BSAVE and BLOAD use the current DEF SEG setting to know the segment
the data is in, VARSEG is used with the first element of the array. Once
the correct segment has been established, BSAVE is given the name of the
file to save, the starting address, and the number of bytes of data. As
with the TYPE variable example shown earlier, LEN is ideal here as well to
help calculate the number of bytes that must be saved. In this case, each
integer array element is two bytes long, and BASIC multiplies the constants
NumEls% and LEN(Array(1)) when the program is compiled. Therefore, no
additional code is added to the program to calculate this value at runtime.
Once the array has been saved it is redimensioned, which effectively
clears it to all zero values prior to reloading. Notice that DEF SEG is
used again before the BLOAD statement. This is an important point, because
there is no guarantee that BASIC will necessarily allocate the same block
of memory the second time. If a file is loaded into the wrong area of
memory, your program is sure to crash or at least not work correctly.
Also note that BLOAD always loads the entire file, and a length
argument is not needed or expected. This brings up an important issue: how
can you determine how large to dimension an array prior to loading it? The
answer, as you may have surmised, is to open the file for binary access and
read the length stored in the BSAVE header. All that's needed is to know
how the header is organized, as the following program reveals.
DEFINT A-Z
TYPE BHeader
Header AS STRING * 1
Segment AS INTEGER
Address AS INTEGER
Length AS INTEGER
END TYPE
DIM BLHeader AS BHeader
OPEN "ARRAY.DAT" FOR BINARY AS #1
GET #1, , BLHeader
CLOSE
IF ASC(BLHeader.Header) <> &HFD THEN
PRINT "Not a valid BSAVE file"
END
END IF
LongLength& = BLHeader.Length
IF LongLength& < 0 THEN
LongLength& = LongLength& + 65536
END IF
NumElements = LongLength& \ 2
REDIM Array(1 TO NumElements)
DEF SEG = VARSEG(Array(1))
BLOAD "ARRAY.DAT", VARPTR(Array(1))
END
Even though the original segment and address from which the file was saved
is in the BSAVE header, that information is not used here. In most
situations you will always provide BLOAD with an address to load the file
to. However, if the address is omitted, BASIC uses the segment and address
stored in the file, and ignores the current DEF SEG setting. This would be
useful when handling text and graphics images which are always loaded to
the same segment from which they were originally saved. But in general I
recommend that you always define an explicit segment and address.
There are a few other points worth elaborating on as well. First, the
program examines the first byte in the file to be sure it is the special
value &HFD which identifies a BSAVE file. The ASC function is required for
that, since the only way to define a TYPE component one byte long is as a
string.
Second, the length is stored as an unsigned integer, which cannot be
manipulated directly in a BASIC program if its value exceeds 32767. As you
learned in Chapter 2, integer values larger than 32767 are treated by BASIC
as signed, and in this case they are considered negative. Therefore, the
value is first assigned to a long integer, which is then tested for a value
less than zero. If it is indeed negative, 65536 is added to the variable
to convert it to an equivalent positive number. Note that the length in a
BSAVE header does not include the header length; only the data itself is
considered.
If you single-step through this program after running the earlier one
that created the file, you will see that the code that adds 65536 is
executed, because the header shows that the file contains 40000 bytes.
There are two limitations to using BSAVE and BLOAD this way. One
problem is that you may not want the header to be attached to the file.
The other, more important problem is that BASIC allows arrays to exceed
64K. Saving a single huge array in multiple files is clumsy, and
contributes to the clutter on your disks. The header issue is less
important, because you can always access the file with normal binary
statements after using a SEEK to skip over the header. But the huge array
problem requires some heavy ammunition.
One final point worth mentioning is that BSAVE and BLOAD assume a .BAS
file name extension if none is given. This is incredibly stupid, since the
contents of a BSAVE file have no relationship to a BASIC source file.
Therefore, to save a file with no extension at all you must append a period
to the name: BSAVE "MYFILE.", Address, Length.
Beyond BSAVE
The program that follows includes both a demonstration and a pair of
subprograms that let you save any data regardless of its size or location.
These routines are primarily intended for saving huge numeric and TYPE
arrays, but there is no reason they couldn't be used for other purposes.
However, they cannot be used with conventional variable-length string
arrays, because the data in those arrays is not contiguous. The file is
processed in 16K blocks using multiple passes, and the actual saving and
loading is performed by calling BASIC's internal PUT # and GET # routines.
DEFINT A-Z
'NOTE: This program must be compiled with the /ah option.
DECLARE SUB BigLoad (FileName$, Segment, Address, Bytes&)
DECLARE SUB BigSave (FileName$, Segment, Address, Bytes&)
DECLARE SUB BCGet ALIAS "B$GET3" (BYVAL FileNum, BYVAL Segment, _
BYVAL Address, BYVAL NumBytes)
DECLARE SUB BCPut ALIAS "B$PUT3" (BYVAL FileNum, BYVAL Segment, _
BYVAL Address, BYVAL NumBytes)
CONST NumEls% = 20000
REDIM Array&(1 TO NumEls%)
NumBytes& = LEN(Array&(1)) * CLNG(NumEls%)
FOR X = 1 TO NumEls% 'fill the array
Array&(X) = X
NEXT
Segment = VARSEG(Array&(1)) 'save the array
Address = VARPTR(Array&(1))
CALL BigSave("ARRAY.DAT", Segment, Address, NumBytes&)
REDIM Array&(1 TO NumEls%) 'clear the array
Segment = VARSEG(Array&(1)) 'reload the array
Address = VARPTR(Array&(1))
CALL BigLoad("ARRAY.DAT", Segment, Address, NumBytes&)
FOR X = 1 TO NumEls% 'prove this all worked
IF Array&(X) <> X THEN
PRINT "Error in element"; X
END IF
NEXT
END
SUB BigLoad (FileName$, DataSeg, Address, Bytes&) STATIC
FileNum = FREEFILE
OPEN FileName$ FOR BINARY AS #FileNum
NumBytes& = Bytes& 'work with copies to
Segment = DataSeg 'protect the parameters
DO
IF NumBytes& > 16384 THEN
CurrentBytes = 16384
ELSE
CurrentBytes = NumBytes&
END IF
CALL BCGet(FileNum, Segment, Address, CurrentBytes)
NumBytes& = NumBytes& - CurrentBytes
Segment = Segment + &H400
LOOP WHILE NumBytes&
CLOSE #FileNum
END SUB
SUB BigSave (FileName$, DataSeg, Address, Bytes&) STATIC
FileNum = FREEFILE
OPEN FileName$ FOR BINARY AS #FileNum
NumBytes& = Bytes& 'work with copies to
Segment = DataSeg 'protect the parameters
DO
IF NumBytes& > 16384 THEN
CurrentBytes = 16384
ELSE
CurrentBytes = NumBytes&
END IF
CALL BCPut(FileNum, Segment, Address, CurrentBytes)
NumBytes& = NumBytes& - CurrentBytes
Segment = Segment + &H400
LOOP WHILE NumBytes&
CLOSE #FileNum
END SUB
Although BASIC lets you save and load only single variables or array
elements, its internal library routines can work with data of nearly any
size. And since TYPE variables can be as large as 64K, these routines must
be able to accommodate data at least that big. Therefore, BASIC's usual
restriction on what you can and cannot read or write to disk with GET # and
PUT # is an arbitrary one.
Accessing BASIC's internal routines requires that you declare them
using ALIAS, since it is illegal to call a routine that has a dollar sign
in its name. As you can see, these routines expect their parameters to be
passed by value, and this is handled by the DECLARE statements. Normally,
you cannot call these routines from within the QB editing environment. But
if you separate the two subprograms and place them into a different module,
that module can be compiled and added to a Quick Library. That is, the
subprograms can be together in one file, but not with the demo that calls
them. Be sure to add the two DECLARE statements that define B$PUT3 and
B$GET3 to that module as well.
The long integer array this program creates exceeds the normal 64K
limit, so the /ah compiler switch must be used. Notice in the BigLoad and
BigSave subprograms that copies are made of two of the incoming parameters.
If this were not done, the subprograms would change the passed values,
which is a bad practice in this case. Also, notice how the segment value
that is used for saving and loading is adjusted through each pass of the DO
loop. Since the data is saved in 16K blocks, the segment must be increased
by 16384 \ 16 = 1024 for each pass. The use of an equivalent &H value here
is arbitrary; I translated this program from another version written in
assembly language that used Hex for that number.
Processing Large Files
Although the solutions shown so far are valuable when saving or loading
large amounts of data, that is as far as they go. In many cases you will
also need to process an entire existing file. Some examples are a program
that copies or encrypts files, or a routine that searches an entire file
for a string of text. As with saving and loading files, processing a file
or portion of a file in large blocks is always faster and more effective
than processing it line by line.
The file copying subprogram below accepts source and destination file
names, and copies the data in 4K blocks. The 4K size is significant,
because it is large enough to avoid many repeated calls to DOS, and small
enough to allow a conventional string to be used as a file buffer. As with
the BigLoad and BigSave routines, the file is processed in pieces. Also,
for simplicity a complete file name and path is required. Although the DOS
COPY command lets you use a source file name and a destination drive or
path only, the CopyFile subprogram requires that entire file names be given
for both.
DEFINT A-Z
DECLARE SUB CopyFile (InFile$, OutFile$)
SUB CopyFile (InFile$, OutFile$) STATIC
File1 = FREEFILE
OPEN InFile$ FOR BINARY AS #File1
File2 = FREEFILE
OPEN OutFile$ FOR BINARY AS #File2
Remaining& = LOF(File1)
DO
IF Remaining& > 4096 THEN
ThisPass = 4096
ELSE
ThisPass = Remaining&
END IF
Buffer$ = SPACE$(ThisPass)
GET #File1, , Buffer$
PUT #File2, , Buffer$
Remaining& = Remaining& - ThisPass
LOOP WHILE Remaining&
CLOSE File1, File2
END SUB
Once the basic structure of a routine that processes an entire file has
been established, it can be easily modified for other purposes. For
example, CopyFile can be altered to encrypt an entire file, search a file
for a text string, and so forth. A few of these will be shown here. Note
that for simplicity and clarity, CopyFile creates a new buffer with each
pass through the loop. You could avoid that by preceding the assignment
with IF LEN(Buffer$) <> ThisPass THEN or similar logic, to avoid creating
the buffer when it already exists and is the correct length.
The BufIn function and example below serves as a very fast LINE INPUT
replacement. Even though BASIC's own file input routines provide buffering
for increased speed, they are not as effective as this function. In my
measurements I have found BufIn to be consistently four to five times
faster than BASIC's LINE INPUT routine when reading large (greater than
50K) files. With smaller files the improvement is less, but still
substantial.
DEFINT A-Z
DECLARE FUNCTION BufIn$ (FileName$, Done)
LINE INPUT "Enter a file name: ", FileName$
'---- Show how fast BufIn$ reads the file.
Start! = TIMER
DO
This$ = BufIn$(FileName$, Done)
IF Done THEN EXIT DO
LOOP
Done! = TIMER
PRINT "Buffered input: "; Done! - Start!
'---- Now show how long BASIC's LINE INPUT takes.
Start! = TIMER
OPEN FileName$ FOR INPUT AS #1
DO
LINE INPUT #1, This$
LOOP UNTIL EOF(1)
Done! = TIMER
PRINT " BASIC's INPUT: "; Done! - Start!
CLOSE
END
FUNCTION BufIn$ (FileName$, Done) STATIC
IF Reading GOTO Process 'now reading, jump in
'----- initialization
Reading = -1 'not reading so start now
Done = 0 'clear Done just in case
CR$ = CHR$(13) 'define for speed later
FileNum = FREEFILE 'open the file
OPEN FileName$ FOR BINARY AS #FileNum
Remaining& = LOF(FileNum) 'byte count to be read
IF Remaining& = 0 GOTO ExitFn 'empty or nonexistent file
BufSize = 4096 'bytes to read each pass
Buffer$ = SPACE$(BufSize) 'assume BufSize bytes
DO 'the main outer loop
IF Remaining& < BufSize THEN 'read only what remains
BufSize = Remaining& 'resize the buffer
IF BufSize < 1 GOTO ExitFn 'possible only if EOF byte
Buffer$ = SPACE$(BufSize) 'create the file buffer
END IF
GET #FileNum, , Buffer$ 'read a block
BufPos = 1 'start at the beginning
DO 'walk through buffer
CR = INSTR(BufPos, Buffer$, CR$) 'look for a Return
IF CR THEN 'we found one
SaveCR = CR 'save where
BufIn$ = MID$(Buffer$, BufPos, CR - BufPos)
BufPos = CR + 2 'skip inevitable LF
EXIT FUNCTION 'all done for now
ELSE 'back up in the file
'---- if at the end and no CHR$(13) was found
' return what remains in the string
IF SEEK(FileNum) >= LOF(FileNum) THEN
Output$ = MID$(Buffer$, SaveCR + 2)
'---- trap a trailing EOF marker
IF RIGHT$(Output$, 1) = CHR$(26) THEN
Output$ = LEFT$(Output$, LEN(Output$) - 1)
END IF
BufIn$ = Output$ 'assign the function
GOTO ExitFn 'and exit now
END IF
Slop = BufSize - SaveCR - 1 'calc buffer excess
Remaining& = Remaining& + Slop 'calc file excess
SEEK #FileNum, SEEK(FileNum) - Slop
END IF
Process:
LOOP WHILE CR 'while more in buffer
Remaining& = Remaining& - BufSize
LOOP WHILE Remaining& 'while more in the file
ExitFn:
Reading = 0 'we're not reading anymore
Done = -1 'show that we're all done
CLOSE #FileNum 'final clean-up
END FUNCTION
As you can see, the BufIn function opens the file, reads each line of text,
and then closes the file and sets a flags when it has exhausted the text.
Even though this example show BufIn being invoked in a DO loop, it can be
used in any situation where LINE INPUT would normally be used. As long as
you declare the function, it may be added to programs of your own and used
when sequential line-oriented data must be read as quickly as possible.
I don't think each statement in the BufIn function warrants a complete
explanation, but some of the less obvious aspects do. BufIn operates by
reading the file in 4K blocks in an outer loop, and each block is then
examined for a CHR$(13) line terminator in an inner loop that uses INSTR.
INSTR happens to be extremely fast, and it is ideal when used this way to
search a string for a single character.
The only real complication is when a portion of a string is in the
buffer, because that requires seeking backwards in the file to the start of
the string. Other, less important complications that also must be handled
arise from the presence of a CHR$(26) EOF marker, and a final string that
has no terminating carriage return.
I have made every effort to make this function as bullet-proof as
possible; however, it is mandatory that every carriage return in the file
be followed by a corresponding line feed. Some word processors eliminate
the line feed to indicate a "soft return" at the end of a line, as opposed
to the "hard return" that signifies the end of a paragraph. Most word
processor files use a non-standard format anyway, so that should not be
much of a problem.
The last complete program I'll present here is called TEXTFIND.BAS,
and it searches a group of files for a specified string. TEXTFIND is
particularly useful when you need to find a document, and cannot remember
its name. If you can think of a snippet of text the file might contain,
TEXTFIND will identify which files contain that text, and then display it
in context.
'----- TEXTFIND.BAS
'Copyright (c) 1991 by Ethan Winer
DEFINT A-Z
TYPE RegTypeX 'used by CALL Interrupt
AX AS INTEGER
BX AS INTEGER
CX AS INTEGER
DX AS INTEGER
BP AS INTEGER
SI AS INTEGER
DI AS INTEGER
Flags AS INTEGER
DS AS INTEGER
ES AS INTEGER
END TYPE
DIM Registers AS RegTypeX 'holds the CPU registers
TYPE DTA 'used by DOS services
Reserved AS STRING * 21 'reserved for use by DOS
Attribute AS STRING * 1 'the file's attribute
FileTime AS STRING * 2 'the file's time
FileDate AS STRING * 2 'the file's date
FileSize AS LONG 'the file's size
FileName AS STRING * 13 'the file's name
END TYPE
DIM DTAData AS DTA
DECLARE SUB InterruptX (IntNumber, InRegs AS RegTypeX, OutRegs AS RegTypeX)
CONST MaxFiles% = 1000
CONST BufMax% = 4096
REDIM Array$(1 TO MaxFiles%) 'holds the file names
Zero$ = CHR$(0) 'do this once for speed
'----- This function returns the larger of two integers.
DEF FNMax% (Value1, Value2)
FNMax% = Value1
IF Value2 > Value1 THEN FNMax% = Value2
END DEF
'----- This function loads a group of file names.
DEF FNLoadNames%
STATIC Count
'---- define a new Data Transfer Area for DOS
Registers.DX = VARPTR(DTAData)
Registers.DS = VARSEG(DTAData)
Registers.AX = &H1A00
CALL InterruptX(&H21, Registers, Registers)
Count = 0 'zero the file counter
Spec$ = Spec$ + Zero$ 'DOS needs an ASCIIZ string
Registers.DX = SADD(Spec$) 'show where the spec is
Registers.DS = SSEG(Spec$) 'use this with PDS
'Registers.DS = VARSEG(Spec$) 'use this with QB
Registers.CX = 39 'the attribute for any file
Registers.AX = &H4E00 'find file name service
'---- Read the file names that match the search specification. The Flags
' registers indicates when no more ma