Large scale programming

This is a comparatively new section, inspired by a new technique I'm trying out. This section is for the programmer who is now very comfortable with ksh programming, and would like to code something complex.

In addition to making large scale coding manageable, the techniques described on this page, also make it maintainable in the long term.
When you are writing something that is important ,and you know is going to be around for years, it's a really really good idea to write it in a way that will be easy to debug, test, and modify, a year from now when you've forgotten most of its innards.

Shellscripting for large tasks???

Some people like to code large shellscripts in perl: however, it is a mistake to choose a language, based solely on the size of the task. Perl is a good choice for tasks that have fine-grained data twiddling, without calling external programs.
In contrast, Korn shell (and perhaps other related shells such as bash) is a good choice for tasks that have a high-level, "user command" type flow, where you call external programs a lot.

If you find yourself needing to do a LARGE amount of tasks, the size of the task is not in itself a reason to move to another language. It is, however, a reason to start with a drastically different mindset.

The mindset that is most useful, is that of the professional, traditional programming-language programmer, who breaks up a program into separate modules.
Since such a mindset is the subject of some full semester-long courses, I wont be able to do it justice on a single page! However, I can at least describe a basic framework by which you can excercise this mindset within the bounds of ksh programming.

Source Code Control

First of all, if you are going to be writing multi-thousand lines worth of code, it may be time for you to consider using some kind of Source Code/Version Control system. I wont waste my time claiming there is one best one; use whatever is right for you. Any of (RCS, SCCS, CVS, git, subversion, mercurial, ??) may be suitable. The important choice is not as much which one to use, as whether to use one at all.

If nothing else, version control becomes useful sometimes, when you wish to do "releases" of software.
It also can become convenient to track down stupid one-line errors that you have changed unexpectedly from "last commit"

Lastly, it can be a good safeguard from an accidental "rm somespec_ *.ksh" (note the unexpected space). Backups can serve this purpose... IF you have them! But even then, losing a full days's work between backups, can be very demotivational.

Modularity

Those people who have written in a more traditional "programming language", will be familiar with the concept of modules, include files, etc.

Modules take the concept of functions, but then go a step further. Rather than just grouping and containing a set of commands together, modules group and contain sets of functions (and associated values) together. In this page, I will introduce the concept of separate files, as "modules".

ksh, and other sh-derivatives, have the dot command "." , as a way to "source"(include) other files. This is literally equivalent to the "#include" directive in many other languages. Most commonly, it is used merely as a convenience to slurp in some configuration file that merely has some convenient variable settings. That being said, there is no reason not to use it to include full functions as well.

But why do this?
For sanity, and debugging purposes.

Sanity

Lets be honest: it is tough to meaningfully peruse a 1,000+ line shellscript in its entirety. Yet as a script grows larger, it is in a way, even MORE important, to do so! It is cruicial to ensure that a change in one place, does not adversely affect code in other places.

In contrast, if you are disciplined enough to subdivide your code so that, for example, all your file manupulation code is in a single file, you then have a much smaller target to check through, when you make a change that affects "file manipulation" in your program.

Debugging

In more complex programs, it becomes very important to test certain sub functions of your program. Running through a program that may have 10 or more different branches, is impractical. You need "unit testing".

In smaller scripts, it is easy to just make a quick copy, and abuse the copy for quick-and-dirty testing, cutting out pieces unneccessary for the test willy-nilly. In larger, more complex programs, there is a higher danger that such an approach, will end up accidentally cutting out a piece that is cruicial to normal operation of the program.

But what if the program is already "cut up", into pre-planed, self-contained sections?
From these "modular sections", you can then pick and choose just the ones you need, and also jump right into the routines you need, with a good degree of certainty of safety.

As an even further benefit; once you have modules, it is then easier to write task-specific test harnesses. For a large program, you might have a collection of scripts just for testing, that you keep around with the program itself. Then, when you make a change to a particular section, you can re-run the test script to verify that you have not broken anything in that module.

Important limits

NB: make sure that your "modules" contain only functions, and perhaps a few variable assignments. There should be no top-level running code. Doing
. prog_somemodule
should not actually DO anything, besides defining functions and variables.

Trivialized example

Here is something to look at, that demonstrates the principles above. It is supremely non-useful, other than hopefully being enlightening :-)

Top level 'prog'
#!/bin/ksh -p

CODEDIR="."
. $CODEDIR/prog_status
. $CODEDIR/prog_mungedata

get_status
if [[ $? -eq 0 ]] ; then
rotate_data
fi

'prog_status'
# start of prog_status
get_status() {
	ps -ef|grep apache
	if [[ $? -eq 0 ]] ; then
		return 0
	else
		return 1
	fi
}

'prog_mungedata'
# start of prog_mungedata

# rotate_data: note you should probably call get_status before this,
# to ensure program is not active before log rotation
rotate_data() {
	mv -f /var/log/apache.1 /var/log/apache.2
	mv -f /var/log/apache /var/log/apache.1
}

'tester'
#!/bin/ksh -p

# Test harness for 'prog' routines.
# right now, we just test to validate get_status works properly

get_status 
if [[ $? -eq 0 ]] ; then
	print According to get_status, program is running now
else
	print According to get_status, program is NOT running now
fi

Final notes: includes and Makefiles

It is important to note that there are lots of ways to handle final delivery of the code. In the trivial examples above, the code was sourced from the current directory. But in the real world, that is not good practice. It would be better to source it from CODEDIR=/opt/prog/libexec or someplace like that.

There is yet another option, however: you can keep the different modules split for coding, but for delivery, simply do the "include" yourself, by concatenating the files into a single shellscript before deployment.
In a sense, this can be considered akin to "linking" object files, to delivery a single executable object.

Since "cpp" does not like '#' style comments, (but you'll want to use '#' style comments in your shell programs!!) here's a quick Makefile example of how to do this sort of thing easily:

yourprog:	main.ksh incfile1 incfile1
	awk '$$1 == "AWKinclude" {system ("cat "$$2);next;} \
              {print}' main.ksh > $@
Then in your main.ksh, use "AWKinclude incfile1" instead of the cpp style of "#include <incfile1>"

For a fancy, useful example of this style, you can look at my zrep "source code" directory


TOP of tutorial
Prev: Comparisions of good vs bad code
Part of bolthole.com... Solaris tips ... AWK tutorial ... OO Programming tutorial
This material is copyrighted by Philip Brown, © January 2002-2012