February 2003
Issues and Answers for Open Source
by Lynn Maxson
The Issues
The success of open source depends upon resolving three issues:
- the timely receipt of source contributions,
- how to increase the number of contributors, and
- how to maximize contributors' productivity.
Timely Receipt of Source Contributions
Both open source and closed source have the same two challenges. One,
to provide in terms of data types and operators the closest match
between the software, the Solution Set (SS), and the real world, the
Problem Set (PS): SS = PS. Two, to provide the closest match possible
between changes in the Problem Set reflected in the Solution Set:
dSS/dt >= dPS/dt.
Two situation stages occur in the lifecycle of software. The first
occurs in development when no existing source exists. The second occurs
in maintenance when existing source exists. This says that the
development stage ceases when a change request occurs in the Problem Set
modifying existing source. In effect development exists as the initial
state of maintenance, lasting only until the first change request
impacts existing source.
We can freeze change requests from entering a development or maintenance
cycle. Doing this creates a backlog for the next. Thus freezing
change requests runs counter to our need to have the solution set match
as closely as possible to changes in the problem set. This means we
need to implement changes as close as possible to their submission.
That excludes freezing along with tools and methods that cause it.
Every change request translates into a set of one or more
specifications. All these aggregate in a specification pool. From this
pool we extract a set representing a version of an application system,
i.e. one or more programs. If we can use the specification language as
our programming language, we can go directly to producing an executable
version of our application system.
We need to minimize the interval from time of entry of change request to
translation into specifications to their inclusion in an executable.
The shortest interval occurs when we have the same language for both
specification and programming.
The timely receipt of source contributions then depends upon the rate at
which we can translate change requests into specifications. To avoid
backlog means we must have the capability to translate faster than the
change rate. That in turn means having sufficient people resource to
allow parallel translation.
How to Increase the Number of Contributors
You cannot increase the number of people working in parallel beyond the
number you have. That number depends upon the change rate in the
problem set which the solution set must match or else fall behind, i.e.
create a backlog. That number depends also upon the change rate
capacity of each individual: the rate at which they can translate user
requirements into the different formats necessary to produce
executables.
Each format has a language associated with it. For each language we
have at least one tool which processes it. Each tool has a user
interface which the user must master to some degree to use properly. So
we have the language which the tool processes as well as that we use to
communicate with the tool.
We use people to translate user requirements into specifications. In
this manner people act as the tool. If we can go directly from
specifications to executables, then we only need master one additional
language, that of the tool interface. So we have a minimum of three
languages, that used by the user requirements, that used in
specifications, and that used to interface with the tool.
If we assume that the people already know the language of the user
requests, then they need only learn two additional languages: one for
writing specifications; the other, for interfacing with the tool. In
pre-object-oriented programming we had the language of the user request,
of analysis, of design, of construction, and of testing. Further in
construction aside from the compiler we had language source and
interfaces with additional tools like make utilities, linkers,
debuggers, etc..
In object-oriented programming, analysis and design went from two
languages (dataflows and structure charts) to as many as fourteen in UML
(Unified Modeling Language). The principle of logical equivalence says
that nothing exists in all these different source language forms not
present in the specifications themselves. Moreover instead of using
people to translate specifications into UML, we have no reason not to do
this with software.
In the end we can increase the number of source contributors by reducing
the number of different sources to two: the informal language of user
requests and the formal language of specifications. This means we can
reduce the number of tools to one, that which translates source into
executables. That we require more in current methods says that we have
ignored already existing advances, not that we have something yet to
discover.
You increase the number of source contributors by minimizing the number
of different languages they need to learn. This in turn minimizes the
number of different tools to learn. This lowers the barrier to learning
to within the comfort zone of more people. Now you need only to
increase their productivity.
How to Maximize Contributors' Productivity
We measure productivity as the ratio of work-out over work-in. We can
increase productivity in two ways:
- increasing the work-out for the
same work-in, or
- decreasing the work-in for the same work-out.
In either case we increase the ratio of work-out over work-in. In
developing and maintaining software our productivity gains or losses
occurs with the tools we use. These tools include languages used,
software tools that process the languages, and the methods that combine
them into a process.
We gain in productivity by shifting more of the work-in from people to
our software tools that produce the work-out. Our basic guide for this
lies in letting people do what software cannot and software what people
need not. Basically this occurs by shifting as much of the clerical
processes from people to software. No better example of this exists
than in the evolution of programming languages.
Evolution of Programming Languages
We have four generations of programming languages:
- actual or machine language,
- symbolic assembly plus macro,
- imperative higher level languages (HLLs), and
- declarative HLLs.
Actually the first three generations represent progressive forms of imperative
languages while the fourth generation carries the progression one step
farther from imperative to declarative. This progression comes from
reducing the "what" and "how" logic of imperative languages to the
"what" language of declaratives. This effectively shifts the writing of
the "how" the logical organization of the source over to the software.
This not only reduces the amount of manual writing, but more
significantly that of rewriting, i.e. reflecting changes in the logical
organization.
The second generation, symbolic assembly plus macro, introduced the use
of mnemonics for actual op codes and symbolic operands. The first eased
the task of reading the source while the second allowed the resolution of
operand addressing to shift from data entry, the time of writing the
source, to compile time, after writing the source. The macro introduced
the instruction aggregate, using one instruction to replace the use of
several others. The first, shifting of the binding time from data entry
to compile, significantly reduced the amount of rewriting while the
second reduced the amount of writing necessary.
While the second generation maintained the machine-dependency of the
first, third generation took advantage of the second's macro form to
introduce the use of machine-independent expressions. These occur in an
assignment statement, e.g. "a = b + c;". This effectively says
evaluate the expression "b + c" and assign the result to "a", i.e.
replace the value of a. HLLs like APL use the left arrow ("<-") instead
of the equal symbol for assignment, e.g. "a <- b + c".
While the second generation introduced an instruction aggregate, the
third generation introduced the data aggregate in the forms of arrays,
structures, and lists. Some programming languages, notably APL and
PL/I, offered operator support for aggregate operands. This shifted
responsibility for writing the underlying element by element processing
from the programmer to the software.
The third generation also introduced an alternate to the compiler: the
interpreter. The interpreter combines the functions of the editor and
compiler into a single tool with a single interface. This makes it
intrinsically an IDE (Integrated Development Environment), not as the
"add on" when used with a compiler.
The fourth generation introduced a significant paradigm shift from
imperative to declarative mode. We need to insure a better
understanding of both modes to make distinct the differences.
The Imperative Mode
In grammar we associate the imperative mode with the issuing of commands
like "come here," "shut up," "step back" and so on. Machine
instructions have the same imperative purpose in stating what action
takes place. The difference between English and machine imperative
grammar lies in the details. You can tell someone to "Go to the store.
Buy some bread. Return home." We expect that someone is then capable of
determining how to execute these three commands. No such luck in
programming.
A machine only executes low-level instructions. We must translate any
higher level instruction like "Go to the store" into an ordered set, a
sequence, of low-level instructions (or commands). The machine follows
them exactly as written. While each instruction has its own internal
(local) logic relative to its successful execution, the global logic
falls entirely on the programmer.
In first and second generation languages, machine and symbolic assembly,
the programmer sequences the global logic on an
instruction-by-instruction basis. The writing process incorporates this
global logic. For that reason we have said this form exists as "logic
in programming," i.e. part of the writing process.
In third generation languages we move from instruction-by-instruction
logic to one based on statements: statement-by-statement. Even though
we now have an enhanced macro facility allowing us to write assignment
statements like " a = (b + c)/(d - e);", the programmer still determines
the global logic covering the sequencing of such statements. If a
change in global logic occurs, the programmer must do the necessary
rewriting.
To ease this task somewhat and to move away from "spaghetti" code, the
early 70's saw the introduction of structured programming. This
introduced the concept of "control structures" which in terms of control
flow had the common connection feature of "one in/one out." In theory
then these control structures of sequence, decision, and iteration
became pluggable or reusable units in a manner similar to the "one
in/one out" interface of subroutines.
The global logic then involved the sequencing or ordering of these
control structures. If something necessitated a change in the global
logic, the programmer had to make the changes to the ordering of the
control structures. As these normally did not exist as reusable
components, i.e. named files within an %include statement, this meant
physically rewriting the source.
Understand that we had a capability of reuse of control structures that
was never implemented as such. The question is, "Why?" The answer lies
in the use of a compiler and editor as separate processes. In an editor
you write a %include statement, but the actual loading of the file, i.e.
its inclusion, does not occur until later during compilation. In theory
we could implement each control structure as a separate file, creating
each in turn as a separate process with the editor. Having done this we
could once more use our editor to create a source program file
containing no more than %include statements arranged in the proper
sequence.
This means, of course, that unless we opened up separate instances of
the editor to allow us to view the files, which we nevertheless now had
to piece together in our heads, the only time we would see the source in
its entirety occurred after the compile. We could, of course, enhance
the editor by allowing it to recognize the %include statement on input,
i.e. from within a file, or upon data entry from the keyboard. It
could then do as the compiler does about retrieving the named file and
inserting it in place within the source.
Unfortunately editors know nothing of compilers and vice versa. Perhaps
no better example of this exists than the LPEX editor IBM packages with
its VisualAge products. LPEX (for Live Parsing EXecutive) performs
syntax checking on opening a file or data entry, using a colorization
process to denote the parsed elements by type, e.g. variable, constant,
operator. As LPEX can parse and thus recognize a %include statement, it
should not take more effort to then load the file named in the %include
statement and parse it as well. In this way the entire source would
appear as it would appear after the compile.
In order to parse the source statements, i.e. perform syntax analysis,
on completion of each statement (as well as the completion of parsing
all statements) the editor could submit the parsed elements for semantic
analysis. This would allow the programmer to check for spelling errors,
any one of which will cause a severe error during compilation causing,
it in effect to abort. Then once completing semantic analysis it could
submit the results for code generation, eliminating the need for a
separate compilation step. By putting all these functions within the
editor makes it an IDE without implementing it as another function
within the edit-compile-link-execute package.
Before leaving this we should note that what we have described here
represents a tool, not a language failure. It is our tools which place
restrictions on our use of what the language makes available.
The Declarative Mode
No better example of the advantages of fourth generation, declarative
languages exists than SQL. It is probably used more than any other
language worldwide by people who don't regard what they write as
programming. It has an imperative form, e.g.
"SELECT...FROM...WHERE...", from the outside where the "programmer"
determines the order of the output (SELECT clause), the source(s) of the
input (FROM clause), and the conditions under which selection occurs
(WHERE clause). The "programmer" then leaves it up to the database
manager to supply all the logic necessary to satisfy the query.
If we change anything of what we want to see or the order in which we
want to see it, of which input sources to use, or conditions to apply,
the database manager ignores what it did with the previous query in
satisfying the new one. In short, it generates the code on an individual
query basis.
It does this through performing the three steps of syntax analysis,
semantic analysis, and proof theory. In imperative languages the proof
theory consists of code generation only. In declarative the proof
theory decomposes into two stages: a completeness proof and an
exhaustive true/false proof. The completeness proof verifies that it
has enough information to satisfy the request. It returns "false" if it
doesn't; otherwise "true". If "true", it then imposes a logical
organization based on the information. It then generates the code based
on this logical organization.
At this point it has an executable form: the "end result" of a "true"
completeness proof. It then enters the second stage: the exhaustive
true/false proof. Here it evaluates each instance of input data, in
this instance each row of the table(s) specified in the FROM clause,
according to the conditions set out in the WHERE clause now incorporated
within the executing logical organization. It includes each "true"
instance, one or more, in the output according to the specifications of
the SELECT clause.
It may happen that no "true" instance exists. In that case the query
returns "false": no "true" instances. It cannot do this without having
evaluated every input instance, i.e. each row in the table. Thus the
description of "exhaustive true/false proof."
Now SQL by its name, Structured Query Language, is not free form. Every
SQL statement has a specifically ordered form. Other declarative
languages, e.g. Prolog or Trilogy, have no such restrictions. They
essentially allow the input source statement groups to appear in any
order, i.e. unordered. Thus the programmer does not have to apply a
logical organization, i.e. global logic, on the input. That
responsibility for logical organization transfers from the programmer to
the software as part of its completeness proof.
This allows for named control structures on input. With the exception
of COBOL with its paragraph reuse through the PERORM verb, almost all
other users of other imperative languages regard such reuse through the
%include statement as impractical. Yet such reuse is inherent in
declarative languages other than SQL.
Basically this means that the programmer needs only to concern himself
with the local or internal logic of a control structure. Now control
structures themselves may contain other control structures or at least
references to them. However, the programmer need only concern himself
with the logical organization of the "containing" control structure,
having presumably done the same for each "contained" control structure.
Now as control structures contain logic their application implies rules.
Declarative languages allow the explicit declaration of rules separate
from explicitly written control structures. For example, suppose we
have the following data declaration:
dcl easy fixed bin (31) range (0...99999);
This specifies that easy is a fixed binary number, 31
bits plus sign, whose only allowable, i.e. "true," values range from 0
to 99999. The responsibility for applying this rule in every possible
instance lies with the software and not with the programmer.
By explicitly incorporating the rules governing data usage within the
declaration of that data, some hundreds and thousands of which may occur
affecting the different applications used within an enterprise, we give
to the software, which is not subject to memory lapses or ignorance, the
responsibility for their application. We have the net effect of
increased granularity of reuse down to the statement level with software
assistance as well as enforcement. This does not exist in imperative
languages, even those like object-oriented who boast reuse as a major
feature.
Progress, not Regress
We have to ask ourselves, "Why?".
Considering what we had achieved in third generation languages like APL,
LISP, and PL/I prior to 1970, why did we ever regress to a starting
level of C? Why in the 30+ years since have we not reached the same
capability we achieved in the ten years prior? That includes the heavy
investment in object-oriented technology and languages like C++ and
JAVA. It's not that we didn't try. Much of C++ tried to address
deficiencies in C. Even doing so still has not reached the level of PL/I
prior to 1970.
If we value reuse so highly, why do we not accept granularity down to
the statement level? Why do we insist in involving people in
implementing rules when we have software that will do it for us,
providing physical replication wherever needed without failure?
We could make the list of such questions considerably longer. After we
worked our way through them it will not turn out that our languages have
failed us. We can change them and at the same time insure full backward
compatibility: no conversion necessary. Instead our problems lie with
our software toolset. Our editors. Our compilers.
We have four generations of programming languages. Each successive
generation has provided productivity gains over the previous. Each time
those gains occurred by shifting the clerical effort from people to
software. We went from the low level instruction writing of machine
language to the introduction of the higher-level macro in symbolic
assembly. We went from the macro to the higher-level assignment
statement of third generation. We went from ordered input of third
generation to unordered of fourth, allowing software instead of people
to construct the global logic. In doing so we saw our granularity of reuse
extend down to the statement and control structure level.
We know that a programmer can make three types of errors, excluding, of
course, the decision to become a programmer. Errors of syntax. Errors
of semantics (spelling). And errors of logic. The first two we can
catch at data entry. The last during testing. We know that operating
interpretively we can test closer to the point of data entry than we
can in a following compile step. Moreover, operating interpretively we
can test segments in isolation regardless of the presence of other
source. We can do it without even having a minimal complete program as
required when using a compiler.
Speaking of testing, we have a form of fourth generation languages using
predicate logic. This differs from the clausal logic used in SQL, which
requires the existence of test data, i.e. the table(s) referenced in
the FROM clause. Predicate logic allows us to perform an exhaustive
true/false proof on code from the statement level on up using
automatically created test data based on data value rules assigned to
each variable. This again shifts clerical effort from people to
software.
So how do we maximize contributors' productivity? By improving our
toolset with proven technology.
The Answers
We need to move to fourth generation languages. In doing so we need to
retain the assignment statement of the third. This allows us to have
full imperative capability within a declarative language. With that we
have the capability of completely writing the tools for the language in
the language.
This becomes more meaningful when we recognize that Intel in volume 2
(Instruction Set) of its Pentium reference manuals offers three
generational forms (1:1:1) for each instruction: machine, symbolic, and
HLL. This means in essence that we can include machine or assembly
language capability in an HLL form, requiring only a simple translation
to machine language during code generation.
Having then defined our fourth generation, declarative language, we can
then look at our toolset written entirely with it. We know from our
experience with interpreters that we need only one tool and one
interface. Thus only one tool language to learn.
We also know that interpreters and compilers share the same syntax and
semantic analysis, differing only in their proof theory (code
generation). In our fourth generation language prior to the code
generation phase of the completeness proof we can indicate which of the
two executable forms we desire. In effect we incorporate the complete
compiler functions within an interpreter.
Take a look at the historical five stages of development:
- specification,
- analysis,
- design,
- construction, and
- testing.
Using a fourth generation language shifts responsibility for analysis,
design, and construction to the software as part of the completeness
proof. As the completeness proof if "true" contains the global logic
organization of the unordered source, we can use this now ordered source
as the input to our CASE tools now integrated within our
interpreter/compiler tool. This means we have only one source, the
unordered input, we have to maintain. The software tool then generates
the ordered source and from this can produce any of the CASE outputs:
flowcharts, structure charts, UML documents.
This capability essentially eliminates the principal arguments for
recent RAD (Rapid Application Design) approaches like Extreme
Programming and Agile Modeling. These argue against having to produce
and maintain separate source files for each document as done currently
by CASE tools. The argument disappears once you have only one source
with a choice of generated outputs. This ability makes every RAD slow
by comparison.
Finally the use of the exhaustive true/false proof of declarative
languages in conjunction with predicate logic to automatically generate
any range of test data allows a level of testing not possible with the
current system of beta testing regardless of the number of beta testers
(users). This not only eliminates the need for beta testers, but
provides more immediate feedback of verified ("true") results.
We can begin with any third generation language, any open source editor,
and any open source compiler. We can incrementally enhance each, using
each step along the way as empirical proof of increased productivity.
In short we can take full advantage of known technology and then build
from there: add some technology of our own.
The Southern California OS/2 User Group
P.O. Box 26904
Santa Ana, CA 92799-6904, USA
Copyright 2003 the Southern California OS/2 User Group. ALL RIGHTS
RESERVED.
SCOUG, Warp Expo West, and Warpfest are trademarks of the Southern California OS/2 User Group.
OS/2, Workplace Shell, and IBM are registered trademarks of International
Business Machines Corporation.
All other trademarks remain the property of their respective owners.
|