EBCDIC -vs- The World

My good friend and former co-worker Mike related today his struggles with 'autoconf' on MVS.
He has a better grasp of cross-platform issues, like 'make' logic that works as well on z/OS as on Unix, than most people I know personally. Mike tells me the './configure' step works fine, but that a specific package using it refuses to support EBCDIC and it sounds like a religious matter. [sigh]

When I first encountered "Open Edition" (as it was called then), I was delighted and dismayed.
First I launched a shell and found all those Unix commands that I had seen on other platforms. But when I brought in a TAR file with my own bag of tricks, it failed. The archive was intact, but my scripts crashed. Trying to eyeball one of them I got garbage. Then it hit me: they were all in ASCII. But more significantly, the system was EBCDIC. Duh!

I assumed what so many others assume: If it's Unix, it's ASCII. But I was wrong.
It took several months before I could accept that OVM and OMVS being EBCDIC was not only okay, but was and is "the right thing". But developers who do not know our beloved mainframe environments have not walked this path and may react against it. (As the authors of this package Mike is wrestling with appear to have done.)

The designers of the POSIX specification and of the X/Open Unix brand were very careful about what is defined and where, what is required and how. Just what makes a system Unix? For ten years, MVS has passed the test and is Unix branded. But surely none of us expect "a Unix person" to accept MVS that way. The single biggest difference between OpenMVS and "real Unix" is the character set. It is a curse and a blessing.

Let me first mention the blessing. CMS and z/OS, even with a POSIX face, must be EBCDIC for the sake of the legacy code they run. For all their faults, this is one place IBM is exceptional: They support historical workloads. (They do it better than a certain other vendor of operating systems which shall remain nameless.) The old code works. But the old code uses EBCDIC for character representation. After chewing on this for more than half a year, I realized that it must be so for the POSIX side as well, or there would be grossly confusing results.

In theory, the character set should be as easily replaced as most other aspects of the system. (For example, we let users run CSH instead of Bourne exclusively, which has grave consequences if they want to do scripting.) In practice, the character set is more deeply entrenched. When moving from one Unix to another, the theory was "just recompile". In practice, we know it doesn't work so smoothly. This is bad. This is sad!

Programmers make assumptions. I know: I'm a programmer, so I'm just as guilty. There are ways to render any application source "character set agnostic". Such techniques take time and practice. Is it worth the hassle? Yes! Today, the unnamed authors of the unidentified package Mike is wrestling with reject EBCDIC. It's not that they can't as much as that they won't. What is heartbreaking is that they have already done the tough part: they deal with differing character encodings. Supporting EBCDIC for them would be no extra mile (IMHO), and their attitude paints them into a corner where they'll have trouble with any new-and-wonderful encoding yet to be devised.

Thankfully, compiler writers tend to be more disciplined than the rest of us. The foundation is strong: Any special character is represented by a well defined and always expected meta-character or escape sequence. Notably, newline is always coded as "\n", never as 0x0A. Even the most ASCII-entrenched Unix hack will chastise the programmer who uses the hexadecimal rather than the symbolic. We all need to be more consistent.

The problem does not simply go away when we are more diligent. There continue to be situations where character encoding bites us. But as source code grows more robust, we can make progress.

-- R;