An entry in the 2003 (0th) International Obfuscated Haskell Code Competition ---------------------------------------------------------------------------- Program: 'remorse' Author: Malcolm.Wallace@cs.york.ac.uk Copyright: Public Domain TO BUILD -------- To build the program, just use 'make'. The Makefile assumes ghc, but nhc98 works equally as well. Note, this file contains progessively more information about this submission, starting with the simple command-line USAGE. You can also get that information by running the program with no arguments. Then follow some SIMPLE HINTS, some MORE HINTS, and finally the full description marked with SPOILER, further down. Read only as far as you want to... USAGE ----- Usage: remorse (+/-) file.hs The output goes to stdout, so it doesn't overwrite your original file. 'remorse' is an invertible function: use the + or - argument to decide the direction. SIMPLE HINTS ------------ 'remorse' transforms Haskell code into Haskell code. It is so-named because, if you wish you hadn't used it, you can go back and undo your action. I am certainly beginning to regret writing it. :-) You might like to try running 'remorse +' on some arbitrary Haskell'98 modules you have lying around, perhaps even some of the other entries in the IOHC! The results should still be valid Haskell modules (with one caveat (see below)). To undo the changes, use 'remorse -', and you should get the original module back. MORE HINTS ---------- 'remorse' is an automatic Haskell code obfuscator and de-obfuscator. With the command-line argument +, it generates an obfuscated version of the program, and with the argument -, it undoes the obfuscation. Obviously, the primary test case was 'remorse' itself, and of course the version I submitted has already had 'remorse +' applied to itself. To view the (relatively) unobfuscated version, try 'remorse -' on the source code. Now, even when you auto-de-obfuscate 'remorse', it will still not be easy to read I'm afraid. I had to use all the obvious techniques to squash the source code to fit into the 5k limit. (Actually, it was the 'remorse'ified version that had to fit into 5k - the 'remorse'less version is only 2.7k.) I really wish the limit had been 5k LoC! * omission of type signatures * removal of all non-essential whitespace between identifiers+patterns * use of explicit {} and ; to avoid layout whitespace * grossly shortened identifier names * local re-binding of Prelude names to shorten them * import renaming to shorten the module part of a qualified name Nevertheless, I felt it prudent to retain a /few/ comments in the source, just so I can remember what each function does. SPOILER ALERT! -------------- The technique used by 'remorse' to obfuscate programs is to rename all the identifiers using a morse-code spelling. Any alphanumeric variable name can be replaced in Haskell by a symbolic operator name surrounded by parentheses. This includes the argument positions in function and pattern bindings. :-) Each valid alphanumeric character in an identifier is replaced by its morse code equivalent, in dots and dashes, with a separating bar between characters. Thus, 'a' = .- and 'b' = -... so the identifier ab = (.-|-...) The full transliteration was found on the web at http://www.babbage.demon.co.uk/morseabc.html The maximal munch rule of Haskell'98 ensures that there is no problem using operator names that look a little bit like the start of a comment, e.g --.. The only tricky part is to ensure that ambiguous parses cannot happen. Luckily, there are only five ambiguous cases: each a single-letter identifier: e = . = overlaps with composition i = .. = syntax for numeric ranges m = -- = end-of-line comment o = --- = end-of-line comment t = - = overlaps with subtraction For these few exceptions, we simply append a bar to the morse spelling. Morse code covers only one case of the alphabetic characters, which we chose for convenience to be lower-case. Thus, every upper-case character needs a special tag, for which we chose a trailing + character. Valid Haskell varids also permit underscore _ and prime ' which have no morse equivalent, so we chose ! and = respectively (allowing some nice confusion with function/pattern bindings.) Now, another special case. An ordinary varid can be turned into an infix varop by enclosing it in backquotes, e.g. `elem`, but we have already translated the varid into a varop surrounded by parentheses, so in this case we must rather un-bracket the varop and ensure that the backquotes are dropped. One cannot spell keywords or conids (type, constructor, or module name) in morse code - a conop is not permitted in place of a conid at all syntactic locations. What about qualified names? The choices here are to translate /all/ modules including imported ones, and thus the qualified name should be spelled in morse; or to translate modules stand-alone in which case the qualified names should remain as they are. I chose the latter course of action, since translating the entire Prelude and Libraries would have far exceeded the competition's 5k limit. :-) This leads us to the previously-mentioned caveat on the validity of morse-spelled Haskell programs generated by 'remorse'. Any imported function name that is /not/ qualified, will be spelled in morse, and of course there will be no definition with that spelling. Thus, to compile and run a 'remorse'ful program, you must either 'remorse' the whole import dependency graph including the Prelude and Libraries, or qualify each import. As illustrated in 'remorse' itself, you can get away with qualifying each name only once if you use a local binding e.g. x = Prelude.lex becomes (-..-) = Prelude.lex. Finally, although I tried to cover all the common lexical features of Haskell, obviously the 5k limit meant I couldn't do an exhaustive analysis. The undo-ing of the morse-code spelling is rather brute force, and is the most likely place for bugs to lurk.