Friday, January 18, 2013

Large Files (some understandings)

Haskell tries to memoize everything at top level. That is, it tries to store the results in memory. So if you have some function that is supposed to be processing for a single line in a file ProcessLine :: Handle -> IO() then what happens is that if that code prints something that depends on the input, that is it depends on some content in the file you are opening, then it will evaluate the action. Otherwise it just stores the action to execute later. With a large file that might mean that every call to a ProcessLine function is just storing more into memory. More worryingly, it breaks specified action order sometimes, calling file actions after the file is closed. So essentially Haskell's handling of imperative programming order is somewhat broken, which makes sense, that part of the language is the part that the gurus don't like, so they write their code as pure functions, that part of the code remains untested in places, or they just think it's what should be.

The solution then is just to go ahead and fit all IO into one monolithic call that fits into memory, and write things in pure code like ProcessFile::IO() and ProcessLine :: String -> String and then just map a whole file lazily. In general, since ProcessFile is returned as a single action at the top level, using hGetContents will end up not being all stored in memory unprocessed. There's still some limits here, because ProcessLine might get memoized, and there's no method that I've found yet to just say not to do that.