Tuesday, 1 December 2009

read_file() and write_file() in Python

Here are two functions I would like to see in the Python standard library:
def read_file(filename):
  fh = open(filename, "r")
  try:
      return fh.read()
  finally:
      fh.close()

def write_file(filename, data):
  fh = open(filename, "w")
  try:
      fh.write(data)
  finally:
      fh.close()

In the first case, it is possible to do open(filename).read() instead of read_file(filename). However, this is not good practice because if you are using a Python implementation that uses a garbage collection implementation other than reference counting, there will be a delay before the Python file object, and the file descriptor it wraps, are freed. So you will temporarily leak a scarce resource.

I have sometimes argued that GC should really know about file descriptors: If the open() syscall fails with EMFILE or ENFILE, the Python standard library should run GC and retry. (However this is probably not going to work for reliably receiving FDs in messages using recvmsg().) But in the case of read_file() it is really easy to inform the system of the lifetime of the FD using close(), so there is no excuse not to do so!

In the second case, again it is possible to do open(filename, "w").write(data) instead of write_file(filename, data), but in the presence of non-refcounting GC it will not only temporarily leak an FD, it will also do the wrong thing! The danger is that the file's contents will be only partially written, because Python file objects are buffered, and the buffer is not flushed until you do close() or flush(). If you use open() this way and test only on CPython, you won't find out that your program breaks on Jython, IronPython or PyPy (all of which normally use non-refcounting GC, as far as I know).

Maybe CPython should be changed so that the file object's destructor does not flush the buffer. That would break some programs but expose the problem. Maybe it could be optional.

read_file() and write_file() calls also look nicer. I don't want the ugliness of open-coding these functions.

Don't write code like this:

fh = open(os.path.join(temp_dir, "foo"), "w")
fh.write(blah)
fh.close()
fh = open(os.path.join(temp_dir, "bar"), "w")
fh.write(stuff)
fh.close()

Instead, write it like this:

write_file(os.path.join(temp_dir, "foo"), blah)
write_file(os.path.join(temp_dir, "bar"), stuff)

Until Python includes these functions in the standard library, I will copy and paste them into every Python codebase I work on in protest! They are too trivial, in my opinion, to be worth the trouble of depending on an external library.

I'd like to see these functions as builtins, because they should be as readily available as open(), for which they are alternatives. However if they have to be imported from a module I won't complain!

6 comments:

Anonymous said...

It's already pretty trivial:

with open(fn,'r') as f:
for L in f.readlines():
pass

with open(fn,'w') as f: f.write(blah)

f is close()'d at the right time.

Zooko said...

Added to my pyutil library:

http://allmydata.org/trac/pyutil/browser/pyutil/pyutil/fileutil.py?rev=20091202172212-92b7f-daea0abbf974b4faf4df90320014b0ff7b4e0945#L19

Mark Seaborn said...

Thanks Zooko! I hope you find them useful.

Mark Seaborn said...

@Anonymous: Your first snippet doesn't work as an expression. The second example will take up two lines for those of us who follow the coding style that nested blocks always go on a new line.

I often use these functions in automated tests where I like to have conciseness.

For example, to test a command line tool I might do:

temp_dir = self.make_temp_dir()
write_file(os.path.join(temp_dir, "input.c"), "blah blah")
subprocess.check_call(["baz", os.path.join(temp_dir, "input.c")])
self.assertEquals(read_file(os.path.join(temp_dir, "input.o")), "blah")

Vic said...

Hey, these functions don't have to be in std python lib since they are promoting axe-like programming without error checking.

Mark Seaborn said...

Crikey, what do you mean by "axe-like programming"? What sort of error checking would you use?