A guide to Python Namespaces

This post is part of the Powerful Python series where I talk about features of the Python language that make the programmer’s job easier. The Powerful Python page contains links to more articles as well as a list of future articles.

Namespaces are a fundamental idea in Python and can be very helpful in structuring and organizing your code (especially if you have a large enough project). However, namespaces might be a somewhat difficult concept to grasp and get used to if you’re new to programming or even coming from another programming language (in my case, Java). Here’s my attempt to make namespaces just a little easier to understand.

What’s in a name?

Before starting off with namespaces, you have to understand what Python means by a name. A name in Python is roughly analogous to a variable in just about any other language, but with a few extras. First of, because of Python’s dynamic nature, you can apply a name to just about anything. You can of course give names to values.

a = 12
b = 'B'
c = [1, 2, 3, 4]

But you can also give names to things like functions:

def func():
    print 'This is a function'

f = func

Now whenever you want to use func(), you can use f() instead. You can also take a name and reuse it. For example, the following code is perfectly legal in Python:

var = 12
var = "This is a string now"
var = [2, 4, 6, 8]

If you accessed the name var in between assignments, you’d get a number, a string and a list at different times. Names go hand in hand with Python’s object system, ie. everything in Python is an object. Numbers, strings, functions, classes are all object. The way to get to the objects is often through a name.

Modules and Namespaces go hand in hand

So much for names. A namespace, is obviously enough, a space that holds a bunch of names. The Python tutorial says that they are a mapping from names to objects. Think of it as a big list of all the names that you’ve defined, either explicitly or my importing from modules. It’s not something than you have to create, it’s created whenever necessary.

To understand namespaces, you also have to have some understanding of modules in Python. A module is simply a file containing Python code. This code can be in the form of Python classes, functions, or just a list of names. Each module gets it’s own global namespaces. So you can’t have two classes or two functions in the same module with the same name as they share the namespace of the module (unless they are nested, which we’ll come to later).

However each namespace is also completely isolated. So two modules can have the same names within them. You can have a module called Integer and a module called FloatingPoint and both could have a function named add(). Once you import the module into your script, you can access the names by prefixing them with the module name: FloatingPoint.add() and Integer.add().

Whenever you run a simple Python script, the interpreter treats it as module called __main__, which gets its own namespace. The builtin functions that you would use also live in a module called __builtin__ and have their own namespace.

Importing pitfalls

Of course, modules are useless unless you import them into your program. There are a number of ways to do imports, and each has a different effect on the namespace.

1. import SomeModule

This is the simplest way to do imports and generally recommended. You get access to the module’s namespace provided you use the module’s name as a prefix. This means that you can have names in your program which are the same as those in the module, but you’ll be able to use both of them. It’s also helpful when you’re importing a large number of modules as you see which module a particular name belongs to.

2. from SomeModule import SomeName

This imports a name (or a few, separated by commas) from a module’s namespace directly into the program’s. To use the name you imported, you no longer have to use a prefix, just the name directly. This can be useful if you know for certain you’ll only need to use a few names. The downside is that you can’t use the name you imported for something else in your own program. For example, you could use add() instead of Integer.add(), but if your program has an add() function, you’ll lose access to the Integer’s add() function.

3. from SomeModule import *

This imports all the names from SomeModule directly into the module’s namespace. Generally not a good idea as it leads to ‘namespace pollution’. If you find yourself writing this in your code, you should be better off with the first type of import.

These imports apply to classes and other data just as much as functions. Imports can be confusing for the effect they have on the namespace, but exercising a little care can make things much cleaner.

Scoping

Even though modules have their own global namespaces, this doesn’t mean that all names can be used from everywhere in the module. A scope refers to a region of a program from where a namespace can be accessed without a prefix. Scopes are important for the isolation they provide within a module. At any time there are a number of scopes in operation: the scope of the current function you’re in, the scope of the module and then the scope of the Python builtins. This nesting of scopes means that one function can’t access names inside another function.

Namespaces are also searched for names inside out. This means that if there is a certain name declared in the module’s global namespace, you can reuse the name inside a function while being certain that any other function will get the global name. Of course, you can force the function to use the global name by prefixing the name with the ‘global’ keyword. But if you need to use this, then you might be better off using classes and objects.

Classes

Classes and namespaces have special interactions. The only way for a class’ methods to access it’s own variables or functions (as names) is to use a reference to itself. This means that the first argument of a method must be a ‘self’ parameter, if it to access other class attributes. You need to do this because that while the module has a global namespace, the class itself does not. You can define multiple classes in the same module (and hence the same namespace) and have them share some global data. While this is different from other object-oriented languages, you’ll quickly get used to it.

Hopefully this guide will help you avoid some of the pitfalls that can arise if you don’t understand namespaces. They can lead to unusual results if you don’t use them properly, but they can help you write clean, properly separated code if you use them well. A further source of information on namespaces and classes is the excellent Python tutorial.

Save everything

The hard drive in my first computer was a mere 20GB. When we had to replace it three years later, we got one with a 40GB capacity. My current laptop (which was the first computer I bought on my own) has a 160GB hard drive. Even my pocket sixed iPod classic sports 80GB . Today you can get terabyte hard-drives for a few hundred dollars. So the question is: with all this massive storage easily available, why is the user still prompted to save their work? Why not just save everything?

In a way, even though we have 21st century storage capacity, we still have 20th century storage techniques. Filesystems of today while certainly efficient at doing their job don’t quite do the right job. After all the bulk of today’s personal data isn’t in the form of documents that can be neatly sorted into hierarchical categories. Instead most of our data is in the form of pictures, music and video, for which the easiest interface is some sort of multi-property based search rather than directory-style storage. As our data grows, metadata will become increasingly important.

But even if we have rich metadata, that’s still only going to take up a small amount of space. What I would really like to see is ubiquitous versioning. Any changes that I make to any files (including moving and renaming) should be easily viewable and I should be able to roll back to previous versions without any difficulty. Software developers have already been using robust versioning systems for decades, but I would like to see it become an inherent property of the file-storage system itself. Versioning goes hand in hand with backups, and while Apple’s Time Machine is a step in the right direction, its still got a while to go.

Another twist in the storage tale is that though local storage in the form of hard drives and flash drives is becoming dirt cheap, online bulk storage is cheaper still (and in some cases free). Unfortunately there is often quite some work to be done to get reliable online storage working seamlessly with your local machine. Like versioning, the technology is already out there, it just needs to be packaged into a convenient, always-available form.

So where do we start? I think Google Docs has shown a good starting point: instead of making the user explicitly save something, applications should just go ahead and save it anyway. If the user decides to actually keep it she can then rename it to something meaningful and move to somewhere else. Perhaps there should be some sort of garbage collection where files that were autosaved and then untouched are deleted after a certaing amount of time (after asking the user, of course). Or you could just save everything forever and only run garbage collection if disk space gets dangerously low.

Once you have a basic save-everything system, you could add versioning on top of that. I was hoping to find a versioning filesystem already in existence, but the closest to a fully operational one that I could find were Wayback and CopyFS, not quite what I’m talking about yet. ZFS shows some promise with its copy-on-write and snapshot features. Hopefully it will only be a few more years before one of the major OS makers (or an Open-source initiative) decide to bake version control into the filesystem (or at least tie them together closely). Once we have the capability to store such massive amounts of versioned data seamlessly, we need a way to find it all. WinFS would have gone a long way to solving this problem, if it had ever gotten finished. I’ve personally come to see the shelving of WinFS as one of the greatest tragedies our industry has faced in recent times. The hierarchical file structure is being pushed to the limit and WinFS would have given a good way forward. However, as personal data gets into the terabyte range, we will absolutely need filesystems that can work with rich metadata. Hopefully WinFS will be pulled out of the mothball or Apple will come up with a working solution post Snow Leopard.

Right now I’m stuck with my large hard-drives hopelessly underutilized. I’ve started trying some home-grown solutions such as putting all my documents under version control. Over the next semester at college I’ll also be experimenting with S3 and trying to run a personal backup server. Hopefully I’ll be able to put all those gigabytes to work.

CherryPal: Is cloud computing finally here?

Slashdot just linked to an article about the CherryPal, a lightweight computer designed to run applications off the Internet rather than locally installed ones. The CherryPal is light both in terms of size and power. While it’s only 10.5 ounces and uses just 2 watts of power, it’s also powered by a low-power Freescale processor at 400Mhz with just 4GB of storage and 256MB of RAM. That’s incredibly low powered by today’s standards. However unlike other lightweight machines such as the OLPC or the Eee PC, the CherryPal computer isn’t meant to be a stripped down generic computer.

The CherryPal runs an embedded version of Debian Linux, but the operating system is effectively hidden from the user. While there are a few installed applications like OpenOffice, Firefox and Multimedia support, the bulk of the CherryPal’s applications run online and are augmented by 50GB of online storage. Essentially the CherryPal is meant to be just a gateway to the cloud. I’ve talked about cloud computing before and I’m gradually coming to believe that cloud computing is going to be one of the core infrastructures of 21st century computer technology. While I am a bit skeptical about the CherryPal’s ultra-light specs, I think it is certainly a good concept. The question is, will it all actually work?

Two days ago Infoworld ran an article comparing 4 cloud computing architecture providers. The article was interesting for showing that while cloud computing is already a feasible option for those who want to try it, there is still no lowest common denominator for the industry as a whole. The 4 services are quite different and if you should ever be in a position of moving from one to another, I don’t think the experience would be entirely painless (or even possible in some cases).

When cloud computing becomes an everyday reality (which won’t be too long looking at how things are going), then it seems like their will be three-tier structure. The first will be the user: you running a low cost, low power computer, perhaps not very different from the CherryPal. It will have just even power to run a browser with full Javascript and Flash, and maybe a very minimal set of on site software. You’ll mainly use this machine to use online services, such as Zoho’s suite for office work, Gmail for email, MP3tunes to store your music and so on. Of course, all your data will be stored online behind layers of encryption with multiple backups. Some of these providers will have their own servers with their own back-end infrastructures. But a lot of them, especially smaller, newer startups will simply be leasing computing power and storage from the even larger cloud provided by companies such as Amazon and Google (S3 and App Engine for example).

Of course, cloud computing won’t completely replace normal computers. I’d hate to play a 3D FPS over the Photoshop Expresscloud and more intensive graphics application will still need powerful on-site hardware for acceptable performance (though Photoshop Express does show promise). There’s also the fact that many people will simply not trust their data to another party. However, for these people, it would be possible to run their own cloud. With the falling price of hardware and storage, you can even today easily put together a powerful home server and connect to it remotely, thus not having to lug your data with you everywhere. In fact, this will be one of my projects for next semester at school.

So is cloud computing finally here? Not quite. The industry is still in a growing stage and needs to become much more mature before it becomes. We will need to see standardization in the same that the internet has (mostly) standardized around HTML and CSS. Different cloud computing providers will have to work together so that it is possible to pull features from different clouds and change providers if necessary. Hopefully this time around we’ll see a more civilized growth of a powerful new computing paradigm and not a rerun of the browser wars of the 90s.

Programmers should read

For the longest time I believed that programming was all about writing code. However over the past year, I’ve slowly changed that opinion. I now think that reading is just as important as writing. Computer science is a huge and continually expanding field and as a programmer you’ll probably only be dealing with a very small segment of the problems out there. More importantly there is a significant amount of cross fertilization between parts of our field. The only way to keep abreast of all the developments is to read and read voraciously.

Though our field is still one of the youngest in the world (probably only genetics and nanotechnology is younger), there has been an incredible amount of knowledge generated in the last 60 years or so. We’ve come a long way and solved an awful lot of complicated problems along the way. All the collective knowledge of our field is locked into three primary sources:

  1. Thousands of articles, books and other publications
  2. Billions (maybe trillions of lines) of powerful computer code.
  3. A growing number of blogs, wikis and other easily accessible electronic forms, most of which are free.

Of course it is humanly impossible to read all of it. Okay, maybe it is possible, but then you’d never get around to doing any programming. The good news is that reading even a very small fraction of the total mass of computer science literature out there will give you a far better idea of what’s been done and what’s going on.

No matter what part of computer science you’re in, there are bound to be a few classics that are required reading for everyone in that area. These might be in the form of books or research papers (especially if you’re into theoretical aspects). Though these might be somewhat more difficult than the daily problems you are used to solving, there are probably only a handful of them, so it’s not going to take forever to get through them. And it will probably be worth your time to go through them with some rigor. Reading the classics and founding works of your area will give you a fresh perspective of your work and offer a valuable insight into the minds of the early pioneers.

Now that you’ve started you might as well keep going and read something more modern. A lot of the things that were significant problems 60 years ago aren’t problems anymore and it might come in handy to know what the solutions are. There are modern books of course, but in this case the Internet might prove a better source. You can always start from Wikipedia to get a general idea and then work from there. Being in such a dynamic field, it’s important for us to keep abreast of current happenings and developments. Enter the blog. Sure, many blogs are of questionable quality and there’s a very high noise content. But that being said, there are still a number of authoritative voices and it shouldn’t be very hard for you to find ones pertinent to your interests.

Blogs are important not just as news sources, but also because often they link to or mention sources of information that would otherwise have passed under your radar. Blog comments are often a vibrant (and sometimes vitriolic) medium of discussion, and you might often be led to question your own assumptions on the topic.

Finally there is reading code. Reading other people’s code is not always an easy thing to do. After all it does represent someone else’s thought patterns which might be very different from our own. It’s even harder if the code is badly written. However reading code is important, both good and bad code. If you don’t know what bad code is, you won’t be able to tell when you’re writing it. Luckily with the rise of open source, there’s a profusion of code out there. Again the same warnings apply: there’s a lot of crap out there and you have to be a bit careful as to what exactly you’re looking at. However, just as with blogs, there should be some well known source you can safely trust.

Since I started actively reading a few months, I’ve seen that I’ve become better at recognizing certain classes of problems and thinking of innovative solutions. I’ve also come to know of very interesting parts of our field which I would never have known of in a class setting. I’m still rather deficient when it comes to reading code, but that will be something I will work on over the rest of the year.

Blog posts or essays?

Between my work, traveling back to home and frequent power cuts, my blogging hasn’t been very regular recently. I haven’t been suffering from any sort of writer’s block, in fact I have a list of about 7-8 topics that I’d like to write about. However there has been one thing that has kept bothering me for quite some time: the size of my postings. I’ve been trying to use this blog as a way to tell the world about the things that I learn and discover as I pursue my career as a computer science student. However many of the things that I deal with daily and which I think about are quite complex and takes long discussion to get everything together. At the same time I would like to be able to post new things everyday or at least every alternate day. Often these two things don’t really go together add being an avid reader myself, I understand that it can be very trying to read something long on a topic like computer science. Hence the question: do I write small compact blog posts on a regular basis, or do I write longer essay-style posts where I can talk at length about the topic?

I’ve been looking at some of my favorite technology oriented bloggers to possible solutions to my dilemma. My favorite bloggers is Steve Yegge, who without fail writes long, sometimes rambling, but always interesting essays on an approximately bimonthly basis. While I find his essays thoroughly entertaining, they are a bit too big for something that I would want to write. More importantly, I certainly want to post more regularly than twice a month. Paul Graham’s essays are somewhat shorter, but are also published at a similar frequencies. Again, brilliant, but not quite what I’m aiming for.

Perhaps the closest to what I’m acheiving would be Jeff Atwood’s Coding Horror. Atwood posts regularly (almost everyday) and his posts are of a good length, long enough to make you feel a sense of actually reading something worthwhile, while being short enough that you don’t need to set apart an entire hour to go through them. Though occasionally he does err on the side of excess, they are nowhere as long as Yegge’s posts.

Of course, length isn’t a separate consideration in itself. It’s closely tied to the content of what I write. Recently I’ve been writing more from a software engineering standpoint, though I would like to write posts of a slightly more theoretical nature (especially since I’m getting increasingly interested in compilers and programming languages). While I’m willing to accept that such topics might require slightly lengthier posts, I really don’t want to turn my posts into mini-theses.

Long blog posts also mean longer time investments on my part, something that is a very important consideration because of the heavy course load I plan on taking. Perhaps the best way for me to decide the issue is to think about how much time I would be willing to invest on a daily basis. On average a blog post right now takes me about 40 minutes to one hour to write. I think that it is a good amount of time for me to spend write now. Considering my typing speed, that translates to about 1000 thousands, even considering looking up pertinent links and confirming information. 1000 words might be pushing it a bit, (that’s about one average sized college paper), but much less would probably be too small for me to clearly say everything. 800 to 1000 words seems like a decent size from what I’ve been reading. I think a good idea would be to to have a number of sections which are more or less self-contained in terms of content.

I’m going to trying to work on trying to control my size and structure. However at the same time, my primary concern will be content, so even if I need to write longer or shorter posts to give a coherent, well paced account of everything I need to say, so be it. I’m sure my readers read other tech blogs, so any comments as to what you prefer would be good very much appreciated.