Friday, August 8, 2008

assert the machine did not explode

I see it often done by novices, heck, I used to do that myself. When writing a unit test for a method cloning an object, I would put an assertion to check that the original object is left unchanged.

Well, that is wrong. This is checking against a possible bug in the implementation, in which the original object gets modified. But this is just one thing that could go wrong. There are infinitely many possible bugs like this, maybe even one that causes your machine to explode. You can't write assertions against all of them, so don't do it against any of them.

It has two advantages. First is that you stop worrying you forget assertions against some possible bug; that is, if you were worrying about that in the first place. I am sure there are people out there who feel bad about not testing against printing foul language. Second is that the tests themselves become smaller, focused on the positive behavior, therefore easier to read.

In the specific example of cloning: check that the cloned object is equal to the original but they are not the same objects, and that is it. Nothing else.

Unit tests are positive tests, they should only test what you want the code to do, not what it shouldn't do. I am not sure about the case when we fix a defect, but probably a functional test for the defect is enough.

Sunday, April 6, 2008

GVim to convert code into html

Few are aware of the fact that gvim can generate html for the code inside the buffer. The command is :TOhtml; just type it in, and the html will promptly appear in a split window.

In the browser, the generated html should look exactly as you see it in gvim. By default it generates html that should work in old browsers too, but can be told to generate CSS stylesheets or even xhtml. Type :help TOhtml for more details.

It works only in gvim, though; not in vim. It is handy for posting snippets of code on the web.

Sunday, March 2, 2008

.NET BCL source code

Source code for .NET Base Class Library has been made available for debugging sessions in VS2008. It would download one file at a time as they are needed. Since then John Robbins and Karem Kusmezer have create a tool to download all the source files without hassle, great work! The tool is NetMassDownloader.

Use it whilst it is still working :). Download, unpack, and run:

NetMassDownloader.exe -output bcl
    -d C:\Windows\Microsoft.NET\Framework\v2.0.50727

I executed in in Windows XP as a limited user and have seen numerous failure messages printed out. Nevertheless, 120MB of data was downloaded and I had all the files I wanted. The authors advise though to run it as a privileged process (at least on Vista).

The most interesting parts are bcl/redbits/ndp/clr/src/BCL/System and bcl/redbits/ndp/fx/src where bcl is the name of the directory you provided after -output switch. Together they are about 75MB.

It was pretty interesting to read the implementation of the classes I had been using for so long: String, Hashtable, Arraylist. They are easy to understand, with lots of comments, and of course in C#.

The classes I have read are all pretty basic and no wonder there are lots of pointers and calls into unmanaged code, but it is still readable.

One surprising thing, but obvious after knowing it, is that " s ".Trim() will not create a new string "s", but return a slice, an object that holds a reference to the original string and knows where are the beginning and the end. This is possible thanks to the immutability of strings. I bet the substring method is implemented like this too.

There are also some monster methods to be seen. One example is internal HttpListener.HandleAuthenticate spanning more than half a thousand lines.

BTW source code for mono is available at mono-project.com/AnonSVN. Source for mono's BCL is under mcs/class/corlib. It might be interesting to compare the two.

Saturday, March 1, 2008

Why Vim's modes frustrate newbies

Some people say they hate the Vim's modes. Puzzling, because I consider the modes as the killer feature of Vim, something that makes it superior to Emacs.

Why modes rock

I spend most of the time in the normal mode. In this mode, undo is just u, delete line dd and move one word forward w. In Emacs these commands would all include ctrl, shift, alt or whatever. Inserting text is not what I spend most of the time in an editor on. Most of the time I need to quickly navigate around and make small modification to an already existing text. It is false economy to have the keys on the keyboard be bound to less often used functionality all the time.

When I reach the exact place where I need to enter some text, I press i to enter insert mode, type it, and immediately get back to normal mode (I get to normal mode by pressing ctrl-[ rather than ESC).

Why modes frustrate newbies

While pairing with people new to Vim I noticed that they leave it in whatever mode. So if they were inserting, they live it in insert mode. When they come back to Vim window (after inspecting results of a test run for example), they think they are typing normal-mode commands, but are actually in the insert mode typing text. Or the opposite situation: start typing and instead of inserting text they are rapidly killing one chunk of the file after another.

Experienced Vimers follow a convention. After doing something briefly in other mode, immediately go back to the normal mode. I never live Vim in the insert mode or any other mode than normal; when I get back to the Vim window I can always assume Vim is in the normal mode.

Sunday, February 17, 2008

xmonad

I am running xmonad! I had some troubles configuring it though.

I am running Ubuntu Gutsy Gibbon and had to compile it from sources which turned out to be easy, but then the instructions to actually run it didn't work. I have finally got it going by following instructions in FAQ about using xmonad with a display manager.

Before I got xmonad working I tried wmii, which was readily available in repositories. After installing it (sudo apt-get install wmii) you can log out, and in gdm (session manager) change session to wmii to finally log in and enjoy wmii. wmii would be good for me had I no dual head configuration. It was treating my two monitors as one big virtual screen. This meant that a part of it was not visible to me as my left monitor is shorter than the right.

xmonad has much fewer lines of code that wmii. I have seen it a few times already, Haskell programs being very small in comparison with alternatives in other languages. wmii is proud to keep it small in under 10 000 lines of code, while xmonad is written in less then 1300 lines of Haskell. Additionally, there are 583 lines in tests. Not bad considering that xmonad turned out to be better then wmii.

I got these numbers by running the following command on ver 0.6 sources, it ignores blank lines and comments.

find . -name \*.hs -exec cat {} \; |\
    egrep -v "^ *$|^ *--" | wc

BTW xmonad is cool.

Monday, February 4, 2008

Debugging memory problems in an IronPython app

Introduction

This blog entry contains notes on debugging memory problems in an IronPython application on Windows. I also provide scripts that are helpful in the process. I should also mention that it is specific to IronPython 1.*.

The Debugger I was using is windbg.exe. It is a GUI application, but feels more like a command line. It is certainly more powerful than Visual Studio Debugger.

First steps

First thing after starting windbg should be loading SOS.dll, which is an extension to windbg providing commands specific to debugging .NET apps (all starting with !). There is !help command that provides description for these commands (invoke after loading sos).

 
.load sos 
!help 

If you have problems it might be because of the wrong sos.dll version. There is a different version for each .net runtime. Since IronPython runs in .net 2.0 you need sos for that specific version. You can find it in the .net sdk directory and specify the full path for the load command.

Say freeze ;)

Windbg can attach to a process by id (F6 shortcut, but there is also a menu item for it). From that moment on, the debugee is freezed until you detach from it.

Before we actually do any debugging we need to have something to debug. Prepare application so that after it does some considerable leaking, there is a moment in its execution that you would attach windbg to it. This moment should be after doing thorough garbage collection to ensure that during debugging we are dealing only with alive objects. You could have code like this:

from System import GC
from System.Diagnostics import Process
from System.Threading import Thread
for _ in range(2):
    GC.Collect()
    GC.WaitForPendingFinalizers()
print "attach now to process id %d" %\
        Process.GetCurrentProcess().Id
Thread.Sleep(100000)

Next thing is to find objects that leak. In our case the biggest ones were Dictionaries so it was enough to just list all dictionary objects. Sometimes it is unclear which objects leak, this blogpost suggest a strategy for finding out leaking objects. For this task you will find the dumpdiff.py helpful.

This following listing shows output of !dumpheap command the heap with given substring in the type name. This is always the first command after attaching to a process.

!dumpheap -type System.Collections.Generic.Dictionary
...
1b489a58 000e2e00      124     
02713600 000e2e00   134720     
02b756b0 000e2e00  2503008     
24b496a0 000e2e00  2503008  
...

Now copy one of the addresses and issue gcroot command to find some of the chains of references holding the object alive. I'll take the last address. It is worth mentioning that 24b496a0 points to an array that holds the entries of the dictionary.

!gcroot 24b496a0
...
13ab6360(<Unloaded Type>)->
13ab63bc(IronPython.Runtime.FieldIdDict)->
13ab63c8(System.Collections.Generic.Dictionary`2[[IronPython.Runtime.SymbolId, IronPython],[System.Object, mscorlib]])->
13ab6424(System.Collections.Generic.Dictionary`2+Entry[[IronPython.Runtime.SymbolId, IronPython],[System.Object, mscorlib]][])->
186787d4(<Unloaded Type>)->
...

The previous listing shows only a fragment of the output of the !gcroot command. Arrows indicate directions of references. In the output of !gcroot command, the root (a static object or value on some stack) will be at the top whereas the object you invoked the !gcroot with will be towards the bottom.

Difficulties in debugging an IronPython app

Python is a dynamic language and that makes it more difficult to debug IronPython than C# for example (with windbg). There are two problems. Firstly, types defined in IronPython appear under windbg as Unloaded Type. Even if the type names would be shown (no clue why they are not), they would be C# types (or should I say .NET native types) like OldType and UserType: the guts of the IronPython implementation, not the classes defined in our IronPython code.

Secondly, the attributes of an object are not fields. They are stored in a dictionary, and to make things worse, that dictionary is not keyed by strings but by SymbolIDs. Somewhere in memory lies a table containing strings to which SymbolIDs hold indexes.

To find out the attribute name or the type name there are many commands needed. I have created two scripts to automate it. Explaining the windbg scripts is out of scope for this entry, but I think I will post about it soon.

attr script

The first script, attr, takes two addresses as arguments, one to the array holding the dictionary entries, second to the value of the attribute, and prints the name of the attribute. Below I include the same fragment of the gcroot output plus example use of the attr. It will display the name under which 186787d4 is kept in the __dict__ of 13ab6360. Note that I pass to attr the address of the array not 13ab6360.

... 13ab6360(<Unloaded Type>)->
13ab63bc(IronPython.Runtime.FieldIdDict)->
13ab63c8(System.Collections.Generic.Dictionary`2[[IronPython.Runtime.SymbolId, IronPython],[System.Object, mscorlib]])->
13ab6424(System.Collections.Generic.Dictionary`2+Entry[[IronPython.Runtime.SymbolId, IronPython],[System.Object, mscorlib]][])->
186787d4(<Unloaded Type>)->
...
$$>a<c:\path\attr 13ab6424 186787d4
Name: System.String
MethodTable: 790fc6cc
EEClass: 790fc62c
Size: 110(0x6e) bytes
(C:\Windows\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String: _GUIThreadTestCase__guiControl

The above listing assumes you save the attr script in c:\path, and the dollars are required, pounds will not do.

Make sure you invoke it giving an address of a dictionary entry array keyed by SymbolIds like in the above example.

So the above gives the name of one attribute. To get more attributes you can invoke !dumparray on the entry array, which will display all the values there which you can then check individually with attr script (or write another script to automate it ^^).

!dumparray -details <entryArrayAddress>

The attr script may not work if the executable you are debugging is running from a path that contains spaces. There is one constant in the script that would have to be increased to make up for that additional space.

usrtype script

The next script is less reliable that the first one, it worked half the time I invoked it on a Unloaded Type object. I think it works for UserType, so I named it usrtype. It takes just one argument which is the address of an object and returns the name of its Python type.

$$>a<c:\path\usrtype <address>

I again assume you have it saved in c:\path.

!dumpobj

Now the most frequently used command, !dumpobj. It displays the values of the fields of an object. Of course it displays only the fields defined in the .net class. It knows what the object's type is, because each object on the managed heap has the pointer to its type in its first 4 bytes (or 8 if it is 64 bit machine).

Since all the leaking I was chasing ware caused by design flaw in implementation of events in IronPython, I usually looked for object of type ReflectedEvent in the chain displayed by the !gcroot. ReflectedEvent stores the name of the event. It was usually also useful to inspect the IronPython.Runtime.Calls.Method further down the chain, from which you can get the name of the method handling the event as well as the class this method is defined in.

Tips & Tricks

I hope you will find them useful.

Testing that the memory leak is gone

So you got rid of the memory leak, but how do you test that it is gone? I mean an automated unit test!

obj = objectThatMightLeak()
weakRefToObj = WeakReference(obj)
del obj
for _ in range(2):
    GC.Collect()
    GC.WaitForPendingFinalizers()
assert !weakRefToObj.IsAlive

WeakReference will allow GC to collect the object, and after collection it will answer if the object was garbage collected. The del obj line is very important, ware it not there the test would have no chance of passing.

Marking objects

Let's say you suspect that some object is not being garbage collected, and its class is defined in Python. Because of the dynamic nature of Python, its class is not represented as .NET type, hence you can't simply find it with !dumpheap -type <type_name> command.

My mother works in laboratory, when she analyzes samples, she sometimes marks them with special substance to make some elements stand out. You can mark Python objects in a similar way.

from System import GopherStyleUriParser
pythonObject.tag = GopherStyleUriParser()

GopherStyleUriParser is a good enough mark, there is low probability you will have a second one on the heap, and because it is .NET type you can find it easily with !dumpheap -type Gopher. Then invoke !gcroot on it and you will find that the chain goes through your object.

Update: Previously I was recommending using ArithmeticException instead of GopherStyleUriParser for the tag; don't do this. As soon as you would get rid of the other leaks, the object would be kept alive by the instance of ArithmeticException, which captures the stack trace.

Nuking events

That one is a dirty hack. All the trouble I had with memory leaks were because of the events, how about nuking them?

The reason the IronPython team didn't just use the .net implementation of events was that they wanted to allow removing handlers by object equality, rather then by referential equality. By referential equality I mean comparing objects' references (is operator in Python).

So how about defining an object that is equal to everything?

class EqualityForAll(object):
    def __eq__(_, __):
        return True

That is a very fine fellow, which we can use to get all the handlers out of an event.

e = EqualityForAll()
for _ in irange(100)
    control.OnClick -= e

Ok, that will kill at most 100 handlers, but there is no way you can find out how many there are. And not, it will not throw an exception if there are no more handlers. If I remember correctly from looking at the implementation code (open source is great), there are some exception being thrown under the hood, but are swallowed higher up. This makes the operation costly if there are no handlers, as the exception in .NET are much more expensive than in CPython.

Summary

I thought that finding and stopping the leaks would be easy, but it took many days. Perhaps fixing the IronPython implementation could have taken less time. Anyway, if you think that the bug should be fixed, please vote for it. It was given a low priority and it is not easy to do, so without votes it might not make it to IronPython 2.0.

Thursday, January 31, 2008

Arc

I have just read the Arc tutorial. Here are some comments.

Since functions of one argument are so often used in Lisp programs, Arc has a special notation for them. [... _ ...] is an abbreviation for (fn (_) (... _ ...)). So our first map example could have been written

arc> (map [+ _ 10] '(1 2 3))
(11 12 13 . nil)
I know this feature from Nemerle, and miss it when programming in other languages.

Another nice one seems to be ~ operator for negating a function, though would not be nearly as useful as the last one.

arc> (map ~odd '(1 2 3 4 5)) 
(nil t nil t nil)

Code for adding an element to a hashtable does not look very natural:

arc> (= (airports "Boston") 'bos)
bos
In Python it would be:
airports["Boston"] = 'bos

Wednesday, January 30, 2008

Resolver One for Open Source

Resolver One, a new MS Excel's alternative, is currently available in two versions. First is a vanilla commercial one, whereas the second, Resolver One - Non-Commercial, requires a bit of explanation.

Resolver One - Non-Commercial is free (as in beer). Internally at Resolver we used to call it Resolver One for Open Source. Not because it comes with sources, because it doesn't, but because it could only be used to produce open source. Since then we have further relaxed the license to allow personal use.

What a user gets after installing Resolver One Non-Commercial is feature-wise the same application as in the commercial one, but with a different license. The license requires the user to make his spreadsheets open source or mark them as personal -- not for redistribution and profit.

Technically, the Non-Commercial version will not save the spreadsheet file until user decides on a license for his spreadsheet. When he tries to save the file for the first time, he will be presented with seven well known open source licenses to choose from (MIT, GPL, Creative Commons, ...), but he can also mark it as personal. If he selects one of the open-source licenses, it is included on top of the code part of the spreadsheet, making his spreadsheet free (as in speech).

Sunday, January 6, 2008

Anti-pattern: static subject to observer mapping

Short description

When implementing observer pattern in a language with automatic garbage collection don't use static mapping from subject to observer. Doing this will cause memory leaks.

Such an implementation involves having a hashtable which keys are subjects and values are lists of observers. Then to notify about a change in a subject, get the list of its observers from the mapping and notify them one by one.

Glossary

GC - Garbage Collector.
Dead object - an object is dead when it is eligible for garbage collection (can't be reached by strong references from any root). By symmetry, an object can be called alive.
Strong reference - a reference that keeps an object alive (normal reference), as opposed to a weak reference.
Weak reference - a reference that is not taken into account by GC. In .NET there is WeakReference class to define weak references. As far as I know this is the only way to define a weak reference.

What is wrong with static subject to observer mapping

For a warm-up I will discuss the traditional implementation of observer pattern. The nodes on the diagram represent objects. An arc from one object to another means that it is possible to get from one to the another following strong references.

In the first diagram, observer can be reached from subject and vice versa. The fact that observer can be reached from subject follows from subject keeping a list of observers to notify.

I would like to point out that the arrow from observer to subject on this diagram indicates that there is, maybe indirect, way by strong references from observer to subject, this does not follow from observer pattern. However this being optional, in production code it is quite often that observers keep references to the subjects.

What happens in this case when there are no more references to both subject and observer (apart from their own cycle of references)? They can't be reached from any root, so they are dead and the memory they occupy will be released during next garbage collection.

The second diagram shows a situation where observers are kept in a weak hashtable. The dashed arc from entry to subject indicates a weak reference. This one is not that bad. If there are no references to subject and observers apart from those on the diagram, the subject can be garbage collected, there being only a weak reference to it. The observer will be kept alive by the strong reference from weakhashmap, but WeakHashMaps are usually implemented so that on adding a new entry they will remove entries which keys have been garbage collected. This prevents entries with dead keys to pile up.

The third and last diagram depicts a case where there is a reference from observer to subject. This is the situation which results in memory leaks. In this configuration the weakHashMap will keep both subject and observer alive and every other object that can be reach from either of them.

This happens because the reference from entry to the value must be strong, otherwise an observer might get garbage collected when its subject is still alive.

Known occurrences

I have first come across this anti-pattern in the paper about implementing GOF design patterns in AspectJ by Jan Hannemann and Gregor Kiczales. Gregor Kiczales is renowned for leading the team in XEROX PARC that invented aspect-oriented programming.

In this paper they have a definition of an aspect, ObserverProtocol, that keeps a WeakHashMap from subjects to list of observers. This mapping not being static per se, is kept alive throughout the lifespan of a program execution.

The second use I have seen was in IronPython's implementation of events. IronPython has it's own implementation of events because the one in .NET runtime is not sufficient for IronPython's requirements.

For IronPython programmers it means that hooking to events will cause memory leaks if source of the event can be reached from the handler (very common).

Summary

Both of these cases show how easy it is to make this mistake, therefore I have documented it as an anti-pattern for the benefit of mankind ;).

Thursday, January 3, 2008

All possible FontStyles

That was funny. To instantiate Font in .NET 2.0 you call one of many Font's constructors. Basically you provide a font family (either an object of the class FontFamily or a name of one as a string), size and FontStyle. There is a constructor that doesn't take a FontStyle, it assumes the FontStyle.Regular as default.

A FontFamily is a collection of fonts. Arial family will have Arial Regular, Arial Bold, Arial Bold Italic...

Now, in theory there might be a family with only Bold Italic. The requirement is in case there is no style the user requested, give back a font from the family in any available style.

Unfortunatelly in FontFamily object there is no property returning available members of the family, but there is a method for checking if given FontStyle is available.

FontStyle is an enumeration with the following members.

FontStyle.Regular = 0
FontStyle.Bold = 1
FontStyle.Italic = 2
FontStyle.Underline = 4
FontStyle.Strikeout = 8

A font style can be created by bitwise or-ing the values. For example Bold Italic can be defined by FontStyle.Bold | FontStyle.Italic wihich makes 3.

How to generate all possible combinations?

for i in xrange(16):
    fontStyle = Enum.ToObject(FontStyle, i)
    if fontFamily.IsStyleAvailable(fontStyle):
        return Font(fontFamily, fontSize, fontStyle)

Integers from 0 to 15 form all valid combinations of the FontStyle. This sixteen numbers can be represented with 4 bits, 15's binary representation having them all lit (8 + 4 + 2 + 1).