Archive for the ‘Software Testing’ Category

New Version Every Other Week for Three Years?

February 16, 2011

I get up every morning determined to both change the world and have one hell of a good time. Sometimes this makes planning my day difficult.

E. B. White
US author & humorist (1899 – 1985)

Releasing a working version to customers every two weeks is fun.

  • It is fun for customers who use the features instead of watching fictitious  “product road maps”.
  • It is fun for developers who see their work is actually used.
  • It is fun for the executives who can change the business priorities quickly.
  • If is fun for product managers who can measure actual usage.
  • It is fun for the R&D manager ,as the problems can not be hidden for long.

In my company,  we delivered 72 versions to customers  in three years.

Programming in the large and programming in th...

Image via Wikipedia

Here is one way to do it:

  • Hire top talent for development , QA , IT and operations.
  • Deliver the product as a Service (SaaS). Upgrading one instance is much easier than upgrading 10,000.
  • Bi weekly synchronization meetings on Monday and Thursday. Monday is just team leaders and Thursday is all of R&D.
  • Invest early in QA automation. We invested $20,000 in Automation infrastructure at a very early stage.
  • Invest in Unit-Testing as much as possible.
  • Avoid branching. Branches are evil. Merges are Yikes. One branch is good, two is max.
  • Invest in the “Ugly stuff”. Deployment scripts, upgrade scripts, database consistency.
  • Constructive dictatorship. Every code change  has a ticket. Every. No exceptions.Really.
  • First week is for coding. Than it is feature freeze. Three days for QA and bug fixes.Code freeze. Two days for final QA and critical fixes only. Release on Sunday.

In the next post I’ll try to answer the tricky questions: What about longer features? How not to scare the customers? and more.

Enhance Your User Exprience – Visual Attention Service

February 5, 2011

3M Offers a great tool to find the best User Interface for your application. In less than five minute it analyzes the attention attractiveness of a web application.

The score indicates the probability that each area of interest would get attention in the first 3-5 seconds.

CloudShare ProPlus Screen Area Of Interest Visual Attention Service

Visual Attention Service - CloudShare ProPlus Screen Area Of Interest

Here is an Example of Original CloudShare ProPlus Screen.

CloudShare ProPlus Screen Original

CloudShare ProPlus Screen Original Screen

Heatmaps highlight areas of the image that are likely to get attention in the first 3-5 seconds.

See how the feedback button(lower right corner)  gets a lot of attention, but the Upgrade button (top right corner) does not.

CloudShare ProPlus Screen Heatmap Visual Attention Service

CloudShare ProPlus Screen Heatmap Visual Attention Service

And a similar discrete view.

CloudShare ProPlus Screen Regions Visual Attention Service

CloudShare ProPlus Screen Regions Visual Attention Service

To try if for your own application or design:

1. Take a screenshot of your application(full screen, high resolution).

2. Sign Up for 3M  Activate by email.

3. Upload your screenshot. Select “Web Application” under Type.

4. Mark important functional areas that you want to compare(optional)

5. Press “Analyze”.

The results are fast and simple to understand and downloadable as PDF.

While attention studies have their limitations and cannot be used as the only usability parameter, they can be quite useful.

The service is not cheap ($12-$20 per picture), but you can try five pictures for free. Moreover, It can save hours of stupid arguments between product managers, developers, designers and even VP’s of Marketing.

p.s – it runs on Azure.

Contracting Nightmares

August 23, 2010

It is a common mistake to think that hiring contractors is a silver bullet.

No obligations, get experts and focus on core skills.

That’s why the service level in Cable companies is so great 🙂

I have found that in 50% of the cases , hiring contractors in IT,Sales,Marketing , QA or development results in a complete failure.

Looking at the some of the characters who work as free-lancers, it is not surprising.

1. The not-so-competent

“It is not my fault I changed thirteen companies in ten years. I just made some bad career moves”.

“I like working on many projects at the same time. Just installed Exchange 5.5 for the local donut shop last quarter”

[picapp align=”none” wrap=”false” link=”term=goaly+fail&iid=9557644″ src=”″ width=”234″ height=”201″ /]

2. The lazy genius

“Making $200 per hour seems much nicer than coming to work every day”

“I got this stupid Japanese company to pay me $666  per hour. I can’t really commit to any deadline for you. At least not until they go bankrupt.”

3. The former executive

“In this stage of my life I want to work half a day per week for $2000. You should be thanking me every day”.

[picapp align=”none” wrap=”false” link=”term=lazy&iid=5066159″ src=”″ width=”380″ height=”254″ /]

4. The in-the-wrong-profession-guy

“I really want  to promote my band ,but got to make a living till than. Let me recompile the kernel for you.”

“I’m have founded five social network start-ups so far, but in the meantime I can help you guys with some JavaScript. Yes, I’m 19 years old.”

[picapp align=”none” wrap=”false” link=”term=sting&iid=9519161″ src=”″ width=”234″ height=”311″ /]

5. The Californian-work-life-balance-dude

“I don’t like to work. I like to enjoy life. Please send money to increase my bank balance”.

6. The open-source-Linux-purist

“I will not work with these evil DLL’s. Only tar.gz+RPM would stop world hunger and buffer overflows”

7. The incompetent-misunderstood-self-proclaimed-genius

“I have developed this amazing web site for selling dog food. It has no traffic yet, but I’m starting SEO soon.

[picapp align=”none” wrap=”false” link=”term=lazy&iid=5064274″ src=”″ width=”234″ height=”156″ /]

8. The ethically challenged

“I like free-lancing on top of my day job. What they don’t know can’t hurt them”

[picapp align=”none” wrap=”false” link=”term=criminal&iid=239368″ src=”″ width=”380″ height=”253″ /]

On the next post : “When contracting works well”.

The Hardest Bugs in The World – Part Two

June 5, 2010

It seems our mysterious IronPython memory leak is actually a “Handle Leak”. These are even harder to detect than memory leaks.It seems some of our “Future” is leaking.  I guess we’ll have to patch a hole in the time-space continuum to fix it.

In other words, when we create a “Future” we use one event and one mutex ( or Mutent in windows lingo) and they just keep piling up. It all sounds like a Sci-Fi B Movie. Read all about it, if you have trouble falling asleep.

On the database side things are just as challenging. It seems next to impossible to get determinism out of our  SQL Server 2008. We are trying to understand why the results in our production database , our QA database and our performance setup are sometimes completely different.

It is pretty annoying when the same query lasts two seconds in one place and thirty seconds in the second location, while the database tables are identical.

I would have expected this type of problems to go away in 2010. Here are a few insights (credit goes to Roy R).

  • Increasing the number of CPUs  can decrease performance when running on VMWARE, because VMWARE requires all four CPUs to be free which happens less often than having just two CPUs free.
  • Why would a query run slowly inside the web app, but quickly when ran inside the SQL Management Studio on the same database?
    Why would a query running slowly suddenly start running quickly on the same web app?
    Sounds like Voodoo? Probably, but the answers may lie in the SQL plan cache.
  • Q: Why would a query run slowly inside the web app, but quickly when ran inside the SQL Management Studio on the same DB?
    A: The two queries may be using different query plans because of different text, parametrization or connection settings. The old query plan has become obsolete.
  • Q: Why would a query running slowly suddenly start running quickly on the same web app?
    A: The query plan could have got refreshed. Changes to the table (updates, deletes or inserts) can cause an automatic statistics update. Also the plan could be retired to free memory after a while.
  • Different queries produce different plans. Text matters and parametrization matters.
  • It is quite hard to “freeze” the query plan, since it requires a lot of memory and there are too many variations

This is an interesting variation of Heisenberg_Uncertainty_Principle when trying to measure the performance, changes the statistics and therefore changes the measurement. This is also known as Heisenbug. We are open to creative ideas. Till than I’m considering trying out London’s first pensioners’ playground.

[picapp align=”none” wrap=”false” link=”term=determinism&iid=8853721″ src=”a/e/a/2/Londons_First_Pensioners_cd8d.jpg?adImageId=13105597&imageId=8853721″ width=”380″ height=”588″ /]

The Hardest Bugs in The World – Part One

May 29, 2010

Some bugs in software are extremely hard to track down. One keeps trying to “kill” them and fail for a long time. Much like the guys in this video.

The most difficult bugs in the world can take months to track down and isolate. While not always deadly, Sometimes even when the bug location is known (like George Bush in Washington DC) the fix itself can be extremely difficult.

Bugs are hard to solve when they are non deterministic, non reproducible or out of control. They happen only in production system, or only during load, only at certain places, or just on the customer desktop.

Here are few examples of elusive software bugs, their characteristics and how to turn them into butterflies of code :).

[picapp align=”none” wrap=”false” link=”term=bugs&iid=8872334″ src=”2/4/c/5/A_model_displays_5ecb.jpg?adImageId=13028575&imageId=8872334″ width=”380″ height=”492″ /]

Bugs in other people code

We just found a bug in the way Firefox loads plug-ins. It appears that Firefox tries to cache a plug-in DLL so it does not have to load it many times. While the feature is useful there is a flaw in the caching mechanisms that causes Java plug-in to load instead of our plug-in for our object.

Why was it hard to find ? The bug only happens at very specific conditions, and not all the time. It only happens on Firefox 3.6.3. It only happens when the Plug-In uses the Object HTML tag and not the “SRC” attribute. It only happens when Java and our Plug-In are loaded in the same page, in a certain order …

What did we do ? we  (which means Leeor) compiled Firefox and ran the source code with a debugger . Thanks god for Open Source. If we had the same issue in IE, we would not have had much of a chance to solve it. We submitted the bug to Firefox,but since we needed a solution right away, we were able to find a workaround, by finding the root cause.

Memory Leaks

Why was it hard to find ? Memory leaks are extremely hard to find because they tend to be non deterministic.  When the memory of the process just keeps growing, it is relatively easy to find the problems source. But in many operating systems the memory management and garbage collection have become so sophisticated, it is not trivial to know if there is a leak or not. Memory goes up and down , or just stays still.

In theory,  memory management problems were supposed to go away in Java,C# and Python. While most of them have, the ones that remain are the hardest one to solve.  For a while we kept hunting an “Out Of memory’ problem in IronPython. In the good (?) old days of C++ we could have used a memory profiler to locate our lost memory chunks.  In IronPython this is next to impossible, since the .Net object are so mangled it is not possible to correlate them to original language objects. therefore, traditional tools like Quantify have little value.

What did we do ? Sometime the best resolution is to write a custom memory management library. We used this trick in Check Point when we needed to debug memory leaks in the kernel, where no standard tool works. This approach works especially well when this infrastructure is written form the ground up.

A similar approach can be used in Python, but the performance implications are too hard to run it in production. Unfortunately , the memory leaks only happens in production …

If the problem can be reproduced with unit tests, life is a bit better. One innovative idea that Idan came up with is Binary Search over the code.  Since we moved to GIT we can now change the past retroactively. In other words, we perform a binary search on the code, to track down the line of code in which the memory leak started. We “Pretend” the unit test was written in the past and run it against the old branch. Using binary search we can locate the exact commit in which the problem arose.

Another option is to look for the usual suspects – unmanaged code .In one case we used NetApp SDK that was written in C++ in our .Net code.  It took three iterations to resolve all the memory leaks caused by their library. Pretty much a trail an error process.

[picapp align=”none” wrap=”false” link=”term=bugs&iid=8927280″ src=”c/9/f/b/Kew_Gardens_Launch_f1c7.jpg?adImageId=13028580&imageId=8927280″ width=”380″ height=”570″ /]

A Case Study in Restore Nightmares

March 15, 2009

I’m learning to fly but I aint got wings
Coming down is the hardest thing

Well the good old days may not return
And the rocks might melt, and the sea may burn

Tom Petty

Read the first part of the story first.

Few hours have passed and the Sasha felt more secure about starting the recovery of ClearCase source files.  The initial restore was successful (300GB and 50Million files) and everyone was happy. Everyone but yours truly.

Me: How do you know the recovery was successful ?

Sasha: The recovery software says so.

Me: Well, we know what that’s worth. How do I know all the files are OK ?

Sasha : We will run FSCK on all file systems and see what happens.

Me: That’s a good start, but it still does not tell me which files we lost, what data is corrupted , and that no changes happened. And by the way, how long it will take to run FSCK ?

Sasha: These are excellent questions, I don’t have any idea. Let’s try and see what happens.

Backup and Restore Case Study
Backup and Restore Case Study

Running FSCK took a few more hours, but it seemed Ok. We tried to bring ClearCase up but it was not willing to. Surprisingly the restore process didn’t restore all the file permissions and soft links. These had to be added manually.

I was still unsure we didn’t lose all the information. With 300 modules, 20Million files and thousands of branches it is really hard  to know nothing was lost. It could take months before someone tries to build an old package and finds out it is missing. With databases there are multiple Integrity checks in place.With a Version Control system such as ClearCase it is not so simple.

When we called ClearCase guys (also IBM) it turned out there is a secret script that validates the internal consistency of the setup. We decided to run the script and at the same time build all the main products on the compile servers. If all the compilation results come out identical to what we have we can be quite confident we have at least the latest source available.

What we soon discovered is that we can’t do both at the same time. The hidden script was locking the files. We had to run the build after the script. Of course nobody in IBM could estimate how long the script would run .I sent all the developers home at this stage.

A Couple of hours later everything seemed to be in order. I went back home and prepared for the holy day. Yom Kipur is the day in the year when Jews are supposed to ask for forgiveness for the evil they did to their fellow men.It is also a day for reflection I felt this was very appropriate opportunity.

Lessons Learned:

  • A backup without frequent restore exercises is like a Pizza with no cheese
    • Just like High Availability never works in Passive Active mode
  • Everything starts from the requirements. IT is not different than R&D
    • Restore time is one example 🙂
  • If you can’t prove that the restore works you have a serious problem
    • This is a hard one. Think – what will make you sleep well at night.
  • When crisis happens, it is very nice to have a process in place.
    • Start from the phone list of key people
  • Trying to save money might cost you lots of money
    • There is usually a reason for high-end products
  • Trust no one
    • If they say they have backup. Ask them for the DRP plans
    • If they claim to have backup ask them to restore something

Strange Google Problem – False “This site may harm your computer” messages

January 31, 2009

It seems the Google main search engine has a critical problem.

It reports any site as being malicious, even when you just look for CNN.

See attached image. I validated on, , three different laptops,three different browser  and multiple IP locations. I assume it is not virus on my system.I wonder if the problem is global. Without Google it is hard to know what happens in the world.

google This site may harm your computer problem 2009

Google This site may harm your computer problem 2009

The Proof is in the Pudding- Stating the Obvious III

January 31, 2009

Contrary to what many programmers think, QA role is not to do the dirty work for them. QA’s role is to validate, independently, that the code actually works.

The reason I put the responsibility on the coder is simple. The coder is the one who writes the code, the one that understands it and the one that can change it. Why should anyone else be the owner ?

QA has a lot less options for proving the code works and reducing the risk than the developer, they can only test the functionality from a black box perspective.

Smart Software Developer using Virtual Lab Automation

Smart Software Developer using Virtual Lab Automation

The developer, on the other hand, has multiple options , beyond the ones already listed in part II.

  • Rewrite the code in a more modular fashion so it is easier to have unit tests
  • Move from c# to Python to make it easier to write mocks and do sub system testing
  • Add logs, alerts and assertions so he knows that edge conditions are safely handled
  • Refactor the code so User Interface validations and server validations use the same mechanism
  • Add new code with a separate flag\object\screen so it has less chance to have regression on other functionality
  • Shout at the product manager that the requirements are too complex and there is not way to implement them In SQL with proper testing
  • Move from simple ASP.NET mode to MVC model so more parts of the UI can be tested separately
  • Ask QA to help with extensive PRE-COMMIT manual testing as part of the development stage
  • Ask QA to help with running the automatic testing on development branches
  • Help the  Automated QA team  to make sure new features are tested during the development stage and not post deployment

The manager role is:

  • Iterate over and over the concept of ownership, proof and responsibility
  • Back the theory with resources – buy machines for testing, software for code checking etc
    • For example, buying two servers for the clustering team so they can test their code actually runs on a cluster
  • Help to manage trade-offs and real world considerations
    • For example, which functionality is used a lot and which is hardly used
  • Pay the “price” for making higher quality code
    • For example, Pay $50,000 for a new automated testing project
  • Avoid being dogmatic in the specific methodology
    • For example, unit testing might not be effective in certain places and forcing everyone to do them will just create resentment
  • Introduce and promote new technologies such as Virtualization and lab automation
  • Help apply the right methods in the right context
    • The JavaScript testing framework is great, but should we implement it right now ?

To summarize, like any other professional, the developer is the one responsible for the quality of his or her work.Allowing them to push unproven code to customers is what gave us bad reputation as an industry.However, the best ones are able not just to code, but also to analyze the risk, check for validity ,rewrite and design to create bullet proof products.

And if you read so far, here is a reminder to a lovely 80’s song.

Evident Based Coding – Stating the Obvious II

January 31, 2009

Continuing from the previous post let us check why would one pay a developer who can’t prove his code is working.

When I was in Check Point I used to have weekly pseudo-random interviews with employees. It is a great habit that I learned from Dorit Dor, my manager at the time.

When you manage 200+ employees it is one of the only ways to get direct feedback and stay in touch with what’s really going on, but it turns out to be a good idea even when there is only a single employee reporting to you.

One of my favorite questions to developers was :

How do you KNOW the code you produce really works ?

Amazingly, this pretty simple question had all of them surprised.

The university graduate, the PHD, the autodidact  , the hacker, the PC kid and even the group manager. None were prepared for this question.

The Surprised Software Developer

The Surprised Software Developer

The common answers were :

  • I don’t know it works, but I have a good feeling about it
  • It works most of the time
  • QA will test it and than I’ll know it works
  • I tried it a bit and it looks fine
  • I did a code review with my team leader and he approved it
  • It is a small change and I’m confident in it
  • There is no way to do it in the time I was given

As you can imagine, I was not very happy with most of these answers. Here are some of the best developers in the world, with five time the salary of a social worker with 30 years experience, and they can’t explain why their work is actually ,hmm, working.

My belief is that the Developer has to PROVE ,to a reasonable degree, that code he commits is working as planned and does not break other code.

If he can’t do it , he should not be committing the code to the general working branch.

How can the poor programmer achieve this goal:

  • Running and writing unit tests for his code and running them
  • Writing sub system tests for his code and running them
  • Using code checking tools, looking for warning, errors and suggestions
  • Asking peers for a code review
  • Going through the design and requirement and validating actual code implements them
  • Manually working with the system and going through all scenarios he claims to support
  • Spending couple of hours trying to come up with all the extreme cases and special problems
  • Going over the QA test design and making sure his code will pass the tests

And by the way, if all these methods are not available \ reliable or feasible it is also OK to commit the code if the developer EXPLICITLY lets everyone know the status before and gets the managers approval.

I want to commit the new screen, but I never tested it on FireFox and there for I assume it does not work on FireFox. Is it OK to commit ? I also didn’t test the sorting or the client side validation, but I think they might work because I didn’t touch this code and it is very solid

Obviously , developers are notoriously over optimistic  so this should be kept as a last respot, but making them say it out loud is key to maintaing high level of professionalism and ownership. More on this in the next post.