It is fun for customers who use the features instead of watching fictitious “product road maps”.
It is fun for developers who see their work is actually used.
It is fun for the executives who can change the business priorities quickly.
If is fun for product managers who can measure actual usage.
It is fun for the R&D manager ,as the problems can not be hidden for long.
In my company, we delivered 72 versions to customers in three years.
Image via Wikipedia
Here is one way to do it:
Hire top talent for development , QA , IT and operations.
Deliver the product as a Service (SaaS). Upgrading one instance is much easier than upgrading 10,000.
Bi weekly synchronization meetings on Monday and Thursday. Monday is just team leaders and Thursday is all of R&D.
Invest early in QA automation. We invested $20,000 in Automation infrastructure at a very early stage.
Invest in Unit-Testing as much as possible.
Avoid branching. Branches are evil. Merges are Yikes. One branch is good, two is max.
Invest in the “Ugly stuff”. Deployment scripts, upgrade scripts, database consistency.
Constructive dictatorship. Every code change has a ticket. Every. No exceptions.Really.
First week is for coding. Than it is feature freeze. Three days for QA and bug fixes.Code freeze. Two days for final QA and critical fixes only. Release on Sunday.
In the next post I’ll try to answer the tricky questions: What about longer features? How not to scare the customers? and more.
3M Offers a great tool to find the best User Interface for your application. In less than five minute it analyzes the attention attractiveness of a web application.
The score indicates the probability that each area of interest would get attention in the first 3-5 seconds.
Visual Attention Service - CloudShare ProPlus Screen Area Of Interest
Here is an Example of Original CloudShare ProPlus Screen.
CloudShare ProPlus Screen Original Screen
Heatmaps highlight areas of the image that are likely to get attention in the first 3-5 seconds.
See how the feedback button(lower right corner) gets a lot of attention, but the Upgrade button (top right corner) does not.
CloudShare ProPlus Screen Heatmap Visual Attention Service
And a similar discrete view.
CloudShare ProPlus Screen Regions Visual Attention Service
To try if for your own application or design:
1. Take a screenshot of your application(full screen, high resolution).
The service is not cheap ($12-$20 per picture), but you can try five pictures for free. Moreover, It can save hours of stupid arguments between product managers, developers, designers and even VP’s of Marketing.
It seems our mysterious IronPython memory leak is actually a “Handle Leak”. These are even harder to detect than memory leaks.It seems some of our “Future” is leaking. I guess we’ll have to patch a hole in the time-space continuum to fix it.
In other words, when we create a “Future” we use one event and one mutex ( or Mutent in windows lingo) and they just keep piling up. It all sounds like a Sci-Fi B Movie. Read all about it, if you have trouble falling asleep.
On the database side things are just as challenging. It seems next to impossible to get determinism out of our SQL Server 2008. We are trying to understand why the results in our production database , our QA database and our performance setup are sometimes completely different.
It is pretty annoying when the same query lasts two seconds in one place and thirty seconds in the second location, while the database tables are identical.
I would have expected this type of problems to go away in 2010. Here are a few insights (credit goes to Roy R).
Increasing the number of CPUs can decrease performance when running on VMWARE, because VMWARE requires all four CPUs to be free which happens less often than having just two CPUs free.
Why would a query run slowly inside the web app, but quickly when ran inside the SQL Management Studio on the same database?
Why would a query running slowly suddenly start running quickly on the same web app?
Sounds like Voodoo? Probably, but the answers may lie in the SQL plan cache.
Q: Why would a query run slowly inside the web app, but quickly when ran inside the SQL Management Studio on the same DB?
A: The two queries may be using different query plans because of different text, parametrization or connection settings. The old query plan has become obsolete.
Q: Why would a query running slowly suddenly start running quickly on the same web app?
A: The query plan could have got refreshed. Changes to the table (updates, deletes or inserts) can cause an automatic statistics update. Also the plan could be retired to free memory after a while.
Different queries produce different plans. Text matters and parametrization matters.
It is quite hard to “freeze” the query plan, since it requires a lot of memory and there are too many variations
This is an interesting variation of Heisenberg_Uncertainty_Principle when trying to measure the performance, changes the statistics and therefore changes the measurement. This is also known as Heisenbug. We are open to creative ideas. Till than I’m considering trying out London’s first pensioners’ playground.
Some bugs in software are extremely hard to track down. One keeps trying to “kill” them and fail for a long time. Much like the guys in this video.
The most difficult bugs in the world can take months to track down and isolate. While not always deadly, Sometimes even when the bug location is known (like George Bush in Washington DC) the fix itself can be extremely difficult.
Bugs are hard to solve when they are non deterministic, non reproducible or out of control. They happen only in production system, or only during load, only at certain places, or just on the customer desktop.
Here are few examples of elusive software bugs, their characteristics and how to turn them into butterflies of code :).
We just found a bug in the way Firefox loads plug-ins. It appears that Firefox tries to cache a plug-in DLL so it does not have to load it many times. While the feature is useful there is a flaw in the caching mechanisms that causes Java plug-in to load instead of our plug-in for our object.
Why was it hard to find ? The bug only happens at very specific conditions, and not all the time. It only happens on Firefox 3.6.3. It only happens when the Plug-In uses the Object HTML tag and not the “SRC” attribute. It only happens when Java and our Plug-In are loaded in the same page, in a certain order …
What did we do ? we (which means Leeor) compiled Firefox and ran the source code with a debugger . Thanks god for Open Source. If we had the same issue in IE, we would not have had much of a chance to solve it. We submitted the bug to Firefox,but since we needed a solution right away, we were able to find a workaround, by finding the root cause.
Why was it hard to find ? Memory leaks are extremely hard to find because they tend to be non deterministic. When the memory of the process just keeps growing, it is relatively easy to find the problems source. But in many operating systems the memory management and garbage collection have become so sophisticated, it is not trivial to know if there is a leak or not. Memory goes up and down , or just stays still.
In theory, memory management problems were supposed to go away in Java,C# and Python. While most of them have, the ones that remain are the hardest one to solve. For a while we kept hunting an “Out Of memory’ problem in IronPython. In the good (?) old days of C++ we could have used a memory profiler to locate our lost memory chunks. In IronPython this is next to impossible, since the .Net object are so mangled it is not possible to correlate them to original language objects. therefore, traditional tools like Quantify have little value.
What did we do ? Sometime the best resolution is to write a custom memory management library. We used this trick in Check Point when we needed to debug memory leaks in the kernel, where no standard tool works. This approach works especially well when this infrastructure is written form the ground up.
A similar approach can be used in Python, but the performance implications are too hard to run it in production. Unfortunately , the memory leaks only happens in production …
If the problem can be reproduced with unit tests, life is a bit better. One innovative idea that Idan came up with is Binary Search over the code. Since we moved to GIT we can now change the past retroactively. In other words, we perform a binary search on the code, to track down the line of code in which the memory leak started. We “Pretend” the unit test was written in the past and run it against the old branch. Using binary search we can locate the exact commit in which the problem arose.
Another option is to look for the usual suspects – unmanaged code .In one case we used NetApp SDK that was written in C++ in our .Net code. It took three iterations to resolve all the memory leaks caused by their library. Pretty much a trail an error process.
Few hours have passed and the Sasha felt more secure about starting the recovery of ClearCase source files. The initial restore was successful (300GB and 50Million files) and everyone was happy. Everyone but yours truly.
Me: How do you know the recovery was successful ?
Sasha: The recovery software says so.
Me: Well, we know what that’s worth. How do I know all the files are OK ?
Sasha : We will run FSCK on all file systems and see what happens.
Me: That’s a good start, but it still does not tell me which files we lost, what data is corrupted , and that no changes happened. And by the way, how long it will take to run FSCK ?
Sasha: These are excellent questions, I don’t have any idea. Let’s try and see what happens.
Backup and Restore Case Study
Running FSCK took a few more hours, but it seemed Ok. We tried to bring ClearCase up but it was not willing to. Surprisingly the restore process didn’t restore all the file permissions and soft links. These had to be added manually.
I was still unsure we didn’t lose all the information. With 300 modules, 20Million files and thousands of branches it is really hard to know nothing was lost. It could take months before someone tries to build an old package and finds out it is missing. With databases there are multiple Integrity checks in place.With a Version Control system such as ClearCase it is not so simple.
When we called ClearCase guys (also IBM) it turned out there is a secret script that validates the internal consistency of the setup. We decided to run the script and at the same time build all the main products on the compile servers. If all the compilation results come out identical to what we have we can be quite confident we have at least the latest source available.
What we soon discovered is that we can’t do both at the same time. The hidden script was locking the files. We had to run the build after the script. Of course nobody in IBM could estimate how long the script would run .I sent all the developers home at this stage.
A Couple of hours later everything seemed to be in order. I went back home and prepared for the holy day. Yom Kipur is the day in the year when Jews are supposed to ask for forgiveness for the evil they did to their fellow men.It is also a day for reflection I felt this was very appropriate opportunity.
Lessons Learned:
A backup without frequent restore exercises is like a Pizza with no cheese
Just like High Availability never works in Passive Active mode
Everything starts from the requirements. IT is not different than R&D
Restore time is one example 🙂
If you can’t prove that the restore works you have a serious problem
This is a hard one. Think – what will make you sleep well at night.
When crisis happens, it is very nice to have a process in place.
Start from the phone list of key people
Trying to save money might cost you lots of money
There is usually a reason for high-end products
Trust no one
If they say they have backup. Ask them for the DRP plans
If they claim to have backup ask them to restore something
It seems the Google main search engine has a critical problem.
It reports any site as being malicious, even when you just look for CNN.
See attached image. I validated on http://www.google.co.il, ww.google.com , three different laptops,three different browser and multiple IP locations. I assume it is not virus on my system.I wonder if the problem is global. Without Google it is hard to know what happens in the world.
Google This site may harm your computer problem 2009
Contrary to what many programmers think, QA role is not to do the dirty work for them. QA’s role is to validate, independently, that the code actually works.
The reason I put the responsibility on the coder is simple. The coder is the one who writes the code, the one that understands it and the one that can change it. Why should anyone else be the owner ?
QA has a lot less options for proving the code works and reducing the risk than the developer, they can only test the functionality from a black box perspective.
Smart Software Developer using Virtual Lab Automation
The developer, on the other hand, has multiple options , beyond the ones already listed in part II.
Rewrite the code in a more modular fashion so it is easier to have unit tests
Move from c# to Python to make it easier to write mocks and do sub system testing
Add logs, alerts and assertions so he knows that edge conditions are safely handled
Refactor the code so User Interface validations and server validations use the same mechanism
Add new code with a separate flag\object\screen so it has less chance to have regression on other functionality
Shout at the product manager that the requirements are too complex and there is not way to implement them In SQL with proper testing
Move from simple ASP.NET mode to MVC model so more parts of the UI can be tested separately
Ask QA to help with extensive PRE-COMMIT manual testing as part of the development stage
Ask QA to help with running the automatic testing on development branches
Help the Automated QA team to make sure new features are tested during the development stage and not post deployment
The manager role is:
Iterate over and over the concept of ownership, proof and responsibility
Back the theory with resources – buy machines for testing, software for code checking etc
For example, buying two servers for the clustering team so they can test their code actually runs on a cluster
Help to manage trade-offs and real world considerations
For example, which functionality is used a lot and which is hardly used
Pay the “price” for making higher quality code
For example, Pay $50,000 for a new automated testing project
Avoid being dogmatic in the specific methodology
For example, unit testing might not be effective in certain places and forcing everyone to do them will just create resentment
Introduce and promote new technologies such as Virtualization and lab automation
Help apply the right methods in the right context
The JavaScript testing framework is great, but should we implement it right now ?
To summarize, like any other professional, the developer is the one responsible for the quality of his or her work.Allowing them to push unproven code to customers is what gave us bad reputation as an industry.However, the best ones are able not just to code, but also to analyze the risk, check for validity ,rewrite and design to create bullet proof products.
And if you read so far, here is a reminder to a lovely 80’s song.
Continuing from the previous post let us check why would one pay a developer who can’t prove his code is working.
When I was in Check Point I used to have weekly pseudo-random interviews with employees. It is a great habit that I learned from Dorit Dor, my manager at the time.
When you manage 200+ employees it is one of the only ways to get direct feedback and stay in touch with what’s really going on, but it turns out to be a good idea even when there is only a single employee reporting to you.
One of my favorite questions to developers was :
How do you KNOW the code you produce really works ?
Amazingly, this pretty simple question had all of them surprised.
The university graduate, the PHD, the autodidact , the hacker, the PC kid and even the group manager. None were prepared for this question.
The Surprised Software Developer
The common answers were :
I don’t know it works, but I have a good feeling about it
It works most of the time
QA will test it and than I’ll know it works
I tried it a bit and it looks fine
I did a code review with my team leader and he approved it
It is a small change and I’m confident in it
There is no way to do it in the time I was given
As you can imagine, I was not very happy with most of these answers. Here are some of the best developers in the world, with five time the salary of a social worker with 30 years experience, and they can’t explain why their work is actually ,hmm, working.
My belief is that the Developer hasto PROVE ,to a reasonable degree, that code he commits is working as planned and does not break other code.
If he can’t do it , he should not be committing the code to the general working branch.
How can the poor programmer achieve this goal:
Running and writing unit tests for his code and running them
Writing sub system tests for his code and running them
Using code checking tools, looking for warning, errors and suggestions
Asking peers for a code review
Going through the design and requirement and validating actual code implements them
Manually working with the system and going through all scenarios he claims to support
Spending couple of hours trying to come up with all the extreme cases and special problems
Going over the QA test design and making sure his code will pass the tests
And by the way, if all these methods are not available \ reliable or feasible it is also OK to commit the code if the developer EXPLICITLY lets everyone know the status before and gets the managers approval.
I want to commit the new screen, but I never tested it on FireFox and there for I assume it does not work on FireFox. Is it OK to commit ? I also didn’t test the sorting or the client side validation, but I think they might work because I didn’t touch this code and it is very solid
Obviously , developers are notoriously over optimistic so this should be kept as a last respot, but making them say it out loud is key to maintaing high level of professionalism and ownership. More on this in the next post.