Architecture/Source Code Inventory

Owner: Kay Ramme, Stefan Zimmermann Type: analysis State: draft

Introduction

Recent surveys and current experiences with the project have caused concern over the existent "barrier of entrance" for potential contributors that may hinder e.g. developers to become an active member in the community. This "barrier of entrance" surely has a lot of dimensions. Some of these dimensions may be the complexity of the source code, the build environment, the lack of modularity or simply the pure mass of items involved in the product.

http://marketing.openoffice.org/ooocon2006/presentations/wednesday_c10.odp

Therefor Kay Ramme and Stefan Zimmermann stepped up to determine the sub-dimensions of complexity, find and develop measures to quantify the code base of the project OpenOffice.org, and provide data that describes sub-dimensions of complexity in the project to potential improvement teams. This is a call for help. Everybody who want to contribute his experiences and ideas is more than welcome.

Motto

The overarching motto we agree is : Less [code] is better !, where the word "code" is actually optional.

If we say "less", we need in turn to know how much we have now. Means we need to quantify our (code) base. Although we think we should focus in the first step of specific areas which are:

dead code
redundancy
cyclomatic complexity (McCabe)
(useless features)

after these focus areas are adressed, we may focus more on finding indicators for some properties that are described in the next Section, "The Zen of Programming" ;)

The Zen of Programming

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

(cited from the Zen of Python by Tim Peters)

possible data collection plan

Data to be collected:

At first it is quantitative data and will range from number of files, lines of code (in it's characteristics LINES and SLOC according to DSI concept), number of classes, methods, lines of code per function etc. but also file dependencies, -scattering, -location will get into focus of investigation.

Purpose of Data Collection:

Ultimately, the goal is to provided ideas how to simplify the project to lower the "barrier of entrance" for contributors and determine if maintenance capability or maintainability can be expressed

What Insight The Data Will Provide:

The data, when counted and compared will provide us with information about dependencies, redundencies in the code as well as the purpose/duty of specific code sections.

How It Will Help potential Improvement Teams:

The teams will be able to make a decision on whether to eliminate, consolidate, refactor or modularize code or simply abandom from consideration the possible effects of the multiple dimensions of complexity.

What Will Be Done With The Data After Collection:

The teams will use the data to arrive at code complexity measures, which may be able to describe code "easy to maintain" and code "not so easy to maintain " :). For sure the data will be used to continuously draw a picture what OpenOffice code base is about and how it develops over time.

What we think what data to collect and why (Detailed)

Counts
- Files to handle
  - evaluation of possible consolidation efforts (scattering together with location)
  - comparison with "best practice" data of industrie
  - ratio of product source to product build environment
- LINES
  - size estimates
  - comment line / source line ratio
  - best practice comparison
  - language to language comparison
- SLOC (source lines of code) according to DSI concept (delivered source instructions)
  - use in COCOMO II (Constructive Cost Model II)
  - PM (person month) estimates
  - TDEV (development time) estimates
- Pre Processor directives
  - creating file inclusion hierarchy
  - comparing definition count (constants and macros) with "best practice"
- Keywords
  - calculate cyclomatic complecity (MyCabe)
  - compare with "best practice"
- Statements
  - compare with DSI
  - estimate statement density per method, file
- Classes
  - class hierarchy (inheritance depth)
  - dependencies (circular)
  - "is a" - "has a" relationships (ratio)

Any ideas and experiences about what to collect why are welcome

Links

Wikipedia gives a good introduction to software metrics, pros / cons and approaches.
Thorsten was so kind creating a page with various (source) code tools, including tools generating metrics etc. See Other Tools.
Official OOo statistics - http://stats.openoffice.org
Wikipedia about wikipedia:Software_maintenance.

To be continued ...

Architecture/Source Code Inventory

Contents

Introduction

Motto

The Zen of Programming

possible data collection plan

What we think what data to collect and why (Detailed)

Links

Views

Personal tools

Navigation

Tools

Print/export

Search