Empathy List Archives

wasc-satec@lists.webappsec.org

WASC Static Analysis Tool Evaluation Criteria

Comments on the direction of SATEC - And on every items.

Romain Gaucher

Mon, Aug 22, 2011 3:47 PM

Everyone, sorry for the loooong delay on the response, the past weeks have
been totally crazy.
Anyhow, here what I think and I would suggest, based on the current draft:

1. Tool Setup and Installation

"Setup and Installation" is not really interesting, I believe. The more
important is the platform support (can I run the tool from my linux box, my
mac, our windows server, etc).

1.1 Time required to perform initial installation

That's usually subjective, unless you say something like "always less than 2
hours", "always less than a day", etc. But then again, I find this totally
quite irrelevant to the problem.

1.2 Skills required to perform initial installation

Subjective.

1.3 Privileges required to perform initial installation

I don't find this item very informative. Okay, you need to have root access,
or admin access on the machine... or not.

1.4 Documentation setup accuracy

Subjective.

1.5 Platform Support

This one is interesting for the customers.

2. Performing a Scan

Logically, I would not talk about scanning just now. But, after the platform
support section, I would talk about language, framework support.

2.1 Time required to perform a scan

This does not make any sense. "Time required to scan"... what? This question
however, is answerable if we provide a proper test case and environment to
run the tool. But then again, it's a quite misleading information.

2.2 Number of steps required to perform a scan

Many tools have scripting interfaces. Using scripts, you reduce your steps
from 7, to 1 (i.e., run the script). How does that count?
In summary, I find this information not interesting at all.

2.3 Skills required to perform a scan

I understand that some tools (like PolySpace) require someone to actually
design and model the suspected behavior of the program. But most tools do
not require that. Then again, how to rate the user? Do we assume the user
(who runs the scan) will also look at the findings? Does he also setup the
scan? I definitely see the scan being run by security operation (mostly for
monitoring), and being setup by security engineers...

3. Tool Coverage:

"Tool Coverage" might be the most misleading term here. Coverage of what?!
Coverage of supported weaknesses, languages, version of languages,
framework, application coverage, entry point coverage, etc.?

3.1 Languages supported by the tool

Very important. Now, we should not limit ourselves to the languages, but we
should go at the version of framework level. Nowadays, the language is just
a mean, most of the juicy stuff happen in the relationship with the
frameworks... Also, the behavior of the frameworks might be different from
one version to one another...

*3.2 Support for Semantic Analysis
*
*3.3 Support for Syntactic Analysis *

I do not understand these items. (Usually, "semantic" is used to say
something like AST-level type of knowledge). I would be, honestly,
more interested to know if the tool is properly capable of inter-procedural
data flow analysis, or if it has some other limitations. Then again, I would
prefer not to talk about the underlying logics (and modeling) of the tool
since I believe this is out of scope. Users don't really care about that,
they just want the tool to work perfectly. If you use a dataflow based
model, abstract interpretation, or whatever one comes up with ... don't
care.

3.4 Ability of the tool to understand different components of a project
(.sql, .xml, .xsd, .properties…etc)

This is a very interesting item. When generalized a little bit, we can
derive several items:

Analysis support of configuration files (i.e., the tool gets knowledge
from the configuration files)
Analysis support for multiple languages in separated files
Cross-languages analysis support (the tool is capable of performing its
analysis from one language, let's say Java, to SQL, and back to Java)

Another item that would be quite interesting, is the support for "new
extensions", or redefinition of extensions. Let's say the tool does
recognize ".pl" as perl, but that I have all my stored procedures (in
PL/SQL) with this extension, I'd like to be able to tell the tool to
consider the .pl to be PL/SQL for this application. The same reasoning needs
to be done for new extensions.

3.5 Coverage of Industry Standard Vulnerability Categories (OWASP Top 10,
SANS Top 25…etc)
*
*
Static analysis tools do not find vulnerabilities. They find source code
weaknesses (there is a huge difference). Now, I do not understand what
"coverage of industry standard vulnerability categories" mean.

Is this category supposed to be about coverage of type of "stuff" (or
weaknesses, flaws, bugs, issues, etc.) that the tool can find? If so, we
should use CWE, and nothing else.
Is this category about the the reporting and classification of findings?
(such as, "Oh noes, this finding is mapped to OWASP top 10 risks... that's
very bad for your PCI compliance!"

4. Detection Accuracy

Usually, that does not mean anything.

4.1 Number of false positives
4.2 Number of true negatives

My first comment here was "Gniii?", then I did s/Number/Rate and it made a
bit more sense.
I could understand why someone would want to get a rate of false-positive,
and false-negatives, but true-negatives? True negatives, are the things that
are not reported by the tool, and it's good from the tool not to report
them, and examples would be data flow path that uses a proper validation
routine before sending the data to a sink. You do not want the tool to
report such, and this is a true-negative.

By the way, the rate of FP/FN are very interesting for an experiment point
of view, but there is no way to get this data to mean anything for Joe the
project manager who wants to get a tool. Most likely your data will be very
different than his (if you're making the same experiment on your
applications). Sad reality fix: tools results depend a lot on the
application.

*4.3 Accuracy % *

Accuracy of what? Compared to what? Non-sense to me, cf. previous point. We
cannot measure that in a meaningful way.

5. Triage and Remediation Process
*
*
Do we want to talk about the quality of the UI provided by the tool to
facilitate the triage? IMO, the remediation process is out of scope for a
SAST.
*
5.1 Average time to triage a finding*
*
*
This seems to me like rating your assessor more than the tool you use.
*
5.2 Quality of data surrounding a finding (explanation, tracing, trust
level…etc)*
*
*
Those are indeed very important information. As an assessor, I want to know
why the heck this tool reported this finding to me. Not only I want to have
paths, confidence, data flow info, etc. but I want to know the internals.
Some tools will report the pre-conditions and post conditions that generated
the finding. This is extremely useful for advanced use of the tools. I
undersand that most tools do not report that, so at least reporting the rule
ID (or something I can track later on, and make sense of) is important.
*
*
5.3 Ability to mark findings as false positive
*
*
Mark a finding as FP might have several meaning. Does this mean:

Mark the findings as FP for the report?
Mark the findings as FP for the engine, so that next time it will
encounter a similar case, it won't report it?

5.4 Ability to “diff” assessments
*
*
Very important indeed.
*
5.5 Ability to merge assessments*
*
*
Tracking, merging, combining assessment is definitely part of the
workflow...
*
5.6 Correctness of remediation advice*
5.7 Completeness of remediation advice
*
*
I hope no one actually relies on the tool to give proper remediation advice.
They're usually fine to give an idea, but no way they will give you a good
solution, for your case (even though, in theory they have lots of
information to do so).
*
5.8 Does the tool automatically prioritize defects*
*
*
Prioritize what? Is this category supposed to be talking about the severity
rating? Is this talking about prioritization at the engine level so that the
tool misses lots of stuff (yeah, that's usually what happen when the flow
gets complex).
*
*
6. UI Simplicity and Intuitiveness
6.1 Quality of triage interface (need a way to measure this)
6.2 Quality of remediation interface (need a way to measure this)

Subjective.

6.3 Support for IDE plug-ins both out of the box and on-demand

"Integration with IDEs", and possible support for new IDEs. Yes, that's
important to get at least, a list of integrated IDEs.

6.4 Quality of tools’ out of the box plugin UI

Subjective. Why not talking about the features available though the plugin.

7. Product Update Process
*
*
It's indeed good to know that automated/federated/etc. updates are possible.
*
7.1 Frequency of signature update*
*
*
Interesting, but the reader must be careful not to make much decision based
on that. If the tool gets a new pack of rules every week or every months,
that does not mean much about the quality...
*
7.2 Relevance of signatures to evolving threats
7.3 Re-activeness to evolving threats*
*
*
Are we talking about new weaknesses? The word "threat" is very confusing
here... and does not make sense to me in the context of SAST.

8. Product Maturity and Scalability
*
*
Would be good to know indeed, though... how to get the data?
*
8.1 Peak memory usage*
*
*
42GB?! That's a very subjective data that depends on many factors (machine,
configuration, application, etc. etc.)
*
8.2 Number of scans done before a crash or serious degradation in
performance*
*
*
42, but only because it was 71 degree in the room, and the train was passing
every 2.5 days.
*
8.3 Maximum lines of code the tool can scan per project*
*
*
It would be good to talk about scalability of the tool, and how to improve
it. For examples, can I scan the same application with several machines
(parallelism)? If I add more RAM/CPU, do I get much better results? Is there
a known limit?
*
8.4 What languages does the tool support?*
*
*
This should be covered in a different section.
*
*
9. Enterprise Offerings
*
*
This is also very interesting for companies. However,
the enterprise offerings, are usually, central solution host findings,
review findings, etc. This is not really SAST, but SAST-management. Do we
want to talk about that? I'm happy to have this in the criteria...
*
9.1 Ability to integrate with major bug tracking systems*
*
*
This is mostly a general comment, but instead of a boolean answer. We should
ask for the supported bug tracking systems.
Also, it's important to customize this, and to be able to integrate with
JoeBugTracker...
*
9.2 Ability to integrate with enterprise software configuration management*
*
*
To what regard?

10. Reporting Capabilities
10.1 Quality of reports
*
*
Subjective.
*
10.2 Availability of role-based reports*
*
*
It's indeed important to report different kind of data for the engineer,
dev, QA, managers, etc. Eventually, we're talking about data reporting here,
and tools should provide several ways to slice and represent the data for
the different audience.
*
10.3 Availability of report customization*
*
*
Yup, though, to what extent is the report customizable? Can I just change
the logo, or can I integrate the findings in my word template?

11. Tool Customization and Automation
*
*
I feel that we're finally going to touch the interesting part. Every mature
use of SAST have to make use of automation, and tool customization. This
section is a very important one, and we should emphasize it as much as we
can.
*
11.1 Can custom rules be added?*
*
*
Right, that's the first question to ask. Does the tool support finding
support customization? Now, we need many other points, such as ... What kind
of rules are supported? Can we specific/create a new type of
weakness/findings/category?
*
11.2 Do the rules need learning new language\script?*
*
*
Most likely it will be "yes", unless it's only GUI based. My point is that
even XML rules represent a "language" to describe the rules...
*
11.3 Can the tool be scripted? (e.g. integrated into ANT build script or
other build script)*
*
*
Build automation is crucial, but to me, is different than automation. This
item should be in a different section.
*
11.4 Can documentation be customized (installation instructions, remediation
advice, finding explanation…etc)*
*
*
Interesting point. Can we overwrite the remediation given by a tool?
*
11.5 Can the defect prioritization scheme customized?*
*
*
Right! Can I integrate the results within my risk management system?
*
11.6 Can the tool be extended so that custom plugins could be developed for
other IDEs?*
*
*
That part should be in the IDE integration.

In summary, I believe that the SATEC needs to be restructured to address the
actual problems. We should also move away from any subjective criterion. I
believe that the SATEC should be able to be filled-in by a tool vendor, or
someone who will evaluate the tool. Eventually, we should provide a
spreadsheet that could be filled.

Concerning the overall sections, the order should make sense as well.

Anyhow, I suggest the list to rethink about the current criteria and see
what can be measured properly, and what needs to be captured by any tool
evaluator. The following is just a suggestion (came up with that in too
little time), but I believe it captures the interesting part in a better
order:

Platform support
2.1 OS support
2.2 Scalability tuning (support for 64bits, etc.)
Application technology support
2.1 Language support (up to the version of language)
2.2 Framework support
Scan, command and control
3.1 Scan configuration
3.2 Build system integration
3.3 IDE integration
3.4 Command line support
3.5 Automation support
3.6 Enterprise offerings (need of a better terminology)
Application analysis
4.1 Testing capabilities (weakness coverage, finding-level data, etc.)
4.2 Customization
4.3 Triage capabilities
4.4 Scan results post-processing
Reporting
5.1 Reports for different audiences
5.2 Report customization
5.3 Finding-level reporting information
5.3.1 Classification/Taxonomy mapping (i.e., CWE, OWASP, WASC, etc.)
5.3.2 Finding description (paths, pre-post conditions, etc.)
5.3.3 Finding remediation (available, customizable, etc.)
Miscellanies
6.1 Knowledge update (rules update)
6.2 Integration in bug trackers (list of supported BT, customization, etc.)

Btw, I'm sorry to come back with such feedback quite late... but the
deadlines are too aggressive for me.

Romain

Everyone, sorry for the loooong delay on the response, the past weeks have been totally crazy. Anyhow, here what I think and I would suggest, based on the current draft: *1. Tool Setup and Installation* "Setup and Installation" is not really interesting, I believe. The more important is the platform support (can I run the tool from my linux box, my mac, our windows server, etc). *1.1 Time required to perform initial installation* That's usually subjective, unless you say something like "always less than 2 hours", "always less than a day", etc. But then again, I find this totally quite irrelevant to the problem. *1.2 Skills required to perform initial installation* Subjective. *1.3 Privileges required to perform initial installation* I don't find this item very informative. Okay, you need to have root access, or admin access on the machine... or not. *1.4 Documentation setup accuracy* Subjective. *1.5 Platform Support* This one is interesting for the customers. *2. Performing a Scan* Logically, I would not talk about scanning just now. But, after the platform support section, I would talk about language, framework support. *2.1 Time required to perform a scan* This does not make any sense. "Time required to scan"... what? This question however, is answerable if we provide a proper test case and environment to run the tool. But then again, it's a quite misleading information. *2.2 Number of steps required to perform a scan* Many tools have scripting interfaces. Using scripts, you reduce your steps from 7, to 1 (i.e., run the script). How does that count? In summary, I find this information not interesting at all. *2.3 Skills required to perform a scan* I understand that some tools (like PolySpace) require someone to actually design and model the suspected behavior of the program. But most tools do not require that. Then again, how to rate the user? Do we assume the user (who runs the scan) will also look at the findings? Does he also setup the scan? I definitely see the scan being run by security operation (mostly for monitoring), and being setup by security engineers... *3. Tool Coverage:* "Tool Coverage" might be the most misleading term here. Coverage of what?! Coverage of supported weaknesses, languages, version of languages, framework, application coverage, entry point coverage, etc.? *3.1 Languages supported by the tool* Very important. Now, we should not limit ourselves to the languages, but we should go at the version of framework level. Nowadays, the language is just a mean, most of the juicy stuff happen in the relationship with the frameworks... Also, the behavior of the frameworks might be different from one version to one another... *3.2 Support for Semantic Analysis * *3.3 Support for Syntactic Analysis * I do not understand these items. (Usually, "semantic" is used to say something like AST-level type of knowledge). I would be, honestly, more interested to know if the tool is properly capable of inter-procedural data flow analysis, or if it has some other limitations. Then again, I would prefer not to talk about the underlying logics (and modeling) of the tool since I believe this is out of scope. Users don't really care about that, they just want the tool to work perfectly. If you use a dataflow based model, abstract interpretation, or whatever one comes up with ... *don't care*. *3.4 Ability of the tool to understand different components of a project (.sql, .xml, .xsd, .properties…etc)* This is a very interesting item. When generalized a little bit, we can derive several items: - Analysis support of configuration files (i.e., the tool gets knowledge from the configuration files) - Analysis support for multiple languages in separated files - Cross-languages analysis support (the tool is capable of performing its analysis from one language, let's say Java, to SQL, and back to Java) Another item that would be quite interesting, is the support for "new extensions", or redefinition of extensions. Let's say the tool does recognize ".pl" as perl, but that I have all my stored procedures (in PL/SQL) with this extension, I'd like to be able to tell the tool to consider the .pl to be PL/SQL for this application. The same reasoning needs to be done for new extensions. *3.5 Coverage of Industry Standard Vulnerability Categories (OWASP Top 10, SANS Top 25…etc)* * * Static analysis tools do not find vulnerabilities. They find source code weaknesses (there is a huge difference). Now, I do not understand what "coverage of industry standard vulnerability categories" mean. - Is this category supposed to be about coverage of type of "stuff" (or weaknesses, flaws, bugs, issues, etc.) that the tool can find? If so, we should use CWE, and nothing else. - Is this category about the the reporting and classification of findings? (such as, "Oh noes, this finding is mapped to OWASP top 10 risks... that's very bad for your PCI compliance!" *4. Detection Accuracy* Usually, that does not mean anything. *4.1 Number of false positives 4.2 Number of true negatives* My first comment here was "Gniii?", then I did s/Number/Rate and it made a bit more sense. I could understand why someone would want to get a rate of false-positive, and false-negatives, but true-negatives? True negatives, are the things that are not reported by the tool, and it's good from the tool not to report them, and examples would be data flow path that uses a proper validation routine before sending the data to a sink. You do not want the tool to report such, and this is a true-negative. By the way, the rate of FP/FN are very interesting for an experiment point of view, but there is no way to get this data to mean anything for Joe the project manager who wants to get a tool. Most likely your data will be very different than his (if you're making the same experiment on your applications). Sad reality fix: tools results depend a lot on the application. *4.3 Accuracy % * Accuracy of what? Compared to what? Non-sense to me, cf. previous point. We cannot measure that in a meaningful way. *5. Triage and Remediation Process* * * Do we want to talk about the quality of the UI provided by the tool to facilitate the triage? IMO, the remediation process is out of scope for a SAST. * 5.1 Average time to triage a finding* * * This seems to me like rating your assessor more than the tool you use. * 5.2 Quality of data surrounding a finding (explanation, tracing, trust level…etc)* * * Those are indeed very important information. As an assessor, I want to know why the heck this tool reported this finding to me. Not only I want to have paths, confidence, data flow info, etc. but I want to know the internals. Some tools will report the pre-conditions and post conditions that generated the finding. This is extremely useful for advanced use of the tools. I undersand that most tools do not report that, so at least reporting the rule ID (or something I can track later on, and make sense of) is important. * * *5.3 Ability to mark findings as false positive* * * Mark a finding as FP might have several meaning. Does this mean: - Mark the findings as FP for the report? - Mark the findings as FP for the engine, so that next time it will encounter a similar case, it won't report it? *5.4 Ability to “diff” assessments* * * Very important indeed. * 5.5 Ability to merge assessments* * * Tracking, merging, combining assessment is definitely part of the workflow... * 5.6 Correctness of remediation advice* *5.7 Completeness of remediation advice* * * I hope no one actually relies on the tool to give proper remediation advice. They're usually fine to give an idea, but no way they will give you a good solution, for your case (even though, in theory they have lots of information to do so). * 5.8 Does the tool automatically prioritize defects* * * Prioritize what? Is this category supposed to be talking about the severity rating? Is this talking about prioritization at the engine level so that the tool misses lots of stuff (yeah, that's usually what happen when the flow gets complex). * * *6. UI Simplicity and Intuitiveness 6.1 Quality of triage interface (need a way to measure this) 6.2 Quality of remediation interface (need a way to measure this)* Subjective. *6.3 Support for IDE plug-ins both out of the box and on-demand* "Integration with IDEs", and possible support for new IDEs. Yes, that's important to get at least, a list of integrated IDEs. *6.4 Quality of tools’ out of the box plugin UI* Subjective. Why not talking about the features available though the plugin. *7. Product Update Process* * * It's indeed good to know that automated/federated/etc. updates are possible. * 7.1 Frequency of signature update* * * Interesting, but the reader must be careful not to make much decision based on that. If the tool gets a new pack of rules every week or every months, that does not mean much about the quality... * 7.2 Relevance of signatures to evolving threats 7.3 Re-activeness to evolving threats* * * Are we talking about new weaknesses? The word "threat" is very confusing here... and does not make sense to me in the context of SAST. *8. Product Maturity and Scalability* * * Would be good to know indeed, though... how to get the data? * 8.1 Peak memory usage* * * 42GB?! That's a very subjective data that depends on many factors (machine, configuration, application, etc. etc.) * 8.2 Number of scans done before a crash or serious degradation in performance* * * 42, but only because it was 71 degree in the room, and the train was passing every 2.5 days. * 8.3 Maximum lines of code the tool can scan per project* * * It would be good to talk about scalability of the tool, and how to improve it. For examples, can I scan the same application with several machines (parallelism)? If I add more RAM/CPU, do I get much better results? Is there a known limit? * 8.4 What languages does the tool support?* * * This should be covered in a different section. * * *9. Enterprise Offerings* * * This is also very interesting for companies. However, the enterprise offerings, are usually, central solution host findings, review findings, etc. This is not really SAST, but SAST-management. Do we want to talk about that? I'm happy to have this in the criteria... * 9.1 Ability to integrate with major bug tracking systems* * * This is mostly a general comment, but instead of a boolean answer. We should ask for the supported bug tracking systems. Also, it's important to customize this, and to be able to integrate with JoeBugTracker... * 9.2 Ability to integrate with enterprise software configuration management* * * To what regard? *10. Reporting Capabilities 10.1 Quality of reports* * * Subjective. * 10.2 Availability of role-based reports* * * It's indeed important to report different kind of data for the engineer, dev, QA, managers, etc. Eventually, we're talking about data reporting here, and tools should provide several ways to slice and represent the data for the different audience. * 10.3 Availability of report customization* * * Yup, though, to what extent is the report customizable? Can I just change the logo, or can I integrate the findings in my word template? *11. Tool Customization and Automation* * * I feel that we're finally going to touch the interesting part. Every mature use of SAST have to make use of automation, and tool customization. This section is a very important one, and we should emphasize it as much as we can. * 11.1 Can custom rules be added?* * * Right, that's the first question to ask. Does the tool support finding support customization? Now, we need many other points, such as ... What kind of rules are supported? Can we specific/create a new type of weakness/findings/category? * 11.2 Do the rules need learning new language\script?* * * Most likely it will be "yes", unless it's only GUI based. My point is that even XML rules represent a "language" to describe the rules... * 11.3 Can the tool be scripted? (e.g. integrated into ANT build script or other build script)* * * Build automation is crucial, but to me, is different than automation. This item should be in a different section. * 11.4 Can documentation be customized (installation instructions, remediation advice, finding explanation…etc)* * * Interesting point. Can we overwrite the remediation given by a tool? * 11.5 Can the defect prioritization scheme customized?* * * Right! Can I integrate the results within my risk management system? * 11.6 Can the tool be extended so that custom plugins could be developed for other IDEs?* * * That part should be in the IDE integration. In summary, I believe that the SATEC needs to be restructured to address the actual problems. We should also move away from any subjective criterion. I believe that the SATEC should be able to be filled-in by a tool vendor, or someone who will evaluate the tool. Eventually, we should provide a spreadsheet that could be filled. Concerning the overall sections, the order should make sense as well. Anyhow, I suggest the list to rethink about the current criteria and see what can be measured properly, and what needs to be captured by any tool evaluator. The following is just a suggestion (came up with that in too little time), but I believe it captures the interesting part in a better order: 1. Platform support 2.1 OS support 2.2 Scalability tuning (support for 64bits, etc.) 2. Application technology support 2.1 Language support (up to the version of language) 2.2 Framework support 3. Scan, command and control 3.1 Scan configuration 3.2 Build system integration 3.3 IDE integration 3.4 Command line support 3.5 Automation support 3.6 Enterprise offerings (need of a better terminology) 4. Application analysis 4.1 Testing capabilities (weakness coverage, finding-level data, etc.) 4.2 Customization 4.3 Triage capabilities 4.4 Scan results post-processing 5. Reporting 5.1 Reports for different audiences 5.2 Report customization 5.3 Finding-level reporting information 5.3.1 Classification/Taxonomy mapping (i.e., CWE, OWASP, WASC, etc.) 5.3.2 Finding description (paths, pre-post conditions, etc.) 5.3.3 Finding remediation (available, customizable, etc.) 6. Miscellanies 6.1 Knowledge update (rules update) 6.2 Integration in bug trackers (list of supported BT, customization, etc.) Btw, I'm sorry to come back with such feedback quite late... but the deadlines are too aggressive for me. Romain