Developers are often overwhelmed by the number of vulnerability findings returned by source code analysis tools. But a proper visualization of the code flow can pinpoint optimal code fixes. How?
This webinar shows how visual flow graphs can be automatically generated, even without the knowledge of the code’s logic. Using OWASP’s Web Goat as a test-bed, we demonstrate how visualization can help developers and security auditors to:
- Recognize the correlation between different findings of the same vulnerability
- Understand the ripple effect of each specific flaw over the complete code
- Locate vulnerability junctions and best-fix locations
- Prioritize and effectively mitigate identified vulnerabilities
Video transcript – Checkmarx Web Seminar
I’m Maty Siman. I’m Checkmarx’s CTO. And in the next twenty minutes or so, we will discuss a research area that I find personally very fascinating, and it was introduced into our product a few months ago.
So, let’s get started. So, the topic is Graph Visualization. And once again, if you have any questions, feel free to use the Chat Window on your right-hand side. This is the easiest way for all of us.
(0.34 – 1.10) So, assuming that today’s source code analysis products are good and results that they find are good, the biggest challenge is how to deliver usable results automatically, out-of-the-box, and actionable results for extra-large code bases with thousands of results. If I phrase it a bit differently, the question is: “If you find thousands of accurate results, does it really make you happy?” Probably not. So, finding thousands of results doesn’t really make you happy.
(1.10 – 1.47) If we take Webgoat for example, it has more than one hundred Cross Site Scripting (XSS) findings. I will show you a technique where we can narrow this down to less than twenty places where you have to fix your code. So, when you try to do effort estimation, instead of multiplying the time it takes you to fix a single issue by the number of issues that you find, you can actually multiply by the number of fixing places. So I’ll show you a way; how to get an overview of the results, to get a recommendation of where to fix your code and actively look for the best places to fix you code.
(1.48 – 2.32) This really makes the process of fixing thousands of results really, really, easy. So, the current situation is such that each result has a single data flow that is presented independently from other findings. For example, the classical way would be that on the right-hand side you see the data flow graph and on the left-hand side you see the actual relevant source code. Then whenever you click on the relevant data flow graph, you see the relevant part in the source code highlighted. In this case, it’s a cross site scripting (XSS), where an input is toward the S variable and then printed back to the screen. And then you click here; it shows that it goes to the variable S and then the response right into here.
(2.33 – 3.13) So, this is a very easy way to see how cross site scripting flows from source to sink for a single vulnerability. But what happens when you have fourteen vulnerabilities? Okay? This is the graph that you get. Okay, you get fourteen cross-site scripting. There is very little knowledge that you get from such graph. You gain absolutely no understanding. How to prioritize the fixes? Which issues should be fixed first? Then we decided to take a three-step process to improve the visibility of that graph and to provide a useful tool for the developers.
(3.14 – 3.51) So, the first thing that we did is to combine if the same node appears in multiple paths; just to combine them together into a single node. Just by doing so, you get to that graph. So now we’re not dealing with individual fourteen issues, but rather we have an overview of the findings. We get to understand the relationship between the findings. And as a human being, it’s much easier for us to understand that picture, so that was the first step. This actually allows you to have a feature called ‘The What-if’.
(3.52 – 4.19)You can just click on any node and ask, “What if I fix my code in here?” And then you’ll see which will get fixed by that. However, if you fix your code just in here, you see that this part gets fixed, but the bottom part is only partially fixed. Some of the extended sources are clean, however some others are still vulnerable; are still tainted, so you know that’s probably not the best location to fix.
(4.19 – 4.49)And if you click here, you see these are — these vulnerabilities will get fixed. So, by using the ‘max-flow-min-cut’ algorithm, you can easily find the best fix locations. The minimal cover set that’s fixing these points will help you to eliminate all findings. So, we were able, here, to reduce fourteen vulnerabilities into three fixing points. Okay?
(4.49 – 5.19) So that was the first step in the process – just combining relevant nodes. Then there are two more steps. The first step would be to simplify the graph by combining similar-looking nodes together. And then we found that through our cognitive research that developers gain very little information from the long strings of data flows or long data flows.
(5.19 – 5.51) So, we can actually collect them together into a single node. The technical term for that is to find the homeograph of the original graph. To find the homeograph — it means a graph with similar structure, but simplified version of it. Identical structure of it. So, actually, we get this space invader shape, okay? So we were able to simplify the first step – this graph – into that graph, and then, furthermore, to simplify into that graph.
(5.51 – 6.11)So, we not only provide visually appealing and useful graph to the developer, but we actually also provide a tool that they can proactively search for the best fix locations. Okay? So, let’s see a quick demo. Okay?
(6.14 – 6.52) So, this is the first kind that I did for the Webgoat Project. I believe most of you are familiar with the Webgoat Projects. It’s an open source vulnerable application. It’s vulnerable on purpose. So, the old way to do stuff would be that we find 124 cross-site scripting (XSS). Then you get a long list of 124 vulnerabilities, where each of them has its own data flow path that shows – you can see here on this source – how the data flows from the tainted source to the sink, etc., etc. Okay?
(6.53 – 7.30) The new way would be to set — to view everything through the graph view. So, now you can easily see that we’re not dealing with individual 124 issues, but rather all of them have very, or most of them have the very same root code or root source. And the system highlights, in red, the best fix locations. So, the system tells you that if you fix, for example, for this tree — if you fix your code in here, all that tree below it will get fixed very easily or at once. So, you’re not dealing with individual fifty places, but just one place.
(7.30 – 8.01) For those of you who are actually familiar with Webgoat, you probably know that to get the right parameter is actually the place recommended to get fixed by the Webgoat team itself. So we were able to find the real-life best fix location without taking a single look at the source code; just by watching the graph and listening to what Checkmarx’s product tells us. Finding the best fix location, we were able to determine the really vulnerable piece of code.
(8.01 – 8.33) And the same can be done right through here. This is another best fix location, which is to get parameters valued, so this is pretty straightforward. In case that you believe that you can do better than Checkmarx in finding the best fix location, you can use the ‘what-if’ technique that I presented before. So, when I click in here, it tells me — you might see these nodes are highlighted. It tells me: “These paths will get fixed. However, these ones will remain vulnerable.”
(8.33 – 9.01) And if I fix my code in here, well, only this path gets fixed, but all of these remain vulnerable. Fixing my code that’s here at the best fix location will get all these paths fixed automatically, so this really makes the life of the developer easier. First of all, he gets a suggestion of where to fix the code, and he gets a tool that helps him to determine the quality of the location he chose to use.
(9.01 – 9.30) Getting back to the simplification methods, this graph is very useful when we’re dealing with hundreds of vulnerabilities. However, when we deal with thousands and tens of thousands of vulnerabilities, the graph becomes a bit more complex. And then, as I mentioned, we can find the homeograph of this graph. To find a simplified version of the graph, I will show you how we get the very same structure, but with fewer levels.
(9.30 – 9.50) So this is a simplified version of the very same graph that we saw before and very same best fix locations are highlighted. And in this case, when you are limited to just a few levels, you can very easily understand the basic structure of the graph when you’re dealing with many results.
(9.55 – 10.41) Now, what I did is I took these two best fix locations. I fixed them manually, and then I risk hand code to see how it actually expected the results at. So, at the original scan, we had 124 reflective cross-site scripting. And by fixing our code in two places in the get raw parameter and get parameter values, we were able to change it into 39. So, we were able to eliminate 85 cross-site scripting just by modifying our code in two places, and all that was done semi-automatically just by watching the graph.
(10.41 – 11.16) How does it affect the false positive? Okay, that’s a great question. Someone asked me how it eliminated false positives. So, for example, if, by watching the graph, I see these three is a false positive because — let me just zoom in. For example, let’s say that I find this node in here – okay – is actually a sanitization routine, which should be detected as a sanitization routine, and all these findings are false positives.
(11.16 – 11.55) So, the easiest way, instead of having to go through each and every one of the results in the table and to manually determine that, all I have to do is select this sanitization routine, and then to modify the results state into “not exploitable”. And by doing so, the graph gets updated and now all the elements are flagged as well — do not appear here because they are not exploitable. I can choose to show the non-exploitable flows and then they will reappear here in grey.
(11.55 – 12.12) So, to show you that there are some paths here, which are declared by you as non-exploitable. So, you can manage the entire process of triaging just with this graph view instead of working through the boring table view.
(12.12 – 12.52) Let’s see if there’s another question in here. And can you explain how a false positive is discovered? Okay, so there’s a question. I’m not sure that I understand it accurately. Can you explain how a false is discovered? So, this is a bit different question than this graph visualization technique, but basically, at Checkmarx, we analyze the source code. We build the DOM – the document object model. On top of that, we build a DFG – the data flow graph. The DOM represents the static properties of the code and the DFG represents the dynamic properties of the code.
(12.52 – 13.20) All that information, both the static and dynamic, is stored in a query-able database. And from that moment on, much like any other database where you can find all the employees with a specific summary, you can use our proprietary query language in order to ask your code questions. For example, whether Variable A is influenced by Variable B, or a value of five is going through your code until it reaches a specific location.
(13:20 – 13.47 )So, you can build your own queries to interrogate your code base. Checkmarx comes out of the box with hundreds of predefined queries. For example, SQL injection and cross-site scripting, etc. And all the queries are fully open which means that you can see the queries. You can adapt them to your own needs. You can modify them. You can write your own queries. And the results of these queries are actually the vulnerabilities found.
(13.47 – 14.12)Each result comes with its own data flow graph, and the graph visualization — the graph view allows you to combine all these different data flows into a single consolidated graph view. David, did I answer your question? If not, feel free to elaborate through the chat window. Thanks. Okay.
(14.15 – 14.59 ) Getting back to the presentation. So, recapping. As I said, the biggest challenge of current source code is how to deal with large projects with many results. In this webinar, I answered the second question – how to handle a case where you have thousands of results. In a future webinar we will be discussing the Source Code Knowledge Discovery (SCKD). How to automatically determine or detect patterns in large code bases. In this case, instead of fearing large code bases, we actually take them into our side and determine use statistics in the code base, so this will be our next webinar.
(15.00 – 15:50) Okay, another question here is whether we support automated code analysis within agile environment. So, the answer is yes. Once again, it’s not really relevant to the current webinar, but the answer is yes, we have the incremental scan capability. We’re the only product with the incremental scan capability. This means that once a code changes, you don’t have to rescan the entire codebase from the beginning. We can just modify the files together with their dependencies in a fraction of the overall time. You get the results, so it scales well. You can use it within any continuous integration environment. Jenkins, Hudson, Team City – you name it. We have a lot of experience. We can take it offline if you want.
(15:50 – 16:07) So, as I said, the current webinar discussed how to handle thousands of results. The next webinar will discuss how to handle multi-million or tens of millions Lines Of Code (LoC) kinds of code projects. So, you’re very welcome to join our next webinar.
(16.09 – 16.28) We have a free trial set for you at www.checkmarx.com/register so, fell free to join in to ask for a free trial and to try the graph view either on your own code base or on any open source application of your choice.
(16:32 – 16.41) So, the trial is for our cloud environment. We also have on private installations (on premise), so it’s up to you to choose. Thank you very much, and thank you for your time. I hope you enjoyed and see you in our next webinar.
See Results in Minutes!