Vulnerability Management - A Story

Nir Dagan, DevSecOps Engineer
Nir Dagan, DevSecOps Engineer

New Company Jitters

It is always exciting to join a new company. No matter how experienced you are, there are always those jitters of new beginnings. However the transition was not limited to our team, the company itself was changing from a small startup to a mid-size company, and as it grew the customer base was changing as larger and more established businesses were starting to use the product. 

It was exciting to be in the company during that spurt of growth. There was a sense of optimism and confidence. On our part, we felt that leadership was ready to put more focus and resources on product and infrastructure security. 

The pace was staggering; every week new features were added to the product, and new cloud accounts were opening on a daily basis. 

My team’s responsibility was to secure the different services and products the company offers to its customers. Considering those services were all cloud-native applications, that meant that we needed to have security controls and processes in each stage of the development and deployment cycle.  Considering the size of our company, we did not have an AppSec and infrastructure team. This meant in addition to supporting the different SDLC processes, the team would be responsible for security of the cloud infrastructure that was used to power the company services. 

Know the Unknown 

Security is a multi-layered process. A company can check all the boxes and still not have a mature security program. Attributes like secure architecture, culture, and safeguards are often not captured well in compliance checkboxes and security products.  However, there is immense value in those processes as they create a road map and help organize resource allocation for different security projects.  One of the main initiatives is the resolution and management of vulnerabilities in different IT systems and software services. According to NIST, a vulnerability  “is a weakness in an information system, system security procedures, internal controls, or implementation that could be exploited or triggered by a threat source.” 

When discussing software security, one of the leading sources is the OWASP project. We started looking for some papers and recommendations on software security processes. Here is a  diagram showing the process:

As you can see, there are three main cycles : Detection , remediation, and reporting. From https://owasp.org/www-project-vulnerability-management-guide/

Detection

First, we had to ensure we were able to detect the vulnerabilities. We used a variety of tools, some commercial and some open source. It was important for us to identify vulnerabilities across the different stages of the development pipeline and for services that were already in production.  

Using the GitHub API we made sure whenever a new repository was created in our organization, we would get a notice. We then created a small questionnaire about what the repository was used for, and later used this information when we needed to prioritize the vulnerability.  

Every repo containing code scheduled for integration into our product must undergo a vulnerability scan as part of the Continuous Integration and Continuous Deployment (CI/CD) process

Since all our services would only be deployed on major cloud providers platforms, we had to ensure that all our accounts had a vulnerability scanner deployed. The amount of vulnerabilities discovered were in the thousands. We understood that not all were of equal importance, and we knew that  we needed to prioritize.


Prioritizing

In order to establish a structured approach for addressing these vulnerabilities, we decided to categorize and group them and communicate this process to other stakeholders beyond our team.

1. Categorizing

We categorized each vulnerability into five different types and designed an SLA for each, including an agreed time for resolution.  For example, a supercritical vulnerability like log4j would require us to stop everything we were doing until it was fixed, while other vulnerabilities would be categorized as less urgent.  
Upon completing the process, we developed a quadrant system based on two key factors: the simplicity of the fix and the criticality or severity of the vulnerabilities. We then collaborated with the relevant stakeholders to establish Service Level Agreements (SLAs) for addressing each category of vulnerabilities.

2. Determining the Criticality

We started with the CVSS score from the scanner as a baseline, and added our own environment context to create a more representative score for our business. If the vulnerability severity score was very high, we would do some of our own technical analysis of the vulnerability to see if it was exploitable in our current setup. However, running an analysis like that on each vulnerability was not scalable,  so we created a list of questions/attributes that would help us understand the score.
These are a few examples:

  • How many hops are needed to reach the system running the vulnerability (public 0 hops , company domain 1 hop and extra)?
  • What is the type of environment? Is it dev, production, or a lab?
  • What is the system risk to the business - loss of data, disruption of business, disruption of corporate activity?
  • Can the system be used as an hop to another system - Jenkins server -> cloud account, Workstation -> database , etc.?
  • Can the vulnerability affect the company customer  - weakness in authentication, loss of customer data?
  • How likely is it to use the exploit in our current environment?  

Once we added our input, we wanted to create a decision tree similar to an SSVC score, but we wanted to add another parameter that would influence our final decision - ease of fix. This parameter, in broad terms, helped us understand the complexity involved in rectifying the vulnerability.

We used some of the following attributes to create a score for remediation complexity:

  • What are the consequences of the systems/app being unavailable?
  • How many versions of upgrades would be needed to reach the fix candidate?
  • Does the fixed candidate upgrade require minor or major upgrades?
  • Does the system have a vendor that supports it or is it an obscure open source project?
  • How well maintained is the source code?
  • How often,  in how many systems, does the vulnerability appear?  This can actually go both ways, making it more worthwhile to address and also more complex because it can affect many systems.
  • How many times has the library been called in the code? This requires technical understanding of the code.

It wasn't perfect, and we knew sometimes assumptions would be wrong. However, we felt that this process gave us a few advantages. By using junior members of the team to process each finding, it brought overall cost down, and allowed more senior members to focus on strategy. Instead of having a number that could be interpreted in different ways, we had an action with an accepted SLA. It gave us a baseline to set the expectation for time-till-remediation.
We could use the information added by the process to create detailed reports. We could test and change our assumptions as we got more feedback from the product system owners.

3. Creating Tickets for Code Owners

We were pretty happy with the process we built. The next phase was to process the output of our different scanners and create tickets for code or system owners. This part of the process was passed from our team to engineering. The expected behavior was known only to the code or system owners, requiring them to conduct the code tests.

We started creating tickets and eagerly anticipated the gradual resolution of vulnerabilities within our code, and we were expecting to be able to proceed to other important tasks. We did have some success with our infrastructure teams; however, it soon became clear the developers did not share the same sense of urgency.
Our ticket responses ranged from pushing out due dates to outright disregard, yet the one response we never received was, "issue resolved." It was obvious that we were not moving forward, definitely not at the pace we wanted.  One option was to accept the reality that this organization didn't prioritize security, and occasionally, only when a sufficiently critical vulnerability came up, some team leader might address our concerns. However, we recognized that this approach was not sustainable, and we needed a change in the processes to drive change.

Negotiate, Enforce, Renegotiate, and Repeat 

At our company, the work was divided into two-week sprints. Every feature would be designated into tasks, each with a maximum completion time frame of two weeks.

At the end of each sprint, the developer team leads and the product leads would meet to discuss what work would be done in the next sprint. We knew throwing tickets into the wind wasn’t going to help resolve vulnerabilities. We knew we had to join these meetings, and we did.

The product team plays a key role in remediation. They assign the tasks for each development cycle. Our teams agreed that each cycle, or every other cycle, will have time allocated to deal with fixing issues in the code. This meant that the security team had to act as a customer and push the product team to allocate time. 

In addition to assigning tasks, the product team also categorized engineering work into attributes. Each sprint incorporated different attributes.Things like: reliability, extensibility, and performance.  It is a way for the company to manage its technical debt, especially when there is a race toward creating new features. Security wasn’t taken into account, until we created a more collaborative work environment. After security was a category, the company now had a way to measure risk posture.

The prioritization process we created earlier came in handy, because we had a way to articulate what vulnerabilities needed to be fixed and why. During each meeting, we presented a list of vulnerabilities and security issues that were most critical. We also had a general understanding of the complexity and could occasionally point to other teams that performed the fix. This allowed us to manually create an organizational database of patching experiences.

Reporting  

The third part of the OWASP diagram is the reporting cycle. One of the keys for success in this process is having a consistent way to track and report your progress. We had to make sure all of the key stakeholders were aware of the progress or lack of it. Using the reports we could track if the product was getting safer as time went by or if vulnerabilities were accumulating faster than we were repairing them. We could see which teams were efficient in fixing vulnerabilities and which weren't. 

The reports were the only way we could make our case for additional resources or in cases where we were making progress to demonstrate our success.  We discovered that the number of vulnerabilities and difficulty in addressing them correlated strongly with the general lack of maintainability of the software. So it happened that our report was an indicator that the CTO used when deciding which projects needed upgrades or redesign. 

Thankfully, Things are Different Now

As I recount the process my team and I built at my previous company, I’m both proud and relieved. What we built was painfully manual. Every step along the way - from prioritization to reporting - was done manually and kept on sophisticated spreadsheets. It was inefficient and prone to human error.

Today’s world of vulnerability management and remediation needs something drastically different. That’s the reason I moved to Opus Security.  Our cloud-native, remediation platform aggregates, deduplicates, and prioritizes vulnerabilities across AppSec and Cloud Sec, reducing the complexity of remediation into a single unified view. If I had Opus Security in my previous role, I could have decreased the number of human hours from half a day to hours, and in some cases, minutes.