John Dellinger, Author at Microsoft Security Blog

Lessons learned from the Microsoft SOC—Part 3d: Zen and the art of threat hunting

Mark Simos, John Dellinger and David Fosth — Thu, 25 Jun 2020 16:00:18 +0000

Threat hunting is a powerful way for the SOC to reduce organizational risk, but it’s commonly portrayed and seen as a complex and mysterious art form for deep experts only, which can be counterproductive. In this and the next blog we will shed light on this important function and recommend simple ways to get immediate and meaningful value out of threat hunting.

This is the seventh blog in the Lessons learned from the Microsoft SOC series designed to share our approach and experience from the front lines of our security operations center (SOC) protecting Microsoft, and our Detection and Response Team (DART) helping our customers with their incidents. For a visual depiction of our SOC philosophy, download our Minutes Matter poster.

Before we dive in, let’s clarify the definition of “threat hunting.” There are various disciplines and processes that contribute to the successful proactive discovery of threat actor operations. For example, our Hunting Team works with threat intelligence to help shape and guide their efforts, but our threat intelligence teams are not “threat hunters.” When we use the term “threat hunting,” we are talking about the process of experienced analysts proactively and iteratively searching through the environment to find attacker operations that have evaded other detections.

Hunting is a complement to reactive processes, alerts, and detections, and enables you to proactively get ahead of attackers. What sets hunting apart from reactive activities is the proactive nature of it, where hunters spend extended focus time thinking through issues, identifying trends and patterns, and getting a bigger picture perspective.

A successful hunting program is not purely proactive however as it requires continuously balancing attention between reactive efforts and proactive efforts. Threat hunters will still need to maintain a connection to the reactive side to keep their skills sharp and fresh and keep attuned to trends in the alert queue. They will also need to jump in to help with major incidents at a moment’s notice to help put out the fire. The amount of time available for proactive activities will depend heavily on whether or not you have a full-time or part-time hunting mission.

Our SOC approaches threat hunting by applying our analysts to different types of threat hunting tasks:

1. Proactive adversary research and threat hunting

This is what most of our threat hunters spend the majority of their time doing. The team searches through a variety of sources including alerts, external indicators of compromise and other sources. The team primarily works to build and refine structured hypotheses of what the attackers may do based on threat intelligence (TI), unusual observations in the environment, and their own experience. In practice, this type of threat hunting includes:

Proactive search through the data (queries or manual review).
Proactive development of hypotheses based on TI and other sources.

2. Red and purple teaming

Some of our threat hunters work with red teams who simulate attacks and others who conduct authorized penetration testing against our environment. This is a rotating duty for our threat hunters and typically involves purple teaming, where both red and blue teams work to do their jobs and learn from each other. Each activity is followed up by fully transparent reviews that capture lessons learned which are shared throughout the SOC, with product engineering teams, and with other security teams in the company.

3. Incidents and escalations

Proactive hunters aren’t sequestered somewhere away from the watch floor. They are co-located with reactive analysts; they frequently check in with each other, share what they are working on, share interesting findings/observations, and generally maintain situational awareness of current operations. Threat hunters aren’t necessarily assigned to this task full time; they may simply remain flexible and jump in to help when needed.

These are not isolated functions— the members of these teams work in the same facility and frequently check in with each other, share what they are working on, and share interesting findings/observations.

What makes a good threat hunter?

While any high performing analyst has good technical skills, a threat hunter must be able to see past technical data and tools to attackers’ actions, motivations, and ideas. They need to have a “fingertip feel” (sometimes referred to as Fingerspitzengefühl), which is a natural sense of what is normal and abnormal in security data and the environment. Threat hunters can recognize when an alert (or cluster of alerts/logs) seem different or out of place.

One way to think about the qualities that make up a good threat hunter is to look at the Three F’s.

Functionality

This is technical knowledge and competency of investigating and remediating incidents. Security analysts (including threat hunters) should be proficient with the security tools, general flow of investigation and remediation, and the types technologies commonly deployed in enterprise environments.

Familiarity

This is “know thyself” and “know thy enemy” and includes familiarity with your organization’s specific environment and familiarity with attacker tactics, techniques, and procedures (TTPs). Attacker familiarity starts with understanding common adversary behaviors and then grows into a deeper sense of specific adversaries (including technologies, processes, playbooks, business priorities and mission, industry, and typical threat patterns). Familiarity also includes the relationships threat hunters develop with the people in your organization, and their roles/responsibilities. Familiarity with your organization is highly valued for analysts on investigation teams, and critical for effective threat hunting.

Flexibility

Flexibility is a highly valued attribute of any analyst role, but it is absolutely required for a threat hunter. Flexibility is a mindset of being adaptable in what you may do every day and how you do it. This manifests in how you understand problems, process information, and pursue solutions. This mindset comes from within each person and is reflected in almost everything they do.

Where any threat analyst (or threat hunter) can take a particular alert or event and run it into the ground, a good threat hunter will take a step back and look at a collection of data, alerts or events. Threat hunters must be inquisitive and unrelentingly curious about things—to the point that it bugs them if they don’t have a clear understanding of something. Instead of just answering a question, threat hunters are constantly trying to ask better questions of the data, coming up with creative new angles to answer them, and seeing what new questions they raise. Threat hunting also requires humility, to be able to quickly admit your mistakes so you can rapidly re-enter learning mode.

Threat hunting tooling

Threat hunting naturally pulls in a wide variety of tools, but our team has grown to prefer a few of the Microsoft tools whose design they have influenced.

Advanced hunting in Microsoft Threat Protection (MTP) tends to be the go-to tool for anything related to endpoints, identities, email, Azure resources, and SaaS applications.
Our teams also use Azure Sentinel, Jupyter notebooks, and custom analytics to hunt across broad datasets like application and network data, as well as diving deeper into identity, endpoint, Office 365, and other log data.

Our threat hunting teams across Microsoft contribute queries, playbooks, workbooks, and notebooks to the Azure Sentinel Community, including specific hunting queries that your teams can adapt and use.

Conclusion

We have discussed the art of threat hunting, different approaches to it, and what makes a good threat hunter. In the next entry, we dive deeper into how to build and refine a threat hunting program. If you are looking for more information on the SOC and other cybersecurity topics, check out previous entries in the series (Part 1 | Part 2a | Part 2b | Part 3a | Part 3b| Part 3c), Mark’s List, and our new security documentation site. Be sure to bookmark the Security blog to keep up with our expert coverage on security matters. Also, follow us at @MSFTSecurity for the latest news and updates on cybersecurity.

The post Lessons learned from the Microsoft SOC—Part 3d: Zen and the art of threat hunting appeared first on Microsoft Security Blog.

CISO Series: Lessons learned from the Microsoft SOC—Part 3c: A day in the life part 2

Mark Simos, Kristina and John Dellinger — Tue, 05 May 2020 01:00:36 +0000

This is the sixth blog in the Lessons learned from the Microsoft SOC series designed to share our approach and experience from the front lines of our security operations center (SOC) protecting Microsoft and our Detection and Response Team (DART) helping our customers with their incidents. For a visual depiction of our SOC philosophy, download our Minutes Matter poster.

COVID-19 and the SOC

Before we conclude the day in the life, we thought we would share an analyst’s eye view of the impact of COVID-19. Our analysts are mostly working from home now and our cloud based tooling approach enabled this transition to go pretty smoothly. The differences in attacks we have seen are mostly in the early stages of an attack with phishing lures designed to exploit emotions related to the current pandemic and increased focus on home firewalls and routers (using techniques like RDP brute-forcing attempts and DNS poisoning—more here). The attack techniques they attempt to employ after that are fairly consistent with what they were doing before.

A day in the life—remediation

When we last left our heroes in the previous entry, our analyst had built a timeline of the potential adversary attack operation. Of course, knowing what happened doesn’t actually stop the adversary or reduce organizational risk, so let’s remediate this attack!

Decide and act—As the analyst develops a high enough level of confidence that they understand the story and scope of the attack, they quickly shift to planning and executing cleanup actions. While this appears as a separate step in this particular description, our analysts often execute on cleanup operations as they find them.

Big Bang or clean as you go?

Depending on the nature and scope of the attack, analysts may clean up attacker artifacts as they go (emails, hosts, identities) or they may build a list of compromised resources to clean up all at once (Big Bang)

Clean as you go—For most typical incidents that are detected early in the attack operation, analysts quickly clean up the artifacts as we find them. This rapidly puts the adversary at a disadvantage and prevents them from moving forward with the next stage of their attack.

Prepare for a Big Bang—This approach is appropriate for a scenario where an adversary has already “settled in” and established redundant access mechanisms to the environment (frequently seen in incidents investigated by our Detection and Response Team (DART) at customers). In this case, analysts should avoid tipping off the adversary until full discovery of all attacker presence is discovered as surprise can help with fully disrupting their operation. We have learned that partial remediation often tips off an adversary, which gives them a chance to react and rapidly make the incident worse (spread further, change access methods to evade detection, inflict damage/destruction for revenge, cover their tracks, etc.).Note that cleaning up phishing and malicious emails can often be done without tipping off the adversary, but cleaning up host malware and reclaiming control of accounts has a high chance of tipping off the adversary.

These are not easy decisions to make and we have found no substitute for experience in making these judgement calls. The collaborative work environment and culture we have built in our SOC helps immensely as our analysts can tap into each other’s experience to help making these tough calls.

The specific response steps are very dependent on the nature of the attack, but the most common procedures used by our analysts include:

Client endpoints—SOC analysts can isolate a computer and contact the user directly (or IT operations/helpdesk) to have them initiate a reinstallation procedure.
Server or applications—SOC analysts typically work with IT operations and/or application owners to arrange rapid remediation of these resources.
User accounts—We typically reclaim control of these by disabling the account and resetting password for compromised accounts (though these procedures are evolving as a large amount of our users are mostly passwordless using Windows Hello or another form of MFA). Our analysts also explicitly expire all authentication tokens for the user with Microsoft Cloud App Security.
Analysts also review the multi-factor phone number and device enrollment to ensure it hasn’t been hijacked (often contacting the user), and reset this information as needed.
Service Accounts—Because of the high risk of service/business impact, SOC analysts work with the service account owner of record (falling back on IT operations as needed) to arrange rapid remediation of these resources.
Emails—The attack/phishing emails are deleted (and sometimes cleared to prevent recovering of deleted emails), but we always save a copy of original email in the case notes for later search and analysis (headers, content, scripts/attachments, etc.).
Other—Custom actions can also be executed based on the nature of the attack such as revoking application tokens, reconfiguring servers and services, and more.

Automation and integration for the win

It’s hard to overstate the value of integrated tools and process automation as these bring so many benefits—improving the analysts daily experience and improving the SOC’s ability to reduce organizational risk.

Analysts spend less time on each incident, reducing the attacker’s time to operation—measured by mean time to remediate (MTTR).

Analysts aren’t bogged down in manual administrative tasks so they can react quickly to new detections (reducing mean time to acknowledge—MTTA).

Analysts have more time to engage in proactive activities that both reduce organization risk and increase morale by keeping them focused on the mission.

Our SOC has a long history of developing our own automation and scripts to make analysts lives easier by a dedicated automation team in our SOC. Because custom automation requires ongoing maintenance and support, we are constantly looking for ways to shift automation and integration to capabilities provided by Microsoft engineering teams (which also benefits our customers). While still early in this journey, this approach typically improves the analyst experience and reduces maintenance effort and challenges.

This is a complex topic that could fill many blogs, but this takes two main forms:

Integrated toolsets save analysts manual effort during incidents by allowing them to easily navigate multiple tools and datasets. Our SOC relies heavily on the integration of Microsoft Threat Protection (MTP) tools for this experience, which also saves the automation team from writing and supporting custom integration for this.
Automation and orchestration capabilities reduce manual analyst work by automating repetitive tasks and orchestrating actions between different tools. Our SOC currently relies on an advanced custom SOAR platform and is actively working with our engineering teams (MTP’s AutoIR capability and Azure Sentinel SOAR) on how to shift our learnings and workload onto those capabilities.

After the attacker operation has been fully disrupted, the analyst marks the case as remediated, which is the timestamp signaling the end of MTTR measurement (which started when the analyst began the active investigation in step 2 of the previous blog).

While having a security incident is bad, having the same incident repeated multiple times is much worse.

Post-incident cleanup—Because lessons aren’t actually “learned” unless they change future actions, our analysts always integrate any useful information learned from the investigation back into our systems. Analysts capture these learnings so that we avoid repeating manual work in the future and can rapidly see connections between past and future incidents by the same threat actors. This can take a number of forms, but common procedures include:
- Indicators of Compromise (IoCs)—Our analysts record any applicable IoCs such as file hashes, malicious IP addresses, and email attributes into our threat intelligence systems so that our SOC (and all customers) can benefit from these learnings.
- Unknown or unpatched vulnerabilities—Our analysts can initiate processes to ensure that missing security patches are applied, misconfigurations are corrected, and vendors (including Microsoft) are informed of “zero day” vulnerabilities so that they can create security patches for them.
- Internal actions such as enabling logging on assets and adding or changing security controls.

Continuous improvement

So the adversary has now been kicked out of the environment and their current operation poses no further risk. Is this the end? Will they retire and open a cupcake bakery or auto repair shop? Not likely after just one failure, but we can consistently disrupt their successes by increasing the cost of attack and reducing the return, which will deter more and more attacks over time. For now, we must assume that adversaries will try to learn from what happened on this attack and try again with fresh ideas and tools.

Because of this, our analysts also focus on learning from each incident to improve their skills, processes, and tooling. This continuous improvement occurs through many informal and formal processes ranging from formal case reviews to casual conversations where they tell the stories of incidents and interesting observations.

As caseload allows, the investigation team also hunts proactively for adversaries when they are not on shift, which helps them stay sharp and grow their skills.

This closes our virtual shift visit for the investigation team. Join us next time as we shift to our Threat hunting team (a.k.a. Tier 3) and get some hard won advice and lessons learned.

…until then, share and enjoy!

P.S. If you are looking for more information on the SOC and other cybersecurity topics, check out previous entries in the series (Part 1 | Part 2a | Part 2b | Part 3a | Part 3b), Mark’s List (https://aka.ms/markslist), and our new security documentation site—https://aka.ms/securitydocs. Be sure to bookmark the Security blog to keep up with our expert coverage on security matters. Also, follow us at @MSFTSecurity for the latest news and updates on cybersecurity. Or reach out to Mark on LinkedIn or Twitter.

The post CISO Series: Lessons learned from the Microsoft SOC—Part 3c: A day in the life part 2 appeared first on Microsoft Security Blog.

CISO series: Lessons learned from the Microsoft SOC—Part 3b: A day in the life

Mark Simos, John Dellinger and Kristina — Mon, 23 Dec 2019 17:00:57 +0000

The Lessons learned from the Microsoft SOC blog series is designed to share our approach and experience with security operations center (SOC) operations. We share strategies and learnings from our SOC, which protects Microsoft, and our Detection and Response Team (DART), who helps our customers address security incidents. For a visual depiction of our SOC philosophy, download our Minutes Matter poster.

For the next two installments in the series, we’ll take you on a virtual shadow session of a SOC analyst, so you can see how we use security technology. You’ll get to virtually experience a day in the life of these professionals and see how Microsoft security tools support the processes and metrics we discussed earlier. We’ll primarily focus on the experience of the Investigation team (Tier 2) as the Triage team (Tier 1) is a streamlined subset of this process. Threat hunting will be covered separately.

General impressions

Newcomers to the facility often remark on how calm and quiet our SOC physical space is. It looks and sounds like a “normal” office with people going about their job in a calm professional manner. This is in sharp contrast to the dramatic moments in TV shows that use operations centers to build tension/drama in a noisy space.

Nature doesn’t have edges

We have learned that the real world is often “messy” and unpredictable, and the SOC tends to reflect that reality. What comes into the SOC doesn’t always fit into the nice neat boxes, but a lot of it follows predictable patterns that have been forged into standard processes, automation, and (in many cases) features of Microsoft tooling.

Routine front door incidents

The most common attack patterns we see are phishing and stolen credentials attacks (or minor variations on them):

Phishing email → Host infection → Identity pivot:

Stolen credentials → Identity pivot → Host infection:

While these aren’t the only ways attackers gain access to organizations, they’re the most prevalent methods mastered by most attackers. Just as martial artists start by mastering basic common blocks, punches, and kicks, SOC analysts and teams must build a strong foundation by learning to respond rapidly to these common attack methods.

As we mentioned earlier in the series, it’s been over two years since network-based detection has been the primary method for detecting an attack. We attribute this primarily to investments that improved our ability to rapidly remediate attacks early with host/email/identity detections. There are also fundamental challenges with network-based detections (they are noisy and have limited native context for filtering true vs. false positives).

Analyst investigation process

Once an analyst settles into the analyst pod on the watch floor for their shift, they start checking the queue of our case management system for incidents (not entirely unlike phone support or help desk analysts would).

While anything might show up in the queue, the process for investigating common front door incidents includes:

Alert appears in the queue—After a threat detection tool detects a likely attack, an incident is automatically created in our case management system. The Mean Time to Acknowledge (MTTA) measurement of SOC responsiveness begins with this timestamp. See Part 1: Organization for more information on key SOC metrics.

Basic threat hunting helps keep a queue clean and tidy

Require a 90 percent true positive rate for alert sources (e.g., detection tools and types) before allowing them to generate incidents in the analyst queue. This quality requirement reduces the volume of false positive alerts, which can lead to frustration and wasted time. To implement, you’ll need to measure and refine the quality of alert sources and create a basic threat hunting process. A basic threat hunting process leverages experienced analysts to comb through alert sources that don’t meet this quality bar to identify interesting alerts that are worth investigating. This review (without requiring full investigation of each one) helps ensure that real incident detections are not lost in the high volume of noisy alerts. It can be a simple part time process, but it does require skilled analysts that can apply their experience to the task.

Own and orient—The analyst on shift begins by taking ownership of the case and reading through the information available in the case management tool. The timestamp for this is the end of the MTTA responsiveness measurement and begins the Mean Time to Remediate (MTTR) measurement.

Experience matters

A SOC is dependent on the knowledge, skills, and expertise of the analysts on the team. The attack operators and malware authors you defend against are often adaptable and skilled humans, so no prescriptive textbook or playbook on response will stay current for very long. We work hard to take good care of our people—giving them time to decompress and learn, recruiting them from diverse backgrounds that can bring fresh perspectives, and creating a career path and shadowing programs that encourage them to learn and grow.

Check out the host—Typically, the first priority is to identify affected endpoints so analysts can rapidly get deep insight. Our SOC relies on the Endpoint Detection and Response (EDR) functionality in Microsoft Defender Advanced Threat Protection (ATP) for this.

Why endpoint is important

Our analysts have a strong preference to start with the endpoint because:

Endpoints are involved in most attacks—Malware on an endpoint represents the sole delivery vehicle of most commodity attacks, and most attack operators still rely on malware on at least one endpoint to achieve their objective. We’ve also found the EDR capabilities detect advanced attackers that are “living off the land” (using tools deployed by the enterprise to navigate). The EDR functionality in Microsoft Defender ATP provides visibility into normal behavior that helps detect unusual command lines and process creation events.

Endpoint offers powerful insights—Malware and its behavior (whether automated or manual actions) on the endpoint often provides rich detailed insight into the attacker’s identity, skills, capabilities, and intentions, so it’s a key element that our analysts always check for.

Identifying the endpoints affected by this incident is easy for alerts raised by the Microsoft Defender ATP EDR, but may take a few pivots on an email or identity sourced alert, which makes integration between these tools crucial.

Scope out and fill in the timeline—The analyst then builds a full picture and timeline of the related chain of events that led to the alert (which may be an adversary’s attack operation or false alarm positive) by following leads from the first host alert. The analyst travels along the timeline:

Backward in time—Track backward to identify the entry point in the environment.
Forward in time—Follow leads to any devices/assets an attacker may have accessed (or attempted to access).

Our analysts typically build this picture using the MITRE ATT&CK™ model (though some also adhere to the classic Lockheed Martin Cyber Kill Chain^®).

True or false? Art or science?

The process of investigation is partly a science and partly an art. The analyst is ultimately building a storyline of what happened to determine whether this chain of events is the result of a malicious actor (often attempting to mask their actions/nature), a normal business/technical process, an innocent mistake, or something else.

This investigation is a repetitive process. Analysts identify potential leads based on the information in the original report, follow those leads, and evaluate if the results contribute to the investigation.

Analysts often contact users to identify whether they performed an anomalous action intentionally, accidentally, or was not done by them at all.

Running down the leads with automation

Much like analyzing physical evidence in a criminal investigation, cybersecurity investigations involve iteratively digging through potential evidence, which can be tedious work. Another parallel between cybersecurity and traditional forensic investigations is that popular TV and movie depictions are often much more exciting and faster than the real world.

One significant advantage of investigating cyberattacks is that the relevant data is already electronic, making it easier to automate investigation. For many incidents, our SOC takes advantage of security orchestration, automation, and remediation (SOAR) technology to automate investigation (and remediation) of routine incidents. Our SOC relies heavily on the AutoIR functionality in Microsoft Threat Protection tools like Microsoft Defender ATP and Office 365 ATP to reduce analyst workload. In our current configuration, some remediations are fully automatic and some are semi-automatic (where analysts review the automated investigations and propose remediation before approving execution of it).

Document, document, document

As the analyst builds this understanding, they must capture a complete record with their conclusions and reasoning/evidence for future use (case reviews, analyst self-education, re-opening cases that are later linked to active attacks, etc.).

As our analyst develops information on an incident, they capture the common, most relevant details quickly into the case such as:

Alert info: Alert links and Alert timeline
Machine info: Name and ID
User info
Event info
Detection source
Download source
File creation info
Process creation
Installation/Persistence method(s)
Network communication
Dropped files

Fusion and integration avoid wasting analyst time

Each minute an analyst wastes on manual effort is another minute the attacker has to spread, infect, and do damage during an attack operation. Repetitive manual activity also creates analyst toil, increases frustration, and can drive interest in finding a new job or career.

We learned that several technologies are key to reducing toil (in addition to automation):

Fusion—Adversary attack operations frequently trip multiple alerts in multiple tools, and these must be correlated and linked to avoid duplication of effort. Our SOC has found significant value from technologies that automatically find and fuse these alerts together into a single incident. Azure Security Center and Microsoft Threat Protection include these natively.

Integration—Few things are more frustrating and time consuming than having to switch consoles and tools to follow a lead (a.k.a., swivel chair analytics). Switching consoles interrupts their thought process and often requires manual tasks to copy/paste information between tools to continue their work. Our analysts are extremely appreciative of the work our engineering teams have done to bring threat intelligence natively into Microsoft’s threat detection tools and link together the consoles for Microsoft Defender ATP, Office 365 ATP, and Azure ATP. They’re also looking forward to (and starting to test) the Microsoft Threat Protection Console and Azure Sentinel updates that will continue to reduce the swivel chair analytics.

Stay tuned for the next segment in the series, where we’ll conclude our investigation, remediate the incident, and take part in some continuous improvement activities.

Learn more

In the meantime, bookmark the Security blog to keep up with our expert coverage on security matters and follow us at @MSFTSecurity for the latest news and updates on cybersecurity.

To learn more about SOCs, read previous posts in the Lessons learned from the Microsoft SOC series, including:

Watch the CISO Spotlight Series: Passwordless: What’s It Worth.

Also, see our full CISO series and download our Minutes Matter poster for a visual depiction of our SOC philosophy.

The post CISO series: Lessons learned from the Microsoft SOC—Part 3b: A day in the life appeared first on Microsoft Security Blog.

CISO series: Lessons learned from the Microsoft SOC—Part 3a: Choosing SOC tools

Mark Simos, Kristina and John Dellinger — Mon, 07 Oct 2019 21:20:56 +0000

The Lessons learned from the Microsoft SOC blog series is designed to share our approach and experience with security operations center (SOC) operations. Our learnings in the series come primarily from Microsoft’s corporate IT security operation team, one of several specialized teams in the Microsoft Cyber Defense Operations Center (CDOC).

Over the course of the series, we’ve discussed how we operate our SOC at Microsoft. In the last two posts, Part 2a, Organizing people, and Part 2b: Career paths and readiness, we discussed how to support our most valuable resources—people—based on successful job performance.

We’ve also included lessons learned from the Microsoft Detection and Response Team (DART) to help our customers respond to major incidents, as well as insights from the other internal SOC teams.

For a visual depiction of our SOC philosophy, download our Minutes Matter poster. To learn more about our Security operations, watch CISO Spotlight Series: The people behind the cloud.

As part of Cybersecurity Awareness month, today’s installment focuses on the technology that enables our people to accomplish their mission by sharing our current approach to technology, how our tooling evolved over time, and what we learned along the way. We hope you can use what we learned to improve your own security operations.

Our strategic approach to technology

Ultimately, the role of technology in a SOC is to help empower people to better contain risk from adversary attacks. Our design for the modern enterprise SOC has moved away from the classic model of relying primarily on alerts generated by static queries in an on-premise security information and event management (SIEM) system. The volume and sophistication of today’s threats have outpaced the ability of this model to detect and respond to threats effectively.

We also found that augmenting this model with disconnected point-solutions lead to additional complexity and didn’t necessarily speed up analysis, prioritization, orchestration, and execution of response action.

Selecting the right technology

Every tool we use must enable the SOC to better achieve its mission and provide meaningful improvement before we invest in purchasing and integrating it. Each tool must also meet rigorous requirements for the sheer scale and global footprint of our environment and the top-shelf skill level of the adversaries we face, as well as efficiently enable our analysts to provide high quality outcomes. The tools we selected support a range of scenarios.

In addition to enabling firstline responders to rapidly remediate threats, we must also enable deep subject matter experts in security and data science to reason over immense volumes of data as they hunt for highly skilled and well-funded nation state level adversaries.

Making the unexpected choice

Even though many of the tools we currently use are made by Microsoft, they still must meet our stringent requirements. All SOC tools—no matter who makes them—are strictly vetted and we don’t hesitate to reject tools that don’t work for our purposes. For example, our SOC rejected Microsoft’s Advanced Threat Analytics tool because of the infrastructure required to scale it up (despite some promising detection results in a pilot). It’s successor, Azure Advanced Threat Protection (Azure ATP) solved this infrastructure challenge by shifting to a SaaS architecture and is now in active use daily.

Our SOC analysts work with Microsoft engineering and third-party tool providers to drive their requirements and provide feedback. As an example, our SOC team has a weekly meeting with the Microsoft Defender ATP team to review learnings, findings, request features or changes, share engineering progress on requested features, and share attacker research from both teams. Even today, as we roll out Azure Sentinel, our SOC is actively working with the engineering team to ensure key requirements are met, so we can fully retire our legacy SIEM (more details below). Additionally, we regularly invite engineers from our product groups to join us in the SOC to learn how the technology is applied by our experts.

History and evolution to broad and deep tooling

Microsoft’s Corporate IT SOC protects a cross platform environment with a significant population of Windows, Linux, and Macs running a variety of Microsoft and non-Microsoft software. This environment is approximately 95 percent hosted on the cloud today. The tooling used in this SOC has evolved significantly over the years starting from the classic model centered around an on-premises SIEM.

Phase 1—Classic on-premises SIEM-centric model

This is the common model where all event data is fed into an on-premises SIEM where analytics are performed on the data (primarily static queries that were refined over time).

We experienced a set of challenges that we now view as natural limitations of this model. These challenges included:

Overwhelming event volume—High volume and growth (on the scale of 20+ billion events a day currently) exceeded the capacity of the on-premises SIEM to handle it.
Analyst overload and fatigue—The static rulesets generated excessive amounts of false positive alerts that lead to alert fatigue.
Poor investigation workflow—Investigation of events using the SIEM was clunky and required manual queries and manual steps when switching between tools.

Phase 2—Bolster on-premises SIEM weaknesses with cloud analytics and deep tools

We introduced several changes designed to address shortcomings of the classic model.

Three strategic shifts were introduced and included:

1. Cloud based log analytics—To address the SIEM scalability challenges discussed previously, we introduced cloud data lake and machine learning technology to more efficiently store and analyze events. This took pressure off our legacy SIEM and allowed our hunters to embrace the scale of cloud computing to apply advanced techniques like machine learning to reason over the data. We were early adopters of this technology before many current commercial offerings had matured, so we ended up with several “generations” of custom technology that we had to later reconcile and consolidate (into the Log Analytics technology that now powers Azure Sentinel).

Lesson learned: “Good enough” and “supported” is better than “custom.”

Adopt commercial products if they meet at least the “Pareto 80 percent” of your needs because the support of these custom implementations (and later rationalization effort) takes resources and effort away from hunting and other core mission priorities.

2. Specialized high-quality tooling—To address analyst overload and poor workflow challenges, we tested and adopted specialized tooling designed to:

Produce high quality alerts (versus high quantity of detailed data).
Enable analysts to rapidly investigate and remediate compromised assets.

It is hard to overstate the benefits of this incredibly successful integration of technology. These tools had a powerful positive impact on our analyst morale and productivity, driving significant improvements of our SOC’s mean time to acknowledge (MTTA) and remediate (MTTR).

We attribute a significant amount of this success of these tools to the direct real-world input that was used to design them.

SOC—The engineering group spent approximately 18-24 months with our SOC team focused on learning about SOC analyst needs, thought processes, pain points, and more while designing and building the first release of Microsoft Defender ATP. These teams still stay in touch weekly.
DART team—The engineering group directly integrated analysis and hunting techniques that DART developed to rapidly find and evict advanced adversaries from customers.

Here’s a quick summary of the key tools. We’ll share more details on how we use them in our next blog:

Endpoint—Microsoft Defender ATP is the default starting point for analysts for almost any investigation (regardless of the source of the alert) because of its powerful visibility and investigation capabilities.
Email—Office 365 ATP’s integration with Office 365 Exchange Online helps analysts rapidly find and remove phishing emails from mailboxes. The integration with Microsoft Defender ATP and Azure ATP enables analysts to handle common cases extremely quickly, which lead to growth in our analyst caseload (in a good way ☺).
Identity—Integrating Azure ATP helped complete the triad of the most attacked/utilized resources (Endpoint-Email-Identity) and enabled analysts to smoothly pivot across them (and added some useful detections too).
We also added Microsoft Cloud App Security and Azure Security Center to provide high quality detections and improve investigation experience as well.

Even before adding the Automated investigations technology (originally acquired from Hexadite), we found that Microsoft Defender ATP’s Endpoint Detection and Response (EDR) solution increased SOC’s efficiency to the point where Investigation teams analysts can start doing more proactive hunting part-time (often by sifting through lower priority alerts from Microsoft Defender ATP).

Lesson learned: Enable rapid end-to-end workflow for common Email-Endpoint identity attacks.

Ensure your technology investments optimize the analyst workflow to detect, investigate, and remediate common attacks. The Microsoft Defender ATP and connected tools (Office 365 ATP, Azure ATP) was a game changer in our SOC and enabled us to consistently remediate these attacks within minutes. This is our number one recommendation to SOCs as it helped with:

Commodity attacks—Efficiently dispatch (a high volume of) commodity attacks in the environment.

Targeted attacks—Mitigate impact advanced attacks by severely limiting attack operator time to laterally traverse and explore, hide, set up command/control (C2), etc.

3. Mature case management—To further improve analyst workflow challenges, we transitioned the analyst’s primary queue to our case management service hosted by a commercial SaaS provider. This further reduced our dependency on our legacy SIEM (primarily hosting legacy static analytics that had been refined over time).

Lesson learned: Single queue

Regardless of the size and tooling of your SOC, it’s important to have a single queue and govern quality of it.

This can be implemented as a case management solution, the alert queue in a SIEM, or as simple as the alert list in the Microsoft Threat Protection tool for smaller organizations. Having a single place to go for reactive analysis and ensuring that place produces high quality alerts are key enablers of SOC effectiveness and responsiveness. As a complement to the quality piece, you should also have a proactive hunting activity to ensure that attacker activities are not lost in high noise detection.

Phase 3—Modernize SIEM to cloud native

Our current focus is the transition of the remaining SIEM functions from our legacy capability to Azure Sentinel.

We’re now focused on refining our tool strategy and architecture into a model designed to optimize both breadth (unified view of all events) and depth capabilities. The specialized high-quality tooling (depth tooling) works great for monitoring the “front door” and some hunting but isn’t the only tooling we need.

We’re now in the early stages of operating Microsoft’s Azure Sentinel technology in our SOC to completely replace our legacy on-premises SIEM. This task is a bit simpler for us than most, as we have years of experience using the underlying event log analysis technology that powers Azure Sentinel (Azure Monitor technology, which was previously known as Azure Log Analytics and Operations Management Suite (OMS)).

Our SOC analysts have also been contributing heavily to Azure Sentinel and its community (queries, dashboards, etc.) to share what we have learned about adversaries with our customers.

Learn more details about this SOC and download slides from the CISO Workshop:

Video—Part 3: Learnings from Microsoft Corporate IT SOC (22:07)
Slides—CISO Workshop Module 4b

Lesson learned: Side-by-side transition state

Based on our experience and conversations with customers, we expect transitioning to cloud analytics like Azure Sentinel will often include a side-by-side configuration with an existing legacy SIEM. This could include a:

Short-term transition state—For organizations that are committed to rapidly retiring a legacy SIEM in favor of Azure Sentinel (often to reduce cost/complexity) and need operational continuity during this short bridge period.

Medium-term coexistence—For organizations with significant investment into an on-premises SIEM and/or a longer-term plan for cloud migration. These organization recognize the power of Data Gravity—placing analytics closer to the cloud data will avoid costs and challenges of transferring logs to/from the cloud.

Managing the SOC investigations across the SIEM platforms can be accomplished with reasonable efficiency using either a case management tool or the Microsoft Graph Security API (synchronizing Alerts between the two SIEM platforms).

Microsoft is continuing to invest in building more detailed guidance and capabilities to document learnings on this process and continue to refine technology to support it.

Learn more

To learn more, read previous posts in the “Lessons learned from the Microsoft SOC” series, including:

Also, see our full CISO series.

Watch the CISO Spotlight Series: The people behind the cloud.

For a visual depiction of our SOC philosophy, download our Minutes Matter poster.

Stayed tuned for the next segment in “Lessons learned from the Microsoft SOC” where we dive into more of the analyst experience of using these tools to rapidly investigate and remediate attacks. In the meantime, bookmark the Security blog to keep up with our expert coverage on security matters. Also, follow us at @MSFTSecurity for the latest news and updates on cybersecurity.

To learn more about how you can protect your time and empower your team, check out the cybersecurity awareness page this month.

The post CISO series: Lessons learned from the Microsoft SOC—Part 3a: Choosing SOC tools appeared first on Microsoft Security Blog.

CISO Series: Lessons learned from the Microsoft SOC Part 2b: Career paths and readiness

Mark Simos, Kristina and John Dellinger — Thu, 06 Jun 2019 16:00:16 +0000

The “Lessons learned from the Microsoft SOC” blog series is designed to share our approach and experience with security operations center (SOC) operations, so you can use what we learned to improve your SOC. The learnings in the series come primarily from Microsoft’s corporate IT security operation team, one of several specialized teams in the Microsoft Cyber Defense Operations Center (CDOC). We’ve also included lessons our Detection and Response Team (DART) have learned helping our customers respond to major incidents and insights from the other internal SOC teams.

Today, we wrap up our discussion on people—our most valuable resource in the SOC. In the first part of our discussion, Part 2a: Organizing people, we covered how to set up people in the security operations center (SOC) for success. Today, we talk about our investments into readiness programs and career paths for our SOC analysts as well as recruiting for success. We’ll close the series with discussions about the technology that enables our people to accomplish their mission.

Something new every day

When an analyst walks into our SOC for a shift, they never know what to expect. They must be ready for anything as they face off with intelligent, adaptable, and well-funded adversaries who are intent on evading our defenses. For each problem, they must apply their unique knowledge and experience, the accumulated learnings from our SOC, and the expertise of their SOC teammates.

Our investments into readiness programs, career paths, and recruitment strategies are designed so our SOC analysts are prepared to succeed in their duties, increase mastery of their discipline, and grow as individuals. This ensures that our SOC staff brings their best to every shift, every time.

You may have to adapt some of these practices to the unique needs of your security operations team to be successful. We’re fortunate to have dedicated security operations teams, dedicated facilities, and experienced peers to learn from already on staff, but understand not all security organizations have these resources available.

Analyst roles and career paths

Empowering humans means investing in them. A SOC analyst is a high stress job and we know our success is built upon actively engaged people applying their experience and problem solving creativity. The longer our analysts do this work the better they get, so it’s important to nurture a long-running, sustainable workforce. This starts by clearly defining a career path. Our tier model not only organizes the work of the SOC, but also guides our analysts in building their knowledge and skills and shapes their careers with increasing levels of skills and different challenges.

Because we strive to empower and attract smart people with a continuous learning mindset, we’re motivated to promote from within. An analyst’s career path typically progresses from Tier 1 to Tier 2 to Tier 3 or to incident response, program management, security product engineering, or leadership tracks. There are exceptions, but this tends to be the norm.

Tier 1—Analysts acquire and refine core skills including attacker mindset and techniques, using detection and investigation tools, working with internal teams and processes, and calmly applying a thoughtful approach in a high pressure situation. This is similar to martial arts where beginners acquire basic competencies (marked by a progression of colored belts) until they have achieved their black belt and move to the next stage of skills. Similarly, transition from Tier 1 to Tier 2 is a key turning point in the career of an analyst.
Tier 2—Analysts continue to hone their skills as they move from executing well-defined playbooks for (mostly) predictable incidents at Tier 1 to investigating advanced incidents with greater unpredictability. Tier 2 analysts investigate attack operations conducted by organized groups with specialized skills and a specific targeted goal. Analysts investigating these incidents continue growing skills while learning from Tier 2 peer analysts and the incidents themselves. Over time, senior Tier 2 analysts often shadow different Tier 3 teams as they try out potential career paths and/or prepare for the next stage of their career.
Tier 3—At this level, the analyst career paths typically start to diverge more into deeper specialties. Analysts can choose to pursue mastery of a particular skill or increasing competency/mastery across multiple skills. Tier 3 is increasingly requiring more data analytic skillsets on the team. This is because proactive hunting, investigation of advanced attacks, and automation development frequently require navigating many datasets with massive amounts of information.

Careful balancing

Defining a clear career path is important, but like all disciplines dealing with people, we must carefully balance and manage some nuances along the way.

Balancing short and long term goals—As our analysts learn new skills and progress through their career, they learn to balance goals, such as ensuring alerts and cases are handled as top priority while simultaneously developing creative solutions that can reduce toil and increase efficiency over the long term.
Balancing empowerment and guidance—Managers and senior personnel need to strike this careful balance as they mentor analysts in their career. This is particularly important for key transition points like when an analyst first begins onboarding a new role. Much like we see in many martial arts films when the talented but “not fully trained” student has an overabundance of confidence and tries to take on more than they can handle, we see a similar dynamic as analysts begin shadowing Tier 3 roles. In this situation, we have to be careful not to discourage this creative impulse (offering a feedback channel for ideas) while coaching and guiding analysts to complete their learning from seasoned professionals and focusing on the journey ahead.

Recruiting for success

Recruiting people and developing their skills is one of the most critical aspects of the SOC’s success. The biggest challenges in this space are the scarcity of people with the right skillsets, the speed at which skillsets must evolve, the potential for analyst burnout, and the need to blend diverse skills and perspectives to address both the human and technical aspects of attacks.

Much has been written about the scarcity of cybersecurity skills. We recommend reading a relevant blog on this topic that offers different ways of addressing the scarcity of talent in security. Additionally, you may want to watch a recent RSA Conference Keynote from Ann Johnson (Corporate Vice President of Cybersecurity Solutions Group at Microsoft), which addresses many related topics including the mental health and burnout risks our industry faces.

The evolving skillset challenge is particularly acute for our SOC because classic SOCs tend to be network centric, but our detection and investigation have evolved to rely primarily on device, identity, and application specific tooling. While we still have and use advanced network security tools, we’ve seen the utility of these network tools diminish significantly over the years to supporting investigation and advanced hunting. As of the writing of this blog, it’s been over two years since the last primary detection of an attack on our corporate environment came in from a network tool. We expect this trend to continue and have oriented our analyst readiness accordingly.

When it comes to recruiting and building skilled analysts, we’ve found that we require a combination of diverse perspectives and some common traits. As with any role, success requires having a diverse team with different backgrounds, mindsets, and skillsets to bring more perspective to the problems at hand and surface better solutions faster. We’ve also found certain personality traits tend to make analysts more successful in a fast-paced high-pressure work environment of a SOC.

Its critical to note that the following observations are general trends and not absolute rules. The primary factor of success in hiring an individual into a role is most heavily reliant upon that particular person and how well they fit that role. With that said, we tend to look for people with a kind of “grace under pressure” as we find it’s easier to train technical and security skills to people with a growth mindset and calm demeanor under pressure than it is to do the reverse.

For example, we have found that people with military experience are often a good fit because they have experience focusing on the mission despite the strong distractions in ambiguous situations with active hostile adversaries.

We’ve also had success with recruiting and investing into people early in their careers who are eager to learn and have few preconceptions. We’ve had good results with integrating seasoned professionals, but there are simply not enough available for the needs of the marketplace today.

An interesting aspect of the SOC attracting mission-oriented personalities is that when we have a major incident off hours, we more often get too many people volunteering to help versus not enough—a good “problem” to have!

Building skills and job readiness

Because of the high complexity required to be an effective SOC analyst, it’s difficult to educate new analysts in the ways of the SOC through formal training alone. We’ve tried different training approaches to build skills over the years and have found the apprenticeship model to be most effective at rapidly and consistently building skills. For new analysts we take an “I do, we do, you do” approach that progresses from observation to hands on with supervision of a seasoned analyst to independent investigation with support from peers and mentors.

This is similar to other industries with a need to transfer rich context and nuance during real world practice, such as an internship or a residency during a medical career.

The readiness process focuses on building understanding and competency in three domains:

Technical tools/capabilities.
Our organization (mission and assets being protected).
Attackers (motivations, tools, techniques, habits, etc.).

These competencies map well to established doctrine on human conflict. Sun Tzu’s advice to “know thyself” and “know thy enemy” map well to the second and third domains. Our SOC processes also map well to thinking from Colonel John Boyd’s OODA ‘loop’ on real-time human conflict: observe, orient, decide, act.

Beyond the competencies, we also need to train our analysts to be big picture thinkers and maintain an end-to-end view of the attack. It’s not enough to focus on a single threat, but to also “look left and right.” We need our analysts to think about how else the attacker might be trying to gain access and what else they may be after. For example, a password spray may be a potential entry to a multi-stage attack. An attacker may be using a distributed denial-of-service (DDoS) attack to provide a smokescreen to distract from their real objective.

We supplement this apprenticeship model with structured, formal training on topics, such as new products or features and SOC procedures. We also encourage attendance at conferences and work hard to ensure our staffing model supports these and other learning opportunities, so they aren’t empty promises.

This approach has been successful allowing us to train new Tier 1 analysts in approximately 10–12 weeks and we’re continuously looking for ways to improve our readiness processes. In addition, our staffing approach has been critical at mitigating burnout risk.

Learn more

For a visual depiction of our SOC philosophy, download our Minutes matter poster. Also, read previous posts in the “Lessons learned from the Microsoft SOC” series, including Part 1: Organization and Part 2a: Organizing people as well as see our full CISO series to learn more.

For more discussion on some of these topics, see John and Kristina’s session (starting at 1:05:48) at Microsoft’s recent Virtual Security Summit.

Stayed tuned for the next segment in “Lessons learned from the Microsoft SOC” where we discuss the technology that enables our people to accomplish their mission.

CISO Series: Lessons learned from the Microsoft SOC—Part 2a: Organizing people

Mark Simos, Kristina and John Dellinger — Tue, 23 Apr 2019 16:00:23 +0000

In the second post in our series, we focus on the most valuable resource in the security operations center (SOC)—our people. This series is designed to share our approach and experience with operations, so you can use what we learned to improve your SOC. In Part 1: Organization, we covered the SOC’s organizational role and mission, culture, and metrics.

The lessons in the series come primarily from Microsoft’s corporate IT security operation team, one of several specialized teams in the Microsoft Cyber Defense Operations Center (CDOC). We also include lessons our Detection and Response Team (DART) have learned helping our customers respond to major incidents.

People are the most valuable asset in the SOC—their experience, skill, insight, creativity, and resourcefulness are what makes our SOC effective. Our SOC management team spends a lot of time thinking about how to ensure our people are set up with what they need to succeed and stay engaged. As we’ve improved our processes, we’ve been able to decrease the time it takes to ramp people up and increase employee enjoyment of their jobs.

Today, we cover the first two aspects of how to set up people in the SOC for success:

Empower humans with automation.
Microsoft SOC teams and tiers model.

Empower humans with automation

Rapidly sorting out the signal (real detections) from the noise (false positives) in the SOC requires investing in both humans and automation. We strongly believe in the power of automation and technology to reduce human toil, but ultimately, we’re dealing with human attack operators and human judgement is critical to the process.

In our SOC, automation is not about using efficiency to remove humans from the process—it is about empowering humans. We continuously think about how we can automate repetitive tasks from the analyst’s job, so they can focus on the complex problems that people are uniquely able to solve.

Automation empowers humans to do more in the SOC by increasing response speed and capturing human expertise. The toil our staff experiences comes mostly from repetitive tasks and repetitive tasks come from either attackers or defenders doing the same things over and over. Repetitive tasks are ideal candidates for automation.

We also found that we need to constantly refine the automation because attackers are creative and persistent, constantly innovating to avoid detections and preventive controls. When an effective attack method is identified (like phishing), they exploit it until it stops working. But they also continually innovate new tactics to evade defenses introduced by the cybersecurity community. Given the profit potential of attacks, we expect the challenges of evolving attacks to continue for the foreseeable future.

When repetitive and boring work is automated, analysts can apply more of their creative minds and energy to solving the new problems that attackers present to them and proactively hunting for attackers that got past the first lines of defense. We’ll discuss areas where we use automation and machine learning in “Part 3: Technology.”

Microsoft SOC teams and tiers model

At Microsoft, we organized our SOC into specialized teams, allowing them to better develop and apply deep expertise, which supports the overall goals of reducing time to acknowledge and remediate.

This diagram represents the key SOC functions: threat intelligence, incident management, and SOC analyst tiers:

Threat intelligence—We have several threat intelligence teams at Microsoft that support the SOC and other business functions. Their role is to both inform business stakeholders of risk and provide technical support for incident investigations, hunting operations, and defensive measures for known threats. These strategic (business) and tactical (technical) intelligence goals are related but distinctly different from each other. We task different teams for each goal and ensure processes are in place (such as daily standup meetings) to keep them in close contact.

Incident management—Enterprise-wide coordination of incidents, impact assessment, and related tasks are handled by dedicated personnel separate from technical analyst teams. At Microsoft, these incident response teams work with the SOC and business stakeholders to coordinate actions that may impact services or business units. Additionally, this team brings in legal, compliance, and privacy experts as needed to consult and advise on actions regarding regulatory aspects of incidents. This is particularly important at Microsoft because we’re compliant with a large number of international standards and regulations.

SOC analyst tiers—This three-tier model for SOC analysts will probably look familiar to seasoned SOC professionals, though there are some subtleties in our model we don’t see widely in the industry.

Our organization uses the term hot path and cold path to describe how we discover adversaries and optimize processes to handle them.

Hot path—Reflects detection of confirmed active attack activity that must be investigated and remediated as soon as possible. Managing and remediating these incidents are primarily handled by Tier 1 and Tier 2, though a small percentage (about 4 percent) are escalated to Tier 3. Automation of investigations and remediations are also beginning to help reduce hot path workloads.
Cold path—Refers to all other activities including proactively hunting for adversary campaigns that haven’t triggered a hot path alert.

Roles and functions of the SOC analyst tiers

Tier 1—This team is the primary front line for and focuses on high-speed remediation over a large volume of incidents. Tier 1 analysts respond to a very specific set of alert sources and follow prescriptive instructions to investigate, remediate, and document the incidents. The rule of thumb for alerts that Tier 1 handles is that it can be typically remediated within seconds to minutes. The incidents will be escalated to Tier 2 if the incident isn’t covered by a documented Tier 1 procedure or it requires involved/advanced remediation (for example, device isolation and cleanup).

In addition:

The Tier 1 function is currently performed by full-time employees in our corporate IT SOC. In the past and in other teams at Microsoft, we staffed contracted employees or managed service agreements for Tier 1 functions.
A current initiative for the full-time employee Tier 1 team is to increase the use of automated investigation and remediation for these incidents. One goal of this initiative is to grow the skills of our current Tier 1 employees, so they can shift to proactive work in other security assignments in SOC or across the company.
Tier 1 (and Tier 2) SOC analysts may stay involved with an escalated incident until it is remediated. This helps preserve context during and after transferring ownership of an incident and also accelerates their learning and skills growth.
The typical ratio of alert volumes is noted in the Tiers and Tools diagram above. (We’ll share more details in “Part 3: Technology.”)

Tier 2—This team is focused on incidents that require deeper analysis and remediation. Many Tier 2 incidents have been escalated from Tier 1 analysts, but Tier 2 also directly monitors alerts for sensitive assets and known attacker campaigns. These incidents are usually more complex and require an approach that is still structured, but much more flexible than Tier 1 procedures. Additionally, some Tier 2 analysts also proactively hunt for adversaries (typically using lower priority alerts from the same Microsoft Threat Protection tools they use to manage reactive incidents).

Tier 3—This team is focused primarily on advanced hunting and sophisticated analysis to identify anomalies that may indicate advanced adversaries. Most incidents are remediated at Tiers 1 and 2 (96 percent) and only unprecedented findings or deviations from norms are escalated to Tier 3 teams. Tier 3 team members have a high degree of freedom to bring their different skills, backgrounds, and approaches to the goal of ferreting out red team/hidden adversaries. Tier 3 team members have backgrounds as security professionals, data scientists, intelligence analysts, and more. These teams use different tools (Microsoft, custom, and third-party) to sift through a number of different datasets to uncover hidden adversary activity. A favorite of many analysts is the use of Kusto Query Language (KQL) queries across Microsoft Threat Protection tool datasets.

The structure of Tier 3 has changed over time, but has recently gravitated to four different functions:

Major incident engineering—Handles escalation of incidents from Tier 2. These virtual teams are created as needed to support the duration of the incident and often include both reactive investigations, as well as proactive hunting for additional adversary presence.
External adversary research and threat analysis—Focuses on hunting for adversaries using existing tools and data sources, as well as signals from various external intelligence sources. The team is focused on both hunting for undiscovered adversaries as well as creating and refining alerts and automation.
Purple team operations—A temporary duty assignment where Tier 3 analysts (blue team) are paired with our in-house attack team members (red team) as they perform authorized attacks. We found this purple (red+blue) activity results in high-value learning by both teams, strengthening our overall security posture and resilience. This team is also responsible for the critical task of coordinating with red team to deconflict whether a detection is an authorized red team or a live attacker. At customer organizations, we’ve seen failure to deconflict red team activity result in our DART teams flying onsite to respond to a false alarm (an avoidable, expensive, and embarrassing mistake).
Future operations team—Focuses on future-proofing our technology and processes by building and testing new capabilities.

Learn more

For more insights into Microsoft’s approach to using technology to empower people, watch Ann Johnson’s keynote at RSA 2019 and download our poster. For information on organizational culture and goals, read Lessons learned from the Microsoft SOC—Part 1: Organization. In addition, see our CISO series to learn more.

Stayed tuned for the second segment in “Lessons learned from the Microsoft SOC—Part 2,” where we’ll cover career paths and readiness programs for people in our SOC. And finally, we’ll wrap up this series with “Part 3: Technology,” where we’ll discuss the technology that enables our people to accomplish their mission.

For more discussion on some of these topics, see John and Kristina’s session (starting at 1:05:48) at Microsoft’s recent Virtual Security Summit.

CISO Series: Lessons learned from the Microsoft SOC—Part 1: Organization

Mark Simos, John Dellinger and Kristina — Thu, 21 Feb 2019 19:00:17 +0000

We’re frequently asked how we operate our Security Operations Center (SOC) at Microsoft (particularly as organizations are integrating cloud into their enterprise estate). This is the first in a three part blog series designed to share our approach and experience, so you can use what we learned to improve your SOC.

In Part 1: Organization, we start with the critical organizational aspects (organizational purpose, culture, and metrics). In Part 2: People (Part 2a and 2b), we cover how we manage our most valuable resource—human talent. And finally Part 3: Technology, covers the technology that enables these people to accomplish their mission.

Overall SOC model

Microsoft has multiple security operations teams that each have specialized knowledge to protect the different technical environments at Microsoft. We use a “fusion center” model with a shared operating floor, which we call our Cyber Defense Operations Center (CDOC), to increase collaboration and facilitate rapid communication among these teams. Each team manages to the specific needs of their environment.

In this three part series, we focus on the operation of our corporate IT SOC team as they most closely reflect the challenges and approaches of our customers—having many users and endpoints, email attack vectors, and a hybrid of on-premises and cloud assets. In addition, we include a few lessons learned from the other SOCs and our Detection and Response Team (DART) that helps our customers respond to major incidents.

This SOC operates with three tiers of analysts plus automation as seen in Figure 1 below. (We’ll provide more details in Part 2: People.)

Figure 1. SOC analyst tiers plus automation.

The tooling in the SOC (Figure 2) is a mixture of centralized breadth capabilities and specialized tools to enable high quality alerts and an end-to-end investigation and remediation experience. (Part 3: Technology will provide more details.)

Figure 2. SOC tooling.

Like all things in security, our SOC has evolved considerably over the years to its current state and will continue to evolve. We recently noticed that our SOC had sustained a 100+ percent growth in incidents handled over the past three years with a nearly flat staffing level. While we don’t know if we can expect this astounding trend to continue in the future, it validates that we are on the right track and should share our learnings.

SOC organizational purpose

The first element we cover is the value of the SOC in the context of the overall mission and risk of the organization. Like the traditional incarnations of crime and espionage, we don’t expect there will be a straightforward “solution” to cyberattacks. A SOC is often a crucial risk mitigation investment for an enterprise as it is core to limiting how much time and access attackers have in the organization. This ultimately increases the attacker’s cost and decreases the benefit, which damages their return on investment (ROI) and motivation for attacking your organization. Everything in the SOC should be oriented toward limiting the time and access attackers can gain to the organization’s assets in an attack to mitigate business risk.

At Microsoft, our SOCs bear not just the responsibility of reducing risk to our employees and investors, but also the weight of the trust that millions of customers accessing our cloud services and products put in us.

We’ve learned that the SOC has four primary functional integration points with the business:

Business context (to the SOC)—The SOC needs to understand what is most important to the organization so the team can apply that context to fluid real-time security situations. What would have the most negative impact on the business? Downtime of critical systems? A loss of reputation and customer trust? Disclosure of sensitive data? Tampering with critical data or systems? We’ve learned it’s critical that key leaders and staff in the SOC understand this context as they wade through the continuous flood of information and triage incidents and prioritize their time, attention, and effort.
Joint practice exercises (with the SOC)—Business leaders should regularly join the SOC in practicing response to major incidents. This builds the muscle memory and relationships that are critical to fast and effective decision making in the high pressure of real incidents, reducing organizational risk. This practice also reduces risk by exposing gaps and assumptions in the process that can be fixed prior to a real incident.
Major incidents updates (from the SOC)—The SOC should provide updates to business stakeholders for major incidents as they happen. This allows business leaders to understand their risk and take both proactive and reactive steps to manage that risk. For more learnings on major incidents by our DART team, see the incident response reference guide.
Business intelligence (from the SOC)—Sometimes the SOC finds that adversaries are targeting a system or data set that isn’t expected. As the SOC discovers the targets of attacks, they should share these with business leaders as these signals may trigger insight for business leaders (outside awareness of a secret business initiative, relative value of an overlooked data set, etc.).

SOC culture

If you take one thing away from this post, it’s that the SOC culture is just as important as the individuals you hire and the tools you use. Culture guides countless decisions each day by establishing what the right answer looks and feels like in ambiguous situations, which are plentiful in a SOC.

Our cultural elements are very much focused on people, teamwork, and continuous learning and include these learnings:

Use your human talent wisely—Our people are the most valuable asset we have in the SOC and we can’t afford to waste their time on repetitive thoughtless tasks that can be automated. To combat the human threats we face, we need knowledgeable and well-equipped humans that can apply expertise, judgement, and creative thinking. This human factor affects almost every aspect of SOC operations including the role of tools and automation to empower humans to do more (versus replacing them) and in reducing toil on our analysts. (More on this topic in Part 2: People.)
Teamwork—We’ve learned that we can’t tolerate the “lone hero” mindset in the SOC, nobody is as smart as all of us together. Teamwork makes a high-pressure working environment like the SOC much more fun, enjoyable, and productive when everyone knows they’re on the same team and everyone has each other’s back. We design our processes and tools to divide up tasks into specialties and to encourage people to share insights, coordinate and check each other’s work, and constantly learn from each other.
Shift left mindset—To get and stay ahead of cybercriminals and hackers who constantly evolve their techniques, we must continuously improve and shift our activities “left” in the attack timeline. We focus on speed and efficiency to try and get “faster than the speed of attack” by looking at ways we could have detected attacks earlier and responded more quickly. This principle is effectively an application of a continuous learning “growth mindset” that keeps the team laser focused on reducing risk for our organization and our customers.

SOC metrics

The final organizational element is how we measure success, a critical element to get right. Metrics translate culture into clear measurable objectives and have a powerful influence on shaping people’s behavior. We’ve learned that it’s critical to consider both what you measure, as well as the way that you focus on and enforce those metrics. We measure several indicators of success in the SOC, but we always recognize that the SOC’s job is to manage significant variables that are out of our direct control (attacks, attackers, etc.). We view deviations primarily as a learning opportunity for process or tool improvement rather than a failing on the part of the SOC to meet a goal.

These are the metrics we track, trend, and report on:

Time to acknowledge (TTA)—Responsiveness is one of the few elements the SOC has direct control over. We measure the time between an alert being raised (“light starts to blink”) and when an analyst acknowledges that alert and begins the investigation. Improving this responsiveness requires that analysts don’t waste time investigating false positives while another true positive alert sits waiting. We achieve this with ruthless prioritization. Any alert that requires an analyst response must have a track record of 90 percent true positive. We’ll talk more about the technology we use in Part 3: Technology and will describe our use of “cold path” activities like proactive hunting to supplement the “hot path” of alerts in Part 2: People.
Time to remediate (TTR)—Much like many SOCs, we track the time to remediate an incident to ensure we’re limiting the time attackers have access to our environment, which drive effectiveness and efficiencies in our SOC processes and tools.
Incidents remediated (manually/with automation)—We measure how many incidents are remediated manually and how many are resolved with automation. This ensures our staffing levels are appropriate and measures the effectiveness of our automation technology.
Escalations between each tier—We track how many incidents escalated between tiers to ensure we accurately capture the workload for each tier. For example, we need to ensure that Tier 1 work on an escalated incident isn’t fully attributed to Tier 2.

Get started

Our biggest recommendation for the SOC organization is to define the culture you want to inculcate. This will shape your team and attract the talent you want. In the coming weeks, we’ll share our philosophy on managing people, career paths, skills, and readiness, and what tools we use to enable our people to accomplish their mission. In the meantime, head over to CISO series to learn more.

John Dellinger, Author at Microsoft Security Blog

Lessons learned from the Microsoft SOC—Part 3d: Zen and the art of threat hunting

What makes a good threat hunter?

Threat hunting tooling

Conclusion

CISO Series: Lessons learned from the Microsoft SOC—Part 3c: A day in the life part 2

COVID-19 and the SOC

A day in the life—remediation

Big Bang or clean as you go?

Automation and integration for the win

Continuous improvement

CISO series: Lessons learned from the Microsoft SOC—Part 3b: A day in the life

General impressions

Nature doesn’t have edges

Routine front door incidents

Analyst investigation process

True or false? Art or science?

Document, document, document

Learn more

CISO series: Lessons learned from the Microsoft SOC—Part 3a: Choosing SOC tools

Our strategic approach to technology

Selecting the right technology

Making the unexpected choice

History and evolution to broad and deep tooling

Learn more

CISO Series: Lessons learned from the Microsoft SOC Part 2b: Career paths and readiness

Something new every day

Analyst roles and career paths

Careful balancing

Recruiting for success

Building skills and job readiness

Learn more

Read more from this series

CISO Series: Lessons learned from the Microsoft SOC—Part 2a: Organizing people

Empower humans with automation

Microsoft SOC teams and tiers model

Roles and functions of the SOC analyst tiers

Learn more

Read more from this series

CISO Series: Lessons learned from the Microsoft SOC—Part 1: Organization

Overall SOC model

SOC organizational purpose

SOC culture

SOC metrics

Get started

Read more from this series