Case Study: Cloud Support Advisory

The Client: Major Social Media Company

Date: 2021

Project Lead: Tyler Fischella

Summary

In response to exponential growth and increased traffic driven by the COVID-19 pandemic, a large social media company required a dedicated project team to enhance its operational capabilities on Google Cloud. As end-user engagement surged, the company faced significant technical issues, and consequently, increased challenges in managing cloud resources. To address these challenges, the project team focused on operationalizing and scaling support management systems at Google Cloud.

Customer Challenge

The client faced operational capacity challenges in 2021, often requiring resources at a scale that was not readily available. The impact of COVID-19 and the resulting remote work policies hindered their ability to provide comprehensive onboarding training for new hires, which had been more effective in previous years. Additionally, they sought to increase consumption of services across products without compromising operational efficiency.

On the product side, the client encountered difficulties running Google Cloud services at scale, grappling with both product and organizational challenges. High availability, latency, and cost optimization emerged as key concerns as their adoption of Google Cloud matured.

Given the large scale of their operations on Google Cloud, which includes over 100 services and more than 1,000 projects, the client experienced a high volume of support cases each month. Many of these issues were related or long-standing, necessitating greater engagement from Google Cloud teams for investigation and root cause analysis.

Project Deliverables

The project team coordinated with engineers and product managers to launch several high-priority solutions, effectively advocating for the client.

  1. Assessed Current Support Model:

    • The team analyzed existing support workflows, response times, and issue resolution metrics.

    • Pain points and bottlenecks were identified from both support teams and the client.

  2. Defined New Support Objectives:

    • Clear goals were set for response times, resolution times, client satisfaction, and cost efficiency.

  3. Designed the New Support Process Framework:

    • New roles, responsibilities, and escalation paths were defined for clients primary workloads.

  4. Implemented New Self-Service Paths

    • A knowledge base or FAQ repository was set up to enable self-service for common issues.

    • Automation was used for new routine tasks for capacity management, priority assignment, and status updates.

  5. Established Clear SLAs and SOPs:

    • Service Level Agreements (SLAs) were developed to outline expected response and resolution times.

    • Standard Operating Procedures (SOPs) were created for handling common and complex issues, ensuring consistency.

  6. Deployed Monitoring & Alerting Systems:

    • Real-time monitoring for cloud services was set up to proactively identify slot management and quota issues.

    • Alerting systems were improved to notify support teams of potential problems before client was impacted.

Engagement initiatives also included weekly onboarding sessions for new engineers, which provided insights into available resources and collaboration.

Through program management of multiple workstreams, the team maintained alignment with customer goals via weekly status updates and addressed any potential risks or issues. Additionally, they translated implementation requirements for the engineering and product management teams, ensuring that the client’s needs were effectively met.

Impact

Black Friday/Cyber Monday (BFCM) & New Year’s Eve (NYE)

  • Identified pain points and bottlenecks from both support teams and client.

  • Set up real-time monitoring for cloud services to proactively identify potential issues.

  • Monitored and supported two scale tests during BFCM, ensuring a successful pre-spin with no major incidents or outages.

  • Prepared for three scale tests on NYE, submitting over 60 capacity signals for critical projects and alerting product engineering teams.

  • Implemented alerting systems to notify support teams of emerging problems before client impact.

  1. New Support Management

    • Established clear goals for response times, resolution times, customer satisfaction, and cost efficiency.

    • Created key performance indicators (KPIs) aligned with business and customer needs.

    • Set up a knowledge base or FAQ repository to enable self-service for common issues.

  2. AI/ML Initiatives

    • Delivered new integration of Tensor Processing Units (TPUs) into the Ads Ranking team’s training pipeline, which improved operational use of AI/ML workloads and spontaneous experiments.

  3. Infrastructure, Compute, & GKE

    • Delivered multiple solutions focused on simplifying operations and management through a service mesh abstraction.

    • Provided solutions for comprehensive visibility into customer interactions and ongoing support tickets.

  4. Big Data Achievements

    • Delivered faster root cause analyses for several technical issues across BigQuery, Dataproc, and Dataflow.

    • Created Standard Operating Procedures (SOPs) for handling common and complex issues.

  5. Database & Storage Developments

    • Delivered four essential security features, prior to BFCM and NYE.