Using the cloud to scale Etsy

Etsy, an online marketplace for unique, handmade, and vintage items, has
seen high growth over the last five years. Then the pandemic dramatically
changed shoppers’ habits, leading to more consumers shopping online. As a
result, the Etsy marketplace grew from 45.7 million buyers at the end of
2019 to 90.1 million buyers (97%) at the end of 2021 and from 2.5 to 5.3
million (112%) sellers in the same period.

The growth massively increased demand on the technical platform, scaling
traffic almost 3X overnight. And Etsy had signifcantly more customers for
whom it needed to continue delivering great experiences. To keep up with
that demand, they had to scale up infrastructure, product delivery, and
talent drastically. While the growth challenged teams, the business was never
bottlenecked. Etsy’s teams were able to deliver new and improved
functionality, and the marketplace continued to provide a excellent customer
experience. This article and the next form the story of Etsy’s scaling strategy.

Etsy’s foundational scaling work had started long before the pandemic. In
2017, Mike Fisher joined as CTO. Josh Silverman had recently joined as Etsy’s
CEO, and was establishing institutional discipline to usher in a period of
growth. Mike has a background in scaling high-growth companies, and along
with Martin Abbott wrote several books on the topic, including The Art of Scalability
and Scalability Rules.

Etsy relied on physical hardware in two data centers, presenting several
scaling challenges. With their expected growth, it was apparent that the
costs would ramp up quickly. It affected product teams’ agility as they had
to plan far in advance for capacity. In addition, the data centers were
based in one state, which represented an availability risk. It was clear
they needed to move onto the cloud quickly. After an assessment, Mike and
his team chose the Google Cloud Platform (GCP) as the cloud partner and
started to plan a program to move their
many systems onto the cloud
.

While the cloud migration was happening, Etsy was growing its business and
its team. Mike identified the product delivery process as being another
potential scaling bottleneck. The autonomy afforded to product teams had
caused an issue: each team was delivering in different ways. Joining a team
meant learning a new set of practices, which was problematic as Etsy was
hiring many new people. In addition, they had noticed several product
initiatives that did not pay off as expected. These indicators led leadership
to re-evaluate the effectiveness of their product planning and delivery
processes.

Strategic Principles

Mike Fisher (CTO) and Keyur Govande (Chief Architect) created the
initial cloud migration strategy with these principles:

Minimum viable product – A typical anti-pattern Etsy wanted to avoid
was rebuilding too much and prolonging the migration. Instead, they used
the lean concept of an MVP to validate as quickly and cheaply as possible
that Etsy’s systems would work in the cloud, and removed the dependency on
the data center.

Local decision making – Each team can make its own decisions for what
it owns, with oversight from a program team. Etsy’s platform was split
into a number of capabilities, such as compute, observability and ML
infra, along with domain-oriented application stacks such as search, bid
engine, and notifications. Each team did proof of concepts to develop a
migration plan. The main marketplace application is a famously large
monolith, so it required creating a cross-team initiative to focus on it.

No changes to the developer experience – Etsy views a high-quality
developer experience as core to productivity and employee happiness. It
was important that the cloud-based systems continued to provide
capabilities that developers relied upon, such as fast feedback and
sophisticated observability.

There also was a deadline associated with existing contracts for the
data center that they were very keen to hit.

Using a partner

To accelerate their cloud migration, Etsy wanted to bring on outside
expertise to help in the adoption of new tooling and technology, such as
Terraform, Kubernetes, and Prometheus. Unlike a lot of Thoughtworks’
typical clients, Etsy didn’t have a burning platform driving their
fundamental need for the engagement. They are a digital native company
and had been using a thoroughly modern approach to software development.
Even without a single problem to focus on though, Etsy knew there was
room for improvement. So the engagement approach was to embed across the
platform organization. Thoughtworks infrastructure engineers and
technical product managers joined search infrastructure, continuous
deployment services, compute, observability and machine learning
infrastructure teams.

An incremental federated approach

The initial “lift &
shift” to the cloud for the marketplace monolith was the most difficult.
The team wanted to keep the monolith intact with minimal changes.
However, it used a LAMP stack and so would be difficult to re-platform.
They did a number of dry runs testing performance and capacity. Though
the first cut-over was unsuccessful, they were able to quickly roll
back. In typical Etsy style, the failure was celebrated and used as a
learning opportunity. It was eventually completed in 9 months, less time
than the full year originally planned. After the initial migration, the
monolith was then tweaked and tuned to situate better in the cloud,
adding features ​​like autoscaling and auto-fixing bad nodes.

Meanwhile, other stacks were also being migrated. While each team
created its own journey, the teams were not completely on their own.
Etsy used a cross-team architecture advisory group to share broader
context, and to help pattern match across the company. For example, the
search stack moved onto GKE as part of the cloud, which took longer than
the lift and shift operation for the monolith. Another example is the
data lake migration. Etsy had an on-prem Vertica cluster, which they
moved to Big Query, changing everything about it in the process.

Not surprising to Etsy, after the cloud migration the optimization
for the cloud didn’t stop. Each team continued to look for opportunities
to utilize the cloud to its full extent. With the help of the
architecture advisory group, they looked at things such as: how to
reduce the amount of custom code by moving to industry-standard tools,
how to improve cost efficiency and how to improve feedback loops.

Figure 1: Federated
cloud migration

As an example, let’s look at the journey of two teams, observability
and ML infra:

The challenges of observing everything

Etsy is famous for measuring everything, “If it moves, we track it.”
Operational metrics – traces, metrics and logs – are used by the full
company to create value. Product managers and data analysts leverage the
data for planning and proving the predicted value of an idea. Product
teams use it to support the uptime and performance of their individual
areas of responsibility.

With Etsy’s commitment to hyper-observability, the amount of data
being analyzed isn’t small. Observability is self-service; each team
gets to decide what it wants to measure. They use 80M metric series,
covering the site and supporting infrastructure. This will create 20 TB
of logs a day.

When Etsy originally developed this strategy there weren’t a lot of
tools and services on the market that could handle their demanding
requirements. In many cases, they ended up having to build their own
tools. An example is StatsD, a stats aggregation tool, now open-sourced
and used throughout the industry. Over time the DevOps movement had
exploded, and the industry had caught up. A lot of innovative
observability tools such as Prometheus appeared. With the cloud
migration, Etsy could assess the market and leverage third-party tools
to reduce operational cost.

The observability stack was the last to move over due to its complex
nature. It required a rebuild, rather than a lift and shift. They had
relied on large servers, whereas to efficiently use the cloud it should
use many smaller servers and easily scale horizontally. They moved large
parts of the stack onto managed services and third party SaaS products.
An example of this was introducing Lightstep, which they could use to
outsource the tracing processing. It was still necessary to do some
amount of processing in-house to handle the unique scenarios that Etsy
relies on.

Migration to the cloud-enabled a better ML platform

A big source of innovation at Etsy is the way they utilize their
Machine learning.

Etsy leverages
machine learning (ML) to create personalized experiences for our
millions of buyers around the world with state-of-the-art search, ads,
and recommendations. The ML Platform team at Etsy supports our machine
learning experiments by developing and maintaining the technical
infrastructure that Etsy’s ML practitioners rely on to prototype, train,
and deploy ML models at scale.

Kyle Gallatin and Rob Miles

The move to the cloud enabled Etsy to build a new ML platform based
on managed services that both reduces operational costs and improves the
time from idea generation to production deployment.

Because their resources were in the cloud, they could now rely on
cloud capabilities. They used Dataflow for ETL and Vertex AI for
training their models. As they saw success with these tools, they made
sure to design the platform so that it was extensible to other tools. To
make it widely accessible they adopted industry-standard tools such as
TensorFlow and Kubernetes. Etsy’s productivity in developing and testing
ML leapfrogged their prior performance. As Rob and Kyle put it, “We’re
estimating a ~50% reduction in the time it takes to go from idea to live
ML experiment.”

This performance growth wasn’t without its challenges however. As the
scale of data grew, so too did the importance of high-performing code.
With low-performing code, the customer experience could be impacted, and
so the team had to produce a system which was highly optimized.
“Seemingly small inefficiencies such as non-vectorized code can result
in a massive performance degradation, and in some cases we’ve seen that
optimizing a single tensor flow transform function can reduce the model
runtime from 200ms to 4ms.” In numeric terms, that’s an improvement of
two orders of magnitude, but in business terms, this is a change in
performance easily perceived by the customer.

What were the challenges of the cloud?

Etsy had to operate its own infrastructure, and a lot of the platform
teams’ skills were in systems operation. Moving the cloud allowed teams
to use a higher abstraction, managed by infrastructure as code. They
changed their infrastructure hiring to look for software engineering
skills. It caused friction with the existing team; some people were very
excited but others were apprehensive about the new approach.

While the cloud certainly reduced the number of things they had to
manage and allowed for simpler planning, it didn’t fully get them away
from capacity planning. The cloud services still run on servers with
CPUs and Disks, and in some situations, there is right-sizing for future
load that has to be done. Going forward, as on-demand cloud services
improve, Etsy is hopeful they can reduce this capacity planning.

The stress test of the pandemic

Etsy had always been data center based, which had kept them
constrained in some ways. Because they’d been so heavily invested in
their data center presence, they hadn’t been taking advantage of new
offerings cloud vendors had developed. For example, their data center
setup lacked robust APIs to manage provisioning and capacity.

When Mike Fisher came onboard, Etsy then began their cloud migration
journey. This set them up for success for the future, since the
migration was basically finished at the start of the pandemic. There
were a few ways this manifested: they had no capacity crunch, although
traffic exploded 2-3X overnight, as events had increased from 1 billion
to 6 billion.

And there were specific examples of ways the cloud gave them agility
during the pandemic. For example, the cloud enabled efforts to close the
“semantic gap”, ensuring searches for “masks” surfaced cloth masks not
face masks of the cosmetic or costume variety. This was possible because
Google Cloud enabled Etsy to implement more sophisticated machine
learning and the agility to retrain algorithms in real time. Another
example was their database management changed from the datacenter to the
cloud. Specifically, around backups, Etsy’s DR posture improved in the
cloud, since they leveraged block storage snapshotting as a way of
restoring databases. This enabled them to do fast restores, have
confidence and be able to test them quickly, unlike the older method,
where a restore would take several hours and not be perfectly
scalable.

Etsy performs extensive load and performance testing. They use chaos
engineering techniques, having a ‘scale day’ that stresses the systems
at max capacity. After the pandemic the increased load was no longer a
spike, it was now the daily average. The load testing architecture and
techniques needed to be just as scalable as any other system in order to
handle the growth.

Continually Improving the platform

One of Etsy’s next focus areas is to create “paved roads” for
engineers. A set of suggested approaches and machinery to reduce
friction when launching and developing services. During the initial four
years of the cloud migration, they decided to take a very federated
strategy. They took the “let 1000 flowers bloom” approach as described
by Peter Seibel in his article on engineering effectiveness at
Twitter
.
The systems had never existed in the cloud before. They did not know
what the payoffs would be, and wanted to maximize the chances of
discovering value in the cloud.

As a result, some product teams are reinventing the wheel because
Etsy doesn’t have existing implementation patterns and services. Now
that they have more experience operating in the cloud, platform teams
know where the gaps are and can see where tooling is needed.

To determine if the investments are paying off. Etsy is tracking
various measures. For example, they monitor trends in SLI/SLOs related
to reliability, debuggability and availability of the systems. One other
key metric is Time to Productive – the time it takes for a new engineer
to be set up with their environments and make the first change. What
exactly that means changes by domain; for example it might be the first
website push or the first data pipeline working in the big data
platform. Something that used to take 2 hours now takes 20 minutes.

They combine these quantitative metrics with regularly measuring
engineering satisfaction, using a form of an NPS survey to assess how
engineers enjoy working in their respective engineering environments,
and give an opportunity to point out problems and suggest improvements.
Another interesting stat is that the infrastructure has expanded to use
10x the number of nodes but only requires 2x the number of people to
manage them.

Measuring Cost and Carbon Consumption

Etsy continues to embrace measuring everything. Moving to the cloud
made it easier for teams to identify and track their operational costs
than it had been in the datacenters. Etsy built tools on top of Google
Cloud to provide dashboards which give insight into spending, in order
to help teams understand which features were causing costs to rise. The
dashboards included rich contextual information to help them make
optimization decisions, measured against their understanding of what
ideal efficiency should be.

A very important company pillar is sustainability. Etsy reports their
energy consumption in their quarterly SEC filings, and have made
commitments to reduce it. They had been measuring energy consumption in
the data center, but trying to do this in the cloud was initially more
difficult. A team at Etsy researched and created Cloud Jewels, an energy
estimation tool, which they open-sourced.

We’ve
been unable to measure our progress against one of our key impact goals
for 2025 — to reduce our energy intensity by 25%. Cloud providers
generally do not disclose to customers how much energy their services
consume. To make up for this lack of data, we created a set of
conversion factors called Cloud Jewels to help us roughly convert our
cloud usage information (like Google Cloud usage data) into approximate
energy used. We’re proud that our work and methodology have been leveraged by
Google and AWS to build into their own models and tools.

— Emily Sommer (Etsy sustainability architect)

These metrics have recently been added to their product dashboard,
allowing product managers and engineers to find opportunities to reduce
energy consumption and spot whether a new feature has had any effect.
Thoughtworks, who has a similar sustainability mission, also created an
open-source tool called the Cloud Carbon Footprint, which was inspired
by initial research into Cloud Jewels, and further developed by an
internal Thoughtworks team.


Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: