Flash Crowd
Scaling Global E-Commerce

Istanbul

Medium-sweet: On my first visit to Avon's office in Istanbul my hosts told me this was the way to properly enjoy their coffee. They were right. It was excellent.

Avon is one of the largest beauty companies in the world and Avon Turkey is a strong market for the company. Hundreds of thousands of women across the region run their own business selling Avon cosmetics, and having just taken the role of Chief Web Architect in Avon's Global IT organization, I was there to support them. They were about to get a new website.

Business leaders from across Avon, including marketing and technology, were coming together to create an ambitious new way for the Avon representatives to manage their business. The new website would enable every single one of them to place their orders online, and team leads would be able to track those orders in real time.

I listened and learned about the planned functionality as well as the planned marketing campaigns and processes.

The goals were set, and as always, the schedule was ambitious.

The General Manager took me aside and asked what I thought. Confidently, I assured him we would be able to deliver the new system as expected.

New York

Back in New York, I started to learn about the systems environment. The global websites were built using a classic 3-tier architecture: apache web servers, IBM WebSphere Java application servers, and Oracle 11g databases. A shared physical infrastructure was used to host the websites for dozens of markets. Given the substantial amount of traffic and orders, this was fairly heavy iron. High end physical servers that could house a large number of CPUs and RAM were combined with a high end storage area network (SAN) to provide the high speed disk storage for Oracle.

Leveraging IBM WebSphere, the e-commerce system implemented a variety of different business rules to accommodate the varied markets around the world. As such, it was highly configurable, and there was a rich custom object model and database schema to power the website experience. As would be expected, it was both complex and far from error free.

In addition to the usual smattering of functional bugs reported by users and tracked in our bug tracking system, there were other reports of website outages at varying times. These were generally reported by the local markets and then researched by members of the Global IT team.

The system had some problems - and now they were mine.

Campaigns

Avon operates on sales campaigns that range from two to four weeks in duration depending upon the market. Avon Turkey operates a three week campaign cycle. Every three weeks, new products, promotions, and marketing plans are loaded into the Oracle database, and e-commerce orders are tracked within the same period.

Special deals come on offer at the start of each campaign; these could drive spikes in website orders that the system might not be able to handle. The team had identified some Java code flaws related to the functional bugs, and we worked in a fix, test, and release cycle coordinated with the actual campaigns.

By now we had also identified and fixed some high cost Oracle stored procedures, but I knew that these alone would not enable us to sustain extreme order volume peaks. And I had reason to expect that such peaks were on their way.

To maximize engagement and sell-through, Avon Turkey marketing was going to put selected products on sale in the last days and hours of the next campaign. These excellent deals would be limited time offers and they would only be communicated at the last minute. With hundreds of thousands of emails going out to Avon representatives each looking to maximize their own business, the website crashed.

We had brought upon ourselves a flash crowd and our system could not handle the onrush of traffic and orders. Not only did this failure put a damper on the company's sales, but it also affected the livelihood of many independent women. The problem had to be fixed.

Analysis and Solution

In addition to poring over Java code and Oracle PL/SQL, I had been looking for ways to measure and characterize the aggregate behaviour of our users. I wanted to understand, for a given moment in time, how many orders were being placed on the website. The conventional way this is done is to query the Oracle database for orders placed in a given time period. I procured access to the production apache web server logs instead. Prolonged access to the production database would be hard to obtain, but the apache logs were available to me whenever I wanted. I could now study the traffic of our many websites in a unique and valuable way. Analyzing the site wide click-stream, I developed a way to generate an approximate order rate graph.

I used the apache logs to create high resolution graphs of the order arrival rate. This meant accurately accounting for attempts to place orders on the website in individual minute by minute increments. With data now in hand, I worked with our WebSphere administrators to develop a special configuration. We had already tried scaling up by adding more 'servers' (in quotes, because I do not mean hardware, but rather WebSphere process instances). Now we would also pre-allocate java worker threads to be able to handle the peak arrival rate that we knew would be coming.

Testing our solution would not be easy. Setting up our load test environment with both the special configuration, as well as the ability to replicate the flash crowd traffic, was a major undertaking. The team did a great job, and we finally had what we were looking for - a runtime server configuration that we believed could handle the volume.

Success

The next campaign was going to be a big one. A blitz of new products in skin care, color, and fragrance. Marketing was pulling out all the stops. We assured them that this time, "technology would be ready."

It is 4 pm in New York when it is 11 pm in Istanbul. With the campaign and the special deals on offer for one more hour, we watched our systems as the orders flowed in. We watched server CPU utilization, JVM thread counts and memory, Oracle process and connection counts. All looked clean and stable. One more hour to go.

While I was fairly confident in the outcome, I remained focused on the monitors, watching the systems, looking for any tell tale signs of trouble. And then, the campaign ended; traffic levels dropped like a stone. We had succeeded.

It always feels good when you engineer a solution to a critical business problem - but what was even more gratifying in this case was knowing how many independent business women were also benefiting from my work. Although it was now after 5 pm, I decided to treat myself to a coffee. Medium-sweet. Beautiful.