Skip to content

techwiddeep.com

Menu
  • About me
  • Contact Us
Menu

Ecommerce Org – Contributions

Posted on August 20, 2023 by Deep.Kulshreshtha

I worked at a major e-commerce organization. Following are the major tasks I accomplished while there.

.

Cache Redesign

.

Impact

.

●Direct savings in infra costs    : ~30 lakhs / year
●Projected savings in infra costs    : 1 cr in the next 2 years

.

Infra

.

●Normal time site load             : 30,000 requests/ second
●Sale time load              : 60,000 requests/ second

.

●Normal Pod throughput     : ~2000 requests/ second.
●Sale time throughput    : ~3000 requests/ second.

.

●Pod RAM                : 16 gb
●Pod CPU                 : 8 cores

.

●Redis                 : 3 master 3 slave clusters.

.

Context

.

Client is an ecommerce website. One of the existing services – Aggregator fetched data from ~7 different services.

.

a.Price service ( fluctuated frequently )
b.Inventory service ( fluctuated frequently but less than Price )
c.Promo service ( What promotions are available with the product. Varied every sale season )
d.Size chart service ( Mostly static data )
e.Cohort service or Classification service. ( Static but varied based on business needs )
f.… few others

.

Data was saved on Redis collectively for each product e.g.

.

Key = Price service key + Inventory service key + Promo service key + ….

.

Value = Price of product + Inventory of product + Promos on product … etc.

.

This design was copied from a legacy system bought by the organization. (forgetting the name)

.

Cons of design

.

●TTL was set to the minimum of all the ~7 services.

.

Prices fluctuated most frequently. So, ‘price’ had the smallest TTL of ~15 seconds.

.

After 15 seconds, however, all other data along with price expired too. And had to be fetched again.

.

Cohort / Size chart data was also deleted. These had to be fetched again. This premature expiry created an unwanted load on Cohort and Size service.

.

Data was fetched repeatedly despite it NOT changing.

.

●Key creation was slow

.

Each key was heavy ( a big string ) and had to be created slowly.

.

It had to be created by 7 different adapters, working sequentially. This was because NOT all services used the same product property.

.

e.g. Size Chart will NOT always use a product code instead it could also use a standard size metric – Indian, UK, US etc.

.

●Total Redis IO was high ~ 50 ms.

.

Each key – value pair carried data from 7 different services. Therefore each pair was heavy.

Each redis IO therefore took ~50 ms – ~100 ms.

.

Pros of design

.

●Only 1 call was made to redis for each read.

.

However, the same data was read sequentially from 6 upstream systems. And posted to redis every “n” seconds.

.

.

Redesign

.

As part of the redesign, following changes were done:

.

.

●Created a separate Redis key + TTL for each service.

.

When data from one service fluctuated, then it did NOT impact data from the other services. e.g. Price fluctuations did NOT impact size chart data.

.

We ended up NOT caching Price data. It was always fetched live.

.

Some services had TTL set to 6 hours while others had TTL in days.

.

Size chart – 24 hours

Promo – 3 hours

.

Enabling Caching control – Given to the business team.

TTL time for each service – Given to the business team.

.

Con

.

1.The number of calls required to read data increased. From 1 to 7.

.

2.After the 7 calls, all data has to be merged to produce a single output for the frontend server.

.

3.This meant that the slowest response added to the overall response. ( weakest link in the chain )

.

Pros

.

1.Traffic on static services decreased.

Only Price service was called frequently. For others, traffic dropped.

.

2.Upstream service failure impact reduced.

Data was served via cache. ( In past ~2-5% requests failed due to upstream errors )

.

3.Reduced IO duration for each Redis request to ~5 ms

This was because the size of each read/write became smaller.

.

.

Time spent

.

1.Me and 1 other mid level engineer ( 7 years experience ) worked on the redesign for ~ 3 weeks.
2.Changes included
a.Code rewrite
b.Prometheus integration for the new calls.
c.Creating a dashboard on Grafana for the Prometheus stats

.

.

Results

.

.

1.Throughput increased from 1200 requests/ sec to ~2100 requests/ sec.

.

Company was moving away from 24 core machines to 8 core machines.

.

On the old machine performance was ~1200 requests/ second. So, the requirement on the new machine was ~1200 requests/ second.

.

But the new design served ~2100 requests/ second.

.

Meaning, despite the CPU being reduced 66%, the throughput increased ~80%.

.

2.CPU utilization fell from ~20% to ~4% during most times.

.

3.Response time remained around 50 – 100 ms/ for each request.

.

4.Service was tunable at each upstream level.

.

Business team could choose to tune TTL for each upstream system.

.

.

Failed Rewards Reporting

Impact

.

●Turn around reduction : 3 weeks to 3 mins

.

Context

.

When users bought from the website, they receive rewards e.g. 50OFFDOMINOS, 75OFFOVENSTORY etc.

.

Rewards are of 2 types …

1.Immediate
2.Delayed ( given after the order cancellation duration is over )

.

.

Players

.

–Business team  : configured sales times.
–Business team  : estimated coupons required
–Promo team    : prepared coupons in advance
–Promo engine     : validated, calculated and awarded coupons. This was our service
–Email service     : notified customers.

.

.

Problem

.

Players malfunctioned. e.g.

.

●Business team forgot to enable rewards at midnight.

So, the orders placed between midnight till now, would NOT get rewarded

.

●Business team MISCALCULATED sale volumes.

So many orders would NOT get the coupons. This was because the Coupon engine would NOT have any more coupons left.

.

By the time the coupon engine would generate more coupons ( from a third party service ). Sale time would be over, and our service would be unable to disburse them.

.

●Bugs in Promo engine

Some customers would NOT get coupons.

.

●Bugs in Notification engine ( mails or app notifications would NOT get delivered )

.

.

Solution Chain

.

After NOT having received the coupons, following would typically happen.

.

●Customers would call Ajio customer care.
●Customer care would redirect to the business team.
●Business team would come to us ( the Dev team )
●We did NOT have access to production env, so we contacted TechOps team.
●After a few days, some issues were figured out.
●To solve the error, some action would need to be taken.
●We would ask for data from the PII team. TechOps did not have access to sensitive data like phone numbers or emails.
●After getting exceptions etc, the PII team would give data.
●Dev team would create a script -> give it to TechOps.
●TechOps would trigger the script ( from Dev team ) + Data ( from PII team )

.

.

All of the above typically took ~3 weeks. And was very tiring for the tech team.

.

.

Redesign

.

I recommended, got approved and started on a new business process.

.

●Generate a Rewards report, for each order.

The report would be used by the Customer Care and Business team.

.

●They would be able to see a 360 degree view of each order + rewards.

.

●To do the above …

.

○For each Reward
■Given – Generate an event with ALL information on the reward given.
■NOT Given – Generate an event will ALL information for the failure reason.

.

○For each email / notification
■Success – Generate an event
■Fail – Generate an event.

.

○At PII layer … read and consolidate all events.

.

○Create a report in Tableu ( used for reporting ) which allows for users to see
■User data
■Order data
■Reward data
■Notification/ email data.

.

.

Result

.

When an Ajio customer calls. Customer care would have a complete view of what happened.

.

80% of the requests could be handled at level 1 itself. They could redirect the solution without the intervention of the Tech team.

.

The time of 3 weeks would reduce to 3 mins.

.

I coded the highlighted part of the design. Also initiated the Tableau report creation. And then moved to my next organization.

.

© 2025 techwiddeep.com | Powered by Superbs Personal Blog theme