I worked at a major e-commerce organization. Following are the major tasks I accomplished while there.
Cache Redesign
Impact
Infra
Context
Client is an ecommerce website. One of the existing services – Aggregator fetched data from ~7 different services.
Data was saved on Redis collectively for each product e.g.
Key = Price service key + Inventory service key + Promo service key + ….
Value = Price of product + Inventory of product + Promos on product … etc.
This design was copied from a legacy system bought by the organization. (forgetting the name)
Cons of design
Prices fluctuated most frequently. So, ‘price’ had the smallest TTL of ~15 seconds.
After 15 seconds, however, all other data along with price expired too. And had to be fetched again.
Cohort / Size chart data was also deleted. These had to be fetched again. This premature expiry created an unwanted load on Cohort and Size service.
Data was fetched repeatedly despite it NOT changing.
Each key was heavy ( a big string ) and had to be created slowly.
It had to be created by 7 different adapters, working sequentially. This was because NOT all services used the same product property.
e.g. Size Chart will NOT always use a product code instead it could also use a standard size metric – Indian, UK, US etc.
Each key – value pair carried data from 7 different services. Therefore each pair was heavy.
Each redis IO therefore took ~50 ms – ~100 ms.
Pros of design
However, the same data was read sequentially from 6 upstream systems. And posted to redis every “n” seconds.
Redesign
As part of the redesign, following changes were done:
When data from one service fluctuated, then it did NOT impact data from the other services. e.g. Price fluctuations did NOT impact size chart data.
We ended up NOT caching Price data. It was always fetched live.
Some services had TTL set to 6 hours while others had TTL in days.
Size chart – 24 hours
Promo – 3 hours
Enabling Caching control – Given to the business team.
TTL time for each service – Given to the business team.
Con
Pros
Only Price service was called frequently. For others, traffic dropped.
Data was served via cache. ( In past ~2-5% requests failed due to upstream errors )
This was because the size of each read/write became smaller.
Time spent
Results
Company was moving away from 24 core machines to 8 core machines.
On the old machine performance was ~1200 requests/ second. So, the requirement on the new machine was ~1200 requests/ second.
But the new design served ~2100 requests/ second.
Meaning, despite the CPU being reduced 66%, the throughput increased ~80%.
Business team could choose to tune TTL for each upstream system.
Failed Rewards Reporting
Impact
Context
When users bought from the website, they receive rewards e.g. 50OFFDOMINOS, 75OFFOVENSTORY etc.
Rewards are of 2 types …
Players
Problem
Players malfunctioned. e.g.
So, the orders placed between midnight till now, would NOT get rewarded
So many orders would NOT get the coupons. This was because the Coupon engine would NOT have any more coupons left.
By the time the coupon engine would generate more coupons ( from a third party service ). Sale time would be over, and our service would be unable to disburse them.
Some customers would NOT get coupons.
Solution Chain
After NOT having received the coupons, following would typically happen.
All of the above typically took ~3 weeks. And was very tiring for the tech team.
Redesign
I recommended, got approved and started on a new business process.
The report would be used by the Customer Care and Business team.
Result
When an Ajio customer calls. Customer care would have a complete view of what happened.
80% of the requests could be handled at level 1 itself. They could redirect the solution without the intervention of the Tech team.
The time of 3 weeks would reduce to 3 mins.
I coded the highlighted part of the design. Also initiated the Tableau report creation. And then moved to my next organization.