Worked at a Crypto exchange. Following are the major tasks accomplished while there.
Incoming payments redesign.
Impact
Helped reduce Ops team from 11 to 3
Context
Crypto is an upcoming field and does NOT have sufficient support from the Government; therefore the banks.
Incoming payments (on-ramp) integration is therefore done by accepting statement files from various banks.
Problems
The previous design had been done with many design flaws. E.g.
This meant – whenever 2 payments were processed for the same user. ( different flows – UPI / bank deposit ). Sometimes the lock wasn’t taken.
This meant that the user was either double credited or double debited.
On one hand, this resulted in direct losses to the organization. On the other hand, it created a HEAVY operation cost.
Solved by changing the lock on a single mutex – userId
Among hundreds of locks, few missed unlocking a user. This made all other requests for the user wait for the expiration time of 1 min. Increasing the turnaround Time.
Solved by creating a lambda function for locking + unlocking. A method could be passed as an argument in the method.
Ensured guaranteed locking/ unlocking. Solving the issue across the codebase.
The previous version of the code was written with if/ else conditions for each bank integration. Over time the number of conditions became so many, that managing the code became too much.
The code was modularized into classes. The classes had specific logic.
Each logic was called for a specific banking partner.
Modular code reduced the management cost.
Logic was written once and triggered from multiple places.
Each new change needed an ad hoc column. This created problems later.
e.g. the same column updated twice lost its history. History was needed for business logic.
Changed logic to vertical expansion from horizontal table expansion.
e.g. instead of adding a new column called vipStatus. Design the table to have 2 columns called PropertyName and PropertyValue
Now the table can store any number of properties uniquely without another property of the same user.
This meant …
These properties were redesigned to be stored in a JSON column in the same table. This reduced both storage costs + the need for additional joins.
This created issues while debugging.
Solved by creating webhook callback tables. The tables were purged after 30 days to NOT take up lots of space.
All logs were available with success/ failure messages.
Time Spent
Result
Reliability / Cost Reduction / Performance Improvement
Impact
Reduced Cloud costs by 30% ( still going down )
Below are the projected numbers.
Context
Organization was one of the highest $$ payers to AWS for cloud costs.
At the same time, the application faced issues during the months – Feb, March and April 2023.
Improvements
Following were done to improve the application:
Indexes were either missing or NOT being used in the tables.
Logs were analyzed to find the slowest APIs. API queries were checked.
Each query was analyzed to find whether it used an appropriate index.
The following types of indexing were done.
Queries that referenced a direct = or in comparison. Hash indexes were created.
Performance |
Before Index |
After Index |
Increase in Performance
( old – new ) / new * 100 |
Max |
1799 ms |
2.5 ms |
71860 % |
P99 |
1747 ms |
1.5 ms |
116300 % |
P95 |
628 ms |
1 ms |
62700 % |
P90 |
609 ms |
0.933 ms |
65100 % |
Gin index helps with full text search.
Certain fields needed full and partial search ( some banks gave partial ids ).
Performance |
Before Index |
After Index |
Increase in Performance
( old – new ) / new * 100 |
Max |
5.3 sec |
471 ms |
10200 % |
P99 |
4.94 sec |
193 ms |
24600 % |
P95 |
4.94 sec |
97.7 ms |
49900 % |
P90 |
4.94 sec |
53.4 ms |
91500 % |
The columns in the Index were verified to be in the same sequence as that in the query. This is important to ensure the index performs optimally
Json columns with particular key searches are created as Expression indexes.
Performance |
Before Index |
After Index |
Increase in Performance
( old – new ) / new * 100 |
Max |
8.91 sec |
3.72 ms |
239400 % |
P99 |
8.76 sec |
3.38 ms |
259000 % |
P95 |
8.36 sec |
2.85 ms |
835700 % |
P90 |
8.11 sec |
1.96 ms |
413600 % |
A parent query fetches many child objects. Each child object fetches details via a query.
Such queries were batched to fetch details in one go instead of multiple ones
Such queries were batched to fetch details in one go instead of multiple ones
These were fixed to use the right data structures.
These were date range based partitions. Such tables were used in the range of last 12 months.
Data before that wasn’t needed
Result
Reduced Cloud utilization by 30% ( hence 30% costs saved )
Below are projected numbers.
Checklist Creation
Team manager gave feedback that the same issues repeated in the team e.g. missed deployment steps, partial testing deployments etc.
So I created this checklist. This helped reduce 95% issues.