Trading Org - Contributions

Worked at a Crypto exchange. Following are the major tasks accomplished while there.

Incoming payments redesign.

Impact

Helped reduce Ops team from 11 to 3

●Direct savings : ~30 lakhs / year

●Direct Processing savings : ~1 crore / year

●Indirect savings : ~ 50 lakhs/ year of dev effort was saved.

Context

Crypto is an upcoming field and does NOT have sufficient support from the Government; therefore the banks.

Incoming payments (on-ramp) integration is therefore done by accepting statement files from various banks.

Problems

The previous design had been done with many design flaws. E.g.

●Lock was taken on prefix + user.

This meant – whenever 2 payments were processed for the same user. ( different flows – UPI / bank deposit ). Sometimes the lock wasn’t taken.

This meant that the user was either double credited or double debited.

On one hand, this resulted in direct losses to the organization. On the other hand, it created a HEAVY operation cost.

●Extra credits had to be recovered manually via legal notices.

●Extra debits had to be refunded with penalties

Solved by changing the lock on a single mutex – userId

●Code bugs – locking/ unlocking

Among hundreds of locks, few missed unlocking a user. This made all other requests for the user wait for the expiration time of 1 min. Increasing the turnaround Time.

Solved by creating a lambda function for locking + unlocking. A method could be passed as an argument in the method.

Ensured guaranteed locking/ unlocking. Solving the issue across the codebase.

●If – else based code

The previous version of the code was written with if/ else conditions for each bank integration. Over time the number of conditions became so many, that managing the code became too much.

The code was modularized into classes. The classes had specific logic.

Each logic was called for a specific banking partner.

Modular code reduced the management cost.

●Each flow had its own logic

Logic was written once and triggered from multiple places.

●Design flaws

○Tables were NOT designed to allow logic expansion.

Each new change needed an ad hoc column. This created problems later.

e.g. the same column updated twice lost its history. History was needed for business logic.

Changed logic to vertical expansion from horizontal table expansion.

e.g. instead of adding a new column called vipStatus. Design the table to have 2 columns called PropertyName and PropertyValue

Now the table can store any number of properties uniquely without another property of the same user.

○Related ad hoc properties were stored in a different data structure ( tables ).

This meant …

■Additional query joins, therefore higher DB costs, and slow queries.

These properties were redesigned to be stored in a JSON column in the same table. This reduced both storage costs + the need for additional joins.

○Webhook callbacks were not persisted.

This created issues while debugging.

■Logs were difficult to parse through. Were filtered down for cost reasons … only 30% of the logs were brought to Datadog ( similar to Kibana ).

■Logs persisted only for 15 days. Cost reasons.

■Parsing through logs was time-consuming… at peak times of production issues.

Solved by creating webhook callback tables. The tables were purged after 30 days to NOT take up lots of space.

All logs were available with success/ failure messages.

Time Spent

●Me and 2 other engineers ( 6 years and 3 years experience ) respectively, worked for ~1 month.

●Change Included

○Code rewrite

○DB redesign

○New modules – purge, flows etc.

Result

●Losses reduced to ~0 from ~75 lakhs, in the previous quarter. Any losses were due to Ops team errors.

●New integration time was reduced from 14 days to ~3 days.

●Ops team size reduction ( 11 to 3 ) saved further costs to the organization.

Reliability / Cost Reduction / Performance Improvement

Impact

Reduced Cloud costs by 30% ( still going down )

Below are the projected numbers.

●Direct savings ~30 – 60 lakhs / year

●Direct Processing savings ~2 crore / year

●Indirect savings – ~ 1 crore / year of dev effort was saved.

Context

Organization was one of the highest $$ payers to AWS for cloud costs.

At the same time, the application faced issues during the months – Feb, March and April 2023.

Improvements

Following were done to improve the application:

●DB Index creations.

Indexes were either missing or NOT being used in the tables.

Logs were analyzed to find the slowest APIs. API queries were checked.

Each query was analyzed to find whether it used an appropriate index.

The following types of indexing were done.

○Hash Indexes – Simple equality comparisons.

Queries that referenced a direct = or in comparison. Hash indexes were created.

Performance	Before Index	After Index	Increase in Performance . *( old – new ) / new 100**
Max	1799 ms	2.5 ms	71860 %
P99	1747 ms	1.5 ms	116300 %
P95	628 ms	1 ms	62700 %
P90	609 ms	0.933 ms	65100 %

○GIN Index ( Generalized Inverted Index )

Gin index helps with full text search.

Certain fields needed full and partial search ( some banks gave partial ids ).

Performance	Before Index	After Index	Increase in Performance . *( old – new ) / new 100**
Max	5.3 sec	471 ms	10200 %
P99	4.94 sec	193 ms	24600 %
P95	4.94 sec	97.7 ms	49900 %
P90	4.94 sec	53.4 ms	91500 %

○B Tree Indexes

■Composite

The columns in the Index were verified to be in the same sequence as that in the query. This is important to ensure the index performs optimally

■Expression

Json columns with particular key searches are created as Expression indexes.

Performance	Before Index	After Index	Increase in Performance . *( old – new ) / new 100**
Max	8.91 sec	3.72 ms	239400 %
P99	8.76 sec	3.38 ms	259000 %
P95	8.36 sec	2.85 ms	835700 %
P90	8.11 sec	1.96 ms	413600 %

●N + 1 Query problem

A parent query fetches many child objects. Each child object fetches details via a query.

Such queries were batched to fetch details in one go instead of multiple ones

●Reads in a loop – SQL / Redis / Dynamo DB

Such queries were batched to fetch details in one go instead of multiple ones

●Poorly designed code

○High complexity Redis commands like – HGETALL ( gets all keys with a prefix ) were used.

These were fixed to use the right data structures.

●High response time from internal / external services

○Cache responses where possible e.g. Master data – currencies etc.

○Optimized APIs

●Dynamodb – insufficient read/ write capacity

○Right size was provisioned

●Archival of data

○Created a Util file – taking DB name, table name, where condition, batch size, deleted by.

○Trigger to the util did the following :

■Read data in batches

■Write it in a file

■Encrypted file using a random key

■Calculated Hash of the file

■Create an entry into the table with details of the DB, table, where condition, batch size, deleted by etc.

■Move the file to Amazon S3

■Delete the data.

●Create Partitions in the table.

These were date range based partitions. Such tables were used in the range of last 12 months.

Data before that wasn’t needed

Result

Reduced Cloud utilization by 30% ( hence 30% costs saved )

Below are projected numbers.

●Direct savings ~30 – 60 lakhs / year

●Direct Processing savings ~2 crore / year

●Indirect savings – ~ 1 crore / year of dev effort was saved.

Checklist Creation

Team manager gave feedback that the same issues repeated in the team e.g. missed deployment steps, partial testing deployments etc.

So I created this checklist. This helped reduce 95% issues.