Not long ago, I experienced a classic "Today I Fucked Up" moment while working on a data migration at work. We needed to sync data between Azure storage accounts to set up infrastructure in a new region. Source accounts were cluttered with good and stale data, in the case of the latest, accumulated over the years without a proper cleaning routine. The task was daunting: over 1 billion blobs awaited migration.
Given the scale, traditional migration options like AzCopy were sidelined due to the anticipated manual effort for syncing and monitoring. Instead, I proposed leveraging Azure's "Object Replication" feature for storage accounts, which promised to replicate containers in the background with minimal manual intervention.
Cost Estimation Misstep
Before kicking off the process, and without enough caffeine in my body, I used the Azure cost calculator to estimate the migration cost. However, I mistakenly input 10,000 instead of the correct 100,000 multiplier for write transactions, leading to a gross underestimation. The initial cost seemed reasonable at $1,300 for writes operations, but in reality, I was off by a factor of 10. Nevertheless, the data had to be migrated or purged, nothing much we could do about that cost.
The Migration Process
The migration began smoothly. By Friday morning, a significant portion of the data was synced, and the cost metrics seemed to align with my (flawed) calculations. By Monday, the metrics indicated that the replication was complete, and I proudly shared the news with my colleagues, pleased with the minimal effort required.
The Bitter Victory
However, the sense of accomplishment was short-lived. Upon checking Azure Cost Management, my heart did a few hiccups as I met with a staggering figure: over $40,000 spent over two days. This was my oversight—I had forgotten to account for Microsoft Defender for Storage, which also bills by the transaction volume you perform and can become exorbitantly expensive.
Damage Control
After the initial shock subsided, I promptly informed the SRE, Technology, and Finance teams of the error. I then reached out to Azure support, explaining the oversight and asking if any credit for the unintended costs could be given. Thankfully, Azure's finance department was kind enough to grant a one-time credit for the costs of Defender.
Learnings and Takeaways
This incident served as a potent reminder that anyone can make mistakes, regardless of seniority. It's vital to maintain data hygiene, prioritize data lifecycle management, and be aware of additional cost factors like Microsoft Defender for Storage.
There is a way to disable Microsoft Defender for Storage (Classic) from scanning storage accounts using a special tag. Before undertaking substantial data transactions, temporarily disabling Defender on target storage accounts, for the time of the migration could help prevent such a costly error.
Migration Stats
The migration resulted in over 10 billion storage transactions and 15.5TB of data movement—a testament to the scale of the operation and the importance of meticulous planning and oversight.
Conclusion
I chose to share this story to highlight that mistakes are part of the learning process. Owning up to them, sharing the experience, and implementing improvements are crucial steps. As a result, my team is now committed to developing better data management & lifecycle practices to ensure such an oversight does not happen again.
Remember, always double-check your calculations, consider all cost factors, and maintain clear communication with all relevant parties during significant operations like data migrations.
Happy coding, and may your migrations always be cost-effective and incident-free!