In many cases, hundreds of millions of records can be handled with split-second response times, especially when dealing with aggregate queries on modern hardware. But today, a considerable number of organizations are clinging to 90s ETL technology.
ETL seems to be an addiction for IT organizations because they still consider millions of records a lot of data. But having more memory than data may allow for these organizations to have their cake and eat it too. And today, getting to this point is dirt-cheap.
This may be a little controversial, but I believe that too many companies are falling for the lies about how transactional databases have millions of records and can’t handle the load. As an IT community, let’s eliminate the idea that the users have to put up with a batch process that makes it so they don’t know the outcome of their system interactions for a day (or even a month) later.
Today’s technologists are fortunate to have access to powerful, yet surprisingly affordable, hardware that would have been considered “supercomputers” ten years ago. This is obviously a good thing! Here are seven reasons to break the ETL addiction.
Existing ETL tools make easy things hard. People lucky enough to never have experienced an ETL project, make the assumption that there is a magic "copy the database button.” We have four databases, let's just "copy" them to a big one every night.
While databases are getting faster and SSD gives a great boost to random access, data growth is rapidly exceeding performance. This trend stems from basic storage technology. A company now can purchase several gigabytes of storage for $1. Indeed, even laptops are shipping with terabyte drives. There is no reason to assume this trend will stop anytime soon.
Real-world data transfer speeds, especially over a network, are growing at a much slower pace. This means that creating a backup of the database will take longer and longer until it borders on the ridiculous. Moving a gigabyte over 100 BaseT takes 100 seconds. Moving a terabyte takes 10,000 seconds or 3 hours. (This is theoretical performance and assumes the network isn't doing anything else.)
In a real-word scenario on an active network with lots of users, this could take all day. Sure, you could get better networking, but that is a temporary Band-Aid. As data volumes continue to grow, ETL technologies will have to spend more than 24 hours copying data.
Typically, ETL tools would be scheduled around midnight during supposedly idle downtime. What if you have people that access the system from California after-hours? What if some of these people are in offices in different time zones?
Your midnight could well be the middle of the day in Asia, and you may have suppliers there that need access to your applications and data. Should they suffer bad performance so that the executives in New York can get their dashboard in the morning?
So you hired a DBA to setup the ETL, but they have an average turnover rate of 75%. Are they going to be available when the sales department adds new fields to the database to do marketing automation?
Since ETL scripts are often very difficult to decipher, are you really going to invest the few days it will take someone new to figure out how the old scripts work and update them? Isn't there some "automatically add new fields" feature? The answer is no. This assumes you have the luxury of a spare DBA. If not, it could take months to hire one, especially for a short-term project as DBA’s prefer longer assignments, and the demand is there.
Social media applications have trained billions of people on informing them what's happening, at the time it happens. Yesterday's data is like yesterday's news. Your customers expect to be able to change their data and see the results in their dashboard immediately. Sometimes this requires fixing bad data, and sometimes this is a critical operational problem that needs immediate attention.
Would you want a 911 operator to put your information into a queue and send it to responders the next day? No. The same holds true for all critical data.
If your company is like most organizations, your database now fits in the system memory of a commodity server. In contrast, DBAs cost over $100 an hour and won’t tell you that $100 worth of memory may be a better investment than hiring them. Upgrading to such a server and utilizing a platform to give users access to that now lightning-fast data, could be just what you need to ensure the company stays very happy.
You no longer have to copy data over to work with different sources at the same time. Data virtualization technologies available today let users enjoy the benefits of interactive dashboards that consume data from various places; like relational databases, web services, and CSV files all with one simple user experience that simulates having the data in one place.