|
||||||||||||||||
ETL is central to a lot of big data work, standing for Extract, Transform, and Load. But what does that mean? Let's explain it with an example:
Lauren is a data scientist working at a university, looking to bring together different datasets to make sure students are offered courses which best suit their profiles. To do this, she needs to pull data from lots of places into a centralized data warehouse.First, she needs to extract data from the original sources, which can include existing university databases, as well as web crawling for social media information on students.
Next, Lauren has to transform this extracted data so that it fits in a way the centralized data warehouse can use it. For this, she can use a series of rules or functions to get the data into shape -- for instance, changing DOBs to reflect age, deriving aggregated values, deduplicating records, or joining data from multiple sources, depending on what the final data warehouse needs.
Finally, Lauren can load this data into the data warehouse, giving her a way to gain new insight on students by mining for patterns in this collected data.