Andrzej Szczechla - "Data Lake vs Data Warehouse"

Data Lake vs. Data Warehouse


Few months ago I visited a big retail company. We discusses the challenges which such big firm like this has with reporting.  Specifically it looked like that:
  • The large amount of source systems powering many reporting systems.
  • The outcomes from those systems gave the diverse results.
  • Various results generated different business conclusions.
  • additionally the abstruse maintenance and development of such environment.
People, I have talked with were deeply convinced, that there is a huge demand for their firms to take some major actions on in the field of reporting. As it came out, they have already worked out such moves, and they needed only to confirm the rightfulness of such idea. The step they believed should be taken was coping all the data into one spot called by them the Data Lake. The data should be copied without any  data cleaning and integration, followed by sharing these data to users. Consequently all problematic issues would be solved.
I tried to explain that they need traditional Data Warehouse and that  Data Lake can only be its’ extension. I did not convince anyone. In return I received a message, that ‘they are searching for a partner with much more modern approach to reports solutions rather than traditional Data Warehouses’:( .
Thus are the Data Warehouses passé already? Will they soon be substituted by the modern Data Lake solution? And should I, if I don’t want to take place just next to the floppy disc and punch card, should I as soon it’s possible make a head jump to one of the lakes? ;)  
To answer such question I should first define what the Data Warehouse or Data Lake are.

What the Data Warehouse is?
Data Warehouse is a system which collects the possible biggest amount of data from the company. These data are collected in the organised way. The key element of the Warehouse is the data model. As the author of first Data Warehouse in Poland  for NBP used to say - ‘the model of the data in Data Warehouse is equivalent to the house foundation. You can build the house without it, but will it remain for the long time?’
There are two major approaches to built model of such data: multidimensional (Kimball) and normalised (Inmon). In both key factor is that data from particular type are integrated and stored in one area of the Warhouse. To simplify you can say that, for example clients data are stored in one table (even if we collect data from multiple source systems) and one client appears in such table only once (although in different systems it can be named under other different ID’s).

What is Data Lake?
Data Lake I also understand as the storage of different data. However they are stored in such configuration as in the source data (structural and non-structural) without any or with minor transformations. Data Lake does not have integrated model of data. 
To refer to previous example - if the client data are in many source systems, than in Data Lake there will be few tables with those data, each with a data copy from different source systems. 
Data Lake is often equated wit Big Data. On the one workshop provided by  Microsoft such graphic was presented.
It clearly shows where the Big Data solution is located in, in terms of Data Warehouse.
  • First  (ETL Offload Data Prep) is the Big Data as a transition field to make a transformation, which requires a lot of power, which results will go to the Data Warehouse.  Possibly such approach to the data could be called Data Lake, but such Data Lake is only the way to support the destined Data Warehouse.
  • In next two cases the Big Data is only the different technology use of Data Warehouse- as in both those cases, organised and integrated data are stored  - otherwise the easy access to them wouldn’t be possible, as to the other elements of the Warehouse. 
  • The last architecture (Run data science) shows Big Data as an environment next to Data Warehouse in need for the advanced analyses performed by data scientists. Data are input there without major changes. Needed transformations and data cleaning are performed by data scientists themselves.
Data Lake I also understand as the storage of different data. However they are stored in such configuration as in the source data (structural and non-structural) without any or with minor transformations. Data Lake does not have integrated model of data. 
To refer to previous example - if the client data are in many source systems, than in Data Lake there will be few tables with those data, each with a data copy from different source systems. 
Data Lake is often equated wit Big Data. On the one workshop provided by  Microsoft such graphic was presented.
It clearly shows where the Big Data solution is located in, in terms of Data Warehouse.
First  (ETL Offload Data Prep) is the Big Data as a transition field to make a transformation, which requires a lot of power, which results will go to the Data Warehouse.  Possibly such approach to the data could be called Data Lake, but such Data Lake is only the way to support the destined Data Warehouse.
In next two cases the Big Data is only the different technology use of Data Warehouse- as in both those cases, organised and integrated data are stored  - otherwise the easy access to them wouldn’t be possible, as to the other elements of the Warehouse. 
The last architecture (Run data science) shows Big Data as an environment next to Data Warehouse in need for the advanced analyses performed by data scientists’. Data are input there without major changes. Needed transformations and data cleaning are performed by data scientists themselves. 
I understand Data Lake as it’s been described in the last example. The unorganised data environment available for different types data samples, verification of their quality and usefulness and for constructing and verification of the prediction models and other advanced analyses .
If our data are aimed to be used to advanced data analyses by data scientist, Data Lake solution would be enough.  However if they are being used for preparing the consistent reports to present the whole company, and specifically for making reports by individual workers, than without traditional Data Warehouse, with integrated, coherent and cleaned data, it won’t be possible.
Of course the traditional Data Warehouse can be build based on the Big Data solutions.


About author
Andrzej Szczechla -Over 20 years in Business Intelligence. Traditional data warehouses and Big Data solutions expert. Starting from the year 2000 many projects concerning the machine learning - customer behavior predictions.