Demystifying Data Engineering

Last year, I embarked on a journey to learn data analysis, and it wasn't long before I encountered the complex world of data engineering. The intricacies of data engineering felt like an impenetrable maze to me, especially as a beginner.

However, my recent completion of the IBM Introduction to Data Engineering Course just last week has provided me with valuable insights. It was this newfound knowledge that motivated me to write this article. I wanted to share what I've learned and create a guide that can help others, especially those who, like me, are just starting their data journey.

This article aims to explain what data engineering is, how it differs from data science and data analysis, and it breaks down concepts like types of data, where data is stored, and how it all fits together.

I'll also dive into topics like ETL/ELT processes and the differences between transactional and analytical systems. Most importantly, I'll share the key things you need to know when designing effective data systems. This article is meant to be a friendly introduction, making data engineering less mysterious for those starting their data analysis journey.

What is Data Engineering?

Data Engineering encompasses all activities, processes, and systems involved in data collection and ingestion, integration and storage, processing, protection, and management of data. The main objective is to ensure the availability, scalability, security, and authenticity of data and ensure accessibility to end-users and applications.

Who are Data Engineers, and how are they Different from Data Scientists and Data Analysts?

In brief, Data Analysts derive actionable insights from data by analyzing data to drive data-informed decisions for stakeholders. Data Scientists analyze data and build predictive models based on available data for forecasting, surveillance, predictions, and enabling AI applications. A data engineer's role involves enabling the work of other data professionals by supplying them with the necessary data. The data engineer also enables Applications and AI by providing the data they require.

Types of Data

Data comes in varying forms, sizes, or structures. Generally, there are 3 major types of data which are grouped based on structure:

Structured Data: This simply refers to data that is organized in rows and columns. Examples of this are SQL databases, Excel Spreadsheets, or CSV files.
Semi-structured Data: Similar to structured data, it's organized in an order but not as straightforward as rows and columns. It often uses tags or markers to describe the data; in JSON objects, they are called key-value pairs. Examples of these are XML or JSON data. A JSON Document could look like this:
```
  Character 1
  {
    "Name": "James Bond",
    "Gender": {
      "gender1": "007",
      "gender2": "ULTIMATE CHAD"
    },
    "Occupation": "Highly Classified",
    "Rank": "Top Secret",
    "Car": "2019 Aston Martin DBS Superleggera"
  }
```

Character 2
{
  "Name": "Dominic Toretto",
  "Gender": {
    "gender1": "Family",
    "gender2": "Mechanic"
  },
  "Occupation": "Family Man",
  "Rank": "Family Head",
  "Car": "1970 Dodge Charger R/T"
}

The information regarding the characters is stored in a JSON Document.

Unstructured Data: This place is a free-for-all and encompasses a variety of data ranging from text documents, PDFs, images, videos, audio, and more.

Data Repositories Data of varying structures needs to be stored somewhere, that's where data repositories come in. Data repositories refer to the storage of data. It is a general term that encompasses forms of databases such as Data Warehouses, Data Lakes, Data Marts, Big Data Stores, etc.

Data Warehouse: A central repository to store data from various sources relevant to the business. The data stored here has been processed and structured for intended use, often for analytics.
Data Lakes: The data stored here is raw and unprocessed! Think of it as where collected data is dumped as is, with no transformations done. This doesn't mean that there is no data governance or order. If mismanaged, this can turn into a data swamp, which should be avoided by all means.
Data Marts: This is a subsection of the Data Warehouse, containing specialized data relevant to a particular aspect of a business. For example, a database containing Employee and Payroll information is mainly relevant to the HR department.
Big Data Store: Big Data? What is Big data? There isn't really an official definition for big data. But generally speaking, big data can be defined by its characteristics called the 5 Vs:
Volume - The size or amount of the data.
Velocity - The speed at which data is generated.
Veracity - The accuracy or authenticity of the data.
Value - How useful, relevant, or meaningful the data is.
Variety - The different varieties the data comes in, and the diversity of data types (structured, semi-structured, and unstructured data). Big Data Stores are used for the storage of Big Data.

The Layers of Data Engineering

Data Collection and Ingestion Layer: This layer is responsible for bringing data into the data ecosystem and fetching data from various sources by connecting to different source systems.
Data Storage and Integration Layer: After data ingestion, data needs to be stored. This layer is responsible for the storage of collected data. Data integration involves taking data from various sources and arranging them into a unified format or view.
Data Processing Layer: The transformation of data, data validation, and the application of business logic to the data occurs in this layer. Data modeling, normalization or denormalization, data partitioning, and Data Wrangling/Data Cleaning are some of the transformation operations that may be carried out. This layer can precede the storage layer depending on what framework is applied, ETL or ELT.
User Interface and Analytics Layer: This is where Data Analysts, BI Analysts, and Data Scientists come in. Data end-users can access the data using various querying tools, programming languages, and APIs. This layer also enables Dashboarding and other BI applications such as Tableau, Power BI, Jupiter Notebooks, Python libraries, etc.
The Almighty Pipeline Layer!!!: This layer overlays the Collection and Ingestion Layer, the Data Processing Layer, and the Storage and Integration Layer. The ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) process is the embodiment of this layer.

ETL: In ETL, data is first extracted from source systems, then transformed into the desired format, and finally loaded into the target data repository. This approach is often used when data needs significant transformation before it's ready for analysis.
ELT: In ELT, data is first extracted from source systems and loaded into a data repository, typically a data lake or data warehouse, and then transformation occurs after loading. ELT is preferred when the target repository can handle raw data efficiently. Data Lakes folks!!! 😉

Apache Airflow and Dataflow: Apache Airflow is an open-source platform used to programmatically develop, schedule, and monitor workflows. It's commonly used for orchestrating complex ETL processes.

Google Cloud Dataflow, on the other hand, is a fully managed stream and batch data processing service. It's suitable for ELT workflows in a cloud-based environment.

Transactional Processing and Analytical Processing Systems

There are two major use cases of any data system, categorized based on how the data is processed:

Transactional Processing (OLTP): This is also known as Online Transactional Processing (OLTP). Think of it as the system used to handle day-to-day operations and real-time transactions. For instance, when you make multiple orders simultaneously on an e-commerce website, and the number of items in stock is instantly reduced when a purchase is made, that's OLTP at work. Similarly, in banking systems, thousands of transactions, credits, and debits are initiated every minute. The sender's balance is instantly reduced, and the receiver's balance is increased. These real-time operations are managed by OLTP systems.
Analytical Processing Systems (OLAP): On the other hand, there's Online Analytical Processing (OLAP). This is like a record keeper for your data. OLAP systems are designed to capture historical records and support complex queries and analytical procedures. They are used when you need to dig deep into data to gain insights, make decisions, or generate reports. For example, OLAP might be used to analyze sales trends over the past year, compare product performance, or forecast future demand. It is a crucial part of the foundation upon which Data Analysis is built.

In summary, OLTP is all about handling the present, managing transactions, and ensuring that day-to-day operations run smoothly, while OLAP is about looking back at the past, analyzing historical data, and making informed decisions based on that analysis.

Factors To Consider when Building a Data Architecture

When building a data architecture, several factors come into play. These factors can be categorized as follows:

Structure of the Data: Consider whether your data is structured (suitable for relational databases like SQL), semi-structured (where NoSQL databases like MongoDB might be a better fit), or unstructured (e.g., text documents, images).
The 5 Vs of Big Data: Evaluate your data based on the 5 Vs: Volume (amount of data), Velocity (speed of data generation), Veracity (data quality), Value (usefulness), and Variety (data types).
Purpose or Use Case: Define the specific objectives and use cases for your data architecture. For example, are you building it for real-time analytics, reporting, or machine learning?
Security and Privacy Considerations: Ensure that your architecture complies with data security and privacy regulations. Implement encryption, access controls, and audit trails as needed.
Scalability: Consider how your architecture can scale to accommodate future data growth and increased user demands.
Cost Considerations: Assess the cost implications of your chosen architecture, including infrastructure, tools, and maintenance.

Conclusion

In conclusion, data engineering is a critical discipline in the world of data. It involves collecting, storing, processing, and delivering data to support a wide range of applications and analytics. By understanding the types of data, data repositories, layers of data engineering, key factors to consider when building a data architecture, and the distinction between OLAP and OLTP, you can lay a solid foundation for effective data management and utilization.

Thank you for taking the time to read; I hope you've learned something new. I'd like to use this medium to give credit to the IBM Skills Network and recommend the IBM Introduction to Data Engineering Course or DataCamp's Understanding Data Engineering Course to gain a deeper understanding of the concepts I've discussed.

My Socials

Twitter

Github