WDIS AI-ML
20 min read
WDIS AI-ML Series: Module 2 Lesson 4: Data Collection and Data Preprocessing
Written by
Vinay Roy
Published on
03rd May 2024

We all have heard this multiple times “Data is the new currency”. However, not many companies invest enough in data as much as they do in Data Science. Albeit the realization is growing that to be seen as an ‘AI-first’ company, one needs to establish itself as a ‘Data-first’ company. The biggest challenge is to collect, clean, organize, and deliver data in a way that makes it usable for valuable analysis by data scientists and analysts.

In this section we will give an overview of what end-to-end data processing looks like from the viewpoint of a data science project:

Topic 1: Scalable Data Architecture for Data Collection

Companies today receive all sorts of data such as customer data, store-level data, inventory data, digital analytics data, marketing campaign data, etc that remain dispersed across multiple systems and in multiple incompatible formats. A herculean challenge remains how to store this data in a well-organized manner so that it can be used for Descriptive, Predictive, and Prescriptive analytics and for building Machine Learning Models.

While data architecture may differ from one company to the other based on many factors including but not limited to how data is organized, how data is used for analytics or other purposes, what kind of data flows from one system to the other, etc., an extremely high overview of Data system Architecture may look like the following:

The architecture is broadly divided into four components:

1.1 Data Source: Data may come from many sources today but some prominent ones where the bulk of data gets created are the following:

Websites for companies with digital presence such as Netflix get the majority of their user engagement data from how its users are engaging with Netflix on their laptops or TV etc.
Apps for companies such as Reddit etc.
The data sources may also include offline sources such as Stores that have point-of-sale data or Factories or supply chain data points collected for companies such as Apple, Nvidia, and Qualcomm.
Each of these data generating systems is connected in the backend to its own database as shown in the figure above.
Moreover, the data may be collected in varying formats such as CVS, xls, JSON etc.But since data is so dispersed (Multiple sources, Multiple formats, Possibly multiple definitions of the same data), how do we combine them into one single storage to make some sense? This is where ETL comes in.

1.2 ETL: ETL stands for Extract, Transform, Load. It is a process used in data engineering to move data from source systems into a data warehouse. It's a fundamental process in data engineering. It involves retrieving data from various sources (Extract), cleaning, manipulating, and converting it into a format suitable for analysis (Transform), and finally loading the transformed data into a target system (Load), typically a data warehouse or data lake. ETL can help businesses to integrate data from multiple sources into a single, consistent view. This unified view of the data can be used to support a variety of decision-making processes, such as customer segmentation, product development, and financial planning.The ETL process typically involves the following steps:

ETL Pipeline

1.2.1 Extract: As we discussed, data such as Customer Name, Email Address, or Phone Numbers, in organizations often resides in different databases, spreadsheets, or software systems. Extracting data involves identifying and retrieving relevant information from these sources. The data can be extracted using a variety of tools, such as database connectors and data integration tools. A few important points to keep in mind:
1.2.1.1 Scheduling and Frequency: Data extraction is typically not a one-time process. So one might need to schedule regular extractions (daily, hourly) to capture new data from source systems.
1.2.1.2 Incremental vs. Full Extraction:
For large datasets, extracting all data at once might not be efficient.  Incremental extraction focuses on capturing only the changes since the last extraction, reducing processing time and resource usage.
1.2.1.3 Change Data Capture (CDC): This advanced technique involves capturing only the modifications made to source data since the last extraction, streamlining the process further.

1.2.2 Transform: The extracted data may not be in a suitable format for analysis or reporting. Transformation involves cleaning, structuring, and manipulating the data to make it consistent and usable. This involves the following steps:
1.2.2.1 Data cleansing: Removing duplicate data and correcting errors in the data for example removing duplicate entries representing the same customer.
1.2.2.2 Data standardization: Converting data into a consistent format. For example, Converting dates to a consistent format (e.g., YYYY-MM-DD).
1.2.2.3 Data aggregation: Combining data from multiple rows into a single row. For example, calculating new fields like "Total Spend" by combining multiple entries based on purchase history.

1.2.3 Load: The transformed data needs to be stored in a centralized location for easy access and analysis. Loading involves moving the transformed data into a data warehouse, database, or analytical tool. It's like organizing books in a library after categorizing and labeling them for easy retrieval.

Real-time vs. Batch ETL: Traditional ETL operates in batches, processing data periodically. However, some scenarios require near real-time data integration such as user click-stream analysis to provide movie recommendations on Netflix. Technologies like Apache Kafka or Apache Spark enable stream processing for real-time data pipelines.

ETL vs. ELT (Extract, Load, Transform): In some data architectures, data is loaded into the target system (data lake) in its raw format before transformation. This approach can be beneficial for big data scenarios where upfront transformation might be computationally expensive. However, it requires the target system to have robust data processing capabilities.

ELT Pipeline

1.3 Data warehouse: A data warehouse is a specialized database designed to store historical data from various sources within an organization. It's optimized for complex queries and analysis, allowing business users to gain insights into trends, customer behavior, and operational performance.
Data modelers design the data warehouse schema, which defines how data is structured and organized within the data warehouse. This ensures efficient storage, retrieval, and querying of data for analysis.
Data engineers implement data quality checks and cleansing processes within the ETL pipelines to ensure the data stored in the warehouse is accurate, consistent, and usable for analysis.
Data warehouse provides an option to configure security measures to restrict access to sensitive data in the warehouse and ensure compliance with data privacy regulations.

The data warehouse can be used to support a variety of decision-making processes, such as:

Customer segmentation: Identifying different groups of customers based on their purchase history and demographics.
Product development: Identifying new products to develop based on customer demand.
Financial planning: Forecasting future sales and expenses.

The data warehouse has a few limitations:
One of them is that Data warehouses are designed for structured data, typically from relational databases. As businesses see an influx of unstructured and semi-structured data generated by modern applications, social media, sensor networks, and IoT devices, there is a need to look beyond a data warehouse.
The other reason is that Data warehouses require upfront schema definition, meaning you need to determine how data will be structured before loading it. This can be limiting for exploratory analysis or when dealing with new, unforeseen data sources.
This is where comes in Data Lake. Data lakes are designed to handle massive volumes of data in its raw, native format, regardless of structure (structured, semi-structured, unstructured). This flexibility allows you to store all your data without upfront schema definition. Data lakes also scale more easily and cost-effectively in the cloud.

Structured, Unstructured, and Semi-structured data is stored in raw format in the data lake but is later on processed and stored in tabular format in the data warehouses, as we can see in the image below. The data stored in data warehouses are used for Analytics and ML models, while semi-structured and unstructured data is stored in the data lake and used for Data Science and Machine Learning.

Architecture: ETL for Data warehouse vs ELT for Data Lake

1.4 Analytics and Machine Learning Models: The final part in the data pipeline is the consumption of data. This is typically used both for Analytics (Descriptive, Predictive, and Prescriptive Analytics as we discussed in WDIS AI-ML Series: Module 1 Lesson 5) or to create Machine learning models.  
As you see Analytics and ML models are dependent on quality of data in Data Lake and DWH. So essentially it is Garbage-in Garbage-out. The more we invest in data quality, the better these models become.

Topic 2: Data Preprocessing

Data preprocessing is a fundamental step in machine learning, just like cleaning and organizing the ingredients before one can start cooking. We discussed earlier in the ETL section that a part of the ETL pipeline is Transformation. Transformation involves cleaning, structuring, and manipulating the data to make it consistent and usable. While Data Engineers spend a considerable amount of time cleaning data, data scientists may still have to do data preprocessing to prepare data for the next step. It could be because either data scientists are accessing Raw data from the data lake or some advanced data preprocessing techniques might be needed for Machine learning Models.

Some preprocessing Techniques are:2.1 Handling Missing Values: Real-world data often contains missing entries. Data scientists need to decide how to handle these missing values depending on the situation. Some standard techniques are:

2.1.1 Deletion: If the missing data is minimal and unlikely to bias the model, removing rows or columns with missing values might be acceptable.
2.1.2 Imputation: Imputation means estimating the missing values based on other available data. Techniques like mean/median imputation (filling with average/middle value) or more sophisticated methods like K-Nearest Neighbors (We will discuss this later in the course when we will discuss the Unsupervised Learning - Clustering Machine Learning Model) is typically used to fill in the missing values.

2.2 Data Cleaning and Correction: Data cleaning involves identifying and correcting errors like typos, inconsistencies, or outliers. This might involve:
2.2.1 Formatting: Standardizing date formats, currency units, or ensuring consistent capitalization.
2.2.2 Outlier Detection: Identifying data points that fall significantly outside the expected range. These outliers might be genuine anomalies or indicate errors and require further investigation.

2.3Advanced data transformation techniques: Data transformation involves changing the format, structure, or representation of the data to make it suitable for analysis or modeling. It focuses on preparing the raw data for further processing.
2.3.1 Encoding Categorical Features: Categorical data (like colors) can be encoded using techniques like one-hot encoding (creating separate binary features for each category) or label encoding (assigning numerical values to categories).

(A table with four rows and four columns. The first column represents the original "Color" data (Red, Green, Blue, Red). The remaining three columns are the new features created after one-hot encoding: "New Feature (Red)," "New Feature (Green)," and "New Feature (Blue)." Each row has a 1 in the column corresponding to the color of the purchase in the original data and 0s in all other new feature columns.) This is done to convert categorical variables into Numericals so that the machine learning model can use the data as it would use any numerical data.

2.3.2 Normalization or Standardization: Scaling numerical features to a common range, e.g., between 0 and 1 or having a mean of 0 and standard deviation of 1, can improve model performance by placing all features on an equal footing.
2.3.3 Feature Scaling and Normalization: Scaling / Normalization (scales features to a specific range, often between 0 and 1) ensures all features contribute proportionally to the model's learning process. An example of Normalization is when players on the fielding side in cricket are penalized for slow over rate as % of their salary rather than the absolute value so that low-paid cricketers have to pay lower penalties while higher-paid cricketers have to pay higher penalties.
or
data can be standardized, a technique transforms features to have a mean of 0 and a standard deviation of 1.

The example above shows how Age and Income are normalized so that they are transformed on the same scale between 0 and 1.  This technique ensures all features contribute proportionally to the learning process, regardless of their original unit or scale.

2.3.4 Text preprocessing (e.g., tokenization, stemming, lemmatization)Natural language processing, which we will discuss in later modules in more detail, utilizes pre-processing text so that text can be used in a model. The few most common techniques are:
2.3.4.1 Tokenization: Breaking down the text into smaller, meaningful units that machines can understand and process.

Text “The sky is blue” is tokenized with a token size of 1 to break the text into words

2.3.4.2 Stemming and Lemmatization: Both stemming and lemmatization aim to reduce words to their base forms in text data processing for machine learning. Here's a breakdown with tabular examples to illustrate the key differences:

So that was all for this lesson. We covered data collection and data cleaning. Now we can move on to the next step - Feature Extraction.

As a photographer, it’s important to get the visuals right while establishing your online presence. Having a unique and professional portfolio will make you stand out to potential clients. The only problem? Most website builders out there offer cookie-cutter options — making lots of portfolios look the same.

That’s where a platform like Webflow comes to play. With Webflow you can either design and build a website from the ground up (without writing code) or start with a template that you can customize every aspect of. From unique animations and interactions to web app-like features, you have the opportunity to make your photography portfolio site stand out from the rest.

So, we put together a few photography portfolio websites that you can use yourself — whether you want to keep them the way they are or completely customize them to your liking.

12 photography portfolio websites to showcase your work

Here are 12 photography portfolio templates you can use with Webflow to create your own personal platform for showing off your work.

1. Jasmine

Stay Updated with Growthclap's Newsletter

Subscribe to our newsletter to receive our latest blogs, recommended digital courses, and more to unlock growth Mindset

Thank you for subscribing to our newsletter!
Oops! Something went wrong while submitting the form.
By clicking Subscribe, you agree to our Terms and Conditions