Where to Start in the Data Science Career?

Data Science has been growing in the Technology market. Okay, but do you already know the paths you could take start the data science career? That is the purpose of this article.

First, we will solve your doubts about existing professionals and their attributions. Let us go.

Why is this professional being so sought after?

Over time, computing has evolved rapidly, almost exponentially. This advance results in creating more robust hardware and, consequently, more powerful software. The improvement of the hardware fueled the evolution of the software.

When do we generate data? We produce an infinity of data through simple internet browsing and consuming web pages. Every service available on the network generates data associated with its customers. This data needs to be stored for further analysis.

Okay, why do they say data is new gold? The data directly influence the company’s decision-making.

We can better understand this impact’s size with examples, such as process optimization, profit increase, higher engagement rate, value prediction, classification, and characterization of behaviors.

Profiles of Data Career Professionals?

Within this context, who are the professionals working on polishing this raw diamond (data) and transforming it into useful information for decision-making?

As a data scientist, I realize that there is still an inevitable “confusion” about the limitations of each professional. For example, when searching for jobs on LinkedIn, it is common to find assignments from analyst (or engineer) to scientist. Even so, we found a defined role as an analyst with the role of a scientist.

This overlapping functions with a mixture of nomenclatures because it is a new area disseminated in Brazil and other countries. As the data career is still in its infancy, it is common to see this variation.

Within the data career, we can define three primary roles. Of course, there are others on the market, but if you look closely, you’ll notice that it fits into one of these three profiles that we will discuss now.

Who are data science professionals?

Data Scientist
Data Analyst
Data Engineer

Data Engineer

All right, what makes a data engineer? This professional is focused on designing, building, and maintaining data solutions. In other words, this professional is responsible for handling the data and designing a storage structure for this data according to the context or demand.

Therefore, the data engineer will determine the best system to provide the persistence of this data. After this step, they can finally be analyzed by another professional. One more thing at a time. How does this happen?

ETL Process

A process directly related to the data engineer is known as ETL – Extract Transform & Load. Briefly, this process comprises collecting, cleaning, structuring, and maintaining data in a system.

Extract (Extract)

Starting with the extract or extraction, we can already have an idea about this ETL step. Here the engineer will work with data collection solutions. These solutions can include Relational or NoSQL DBMSs.

In addition, we have programming languages available to work with APIs. Additionally, web scraping can be used for data collection. However, one issue is worth mentioning before continuing: Be careful to overcome legal issues and perform web scraping without permission.

Collecting information from pages using this technique without permission may be considered unethical. Therefore, be sure of what you are doing.

Transform (Transform)

After the data collection stage, we are faced with a difficult task. The data transformation phase is related to data cleaning and structuring.

That is right, cleaning. This term is designed to describe the data refinement step.

Data does not arrive ready to use. In addition, the data are primarily collected from heterogeneous sources to compose the same context. Therefore, each dataset will have a distinct structure.

It is up to the engineer to remove redundancies and unnecessary information and fill in gaps when possible. That way, he can define which structural model to use for the loading process.

In this step, if necessary, you will use some programming language to transform and automate this process. Importantly, data must have cohesion in order to be meaningful. The value of the analysis is linked to the quality of the data.

Load

Now the data must be transferred to the corresponding system of your structure. We can mention data lakes, relational DBMSs, NoSQL, and other data storage/management systems in this case. We can also use these solutions in a cloud environment as a service.

This step consists of adapting/configuring systems in addition to a possible integration between systems. Consequently, the engineer will maintain a structure for the data, gagged and persisted, for further analysis.

The engineer, at various times, will need to maintain the “round” structure for consumption by analysts and scientists. Who knows, even dealing with integration issues between systems like MySQL and Power BI.

Big Data Paradigm

With the evolution of technology, real-time applications, and large volumes of data generated on the Internet, a new paradigm has emerged: Big Data. Imagine this whole process now assigned to the significant data context.

The Big Data paradigm is based on 3 Vs: Speed, Variety and Volume. These are the main characteristics corresponding to Big Data.

The data engineer must be able to handle data generated at a high baud rate. This data is usually linked to high system demand. This type of scenario differs from “traditional” data, that is, data that follow the relational model.

NoSQL database management systems are often associated with this scenario. For example, among the NoSQL (read – Not Only SQL), I can also mention the structures: wide-column, graphs, document-oriented, and key-value, among others.

NoSQL DBMS can be used in conjunction with relational ones to meet the access demands of companies. A real-world example is the well-known Black Friday.

For relational DBMSs to supply overwhelming demand, investment in the vertical growth of the infrastructure is necessary. However, this leads to high costs in meeting a punctual demand. Therefore, we can use a NoSQL DBMS instead of spending to make the server more robust.

NoSQL inherits several characteristics of distributed systems, such as: scalability, performance, independent failures, heterogeneity, and transparency. These aspects are fascinating in scenarios like Black Friday: punctual and high demand.

Data Flow

Another assignment of a data engineer that ends up being requested by companies is knowledge of advanced software architectures, such as: messaging and event-oriented systems. These systems have asynchronous communication.

Asynchronous communication becomes interesting in many scenarios compared to a traditional system. Let us think for a moment. Imagine that you start to notice an increase in demand for your services.

In a synchronous system, the endpoints must be online for the operation to be successful. However, the server can be overloaded, like DoS, unable to process the information anymore.

With their asynchronous communications, messages and/or event systems allow for a better user experience. In this case, it will not be necessary to invest vertically in network infrastructure.

A server that will retrieve messages can do so at a later time. In other words, the consumer node can choose when to process information. Therefore, it provides greater availability, performance, and user experience.

About these systems, we can mention the main ones: RabbitMQ and Kafka

Want to know more about RabbitMQ?

Go to: RabbitMQ #1 – What are messaging systems?

Summing up

The engineer is responsible for the solutions/systems with data management and persistence, providing a ready structure for further analysis.

Study Topics for a Data Engineer Career

Relational and NoSQL DBMS
Python, Java
Container Orchestration and Clustering
Event-Driven Systems
Messaging Systems
ETL Process

Average salary of a data engineer

The average salary of a data engineer according to www.glassdor.com is approximately $7,000. However, there are variations according to the level of each professional. Senior engineers can be paid in the range of R$10,000.

Data Analyst

The data analyst’s focus is on analyzing business-oriented data. Its main objective is to show policies and results of company metrics visually.

Therefore, this professional needs to know the context in which the company is inserted. That way, he can formulate the most appropriate questions for your scenario.

Generally, it answers questions related to directives and business rules. The data analyst must be able to “tell” the story of the data through dashboards, graphs, reports and other visual devices.

Practical examples are associated with reporting from sales, marketing, finance, government, industry and other market-oriented branches.

An analyst needs to understand the data’s context, analyze it to infer information and insights, and present it to stakeholders. Once you have found the answers to your questions, specific patterns within the data need to be visualized.

In most cases, the analyst will retrieve data from DBMSs. Most business scenarios use traditional relational database systems. Therefore, the analyst must know SQL.

From this data, the analyst will use the BI – Business Intelligence – approach to analyze the data. Then, with the analysis in hand, the analyst can answer questions and create reports and dashboards.

Critical points for a data analyst

Analytical thinking to turn data into information
Understand sales and business
Know the difference between value and price
Turn data into outlets decisions that influence the health of the company
Teamwork (data team: engineer, analyst and scientist)

Business-focused Analysis Tools

The analyst’s focus is on business analysis. Therefore, the tools used by this professional include:

SAP and SPSS
Excel, or Google Sheets
Tableu or Power BI, Qlik
Relational DBMS – SQL

Average salary of a data analyst

Average salary based on glassdoor.

Data Scientist

What is a data scientist really?

Data science is an interdisciplinary field where the professional will convert data into useful information and insights. The areas covered by data science are computing, mathematics, and modeling.

A data scientist uses computational techniques with models and metrics based on statistical mathematics to obtain insights and generate inferences about a context.

Exploratory and analytical thinking accompany the data scientist in his daily life. Therefore, the scientist searches the data for information that explains the context in which these data are inserted.

Characterization, classification, prediction, and interpretation of the context are part of its repertoire.

Although the scientist’s job is well defined, this professional often performs tasks assigned to other data professions. Therefore, the scientist must know the ETL process. That way, he will have a macro view of the situation.

Often, a scientist will need to perform data collection, cleaning, and transformation for further analysis. This entire process can be performed using the Python programming language (or another one of your choice). In addition, scientists may have to represent their insights graphically.

Okay… what’s the difference between an analyst and a scientist?

The analyst is concerned about business bias. Therefore, he needs to have the business Known How. Another characteristic of this professional is his time perspective. The analyst is aware of the past and understands how a company’s indicators evolved over time.

The scientist is concerned with understanding the snapshot provided by the data. Therefore, it focuses on the events’ characterization, classification, and prediction (prediction). His gaze is more toward the future compared to the analyst.

Understanding a little more …

The Bid Data paradigm is also present here. Often the scientist will analyze data that deviate from the traditional pattern of the relational model. Therefore, you need to be equipped with techniques for analyzing large volumes of data.

In addition to these skills, companies have been requesting knowledge in natural language processing – NLP. With the popularization of AI agents, data has taken on another connotation. Therefore, it needs to be handled by the scientist.

We talk about techniques to analyze data with a scientist’s eye. So what are these techniques? We can separate the techniques into three major areas within Data Science: Machine learning, Network Data Science, and Statistics. We will talk about them shortly.

Statistical Approach

Statistics and linear algebra are part of the life of any data analysis professional. However, the data can be analyzed with a purely statistical bias. Additionally, it happens a lot in the academic realm. Statisticians and mathematicians use mathematical resources to model and explain phenomena through mathematical equations.

Also, we can see this scenario within the theoretical computation. For example, computing topics like cryptography are purely math-based. However, in the market, you will find that the math is applied to machine learning models and analytics metrics with complex networks.

Network Data Science

Network Data Science, or complex networks, uses graph theory to model, characterize and analyze data. In this case, the data is modeled for the structure of nodes (actors) and connections (relationships).

After modeling, we can apply metrics specific to this area, such as: degree centrality, pagerank, and betweenness, among others. For example, Google uses the Pagerank algorithm to index pages in the search engine.

Unlike the traditional relational model framework, we can define relationships for the data. This graph structure represents very well several real-world scenarios. For example, we can find the structure of complex networks in scenarios like:

Social networks (users and friendships)
Authors and publications networks
Banking networks
DNA mapping (connection between chromosomes)
Public transport networks
Airlines (cities and flight routes)

We can align this approach, complex networks, with several mathematical techniques, such as correlation, eigenvalue and eigenvector, and probabilities. Therefore, we can use Python libraries to create a robust analysis.

Some of these libraries are: Numpy, Pandas, and NetworkX (Complex networks).

Machine Learning

Machine learning techniques, or AI, can contribute to your data analysis. However, often the data analysis work performed by the data scientist comes before the application of ML models. After all, we can have a fantastic model, but we will not have good results with insufficient data.

Unlike data science, machine learning is a field of computing that works with informal systems that can learn. These systems are trained so that they can opine on unknown data. Therefore, the context is provided by the training data.

In this context, there are a few approaches: AI, artificial intelligence, which is a branch in its own right, but provides a foundation for ML; Deep learning – deep learning – with specific models in this area.

The goal of Machine Learning is in the recognition and detection of patterns in the processed data. We need to pay considerable attention to the training data for the quality of the results.

In short… in any analysis the quality of the data influences the result.

Data Visualization

A data scientist must be able to express their results clearly, objectively, and reasonably. Therefore, the data scientist must possess the Soft skill of storytelling.

To display the results visually we can use libraries for plotting graphs, such as Matplotlib (official site) of Python.

Following this topic, we can use dashboard creation tools like Power BI, Tableu, and Qlike. Using these tools, we can bring data to life through interactive dashboards.

There are several features associated with these tools. Power BI, for example, can be integrated with power point to create the presentation of the results.

Technologies used by data scientists

Python is used massively by data scientists. Therefore, the entire process can be built with it. In other words, the process of collecting, cleaning, modeling, analyzing, and visual representation.

However, we can mention other important tools.

Data Analysis with Python
Relational Database – SQL
NoSQL Database
Apache Spark
Jupyter Notebook or Google Colab

Skills requested by companies

In this topic, we want to talk to you about issues that go beyond the basic skills of a data scientist. However, as knowledge is never too much, it is worth paying attention to the following topics:

Git (versioning)
Docker & Kubernetes
MLOps – deploying and maintaining ML models
AI – Artificial Intelligence
BI Tools

Of the BI tools we can mention: Power BI, Google Data Studio, Tableu, Qlike, Locker, among others.

Average salary of a data scientist in Brazil

Average salary based on the website glassdoor.

Data Science Career Video

Juliana Mascarenhas

Data Scientist and Master in Computer Modeling by LNCC.
Computer Engineer

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". O cookie é definido pelo consentimento do cookie GDPR para registrar o consentimento do usuário para os cookies na categoria "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". Este cookie é definido pelo plug-in GDPR Cookie Consent. Os cookies são usados para armazenar o consentimento do usuário para os cookies na categoria "Necessary",
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. O cookie é definido pelo plug-in GDPR Cookie Consent e é usado para armazenar se o usuário consentiu ou não com o uso de cookies. Ele não armazena nenhum dado pessoal.

Cookie	Duration	Description
_tccl_visit	30 minutes	This cookie is set by the web hosting provider GoDaddy. This is a persistent cookie used for monitoring the website usage performance.
_tccl_visitor	1 year	This cookie is set by the web hosting provider GoDaddy. This is a persistent cookie used for monitoring the website usage performance.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_199766752_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Cookie	Duration	Description
FCCDCF	12 hours	No description available.
GoogleAdServingTest	session	No description