📄️ Initial Setup
Initial Setup
🗃️ Workspace
**Workspace** is a dedicated working environment for users within the **Data Platform** system. Its primary purpose is to provide an isolated and secure space where users can efficiently and conveniently perform data-related operations and workflows.
🗃️ CDC Service
**CDC Service** is a service that provides a real-time Change Data Capture (CDC) platform for monitoring database changes. It enables users to easily define connectors to integrate data into and out of **Kafka** from various database systems.
🗃️ Apache Superset
**Apache Superset** is an open-source Business Intelligence (BI) platform that enables users to visualize data, create interactive dashboards, and perform data analysis with ease. It serves as a powerful alternative to **Tableau** and **Power BI**, especially in big data ecosystems leveraging platforms such as **Druid**, **Presto**, Trino, **BigQuery**, **ClickHouse**, MySQL, **PostgreSQL**, and many others.
🗃️ JupyterHub
**JupyterHub** is an open-source platform designed to provide a multi-user **Jupyter Notebook** environment, enabling data scientists, data engineers, and software developers to access computational resources for data analysis, data processing, and machine learning model development. When integrated into the **Cloud Data Platform**, JupyterHub becomes a core component that allows management, scaling, and optimization of resources across cloud services, thereby supporting large-scale data storage and processing workflows.
🗃️ Ranger
**FPT Data Governance**, powered by **Ranger**, is a security and access control solution designed for **Lakehouse** environments using the **Trino** query engine. It provides centralized and fine-grained access management, supporting both **Role-Based Access Control (RBAC)** and **Attribute-Based Access Control (ABAC)** models.
🗃️ Hive Metastore
**Hive Metastore** is a core component for metadata storage within a **Lakehouse** architecture. It provides information about tables, schemas, partitions, and data locations, enabling engines such as **Apache Spark**, Trino, and **Presto** to efficiently understand, manage, and access data.
🗃️ Query Engine
**FPT Query Engine**, powered by **Trino**, is an open-source distributed SQL query engine designed to deliver fast and efficient querying across large-scale datasets. Trino enables users to query data from multiple sources, including relational databases, data warehouses, and non-relational storage systems, without the need to move or duplicate data.
🗃️ Nessie
**Nessie** is designed to support large-scale and complex distributed data environments, enabling data teams to more effectively manage data development, version control, and deployment processes across the system.
🗃️ Flink
**Apache Flink** is an open-source distributed data processing framework primarily designed for real-time stream processing. In addition to stream processing, it also supports batch processing, but it is especially recognized for its ability to handle continuous data streams with low latency. **Flink** offers flexible scalability, supports stateful processing, and ensures data consistency, making it a leading choice for **Big Data Analytics**, Machine Learning, IoT, financial systems, and system monitoring applications.
🗃️ Orchestration
The **Orchestration service** is defined as a service that manages and automates workflows within a data system, ensuring that data processing tasks are executed sequentially or in parallel according to schedules or events, while providing effective monitoring and troubleshooting capabilities.
🗃️ Dịch vụ Ingestion
The **Ingestion service** is built to automate data flows between systems. It manages, orchestrates, and automates the movement of data between different systems in an easy and efficient manner, while providing data flow monitoring, supervision, and management capabilities.
🗃️ Processing Service
The **Processing Service** is a service deployed on the **Data Platform** that provides batch and real-time data processing capabilities through user-configured compute resources. The service supports both CPU and GPU environments, enabling flexible execution of high-performance data processing tasks in a distributed and efficient manner.
🗃️ Open Metadata
**Open Metadata** is defined as a platform for managing and automating metadata within a data system. It centralizes the collection, organization, and governance of information about data objects from multiple sources. The platform supports data tracking, classification, lineage tracing, and change alerting, thereby improving operational efficiency and ensuring data quality across the organization.