Purnama Academy 0838-0838-0001 , Training Syllabus :
STREAMSETS
DATA COLLECTOR (SDC)
Level of Knowledge : StreamSets Fundamental (Basic to Intermediate)
Durations : 5 Days (09.00 – 16.00)
Class Method :
Offline Class Only
Prerequisites :
- Ubuntu OS & LDAP Fundamental Knowledge
- Java / Python Basic Knowledge
- Vagrant / Docker Basic Knowledge
Descriptions
DESCRIPTIONS
:
What is StreamSets?
StreamSets is a system for creating, executing
and operating continuous dataflows that connect various parts of your data
infrastructure. It comprises two complementary products - StreamSets Data
Collector (aka SDC), and StreamSets Dataflow Performance Manager (aka DPM).
StreamSets Data
Collector (SDC)
The SDC is the workhorse of the system which
implements your data plane, i.e. the actual physical movement of data from one
place to another. It provides a data pipeline authoring environment that helps
you build any-to-any data movement pipelines using a drag-and-drop graphical
interface or programmatically using Python or Java. The pipelines have the
capability to work with minimal or no schema/structure specification and can
filter, decorate or transform data as it flows through. Here is a screenshot of
what a running pipeline may look like in SDC:
These pipelines can run in standalone mode,
cluster streaming mode, or cluster batch mode. The SDC which runs these
pipelines can be installed on free standing dedicated nodes or
edge/gateway/cluster nodes alike. All that is needed is that SDC has direct
access to the data sources and destinations it is operating on, and sufficient
resources to run the dataflow.
OVERVIEW
INSTALLATION
Installation
Full Installation and Launch (Manual Start)
Full Installation and Launch (Service Start)
Core Installation
Install Additional Stage Libraries
Run Data Collector from Docker
Installation with Cloudera Manager
MapR Prerequisites
Creating Another Data Collector Instance
Uninstallation
CONFIGURATION
User Authentication
Roles and Permissions
Data Collector Configuration
Data Collector Environment Configuration
Install External Libraries
Custom Stage Libraries
Accessing Hashicorp Vault Secrets
Enabling External JMX Tools
PIPELINE
CONCEPTS AND DESIGN
What is a Pipeline?
Data in Motion
Single and Multithreaded Pipelines
Delivery Guarantee
Designing the Data Flow
Branching Streams
Merging Streams
Dropping Unwanted Records
Required Fields
Preconditions
Error Record Handling
Pipeline Error Record Handling
Stage Error Record Handling
Example
Record Header Attributes
Working with Header Attributes
Viewing Attributes in Data Preview
Header Attribute-Generating Stages
Record Header Attributes for Record-Based Writes
Field Attributes
Field Attribute-Generating Stages
Processing Changed Data
CRUD Operation Header Attribute
CDC-Enabled Origins
CRUD-Enabled Stages
Processing the Record
Use Cases
Delimited Data Root Field Type
Protobuf Data Format Prerequisites
SDC Record Data Format
Text Data Format with Custom Delimiters
Processing XML Data with Custom Delimiters
Whole File Data Format
Basic Pipeline
Whole File Records
Additional Processors
Defining the Transfer Rate
Writing Whole Files
XML Data Format and Data Processing
Creating Multiple Records with an XML Element
Creating Multiple Records with an XPath Expression
Including Field XPaths and Namespaces
XML Attributes and Namespace Declarations
Parsed XML
Control Character Removal
Development Stages
PIPELINE
CONFIGURATION
Data Collector Console - Edit Mode
Retrying the Pipeline
Pipeline Memory
Rate Limit
Runtime Values
Using Runtime Parameters
Using Runtime Properties
Using Runtime Resources
Webhooks
Request Method
Payload and Parameters
Examples
Notifications
SSL/TLS Configuration
Keystore and Truststore Configuration
Transport Protocols
Cipher Suites
Implicit and Explicit Validation
Expression Configuration
Basic Syntax
Using Field Names in Expressions
Referencing Field Names and Field Paths
Expression Completion in Properties
Data Type Coercion
Configuring a Pipeline
ORIGINS
Elasticsearch
Hadoop FS
HTTP Client
HTTP Server
HTTP to Kafka
JDBC Multitable Consumer
JDBC Query Consumer
Kafka Consumer
MySQL Binary Log
SFTP/FTP Client
UDP Source
UDP to Kafka
WebSocket Server
PROCESSORS
Processors
Base64 Field Decoder
Base64 Field Encoder
Expression Evaluator
Field Flattener
Field Hasher
Field Masker
Field Merger
Field Order
Field Pivoter
Field Remover
Field Renamer
Field Splitter
Field Type Converter
Field Zip
Geo IP
Groovy Evaluator
HBase Lookup
Hive Metadata
HTTP Client
JavaScript Evaluator
JDBC Lookup
JDBC Tee
JSON Parser
Jython Evaluator
Log Parser
Record Deduplicator
Spark Evaluator
Static Lookup
Stream Selector
Value Replacer
XML Flattener
XML Parser
DESTINATIONS
Elasticsearch
Hadoop FS
HBase
Hive Metastore
Hive Streaming
HTTP Client
Kafka Producer
MapR DB
WebSocket Client
EXECUTORS
Executors
HDFS File Metadata Executor
Hive Query Executor
JDBC Query Executor
MapReduce Executor
Pipeline Finisher Executor
Shell Executor
Spark Executor
DATAFLOW
TRIGGERS (A.K.A. EVENT FRAMEWORK)
Dataflow Triggers Overview
Event Streams
Event Records
Case Study: Parquet Conversion
Case Study: Impala Metadata Updates for DDS for
Hive
Case Study: Output File Management
Case Study: Stop the Pipeline
Event Records in Data Preview, Monitor, and
Snapshot
Summary
MULTITHREADED
PIPELINES
Multithreaded Pipeline Overview
How It Works
Monitoring
Tuning Threads and Runners
Resource Usage
Multithreaded Pipeline Summary
SDC
RPC PIPELINES
SDC RPC Pipeline Overview
Deployment Architecture
Configuring the Delivery Guarantee
Defining the RPC ID
Enabling Encryption
Configuration Guidelines for SDC RPC Pipelines
CLUSTER
PIPELINES
Cluster Pipeline Overview
Kafka Cluster Requirements
MapR Requirements
HDFS Requirements
Stage Limitations
DATA
PREVIEW
Data Preview Overview
Data Collector Console - Preview Mode
Previewing a Single Stage
Previewing Multiple Stages
Editing Preview Data
Editing Properties
RULES
AND ALERTS
Rules and Alerts Overview
Metric Rules and Alerts
Data Rules and Alerts
Data Drift Rules and Alerts
Alert Webhooks
Configuring Email for Alerts
PIPELINE
MONITORING
Pipeline Monitoring Overview
Data Collector Console - Monitor Mode
Viewing Pipeline and Stage Statistics
Monitoring Errors
Snapshots
Viewing the Run History
PIPELINE
MAINTENANCE
Data Collector Console - All Pipelines on the
Home Page
Understanding Pipeline States
Starting Pipelines
Stopping Pipelines
Importing Pipelines
Sharing Pipelines
Adding Labels to Pipelines
Exporting Pipelines
Duplicating a Pipeline
Deleting Pipelines
Next Training Topic
Recommendation : STREAMSETS DPM
0 comments:
Post a Comment
Terima kasih telah mengunjungi halaman website kami, Jika ada pertanyaan terkait informasi di Atas silahkan isi Comment Box di bawah ini, Tim kami akan merespon komentar/ pertanyaan Anda paling lambat 2 x 24 Jam
Untuk respon cepat silahkan hubungi 0838-0838-0001 (Call/Whatsapp)
Regards,
Management,
www.purnamaacademy.com