Introduction to Talend Open Studio
Talend offers Open Studio which is an open-source for data integration. It has more than 800+ components for various integration purposes. Download Talend Open Studio from https://www.talend.com/download/
Data integration means combining data from different sources and combining them to a single view to get some meaningful data from that which can help the company or organization to improve their business by analyzing those data. Integration helps in getting data, cleaning the data making some required transformation etc, then loading it into a data warehouse.
What is Talend?
Talend is an ETL tool that is used for data integration. Talend provides a solution for data preparation, data quality, data integration, and big data. Here we will be discussing some of the components. To make it easy to see the below example A sim operator has huge data about plans, customers, sim details, etc. These data are huge so big data is also used in the integration.
Customer A buying a sim using a government id
Giving his name as AB C
address as Chennai, Chennai
phone number as 1234567890
After data integration
First name: AB
Address: Chennai, India
Here the data is cleansed and transformed into something more meaningful.
The benefits are pointed out below:
- Analyzing Business trends using data integration
- Combining data into a single system
- Time-saving and more efficient and less rework
- Easy Report generation – used by BI tools
- Maintaining and inserting data into the data warehouse and data marts
Here are the following applications mention below
1. Working with Talend
- Make sure you have java installed and environment variables set.
- Download the open-source from the Talend website and install the software.
- Create a new project and finish the setup
- Talend will open with the designer tab.
- Talend is an eclipse based tool and the components can be dragged from the palette or you can click and type the components name.
2. The first job Reading a file
- Search for the component tFileinputdelimited. This component is used for reading any delimited files.
- Place the tfileinputdelimited component. Search for tlogrow and place it in the job designer.
- Right-click tfileinputdelimited and select row-> main and draw a line to tlogrow.
- In the component, the tab selects the path of the file you want to read and gives the row separator as \n. If the file has delimiter you can mention the delimiter.
- Click the schema and give the column type details or you can read the entire row as a string with one column and delimiter value should be empty.
- You can skip the header and footer also.
- In the tlogrow component select the way how you want to see the data. Table format or single-line format.
- tlogrow displays output in the run console.
- After connecting both tfileinputdelimited and tlogrow run the job from the run tab.
- You can see the file contents in the console printed.
3. A second job using Tmap
- Read a file and filter it into different output files.
- Read a file in the tfileinputdelimited component with one column schema as the record.
- Tmap component- This component helps in transforming data with some inbuilt functions like lookup, joins, etc.
- In tmap create two outputs out1 and out2.
- In out1 filter add row3.record.contains(“talend”) and draw the record to out1.
- Draw the record line to other out2.
- From the tmap take main rows and connect to two tfileoutputdelimited.
- out1 link to one tfileoutputdelimited1 as file1.txt and out2 to other tfileoutputdelimited2 as file2.txt.
- txt will have records that contain talend.
- txt will have records that have other names.
4. Built-in and repository
- Built-in means you should set schema or details for connecting to a database every time.
- The repository comes in handy to save the details in the metadata so that you can reuse the same details every time without manually entering details every time. In metadata, you can save file schema, database connections, Hadoop connection, hive connection, s3 connection and many more.
Components of Talend Open Studio
Here are the following Components of Talend Open Studio mention below
- This component lists the files in a directory or folder with a given file mask pattern.
- This component is used for connecting with the mysql database.
- Mysql components can use this connection for easy setup of connecting to the database.
- This component helps to run a mysql database query and get the table or columns. This component is used to select queries and get the details.
- This component is used for inserting or updating data in the Mysql database.
- This component is the first to execute in the job and can be connected with other components with on subjob ok.
- This component is the last to execute in the job. You can connect this with connection close components.
- This component catches the warning and errors in the job.
- Most important component used in error handling technique.
- Error logs can be written using this component along with tfileoutputdelimited.
- There are more than 800+ components.
- Context variables are variables that can be used in the job anywhere.
- It holds values and can be passed to another job also using trun component.
- The uses of context variables is we can change the value for different purposes.
- For example, we can have a set of values for the development context group and a different set of context values for production.
- This way we dont have to change the job just changing the context parameters is enough.
Building a job
Let us know how to select and build a job.
- To build a job right click the job and select build job.
- You can import the build job in TAC.
- In Talend Administration Console you schedule a job to trigger the job set dependency also.
- You can also import the job from the Nexus repository using an artifact job.
Create a task in TAC
Let us discuss how to create a task in TAC.
- Open job conductor in TAC
- Click new tasks and select normal or artifact tasks.
- Import the builded job or select from nexus.
- Select the job server in which the talend will run.
- Save the task.
- Now you can deploy and run the job.
“Simplify ETL and ELT with the leading free open source ETL tool for big data.” is the tagline for open studio. Talend Bigdata has many components for handling huge data. Standard job, Bigdata job, Bigdata streaming jobs are the different types of jobs available in Talend. Bigdata jobs can be created in a spark or MapReduce framework.
This is a guide to Talend Open Studio. Here we discuss the basic concept, benefits, applications, and Components of Talend Open Studio. You may also have a look at the following articles to learn more –