Demo
Blog /

Ingesting Wikipedia Data To Tinybird

Hubert Dulay
Decodable

Tinybird

Tinybird is a serverless analytical backend that serves data through low-latency APIs. It provides a fast and easy way for developers to build real-time web applications by focusing on the application rather than the data APIs that serve them.

Like most real-time analytical databases, Tinybird performs best when the data is prepared so that it can focus on low-latency queries. Decodable provides the capability to capture, enrich, and transform the real-time streaming data and send it to Tinybird. This provides you with a fully managed streaming solution from end to end from the source to the real-time web application.

Architecture

The diagram below shows Wikipedia page change events provided by Wikimedia. Wikimedia provides an RSS feed that we will parse and send to Amazon Kinesis. Decodable consumes from Kinesis and prepares the data for Tinybird through Kafka. Decodable will also send the same prepared data to Amazon S3.

We will then define API endpoints in Tinybird for use in a simple web application that shows analytics on the Wikipedia page changes.

RSS

The Wikimedia source provides an RSS feed in an XML format. We will use a python application to parse the RSS feed and serve its entries to Kinesis. In the code snippet below, we use an RSS python module called feedparser to simplify this work and capture the change entries in the RSS. We dump each entry in the RSS as JSON before sending them a Kinesis stream.

The payload sent to Kinesis will have the following complex format.

Payload

Schema

Before ingesting into Decodable, we need to define a stream with a schema that represents the data above. Decodable uses a simple way of representing complex types using a type called ROW<>.

In the table above, the data types defined for tltle_detail and links correspond to these JSON structures.

We use the Decodable data types to define a complex stream object in Decodable which will hold the incoming Wikipedia page changes data in Kinesis.

After creating the stream, we can create a source connection to AWS Kinesis following these simple instructions. Below is a screenshot of the stream we created with data sourced from Kinesis.

Pipeline

Pipelines are the “heavy lifters” in Decodable that perform the complex transformations to streaming data. The SQL below is used in a Decodable pipeline to transform the Wikipedia change data before it reaches Tinybird. There are a few Decodable functions in this SQL that make this transformation easy. In this section, we’ll go over these functions.

Parsing Timestamp

Decodable has functions that can parse strings into timestamp data types. Timestamp fields are necessary for services like Tinybird that use real-time analytical databases underneath. In this case, Tinybird uses Clickhouse.

In this case, we are using the TO_TIMESTAMP function which takes a timestamp format and converts a string column to a timestamp column. The format is provided in the second parameter and follows the SimpleDateFormat class in Java.

You don’t need to be a java programmer to know how to use this function. This link is a SimpleDateFormat tool that will help you test your timestamp format to make sure that the parser understands how to parse your string into a timestamp.

For example the published field contains:

 "published": "Mon, 08 Aug 2022 14:13:14 GMT",

Using this link, we can enter the text from the publish field then the Format Pattern to see if the resulting timestamp displayed in the gray box is valid. In this case the pattern “E, dd MMM yyyy HH:mm:ss z” works.

  • E = Day of Week Name
  • yyyy = four digit year
  • HH:mm:ss = two digit hour, two digit minute, and two digit year.

Here is the function call in the SQL:

TO_TIMESTAMP(`published`, 'E, dd MMM yyyy HH:mm:ss z')

Parsing URL

In the id field of the data, there provides some additional metadata related to the Wikipedia page change. There are query parameters being sent in the URL we could use to categorize the change and reference older page versions.

  • title - contains the title of the page as well as the type of page being changed.
  • diff - has the reference code that references the changes between the new and previous version
  • oldid - has the reference code that references the changes between the old version.

https://en.wikipedia.org/w/index.php?title=User_talk:74.52.105.242&diff=1103147176&oldid=780606035

We use the PARSE_URL function in Decodable to pull values from the query string and create new columns from them. The function calls below reads the string in the id field and grabs the QUERY (the query string) the value of the parameter name in the third parameter: diff, oldid, and title.

 PARSE_URL(id,'QUERY','diff') as diff_id,
 PARSE_URL(id,'QUERY','oldid') as oldid,
 PARSE_URL(id,'QUERY','title') as url_title

Page Type

Lastly, we use a CASE statement and LOCATE function to pull off the title prefix that indicates the type of page being changed.

   CASE
       WHEN LOCATE(':', url_title) > 0 THEN SPLIT_INDEX(url_title, ':', 0)
       ELSE null
   END AS `type`

Creating API Endpoints

At this point we have our data in a format that is optimal for Tinybird. All the fields are flattened and a timestamp is parsed.

In Tinybird, there are a few ways to integrate data. You can send events using their Events API. You can also send it via Kafka which is what we will do by creating a sink connection to Confluent Cloud. Instructions can be found here.

Once the Confluent Cloud data is flowing, add a new datasource in Tinybird by clicking on the “+” next to “Data Sources” and select the Kafka tab. You will see the screenshot below. Complete the fields by creating a API key in Confluent Cloud and adding the Key/Secret here.

Tinybird will pull a list of topics in Kafka and grab a sample of the data for configuration.

Next to each of the string fields, click the three dots and indicate that the field is nullable. This will prevent your records from being quarantined in case you receive any null fields.

Lastly click on “Create Data Source” to finish the data integration process.

Tinybird CLI

In the Decodable examples GitHub repository exists a Tinybird project with predefined pipes. You can deploy the pipes by installing Tinybird’s CLI here. Alternatively you can, you can paste them into their UI directly. For example one of the pipes runs this Tinybird query:

You can now create an API endpoint to the result of this query to be used in your application. Do this for each of your queries you would like to run from the application.

Web Application

After creating an endpoint to your result in Tinybird, you can view its usage metrics on its API page. Tinybird also provides some code for you to get started with your application. In our case, we copied the HTML code and used Chartsjs to display the results from Tinybrid in a chart.

Below is a screenshot of two charts displaying the results of two different endpoints but sourced from the same data source from Decodable.

The bar chart on the right opens another window when clicked showing the last 10 edits a user has made. This type of query requires a parameter sent from the application. In this example an endpoint with a query string is necessary. The query below creates an API endpoint in Tinybird with a parameter: author.

In the example web application, the URL below sends an author name to a second page.

examples/tinybird/author.html?author=Zyxw

The second page grabs the author parameter from the query string using Javascript and applies it to the Tinybird endpoint to load its result.

The page then generates a table of the last 10 page edits for a user.

Sending to Amazon S3

Decodable can also send the same data to other sinks like a data warehouse or a data lake like Amazon S3. This provides the same prepared data to data scientists and analysts for model training and insights.

Summary

In this post, we captured Wikipedia page changes and sent them to Tinybird, a serverless analytical backend for building low-latency APIs for real-time applications. Decodable pre-processed the data to a flat format and cleansed them to reduce the workload for Tinybird so that it can focus on serving real-time data with low latency. We also made available the same prepared data for other consumers of this data like data scientists by sending it to S3. If you would have any questions or would like help in implementing this solution, please contact us at support@decodable.co.

Watch The Video


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join the community Slack

Demo Day - Developer Experience Tour

Josh Mahonin takes us on a whistle-stop tour of the Decodable developer experience including schema version management and update, debugging, pipeline dependency management and data product navigation via schemas in a data mesh setup.

Learn more

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Learn more
Tags
Pintrest icon in black
Demo

Start using Decodable today.