In this post, I will write about Amazon AppFlow and how we can use it to "flow" our data from Slack channels into an S3 bucket and subsequently run queries using Amazon Athena. In a follow up blog post we will explore solutions to visualize our data by leveraging the Athena database.
Amazon AppFlow Overview
Amazon AppFlow is a managed service in AWS that allows us to connect to various SaaS platforms such as Slack, Salesforce, Facebook/Instagram ads and many more to extract and transfer data to other AWS services such as S3, RDS, Redshift or 3rd party platforms like Snowflake, Zendesk and more.
The basic concept in AppFlow as the name suggests is the flows. A flow is a high level, fully managed process which defines the necessary configuration to achieve transferring our data between a source and a destination, without much effort and in the desired format. It offloads the need to provision infrastructure resources or use SDKs to write complex code to retrieve and manipulate the data from external platforms. It provides numerous connectors to create connections for a variety of SaaS platforms. Each connector allows us to configure various aspects of the data mapping and the file formats we want to use. We can apply filters, run validations and define whether we want to run the flows on a schedule or on demand. In the following paragraphs we will explore how we can set up all this, with our focus on the Slack integration.
Create a Slack application
Before we delve into the AppFlow specifics, we need to create a Slack application with the necessary permissions. The AppFlow connector will use this application to connect to Slack. We can do this from the Slack API page by clicking on the create new app button. After the application is created, we need to configure a few things required by AppFlow that we will see below.
In the Basic Info page, we need to take note of the Client ID and Client Secret fields. We will use this information later on to set up the connection with AppFlow.
Next stop is the OAuth & Permissions page. Here we need to configure two things:
1) In the Redirect URLs section we need to add a new Redirect URL for AppFlow. This should be the following (you might need to adjust the region) https://ap-northeast-1.console.aws.amazon.com/appflow/oauth
2) In the Scopes section we need to add the necessary OAuth scopes to allow the Slack application to read from Slack channels. There are separate permissions for public channels, private channels and private messages. In our case, we are only interested in public channel messages. So we only need the two options that can be seen below
Set up AppFlow
With the Slack set up left behind, we now need to move to the AWS Management Console and have our first look at the AppFlow service. In the console UI once we are in the AppFlow page, we need to select flows from the left panel. Here we should be able to see all our flows. Assuming this is the first time using the service, the list should be empty. So from the top right of this page, we should click the Create Flow button to start populating that empty list.
1) In the first screen we are presented with, we choose a name for the flow and we can also write a description to identify the flow.
2) The second screen is where we do the wiring between Slack and S3. We can choose the data source, the destination where our extracted data will be saved and set up a connection. There are some optional parameters as well like integrating with AWS Glue and choosing the file format. The default format is JSON but since we will be working with Athena, choosing Parquet will give us better performance.
In the image below we can see the screen where we create the connection. Depending on the SaaS platform, the connector will have different parameters that we can configure. For Slack we need to input the workspace name and the Client ID and Client Secret. This information can be found in the Slack API apps page as described in the previous section.
After we set up the connection, we will be asked to choose the Slack channel that we want to use as our source. We are also given two options to choose from on how the flow is triggered. It can either be triggered on demand or run on a schedule as can be seen below.
3) In the next screen, we can choose various options to configure how our data is mapped. Slack stores a variety of information for each message and we can choose which fields we want to store in S3. We are also given some more advanced options such as transforming the fields, performing validations and defining a folder structure for the S3 bucket. One useful option is to create sub-folders sorted by date.
4) The next screen is very simple and allows us to create a filter for the data. Unfortunately at the time of writing, the filter did not work as expected. For Slack, we are only given the option to filter by message timestamp. This can be very useful if we don't want to extract all channel history. But AppFlow reads the timestamp field as string instead of datetime resulting in the filter not being applied. This might have been broken in newer versions of AppFlow as in AWS documentation there is a video example that shows this field to be interpreted as a datetime.
The final screen is there just to review our input. If everything looks good, we can hit the create button. This should conclude creating the flow. After the flow is created, if we didn't choose the schedule execution, we can select it from the flows panel on the left side and execute it by clicking the run flow button.
Query our Data in Athena
The final piece in the puzzle is to query the extracted data in S3 using Athena. Athena is a serverless service that allows us to query and analyze our data in S3 without the need to provision any infrastructure ourselves. It is very straightforward to use and we can hit S3 directly with standard SQL queries.
1) The first step is to create the table and database in Athena. From the Athena page in Management Console we select create new table. In the next screen we need to name the table and database and also select the S3 table we will use as our data source. This can be seen in the image below.
In the same page there are some additional options we can configure. We can choose the table format between Iceberg and Hive. There are some performance considerations here for large datasets but it's out of scope for our simple use case here. So I will use the default option which is Hive. The one thing that we need to configure here is the table columns. Slack messages exported from AppFlow have 20 different fields. For the demo purposes I will just use the following 3: timestamp
, text
and user
We should be done with the configuration now. Once we click the create table button, the table should be created and we should be back to the main page. Here we will see the SQL query Athena used to create the table. We can also create a new query tab where we can run our SQL commands. Nothing fancy here, just select everything from the table to prove that our integration works.
And that should be it. It was a bit lengthy but I believe AppFlow is an easy to use and well priced service so I hope you liked it and you found it useful. In a follow-up blog post, I will explore visualization options for the data in the Athena table.
Akis Papaditsas - Monex Insight