Medium, Lambda, and Me (or how I export Medium stories to my website)

16 min readJan 20, 2018

As I neared the completion of software school I was encouraged to build a personal website to profile my work and social media outreach in one concise place. I wasn’t sure what I wanted to build or how I wanted to build it. How does one build a website without any content? How does one host content without a website? It was a traditional chicken/egg problem and when confronted with a problem like this the first best solution is to always just get started. Now, as a software student it felt borderline dishonest to build a personal website that included a blog using WordPress. There is nothing wrong with WordPress or WordPress development, but I had just spent six months and nearly 1,500 hours learning how to build web applications from the ground up, I didn’t need WordPress… Did I? Well, probably not, but I used it anyway, after all, why reinvent the wheel? I didn’t have dozens of hours to spend building a greenfield web application to host a half a dozen links and a few articles that didn’t exist yet, I needed this egg to hatch!

So I configured and built my WordPress site using Route 53 to put it behind http://www.nzenitram.com. I ended up using some free themes I found through WordPress and wrote a couple blogs about what I was learning as I began using AWS. Then I started using Medium. I wanted to get my blog posts to my profile and discovered that Medium offers a wonderful import tool that does all of the work for you! Hot damn, that was easy, it even linked to my website at the bottom of each post! Fast forward some time and I decide to rebuild it. I didn’t (and still don’t) know what I want to do with it but I knew I wanted to build and host a static site in an S3 bucket. A bit of HTML, JavaScript, and bucket configuration later and done. Now whenever I want to update my website, I can make local changes on my machine, execute a simple console command to push it to S3 and it updates. Wow, this S3 thing is simple and sweet.

Now what? Well, the first rule of profitable web design is to keep people on your page and while I am not attempting to turn a profit writing blog posts no one will read (and very few posts at that), I figured what the hell? Lets export the blog posts I had previously imported to Medium and allow users to read them right on my web page. Perfect, the import to Medium was so simple I had to imagine that the export would be a breeze. Wrong. Medium doesn’t make it easy to export your posts. Sure, in a way they do, they have an export feature built into your user profile that will .zip all of your stories, HTML and all and download them. Great, now I have this compressed file filled with .html documents that I have to parse through to get the text out of, and if I want the images, well, those are links using the Medium CDN. What a mess. What to do…

The first thing I did with my exported posts was back them up to S3. Then it dawned on me, these are HTML files! If I want to present them on my own website, I can work some Javascript magic, import the HTML string to my website from the bucket and append a DOM element with the HTML. Straight forward and simple, perfect! So I modify my navbar to include a name element that stores the URL value to the corresponding HTML file for the Medium story I would like to fetch. I write an AJAX call that implements a click event to remove the HTML from the element and append the HTML from the file. Done. Lets test it! Uh oh… I can’t really test this locally. I don’t want to make the bucket public for obvious reasons so I am receiving a Cross-origin request error. The bucket is blocking access to the request from my local machine. Alright, I know how to solve that! I push the HTML and Javascript changes to the S3 bucket hosting my website, refresh it and what?! I get the same error! But its hosted on S3, on my own account. Why don’t I have access?

Oh, right, the request is coming from http://www.nzenitram.com, not the same domain. If you want to investigate this you can go into the bucket hosting your website, open the https://s3-us-west-1.amazonaws.com/{bucket_name}/index.html page directly from the bucket and make the same AJAX request. It works (but probably looks awful, we will get into this later). Why? Well, your buckets share a domain so it isn’t a cross-origin request. Great, so how do we fix this for your personal domain without opening it up to the world? Lets configure our S3 bucket to accept cross-origin requests from only the domains we wish to access them. Open the bucket you are trying to access from your website through the AWS Console, click on the permissions tab, then open the CORS configuration page. The configuration editor will take an XML formatted list of permissions:

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
    <AllowedOrigin>http://*.example.com</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
</CORSRule>
<CORSRule>
    <AllowedOrigin>http://example.com</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
</CORSRule>
</CORSConfiguration>

Save the configuration and travel back to your website. Your domain should now have access to the objects in the S3 bucket. If your Javascript is written properly, the HTML from the bucket should append to the element you called in the script, however, it probably doesn’t look right. What happened? Well, the Medium export brings a lot of baggage with it (as we saw in the screenshot above). If we travel to the bucket we saved the posts in and open them in our browser directly, we will see that they are styled and formatted, they look great. Of course this doesn’t work on our own website that contains its own stylesheets and formatting. At this point we have a couple of options. We can open the exported stories into our own text editor and manually remove the style elements and modify the HTML to suit our needs but where is the fun in that? Enter: AWS Lambda.

Lambda Overview and first steps:

It is entirely possible that using AWS Lambda for this task is overkill but the point of the exercise was to experiment with the service and to be honest, with a little more fine tuning this is going to save me hours of formatting as I continue to write blog posts for my web page and for Medium. There are also some other quirks about the entire workflow that don’t make a whole lot of sense. For example, when you export your stories from Medium, you export ALL of your stories from Medium, even the “comments”, you have made (yes, for some reason, Medium collects everything you write on its website, calls them all stories, and saves them in the same place). Consequently this means that if you want to migrate a single story from Medium to your website using this procedure, you will have to download all of your stories and comments and upload only the story(s) you would like to migrate.

I broke the process up into seven steps:

Trigger a Lambda event on an S3 PUTS
GETS the object that was PUTS to the bucket
Edit the HTML from the object using Beautiful Soup
PUTS edited Medium post to bucket
GETS the index.html file from the bucket hosting the website
Append a new link to the page
PUTS index.html back to the hosting bucket

Getting Started:

Visit the AWS Console > Compute > Lambda and create a new function. Select Author from scratch and fill out the form below. I will be using a Python 3.6 run time for this example and because we will be fetching objects and uploading objects to S3 from this Lambda function we will have to create or use a custom role through IAM to allow AmazonS3FullAccess to our buckets. I created a new role called lambda_fullS3 for this specifically because when you attach this role to your Lambda function AWS is going to automatically add inline Lambda and Cloudwatch policies to the role and it is always a good idea to keep your IAM policies organized and identifiable as a matter of practice. Now, for example, if we have a security problem with this one service we can deny access to only this service by re-configuring or deleting only this role and it won’t impact any of the other services we are running on our account.

On the left hand side of the Designer window there is an Add triggers column. The plan is to trigger the Lambda function on an S3 PUTS so scroll down a bit and click on S3. The box that reads Add triggers from the list on the left should populate with the S3 icon and a warning mentioning that Configuration is Required. Click on the box and the Configure triggers window should appear. The configuration window includes sub categories for the Bucket we wish to select to trigger the event, the Event type that must occur within the bucket and prefix/suffix modifiers that allow us more granular control over the file types that can trigger our function. Select the Bucket you are using to store your stories and the PUT event type then click Add at the bottom of the configuration window. After you make changes to your function or its configuration you will have to save those changes by clicking the save button at the top of the page. A brief mention about the console window: If you take a look at the structure of the function in the diagram, you should now see the S3 trigger pointing to the Lambda function we’ve created and the Lambda function pointing to Cloudwatch and S3. When you run a Lambda function, AWS is going to store the logs for the run on S3 and allow you to view the logs in CloudWatch. If you wish, return to the IAM role we created for Lambda and you will see the inline permission AWS added to the role so Lambda can communicate with those services.

Now that we have our event trigger configured, we can begin writing our function’s code. Beneath the Designer window is the Function code window, and above the Designer window is a drop down called test. Click the drop down, click Configure test events. From here you can select predefined test templates, and there is an option called S3 Put, select it, name it, and create it. (AWS offers SAM Local, a CLI tool to test your Lambda functions locally that utilizes Docker, but that is outside the scope of this post.)

{
  "Records": [
    {
      "eventVersion": "2.0",
      "eventTime": "1970-01-01T00:00:00.000Z",
      "requestParameters": {
        "sourceIPAddress": "127.0.0.1"
      },
      "s3": {
        "configurationId": "testConfigRule",
        "object": {
          "eTag": "0123456789abcdef0123456789abcdef",
          "sequencer": "0A1B2C3D4E5F678901",
          "key": "HappyFace.jpg",
          "size": 1024
        },
        "bucket": {
          "arn": "arn:aws:s3:::mybucket",
          "name": "sourcebucket",
          "ownerIdentity": {
            "principalId": "EXAMPLE"
          }...

Lets run a couple of tests:

The name of the test we saved should appear in the drop down menu at the top of the page. With it selected, click the Test button. If everything is configured properly, and the code in the editor is left as default, we should receive a green flash message at the top with a link to our Cloudwatch logs and a response in the Execution results portion of the code window.

Response:
"Hello from Lambda"Request ID:
"efec9a6b-fd49-11e7-8a83-ed5b337ce748"

The response text is coming from our return statement in the code block. Lets modify our code to reveal how our test event is influencing the run of our function.

def lambda_handler(event, context):
    # return 'Hello from Lambda'
    return event

If we return the event our response is simply a return of the JSON from our test.

We can parse that response to get the details of the event that triggered the function. In this case, we can parse out the bucket name:

def lambda_handler(event, context):
    return event['Records'][0]['s3']['bucket']['name']Response:
"sourcebucket"Request ID:
"b21e9537-fd4c-11e7-a909-4120ebeee88f"

So if we set up our function like this:

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    return bucket, key

We can return the bucket and the key of the object that is triggering the event:

Response:
[
  "sourcebucket",
  "HappyFace.jpg"
]

In short, the event parameter is a JSON response from the trigger that our Lambda function can consume.

So now that we have covered the basics, lets upload a post! But before we do, there is one caveat. We are going to be using dependencies outside of the Python standard library. When writing Lambda functions that require outside dependencies, we will need to install those dependencies into the directory we are running our code from. Practically speaking, this means we are not going to be able to write this code directly into the Function code window in the AWS console. Rather, the code will need to be written on your local workstation, dependencies installed, the project packaged and then uploaded to Lambda by selecting an upload option from the drop down Code entry type in the Function code window (more on this later).

Lets get to those seven steps I mentioned somewhere above shall we?

Our Code:

Step 1: Trigger a Lambda event with a S3 PUTS:

(This step will trigger an event after the Lambda function has been written and configured.)

CD into the directory you saved your Medium stories into, identify which story you want to parse out and add to your blog. Using the AWS CLI, upload (PUT) the file:

aws s3 cp /user/dir/filename s3://bucketname/

Notice: The bucketname needs to be the same bucket we used in the Add Triggers step from the previous steps.

If you would like to add the entire directory or more than one file, more information on the AWS CLI for S3 can be found here.

Step 2: GETS the object that was PUTS to the bucket

The upload will trigger our Lambda function. We will need to parse the JSON event sent to our function:

from bs4 import BeautifulSoups3 = boto3.client('s3')def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        soup = BeautifulSoup(response['Body'].read(), "html.parser")
        remove_elements(soup)
        html = soup.prettify()
        end_path = path(key)
        s3_upload_article(html, bucket, end_path)
        puts_index(bucket, key, end_path)
        return response['ContentType']
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

From the Docs:

AWS Lambda includes the AWS SDK for Python (Boto 3), so you don’t need to include it in your deployment package. However, if you want to use a version of Boto3 other than the one included by default, you can include it in your deployment package.

When the event comes in, we parse out the bucket and the key (the file we uploaded), then in the try: we fetch the object with s3.get_object(Bucket-bucket, Key=key) using the boto3 library.

Step 3: Edit the HTML from the object using Beautiful Soup

First we are going to parse the response from the get_object() function with BeautifulSoup(response['Body'].read(), "html.parser"). Recall that the Medium export feature downloads our stories as HTML documents. When we fetch these files from S3 we are simply bringing in our HTML as a string that we can then use the Python library BeautifulSoup to parse. Included above is a partial screenshot of the style element that is exported with the story from Medium so when I considered the first steps in cleaning up the HTML removing the style elements was the natural first choice. Lets write the remove_elements() functions and pass it soup to begin formatting our file:

def remove_elements(soup):
    soup.style.decompose()
    soup.section.decompose()
    soup.header.decompose()
    soup.footer.decompose()
    remove_style_tags(soup)
    center_figures(soup)def remove_style_tags(soup):
    for tag in soup():
        for attribute in ["style"]:
            del tag[attribute]def center_figures(soup):
    figures = soup.findAll('figure')
    for fig in figures:
        fig['style'] = "text-align:center"

BeautifulSoup is an absolutely wonderful library. As you can see, it allows us to call tags directly from the parsed HTML and remove them. The first thing we are going to do is remove the style, section, header, and footer components from our document. I would recommend as you follow along with this post, you review the HTML document you are manipulating, your needs may vary, you may want to keep the footer information that was exported, its difficult to say until you see the HTML on your own website. For my use case, after I removed the explicit tags from the document, I decided to iterate over the all of the tags, find any lingering style attributes and delete them. I then went over the figure tags, the tags that contain the images from the post, and centered them.

Step 4: PUTS edited Medium post to bucket

After we have completed our formatting lets modify the path (rename our upload) of the object and push it back up to the bucket to save it. We need to be sure to upload the file as an HTML document with the flag ContentType='text/html and add public-read to the Access Control List ACL=’public-read’:

def lambda_handler(event, context):
    ...
    html = soup.prettify()
    end_path = path(key)
    s3_upload_article(html, bucket, end_path)
    ...def s3_upload_article(html, bucket, end_path):
    s3.put_object(Body=html, Bucket=bucket, Key=end_path,        ContentType='text/html', ACL='public-read')def path(key):
    s = '-'
    seq = key.split('_')[1].split('-')[:-1]
    end_path = s.join(seq) + '.html'
    return end_path

Step 5: GETS the `index.html` file from the bucket hosting the website

For this step I wrote the function puts_index(bucket, key, end_path) which looks like:

def puts_index(bucket, key, end_path):
    markup = create_markup(end_path)
    index_response_www, index_bucket_www, index_key = get_object()
    soup_www = BeautifulSoup(index_response_www['Body'].read(), "html5lib")
    soup_www.ul.insert(-1, markup)
    html2 = soup_www.prettify()
    s3_upload_article(html2, index_bucket_www, index_key)def get_object():
    index_key = 'index.html'
    index_bucket_www = {bucket_name}
    index_response_www = s3.get_object(Bucket=index_bucket_www, Key=index_key)
    return index_response_www, index_bucket_www, index_keydef create_markup(end_path):
    url = 'https://s3-us-west-1.amazonaws.com/{bucket_name}/' + end_path
    text = ' '.join(end_path.split('.'[:1])[0].split('-'))
    markup = BeautifulSoup("<li><a class='waves-effect blog' name={}>{}</a></li>".format(url, text), 'html5lib').body.next
    return markup

Step 6: Append a new link to the page

The create_markup() function is responsible for generating the name attribute (the url endpoint for our formatted html object) and the text that will appear in the navbar that our users will see. We then use BeautifulSoup to dynamically generate the markup that will be appended to our list of links on our webpage. We call the get_object() function where we pull down our index.html from the bucket that is hosting our website. We also return the bucket name and the object key for use later. Lets create the soup_www variable by parsing our index.html page with BeautifulSoup BeautifulSoup(index_response_www[‘Body’].read(), “html5lib”). The navbar on our page is wrapped in a ul element so we can grab that with .ul which will return the entire list including the elements within it. We use soup_www.ul.insert(-1, markup) to insert the element we created at the end of the list of links.

Step 7: PUTS index.html back to the hosting bucket

Rebuild the the HTML structure of the document soup_www.prettify() and finally upload the newly formatted HTML to our website’s original bucket.

def puts_index(bucket, key, end_path):
    ...
    html2 = soup_www.prettify()
    s3_upload_article(html2, index_bucket_www, index_key)def s3_upload_article(html, bucket, end_path):
    s3.put_object(Body=html, Bucket=bucket, Key=end_path,        ContentType='text/html', ACL='public-read')

Visit your website and perform a hard refresh on the page. Check the navbar and you should see the link we just created added to your list. Click on it and check the formatting. If it doesn’t look quite right, we can always modify the function to add or remove any elements we would like. My current website for example has some issues with resizing the images when the browser is resized. In the future I would like to add some elements that resize the images dynamically. Keep in mind that the images we are serving our blog post are still coming from the Medium CDN, so if we want to modify or change those in any way, we may need to save them elsewhere ourselves and update the image links in the HTML.

Your code now looks something like this:

from bs4 import BeautifulSoup
import boto3s3 = boto3.client('s3')def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        soup = BeautifulSoup(response['Body'].read(), "html.parser")
        remove_elements(soup)
        html = soup.prettify()
        end_path = path(key)
        s3_upload_article(html, bucket, end_path)
        puts_index(bucket, key, end_path)
        return response['ContentType']
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise edef puts_index(bucket, key, end_path):
    url, text, markup = create_markup(end_path)
    index_response_www, index_bucket_www, index_key = get_object()
    soup_www = BeautifulSoup(index_response_www['Body'].read(), "html5lib")
    soup_www.ul.insert(-1, markup)
    html2 = soup_www.prettify()
    s3_upload_article(html2, index_bucket_www, index_key)def get_object():
    index_key = 'index.html'
    index_bucket_www = {bucket_name}
    index_response_www = s3.get_object(Bucket=index_bucket_www, Key=index_key)
    return index_response_www, index_bucket_www, index_keydef remove_elements(soup):
    soup.style.decompose()
    soup.section.decompose()
    soup.header.decompose()
    soup.footer.decompose()
    remove_style_tags(soup)
    center_figures(soup)def remove_style_tags(soup):
    for tag in soup():
        for attribute in ["style"]:
            del tag[attribute]def center_figures(soup):
    figures = soup.findAll('figure')
    for fig in figures:
        fig['style'] = "text-align:center"def s3_upload_article(html, bucket, end_path):
    s3.put_object(Body=html, Bucket=bucket, Key=end_path, ContentType='text/html', ACL='public-read')def path(key):
    s = '-'
    seq = key.split('_')[1].split('-')[:-1]
    end_path = s.join(seq) + '.html'
    return end_pathdef create_markup(end_path):
    url = 'https://s3-us-west-1.amazonaws.com/{bucket_name}/' + end_path
    text = ' '.join(end_path.split('.'[:1])[0].split('-'))
    markup = BeautifulSoup("<li><a class='waves-effect blog' name={}>{}</a></li>".format(url, text), 'html5lib').body.next
    return url, text, markup

As was mentioned, because BeautifulSoup is not part of the standard Python library, we are going to need to import it with our function. To do so execute pip install beautifulsoup4 -t <code-dir>. Two sub-directories will be created in your project directory for the library. Now we can .zip up the contents of the project directory and upload it to Lambda.

From the Docs:

https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html

Important
Zip the directory content, not the directory. The contents of the Zip file are available as the current working directory of the Lambda function. For example: /project-dir/codefile.py/lib/yourlibraries

Return to the AWS Lambda console and upload your .zip:

Return to Step 1 and lets upload an exported Medium post to the S3 bucket and trigger the event. If all is well, your website should have a new link and a new blog post! If your Lambda function has an error, you’re going to have to join me for my next blog post on setting up SAM Local testing as debugging Lambda functions with outside dependencies from the console isn’t ideal.

Don’t forget to check out my other AWS articles linked below:

Medium, Lambda, and Me (or how I export Medium stories to my website)

Lambda Overview and first steps:

Getting Started:

Lets run a couple of tests:

Our Code:

From the Docs:

Step 3: Edit the HTML from the object using Beautiful Soup

Step 4: PUTS edited Medium post to bucket

Step 5: GETS the `index.html` file from the bucket hosting the website

Step 6: Append a new link to the page

Step 7: PUTS index.html back to the hosting bucket

Your code now looks something like this:

From the Docs:

Spinning up an EC2 instance.

Setting up your Identity and Access Management for AWS

Connecting an AWS VPC to your VPN — From the Cloud to the Colo.

Written by Nicholas Martinez

Medium, Lambda, and Me (or how I export Medium stories to my website)

Lambda Overview and first steps:

Getting Started:

Lets run a couple of tests:

Our Code:

From the Docs:

Step 3: Edit the HTML from the object using Beautiful Soup

Step 4: PUTS edited Medium post to bucket

Step 5: GETS the index.html file from the bucket hosting the website

Step 6: Append a new link to the page

Step 7: PUTS index.html back to the hosting bucket

Your code now looks something like this:

From the Docs:

Spinning up an EC2 instance.

Setting up your Identity and Access Management for AWS

Connecting an AWS VPC to your VPN — From the Cloud to the Colo.

Written by Nicholas Martinez

Step 5: GETS the `index.html` file from the bucket hosting the website