Static site search using Hugo + App Engine + Search API + Python – News Couple
ANALYTICS

Static site search using Hugo + App Engine + Search API + Python


As the year turned into 2018, I decided to ditch WordPress, which I had been using for over 12 years as my preferred CMS. I had many reasons to do this, but the biggest motivation was the opportunity to try something new and ditch the bloat and clutter of WordPress for a simpler, more elegant arrangement of things. Motivated by another adopter, Mark Edmondson, I decided to try Hugo (pun intended).

Migration was not easy, as I had to convert the relational WordPress database to a static list of markdown files. Along the way, I had to configure two themes (Normal and AMP), optimize all my images, JavaScript, and stylesheets, and go through each of the 200+ articles, looking for stylistic issues and broken links (oh boy, there were a lot of those !).

Hugo is written in Go, and is fairly easy to use if you are familiar with markdown and command line for your operating system. The trick about a static site is to store all the content in static files on your file server. There is no relational database to reference, which means that a static site can be very fast and requires maintenance.

One of the biggest problems for me was how to set up the site search. Without a database or web server that generates dynamic HTML documents, finding a suitable way to index content in the browser and respond quickly and efficiently to search queries seemed like an insurmountable task.

I tried a number of things initially, including:

  • Algolia, which I had to give up because I have a lot of content for their free tier.

  • js running on my NodeJS virtual machine in the Google cloud, which I had to give up because I got billed for eg $400 for December maintenance alone.

  • A custom solution that digested Hugo-generated JSON and parsed it for jQuery search directly in the browser, which I had to give up because downloading an indexed JSON file of about 5MB per page is not conducive to a good user experience.

After the failed experiment with lunr.js, I still wanted to give Google App Engine another chance. I’ve been in love with App Engine ever since I published my first version of GTM Tools on it. Well, as it turns out, App Engine has a really useful and flexible search API for Python, which seems to be specifically designed to work with JSON generated by Hugo on a static site!

Find API stakes in App Engine


X


Simmer . Newsletter

Subscribe to the Simmer newsletter to get the latest news and content from Simo Ahava right in your inbox!

and create

My setup looks like this:

  1. Hugo’s config file is configured to output a file index.json In the public directory, with all my site content ready for indexing.

  2. A script that publishes this JSON file in an App Engine project.

  3. The App Engine project uses the Python Search API client to create an index for this JSON.

  4. The App Engine project also provides an HTTP endpoint where my site performs all search queries. Each request is processed as a search query, and the result is returned in an HTTP response.

  5. Finally, I have a set of JavaScript running the search form and search results page on my site, sending the request to the App Engine endpoint as well as formatting the search results page with the response.

The beauty of using the Search API is that I’m well below the quota limits for the free version, so I don’t have to pay a dime to make it fully functional!

My share for the search API

1. Modify the configuration file

It’s easy to make the change in Hugo’s config file, because Hugo has built in support for generating JSON in a format that most search libraries will digest. In the configuration file, you need to find a file output Configuration and add "JSON" As one of the output file home Content type. So it looks something like this:

[output]
  home = [ "HTML", "RSS", "JSON" ]

This configuration change creates a file index.json A file in the root of your public folder whenever the Hugo project was created.

Here is an example of what a blog post might look like in this file:

[
    
        "uri": "https://www.simoahava.com/upcoming-talks/",
        "title": "Upcoming Talks",
        "tags": [],
        "description": "My upcoming conference talks and events",
        "content": "17 March 2018: MeasureCamp London 20 March 2018: SMX München 19 April 2018: Advanced GTM Workshop (Hamburg) 24 May 2018: NXT Nordic (Oslo) 20 September 2018: Advanced GTM Workshop (Hamburg) 14-16 November 2018: SMXL Milan    I enjoy presenting at conferences and meetups, and I have a track record of hundreds of talks since 2013, comprising keynotes, conference presentations, workshops, seminars, and public trainings. Audience sizes have varied between 3 and 2,000.nMy favorite topics revolve around web analytics development and analytics customization, but Iu0026rsquo;m more than happy to talk about integrating analytics into organizations, knowledge transfer, improving technical skills, digital marketing, and content creation.nSome of my conference slides can be found at SlideShare.nFor a sample, hereu0026rsquo;s a talk I gave at Reaktor Breakpoint in 2015.n   You can contact me at simo (at) simoahava.com for enquiring about my availability for your event.n"
    
]

2. Publishing script

The deployment script is part of the Bash code that builds the Hugo website, copies a file index.json in my research project folder, then publish the entire research project in App Engine. This is what it looks like:

cd ~/Documents/Projects/www-simoahava-com/
rm -rf public
hugo
cp public/index.json ../www-simoahava-com-search/
rm -rf public
cd ~/Documents/Projects/www-simoahava-com-search/
gcloud app deploy
curl https://search-www-simoahava-com.appspot.com/update

The hugo The command builds the site and creates the public folder. From the public folder, file index.json Then it is copied to my research project folder, which is later published in App Engine with the command gcloud app deploy. Finally, a curl Command to my custom endpoint makes sure that my Python script updates the search index with the latest version of index.json.

3. Python code running in App Engine

In App Engine I simply created a new project with an easy to remember name as the endpoint. I didn’t add any invoices to the account, because I set myself a challenge to create a free search API for my site.

See this documentation for a quick start guide on how to get started with Python and App Engine. Focus especially on how to set up an App Engine project (you don’t need to enable billing), how to install and configure a file gcloud Command line tools for your project.

Python code looks like this.

#!/usr/bin/python

from urlparse import urlparse
from urlparse import parse_qs

import json
import re

import webapp2
from webapp2_extras import jinja2

from google.appengine.api import search

# Index name for your search documents
_INDEX_NAME = 'search-www-simoahava-com'


def create_document(title, uri, description, tags, content):
    """Create a search document with ID generated from the post title"""
    doc_id = re.sub('[s+]', '', title)
    document = search.Document(
        doc_id=doc_id,
        fields=[
            search.TextField(name='title', value=title),
            search.TextField(name='uri', value=uri),
            search.TextField(name='description', value=description),
            search.TextField(name='tags', value=json.dumps(tags)),
            search.TextField(name='content', value=content)
        ]
    )
    return document


def add_document_to_index(document):
    index = search.Index(_INDEX_NAME)
    index.put(document)
	

class BaseHandler(webapp2.RequestHandler):
    """The other handlers inherit from this class. Provides some helper methods
    for rendering a template."""

    @webapp2.cached_property
    def jinja2(self):
        return jinja2.get_jinja2(app=self.app)


class ProcessQuery(BaseHandler):
    """Handles search requests for comments."""

    def get(self):
        """Handles a get request with a query."""
        uri = urlparse(self.request.uri)
        query = ''
        if uri.query:
            query = parse_qs(uri.query)
            query = query['q'][0]

        index = search.Index(_INDEX_NAME)

        compiled_query = search.Query(
            query_string=json.dumps(query),
            options=search.QueryOptions(
                sort_options=search.SortOptions(match_scorer=search.MatchScorer()),
                limit=1000,
                returned_fields=['title', 'uri', 'description']
            )
        )
		
		results = index.search(compiled_query)

        json_results = 
            'results': [],
            'query': json.dumps(query)
        

        for document in results.results:
            search_result = 
            for field in document.fields:
                search_result[field.name] = field.value
            json_results['results'].append(search_result)
        self.response.headers.add('Access-Control-Allow-Origin', 'https://www.simoahava.com')
        self.response.write(json.dumps(json_results))
		

class UpdateIndex(BaseHandler):
    """Updates the index using index.json"""

    def get(self):
        with open('index.json') as json_file:
            data = json.load(json_file)

        for post in data:
            title = post.get('title', '')
            uri = post.get('uri', '')
            description = post.get('description', '')
            tags = post.get('tags', [])
            content = post.get('content', '')

            doc = create_document(title, uri, description, tags, content)
            add_document_to_index(doc)
			

application = webapp2.WSGIApplication(
    [('/', ProcessQuery),
     ('/update', UpdateIndex)],
    debug=True)

In the end, I am bound by requests for / end point for ProcessQueryand requests /update to me UpdateIndex. In other words, these are the two endpoints I serve.

UpdateIndex downloading index.json file, and for every single content piece inside (blog posts, pages, etc.), it grabs an extension titleAnd uriAnd descriptionAnd tags, And content Parameters from JSON content, and generates documentation for each state. Then each document is added to the index.

This is how the Search API can be used to translate any JSON file into a valid search index, against which you can then create queries.

Queries are made by polling in /?q=<keyword> End point, where keyword Matches a valid query against the search API’s query engine. Each query is processed by ProcessQuery, which takes the query term, polls the search index with that term, and then aggregates a result for all documents returned by the search index for that query (in ordered order). This result is then pushed into the JSON response to the client.

The Search API gives you plenty of room for index optimization and for compiling complex queries. I’ve opted for a fairly normal approach, which may result in some odd outliers, like docs which should obviously be at the top of the list of relevant results ending up at the end, but I’m still very happy with the way the API’s strength is.

4. JavaScript

Finally, I need some client-side code to produce the search results page. Since Hugo doesn’t have a web server, I can’t do the search server side – it has to be done in the client. This is one case where a static site loses some of its luster when compared to its counterpart with a web server and server-side processing capabilities. The Hugo site is created and published once, so there is no dynamic generation of HTML pages after creation – everything has to happen in the client.

Anyway, the search form on my site is very simple. It just looks like this:

<form id="search" action="/search/">
  <input name="q" type="text" class="form-control input--xlarge" placeholder="Search blog..." autocomplete="off">
</form>

When the form is submitted, it makes a GET request to /search/ A page on my site, adding everything that was written in the field as q Query parameter, so the URL becomes something like

https://www.simoahava.com/search/?q=google+tag+manager

On the /search/ page, I have a load button that waits for the search endpoint request to complete. The search call is made using JavaScript, and it looks like this:

(function($) 

    var printSearchResults = function(results) 
	  // Update the page DOM with the search results...
	;

    var endpoint = 'https://search-www-simoahava-com.appspot.com';

    var getQuery = function() &)q=/.test(window.location.search)) 
            return undefined;
        

        var parts = window.location.search.substring(1).split('&');
        var query = parts.map(function(part) 
            var temp = part.split('=');
            return temp[0] === 'q' ? temp[1] : false;
        );
        return query[0] ;

    $(document).ready(function() 
        var query = getQuery();

        if (typeof query === 'undefined') 
            printSearchResults();
            return;
         else 
            $.get(endpoint + '?q=' + query, function(data) 
                printSearchResults(JSON.parse(data));
            );
        
    );
	
)(window.jQuery)

To keep things simple, I’ve only included relevant parts of code that can be used elsewhere as well. In short, when a file /search/ The page loads, whatever is included as a file value q The query parameter is immediately sent to the search API endpoint. The response is then processed and included in the search results page.

So, if the page URL is https://www.simoahava.com/search/?q=google+tag.manager, this part of JavaScript turns that into a GET request for https://search-www-simoahava-com.appspot.com/?q=google+tag+manager. You can visit this URL to see what the response looks like.

This response has been processed, and the search results page has been created.

summary

This is how I chose to build on-site search using Hugo’s flexibility combined with the powerful search API offered by Google App Engine.

Based on the limited amount of searching I did, it’s as good a solution as any, and seems pretty fast without compromising the power of the search query engine. However, as more content accumulates, it is conceivable that the query engine will either become slower or start hitting the free tier quotas, at which point I will need to rethink my approach.

The weak link at the moment is that everything is done on the client side. This means that, contrary to the philosophy of static sites, a lot of processing takes place in the browser. But I’m not sure how to avoid that, since static site doesn’t give you server-side processor capabilities.

At this time, I’m willing to make a trade-off, but I’m eager to hear feedback if the search is inaccurate or not working properly for you.



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button