This is an example tech spec, written to illustrate [this blog post].
● Commentary (like this) has a yellow hilight.
● I’d suggest you read the spec one through ignoring the hilighted parts, then go back and
read the full document again.
● Please don’t get too worked up about the specific proposal here - the point is to illustrate
a tech spec in general, not to talk about technical decisions.
● Not all sections are required, or to this degree of chattiness. Yours might be much
shorter.
Tech Spec - Transactional Emails
November 1, 2017
Author: Chuck Groom [ There should be only one author - see the blog post for details ]
Team: Louise Wagner and Colin Mullen [ … but of course, give credit where it’s due! ]
Overview
[ Link to the product spec and supporting documents, if they exist]
We currently use the default mailing emailer system included in our web framework. Templates
are checked in with code and are part of the release cycle; and we use our own SMTP server.
This solution has limitations, and should be improved; see ticket #1845. We want to move
transactional emails to a hosted 3rd party service that supports templates and tracking
delivery/open/click rates. We propose using Mailchimp. Send-email jobs will be enqueued to
RabbitMQ, and consumed by a new Email Sending Service. This is a medium-sized project that
will take about one engineer-month of effort from our team, and about two people-weeks of
combined work from other teams.
Goals
[ You can skip Goals, Product Requirements if there is a product spec that explained these in
detail. ]
For transactional emails:
1. Allow marketing and design teams to modify email templates without engineering
involvement.
2. Make available email metrics on: open rates, click rates, and deliverability.
3. Make it easier and faster to add new emails to our system.
Product Requirements
● Emails use a template (with variable substitution and basic if-then-else logic) which can
be edited directly by the marketing and design teams in a self-serve fashion.
● Emails designs can be tested in a staging environment that does not affect production.
Assumptions
● We must have the capacity to sent up to 2 emails to each customer per day (though we’ll
certainly not plan to send this many emails on a regular basis). At current growth
projections, assuming we are planning for a 2-year horizon, that means we need
capacity for up to 20,000 emails/day. [ It’s a good idea to provide SLAs and capacity
whenever possible. ]
● We only care about HTML email; sorry, Mutt/Pine users!
● We will want to segregate our email sending system (APIs, templates) so there is a
development/test sandbox which is separate from production; e.g. a change in a
development/test email template won’t impact production.
● It is the responsibility of the marketing and design teams to copy changes forward from
development/test to production. Note that this is potentially error-prone process that will
be difficult to test; this is likely something we’ll want to address soon. [ This is a gross
problem - it’s likely going to be a pain point, but I’m not sure we can solve this right
away, so I’d rather call it out and move on. ]
● We don’t want to accidentally trigger emails to real people from our development/test
domains, so should add a filter that only sends to @ourcompany.com.
● Because it is possible that a bug could cause us to e.g. send 50 emails to the same
customer in 5 minutes, it is highly desired to include a safeguard throttle that (a) raises
an internal alarm if we send more than 3 emails to the same email address in one hour,
and (b) prevents sending more than 5 emails to the same email address in one hour.
[These numbers were pulled out of thin air. I don’t care about the actual numbers, but I
find that people are often more inclined to engage when there’s a specific stake in the
ground to talk about.]
Out of Scope
● This system only applies to transactional emails, triggered by in-app user events. Email
blasts for e.g. marketing purposes will be handled by a separate system.
● We will not support A/B testing at this stage, though it’s a likely future project.
● Cron-job triggered emails (e.g. send a customer daily summary at 4am) m ight be
handled by our system in the future, but this is not supported for now.
● We do not have a multilingual or localized solution yet; only English emails using a US
date format are supported. [It never hurts to think about international…]
● While we will certainly be able to more quickly add new emails than today (e.g. “let’s
send an thank-you email when a customer refers a friend”), this is not just a matter of
filing an engineering ticket because it requires coordination between three teams: the
email content owner; engineering; and QA. I’ve noted this as an open question below.
[Ugh, this will be a tricky process. For now, let’s just say it’s not all on us, and sort
through the details later.]
Open Questions
[ It’s totally OK to publish a tech spec with holes in it, as long as you call out the holes. ]
● What’s our process for adding new emails? Who owns streamlining this?
● How will we manage email retry?
● How should we throttle emails (so a bug doesn’t send a slew of emails to one
customer?)
● How should we configure our Nginx/Firewall rules to allow Mailchip to make callbacks to
our systems securely?
● What happens if someone messes up a Mailchimp template such that it e.g. has broken
if-then-else logic and cannot be sent? How is this monitored, and who is responsible for
checking? We are trusting marketing to own this, but does there need to be an
engineering safe-guard too?
● Do we need a manual “pause” switch that will turn off the email worker (so jobs enqueue,
but emails aren’t sent)? This might be useful for maintenance or bugs.
● How do we absolutely, 100% guarantee that someone will stay on top of Mailchimp
billing? E.g. if someone in finance leaves the company, we don’t want an expired credit
card notice to go unnoticed.
Approach
[ Here, I’m combining a presentation of the chosen approach alongside the alternatives we
considered. It’s would also be fine to separate these out into a section “Other Options
Considered” ]
The first question we considered is: do we want to send email ourselves (from our own server)
or use a third party system? While it’s cheap to send emails using our own SMTP server, we
know from experience that maintaining email deliverability (IP address reputation, bounces,
unsubscribes) is a huge operational burden, and we strongly lean towards using a third party.
Our ideal third party system will handle both sending emails and templates (with a template
editor supporting variable substitution and if-then-else logic). We would prefer an all-in-one
solution to reduce the number of external touch points; this means we won’t consider solutions
like e.g. SendWithUs which just manage templates.
Two of the more popular solutions for transactional emails with templates are SendGrid and
MailChimp. Because our team has had experience with MailChimp and they have a solid
operational reputation, we suggest using MailChimp. (Note that their transactional system used
to be called Mandrill, but this was pulled into MailChimp brand name).
Components
[ This kind of summary is optional, but sometimes it help the reader to re-state what we
proposed in terms of what we’ll be building. ]
To recap, our plan is:
● Marketing creates email templates using MailChimp
● Clients will send emails by enqueuing them as async RabbitMQ jobs
○ Specify an email template, recipient
○ We will provide a small library that makes sending emails a simple API call
● There will be a new database table tracking the status of each email addresses we’ve
sent to (deliverable, #sent, etc).
● We will create two new services
○ Email Sending Service - consumes email-send jobs.
■ Makes API calls to MailChimp, using email templates
■ Populates additional standard variables (e.g. user first name)
■ Use Redis to throttle emails (optional)
○ Email Callback Service - processes async callbacks from MailChimp about email
delivery
Currently Supported Emails
These are the current emails we send. They will be moved to the new system.
● Welcome (on sign-up)
● Subscription confirmation
● Unsubscribe confirmation (with CTA to re-subscribe)
● Payment receipt
Send Queue, and Email Sender Service
We propose that all transactional emails are sent by putting them onto an async job queue. The
email job is a JSON blob that includes at least the following fields (we may add more):
● Recipient email address
● Template identifier
● User-id (may be empty if not associated with an account)
● Variables (key-value tuples)
We will write a new “Transactional Email Sending Service” that consumes from the jobs queue
to send emails. There will be at least 4 workers running on at least 2 different machines. These
workers are responsible for the (simple) mechanics of making remote API calls to MailChimp to
send emails; populating additional standardized variables (see below), and retry on failure (also
see below).
Because we already use RabbitMQ, we will also be using RabbitMQ for our email queues (with
a new queue). There will be separate instances of the queue and worker fleets for production
and development/staging. When an email queue worker is started, it will know whether it is
running in production or not.
Send Email Library
We will write a small library that any service can use to enqueuing transactional emails in this
standardized fashion. The API will be something like:
emplate_name: string, e.g. “welcome”
# t
# c ustomer: customer object
# p arams: dict, optional
send_email(template_name, customer, params = None)
This method just adds the email to the send-queue. It is assumed that all internal applications
will use our standardized configuration management. We will add hooks into this system so on
application start, email configurations (how to connect to the right RabbitMQ queue) are
automatically set up.
Template Names
[ I might be rambling a bit here and getting off into the weeds - but you know what? I’m talking to
myself and figuring out an approach, and that’s largely the point of the exercise. ]
We suggest that we use ONE MailChimp account for both test and production emails. Each
email template must start with either “test_” or “prod_” to distinguish between the environment,
e.g. there would be both “test_welcome” and “prod_welcome” templates.
When our software triggers an email, it would just specify the e.g. “welcome” template; if it was
in a staging environment, the email sending service will prepend “test_” to the template name.
Note that the marketing team owns creating and managing these email templates.
Standard Variables
We will want to provide all emails with a standard list of variables. These will be automatically
populated by the email sending service; clients do not need to specify these. Note that we can’t
expect to pull in the universe of all data about a customer (e.g. full click-stream history) because
some data might require complex queries. Instead, we will maintain a list of standard variables
on this wiki page. Initially our variables include:
● First name
● Date joined
● Membership tier
● User timezone (as both current UTC offset in hours, and name like “US/Pacific”)
Client-specified variables are merged into set of variables given by the email service. In the
case where there is a conflict (the client specifies a variable name which is provided by the
service) we will log an ERROR and override the client variable.
Email Retry
If the email sending service is unable to send an email via MailChimp due to connectivity
problems or because of some other outage, we will attempt to re-send the email using
backing-off retry logic. The mechanics are TBD. [ It’s OK to not know this yet, let’s move on... ]
Filter Non-Company Email Addresses
When the email sending service runs in non-production mode, we will log but skip sending
emails to all addresses that are outside of our company domain. We don’t want to accidentally
send real customers emails from staging!
Throttle
Note that this feature might not be included in the initial rollout.
It is highly desired to detect sending a slew of emails to the same email address in a short
period of time. We could use Redis counters with a TTL; but the specifics are TBD. [ Again, we
capture this as an open question and move on.]
Email Callback Service
We will a new, small, separate service for receiving MailChimp async callbacks. These
callbacks tell us whether an email was sent, delivered, opened, clicked; or if there is a
hard-bounced, soft-bounce, or unsubscribe. This service will write all such events to the log. In
addition, we will keep a new database table (e
mail_address_status) about the status of each
unique email address we have sent emails to, with fields like (this isn’t finalized):
● Email address - case insensitive string
● Active (has not hard-bounced, has not unsubscribed from all emails) - boolean, default ‘t’
● Count sent - integer
● Count delivered - integer
● Count soft-bounce - integer
● Count hard-bounce - integer
● Count opened - integer
● Count clicked - integer
● Last sent date - timestamp
The email sending service will also check this table before send emails (to skip non-active
addresses) and will upsert on sending.
Schema Changes
[ In this example, I’m being almost absurdly terse because I’m assuming I’m working with a
particular team and process where more details wouldn’t be necessary. ]
We will create a new database table email_address_status, discussed above. It will be
approximately the same size as the users table. The only two users are the email sending
service and the email callback service.
Security and Privacy
[ Always consider customer data, encryption, PII, and potential vectors of attack or data
leakage.]
Our biggest risk is that we send too many emails, which annoy or alienate customers. And of
course, there must be an opt-out option for users so they are in control of how we use their
emails.
In terms of privacy, we will be sending some customer PII (email address, first name, pricing
tier) to a third party (Mailchimp) via an encrypted (HTTPS) API; this shouldn’t be big deal, but
the legal and devops teams should be made aware of this.
Our email callback service will need to receive external callbacks from mailchimp, so we’ll need
to be careful to ensure that these external endpoints are not a vector of attack. Precise
firewall/Nginx configuration TBD.
Test Plan
For major releases with a black-box (manual) QA cycle, we’d like to ask the QA team to add the
welcome email as part of their test plan. Signing up to create an account should trigger an
email, which the QA team will verify arrives and looks visually correct.
We would also like to include email integration tests to our Selenium automated tests. The
golden-path sign-up flow should include an end check of an email inbox to ensure a welcome
email arrived within a few seconds of sign-up.
Operations
Deployment
We will work with the devops team to include the email sending service and email callback
service as new components (along with config file management, secrets, etc) in our deployment
system. We will own adding the new DB schema ourselves (as a checked-in migration), but the
DBAs will be keeping an eye on things like e.g. query performance on prod.
Initially, our expected load is quite low (we’re just taking small data blobs off a queue and
making remote HTTPs calls), so we suggest deploying to a fleet of 2 separate VMs (on separate
physical services, of course) for each service.
RabbitMQ is owned by the infrastructure team, they are onboard with adding and supporting the
new email queues.
Rollout Plan
Because this is a new system, we can deploy the services and components in advance of
actually sending customer emails. The order of operation is:
● Deploy full infrastructure (but don’t send welcome emails)
○ Marketing creates templates
○ RabbitMQ setup
○ Database changes
○ Deploy email sending service
○ Deploy email callback service
● Manually test (trigger emails)
○ Validate integrations
○ Validate monitoring and metrics
● Deploy code that sends actual emails
Rollbacks
[ Always include some kind of playbook for what people should do if something goes wrong with
the release - even if it’s just one line. ]
If the are unforeseen issues, it should be easy to simply shut down the email workers. We will
be closely looking at logfile ERRORs when this goes live.
Monitoring and Logging
[ Never fly blind! Always have a monitoring plan, even if it’s just “dump to a log file, and have
something that check for ERRORs” ]
In addition to standard machine-health metrics and monitoring, we will use logfile monitoring
(with Graylog).
Email sending service:
● Log each email sent (email address, template, user information)
● Log emails suppressed (eg. known bounce, or in testing if out of domain)
● Log any remote errors/exceptions -- add monitoring alert on these. This should include
missing or broken templates.
● Add abnormal level monitors
○ Production triggers: < 1 email successfully sent per half hour, > 2000 emails sent
per half hour (we’ll update this over time)
Email callback service:
● Log each callback (email, status reported by mailchimp)
● Trigger monitoring on remotely reported errors/exceptions
Metrics
The project success metrics (customer engagement rates by cohort, customer support request
rate) are provided by the BI team, and we’ll check regularly to see how this project is affecting
these.
We suggest that we start tracking some additional email-status metrics from logfile entries, with
a Graylog dashboard report:
● Count emails sent, by template
● Count bounces reported by Mailchimp
Note that MailChimp has good dashboards for open rate, click rates, etc. For now, we won’t pull
these into our system but will give the business team access to log into the MailChimp account.
Long-term Support
The platform team owns the email sending service and email callback service going forward;
infrastructure team owns the RabbitMQ. Note that while we’re using Mailchimp for now, our
system is generic enough that there’s not all that much vendor lock-in. We expect to spend on
the order of $2500/month sending emails. [ N.b. this number is totally made up ].
As noted in the “Out of Scope” section in the beginning, adding new email templates is not just
an engineering project, but a process involving several teams. [ It’s OK to repeat yourself to
CYA ].
Timeline and Components
● MARKETING
○ Set up MailChimp account, add company credit card, create initial email
templates for development and production (one person, 3 days)
● ENGINEERING (platform)
○ Write library for enqueuing emails (one person, 3 days)
○ Write new Email Sending Service, integrate with MailChimp (one person, 5 days)
○ Write new Email Callback Service (one person, 4 days)
○ Work with DevOps team to configure, deploy, troubleshoot services (one person,
2 days)
○ Set up monitoring alerts and metrics dashboards (one person, 1 day)
● ENGINEERING (infrastructure)
○ Add and support new email queues for RabbitMQ (one person, 1 day)
● QA
○ Black-box plan for testing emails (1 day)
○ Add welcome email check to golden-path automated Selenium tests (2 days)
○ Manually test initial rollout (1 day)
● DEVOPS
○ Support setting up new services (2 days)