Headless CMS application lifecycle management - How to upgrade production without downtime+content loss?

Hello!

TL:DR: What teammember type is intended to do what? How can I design a system which allows for consistently safe production releases with minimum amounts of manual intervention?

I’m setting up the CI/CD for a customer and the tech choice was to go latest and greatest (that being headless CMS). It seems very cool, but I don’t know how some details are intended to work out when it comes to cleanly deploying.

“Traditional” application lifecycle management (an example)

  1. Stakeholders creates figma for the developers
  2. Developers develop frontend/backend → push to dev branch → CI/CD magic → development.example[dot]com
  3. Looks good, push to/deploy to staging → staging.example[dot]com
    3a. Feedback etc. stakeholders say “change this colour” or “change this text to this” → Back to development 2.
    3b. Stakeholders happy, on to 4.
  4. Push to uat.example[dot]com, do green/blue testing, testing sprint, something similar against a real database + backend to ensure stability
  5. Testing OK, it’s safe, upgrade production server by retargeting DNS/reroute to new frontend/similar and drain the previous production → example[dot]com

Every step is isolated. The frontend can be updated independently of the backend, it’s only concerned with displaying some content. Since this is the case, there are many strategies which can be used to safely ensure that the update is OK, without any data loss since that’s handled soly in the backend.

Headless CMS application lifecycle management

  1. Stakeholders creates figma, gives to developers
  2. Developers (?) creates content-types which fits the designs of figma → CI/CD magic → dev.api.example[dot]com strapi deployment + dev.example[dot]com which is empty there is no content
  3. Developers populates the content of the content-types, this way they build the figma design
  4. Looks good, deploy on staging → stage.api.example[dot]com strapi deployment + staging.example[dot]com, a next.js frontend (or something)
  5. Developers re-populate the content of the content types (?). Introducing the content migration issue
    5a. They migrate the content from dev.api.example[dot]com?
    5b. They point stage.example[dot]com next.js frontend at dev.api.example[dot]com strapi instance, which has the correct content (what happens with stage.api.example here)?
    5c. They figure it out, somehow, now staging.example[dot]com is deployed
  6. Stakeholders look at stage.example[dot]com, they want changes:
    6a. They change the content themselves?
    6b. They change the design within strapi by altering the components?
    6c. They update their figma and tell developers to update… what? The content-types? The content itself?
    6d. They figure it out, now it’s ready for deployment testing (where the my problems really manifest)
  7. Deploy on some uat/testing server:
    7a. Deploy on uat → uat.api.example[dot]com, uat.example[dot]com → content migration issue + change request issue
    7b. Deploy only frontend: Can’t be done, frontend needs content-types from CMS
    7c. Update production CMS so that it hosts both uat and prod content-types + content? Very risky, it’s production + content migration issue + no point, data is too different to make use of this sort of testing
    7d. They somehow figure this out, uat is greenlit, onto production!
  8. We have a uat.example[dot]com that should become the new production for example[dot]com
    8a. Merge uat to master → CI/CD magic → nextjs + strapi cloud updates and becomes example[dot]com? Won’t work. The content on the master strapi cloud is no longer compatible, it’s written for the current production environment. This method simply results in a broken deployment util the content migration problem is resolved (not acceptable)
    8b. Reroute example[dot]com to uat.example[dot]com? All the user registrations + leads + similar are in the prod strapi cloud, needs to be synced. Introduces race conditions and potential data loss. But it’s acceptable, no downtime.
    8c. Use only one production strapi cloud, introduce all the new designs as draft pages? This could also work: but here we’re yet again poking at a production server running live. Also: who is supposed to do this poking? The developers? The DevOps? The stakeholders?

This is no doubt growing aches which will get figured out eventually (I googled around and found no info about how to handle this), but I currently see no productoin ready way except for 8b, where the potential data loss have to be retroactively managed. Additional problems with this: the “uat” strapi cloud would become prod, and the prod would then no longer be prod, it would just be… ex-prod (or something). This would be fixed by having “green” and “blue” servers instead, but then we need to keep track of which colour is the production, and which is the UAT. What if someone pushes incorrectly, and breaks prod? We’re also forced to have at minimum 3 cloud instances if we also want a staging instance.

Sorry for long post, I suspect that the problem here in reality is that I’m “stuck” in a more traditional development lifecycle paradigm. But so is the rest of the world.

Very interested to hear any ideas and replies!

Another issue I don’t understand how to solve: backwards compatibility for the content-type API.

Let’s say the deployment for my next.js app along with it’s strapi cloud backend gets successfully deployed. example[dot]com v1 → example[dot]com v2. The content-type API’s have changed, and that’s okay for the next.js app since we figured it out in the previous post, but…! It’s not been fixed for all the other devices in the omnichannel (and the omnichannel stuff is the entire selling point for headless CMS’s, btw…). They still rely on the v1 api which were in api.example[dot].com. Since api.example[dot].com serves v2 content-types, all these other non-updated devices will have their frontends broken due to non-backwards compatible content apis.

How can all the devices in the omnichannel system ensure that they are using a strapi cloud instance which provides content types that they are compatible with? Is there some sort of versioning API built into strapi cloud? Should I tag up the content-types manually and never remove a committed content-type?