The @artbhot Text-To-Image Twitter Bot
Amy Smith1 and Simon Colton1,2
1
School of Electronic Engineering and Computer Science, Queen Mary University London, UK
2
SensiLab, Faculty of Information Technology, Monash University, Australia
amy.smith@qmul.ac.uk s.colton@qmul.ac.uk
Abstract illustrated with examples below, people can tweet their text
prompt with appropriate annotations, and expect an image
@artbhot is a Twitter bot that brings the generative capabil- to be returned in due course. This greatly increases acces-
ities of CLIP-guided GAN image generation to the public
domain by transforming user-given text prompts into novel
sibility to the public, as Twitter has over 200 million active
artistic imagery. Until recently, access to such image syn- users. Due to it’s popularity and reach, and both the data
thesis techniques has been largely restricted to Google Co- and interaction available through its API, Twitter also pro-
lab notebooks, which require some technical knowledge to vides an ideal platform for @artbhot to take on more creative
use, and limited services which require access. @artbhot in- autonomy. In particular, we plan to challenge the assump-
creases access to text-to-image technology, as Twitter users tion that text-to-image users should be served only imagery
already have the platform knowledge needed to interact with which purely reflects their prompt. Instead, as described in
the model. We discuss here some of the technical challenges the final section below, we aim for @artbhot to use prompts
of implementing such a system, and provide some illustrative as springboards for creative ideation and visualisation and
examples of its usage. We further discuss what this mounting for it to enter into a dialogue with users in a fashion akin to
of generative technology amongst social media could mean
for autonomous computationally creative agents.
discussions with artists on social media.
Background
Introduction In early 2021, Ryan Murdock combined OpenAI’s Con-
Recent developments with generative deep learning tech- trastive Learning Image Pretraining model (CLIP) (Radford
nologies have enabled text-to-image computational models et al. 2021) with the BigGAN generative adversarial net-
to produce artistic images and video content, given only text work (Brock, Donahue, and Simonyan 2019) into a text-to-
prompts from users. Colton et. al (2021) explored the pos- image generation process. He made the system available via
sibilities for this, within the context of generative search en- a Colab notebook called The Big Sleep. In overview (with
gines, where images are generated rather than retrieved as further details in (Colton et al. 2021)), the process involves
per Google image search. Such approaches in the field of first encoding a user-given text prompt into the CLIP latent
text-to-image synthesis (Agnese et al. 2019), allow the user space as vector v1 . Then the system performs a search for a
to encode text in such a way as to drive a search for a latent latent vector input to BigGAN, v2 , which produces an image
vector input to a pre-trained image generation neural model. that, when encoded into the CLIP latent space as v3 , has op-
This technology has an impressive ability to innovate novel timally low cosine distance between v1 and v3 . The search
visual content from text, producing high quality and diverse is performed using gradient descent to minimise a loss func-
imagery which reflects the prompt well, with images that are tion based on this cosine distance. Given that related images
often surprisingly innovative. Examples of the kind of art- and text are encoded by CLIP to similar places in the latent
work that can be produced are given in (Smith and Colton space, this approach tends to produce images which some-
2021), and we describe the CLIP-Guided VQGAN text-to- how reflect the given text prompt.
image system in the background section below. In the interim, many CLIP-guided text-to-image genera-
Interaction with such systems has been largely limited to tors have been made available, with steadily improved qual-
Google Colab notebooks (Bisong 2019), but this has barri- ity and fidelity (with respect to the prompt) of the images
ers to entry due to the the technical knowledge required to produced. The most recent, and impressive examples of this
run the notebooks, and user interaction is limited to an im- generative technology are @midjourney1 , Disco Diffusion2 ,
age retrieval service. Other recent text-to-image generators DALL-E 3 from OpenAI and Imagen4 from Google. DALL-
(mentioned below) have invitation-only limited access for a
1
small number of artists and researchers. To address this lack midjourney.co
2
of access, we have built the @artbhot twitter-bot (Veale and tinyurl.com/yckn4h7
3
Cook 2018), which embeds CLIP-guided VQGAN in the openai.com/dall-e-
4
Twitter social media platform experience. As described and imagen.research.google/
Figure 1: (a) Processing of a tweet by @artbhot (b) Example user interaction on Twitter.
E is particularly impressive as it employs a one-shot process, ships between compositional elements, and not just the el-
with an encoded text prompt fed-forward through a model to ements themselves, as described in (Esser, Rombach, and
produce images near-instantaneously. However, the trained Ommer 2021). VQGAN models image elements, and the
model is so large that access is limited, with the expecta- local relationships within visual parts of an image, us-
tion that OpenAI will provide a subscription service for it ing continuous representations (such as the RGB chan-
soon. Currently, Disco diffusion is available as a Google Co- nels in a pixel). It also interprets discrete representa-
lab notebook, and @midjourney is only available to selected tions within image content using a transformer (Vaswani
users. Wombo Dream5 however is an app that is available et al. 2017), but before a feature map can be passed
for free from the app store, and appears to have been very to this, the model learns an intermediary representation
popular. In addition to users being able to enter a prompt of this image data using a codebook, as described at
and receive an image based on this text, they can also se- tinyurl.com/2vm3t9r8. This is a fixed size table of embed-
lect from several art styles that can influence the aesthetic of ding vectors that is learned by the model. This interme-
their generated image. These styles include ‘Dark Fantasy’, diary stage is necessary, as transformers scale the length
‘Mystical’ and ‘Salvador Dali’. There is also now DALL.E of an input sequence quadratically, making even a 224 x
mini 6 which is available to the public and free of charge. 224 pixel image above the processing capacity of most
It is a smaller version of the model mentioned above and is GPUs. CLIP-guided VQGAN is described in (Crowson
hosted on Hugging Face7 . et al. 2022), and various notebook for CLIP-guided VQ-
In a similar process to that of the Big Sleep approach, GAN have been implemented, with a list of ten given here:
CLIP-guided VQGAN harnesses the perceptual power of ljvmiranda921.github.io/notebook/2021/08/11/vqgan-list/
CLIP and the image generation capabilities of the Vec-
tor Quantized Generative Adversarial Network (VQGAN) @artbhot Implementation and Deployment
(Esser, Rombach, and Ommer 2021). This GAN architec- Twitter bots are usually small, autonomous programs run-
ture combines two approaches to interpreting meaning, us- ning on a server, which regularly produce and tweet out-
ing both discrete and continuous representations of content puts composed of texts, images, animations and/or mu-
(Cartuyvels, Spinks, and Moens 2021). Discrete represen- sic/audio compositions, as described in (Veale and Cook
tations model a more human way of interpreting meaning 2018). More advanced bots can respond to replies on Twit-
aside from a pixel based approach, which is traditionally ter and/or tweets if they are hashtagged appropriately. Our
how computers have processed images. In particular it con- Twitter bot, @artbhot, is currently only reactive, in that it
siders the image as a whole and interprets the relationships is used as a service: people tweet text prompt requests at it,
between the different compositional elements of the con- and it responds with a reply comprising an image that (hope-
tents, i.e., relationships between different parts of an image fully) reflects the prompt, and a repetition of the prompt.
(such as the sky and the ground in a landscape image). @artbhot is comprised of two parts: the generative pro-
VQGAN models these discrete representations as long cess, which is provided by CLIP-guided VQGAN; and code
range dependencies, meaning it can interpret the relation- which enables it to interact with the Twitter API. The imple-
mentation is hosted on a remote server which runs 24 hours
5
wombo.art a day, so users can access image generation capabilities on
6
tinyurl.com/4eyr5yjv demand. Users can read instructions on how to use the bot
7
huggingface.co from a document linked in the bio section of the @artbhot’s
Figure 2: Generated images for prompts. Top row: “Steampunk morocco, concept art”; “ ”; “Aliens invading New-
castle Upon Tyne”; “Pythagoras killing his student because the square root of 2 is irrational”. Middle row: “A positive lateral
flow test”; “Waiting for the bot”; “Wake up @artbhot”; “The Scribe, sitting in her throne. Deviant art character illustration”.
Bottom row (all): “A 35mm analog film photo of an alchemists lab in the distant future”.
Twitter page. These instructions include how to communi- to provide context). An example user interaction on Twit-
cate with the bot using the following tweet format: ter with @artbhot is given in figure 1(b). The first iteration
of @artbhot incorporated CLIP guided BigGAN for image
@artbhot #makeme prompt text
generation, as this model was one of the best CLIP guided
(e.g. @artbhot #makeme an oil painting of a burger). GANs available to the public. This was a local version of
Every 15 seconds, the bot code checks for new tweets the code released in the Big Sleep colab notebook, installed
in this format from any user, using the python Twitter API. on our server. Later, an implementation of CLIP-guided
Once found, the prompt text is extracted, processed and ei- VQGAN was released (github.com/nerdyrodent/VQGAN-
ther used as input for a CLIP-guided VQGAN process, or CLIP). On experimenting with this text-to-image genera-
rejected for containing any prohibited words. This cross- tor, we found that the output from the newer model showed
referencing of the prompt against a list of prohibited words improvements in multiple ways. Firstly, almost no images
aims to keep the experience of using the bot as friendly as were outright failures from VQGAN in the way that Big-
possible. If a prohibited word is found, a textual reply is au- GAN regularly generated blank or highly noisy/textured un-
tomatically generated and sent to the user as a reply to their interpretable images. Also, the fidelity of the image to the
tweet, asking them to try again. The processing performed prompt was usually much better and there was much less vi-
by @artbhot for a given input tweet is portrayed in fig. 1(a). sual indeterminancy (Hertzmann 2020), making the images
If an image is generated, it is then sent to the user via the more coherent from VQGAN than from BigGAN. For these
Twitter API as a reply to their initial tweet, with a reminder reasons, we replaced BigGAN in @artbhot with VQGAN.
of the prompt they used (this is to ensure that the prompt text The top two rows of figure 2 show 8 example images gener-
follows the generated image in the case where a bot reply is ated in response to tweets sent to it, which we refer to in the
shared on Twitter without the original tweet from the user next subsection.
A Preliminary Evaluation Conclusions and Future Work
We plan to make @artbhot open to the public in 2022, af- Text-to-image colab notebooks are very popular, and initial
ter some additional implementation described in future work responses to @artbhot suggest that it would also be very
below. Before this, we have made it available to a user popular on twitter. Unfortunately, it is beyond our com-
group of 16 people. It has been running for 5 months and putational resources to provide GPU processing to anyone
has processed over 600 tweets, taking, on average, around 2 on twitter who tweets a prompt. Moreover, as predicted
minutes for a user to receive an image in response to their in (Colton et al. 2021), there seems little doubt that con-
tweet. While there have been no outright failures where im- sumer text-to-image generation services will become avail-
ages don’t reflect the prompt at all, after an informal evalu- able soon, and will likely find their way into products such as
ation (by ourselves) of the most recent 100 replies to Twit- Adobe’s Creative Suite eventually. For these reasons, we are
ter prompts, we found 16% of the images were not visually interested in offering more than a service which fulfils image
coherent enough to reflect the prompt satisfactorily. Two generation requests, as @artbhot currently does. Instead, we
examples of this can be seen on the left of row two in fig- will open up @artbhot so that it can receive tweets from any
ure 2, with neither properly reflecting the prompt “A posi- member of the public (which it currently does not), and se-
tive lateral flow test” or “Waiting for the bot”. Generally, lect a few tweets each day to reply to that have the highest
the images that are less successful have a high degree of vi- potential for a meaningful, creative and thought-provoking
sual indeterminacy (Hertzmann 2020), making it difficult to interaction with the user. Once a user is selected, this longer
interpret the content of the image and how it may be asso- interaction with @artbhot may take the form of a string of
ciated with the tweet text. Other factors for relative failure iterations on an image; as the user asks to ‘evolvethis’ image
include content that is off topic, inaccurate colours for the to repeatedly evolve the image with new prompts. This may
subject matter, or image content that is too small and/or off- also take the form of merging several tweets in to a prompt,
centre. We do acknowledge however that this is a subjective that is then used to generate an image, using a ‘mergethis’
evaluation and that other opinions may differ regarding in- hashtag. In this way, the user will still feel in control of the
terpretations of image content. process, but will receive innovative and surprising output as
We found that @artbhot was able to handle unexpected the bot takes on more autonomy.
prompts, for instance ones containing emojis. As per the On responding to the chosen prompts, we plan for @artb-
second image in the first row of figure 2, CLIP-guided VQ- hot to apply a range of generative techniques and appeal
GAN interpreted the weather emojis correctly and produced to a number of computational creativity theories and prac-
an image with sun and clouds. Diversity was also a con- tices. These include (on the text side) fictional ideation, hu-
cern, as users would expect a variety of images for similar mour, narrative generation, poetry, etc., and (on the imagery
prompts. We asked four users to each use the prompt “a side) style transfer, animations, and visual stories. @artbhot
35mm analog film photo of an alchemists lab in the distant will employ framing and explainable computational creativ-
future”, with the resulting images portrayed in the bottom ity techniques (Llano et al. 2020) to get users to look more
row of figure 2. We see that there is some diversity, but per- closely at its ideas and creations. We further aim to enable
haps not enough to be satisfying, and this is something we @artbhot to learn from feedback, so as to be more interest-
hope to improve upon, probably with automated augmenta- ing and engaging for users.
tion/alteration of prompts. We also aim to encourage con-
Overall, the interactions users have had with @artbhot versation and collaboration with
have been playful and casual, with people feeling free to users, to ultimately generate pieces
try out all manner of interesting and unusual prompts, of- deemed to be artworks rather than
ten trying to stretch the bot past its limitations. The quali- just imagery reflecting text. To do
tative responses we’ve gathered have been largely positive, this, we will need to utilise exist-
with people reporting they have used it for amusement, en- ing evaluation techniques from ca-
tertainment and conversation, but wish it would return im- sual creators (Compton and Mateas
ages faster, as attention can wane. We noticed some trends 2015) and computational creativ-
in the kinds of prompts users sent, including: referring to ity in general, and to develop new
the bot itself (see middle row of figure 2); setting moods ones specific to the project. We
or styles such as steampunk (first image of top row); set- will also need to implement more
ting up imaginary or historical scenes such as aliens over advanced artistic image generation
cityscapes or pythagorean murders (top row, right); and ask- techniques. We have already taken
ing for design inspiration (final image on the middle row). first steps in this direction by writ-
One user wanted longer interactions with @artbhot, in par- Figure 3: ing software which takes anima-
ticular to ask it to enhance images and to combine their Exhibition piece: tions from @artbhot and makes a
prompts/images with those from friends. Pericellular Nests large collaged animation (as per
fig. 3) for an exhibition8 at the
Pablo Gargallo Museum in Zaragoza, Spain; celebrating the
life and work of nobel laureate Santiago Ramon y Cajal.
8
zaragoza.es/sede/servicio/cultura/evento/232731
Author Contributions
AS is lead author, SC is second author and contributed to
writing, supervision, reviewing & editing. Both AS and SC
contributed to the concept of @Artbhot, AS developed and
evaluated @Artbhot. SC implemented initial interactions
with CLIP + BigGAN while AS implemented initial inter-
actions with CLIP + VQGAN.
Acknowledgments
We would like to thank Dr. Mike Cook for his support. We
would also like to thank Ryan Murdock, Katherine Crowson
and Nerdy Rodent for their ground breaking contributions in
pioneering CLIP guided image generation. We would like to
thank the anonymous reviewers for their helpful comments,
and also thank the participants of the pilot study from the
GSIG Discord group.
References
Agnese, J.; Herrera, J.; Tao, H.; and Zhu, X. 2019. A Survey
and Taxonomy of Adversarial Neural Networks for Text-to-Image
Synthesis. arXiv: 1910.09399.
Bisong, E. 2019. Google colaborator. In Building Machine Learn-
ing & Deep Learning Models on Google Cloud Platform. Springer.
Brock, A.; Donahue, J.; and Simonyan, K. 2019. Large Scale
GAN Training for High Fidelity Natural Image Synthesis. arXiv:
1809.11096.
Cartuyvels, R.; Spinks, G.; and Moens, M.-F. 2021. Discrete and
continuous representations and processing in deep learning: Look-
ing forward. AI Open 2:143–159.
Colton, S.; Smith, A.; Berns, S.; Murdock, R.; and Cook, M. 2021.
Generative search engines: Initial experiments. In Proceedings of
the International Conference on Computational Creativity.
Compton, K., and Mateas, M. 2015. Casual creators. In Proceed-
ings of the International Conference on Computational Creativity.
Crowson, K.; Biderman, S.; Kornis, D.; Stander, D.; Hallahan,
E.; Castricato, L.; and Raff, E. 2022. Vqgan-clip: Open do-
main image generation and editing with natural language guidance.
arXiv:2204.08583.
Esser, P.; Rombach, R.; and Ommer, B. 2021. Taming Transform-
ers for High-Resolution Image Synthesis. arXiv: 2012.09841.
Hertzmann, A. 2020. Visual Indeterminacy in GAN Art. Leonardo
53(4):424–428.
Llano, M. T.; dÄôInverno, M.; Yee-King, M.; McCormack, J.; Il-
sar, A.; Pease, A.; and Colton, S. 2020. Explainable Computa-
tional Creativity. In Proceedings of the International Conference
on Computational Creativity.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agar-
wal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.;
and Sutskever, I. 2021. Learning Transferable Visual Models From
Natural Language Supervision. arXiv: 2103.00020.
Smith, A., and Colton, S. 2021. CLIP-Guided GAN Image Gener-
ation: An Artistic Exploration. In Proceedings of the EvoMusArt
conference.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;
Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is
All you Need. In Advances in NeurIPS,
Veale, T., and Cook, M. 2018. Twitterbots: Making Machines that
Make Meaning. MIT Press.