Speech-to-Image Generation using Multi-Deep Learning Models and AWS

March 30, 2023

In this blog, we demonstrate how multiple deep learning models can be combined to create a speech-to-image solution. We utilized the Stable Diffusion Model and Amazon Transcribe and Amazon Translate services to develop our solution. The process involves uploading an mp3 recording into S3, which triggers a lambda function to call Amazon Transcribe for speech-to-text conversion. Then, Amazon Translate is used to translate the transcribed text to English, and the HuggingFace model deployed on SageMaker is invoked to generate an image from the translated text. The resulting image is saved to a specific sub-bucket.

Overview:

Text-to-image generation in artificial intelligence involves generating images based on a given prompt. This requires advanced natural language processing and image generation techniques. While still a work in progress, it is a major field in AI research. Our solution involves using Amazon Transcribe for speech-to-text conversion, Amazon Translate for text translation, and the Stable Diffusion model for image generation. The Stable Diffusion model is a latent text-to-image diffusion model developed by Stability AI and trained on a dataset of 5.85 billion CLIP-filtered image-text pairs.

Once an mp3 recording is uploaded into the S3 a lambda function will call Amazon Transcribe to transcribe the recording (speech-to-text), then it will call Amazon Translate to translate the transcribed text (text-to-text), and finally, it will invoke the HuggingFace model deployed on SageMaker to generate an image from the translated text (text-to-image) that is also uploaded to the specific sub-bucket.

Stable Diffusion:

Stable Diffusion is a generative neural network that can generate photo-realistic images from any text input. It empowers users to create stunning art within seconds.

Lambda Function:

The Lambda function is developed to call the deep learning models and complement their responses. There are three main models executed in this lambda: Transcribe Model, Translate Model, and Hugging Face Model.

The Lambda function code is as follows:

                  import json
                  import boto3
                  from transcribe import transcribe_mp3
                  from translate import translate_text
                  from hf_model import generate_images
                  
                  from upload2s3 import upload_file
                  
                  client = boto3.client ( 's3' )
                  
                  def lambda_handler(event, context):
                      #parse out the bucket & file name from the event handler
                      for record in event['Records']:
                          file_bucket = record['s3']['bucket']['name']
                          file_name = record['s3']['object']['key']
                          object_url = 'https://s3.amazonaws.com/{0}/{1}'.format(file_bucket, file_name)
                  
                          transcribed_text = transcribe_mp3(file_name, object_url)
                          translated_text = translate_text(transcribed_text)
                          generated_images = generate_images(translated_text, 2)
                  
                          for i, img in enumerate(generated_images):
                              img_name = f'{translated_text.replace(" ", "_").replace(".","")}-{i}.jpeg'
                              img_path = "/tmp/" + img_name
                              img.save(img_path)
                              upload_file(
                                  file_name=img_path,
                                  bucket='', # enter the s3 bucket name
                                  object_name='result/' + img_name
                                  )
                  
                              return "lambda handled Successfully!"

Transcribe Model:

Amazon Transcribe is used for automatic speech recognition. It uses deep learning models to convert audio to text.

The code includes boto3 library which allows us to access AWS services, it is as follows:

                  import json, boto3
                  from urllib.request import urlopen
                  import time


                  transcribe = boto3.client('transcribe')

                  def transcribe_mp3(file_name, object_url):
                      response = transcribe.start_transcription_job(
                          TranscriptionJobName=file_name.replace('/','')[:10],
                          LanguageCode='ar-AE',
                          MediaFormat='mp3',
                          Media={
                              'MediaFileUri': object_url
                          })
                      while True :
                          status = transcribe.get_transcription_job(TranscriptionJobName='audio-rawV')
                          if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED','FAILED']:
                              break
                          print('In Progress')
                          time.sleep(5)
                      load_url = urlopen(status['TranscriptionJob']['Transcript']['TranscriptFileUri'])
                      load_json = json.dumps(json.load(load_url))
                      text = str(json.loads(load_json)['results']['transcripts'][0]['transcript'])

                      return text

Translate Model:

Amazon Translate is used to make the application more diverse for users around the world. It is a neural machine translation service that uses deep learning models to deliver fast and high-quality language translation. The implementation uses the Boto3 library to connect the code written on SageMaker to AWS Translate service. The translated text is saved as the prompt variable in English.

The code used is as following:

                  import boto3, json

                  translate = boto3.client ('translate')
                  
                  def translate_text(text):
                      result = translate.translate_text(
                          Text = text,
                          SourceLanguageCode = "auto",
                          TargetLanguageCode = "en")
                  
                      prompt = result["TranslatedText"]
                  
                      return prompt

Hugging Face Model:

To deploy the HuggingFace Model, the steps provided by Phil Schmid in his blog entitled “Stable Diffusion on Amazon SageMaker” can be followed. Once the model is deployed, the SageMaker Endpoint name is noted, which is then invoked through the lambda code to generate an image from the translated text.

The code to invoke the SageMaker Endpoint is as follows:

                  import os, io, boto3, json, csv
                  from io import BytesIO
                  import base64
                  from PIL import Image


                  ENDPOINT_NAME = '' # enter your ENDPOINT_NAME 
                  runtime= boto3.client('sagemaker-runtime')

                  # helper decoder
                  def decode_base64_image(image_string):
                    base64_image = base64.b64decode(image_string)
                    buffer = BytesIO(base64_image)
                    return Image.open(buffer)

                  def generate_images(prompt, num_images_per_prompt):
                      data = {
                          "inputs": prompt,
                          "num_images_per_prompt" : num_images_per_prompt
                      }

                      payload = json.dumps(data, indent=2).encode('utf-8')

                      response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                                        ContentType='application/json',
                                                        Body=payload)

                      response_decoded = json.loads(response['Body'].read().decode())

                      decoded_images = [decode_base64_image(image) for image in response_decoded["generated_images"]]

                      return decoded_images