This subject covers prompt engineering for multimodal models that process and generate combinations of text, images, and other media. Learners will learn to design prompts for vision-language models, text-to-image systems like DALL-E and Midjourney, and cross-modal reasoning tasks that link visual and textual information.
Upon completion of this subject, learners will be able to describe how multimodal models represent and connect text and images, write precise prompts for generating and editing images with systems such as DALL-E and Midjourney, craft instructions for models that answer questions about images or combine visual and textual inputs, and design workflows that leverage multimodal capabilities for analytics, design, and education.
This topic focuses on practical prompt engineering for text-to-image systems. It explains how to structure prompts with a clear subject, modifiers for style (e.g., photorealistic, watercolor, cyberpunk), composition details (e.g., close-up, wide shot, bird’s-eye view), environmental context, and post-processing attributes (e.g., depth of field, color grading). Using DALL-E and Midjourney...
This topic focuses on practical prompt engineering for text-to-image systems. It explains how to structure prompts with a clear subject, modifiers for style (e.g., photorealistic, watercolor, cyberpunk), composition details (e.g., close-up, wide shot, bird’s-eye view), environmental context, and post-processing attributes (e.g., depth of field, color grading). Using DALL-E and Midjourney as reference products, the topic shows common syntax patterns and model-specific features such as aspect ratio settings or stylistic tags. Learners see how small prompt changes significantly alter results, and practice iterative refinement by adjusting wording, adding constraints, or providing reference images. The topic discusses use cases in design, marketing, storyboarding, and education, as well as limitations such as handling of text in images, faces, and copyrighted characters.
Show moreThis topic addresses prompts that mix visual and textual cues. Learners design prompts for tasks such as describing what is happening in an image, extracting text from screenshots, interpreting graphs and charts, answering natural language questions about documents, and comparing multiple images. The topic explains how to reference images in...
This topic addresses prompts that mix visual and textual cues. Learners design prompts for tasks such as describing what is happening in an image, extracting text from screenshots, interpreting graphs and charts, answering natural language questions about documents, and comparing multiple images. The topic explains how to reference images in prompts (e.g., numbered attachments), how to ask for step-by-step visual reasoning, and how to request structured outputs summarizing visual content. It illustrates typical failure modes, such as miscounting objects or misreading small text, and shows how follow-up prompts can correct or clarify. The topic also touches on advanced scenarios, such as multimodal tutoring systems that show a learner a diagram and guide them through problem-solving by referencing visual elements explicitly.
Show moreThis topic introduces multimodal models that accept both text and images as input and can output text, images, or both. It explains how convolutional networks, vision transformers, or CLIP-like encoders convert images into vector representations and how these visual embeddings are combined or aligned with text embeddings. The topic describes...
This topic introduces multimodal models that accept both text and images as input and can output text, images, or both. It explains how convolutional networks, vision transformers, or CLIP-like encoders convert images into vector representations and how these visual embeddings are combined or aligned with text embeddings. The topic describes common tasks including image captioning, where models generate descriptions of images; visual question answering, where models answer questions about image content; and visual reasoning, where models perform comparison, counting, or interpretation. Learners also see how generalized multimodal LLMs such as GPT-4V or Gemini Visual unify these capabilities. The topic highlights application areas including accessibility (describing images for visually impaired users), document understanding, e-commerce, and creative media workflows.
Show moreThis topic bridges design literacy and prompt engineering. It explains fundamental visual concepts including rule of thirds, leading lines, framing, depth, color harmony, and contrast, and shows how to encode these concepts in prompts. Learners practice describing desired compositions (e.g., 'portrait, subject centered, shallow depth of field, soft natural light...
This topic bridges design literacy and prompt engineering. It explains fundamental visual concepts including rule of thirds, leading lines, framing, depth, color harmony, and contrast, and shows how to encode these concepts in prompts. Learners practice describing desired compositions (e.g., 'portrait, subject centered, shallow depth of field, soft natural light from the left'), and examine how models interpret such language. The topic highlights that effective image prompting requires both linguistic precision and a mental model of visual aesthetics. It covers edge cases where models misinterpret instructions and shows how layering more explicit constraints or descriptive language can resolve ambiguity. The topic also briefly addresses accessibility and inclusivity in visual design, such as avoiding stereotypical imagery and representing diverse users.
Show moreThis topic focuses on ethical and safety concerns specific to multimodal AI. It discusses how generative models may reinforce stereotypes through visual outputs, overrepresent certain demographics, or underrepresent others. Learners examine misuse vectors such as deepfake imagery, synthetic obscene or violent content, and deceptive advertising. The topic covers safety filters...
This topic focuses on ethical and safety concerns specific to multimodal AI. It discusses how generative models may reinforce stereotypes through visual outputs, overrepresent certain demographics, or underrepresent others. Learners examine misuse vectors such as deepfake imagery, synthetic obscene or violent content, and deceptive advertising. The topic covers safety filters and content policies imposed by providers, and explains why some prompts are refused. Learners are encouraged to design prompts that respect privacy, avoid reproducing trademarked or copyrighted assets, and prevent harmful scenarios. The topic also addresses disclosure practices, such as watermarking or labeling AI-generated images, and emerging regulatory frameworks that may govern generative imagery.
Show more