Gemini 1.5 Pro can see, hear & read yet it still sucks at OCR

Gemini 1.5 Pro can see, hear & read yet it still sucks at OCR

TL,DR: I tested all major models with vision and neither of them can understand where the text on the screen appears. Here are my notes.


7 min read


As Google finally published API access Gemini 1.5 Pro on Vertex AI, I was very eager to get back to the task I've been struggling* since 2021 Christmas. Where I set out to extract messages from un-ideal screenshots of iMessage, Tinder, Instagram etc.

\Because it's my side hobby project so I haven't spent actual years working on it. But to be considerate, I did put a lot of time since then testing various ideas.*

While I mainly focused on Gemini, I also run parallel prompts to GPT 4 Turbo and Claude 3 Opus models to evaluate. The observations apply to them all.

The problem that the screenshots are very random. They can contain many not wanted elements (a name of a mobile provider, UI elements), may not be perfectly cropped, and the design may have changed over time, including or excluding extra details.

This is exactly where powerful LLMs such as Gemini 1.5 Pro bring a big promise and thus big expectations on processing general unstructured information.


To successfully complete this job, the output, at a bare minimum, needs to contain:

  • the message body including emojis,

  • the correct sender (either "us" or the opponent),

  • ignore any other text from UI elements - only from the actual messages of the conversation.

My observation notes

Saturday, April 13. I spent the last four hours integrating and testing Gemini 1.5 Pro to run OCR on screenshots. The approach was to have a detailed prompt on how to extract messages and run one request per image. So no matter how many screenshots (that is, how many tokens) a conversation may contain, it should work one by one.

I spent a lot of time testing, attempting various angles, and feeling really frustrated because it was not successful, and as Titanic once did, the time is gone.


My first approach was to tell it whether the message is aligned to the left or to the right (padded on the right or on the left, respectively). This is how we intuitively tell who sent it. Even if I use black & white filter on my phone, which I love.

One thing that was not immediately obvious, but with debugging I found out, is that I had to explicitly tell it to judge from the viewer's perspective. Otherwise it would flip/mirror the sides to often. Even with temperature set to 0.

This proved to work rather great-ish. 7.5 out of 10 times.

The model is consistently inconsistent. It fails to achieve any consistent results. This is especially common when messages vary in length but are still short enough to fit a single line in the UI. It tends to shift the flow, whether it thinks a message appears on the right or left, resulting in poor parsing quality.

OpenAI has mentioned that their vision model is struggling with spatial reasoning. It seems to me that no matter the provider - they all are struggling as a team.


In the context of an iMessage screenshot, I thought of asking about the color of the message bubble, since the opponent will always appear in gray. However when it fails to detect the alignment, it will also hallucinate the color - will reply it is gray, even though the message appears on the right with a blue bubble.

I also noticed that in some requests, it will output colors that don't exist or are turned into black and white. For example, with Google Gemini, sometimes it will say the message appears in a white bubble on a green app background, which never happened in any image. Super strange.

For the rest 4/10 times, it would think everything is in gray. So either the model is hallucinating, or Google is performing some kind of unclear pre-processing.

Furthermore, once in a while I observed the model spitting out the complete bullshit by mentioning the colors that are not seen at all; or making it seem as the image was further preprocessed and turned mono-chrome. If any of the providers are hiding this, it is a very unprofessional move to fail to communicate that.

The problem with background color is that emojis sent in certain cases are shown as tiny images with no background color at all. A hard rule like that would not work.


Let's talk about 'em, the emojis. The model is very good for parsing text but only 50-50 for extracting and parsing emojis. They're often not the ones that actually appear and often will be missed out too.

To my surprise - and perhaps to yours too - no company has ever worked on extracting emojis except for The Hive AI. They have the best models I've seen hands-on. However they struggle with "I"s. These get missed out a lot or become a vertical symbol instead. Crazy.

Mirroring issue

Another one I could describe is the mirroring issue. It tends to flip the image in the middle of processing the image.

If a message from the opponent is short leading to a long message from "us", it will assign "our" message to the opponent and will continue putting the wrong sender for the rest of the parsing.

Were it to do that consistently, then all I need to do is flip it over. But no. It's in the middle haha.

Google AI Studio Playground

Furthermore, using Google AI Studio, I discovered that any subsequent requests in the chat that provided an image in the first request are not actually processing the image again.

I found this while attempting to debug and ask the model why it thinks a message is not blue and not aligned to the right when it is. It would spit out an answer based on the image description and assumptions about how messages would typically appear. It's quite obvious that all it gets is a description of once processed image.

This is rather frustrating, and I would blame Google for this. I've never seen this communicated, and if it is, it's definitely not clearly communicated.

I do read through documentation, and I'd like to think I only miss out on a minority of details. In other words, I tend to catch majority of details.

On the other hand, I was cheating too - using VPN to access otherwise geo-blocked playground (hence me waiting for Vertex AI). I guess that means we're even: 1 - 1.

Not ready for now

For now I conclude that although it is promising and exciting to parse a lot of information from screenshots that are otherwise not text-searchable, it is not implementable without human in the loop with an interface for review. At which, it would be marginally faster than a human alone.

To close with the opening,

powerful LLMs bring a big promise and set high expectations on processing general unstructured information but they fail to execute it consistently.

On the bright side, this is how I build my knowledge. Much more fun than reading books.

That being said, Elon Musk introduced his Grok 1.5 introducing vision capabilities. I would bet that this model should be good if he wants to use for self-driving Teslas.

Tips for working with LLMs

A few things that I found very useful and you should know too.

  • Debug prompts by asking the model explain what it sees. Be it a text or be it an image. We see it differently so make no assumptions.

  • Run parallel queries to all providers and make sure hyper parameters are the same. This way you can get ideas.

  • Ask the model to tell the variables you want it to apply the logic for. Such as, the message is sent by us if it's in a blue bubble. So ask what the color the bubble is in.

  • Save prompt versions and take breaks. It's easy to get lost a rabbit hole and deviate from the main approach. Taking a break can give you fresh eyes and fresh ideas. A prescription of 15 minute walk can do wonders.

That is it, my dear reader.
If you have any ideas, I would love them to hear.

Did you find this article valuable?

Support Lukas by becoming a sponsor. Any amount is appreciated!