A mention in The Jakarta Post and a mention in Forbes are both real, independent, institutional mentions. But they do not carry the same weight in AI-generated answers.

This is uncomfortable to say. It feels unfair. But understanding it is essential if you want to build entity infrastructure that actually works across both local and global AI systems.

The reality is that AI models are trained predominantly on English-language data from internationally recognized sources. Local-language sources in Indonesian, Thai, Vietnamese, or Malay are underrepresented in training datasets. This creates a structural asymmetry: your local mentions are valuable for local search, but they contribute less to global AI visibility than equivalent mentions in international publications.

I wrote about this asymmetry in the Indonesia-specific context in Singapore vs Indonesia: The Entity Gap. Singaporean businesses appear more frequently in global AI answers not because they are better, but because their digital footprint is overwhelmingly in English and indexed by international systems. Indonesian businesses, even large ones, are underrepresented because their documentation is primarily in Bahasa Indonesia.

The mention weighting hierarchy

This diagram shows how AI models weight different types of mentions based on source type and reach. The higher in the hierarchy, the more influence the mention has on global AI answers.

graph TD A["Tier 1: Global Institutional
Reuters, Forbes, Nature, IEEE
Weight: Very High"] --> E["Global AI Models
(ChatGPT, Gemini, Claude)"] B["Tier 2: International Industry
Trade publications, conferences
Weight: High"] --> E C["Tier 3: Regional English
The Jakarta Post, Nikkei Asia
Weight: Medium"] --> E C --> F["Regional AI Responses
(localized queries)"] D["Tier 4: Local Language
Kompas, Bogor Today, Detik
Weight: Low for global AI"] --> F D --> G["Google Local Search
(Maps, local pack)"] F --> E style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c8a882,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#6b8f71,color:#ede9e3 style E fill:#222221,stroke:#c8a882,color:#ede9e3 style F fill:#222221,stroke:#6b8f71,color:#ede9e3 style G fill:#222221,stroke:#6b8f71,color:#ede9e3

This is not a judgment about the quality of local journalism or the importance of local-language content. Kompas is a legitimate, respected publication. Bogor Today covers stories nobody else covers. These are real sources. The issue is purely technical: current AI training pipelines process English-language content from internationally indexed sources more thoroughly than local-language content from regional sources.

Why the asymmetry exists

Three factors drive this.

Training data composition. Common Crawl, which underpins most AI training datasets, is heavily skewed toward English-language content. Estimates vary, but English typically accounts for 40-60% of the crawled web, while Indonesian accounts for roughly 1-2%. This means an Indonesian-language article has to compete with a much smaller pool for representation in training data, and the tools that process and filter training data are optimized for English.

Source authority scoring. AI training pipelines do not ingest all crawled data equally. They apply quality filters that weight sources based on various authority signals. International publications with long histories, extensive cross-referencing, and high domain authority score higher in these filters. Regional publications, even good ones, score lower because they have fewer inbound links, less international cross-referencing, and smaller digital footprints.

Entity graph density. International sources are more densely interconnected in knowledge graphs. When Forbes mentions a company, that mention links to other entities (people, places, industries) that are also well-represented in the graph. When a local Indonesian publication mentions a company, the surrounding entity context is often sparser. AI models learn entity relationships from these interconnections, so denser contexts produce stronger entity recognition.

What local mentions are still good for

None of this means local mentions are worthless. They serve different purposes.

Local mentions are essential for Google's local search ecosystem. Google Maps, the local pack, and geo-specific search results all draw heavily from local-language sources. If someone searches "pump supplier Bogor" on Google, your mentions in Bogor Today and local business directories directly influence your visibility.

Local mentions build the citation network that institutional sources draw from later. A journalist at The Jakarta Post checking your background will find your local coverage and use it as corroboration. Local mentions are stepping stones, not endpoints.

Local mentions also contribute to Perplexity and Gemini's live retrieval systems. Unlike ChatGPT's training-data-based approach, Perplexity searches the live web when generating answers. If your entity appears in indexed local sources, Perplexity can find and cite them. This is a meaningful difference for entities in non-English markets.

I discussed this differentiation between AI systems in Why AI Does Not Mention Your Name. Each AI platform has different data sourcing, which means your mention strategy should be diversified, not concentrated on a single type.

The right mix for Indonesian businesses

If you are an Indonesian business trying to be visible in both local and global AI systems, you need a layered approach.

Layer 1: Local foundation. Get mentioned in local and regional Indonesian media. Chamber of commerce listings. Regional business directories. Local news coverage. This establishes your geographic entity context and supports local search visibility.

Layer 2: Regional English. Get mentioned in English-language regional publications. The Jakarta Post, Nikkei Asia, Tech in Asia, DealStreetAsia. These publications are indexed by international systems and included in AI training data at higher rates than Indonesian-language sources.

Layer 3: International institutional. Publish in or get cited by international trade publications, conference proceedings, or industry reports. This is the hardest layer to reach, but it carries the most weight for global AI visibility.

Layer 4: Structured databases. Wikidata, Crunchbase, OpenCorporates, ORCID. These are language-agnostic and feed directly into AI training data regardless of your primary language. I wrote about this extensively in the Indonesia AI landscape essay. Structured data bypasses the language barrier entirely.

The mistake most Indonesian businesses make is concentrating entirely on Layer 1 and wondering why they are invisible to global AI. The mistake international-focused businesses make is skipping Layer 1 and losing their local search foundation. You need all four layers, allocated based on your target market.

The language strategy

Here is the uncomfortable practical advice. If you want global AI to cite you, some of your content needs to be in English.

This does not mean abandoning Indonesian. It means being strategic about which content exists in which language. Your website can be bilingual. Your published articles can target both Indonesian and English-language publications. Your structured data (schema markup, Wikidata entries) should always include English labels and descriptions because that is what machines process most effectively.

I run three companies from Bogor. My primary market is Indonesia. Most of my daily communication is in Indonesian. But my entity infrastructure is bilingual because I understand that the AI systems evaluating my entity are predominantly English-trained. This is not cultural surrender. It is technical pragmatism.

The Systems Thinking course I built covers this kind of strategic layering. Understanding which system you are optimizing for determines which inputs matter most. For local Google search, Indonesian-language content is the primary input. For global AI citation, English-language institutional presence is the primary input. Both matter. They serve different systems.

What changes over time

The current asymmetry is not permanent. AI companies are actively working to improve multilingual coverage. Google's Gemini is better at processing Indonesian than ChatGPT was a year ago. Training datasets are getting more diverse with each iteration.

But "improving" is not "equal." The gap is narrowing slowly, and entities that build English-language presence now will retain their advantage even as multilingual capabilities improve. This is because entity maturation is cumulative. The signals you build today are in the training data of future models. Starting later means those signals are absent from more training cycles.

The practical takeaway: build for today's AI systems while anticipating tomorrow's improvements. That means investing in both local and international presence, not choosing one over the other.

Measuring where you stand

Test your entity across different AI platforms with queries in both English and Indonesian. Ask ChatGPT "Who distributes ALBIN Pumps in Indonesia?" in English and in Indonesian. Compare the answers. Ask Perplexity the same questions. Ask Gemini.

The differences in responses will tell you exactly where your mention gaps are. If AI answers your English queries correctly but not your Indonesian ones, your international presence is working but your local entity infrastructure needs attention. If neither language returns your entity, you need foundational work across the board.

This kind of diagnostic is part of the Entity Infrastructure services I offer. But you can start with the basic test described above. It takes fifteen minutes and tells you more about your entity status than any SEO tool.

Frequently Asked Questions

Do AI models completely ignore local-language mentions?

No. AI models do include local-language content in their training data. The issue is proportionality. English-language content is overrepresented in training datasets, which means local-language mentions contribute less to entity recognition in global AI models. For AI systems with live retrieval (Perplexity, Gemini with Search), local-language sources can be retrieved and cited in real time regardless of training data composition. The gap is real but not absolute.

Should Indonesian businesses publish everything in English for AI visibility?

No. That would sacrifice local search visibility, which is often more commercially valuable than global AI citation. The recommended approach is bilingual: maintain strong Indonesian-language presence for local search and Google Maps, while building targeted English-language content for global AI visibility. Structured data (Wikidata, schema markup) should always include English because machines process it more efficiently.

Will the language bias in AI training data improve over time?

Yes, gradually. Each generation of AI models includes more multilingual data. Google's Gemini processes Indonesian better than earlier models. But the improvement is incremental, and entities that build English-language presence now will retain compounding advantages because their data appears in more training cycles. Building bilingually today hedges against the current bias while benefiting from future improvements.

References

  1. First Line Software. "Why Your Brand Is Not Appearing in ChatGPT, Perplexity, or AI Overviews." firstlinesoftware.com, 2024. Link
  2. Animalz. "The AI Visibility Pyramid." animalz.co, 2024. Link
  3. Just by Design. "AI Visibility Guide: How to Get Your Brand Mentioned by AI." justbydesign.com, 2024. Link

Related notes

2026-03-28

The companies that show up in ChatGPT are the ones that bothered to be verifiable.