Detecting brand mentions across structured tags, hashtags, and free text — and weighing them against the platform’s disclosure flag — over 99,920 Instagram posts.
paid_partnership is the platform's raw disclosure flag, not ground truth — a false value is known to hide undisclosed ads. This report measures detected brand mentions and compares them against that flag; it never treats the flag as the answer.
Recognizable posts behind every category — the four partnership types and the Confirmed tier from the taxonomy doc, plus three things we discovered. Click an account to open it; the → @author shows who Instagram says posted. Tabs marked ⚠ are problems we surfaced, not clean findings.
Platform paid_partnership flag = true. The only certain signal — 466 posts. Recognizable disclosed ads: View all 150 →
| Post → author | Brand / cue | Caption |
|---|---|---|
| reel/DV8uClCDCxf → @goodmorningamerica 4.3M | @burgerking | “Our sponsor @burgerking asked for feedback — and you delivered…” |
| reel/DW6ruPbAP7Q → @instylemagazine 4.4M | @macys | “#ad Soft, romantic layers for spring, styled head to toe from @macys 💐” |
| reel/DLXolpeAe6W → @rappingchef 2.4M | @cawalnuts | “California walnuts 😂 @cawalnuts #CAWalnutsPartner #Sponsored” |
| reel/DMFTpwXqC4x → @brettlee_58 2.1M | Aussie Avocados | “Fuel up with Aussie avocados 🥑 #AustralianAvocados” |
Type: Sponsored. Platform flag, or #ad/#sponsored/“sponsored by”. 700 posts. View all 150 →
| Post → author | Brand | Caption |
|---|---|---|
| reel/DWwIt4MEUyG → @markwahlberg 30M | @hallowapp | “HAPPY EASTER🙏 @hallowapp 🙏 #ad” |
| reel/DRR0Ca6DHCy → @ladbible 15.8M | — | “So pet cloning is actually a thing now? 😳🐾 #ad” |
| reel/DQZVGnaifU6 → @todayshow 5.5M | Winter Olympics | “The 2026 Winter Olympics are 100 days away… #ad” |
| reel/DSI1eeeklH3 → @rocky_barnes 3.3M | — | “All dressed up for the holidays ✨ 🎁 #ad” |
Type: Affiliate. Promo/discount code or affiliate-platform link. 656 posts (after removing the noisy % off pattern — see §09). View all 150 →
| Post → author | Code / cue | Caption |
|---|---|---|
| reel/DUWAS7SjpFl → @tweets 4.2M | SUPERBOWL10 | “Click the link in bio to claim $10 with code SUPERBOWL10…” |
| reel/DWoysvSDnV2 → @annabel.lucinda 3.4M | @gymshark | “GLUTE DAY… wearing the new @gymshark ‘interval’ collection” |
| p/DYkIPPXkZ0B → @kellylmatthews 1.7M | KELLY | “EUPHORIA … **Code KELLY** #activewear” |
| reel/DQmNKDGgGmP → @styled.by.josepha 878K | JOSEPHA35 · @vici | “🔥35% off with my code: JOSEPHA35 🔗Comment VICI…” |
Type: Collaboration. A brand co-author on the post (appears on both grids). 1,921 posts. The co-author handle is shown. View all 150 →
| Post → author | Brand co-author | Caption |
|---|---|---|
| p/DMDSVYfI3Y9 → @victoriabeckham 32.9M | @victoriabeckhambeauty | “Late check-out. Instant obsession…” |
| reel/DXVCvtLjMXJ → @premierleague 79.5M | @astonvilla | “Unai Emery celebrations never disappoint.” |
| reel/DNVhbzIg8-B → @ralphlauren 16.7M | @poloralphlauren | “The Polo Bear Chronicles: Operation Black Tie…” |
Type: Gifted. Gifting language or #gifted — the weakest/noisiest type. 300 posts. View all 150 →
| Post → author | Brand / cue | Caption |
|---|---|---|
| reel/DMjXnEOvKuN → @bybinalshah 479K | NuFACE | “gifted by NuFACE ✨ 3 hours of sleep, but make it look like 8…” |
| reel/DQIJHV7Efst → @aarondinin 809K | (box of product) | “What would you have done in a failure challenge featuring a giant box…” |
| reel/DXzemYAOZjO → @bandana.eats1 417K | Red Lobster | “RED LOBSTER ENDLESS SHRIMP!!! #foodie #redlobster” |
★ Discovered: undisclosed organic mentions. Creator @-mentions a hiker-verified brand, no paid flag, no brand co-author (so genuinely the creator’s own post — distinct from the author-inversion cases). The triage pool for hidden ads. View all 150 →
| Post → author | Brand | Caption |
|---|---|---|
| reel/DS2_kLdkrG5 → @renatarrii 1.4M | @iamgia | “training in @iamgia 🥊” |
| reel/DYsDhjSAoSp → @malloryervin 1.1M | @ultabeauty | “Testing my more is more mentality 🤪✨ I am a makeup lover…” |
| reel/DRzrlOXjA9I → @phenixsoul 782K | @yslbeauty | “60 second red carpet glam. Products used (in order)…” |
| reel/DUGcZTFEkAx → @heatherdubrow 1.8M | @popupbagels | “Bagels are always better in NY … @popupbagels” |
⚠ Discovered: hashtag matching ≠ mention. A #brand hashtag conflates “sponsored by” with “about this brand.” These matched our dictionary by a hashtag but are not brand mentions — they are news topic-tags or brands tagging themselves. This is why the raw “hidden-in-hashtag” count (5,406) collapses to ~420 once cleaned. View all 150 →
| Post → author | Hashtag | Why it’s noise |
|---|---|---|
| p/DYuDMtUiEll → @dainikbhaskar_ news | #Amazon | Hindi news story on tech layoffs — topic tag, no Amazon relationship. |
| p/DPboq_pk5xI → @9gag media | #lego | Meme account; #lego is the subject, not a sponsor. |
| p/DN22AxOYqHC → @bmw brand | #bmw | BMW tagging itself — first-party, not an influencer ad. |
| reel/DLSWrb3Iljl → @tommyhilfiger brand | #TommyHilfiger | Brand’s own post. |
⚠ Discovered: primary-author inversion on collab posts. For brand×creator collaborations, the dataset’s influencer_name can record a co-author creator as the author when Instagram’s real owner is the brand. Verified against hiker get_v2_media_info_by_code (true owner in media.user). These look like “undisclosed creator mentions” but are brand-published collab ads.
| Post | DB says author | Hiker: true author |
|---|---|---|
| reel/DWEmrEPhWdQ | @atsukocomedy (creator) | @ugg (brand); atsukocomedy = co-author |
| reel/DWeNYlyjJNg | @zeyatilgan (creator) | @lcwaikiki (brand); caption has “işbirliği”=collab |
Impact: 217 of the 2,327 “verified-brand undisclosed” posts are actually brand-coauthor collabs. Recoverable per-post via hiker; logged as a known data issue, not yet corrected in bulk.
Snippets truncated/whitespace-collapsed; full captions in posts.caption. Links built from post_links view (verified: reel/DX6wxRhRv_3 = @mattestlea).
mv_brand_mentions where has_brand_mention = true (any dictionary handle found in mention_tags, hashtags, or caption text). 12,891 / 99,920 = 12.9%.brand_via_hashtag AND NOT brand_via_mention_tag AND NOT brand_via_caption_text. Caveat: raw 5,406 conflates ads with topic tags (e.g. a news post tagging #Amazon) and first-party brand posts. Restricting to creator authors and dropping generic/place tokens (instagram, miami, nfl, disney…) leaves 420 plausible. See §05.sponsor_handles (brands proven to pay here), paid_partnership=false, minus 12 generic platform handles (instagram, nfl…). See "high-confidence" section.handle_verdict with verdict = 'brand' (hikerapi returned a commercial account category). The defensible brand set. (Was “18,485 dictionary handles” — that was a noisy raw lookup list, mostly unused; see methodology.)A brand mention is detected in 12.9% of posts. A notable channel: brands referenced only through a hashtag (e.g. #nike), invisible to the structured mention_tags field — though raw hashtag matching is noisy (see §04: #Amazon on a news story is a topic tag, not an ad). And among posts naming a brand that provably runs paid partnerships in this data, undisclosed mentions outnumber disclosed ones roughly 7× after noise filtering — the candidate pool this detection project exists to surface.
#brand.@handles in the caption / transcript that never reached the structured arrays (regex-extracted).Transcripts yielded zero @-handles — speech-to-text doesn't emit handles — so text detection is caption-driven.
Step 1 is a dictionary mined from the dataset itself — handles harvested from three fields. This is a broad, noisy candidate list (18,485 total); hikerapi verification (below) is what turns it into the trustworthy 587-brand set.
| Source | handles | confidence |
|---|---|---|
| sponsor handles (paid) | 130 | high |
| brand-typed co-authors | 3,438 | medium |
| brand-typed accounts | 14,917 | broad |
sponsor = distinct lowercased handles from posts.sponsor_handles where sponsor_present. coauthor_brand = handles from coauthor_handles where coauthor_is_brand. brand_account = influencer_name of authors whose influencer_account_type_v2='brand'. De-duplicated (keep strongest source) → 18,485. Only 6,221 are ever actually mentioned in a post.
The data-derived dictionary is broad and noisy. To ground it in reality, the top 2,000 most-mentioned handles were enriched live via the hikerapi Instagram API (get_v2_user_by_username), storing the complete raw profile per handle. The brand-discriminating signal is the account category — not is_business, which is true for creators too (a creator like maddiedoodle__ is a "business" account categorised "Entrepreneur").
category matched a commercial pattern (Brand, Clothing, Retail, Health/beauty…) via classify_brand().mv_brand_verified where has_verified_brand = true (mention a handle with verdict 'brand').has_verified_person = true — a handle the raw dictionary called a brand but hikerapi categorised as a creator/person (e.g. carlifestyle = "Digital creator").has_verified_brand = true AND paid_partnership = false. Disclosed counterpart = 76 (paid_partnership = true).| Hiker verdict | handles | meaning |
|---|---|---|
| brand | 587 | commercial category → confirmed |
| person | 223 | creator/individual → false positive |
| org_other | 196 | team / community / label / media |
| uncertain | 949 | empty category → kept at dict confidence |
| not_found | 45 | handle no longer resolves |
~49% of accounts return an empty category (incl. ZARA, Louis Vuitton, BMW and Kim Kardashian) — category can't classify those, so they fall back to the dictionary.
| Handle | category | posts |
|---|---|---|
| amazon | Retail company | 300 |
| sheinofficial | Brand | 209 |
| tommyhilfiger | Brand | 99 |
| yslbeauty | Health/beauty | 93 |
| ultabeauty | Beauty/cosmetic | 89 |
| ugg | Clothing (Brand) | 64 |
| sezane | Clothing (Brand) | 61 |
| larocheposay | Health/beauty | 45 |
carlifestyle → "Digital creator", removed) — but some genuine brands self-select a creator category and get dropped: minecraft ("Video Game"), cerakote ("Entrepreneur"), gozwift/Zwift ("Fitness Trainer"). So treat "person" as high-precision noise removal, not a complete brand blocklist. The full raw hiker profile is stored per handle (brand_verified.raw) for richer signals later.
| Location | posts w/ brand | share of detections |
|---|---|---|
Structured mention_tags | 7,477 | |
Hashtags #brand | 6,738 | |
| Caption free text (extra, not in tags) | 11 | |
| Transcript free text | 0 | — |
Locations overlap (a post can mention a brand in several). 5,406 posts are caught only by hashtag — but see the caveat below. Caption free-text adds almost nothing (11) — and that is correct, not a bug: the JSONL is already structured, so Instagram has resolved real caption @-mentions into mention_tags. Of 23,848 posts with an @ in the caption, 23,099 are already in mention_tags; only 749 are “extra,” and those are mostly not real mentions — email addresses (name@gmail.com) and truncated URLs (@amazon.com) caught by the @ regex. Just 11 coincidentally matched a brand handle. The structured field has effectively pre-absorbed this channel.
count(*) FILTER (WHERE brand_via_mention_tag / _hashtag / _caption_text / _transcript_text) over mv_brand_mentions. A handle counts here if it's in the 18,485-dictionary; transcript = 0 because speech-to-text emits no @-handles.
#brand hashtag conflates “sponsored by this brand” with “this post is about this brand.” Validated example: p/DYuDMtUiEll (@dainikbhaskar_, a news outlet) carries #Amazon in a story about tech layoffs — there is no Amazon mention or relationship in the post; it’s a topic tag. The raw 5,406 includes such topic tags, first-party brand posts (2,371 authored by brand accounts), and generic tokens (#instagram 1,416, #miami, #nfl, #disney, #starwars). Restricting to creator authors and removing generic/place/platform tokens leaves 420 plausible cases — and even some of those (#imax, #hornets=insects) are topical. Treat hashtag-only as a weak, high-recall/low-precision signal, not a confirmed mention.
| brand detected | not detected | |
|---|---|---|
| paid = true | 218 | 248 |
| paid = false | 12,673 | 86,781 |
Cross-tab of all 99,920 posts.
how:GROUP BY paid_partnership, has_brand_mention over mv_brand_mentions. “detected” = dictionary brand mention (18,485 set); rows: 218 / 248 / 12,673 / 86,781 sum to 99,920.
Recall on disclosed ads = 46.8%. We detect a brand handle in only 218 of 466 disclosed-paid posts — many paid posts name the sponsor in prose or rely on the platform partnership banner rather than an @-handle. A handle/hashtag detector alone under-covers.
12,673 brand mentions are undisclosed. Most are organic (a creator tagging a brand they like), but this is precisely where hidden ads live — the pool to triage.
Restricting to posts that mention a known sponsor brand (brands proven to pay creators in this dataset) and excluding 12 generic platform/place handles (instagram, nfl, amazon, disney, …) that polluted the sponsor list:
sponsor-tier handle, paid_partnership=false, excluding 12 generic handles (instagram/nfl/amazon/disney/youtube/tiktok/spotify/google/facebook/threads/whatsapp/miami).| Brand handle | undisclosed posts |
|---|---|
| lego | 311 |
| sheinofficial | 182 |
| yslbeauty | 92 |
| etsy | 76 |
| adidasoriginals | 70 |
| ebay | 52 |
| peppermayo | 39 |
| garageclothing | 33 |
| newlook | 31 |
| macys | 28 |
| googlepixel | 25 |
| bravotv | 24 |
| Type | posts | branded | % |
|---|---|---|---|
| post | 47,745 | 7,059 | 14.78 |
| reel | 41,773 | 5,205 | 12.46 |
| story | 10,402 | 627 | 6.03 |
GROUP BY type over mv_brand_mentions; branded = has_brand_mention. % = branded ÷ posts.
| Account type | posts | branded | % |
|---|---|---|---|
| brand | 15,655 | 4,356 | 27.82 |
| creator | 33,967 | 4,444 | 13.08 |
| unknown | 2,651 | 348 | 13.13 |
| media | 6,925 | 785 | 11.34 |
| pro_service | 13,637 | 1,429 | 10.48 |
| venue | 2,318 | 138 | 5.95 |
| (empty) | 23,849 | 1,344 | 5.64 |
GROUP BY influencer_account_type_v2 over mv_brand_mentions. Caveat: brand & pro_service authors are first-party accounts — high brand-mention % is expected (brands mention brands) and is not an ad signal. Ad detection should focus on creator (0.98% paid-flag rate, 3–7× the others).
We tagged every post with a two-filter taxonomy — a confidence tier (one per post) and multi-label partnership types — then audited it against the Competitor Insights — Paid Partnership Taxonomy Notion doc (built on this same D1 dataset). The doc is a hypothesis we double-checked, not a spec we followed.
| Tier | Our trigger | posts | doc |
|---|---|---|---|
| Confirmed | platform flag | 466 | 466 ✓ |
| Likely | #ad/#sponsored, “sponsored by/paid partnership”, brand co-author | 2,095 | 2,151 |
| Possible | promo-code / affiliate-host / gifting (soft, low-conf) | 836 | 668 |
| None | no commercial signal | 96,523 | 96,635 |
Tiers are leak-free: Confirmed = exactly the 466 flagged; Likely/Possible/None contain 0 flagged posts.
how:pp_tier column in mv_partnership (CASE over signal booleans, highest tier wins). “doc” = numbers published in the Notion taxonomy doc, shown for comparison — we reproduce the hard signals exactly but our soft-signal tiers differ (see audit note below).
| Type | Our trigger | posts | doc |
|---|---|---|---|
| collaboration | brand co-author | 1,921 | 1,921 ✓ |
| affiliate | promo code / affiliate host | 656 | 487 |
| sponsored | flag / #ad / #sponsored / “sponsored by” | 700 | 766 |
| gifted | gifting language / #gifted | 300 | 291 |
Types are multi-label and only apply to commercial posts — counts don’t sum to 100%.
how:unnest(partnership_types) from mv_partnership, grouped. A post can carry several types. collaboration (=brand co-author) and the hard sponsored signals reproduce the doc; affiliate/gifted are soft-text regex and diverge.
#ad=338, #sponsored=32, #ambassador=8, brand co-author=1,921, any @mention=43,416, Confirmed=466. The earlier gap on Possible / Affiliate traced to a single over-broad sub-pattern: a bare [0-9]% off regex that fired on any retail markdown — spa pricing, product listings, first-party sales — not creator affiliate offers (472 such posts). Removing it brings Affiliate 1,130 → 656 and Possible 1,263 → 836, both now near the doc’s 487 / 668. The residual gap is the opposite error in the doc: it missed real bare-code XXXX affiliate posts (“use code jess15”, “code ali20 to save”). Net lesson: soft-text volumes swing on regex wording — keep them at the lowest tier only (we never let them reach “Likely”).
| Tier | posts | w/ verified brand |
|---|---|---|
| Confirmed | 466 | 76 |
| Likely | 2,095 | 267 |
| Possible | 836 | 70 |
| None | 96,523 | 1,990 |
Brand mentions concentrate in commercial tiers — but 1,990 “None” posts still mention a hiker-verified brand, reinforcing that brand-mention detection surfaces commercial content the disclosure-driven tiers miss.
how:mv_partnership JOIN mv_brand_verified on post id, GROUP BY pp_tier, counting has_verified_brand.
brand_account tier admits generic handles (instagram, miami). A 12-handle stoplist was applied to the high-confidence tier; a fuller stoplist / brand allowlist would sharpen all tiers.