UGC-132 — Brand Mention Detection (D1)

Reading note. paid_partnership is the platform's raw disclosure flag, not ground truth — a false value is known to hide undisclosed ads. This report measures detected brand mentions and compares them against that flag; it never treats the flag as the answer.

§01 · Findings & issues — example browser

Recognizable posts behind every category — the four partnership types and the Confirmed tier from the taxonomy doc, plus three things we discovered. Click an account to open it; the → @author shows who Instagram says posted. Tabs marked ⚠ are problems we surfaced, not clean findings.

OUR FINDINGS FROM DOC

Platform paid_partnership flag = true. The only certain signal — 466 posts. Recognizable disclosed ads: View all 150 →

Post → author	Brand / cue	Caption
reel/DV8uClCDCxf → @goodmorningamerica 4.3M	@burgerking	“Our sponsor @burgerking asked for feedback — and you delivered…”
reel/DW6ruPbAP7Q → @instylemagazine 4.4M	@macys	“#ad Soft, romantic layers for spring, styled head to toe from @macys 💐”
reel/DLXolpeAe6W → @rappingchef 2.4M	@cawalnuts	“California walnuts 😂 @cawalnuts #CAWalnutsPartner #Sponsored”
reel/DMFTpwXqC4x → @brettlee_58 2.1M	Aussie Avocados	“Fuel up with Aussie avocados 🥑 #AustralianAvocados”

Type: Sponsored. Platform flag, or #ad/#sponsored/“sponsored by”. 700 posts. View all 150 →

Post → author	Brand	Caption
reel/DWwIt4MEUyG → @markwahlberg 30M	@hallowapp	“HAPPY EASTER🙏 @hallowapp 🙏 #ad”
reel/DRR0Ca6DHCy → @ladbible 15.8M	—	“So pet cloning is actually a thing now? 😳🐾 #ad”
reel/DQZVGnaifU6 → @todayshow 5.5M	Winter Olympics	“The 2026 Winter Olympics are 100 days away… #ad”
reel/DSI1eeeklH3 → @rocky_barnes 3.3M	—	“All dressed up for the holidays ✨ 🎁 #ad”

Type: Affiliate. Promo/discount code or affiliate-platform link. 656 posts (after removing the noisy % off pattern — see §09). View all 150 →

Post → author	Code / cue	Caption
reel/DUWAS7SjpFl → @tweets 4.2M	`SUPERBOWL10`	“Click the link in bio to claim $10 with code SUPERBOWL10…”
reel/DWoysvSDnV2 → @annabel.lucinda 3.4M	@gymshark	“GLUTE DAY… wearing the new @gymshark ‘interval’ collection”
p/DYkIPPXkZ0B → @kellylmatthews 1.7M	`KELLY`	“EUPHORIA … Code KELLY #activewear”
reel/DQmNKDGgGmP → @styled.by.josepha 878K	`JOSEPHA35` · @vici	“🔥35% off with my code: JOSEPHA35 🔗Comment VICI…”

Type: Collaboration. A brand co-author on the post (appears on both grids). 1,921 posts. The co-author handle is shown. View all 150 →

Post → author	Brand co-author	Caption
p/DMDSVYfI3Y9 → @victoriabeckham 32.9M	@victoriabeckhambeauty	“Late check-out. Instant obsession…”
reel/DXVCvtLjMXJ → @premierleague 79.5M	@astonvilla	“Unai Emery celebrations never disappoint.”
reel/DNVhbzIg8-B → @ralphlauren 16.7M	@poloralphlauren	“The Polo Bear Chronicles: Operation Black Tie…”

Type: Gifted. Gifting language or #gifted — the weakest/noisiest type. 300 posts. View all 150 →

Post → author	Brand / cue	Caption
reel/DMjXnEOvKuN → @bybinalshah 479K	NuFACE	“gifted by NuFACE ✨ 3 hours of sleep, but make it look like 8…”
reel/DQIJHV7Efst → @aarondinin 809K	(box of product)	“What would you have done in a failure challenge featuring a giant box…”
reel/DXzemYAOZjO → @bandana.eats1 417K	Red Lobster	“RED LOBSTER ENDLESS SHRIMP!!! #foodie #redlobster”

★ Discovered: undisclosed organic mentions. Creator @-mentions a hiker-verified brand, no paid flag, no brand co-author (so genuinely the creator’s own post — distinct from the author-inversion cases). The triage pool for hidden ads. View all 150 →

Post → author	Brand	Caption
reel/DS2_kLdkrG5 → @renatarrii 1.4M	@iamgia	“training in @iamgia 🥊”
reel/DYsDhjSAoSp → @malloryervin 1.1M	@ultabeauty	“Testing my more is more mentality 🤪✨ I am a makeup lover…”
reel/DRzrlOXjA9I → @phenixsoul 782K	@yslbeauty	“60 second red carpet glam. Products used (in order)…”
reel/DUGcZTFEkAx → @heatherdubrow 1.8M	@popupbagels	“Bagels are always better in NY … @popupbagels”

⚠ Discovered: hashtag matching ≠ mention. A #brand hashtag conflates “sponsored by” with “about this brand.” These matched our dictionary by a hashtag but are not brand mentions — they are news topic-tags or brands tagging themselves. This is why the raw “hidden-in-hashtag” count (5,406) collapses to ~420 once cleaned. View all 150 →

Post → author	Hashtag	Why it’s noise
p/DYuDMtUiEll → @dainikbhaskar_ news	`#Amazon`	Hindi news story on tech layoffs — topic tag, no Amazon relationship.
p/DPboq_pk5xI → @9gag media	`#lego`	Meme account; #lego is the subject, not a sponsor.
p/DN22AxOYqHC → @bmw brand	`#bmw`	BMW tagging itself — first-party, not an influencer ad.
reel/DLSWrb3Iljl → @tommyhilfiger brand	`#TommyHilfiger`	Brand’s own post.

⚠ Discovered: primary-author inversion on collab posts. For brand×creator collaborations, the dataset’s influencer_name can record a co-author creator as the author when Instagram’s real owner is the brand. Verified against hiker get_v2_media_info_by_code (true owner in media.user). These look like “undisclosed creator mentions” but are brand-published collab ads.

Post	DB says author	Hiker: true author
reel/DWEmrEPhWdQ	@atsukocomedy (creator)	@ugg (brand); atsukocomedy = co-author
reel/DWeNYlyjJNg	@zeyatilgan (creator)	@lcwaikiki (brand); caption has “işbirliği”=collab

Impact: 217 of the 2,327 “verified-brand undisclosed” posts are actually brand-coauthor collabs. Recoverable per-post via hiker; logged as a known data issue, not yet corrected in bulk.

Snippets truncated/whitespace-collapsed; full captions in posts.caption. Links built from post_links view (verified: reel/DX6wxRhRv_3 = @mattestlea).

§02 · Headline numbers

12,891

posts mention a known brand (12.9%)

how: count of posts in mv_brand_mentions where has_brand_mention = true (any dictionary handle found in mention_tags, hashtags, or caption text). 12,891 / 99,920 = 12.9%.

5,406 → 420

brand only via hashtag (raw → cleaned)

how: posts where brand_via_hashtag AND NOT brand_via_mention_tag AND NOT brand_via_caption_text. Caveat: raw 5,406 conflates ads with topic tags (e.g. a news post tagging #Amazon) and first-party brand posts. Restricting to creator authors and dropping generic/place tokens (instagram, miami, nfl, disney…) leaves 420 plausible. See §05.

1,141

mention a sponsor brand but not disclosed paid

how: posts mentioning a handle from sponsor_handles (brands proven to pay here), paid_partnership=false, minus 12 generic platform handles (instagram, nfl…). See "high-confidence" section.

587

brands confirmed by hikerapi

how: handles in handle_verdict with verdict = 'brand' (hikerapi returned a commercial account category). The defensible brand set. (Was “18,485 dictionary handles” — that was a noisy raw lookup list, mostly unused; see methodology.)

A brand mention is detected in 12.9% of posts. A notable channel: brands referenced only through a hashtag (e.g. #nike), invisible to the structured mention_tags field — though raw hashtag matching is noisy (see §04: #Amazon on a news story is a topic tag, not an ad). And among posts naming a brand that provably runs paid partnerships in this data, undisclosed mentions outnumber disclosed ones roughly 7× after noise filtering — the candidate pool this detection project exists to surface.

§03 · How detection works

Three mention locations

mention_tags — structured @-handles Instagram resolved for the post.
hashtag_tags — hashtags; a brand can hide as #brand.
free text — @handles in the caption / transcript that never reached the structured arrays (regex-extracted).

Transcripts yielded zero @-handles — speech-to-text doesn't emit handles — so text detection is caption-driven.

What counts as a "brand"

Step 1 is a dictionary mined from the dataset itself — handles harvested from three fields. This is a broad, noisy candidate list (18,485 total); hikerapi verification (below) is what turns it into the trustworthy 587-brand set.

Source	handles	confidence
sponsor handles (paid)	130	high
brand-typed co-authors	3,438	medium
brand-typed accounts	14,917	broad

how each: sponsor = distinct lowercased handles from posts.sponsor_handles where sponsor_present. coauthor_brand = handles from coauthor_handles where coauthor_is_brand. brand_account = influencer_name of authors whose influencer_account_type_v2='brand'. De-duplicated (keep strongest source) → 18,485. Only 6,221 are ever actually mentioned in a post.

§04 · Hiker-verified brands live API enrichment

The data-derived dictionary is broad and noisy. To ground it in reality, the top 2,000 most-mentioned handles were enriched live via the hikerapi Instagram API (get_v2_user_by_username), storing the complete raw profile per handle. The brand-discriminating signal is the account category — not is_business, which is true for creators too (a creator like maddiedoodle__ is a "business" account categorised "Entrepreneur").

587

handles hiker confirms are brands

how: of the top-2,000 most-mentioned handles enriched via hikerapi, those whose account category matched a commercial pattern (Brand, Clothing, Retail, Health/beauty…) via classify_brand().

2,403

posts with a hiker-confirmed brand

how: posts in mv_brand_verified where has_verified_brand = true (mention a handle with verdict 'brand').

303

posts whose “brand” is really a person (noise removed)

how: posts where has_verified_person = true — a handle the raw dictionary called a brand but hikerapi categorised as a creator/person (e.g. carlifestyle = "Digital creator").

2,327

confirmed-brand mentions not disclosed paid

how: has_verified_brand = true AND paid_partnership = false. Disclosed counterpart = 76 (paid_partnership = true).

Enrichment outcome (2,000 handles)

Hiker verdict	handles	meaning
brand	587	commercial category → confirmed
person	223	creator/individual → false positive
org_other	196	team / community / label / media
uncertain	949	empty category → kept at dict confidence
not_found	45	handle no longer resolves

~49% of accounts return an empty category (incl. ZARA, Louis Vuitton, BMW and Kim Kardashian) — category can't classify those, so they fall back to the dictionary.

Top hiker-confirmed brands

Handle	category	posts
amazon	Retail company	300
sheinofficial	Brand	209
tommyhilfiger	Brand	99
yslbeauty	Health/beauty	93
ultabeauty	Beauty/cosmetic	89
ugg	Clothing (Brand)	64
sezane	Clothing (Brand)	61
larocheposay	Health/beauty	45

Hiker is sharper, not perfect. Category corrects clear noise (carlifestyle → "Digital creator", removed) — but some genuine brands self-select a creator category and get dropped: minecraft ("Video Game"), cerakote ("Entrepreneur"), gozwift/Zwift ("Fitness Trainer"). So treat "person" as high-precision noise removal, not a complete brand blocklist. The full raw hiker profile is stored per handle (brand_verified.raw) for richer signals later.

§05 · Where brands are mentioned

Location	posts w/ brand	share of detections
Structured `mention_tags`	7,477
Hashtags `#brand`	6,738
Caption free text (extra, not in tags)	11
Transcript free text	0	—

Locations overlap (a post can mention a brand in several). 5,406 posts are caught only by hashtag — but see the caveat below. Caption free-text adds almost nothing (11) — and that is correct, not a bug: the JSONL is already structured, so Instagram has resolved real caption @-mentions into mention_tags. Of 23,848 posts with an @ in the caption, 23,099 are already in mention_tags; only 749 are “extra,” and those are mostly not real mentions — email addresses (name@gmail.com) and truncated URLs (@amazon.com) caught by the @ regex. Just 11 coincidentally matched a brand handle. The structured field has effectively pre-absorbed this channel.

how: per-location counts are count(*) FILTER (WHERE brand_via_mention_tag / _hashtag / _caption_text / _transcript_text) over mv_brand_mentions. A handle counts here if it's in the 18,485-dictionary; transcript = 0 because speech-to-text emits no @-handles.

Hashtag matching is a string match, not an intent judgment. A #brand hashtag conflates “sponsored by this brand” with “this post is about this brand.” Validated example: p/DYuDMtUiEll (@dainikbhaskar_, a news outlet) carries #Amazon in a story about tech layoffs — there is no Amazon mention or relationship in the post; it’s a topic tag. The raw 5,406 includes such topic tags, first-party brand posts (2,371 authored by brand accounts), and generic tokens (#instagram 1,416, #miami, #nfl, #disney, #starwars). Restricting to creator authors and removing generic/place/platform tokens leaves 420 plausible cases — and even some of those (#imax, #hornets=insects) are topical. Treat hashtag-only as a weak, high-recall/low-precision signal, not a confirmed mention.

§06 · Brand mention vs. disclosure flag

	brand detected	not detected
paid = true	218	248
paid = false	12,673	86,781

Cross-tab of all 99,920 posts.

how: GROUP BY paid_partnership, has_brand_mention over mv_brand_mentions. “detected” = dictionary brand mention (18,485 set); rows: 218 / 248 / 12,673 / 86,781 sum to 99,920.

Two readings

Recall on disclosed ads = 46.8%. We detect a brand handle in only 218 of 466 disclosed-paid posts — many paid posts name the sponsor in prose or rely on the platform partnership banner rather than an @-handle. A handle/hashtag detector alone under-covers.

12,673 brand mentions are undisclosed. Most are organic (a creator tagging a brand they like), but this is precisely where hidden ads live — the pool to triage.

§07 · High-confidence undisclosed candidates

Restricting to posts that mention a known sponsor brand (brands proven to pay creators in this dataset) and excluding 12 generic platform/place handles (instagram, nfl, amazon, disney, …) that polluted the sponsor list:

1,141

sponsor-brand mention, not disclosed

how: distinct posts mentioning a sponsor-tier handle, paid_partnership=false, excluding 12 generic handles (instagram/nfl/amazon/disney/youtube/tiktok/spotify/google/facebook/threads/whatsapp/miami).

151

sponsor-brand mention, disclosed paid

how: same sponsor-handle set, paid_partnership=true.

~7.6×

undisclosed : disclosed ratio

how: 1,141 ÷ 151 = 7.55. Read: for brands proven to pay here, undisclosed mentions outnumber disclosed ~7.6×.

Brand handle	undisclosed posts
lego	311
sheinofficial	182
yslbeauty	92
etsy	76
adidasoriginals	70
ebay	52
peppermayo	39
garageclothing	33
newlook	31
macys	28
googlepixel	25
bravotv	24

§08 · Breakdowns

By post type

Type	posts	branded	%
post	47,745	7,059	14.78
reel	41,773	5,205	12.46
story	10,402	627	6.03

how: GROUP BY type over mv_brand_mentions; branded = has_brand_mention. % = branded ÷ posts.

By author account type

Account type	posts	branded	%
brand	15,655	4,356	27.82
creator	33,967	4,444	13.08
unknown	2,651	348	13.13
media	6,925	785	11.34
pro_service	13,637	1,429	10.48
venue	2,318	138	5.95
(empty)	23,849	1,344	5.64

how: GROUP BY influencer_account_type_v2 over mv_brand_mentions. Caveat: brand & pro_service authors are first-party accounts — high brand-mention % is expected (brands mention brands) and is not an ad signal. Ad detection should focus on creator (0.98% paid-flag rate, 3–7× the others).

§09 · Partnership taxonomy & doc validation audited the spec

We tagged every post with a two-filter taxonomy — a confidence tier (one per post) and multi-label partnership types — then audited it against the Competitor Insights — Paid Partnership Taxonomy Notion doc (built on this same D1 dataset). The doc is a hypothesis we double-checked, not a spec we followed.

Filter 1 — confidence tier (one per post)

Tier	Our trigger	posts	doc
Confirmed	platform flag	466	466 ✓
Likely	#ad/#sponsored, “sponsored by/paid partnership”, brand co-author	2,095	2,151
Possible	promo-code / affiliate-host / gifting (soft, low-conf)	836	668
None	no commercial signal	96,523	96,635

Tiers are leak-free: Confirmed = exactly the 466 flagged; Likely/Possible/None contain 0 flagged posts.

how: pp_tier column in mv_partnership (CASE over signal booleans, highest tier wins). “doc” = numbers published in the Notion taxonomy doc, shown for comparison — we reproduce the hard signals exactly but our soft-signal tiers differ (see audit note below).

Filter 2 — partnership type (multi-label)

Type	Our trigger	posts	doc
collaboration	brand co-author	1,921	1,921 ✓
affiliate	promo code / affiliate host	656	487
sponsored	flag / #ad / #sponsored / “sponsored by”	700	766
gifted	gifting language / #gifted	300	291

Types are multi-label and only apply to commercial posts — counts don’t sum to 100%.

how: unnest(partnership_types) from mv_partnership, grouped. A post can carry several types. collaboration (=brand co-author) and the hard sponsored signals reproduce the doc; affiliate/gifted are soft-text regex and diverge.

Audit verdict — the doc is right where it’s structured, soft where it’s textual. Every hard signal reproduced exactly: #ad=338, #sponsored=32, #ambassador=8, brand co-author=1,921, any @mention=43,416, Confirmed=466. The earlier gap on Possible / Affiliate traced to a single over-broad sub-pattern: a bare [0-9]% off regex that fired on any retail markdown — spa pricing, product listings, first-party sales — not creator affiliate offers (472 such posts). Removing it brings Affiliate 1,130 → 656 and Possible 1,263 → 836, both now near the doc’s 487 / 668. The residual gap is the opposite error in the doc: it missed real bare-code XXXX affiliate posts (“use code jess15”, “code ali20 to save”). Net lesson: soft-text volumes swing on regex wording — keep them at the lowest tier only (we never let them reach “Likely”).

Cross-check: tiers × hiker-verified brand mentions

Tier	posts	w/ verified brand
Confirmed	466	76
Likely	2,095	267
Possible	836	70
None	96,523	1,990

Brand mentions concentrate in commercial tiers — but 1,990 “None” posts still mention a hiker-verified brand, reinforcing that brand-mention detection surfaces commercial content the disclosure-driven tiers miss.

how: mv_partnership JOIN mv_brand_verified on post id, GROUP BY pp_tier, counting has_verified_brand.

§10 · Limitations & next steps

Handle-only detection. We match brand handles/hashtags. Brands named in prose without an @ or # (e.g. "loving my new Nikes") are missed — the 46.8% recall on disclosed ads reflects this. Next: fuzzy brand-name matching against caption & transcript text.
Dictionary noise. The broad brand_account tier admits generic handles (instagram, miami). A 12-handle stoplist was applied to the high-confidence tier; a fuller stoplist / brand allowlist would sharpen all tiers.
Mentions ≠ ads. An undisclosed brand mention is a candidate, not a confirmed ad. Pairing with caption ad-cues (links/shorteners, "use my code", gifting language) would rank candidates.
No gold labels. Validation here is against the noisy disclosure flag; a human-confirmed eval set (planned follow-on) is needed for precision/recall.

Source: UGC-132-datasets/d1.jsonl → Postgres ugc.posts (99,920 rows, verified against d1_manifest.json). Detection logic: brand_mentions.sql (brand_handles dictionary + posts_brand_mentions view / mv_brand_mentions materialized view). Numbers reproducible via queries.sql.

Brand-Mention Detection & Paid-Partnership Signals