social-analyzer: OSINT Username Search Across 1000 Sites

What it does

social-analyzer searches for a username or person's name across 999 social media sites and returns a confidence-scored result set. It runs as a CLI, a REST API, or a web app — your choice.

Why I starred it

Most username-lookup tools work like a dictionary check: fire requests at a list of URLs, see if you get a 200. That produces a lot of noise. What caught my eye here is the detection layer. Instead of returning every hit, social-analyzer scores each match from 0–100 using multiple detection strategies in parallel, then labels results as good, maybe, or bad. That distinction matters when you're doing real OSINT work — wading through false positives wastes time.

The site database is 999 entries deep, not just the top 20 obvious platforms. It covers wiki farms, adult sites, niche forums, music platforms, and regional social networks, categorized by type and country.

How it works

The entry point (app.js) boots an Express server that serves both the web app and the REST API. The CLI path skips the server and calls the same scan functions directly. The scan itself lives in three modules: fast-scan.js, slow-scan.js, and special-scan.js — corresponding to the --mode flag.

The fast path in modules/fast-scan.js is the most interesting. It doesn't just fire all requests and wait. It runs three sequential retry passes:

const [first_re_try, first_profiles] = await find_username_normal_wrapper(req, helper.websites_entries)
const [second_re_try, second_profiles] = await find_username_normal_wrapper(req, first_re_try)
const [third_re_try, third_profiles] = await find_username_normal_wrapper(req, second_re_try)
if (third_re_try.length > 0) {
  const failed_sites = await get_failed(req, third_re_try)
  all_results = Array.prototype.concat(first_profiles, second_profiles, third_profiles, failed_sites)
}

Any site that fails (network timeout, transient error) gets moved to a retry queue. Sites that fail all three passes are recorded as explicitly failed rather than silently dropped — so you know the search was incomplete for those entries. The concurrency limit is 15 workers per pass via async.parallelLimit.

The scoring engine lives in modules/engine.js. Every site entry in data/sites.json carries a detections array. Each detection specifies a type, a string to match, and a return value — essentially "if this string is present, that's a good sign" or "if this string is absent, that's a good sign." Four detection types run depending on the scan mode:

normal — raw HTML source contains the string
advanced — stripped plain-text version contains the string
ocr — Tesseract.js reads a screenshot and looks for the string
shared — references a shared detection definition (e.g. the same MediaWiki "user not found" message used across 180+ wiki sites)

} else if (detection.type === 'normal' && source !== '') {
  if (source.toLowerCase().includes(detection.string.replace('{username}', username).toLowerCase())) {
    temp_found = 'true'
  }
  if (detection.return === temp_found) {
    temp_profile.found += 1
    temp_detected.normal += 1
  }
}

The {username} placeholder gets substituted at match time. The final score is (found / detections_count) * 100. A site needs to pass a configurable threshold (default requires at least one hit, minimum count check) to be labeled good. This is why the false positive rate stays lower than naive URL checkers.

The shared detection system is clever. data/sites.json contains a shared_detections array with named entries like mastodon, mediawiki, phpbb. Individual site entries reference these by name instead of duplicating detection logic. The 181 wiki entries all point to a single mediawiki shared detection rather than repeating the same four strings 181 times.

{
  "name": "mastodon",
  "detections": [
    { "return": "false", "string": "The page you are looking for isn", "type": "advanced" },
    { "return": "true",  "string": "profile:username", "type": "normal" },
    { "return": "true",  "string": "/@{username}", "type": "normal" }
  ]
}

The string analysis module (modules/string-analysis.js) runs separately from the profile search. It breaks down a username using a dictionary of male/female names, common prefixes, suffixes, and numbers, then tries to infer whether the handle encodes a real name, a word, or random characters. It uses most-common-words-by-language and wordsninja to split concatenated strings like johndoe99 into john + doe + 99.

Using it

# Install and run as CLI
npm install
node app.js --username "johndoe"

# Limit to top 100 sites, extract metadata
node app.js --username "johndoe" --top 100 --metadata

# Filter to music and adult categories only
node app.js --username "johndoe" --type "music"

# JSON output for piping
node app.js --username "johndoe" --output json --filter good | jq .

# Python package (limited to fast mode)
pip install social-analyzer
python3 -m social-analyzer --username "johndoe" --websites "youtube twitter"

Docker is the recommended path for the web app since it bundles Firefox and Tesseract for the slow/OCR scan modes:

docker-compose up
# Web app at http://0.0.0.0:9005/app.html

Rough edges

The OCR detection type appears in the code but data/sites.json contains zero entries that use it — all existing detections use normal, advanced, or shared. The Tesseract.js dependency is wired up but effectively unused by the current site database.

The test directory has a single file checking basic module imports. No integration tests, no tests against known-good usernames on real sites, no tests for the scoring math. For a security-adjacent tool used by law enforcement (per the README), that's a gap.

The slow mode uses selenium-webdriver to take screenshots and run JS, but the site database doesn't indicate which sites require it. You'd need to experiment or read the source to know when slow mode adds value over fast mode.

Git activity has slowed significantly — recent commits are mostly PRs from contributors fixing cheerio API calls and adding Node engine version requirements, not new detection logic. The detection database is described as "different than the one shared here" for the law enforcement deployments, which tells you the public version isn't the full one.

Python CLI is explicitly limited to FindUserProfilesFast. If you want slow mode, OCR, or the web interface, you need Node.

Bottom line

If you're doing OSINT work or building tools that need cross-platform username attribution, social-analyzer's detection scoring model is worth studying even if you end up maintaining your own site list. For straight username lookup across a wide site base with less noise than raw HTTP checkers, it does the job.