Chemical Identification Systems โ Research Summary
What exists, who runs it, what people love/hate, and where HQ fits
๐ข Chemical Identifier Systems
CAS Registry NumberCommercial
American Chemical Society (ACS) โข Since 1965 โข 290M+ substances
Universal standard โ everyone uses it
Unique, unambiguous identifier
Works across languages
Commercial โ $6+ per lookup
CAS RN is trademarked IP
Can't verify without paying
Errors propagated because no one checks
Doesn't cover mixtures/undefined compositions
ACS told Wikipedia not to verify numbers (!)
PubChem CIDFree
NIH / National Library of Medicine โข Since 2004 โข 115M+ compounds
Completely free and open
Government-backed (NIH mandate)
Structure data freely available
Actively maintained
Deliberately moved away from CAS
Only covers compounds with structures
Multiple CIDs can map to same CAS
Less universal recognition than CAS
InChI / InChIKeyOpen Standard
IUPAC โข Since 2005 โข Computable from structure
Free, non-proprietary
Can be computed โ doesn't need authority
Structure information encoded in ID
InChIKey is web-searchable (27 chars)
Not human-readable
Can't handle all chemistry (clays, polymers)
Tautomers cause issues
InChIKey collisions possible (rare)
EC Number (ECHA)Free
European Chemicals Agency โข EU regulation (REACH)
Free to access
Required for EU market
Good regulatory data
EU-only coverage
Not all substances have EC numbers
Less global recognition
"We (including Wikipedia) should now switch from using CAS numbers to using PubChem IDs wherever possible... PubChem has deliberately moved away from CAS because CAS numbers are IP."
โ Peter Murray-Rust, Cambridge chemist, 2008
"~120,000 of the 350,000+ chemicals in commercial products were too poorly described to link to a CAS number or their identities were withheld as trade secrets."
โ Environ. Sci. Technol. research
๐ Consumer Safety Databases
EWG Skin DeepBiased
Environmental Working Group โข 88K+ products
Free consumer access
Large product database
Raised awareness of ingredient safety
Easy-to-understand ratings (1-10)
80% of toxicologists say EWG overstates risks
Ratings for "data: none" ingredients
Chemically identical compounds rated differently
Natural bias โ natural ingredients rated better
Doesn't update with new research
Pay-to-play "EWG Verified" program
Amazon affiliate links on "dangerous" products
Safety Data Sheets (SDS)Free
Manufacturer-required โข GHS standard (16 sections)
Legally required for hazardous chemicals
Standardized format (since 2012)
Comprehensive information
Manufacturer-specific data
Hard to find โ buried in manufacturer sites
Written for industrial/workplace use
Not consumer-friendly language
41% of SDSs in one study didn't mention combustibility
Many consumer products don't have SDS
Not designed for pets/children contexts
PubChemFree
NIH โข Comprehensive compound data
Free, authoritative, government-backed
Chemical/physical properties
Hazard information
Literature citations
Designed for researchers, not consumers
Technical language
No product-level data
No pet-specific information
"A decade ago, George Mason University surveyed ~1000 members of the Society of Toxicology. 80% felt that EWG overstated the risks of chemicals."
โ The Eco Well, citing toxicologist survey
๐ Comparison Matrix
System
Free?
Authority
Consumer-friendly
Pet data
Products
CAS
No ($6/lookup)
High
No
No
No
PubChem
Yes
High
Somewhat
No
No
InChI
Yes
High
No
No
No
ECHA
Yes
High
Somewhat
No
No
EWG
Yes
Low (biased)
Yes
No
Yes
SDS/MSDS
Yes
High
No
No
Some
HQ Safety DB
Yes
High (cited)
Yes (goal)
Yes
Yes
๐ณ๏ธ The Gap HQ Fills
What exists:
High authority, bad UX: PubChem, ECHA, SDS โ accurate but not consumer-friendly
Consumer-friendly, bad authority: EWG โ accessible but biased/inaccurate
Universal identifier, paywalled: CAS โ everyone uses it, but verification costs money
What doesn't exist:
Free, accurate, consumer-friendly safety data
Multi-context data (human, pet, environmental) in one place
Compound โ Material โ Product linkage
Non-extractive (no pay-to-play verification programs)
Citable, ALCOA+ compliant, conflict-acknowledged
๐ง HQ Nomenclature Approach
Layer
HQ ID
Cross-references
Compound
hq-c-0001
CAS, PubChem CID, InChIKey, ECHA EC
Material
hq-m-0001
Resin codes, ASTM standards
Product
hq-p-0001
None (our category, our ID)
Why our own IDs + crossrefs?
Independence: We don't depend on CAS (commercial) or any single system
Verification: Anyone can check our data against official sources
Coverage: We can ID things that don't have CAS numbers (mixtures, products)
Stability: Our IDs won't change if external systems change
Non-extractive: Our system is free, theirs may not be
โ Option A:hq-c-0001.json โ Machine-friendly, unambiguous Option B: glyphosate.json โ Human-friendly, HQ ID inside file only
Human-readable name lives inside the file, not in filename.
Q: Sequential vs structured IDs?RESOLVED
โ Option A:hq-c-0001 โ Simple sequence, just order of entry Option B: hq-c-herb-0001 โ Category embedded Option C: hq-c-2025-0001 โ Year embedded
Simple sequential. Categories change, years complicate lookups. Registry tracks assignments.
Q: How to handle "same compound, different form"?RESOLVED
Glyphosate acid vs glyphosate isopropylamine salt โ same thing? Option A: Same HQ ID, different CAS in crossrefs array โ Option B: Separate HQ IDs, linked via hierarchy
Parent compound (hq-c-0001 glyphosate) โ children reference via hierarchy.parent.
Children have own CAS numbers but inherits_safety: true from parent.
๐ Current Registry State
As of March 2026
888
Active Compounds
+ 295 aliases = 1,183 files
Next: hq-c-1226
262
Materials
hq-m-0001โ0262
Next: hq-m-0263
237
Products
237 active (with gaps)
Next: hq-p-0362
Key Coverage Milestones
Field
Coverage
Notes
hazard_profile, found_in, alternatives
100%
All 888 active compounds
regulatory.classifications
95%
~43 at ceiling (no classifiable language)
GHS hazard data
80%
Ceiling: mixtures/classes lack single SDS
identity (formula + SMILES)
78%
Ceiling: classes/mixtures can't have structure
dose_response.ld50
68%
PubChem exhausted; CTX API pending
First Family: Glyphosate (still canonical example)
hq-c-0001 Glyphosate (acid) CAS 1071-83-6 [parent]
โโโ hq-c-0002 IPA salt CAS 38641-94-0 [alias โ hq-c-0001]
โโโ hq-c-0003 Potassium salt CAS 70901-12-1 [alias โ hq-c-0001]
โโโ hq-c-0004 Ammonium salt CAS 114370-14-8 [alias โ hq-c-0001]