Parallel HTML Parsing with HSV

Proof-of-concept demonstrating parallel parsing of HTML-like structured documents using HSV format.

Background

HTML parsing has been sequential for 30 years. Research attempts achieved limited results:

Project	Year	Approach	Result
HPar	2013	Speculative data-parallel	2.4x on 4 cores
ZOOMM	2013	Parallel browser engine	2x (whole engine)
Servo	2017	Off-main-thread parsing	Tokenization only

HSV solves this by changing the representation, not the parser.

How It Works

Represent HTML as HSV - use control characters instead of angle brackets
Split at delimiters - O(n) scan for FS (record separator)
Parse chunks in parallel - no state synchronization needed
Reconstruct - results are independent, just collect them

Run Tests

go test -v

Run Benchmarks

go test -bench=. -benchmem

Results

Size    Chunks  Sequential      Parallel
----    ------  ----------      --------
100     100     68µs            77µs
500     500     360µs           349µs
1000    1000    646µs           637µs
2000    2000    1.45ms          1.40ms

Parallel wins at ~500+ elements. For real HTML processing (DOM building, rendering), the advantage would be larger.

Key Points

No escaping: <div>, &, "quotes" preserved literally in HSV
Trivial parallelization: ~50 lines of code
Verified correctness: Sequential and parallel produce identical results
Linear scaling: No speculation, no state synchronization

Why HSV Succeeds Where Others Struggled

HPar needed speculative parallelization with rollback. Servo moved tokenization off-thread but kept DOM construction sequential. Both fight HTML's stateful parsing model.

HSV changes the question: instead of "how do we parallelize HTML parsing?" it asks "why use a format that requires sequential parsing?"

It's the difference between building a faster horse and building a car.

References

HPar (2013)

Zhijia Zhao, Michael Bebenita, Dave Herman, Jianhua Sun, and Xipeng Shen. "HPar: A practical parallel parser for HTML—taming HTML complexities for parallel parsing." ACM Transactions on Architecture and Code Optimization (TACO), Vol. 10, No. 4, Article 44, December 2013. https://research.csc.ncsu.edu/picture/publications/papers/taco14.pdf

ZOOMM (2013)

Calin Cascaval, Seth Fowler, Pablo Montesinos-Ortego, Wayne Piekarski, Mehrdad Reshadi, Behnam Robatmili, Michael Weber, and Vrajesh Bhavsar. "ZOOMM: A parallel web browser engine for multicore mobile devices." Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13), February 2013. https://dl.acm.org/doi/10.1145/2442516.2442543

Servo (2017)

"Off main thread HTML parsing in Servo." Servo Blog, August 2017. https://servo.org/blog/2017/08/23/gsoc-parsing/

ParDOM (2011)

Wei Lu and Dennis Gannon. "A data parallel algorithm for XML DOM parsing." Proceedings of the 2007 Workshop on Service-Oriented Computing Performance. https://www.researchgate.net/publication/221412394_A_data_parallel_algorithm_for_XML_DOM_parsing

README