November 19, 2019 News Magazine

Hatebase catalogues the world’s hate speech in real time so you don’t have to

Policing hate speech is something nearly every online communication platform struggles with. Because to police it, you must detect it; and to detect it, you must understand it. Hatebase is a company that has made understanding hate speech its primary mission, and it provides that understanding as a service — an increasingly valuable one.

Essentially Hatebase analyzes language use on the web, structures and contextualizes the resulting data, and sells (or provides) the resulting database to companies and researchers that don’t have the expertise to do this themselves.

The Canadian company, a small but growing operation, emerged out of research at the Sentinel Project into predicting and preventing atrocities based on analyzing the language used in a conflict-ridden region.

“What Sentinel discovered was that hate speech tends to precede escalation of these conflicts,” explained Timothy Quinn, founder and CEO of Hatebase. “I partnered with them to build Hatebase as a pilot project — basically a lexicon of multilingual hate speech. What surprised us was that a lot of other NGOs [non-governmental organizations] started using our data for the same purpose. Then we started getting a lot of commercial entities using our data. So last year we decided to spin it out as a startup.”

You might be thinking, “what’s so hard about detecting a handful ethnic slurs and hateful phrases?” And sure, anyone can tell you (perhaps reluctantly) the most common slurs and offensive things to say — in their language… that they know of. There’s much more to hate speech than just a couple ugly words. It’s an entire genre of slang, and the slang of a single language would fill a dictionary. What about the slang of all languages?

A shifting lexicon

As Victor Hugo pointed out in Les Miserables, slang (or “argot” in French) is the most mutable part of any language. These words can be “solitary, barbarous, sometimes hideous words… Argot, being the idiom of corruption, is easily corrupted. Moreover, as it always seeks disguise so soon as it perceives it is understood, it transforms itself.”

Facebook is finally banning white supremacy that goes by other names

Not only is slang and hate speech voluminous, but it is ever-shifting. So the task of cataloguing it is a continuous one.

Hatebase uses a combination of human and automated processes to scrape the public web for uses of hate-related terms. “We go out to a bunch of sources — the biggest, as you might imagine, is Twitter — and we pull it all in and turn it over to Hatebrain. It’s a natural language program that goes through the post and returns true, false, or unknown.”

True means it’s pretty sure it’s hate speech — as you can imagine, there are plenty of examples of this. False means no, of course. And unknown means it can’t be sure; perhaps it’s sarcasm, or academic chatter about a phrase, or someone using a word who belongs to the group and is attempting to reclaim it or rebuke others who use it. Those are the values that go out via the API, and users can choose to look up more information or context in the larger database, including location, frequency, level of offensiveness, and so on. With that kind of data you can understand global trends, correlate activity with other events, or simply keep abreast of the fast-moving world of ethnic slurs.

hatebase map

Hate speech being flagged all around the world — these were a handful detected today, along with the latitude and longitude of the IP they came from.

Quinn doesn’t pretend the process is magical or perfect, though. “There are very few 100 percents coming out of Hatebrain,” he explained. “It varies a little from the machine learning approach others use. ML is great when you have an unambiguous training set, but with human speech, and hate speech, which can be so nuanced, that’s when you get bias floating in. We just don’t have a massive corpus of hate speech, because no one can agree on what hate speech is.”

That’s part of the problem faced by companies like Google, Twitter, and Facebook — you can’t automate what can’t be automatically understood.

‘Behind the Screen’ illuminates the invisible, indispensable content moderation industry

Fortunately Hatebrain also employs human intelligence, in the form of a corps of volunteers and partners who authenticate, adjudicate, and aggregate the more ambiguous data points.

“We have a bunch of NGOs that partner with us in linguistically diverse regions around the world, and we just launched our ‘citizen linguists’ program, which is a volunteer arm of our company, and they’re constantly updating and approving and cleaning up definitions,” Quinn said. “We place a high degree of authenticity on the data they provide us.”

That local perspective can be crucial for understanding the context of a word. He gave the example of a word in Nigeria, which when used between members of one group means friend, but when used by that group to refer to someone else means uneducated. It’s unlikely anyone but a Nigerian would be able to tell you that. Currently Hatebase covers 95 languages in 200 countries, and they’re adding to that all the time.

Furthermore there are “intensifiers,” words or phrases that are not offensive on their own but serve to indicate whether someone is emphasizing the slur or phrase. Other factors enter into it too, some of which a natural language engine may not be able to recognize because it has so little data concerning them. So in addition to keeping definitions up to date, the team is also constantly working on improving the parameters used to categorize speech Hatebrain encounters.

Building a better database for science and profit

The system just ingested its millionth hate speech sighting (out of perhaps tens times that many phrases evaluated), which sounds simultaneously like a lot and a little. It’s a little because the volume of speech on the internet is so vast that one rather expects even the tiny proportion of it constituting hate speech to add up to millions and millions.

But it’s a lot because no one else has put together a database of this size and quality. A vetted, million-data-point set of words and phrases classified as hate speech or not hate speech is a valuable commodity all on its own. That’s why Hatebase provides it for free to researchers and institutions using it for humanitarian or scientific purposes.

hatebase how

But companies and larger organizations looking to outsource hate speech detection for moderation purposes pay a license fee, which keeps the lights on and allows the free tier to exist.

“We’ve got, I think, four of the world’s ten largest social networks pulling our data. We’ve got the UN pulling data, NGOs, the hyper local ones working in conflict areas. We’ve been pulling data for the LAPD for the last couple years. And we’re increasingly talking to government departments,” Quinn said.

They have a number of commercial clients, many of which are under NDA, Quinn noted, but the most recent to join up did so publicly, and that’s TikTok. As you can imagine, a popular platform like that has a great need for quick, accurate moderation.

In fact it’s something of a crisis, since there are laws coming into play that penalize companies enormous amounts if they don’t promptly remove offending content. That kind of threat really loosens the purse strings; If a fine could be in the tens of millions of dollars, paying a significant fraction of that for a service like Hatebase’s is a good investment.

“These big online ecosystems need to get this stuff off their platforms, and they need to automate a certain percentage of their content moderation,” Quinn said. “We don’t ever think we’ll be able to get rid of human moderation, that’s a ridiculous and unachievable goal; What we want to do is help automation that’s already in place. It’s increasingly unrealistic that every online community under the sun is going to build up their own massive database of multilingual hate speech, their own AI. The same way companies don’t have their own mail server any more, they use Gmail, or they don’t have server rooms, they use AWS — that’s our model, we call ourselves hate speech as a service. About half of us love that term, half don’t, but that really is our model.”

Hatebase’s commercial clients have made the company profitable from day one, but they’re “not rolling in cash by any means.”

“We were nonprofit until we spun out, and we’re not walking away from that, but we wanted to be self-funding,” Quinn said. Relying on the kindness of rich strangers is no way to stay in business, after all. The company is hiring and investing in its infrastructure, but Quinn indicated that they’re not looking to juice growth or anything — just make sure the jobs that need doing have someone to do them.

In the meantime it seems clear to Quinn and everyone else that this kind of information has real value, though it’s rarely simple.

“It’s a really, it’s a really complicated problem. We always grapple with it, you know, in terms of, well, what role does hate speech play? What role does misinformation play? What role do socioeconomics play?” he said. “There’s a great paper that came out of the University of Warwick, they studied the correlation between hate speech and violence against immigrants in Germany over, I want to say, 2015 to 2017. They graph it out. And its peak for peak, you know, valid for Valley. It’s amazing. We don’t do a hell of a lot of analysis — we’re a data provider.”

“But now have like, almost 300 universities pulling the data, and they do those kinds of those kinds of analyses. So that’s very validating for us.”

You can learn more about Hatebase, join the Citizen Linguists or research partnership, or see recent sightings and updates to the database at the company’s website.

Banner
Related Posts

Hack with these APIs at the Disrupt SF 2019 Hackathon

September 25, 2019

September 25, 2019

If you’re one of the 800 code poets lucky enough to score a seat for the TechCrunch Hackathon at Disrupt San Francisco 2019 on October 2-4, congratulations! The event is completely full, so if you missed out, try your luck next time. The TC Disrupt SF Hackathon is a week away, and we bet you can’t wait to learn about what APIs you’ll be able to build on, correct? Without further ado, check out the APIs you’ll use to breathe life into your creation. Filestack is the number-one file handling service for developers. Upload files into your app with 100x more reliability. Take those uploads and transform images at any URL by adding a few parameters and automatically generating a fully transformed image. Filestack provides responsive, reliable and secure delivery so that your files are delivered with unparalleled speed and control. Simplify content tasks using Filestack Workflows, an easy-to-manage UI that lets you automate commands into a single API call. Add intelligence to your workflow..

Startups Weekly: 2019 VC spending may eclipse 2018 record

July 6, 2019

July 6, 2019

Hello and welcome back to Startups Weekly, a weekend newsletter that dives into the week’s noteworthy startups & venture capital news. Before I jump into today’s topic, let’s catch up a bit. Last week, I struggled to understand WeWork’s growth trajectory. Before that, I noted some thoughts on scooter companies’ struggle to raise new cash. Remember, you can send me tips, suggestions and feedback to kate.clark@techcrunch.com or on Twitter @KateClarkTweets. If you don’t subscribe to Startups Weekly yet, you can do that here. What’s on my mind this week? Data. Now that it’s July, I figured it was time for a VC investment data check-in. How much have VCs invested so far this year? Are they finally investing more in female founders? I’ve got answers. (Data source: PitchBook) So far in 2019, VCs have invested $62 billion in U.S. startups. This puts investors on pace to dole out more than $120 billion this year, surpassing last year’s all-time high of $117 billion. Around the world, VCs ha..

Reddit is raising a huge round near a $3 billion valuation

February 5, 2019

February 5, 2019

Reddit is raising $150 million to $300 million to keep the front page of the Internet running, multiple sources tell TechCrunch. The forthcoming Series D round is said to be led by Chinese tech giant Tencent at a $2.7 billion pre-money valuation. Depending on how much follow-on cash Reddit drums up from Silicon Valley investors and beyond, its post-money valuation could reach an epic $3 billion. As more people seek esoteric community and off-kilter entertainment online, Reddit continues to grow its link sharing forums. 330 million monthly active users now frequent its 150,000 Subreddits. That warrants the boost to its valuation, which previously reached $1.8 billion when raised $200 million in July 2017. As of then, Reddit’s majority stake was still held by publisher Conde Nast that bought in back in 2006 just a year after the site launched. Reddit had raised $250 million previously, so the new round will push it to $400 million to $550 million in total funding. It should have been ..

Curve, the ‘over-the-top’ banking platform, raises $55M at a $250M valuation

July 15, 2019

July 15, 2019

Curve, the London-based “over-the-top banking platform,” has raised $55 million in new funding. The startup lets you consolidate all of your bank cards into a single Curve card and app to make it easier to manage your spending and access other benefits. Curve’s Series B round is led by Gauss Ventures, the U.S.-based fintech investor, alongside Creditease, IDC Ventures and previous backer Outward VC (formerly Investec’s INVC fund). A number of other early investors, including Santander InnoVentures, Breega, Seedcamp and Speedinvest also followed on. The new round of funding values Curve at $250 million (or one-quarter unicorn, so to speak), and will be used by the company to continue adding more features to its platform and for further European expansion. The company claims 500,000 users and says it is on track to reach 1 million by the end of the year. Curve is currently available in 31 countries across Europe, with around 30% of its customer base coming from outside the U.K. “We [h..

Orca Security scores $6.5M seed round to solve cloud native security

June 12, 2019

June 12, 2019

Orca Security, a startup based in Israel, announced a healthy $6.5 million seed round today led by YL Ventures, a firm that makes a living investing in Israeli security startups. That’s a lot of money for a seed round, but the company, which is led by two former Check Point Security executives, is trying to solve a hard problem around securing applications in the cloud without an agent. “Orca, as a cloud native security platform, secures both customers’ own cloud native and legacy apps migrated to the cloud without needing agents,” company co-founder and CEO Avi Shua explained. Instead, he says the company uses a concept called “SideScanning,” which he said, “comprehensively examines the entire deployed software stack, discovering vulnerabilities, outdated and deprecated software versions, misconfiguration and other risks.” This approach works well in a cloud native world where developers are launching applications in the cloud in containers using Kubernetes to orchestrate the conta..

Scaling a nonprofit, startup style

April 9, 2019

April 9, 2019

It’s been a year since All Raise emerged with the support of dozens of venture capital’s most powerful women. The 34 founding members had a lofty goal: Double the capital going to female founders in five years and double the representation of female VCs in 10 years. After a few experiments, some trial and error and the first-of-its-kind Women Who Venture Summit, All Raise cemented its reputation as a force for change in Silicon Valley. In the last 12 months, the organization has connected high-profile VCs with female founders through its Female Founder Office Hours, lifted up founders who value diversity with the Founders for Change campaign and connected male general partners with nascent female investors with its VC Champions program. Now, the 501(c)(3) is ready for its next act. As part of a push to formalize the organization and garner more support in the form of cash, All Raise has hired its first chief executive officer, seasoned Silicon Valley operator Pam Kostka (pictured). ..

Sick of your ISP? Wander is rolling out in LA with a $25-per-month, wireless high-speed service

October 23, 2019

October 23, 2019

Harnessing new networking technologies that can turn any real estate developer into their own wireless internet service provider, Wander is launching a $25 per month high-speed networking service for the lucky citizens of Santa Monica, Calif. The brainchild of a former Disney analyst, David Fields, and a former Intuit engineer, Dan Rahmel, Wander uses low-cost wireless hardware and proprietary software to bring last-mile wireless Internet to a customer’s home. “The idea behind Wander was created around some deep frustration with the net neutrality repeal,” says Fields. “We could look at utilizing some of the existing wireless infrastructure and cover that last mile at a fraction of the cost… we have a strong perspective on the data demands of consumers.” The problem, as Fields sees it, is that internet service providers are over-billing for capacity that most consumers don’t even use. As an August report from The Wall Street Journal revealed, high-speed internet just isn’t worth it..

Coinbase tells you if top holders are buying or selling a crypto asset

July 17, 2019

July 17, 2019

Coinbase is taking advantage of its significant user base to give you more information about trading behavior and price correlation. Given that there are now 15 cryptocurrencies on Coinbase that you can trade, the new features should provide some signals. In addition to price and variation information, you can see what Coinbase customers with large balances are currently doing. You get a buy/sell percentage for each asset. Behind the scene, Coinbase looks at users with a Coinbase balance in the top 10%. The exchange then counts how many users in that pool have increased or decreased their positions over the last 24 hours. The signal is updated every two hours. Coinbase is also calculating two other data points — the average hold time and the popularity of each asset. This time, the company relies on the entire Coinbase user base to tell you how long people keep a specific asset before selling it or sending it to another address. Unfortunately, when you transfer your assets to a har..

India’s Darwinbox raises $15M to bring its HR tech platform to more Asian markets

September 26, 2019

September 26, 2019

An Indian SaaS startup, which is increasingly courting clients from outside of the country, just raised a significant amount of capital to expand its business. Hyderabad-based Darwinbox, which operates a cloud-based human resource management platform, said on Thursday it has raised $15 million in a new financing round. The Series B round — which moves the firm’s total raise to $19.7 million — was led by Sequoia India and saw participation from existing investors Lightspeed India Partners, Endiya Partners, and 3one4 Capital. More than 200 firms including giants such as adtech firm InMobi, fintech startup Paytm, drink conglomerate Bisleri, automobile maker Mahindra, Kotak group, and delivery firms Swiggy and Milkbasket use Darwinbox’s HR platform to serve half a million of their employees in 50 nations, Rohit Chennamaneni, cofounder of Darwinbox, told TechCrunch in an interview. The startup, which competes with giants such as SAP and Oracle, said its platform enables high level of con..

Goodly replaces lame office perks with student loan repayment

March 27, 2019

March 27, 2019

There are better employee perks than a ping-pong table. 70 percent of Americans graduate college with student loan debt. That’s 45 million people who owe $1.6 trillion. So when employers use Goodly to offer $100 per month in student loan payback for a $6 fee, talent sticks around. The startup found 86 percent of employees said they’d stay with a company for at least five years if their employer helped pay down their student loans. Yet employers break even if workers stay just two extra months, and get a 5X return if they stay an extra year since it costs so much to hire and train replacement staff. Now, Y Combinator-backed Goodly has raised a $1.3 million seed round led by Norwest. The startup hopes capitalize on corporate America waking up to student loan payback as a benefit, which is expected grow from being offered by 4 percent of companies today to 32 percent by 2021. Goodly co-founder and CEO Greg Poulin knows the student loan crisis personally. “When I was in school, my fathe..

To rebuild satellite communications, Ubiquitilink starts at ground level

January 22, 2019

January 22, 2019

Communications satellites are multiplying year by year as more companies vie to create an orbital network that brings high-speed internet to the globe. Ubiquitilink, a new company headed by Nanoracks co-founder Charles Miller, is taking a different tack: reinventing the Earthbound side of the technology stack. Miller’s intuition, backed by approval and funding from a number of investors and communications giants, is that people are competing to solve the wrong problem in the comsat world. Driving down the cost of satellites isn’t going to create the revolution they hope. Instead, he thinks the way forward lies in completely rebuilding the “user terminal,” usually a ground station or large antenna. “If you’re focused on bridging the digital divide, say you have to build a thousand satellites and a hundred million user terminals,” he said, “which should you optimize for cost?” Of course, dropping the price of satellites has plenty of benefits on its own, but he does have a point. What..

Esports org OverActive Media gets investment from The Weeknd

April 10, 2019

April 10, 2019

OverActive Media, the company that owns Splyce esports org and the Overwatch League’s Toronto Defiant team, have announced that The Weeknd (real name: Abel Tesfaye) has invested in the company. In the world of esports, OAM is a big organization — the Toronto-based company, which launched in 2017, has teams in the League of Legends European Championship, Overwatch League, Call of Duty World League, Rocket League, Starcraft and Smite. OAM is one of only five esports orgs in the world with permanent slots both in League of Legends and the Overwatch League. If you have no idea what I’m talking about, here’s a look at one of the Toronto Defiant’s recent Overwatch League games. The terms of the investment were not disclosed, but it would appear that The Weeknd will be contributing to some marketing efforts and building brand awareness around Splyce and the Toronto Defiant. “Abel’s standing in the music industry will provide our Toronto Defiant and Splyce brands the opportunity to reach ..

Getting a piece of Uber

April 29, 2019

April 29, 2019

Menlo Ventures was founded in 1976 but it took 35 years for the venture capital firm to hit the jackpot. Since the dot-com boom, Menlo Ventures has teetered between good and great. A prolific Silicon Valley investor, it’s never quite reached the heights of Accel or Andreessen Horowitz (a16z), or established the level of name recognition as Benchmark or Sequoia, firms that struck gold with bets on Facebook, Instagram and Snap. But where others missed the boat entirely on one of the most valuable tech startups of all time, Menlo Ventures gnawed its way into an early deal at the last possible moment. In 2011, the firm led a $32 million Series B funding in a fledgling on-demand car service called Uber, agreeing to value the startup at a colossal $322 million after the company’s first-choice investor, a16z, failed to accept Uber’s sky-high terms. Menlo would go on to invest a total of $66.5 million in the company on expected total returns of up to $3.1 billion. “I wouldn’t have dared to..

Andreessen pours $22M into PlanetScale’s database-as-a-service

May 23, 2019

May 23, 2019

PlanetScale’s founders invented the technology called Vitess that scaled YouTube. Now they’re selling it to any enterprise that wants their data both secure and consistently accessible. And thanks to its ability to re-shard databases while they’re operating, it can solve businesses’ troubles with GDPR, which demands they store some data in the same locality as the user it belongs to. The potential to be a computing backbone that both competes with and complements Amazon’s AWS has now attracted a mammoth $22 million Series A for PlanetScale. Led by Andreessen Horowitz and joined by the firm’s Cultural Leadership Fund, head of the US Digital Service Matt Cutts plus existing investor SignalFire, the round is a tall step up from the startup’s $3 million seed it raised a year ago. Andreessen general partner Peter Levine will join the PlanetScale board, bringing his enterprise launch expertise. PlanetScale co-founders (from left): Jiten Vaidya and Sugu Sougoumarane “What we’re discovering..

Sentons launches SurfaceWave, a processor and tech to create software-defined surfaces that supercharge touch and gesture

October 17, 2019

October 17, 2019

As handset makers continue to work on ways of making smartphones more streamlined and sleek, while at the same time introducing new features that will get people buying more devices, a startup that is pioneering something called “software-defined” surfaces — essentially, using ultrasound and AI to turn any kind of material, and any kind of surface, into one that will respond to gestures, touch and other forces — is setting out its stall to help them and other hardware makers change up the game. Sentons, the startup out of Silicon Valley that is building software-defined surface technology, is today announcing the launch of SurfaceWave, a processor and accompanying gesture engine that can be used in smartphones and other hardware to create virtual wheels and buttons to control and navigate apps and features on the devices themselves. The SurfaceWave processor and engine are available to “any mobile manufacturer.” Before this, Sentons had already inked direct deals to test market inter..

Comments
Leave a Reply

Your email address will not be published. Required fields are marked *